diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index c72a5749c52ac97bca71c672ef5295d303d22b05..f9ba8cf65f3e3104dd061c178066ec8247811f33 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,9 +1,9 @@
-# Microsoft Open Source Code of Conduct
-
-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-
-Resources:
-
-- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
-- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
-- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
+# Microsoft Open Source Code of Conduct
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+
+Resources:
+
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 70f39bd7f81f6b8fcedbb68778cbc3caf6c2e6c4..f6e5f39869ebfa495b02328c168f894a5f1620cd 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -65,3 +65,31 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
 [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
 [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
 comments.
+
+## New Feature Contribution Guidelines
+Unlike bug fix or improving existing feature (where users usually directly submit a PR and we review it), adding a new feature to DeepSpeed requires several steps: (1) proposal and discussion, (2) implementation and verification, (3) release and maintenance. This general guideline applies to all new feature contributions. Core DeepSpeed team member contributions may complete step 1 internally.
+
+### Step 1: proposal and discussion
+We ask users to first post your intended feature in an issue. This issue needs to include:
+
+* A description of the proposed feature.
+* A motivation of why it will be useful to DeepSpeed users.
+* A rough design of how you implement the feature inside DeepSpeed.
+* (Important) Results or planned experiments to demonstrate the effectiveness and correctness of the feature.
+  * If this is a general feature applicable to different tasks, we require testing it on at least one CV task (e.g., [CIFAR](https://www.deepspeed.ai/tutorials/cifar-10/)) and one NLP task (e.g., [SQuAD](https://www.deepspeed.ai/tutorials/bert-finetuning/)). If this is a feature for one kind of task only, it is fine to just test on the specific task.
+  * If the feature only affects performance and does not affect training convergence, we require testing on a fraction of training to demonstrate that the training/validation loss are consistent with baseline, and that the performance is better than baseline.
+  * If the feature does affect training convergence, we require testing the whole training to demonstrate that the feature achieves better/on-par final model quality and training performance compared to baseline.
+
+Based on the issue we shall discuss the merit of the new feature and decide whether accept or decline the proposal. Once accepted and after we confirm the design and implementation plan, we are ready for step 2.
+
+### Step 2: implementation and verification
+Contributor will go ahead and implement the feature, and the DeepSpeed team will provide guidance/helps as needed. The required deliverables include:
+
+* A PR to [microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed) including (1) the feature implementation (2) unit tests (3) documentation (4) tutorial
+* A PR to [microsoft/DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples) or [microsoft/Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) including the examples of how to use the feature (this is related to the planned testing experiments in proposal)
+* In the implementation (code, documentation, tutorial), we require the feature author to record their GitHub username as a contact method for future questions/maintenance.
+
+After receiving the PRs, we will review them and merge them after necessary tests/fixes.
+
+### Step 3: release and maintenance
+After the PRs are merged, we will announce the feature on our website (with credit to the feature author). We ask the feature author to commit to the maintenance of the feature.
diff --git a/LICENSE b/LICENSE
index 3d8b93bc7987d14c848448c089e2ae15311380d7..9e841e7a26e4eb057b24511e7b92d42b257a80e5 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,21 +1,21 @@
-    MIT License
-
-    Copyright (c) Microsoft Corporation.
-
-    Permission is hereby granted, free of charge, to any person obtaining a copy
-    of this software and associated documentation files (the "Software"), to deal
-    in the Software without restriction, including without limitation the rights
-    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-    copies of the Software, and to permit persons to whom the Software is
-    furnished to do so, subject to the following conditions:
-
-    The above copyright notice and this permission notice shall be included in all
-    copies or substantial portions of the Software.
-
-    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-    SOFTWARE
+    MIT License
+
+    Copyright (c) Microsoft Corporation.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
diff --git a/MANIFEST.in b/MANIFEST.in
index 53fcc885090ede17210ec40060c5c8aea8759e98..a918b9286d530f57c67f5d3f80b9245bab541a3e 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,4 +1,4 @@
 include *.txt README.md
 recursive-include requirements *.txt
-recursive-include deepspeed *.cpp *.h *.cu *.tr *.cuh *.cc
+recursive-include deepspeed *.cpp *.h *.cu *.hip *.tr *.cuh *.cc *.json
 recursive-include csrc *.cpp *.h *.cu *.tr *.cuh *.cc
diff --git a/MANIFEST_win.in b/MANIFEST_win.in
new file mode 100644
index 0000000000000000000000000000000000000000..ddfe73e0b4185fc2814d758d15d576113297b684
--- /dev/null
+++ b/MANIFEST_win.in
@@ -0,0 +1,8 @@
+include *.txt README.md
+recursive-include requirements *.txt
+
+# this is for Windows only
+recursive-include deepspeed *.tr
+recursive-exclude deepspeed/ops/csrc *.cpp *.h *.cu *.cuh *.cc
+prune csrc
+prune op_builder
diff --git a/README.md b/README.md
old mode 100755
new mode 100644
index c7bde12dd0ea86d008a387e8b7ca810355fff232..aafbbe5e79b470b12edc7d97e8c1c85ca7caf050
--- a/README.md
+++ b/README.md
@@ -2,9 +2,28 @@
 [![PyPI version](https://badge.fury.io/py/deepspeed.svg)](https://pypi.org/project/deepspeed/)
 [![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)
 [![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
-[![Docker Pulls](https://img.shields.io/docker/pulls/deepspeed/deepspeed)](https://hub.docker.com/r/deepspeed/deepspeed)
 
-### 03/2021: DeepSpeed is hiring! Come join us: [SDE 2](https://careers.microsoft.com/us/en/job/1013160/Software-Engineer-2), [Sr. SDE](https://careers.microsoft.com/us/en/job/1017151/Senior-Software-Engineer), [Sr. Researcher](https://careers.microsoft.com/us/en/job/1016440/Senior-Researcher)
+<div align="center">
+ <img src="docs/assets/images/DeepSpeed_light.svg#gh-light-mode-only" width="400px">
+ <img src="docs/assets/images/DeepSpeed_dark_transparent.svg#gh-dark-mode-only" width="400px">
+</div>
+
+<!--
+Remove until pypi issue is resolved: https://status.python.org/incidents/2jj696st6yn5
+[![Downloads](https://pepy.tech/badge/deepspeed/month)](https://pepy.tech/project/deepspeed)
+-->
+## Latest News
+* [2022/03/21] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
+* [2022/03/07] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
+* [2022/01/19] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
+    * [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
+    * [Mixture of Experts (MoE) Inference tutorial](https://www.deepspeed.ai/tutorials/moe-inference-tutorial).
+* [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
+* [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
+  * Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
+
+### DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai)
+---
 
 [DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
 library that makes distributed training easy, efficient, and effective.
@@ -14,10 +33,10 @@ library that makes distributed training easy, efficient, and effective.
 <p align="center"><i><b>Minimal Code Change</b></i></p>
 
 DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
-* Extreme scale: Using current generation of GPU clusters with hundreds of devices,  3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.  
+* Extreme scale: Using current generation of GPU clusters with hundreds of devices,  3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
 * Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
-* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.  
-* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.  1-bit Adam reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.
+* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
+* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.  1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
 
 Early adopters of DeepSpeed have already produced
 a language model (LM) with over 17B parameters called
@@ -31,22 +50,6 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)
 
 **_For further documentation, tutorials, and technical deep-dives please see [deepspeed.ai](https://www.deepspeed.ai/)!_**
 
-
-# News
-* [2021/04/01] [[DeepSpeed on AzureML] Transformers and CIFAR examples are now available on AzureML GitHub](https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed)
-* [2021/03/30] [[PyTorch Lightning Blog] Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed](https://medium.com/pytorch-lightning/accessible-multi-billion-parameter-model-training-with-pytorch-lightning-deepspeed-c9333ac3bb59)
-* [2021/03/16] [1-bit Adam v2: NCCL-based implementation and more](https://www.deepspeed.ai/tutorials/onebit-adam/)
-* [2021/03/08] [ZeRO-3 Offload: Scale your models to trillion parameters without code changes while leveraging both CPUs & GPUs](https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html)
-* [2021/01/19] [[🤗Hugging Face Blog] Fit More and Train Faster With ZeRO via DeepSpeed and FairScale](https://huggingface.co/blog/zero-deepspeed-fairscale)
-* [2020/11/12] [Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies](#installation)
-* [2020/11/10] [Efficient and robust compressed training through progressive layer dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
-* [2020/09/10] [DeepSpeed v0.3: Extreme-scale model training for everyone](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)
-  * [Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention-news.html)
-  * [Training a trillion parameters with pipeline parallelism](https://www.deepspeed.ai/news/2020/09/08/pipeline-parallelism.html)
-  * [Up to 5x less communication and 3.4x faster training through 1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-news.html)
-  * [10x bigger model training on a single GPU with ZeRO-Offload](https://www.deepspeed.ai/news/2020/09/08/ZeRO-Offload.html)
-
-
 # Table of Contents
 | Section                                 | Description                                 |
 | --------------------------------------- | ------------------------------------------- |
@@ -96,6 +99,12 @@ If you would like to pre-install any of the DeepSpeed extensions/ops (instead
 of JIT compiling) or install pre-compiled ops via PyPI please see our [advanced
 installation instructions](https://www.deepspeed.ai/tutorials/advanced-install/).
 
+On Windows you can build wheel with following steps, currently only inference mode is supported.
+1. Install pytorch, such as pytorch 1.8 + cuda 11.1
+2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
+3. Launch cmd console with Administrator privilege for creating required symlink folders
+4. Run `python setup.py bdist_wheel` to build wheel in `dist` folder
+
 # Features
 Below we provide a brief feature list, see our detailed [feature
 overview](https://www.deepspeed.ai/features/) for descriptions and usage.
@@ -116,14 +125,14 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
 * [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
   * Leverage both CPU/GPU memory for model training
   * Support 10B model training on a single GPU
-* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
-* [Sparse attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html)
+* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
+* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
   * Memory- and compute-efficient sparse kernels
   * Support 10x longer sequences than dense
   * Flexible support to different sparse structures
-* [1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html)
+* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
   * Custom communication collective
-  * Up to 5x communication volume saving
+  * Up to 26x communication volume saving
 * [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
   * Smart Gradient Accumulation
   * Communication/Computation Overlap
@@ -142,8 +151,12 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
   * Learning Rate Range Test
   * 1Cycle Learning Rate Schedule
 * [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
+* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
+  * A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
+  * Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
+  * Complementary to many other DeepSpeed features
 * [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
-
+* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)
 
 
 # Further Reading
@@ -154,14 +167,14 @@ All DeepSpeed documentation can be found on our website: [deepspeed.ai](https://
 | Article                                                                                        | Description                                  |
 | ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
 | [DeepSpeed Features](https://www.deepspeed.ai/features/)                                       |  DeepSpeed features                          |
-| [Getting Started](https://www.deepspeed.ai/getting-started/)                                   |  First steps with DeepSpeed                         |
+| [Getting Started](https://www.deepspeed.ai/getting-started/)                                   |  First steps with DeepSpeed                  |
 | [DeepSpeed JSON Configuration](https://www.deepspeed.ai/docs/config-json/)                     |  Configuring DeepSpeed                       |
 | [API Documentation](https://deepspeed.readthedocs.io/en/latest/)                               |  Generated DeepSpeed API documentation       |
 | [CIFAR-10 Tutorial](https://www.deepspeed.ai/tutorials/cifar-10)                               |  Getting started with CIFAR-10 and DeepSpeed |
 | [Megatron-LM Tutorial](https://www.deepspeed.ai/tutorials/megatron/)                           |  Train GPT2 with DeepSpeed and Megatron-LM   |
-| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/)             |  Pre-train BERT with DeepSpeed |
+| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/)             |  Pre-train BERT with DeepSpeed               |
 | [Learning Rate Range Test Tutorial](https://www.deepspeed.ai/tutorials/lrrt/)                  |  Faster training with large learning rates   |
-| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/1Cycle/)                                  |  SOTA learning schedule in DeepSpeed         |
+| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/one-cycle/)                               |  SOTA learning schedule in DeepSpeed         |
 
 
 
@@ -192,7 +205,12 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
 2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
 3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
 4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
-5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888).
+5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
+6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
+7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
+8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
+9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. [arXiv:2202.06009](https://arxiv.org/abs/2202.06009).
+10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596).
 
 # Videos
 1. DeepSpeed KDD 2020 Tutorial
@@ -206,3 +224,6 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
     * Registration is free and all videos are available on-demand.
     * [ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html).
 3. [DeepSpeed on AzureML](https://youtu.be/yBVXR8G8Bg8)
+4. Community Tutorials
+    * [DeepSpeed: All the tricks to scale to gigantic models](https://www.youtube.com/watch?v=pDGI668pNg0)
+    * [Turing-NLG, DeepSpeed and the ZeRO optimizer](https://www.youtube.com/watch?v=tC01FRB0M7w)
diff --git a/SECURITY.md b/SECURITY.md
index 7ab49eb8296428b7e97282be73aad19117ff34c2..e0dfff56a9569fee0ec4628bb42319f81731b250 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -1,41 +1,41 @@
-<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK -->
-
-## Security
-
-Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
-
-If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.
-
-## Reporting Security Issues
-
-**Please do not report security vulnerabilities through public GitHub issues.**
-
-Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
-
-If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
-
-You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
-
-Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
-
-  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
-  * Full paths of source file(s) related to the manifestation of the issue
-  * The location of the affected source code (tag/branch/commit or direct URL)
-  * Any special configuration required to reproduce the issue
-  * Step-by-step instructions to reproduce the issue
-  * Proof-of-concept or exploit code (if possible)
-  * Impact of the issue, including how an attacker might exploit the issue
-
-This information will help us triage your report more quickly.
-
-If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
-
-## Preferred Languages
-
-We prefer all communications to be in English.
-
-## Policy
-
-Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
-
-<!-- END MICROSOFT SECURITY.MD BLOCK -->
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.3 BLOCK -->
+
+## Security
+
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
+
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.**
+
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
+
+If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
+
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
+
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+
+This information will help us triage your report more quickly.
+
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
+
+## Preferred Languages
+
+We prefer all communications to be in English.
+
+## Policy
+
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
+
+<!-- END MICROSOFT SECURITY.MD BLOCK -->
diff --git a/azure/attach.sh b/azure/attach.sh
old mode 100755
new mode 100644
diff --git a/azure/azure_ssh.sh b/azure/azure_ssh.sh
old mode 100755
new mode 100644
diff --git a/azure/build_docker_image.sh b/azure/build_docker_image.sh
old mode 100755
new mode 100644
diff --git a/azure/create_vms.sh b/azure/create_vms.sh
old mode 100755
new mode 100644
diff --git a/azure/setup_docker.sh b/azure/setup_docker.sh
old mode 100755
new mode 100644
diff --git a/azure/setup_vms.sh b/azure/setup_vms.sh
old mode 100755
new mode 100644
diff --git a/azure/shutdown_vms.sh b/azure/shutdown_vms.sh
old mode 100755
new mode 100644
diff --git a/azure/start_container.sh b/azure/start_container.sh
old mode 100755
new mode 100644
diff --git a/bin/deepspeed b/bin/deepspeed
deleted file mode 120000
index 6b768564101983015fd56c8d604e439c2374ad06..0000000000000000000000000000000000000000
--- a/bin/deepspeed
+++ /dev/null
@@ -1 +0,0 @@
-ds
\ No newline at end of file
diff --git a/bin/deepspeed b/bin/deepspeed
new file mode 100644
index 0000000000000000000000000000000000000000..5ec8820db922fcdb284ff18cbe7f21c3b2e4d38b
--- /dev/null
+++ b/bin/deepspeed
@@ -0,0 +1,6 @@
+#!/usr/bin/env python3
+
+from deepspeed.launcher.runner import main
+
+if __name__ == '__main__':
+    main()
diff --git a/bin/deepspeed.pt b/bin/deepspeed.pt
deleted file mode 120000
index 6b768564101983015fd56c8d604e439c2374ad06..0000000000000000000000000000000000000000
--- a/bin/deepspeed.pt
+++ /dev/null
@@ -1 +0,0 @@
-ds
\ No newline at end of file
diff --git a/bin/deepspeed.pt b/bin/deepspeed.pt
new file mode 100644
index 0000000000000000000000000000000000000000..5ec8820db922fcdb284ff18cbe7f21c3b2e4d38b
--- /dev/null
+++ b/bin/deepspeed.pt
@@ -0,0 +1,6 @@
+#!/usr/bin/env python3
+
+from deepspeed.launcher.runner import main
+
+if __name__ == '__main__':
+    main()
diff --git a/bin/ds b/bin/ds
old mode 100755
new mode 100644
index 6bb47da8ce7cc99dee05ada3931989e0bc2dce4a..5ec8820db922fcdb284ff18cbe7f21c3b2e4d38b
--- a/bin/ds
+++ b/bin/ds
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 
 from deepspeed.launcher.runner import main
 
diff --git a/bin/ds_elastic b/bin/ds_elastic
old mode 100755
new mode 100644
index f55ebf106e058990e6e39464b2f6ea3cc211cd14..c9987d4565da3cb4c7e32b8342c201ba0165e030
--- a/bin/ds_elastic
+++ b/bin/ds_elastic
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 
 import argparse
 import json
diff --git a/bin/ds_report b/bin/ds_report
old mode 100755
new mode 100644
index c03a95645eae8e110261155e0892a0e78eae1178..e6f7b50a78b2368c93192e6ef25357f546815037
--- a/bin/ds_report
+++ b/bin/ds_report
@@ -1,6 +1,6 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 
-from deepspeed.env_report import main
+from deepspeed.env_report import cli_main
 
 if __name__ == '__main__':
-    main()
+    cli_main()
diff --git a/bin/ds_ssh b/bin/ds_ssh
old mode 100755
new mode 100644
index c2330e31ee12def026ea6ffbcf15c6aa5a3bd200..d89fc0b44e176c8acf422302308e6a865350ab49
--- a/bin/ds_ssh
+++ b/bin/ds_ssh
@@ -10,11 +10,25 @@ fi
 
 hostfile=/job/hostfile
 
+while getopts "h?f:" opt; do
+  case "$opt" in
+    h|\?)
+      echo "-f <hostfile>: specify a hostfile, defaults to /job/hostfile"
+      exit 0
+      ;;
+    f)
+      hostfile=$OPTARG
+      shift $((OPTIND-1))
+      ;;
+  esac
+done
+
+echo "hostfile=$hostfile"
+
 if [ -f $hostfile ]; then
     hosts=`cat $hostfile | awk '{print $1}' | paste -sd "," -`
     export PDSH_RCMD_TYPE=ssh
     pdsh -w ${hosts} $@
 else
-    echo "Missing hostfile at ${hostfile}, executing command locally"
-    $@
+    echo "Missing hostfile at ${hostfile}, unable to proceed"
 fi
diff --git a/csrc/adagrad/cpu_adagrad.cpp b/csrc/adagrad/cpu_adagrad.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..4f2a9b69ef966599d1bd6664f79e312c9240671b
--- /dev/null
+++ b/csrc/adagrad/cpu_adagrad.cpp
@@ -0,0 +1,227 @@
+#include "cpu_adagrad.h"
+#include <cuda_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "custom_cuda_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adagrad_Optimizer::Step_1(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float step_size = -1 * _alpha;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = grads[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0) { grad = param * _weight_decay + grad; }
+
+                variance += grad * grad;
+
+                grad = sqrt(variance);
+                grad += _eps;
+                grad = momentum / grad;
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                // STORE UPDATE TERM TO GRAD'S MEMORY
+                grads[k] = grad * step_size;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adagrad_Optimizer::Step_4(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adagrad_optimizer(int optimizer_id,
+                             float alpha = 1e-2,
+                             float eps = 1e-8,
+                             float weight_decay = 0,
+                             bool should_log = false)
+{
+    auto opt = std::make_shared<Adagrad_Optimizer>(alpha, eps, weight_decay);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adagrad Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, weight_decay=%f\n", alpha, weight_decay);
+    }
+
+    return 0;
+}
+
+void Adagrad_Optimizer::Step_8(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adagrad_step(int optimizer_id,
+                    size_t step,
+                    float lr,
+                    float epsilon,
+                    float weight_decay,
+                    torch::Tensor& params,
+                    torch::Tensor& grads,
+                    torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr, grads_ptr, exp_avg_sq_ptr, params_c.size(0));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adagrad_step_plus_copy(int optimizer_id,
+                              size_t step,
+                              float lr,
+                              float epsilon,
+                              float weight_decay,
+                              torch::Tensor& params,
+                              torch::Tensor& grads,
+                              torch::Tensor& exp_avg_sq,
+                              torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adagrad_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adagrad_update", &ds_adagrad_step, "DeepSpeed CPU Adagrad update (C++)");
+    m.def("adagrad_update_copy",
+          &ds_adagrad_step_plus_copy,
+          "DeepSpeed CPU Adagrad update and param copy (C++)");
+    m.def("create_adagrad", &create_adagrad_optimizer, "DeepSpeed CPU Adagrad (C++)");
+    m.def("destroy_adagrad", &destroy_adagrad_optimizer, "DeepSpeed CPU Adagrad destroy (C++)");
+}
diff --git a/csrc/adagrad/cpu_adagrad_hip.cpp b/csrc/adagrad/cpu_adagrad_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6bbe9a9ee564c9e8f081c083202326ad279eddd1
--- /dev/null
+++ b/csrc/adagrad/cpu_adagrad_hip.cpp
@@ -0,0 +1,228 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cpu_adagrad_hip.h"
+#include <hip/hip_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "custom_hip_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adagrad_Optimizer::Step_1(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float step_size = -1 * _alpha;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = grads[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0) { grad = param * _weight_decay + grad; }
+
+                variance += grad * grad;
+
+                grad = sqrt(variance);
+                grad += _eps;
+                grad = momentum / grad;
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                // STORE UPDATE TERM TO GRAD'S MEMORY
+                grads[k] = grad * step_size;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adagrad_Optimizer::Step_4(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adagrad_optimizer(int optimizer_id,
+                             float alpha = 1e-2,
+                             float eps = 1e-8,
+                             float weight_decay = 0,
+                             bool should_log = false)
+{
+    auto opt = std::make_shared<Adagrad_Optimizer>(alpha, eps, weight_decay);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adagrad Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, weight_decay=%f\n", alpha, weight_decay);
+    }
+
+    return 0;
+}
+
+void Adagrad_Optimizer::Step_8(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adagrad_step(int optimizer_id,
+                    size_t step,
+                    float lr,
+                    float epsilon,
+                    float weight_decay,
+                    torch::Tensor& params,
+                    torch::Tensor& grads,
+                    torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr, grads_ptr, exp_avg_sq_ptr, params_c.size(0));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adagrad_step_plus_copy(int optimizer_id,
+                              size_t step,
+                              float lr,
+                              float epsilon,
+                              float weight_decay,
+                              torch::Tensor& params,
+                              torch::Tensor& grads,
+                              torch::Tensor& exp_avg_sq,
+                              torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adagrad_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adagrad_update", &ds_adagrad_step, "DeepSpeed CPU Adagrad update (C++)");
+    m.def("adagrad_update_copy",
+          &ds_adagrad_step_plus_copy,
+          "DeepSpeed CPU Adagrad update and param copy (C++)");
+    m.def("create_adagrad", &create_adagrad_optimizer, "DeepSpeed CPU Adagrad (C++)");
+    m.def("destroy_adagrad", &destroy_adagrad_optimizer, "DeepSpeed CPU Adagrad destroy (C++)");
+}
diff --git a/csrc/adam/cpu_adam.cpp b/csrc/adam/cpu_adam.cpp
index d425dc3169efa35d08469ee91cd56a7c7feffda7..727eec8182c12cdcc0dcb0df53a38918bdeae4b8 100644
--- a/csrc/adam/cpu_adam.cpp
+++ b/csrc/adam/cpu_adam.cpp
@@ -1,682 +1,292 @@
-#include "cpu_adam.h"
-#include <cuda_runtime_api.h>
-#include <math.h>
-#include <omp.h>
-#include <torch/extension.h>
-#include <iostream>
-#include <memory>
-#include <type_traits>
-#include <unordered_map>
-#include "cublas_v2.h"
-#include "cuda.h"
-#include "curand.h"
-#include "custom_cuda_layers.h"
-
-static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
-
-#define ROUND_DOWN(size, step) ((size) & ~((step)-1))
-
-// C++ interface
-
-void Adam_Optimizer::Step(float* _params,
-                          float* grads,
-                          float* _exp_avg,
-                          float* _exp_avg_sq,
-                          size_t _param_size,
-                          __half* dev_params)
-{
-    float betta1_minus1 = 1 - _betta1;
-    float betta2_minus1 = 1 - _betta2;
-
-    float step_size = -1 * _alpha / _bias_correction1;
-    float w_decay = -1 * _alpha * _weight_decay;
-    size_t rounded_size = 0;
-
-#if defined(__AVX512__) or defined(__AVX256__)
-
-    AVX_Data betta1_4;
-    betta1_4.data = SIMD_SET(_betta1);
-    AVX_Data betta2_4;
-    betta2_4.data = SIMD_SET(_betta2);
-
-    AVX_Data betta1_minus1_4;
-    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
-    AVX_Data betta2_minus1_4;
-    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
-
-    AVX_Data bias2_sqrt;
-    bias2_sqrt.data = SIMD_SET(_bias_correction2);
-
-    AVX_Data eps_4;
-    eps_4.data = SIMD_SET(_eps);
-
-    AVX_Data step_size_4;
-    step_size_4.data = SIMD_SET(step_size);
-
-    AVX_Data weight_decay4;
-    if (_weight_decay > 0)
-        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
-    rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH);
-
-    for (size_t t = 0; t < rounded_size; t += TILE) {
-        size_t copy_size = TILE;
-        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
-        size_t offset = copy_size + t;
-        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-
-#pragma omp parallel for
-        for (size_t i = t; i < offset; i += SIMD_WIDTH) {
-            AVX_Data grad_4;
-            grad_4.data = SIMD_LOAD(grads + i);
-
-            AVX_Data momentum_4;
-            momentum_4.data = SIMD_LOAD(_exp_avg + i);
-            AVX_Data variance_4;
-            variance_4.data = SIMD_LOAD(_exp_avg_sq + i);
-
-            AVX_Data param_4;
-            param_4.data = SIMD_LOAD(_params + i);
-
-            if (_weight_decay > 0 && !_adamw_mode) {
-                grad_4.data = SIMD_FMA(param_4.data, weight_decay4.data, grad_4.data);
-            }
-            momentum_4.data = SIMD_MUL(momentum_4.data, betta1_4.data);
-            momentum_4.data = SIMD_FMA(grad_4.data, betta1_minus1_4.data, momentum_4.data);
-
-            variance_4.data = SIMD_MUL(variance_4.data, betta2_4.data);
-            grad_4.data = SIMD_MUL(grad_4.data, grad_4.data);
-            variance_4.data = SIMD_FMA(grad_4.data, betta2_minus1_4.data, variance_4.data);
-
-            grad_4.data = SIMD_SQRT(variance_4.data);
-            grad_4.data = SIMD_FMA(grad_4.data, bias2_sqrt.data, eps_4.data);
-            grad_4.data = SIMD_DIV(momentum_4.data, grad_4.data);
-            if (_weight_decay > 0 && _adamw_mode) {
-                param_4.data = SIMD_FMA(param_4.data, weight_decay4.data, param_4.data);
-            }
-            param_4.data = SIMD_FMA(grad_4.data, step_size_4.data, param_4.data);
-
-            SIMD_STORE(_params + i, param_4.data);
-
-            if (dev_params) SIMD_STORE(_doubled_buffer[_buf_index] + (i - t), param_4.data);
-
-            SIMD_STORE(_exp_avg + i, momentum_4.data);
-            SIMD_STORE(_exp_avg_sq + i, variance_4.data);
-        }
-        if (dev_params) {
-            launch_param_update(
-                _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
-            _buf_index = !_buf_index;
-        }
-    }
-
-#endif
-
-    if (_param_size > rounded_size) {
-        for (size_t t = rounded_size; t < _param_size; t += TILE) {
-            size_t copy_size = TILE;
-            if ((t + TILE) > _param_size) copy_size = _param_size - t;
-            size_t offset = copy_size + t;
-            if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-#pragma omp parallel for
-            for (size_t k = t; k < offset; k++) {
-                float grad = grads[k];
-                float param = _params[k];
-                float momentum = _exp_avg[k];
-                float variance = _exp_avg_sq[k];
-                if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
-                momentum = momentum * _betta1;
-                momentum = grad * betta1_minus1 + momentum;
-
-                variance = variance * _betta2;
-                grad = grad * grad;
-                variance = grad * betta2_minus1 + variance;
-
-                grad = sqrt(variance);
-                grad = grad * _bias_correction2 + _eps;
-                grad = momentum / grad;
-                if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
-                param = grad * step_size + param;
-                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
-
-                _params[k] = param;
-                _exp_avg[k] = momentum;
-                _exp_avg_sq[k] = variance;
-            }
-            if (dev_params) {
-                launch_param_update(
-                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
-                _buf_index = !_buf_index;
-            }
-        }
-    }
-}
-
-void Adam_Optimizer::Step_4(float* _params,
-                            float* grads,
-                            float* _exp_avg,
-                            float* _exp_avg_sq,
-                            size_t _param_size,
-                            __half* dev_params)
-{
-    size_t rounded_size = 0;
-
-#if defined(__AVX512__) or defined(__AVX256__)
-
-    AVX_Data betta1_4;
-    betta1_4.data = SIMD_SET(_betta1);
-    AVX_Data betta2_4;
-    betta2_4.data = SIMD_SET(_betta2);
-
-    float betta1_minus1 = 1 - _betta1;
-    float betta2_minus1 = 1 - _betta2;
-    AVX_Data betta1_minus1_4;
-    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
-    AVX_Data betta2_minus1_4;
-    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
-
-    AVX_Data bias2_sqrt;
-    bias2_sqrt.data = SIMD_SET(_bias_correction2);
-
-    AVX_Data eps_4;
-    eps_4.data = SIMD_SET(_eps);
-
-    float step_size = -1 * _alpha / _bias_correction1;
-    AVX_Data step_size_4;
-    step_size_4.data = SIMD_SET(step_size);
-
-    float w_decay = -1 * _alpha * _weight_decay;
-    AVX_Data weight_decay4;
-    if (_weight_decay > 0)
-        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
-    rounded_size = ROUND_DOWN(_param_size, (SIMD_WIDTH << 2));
-
-    for (size_t t = 0; t < rounded_size; t += TILE) {
-        size_t copy_size = TILE;
-        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
-        size_t offset = copy_size + t;
-        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-#pragma omp parallel for
-        for (size_t i = t; i < offset; i += (SIMD_WIDTH << 2)) {
-            AVX_Data grad_4[4];
-            grad_4[0].data = SIMD_LOAD(grads + i);
-            grad_4[1].data = SIMD_LOAD(grads + i + SIMD_WIDTH);
-            grad_4[2].data = SIMD_LOAD(grads + i + (SIMD_WIDTH << 1));
-            grad_4[3].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 3);
-
-            AVX_Data momentum_4[4];
-            momentum_4[0].data = SIMD_LOAD(_exp_avg + i);
-            momentum_4[1].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH);
-            momentum_4[2].data = SIMD_LOAD(_exp_avg + i + (SIMD_WIDTH << 1));
-            momentum_4[3].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 3);
-
-            AVX_Data variance_4[4];
-            variance_4[0].data = SIMD_LOAD(_exp_avg_sq + i);
-            variance_4[1].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH);
-            variance_4[2].data = SIMD_LOAD(_exp_avg_sq + i + (SIMD_WIDTH << 1));
-            variance_4[3].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 3);
-
-            AVX_Data param_4[4];
-            param_4[0].data = SIMD_LOAD(_params + i);
-            param_4[1].data = SIMD_LOAD(_params + i + SIMD_WIDTH);
-            param_4[2].data = SIMD_LOAD(_params + i + (SIMD_WIDTH << 1));
-            param_4[3].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 3);
-
-            if (_weight_decay > 0 && !_adamw_mode) {
-                grad_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, grad_4[0].data);
-                grad_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, grad_4[1].data);
-                grad_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, grad_4[2].data);
-                grad_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, grad_4[3].data);
-            }
-
-            momentum_4[0].data = SIMD_MUL(momentum_4[0].data, betta1_4.data);
-            momentum_4[0].data = SIMD_FMA(grad_4[0].data, betta1_minus1_4.data, momentum_4[0].data);
-            momentum_4[1].data = SIMD_MUL(momentum_4[1].data, betta1_4.data);
-            momentum_4[1].data = SIMD_FMA(grad_4[1].data, betta1_minus1_4.data, momentum_4[1].data);
-            momentum_4[2].data = SIMD_MUL(momentum_4[2].data, betta1_4.data);
-            momentum_4[2].data = SIMD_FMA(grad_4[2].data, betta1_minus1_4.data, momentum_4[2].data);
-            momentum_4[3].data = SIMD_MUL(momentum_4[3].data, betta1_4.data);
-            momentum_4[3].data = SIMD_FMA(grad_4[3].data, betta1_minus1_4.data, momentum_4[3].data);
-
-            variance_4[0].data = SIMD_MUL(variance_4[0].data, betta2_4.data);
-            variance_4[1].data = SIMD_MUL(variance_4[1].data, betta2_4.data);
-            variance_4[2].data = SIMD_MUL(variance_4[2].data, betta2_4.data);
-            variance_4[3].data = SIMD_MUL(variance_4[3].data, betta2_4.data);
-            grad_4[0].data = SIMD_MUL(grad_4[0].data, grad_4[0].data);
-            grad_4[1].data = SIMD_MUL(grad_4[1].data, grad_4[1].data);
-            grad_4[2].data = SIMD_MUL(grad_4[2].data, grad_4[2].data);
-            grad_4[3].data = SIMD_MUL(grad_4[3].data, grad_4[3].data);
-            variance_4[0].data = SIMD_FMA(grad_4[0].data, betta2_minus1_4.data, variance_4[0].data);
-            variance_4[1].data = SIMD_FMA(grad_4[1].data, betta2_minus1_4.data, variance_4[1].data);
-            variance_4[2].data = SIMD_FMA(grad_4[2].data, betta2_minus1_4.data, variance_4[2].data);
-            variance_4[3].data = SIMD_FMA(grad_4[3].data, betta2_minus1_4.data, variance_4[3].data);
-
-            grad_4[0].data = SIMD_SQRT(variance_4[0].data);
-            grad_4[1].data = SIMD_SQRT(variance_4[1].data);
-            grad_4[2].data = SIMD_SQRT(variance_4[2].data);
-            grad_4[3].data = SIMD_SQRT(variance_4[3].data);
-
-            grad_4[0].data = SIMD_FMA(grad_4[0].data, bias2_sqrt.data, eps_4.data);
-            grad_4[1].data = SIMD_FMA(grad_4[1].data, bias2_sqrt.data, eps_4.data);
-            grad_4[2].data = SIMD_FMA(grad_4[2].data, bias2_sqrt.data, eps_4.data);
-            grad_4[3].data = SIMD_FMA(grad_4[3].data, bias2_sqrt.data, eps_4.data);
-            grad_4[0].data = SIMD_DIV(momentum_4[0].data, grad_4[0].data);
-            grad_4[1].data = SIMD_DIV(momentum_4[1].data, grad_4[1].data);
-            grad_4[2].data = SIMD_DIV(momentum_4[2].data, grad_4[2].data);
-            grad_4[3].data = SIMD_DIV(momentum_4[3].data, grad_4[3].data);
-
-            if (_weight_decay > 0 && _adamw_mode) {
-                param_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, param_4[0].data);
-                param_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, param_4[1].data);
-                param_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, param_4[2].data);
-                param_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, param_4[3].data);
-            }
-
-            param_4[0].data = SIMD_FMA(grad_4[0].data, step_size_4.data, param_4[0].data);
-            param_4[1].data = SIMD_FMA(grad_4[1].data, step_size_4.data, param_4[1].data);
-            param_4[2].data = SIMD_FMA(grad_4[2].data, step_size_4.data, param_4[2].data);
-            param_4[3].data = SIMD_FMA(grad_4[3].data, step_size_4.data, param_4[3].data);
-
-            SIMD_STORE(_params + i, param_4[0].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH, param_4[1].data);
-            SIMD_STORE(_params + i + (SIMD_WIDTH << 1), param_4[2].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH * 3, param_4[3].data);
-
-            if (dev_params) {
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t), param_4[0].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH, param_4[1].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + (SIMD_WIDTH << 1),
-                           param_4[2].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 3, param_4[3].data);
-            }
-
-            SIMD_STORE(_exp_avg + i, momentum_4[0].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH, momentum_4[1].data);
-            SIMD_STORE(_exp_avg + i + (SIMD_WIDTH << 1), momentum_4[2].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 3, momentum_4[3].data);
-
-            SIMD_STORE(_exp_avg_sq + i, variance_4[0].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH, variance_4[1].data);
-            SIMD_STORE(_exp_avg_sq + i + (SIMD_WIDTH << 1), variance_4[2].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 3, variance_4[3].data);
-        }
-
-        if (dev_params) {
-            launch_param_update(
-                _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
-            _buf_index = !_buf_index;
-        }
-    }
-#endif
-    if (_param_size > rounded_size)
-        Step((_params + rounded_size),
-             (grads + rounded_size),
-             (_exp_avg + rounded_size),
-             (_exp_avg_sq + rounded_size),
-             (_param_size - rounded_size),
-             (dev_params != nullptr ? (dev_params + rounded_size) : dev_params));
-}
-
-int create_adam_optimizer(int optimizer_id,
-                          float alpha = 1e-3,
-                          float betta1 = 0.9,
-                          float betta2 = 0.999,
-                          float eps = 1e-8,
-                          float weight_decay = 0,
-                          bool adamw_mode = true)
-{
-    auto opt =
-        std::make_shared<Adam_Optimizer>(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
-
-    s_optimizers[optimizer_id] = opt;
-#if defined(__AVX512__)
-    std::cout << "Adam Optimizer #" << optimizer_id
-              << " is created with AVX512 arithmetic capability." << std::endl;
-    printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
-           alpha,
-           betta1,
-           betta2,
-           weight_decay,
-           (int)adamw_mode);
-#else
-#if defined(__AVX256__)
-    std::cout << "Adam Optimizer #" << optimizer_id
-              << " is created with AVX2 arithmetic capability." << std::endl;
-    printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
-           alpha,
-           betta1,
-           betta2,
-           weight_decay,
-           (int)adamw_mode);
-#else
-    std::cout << "Adam Optimizer #" << optimizer_id
-              << " is created with scalar arithmetic capability." << std::endl;
-    printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
-           alpha,
-           betta1,
-           betta2,
-           weight_decay,
-           (int)adamw_mode);
-#endif
-#endif
-    return 0;
-}
-
-void Adam_Optimizer::Step_8(float* _params,
-                            float* grads,
-                            float* _exp_avg,
-                            float* _exp_avg_sq,
-                            size_t _param_size,
-                            __half* dev_params)
-{
-    size_t rounded_size = 0;
-
-#if defined(__AVX512__) or defined(__AVX256__)
-
-    AVX_Data betta1_4;
-    betta1_4.data = SIMD_SET(_betta1);
-    AVX_Data betta2_4;
-    betta2_4.data = SIMD_SET(_betta2);
-
-    float betta1_minus1 = 1 - _betta1;
-    float betta2_minus1 = 1 - _betta2;
-    AVX_Data betta1_minus1_4;
-    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
-    AVX_Data betta2_minus1_4;
-    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
-
-    AVX_Data bias2_sqrt;
-    bias2_sqrt.data = SIMD_SET(_bias_correction2);
-
-    AVX_Data eps_4;
-    eps_4.data = SIMD_SET(_eps);
-
-    float step_size = -1 * _alpha / _bias_correction1;
-    AVX_Data step_size_4;
-    step_size_4.data = SIMD_SET(step_size);
-
-    float w_decay = -1 * _alpha * _weight_decay;
-    AVX_Data weight_decay4;
-    if (_weight_decay > 0)
-        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
-    rounded_size = ROUND_DOWN(_param_size, (SIMD_WIDTH << 3));
-
-    for (size_t t = 0; t < rounded_size; t += TILE) {
-        size_t copy_size = TILE;
-        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
-        size_t offset = copy_size + t;
-        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
-#pragma omp parallel for
-        for (size_t i = t; i < offset; i += (SIMD_WIDTH << 3)) {
-            AVX_Data grad_4[8];
-            grad_4[0].data = SIMD_LOAD(grads + i);
-            grad_4[1].data = SIMD_LOAD(grads + i + SIMD_WIDTH);
-            grad_4[2].data = SIMD_LOAD(grads + i + (SIMD_WIDTH << 1));
-            grad_4[3].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 3);
-            grad_4[4].data = SIMD_LOAD(grads + i + (SIMD_WIDTH << 2));
-            grad_4[5].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 5);
-            grad_4[6].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 6);
-            grad_4[7].data = SIMD_LOAD(grads + i + SIMD_WIDTH * 7);
-
-            AVX_Data momentum_4[8];
-            momentum_4[0].data = SIMD_LOAD(_exp_avg + i);
-            momentum_4[1].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH);
-            momentum_4[2].data = SIMD_LOAD(_exp_avg + i + (SIMD_WIDTH << 1));
-            momentum_4[3].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 3);
-            momentum_4[4].data = SIMD_LOAD(_exp_avg + i + (SIMD_WIDTH << 2));
-            momentum_4[5].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 5);
-            momentum_4[6].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 6);
-            momentum_4[7].data = SIMD_LOAD(_exp_avg + i + SIMD_WIDTH * 7);
-
-            AVX_Data variance_4[8];
-            variance_4[0].data = SIMD_LOAD(_exp_avg_sq + i);
-            variance_4[1].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH);
-            variance_4[2].data = SIMD_LOAD(_exp_avg_sq + i + (SIMD_WIDTH << 1));
-            variance_4[3].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 3);
-            variance_4[4].data = SIMD_LOAD(_exp_avg_sq + i + (SIMD_WIDTH << 2));
-            variance_4[5].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 5);
-            variance_4[6].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 6);
-            variance_4[7].data = SIMD_LOAD(_exp_avg_sq + i + SIMD_WIDTH * 7);
-
-            AVX_Data param_4[8];
-            param_4[0].data = SIMD_LOAD(_params + i);
-            param_4[1].data = SIMD_LOAD(_params + i + SIMD_WIDTH);
-            param_4[2].data = SIMD_LOAD(_params + i + (SIMD_WIDTH << 1));
-            param_4[3].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 3);
-            param_4[4].data = SIMD_LOAD(_params + i + (SIMD_WIDTH << 2));
-            param_4[5].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 5);
-            param_4[6].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 6);
-            param_4[7].data = SIMD_LOAD(_params + i + SIMD_WIDTH * 7);
-
-            if (_weight_decay > 0 && !_adamw_mode) {
-                grad_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, grad_4[0].data);
-                grad_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, grad_4[1].data);
-                grad_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, grad_4[2].data);
-                grad_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, grad_4[3].data);
-                grad_4[4].data = SIMD_FMA(param_4[4].data, weight_decay4.data, grad_4[4].data);
-                grad_4[5].data = SIMD_FMA(param_4[5].data, weight_decay4.data, grad_4[5].data);
-                grad_4[6].data = SIMD_FMA(param_4[6].data, weight_decay4.data, grad_4[6].data);
-                grad_4[7].data = SIMD_FMA(param_4[7].data, weight_decay4.data, grad_4[7].data);
-            }
-
-            momentum_4[0].data = SIMD_MUL(momentum_4[0].data, betta1_4.data);
-            momentum_4[0].data = SIMD_FMA(grad_4[0].data, betta1_minus1_4.data, momentum_4[0].data);
-            momentum_4[1].data = SIMD_MUL(momentum_4[1].data, betta1_4.data);
-            momentum_4[1].data = SIMD_FMA(grad_4[1].data, betta1_minus1_4.data, momentum_4[1].data);
-            momentum_4[2].data = SIMD_MUL(momentum_4[2].data, betta1_4.data);
-            momentum_4[2].data = SIMD_FMA(grad_4[2].data, betta1_minus1_4.data, momentum_4[2].data);
-            momentum_4[3].data = SIMD_MUL(momentum_4[3].data, betta1_4.data);
-            momentum_4[3].data = SIMD_FMA(grad_4[3].data, betta1_minus1_4.data, momentum_4[3].data);
-            momentum_4[4].data = SIMD_MUL(momentum_4[4].data, betta1_4.data);
-            momentum_4[4].data = SIMD_FMA(grad_4[4].data, betta1_minus1_4.data, momentum_4[4].data);
-            momentum_4[5].data = SIMD_MUL(momentum_4[5].data, betta1_4.data);
-            momentum_4[5].data = SIMD_FMA(grad_4[5].data, betta1_minus1_4.data, momentum_4[5].data);
-            momentum_4[6].data = SIMD_MUL(momentum_4[6].data, betta1_4.data);
-            momentum_4[6].data = SIMD_FMA(grad_4[6].data, betta1_minus1_4.data, momentum_4[6].data);
-            momentum_4[7].data = SIMD_MUL(momentum_4[7].data, betta1_4.data);
-            momentum_4[7].data = SIMD_FMA(grad_4[7].data, betta1_minus1_4.data, momentum_4[7].data);
-
-            variance_4[0].data = SIMD_MUL(variance_4[0].data, betta2_4.data);
-            variance_4[1].data = SIMD_MUL(variance_4[1].data, betta2_4.data);
-            variance_4[2].data = SIMD_MUL(variance_4[2].data, betta2_4.data);
-            variance_4[3].data = SIMD_MUL(variance_4[3].data, betta2_4.data);
-            variance_4[4].data = SIMD_MUL(variance_4[4].data, betta2_4.data);
-            variance_4[5].data = SIMD_MUL(variance_4[5].data, betta2_4.data);
-            variance_4[6].data = SIMD_MUL(variance_4[6].data, betta2_4.data);
-            variance_4[7].data = SIMD_MUL(variance_4[7].data, betta2_4.data);
-            grad_4[0].data = SIMD_MUL(grad_4[0].data, grad_4[0].data);
-            grad_4[1].data = SIMD_MUL(grad_4[1].data, grad_4[1].data);
-            grad_4[2].data = SIMD_MUL(grad_4[2].data, grad_4[2].data);
-            grad_4[3].data = SIMD_MUL(grad_4[3].data, grad_4[3].data);
-            grad_4[4].data = SIMD_MUL(grad_4[4].data, grad_4[4].data);
-            grad_4[5].data = SIMD_MUL(grad_4[5].data, grad_4[5].data);
-            grad_4[6].data = SIMD_MUL(grad_4[6].data, grad_4[6].data);
-            grad_4[7].data = SIMD_MUL(grad_4[7].data, grad_4[7].data);
-            variance_4[0].data = SIMD_FMA(grad_4[0].data, betta2_minus1_4.data, variance_4[0].data);
-            variance_4[1].data = SIMD_FMA(grad_4[1].data, betta2_minus1_4.data, variance_4[1].data);
-            variance_4[2].data = SIMD_FMA(grad_4[2].data, betta2_minus1_4.data, variance_4[2].data);
-            variance_4[3].data = SIMD_FMA(grad_4[3].data, betta2_minus1_4.data, variance_4[3].data);
-            variance_4[4].data = SIMD_FMA(grad_4[4].data, betta2_minus1_4.data, variance_4[4].data);
-            variance_4[5].data = SIMD_FMA(grad_4[5].data, betta2_minus1_4.data, variance_4[5].data);
-            variance_4[6].data = SIMD_FMA(grad_4[6].data, betta2_minus1_4.data, variance_4[6].data);
-            variance_4[7].data = SIMD_FMA(grad_4[7].data, betta2_minus1_4.data, variance_4[7].data);
-
-            grad_4[0].data = SIMD_SQRT(variance_4[0].data);
-            grad_4[1].data = SIMD_SQRT(variance_4[1].data);
-            grad_4[2].data = SIMD_SQRT(variance_4[2].data);
-            grad_4[3].data = SIMD_SQRT(variance_4[3].data);
-            grad_4[4].data = SIMD_SQRT(variance_4[4].data);
-            grad_4[5].data = SIMD_SQRT(variance_4[5].data);
-            grad_4[6].data = SIMD_SQRT(variance_4[6].data);
-            grad_4[7].data = SIMD_SQRT(variance_4[7].data);
-
-            grad_4[0].data = SIMD_FMA(grad_4[0].data, bias2_sqrt.data, eps_4.data);
-            grad_4[1].data = SIMD_FMA(grad_4[1].data, bias2_sqrt.data, eps_4.data);
-            grad_4[2].data = SIMD_FMA(grad_4[2].data, bias2_sqrt.data, eps_4.data);
-            grad_4[3].data = SIMD_FMA(grad_4[3].data, bias2_sqrt.data, eps_4.data);
-            grad_4[4].data = SIMD_FMA(grad_4[4].data, bias2_sqrt.data, eps_4.data);
-            grad_4[5].data = SIMD_FMA(grad_4[5].data, bias2_sqrt.data, eps_4.data);
-            grad_4[6].data = SIMD_FMA(grad_4[6].data, bias2_sqrt.data, eps_4.data);
-            grad_4[7].data = SIMD_FMA(grad_4[7].data, bias2_sqrt.data, eps_4.data);
-            grad_4[0].data = SIMD_DIV(momentum_4[0].data, grad_4[0].data);
-            grad_4[1].data = SIMD_DIV(momentum_4[1].data, grad_4[1].data);
-            grad_4[2].data = SIMD_DIV(momentum_4[2].data, grad_4[2].data);
-            grad_4[3].data = SIMD_DIV(momentum_4[3].data, grad_4[3].data);
-            grad_4[4].data = SIMD_DIV(momentum_4[4].data, grad_4[4].data);
-            grad_4[5].data = SIMD_DIV(momentum_4[5].data, grad_4[5].data);
-            grad_4[6].data = SIMD_DIV(momentum_4[6].data, grad_4[6].data);
-            grad_4[7].data = SIMD_DIV(momentum_4[7].data, grad_4[7].data);
-
-            if (_weight_decay > 0 && _adamw_mode) {
-                param_4[0].data = SIMD_FMA(param_4[0].data, weight_decay4.data, param_4[0].data);
-                param_4[1].data = SIMD_FMA(param_4[1].data, weight_decay4.data, param_4[1].data);
-                param_4[2].data = SIMD_FMA(param_4[2].data, weight_decay4.data, param_4[2].data);
-                param_4[3].data = SIMD_FMA(param_4[3].data, weight_decay4.data, param_4[3].data);
-                param_4[4].data = SIMD_FMA(param_4[4].data, weight_decay4.data, param_4[4].data);
-                param_4[5].data = SIMD_FMA(param_4[5].data, weight_decay4.data, param_4[5].data);
-                param_4[6].data = SIMD_FMA(param_4[6].data, weight_decay4.data, param_4[6].data);
-                param_4[7].data = SIMD_FMA(param_4[7].data, weight_decay4.data, param_4[7].data);
-            }
-
-            param_4[0].data = SIMD_FMA(grad_4[0].data, step_size_4.data, param_4[0].data);
-            param_4[1].data = SIMD_FMA(grad_4[1].data, step_size_4.data, param_4[1].data);
-            param_4[2].data = SIMD_FMA(grad_4[2].data, step_size_4.data, param_4[2].data);
-            param_4[3].data = SIMD_FMA(grad_4[3].data, step_size_4.data, param_4[3].data);
-            param_4[4].data = SIMD_FMA(grad_4[4].data, step_size_4.data, param_4[4].data);
-            param_4[5].data = SIMD_FMA(grad_4[5].data, step_size_4.data, param_4[5].data);
-            param_4[6].data = SIMD_FMA(grad_4[6].data, step_size_4.data, param_4[6].data);
-            param_4[7].data = SIMD_FMA(grad_4[7].data, step_size_4.data, param_4[7].data);
-
-            SIMD_STORE(_params + i, param_4[0].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH, param_4[1].data);
-            SIMD_STORE(_params + i + (SIMD_WIDTH << 1), param_4[2].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH * 3, param_4[3].data);
-            SIMD_STORE(_params + i + (SIMD_WIDTH << 2), param_4[4].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH * 5, param_4[5].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH * 6, param_4[6].data);
-            SIMD_STORE(_params + i + SIMD_WIDTH * 7, param_4[7].data);
-
-            if (dev_params) {
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t), param_4[0].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH, param_4[1].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + (SIMD_WIDTH << 1),
-                           param_4[2].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 3, param_4[3].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + (SIMD_WIDTH << 2),
-                           param_4[4].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 5, param_4[5].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 6, param_4[6].data);
-                SIMD_STORE(_doubled_buffer[_buf_index] + (i - t) + SIMD_WIDTH * 7, param_4[7].data);
-            }
-
-            SIMD_STORE(_exp_avg + i, momentum_4[0].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH, momentum_4[1].data);
-            SIMD_STORE(_exp_avg + i + (SIMD_WIDTH << 1), momentum_4[2].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 3, momentum_4[3].data);
-            SIMD_STORE(_exp_avg + i + (SIMD_WIDTH << 2), momentum_4[4].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 5, momentum_4[5].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 6, momentum_4[6].data);
-            SIMD_STORE(_exp_avg + i + SIMD_WIDTH * 7, momentum_4[7].data);
-
-            SIMD_STORE(_exp_avg_sq + i, variance_4[0].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH, variance_4[1].data);
-            SIMD_STORE(_exp_avg_sq + i + (SIMD_WIDTH << 1), variance_4[2].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 3, variance_4[3].data);
-            SIMD_STORE(_exp_avg_sq + i + (SIMD_WIDTH << 2), variance_4[4].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 5, variance_4[5].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 6, variance_4[6].data);
-            SIMD_STORE(_exp_avg_sq + i + SIMD_WIDTH * 7, variance_4[7].data);
-        }
-        if (dev_params) {
-            launch_param_update(
-                _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
-            _buf_index = !_buf_index;
-        }
-    }
-#endif
-    if (_param_size > rounded_size)
-        Step_4((_params + rounded_size),
-               (grads + rounded_size),
-               (_exp_avg + rounded_size),
-               (_exp_avg_sq + rounded_size),
-               (_param_size - rounded_size),
-               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params));
-}
-
-int ds_adam_step(int optimizer_id,
-                 size_t step,
-                 float lr,
-                 float beta1,
-                 float beta2,
-                 float epsilon,
-                 float weight_decay,
-                 bool bias_correction,
-                 torch::Tensor& params,
-                 torch::Tensor& grads,
-                 torch::Tensor& exp_avg,
-                 torch::Tensor& exp_avg_sq)
-{
-    auto params_c = params.contiguous();
-    auto grads_c = grads.contiguous();
-    auto exp_avg_c = exp_avg.contiguous();
-    auto exp_avg_sq_c = exp_avg_sq.contiguous();
-
-    float* params_ptr = (float*)params_c.data_ptr();
-    float* grads_ptr = (float*)grads_c.data_ptr();
-    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
-    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
-
-    std::shared_ptr<Adam_Optimizer> opt =
-        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
-    opt->IncrementStep(step, beta1, beta2);
-    opt->update_state(lr, epsilon, weight_decay, bias_correction);
-    opt->Step_8(params_ptr, grads_ptr, exp_avg_ptr, exp_avg_sq_ptr, params_c.size(0));
-
-    opt->SynchronizeStreams();
-    return 0;
-}
-
-int ds_adam_step_plus_copy(int optimizer_id,
-                           size_t step,
-                           float lr,
-                           float beta1,
-                           float beta2,
-                           float epsilon,
-                           float weight_decay,
-                           bool bias_correction,
-                           torch::Tensor& params,
-                           torch::Tensor& grads,
-                           torch::Tensor& exp_avg,
-                           torch::Tensor& exp_avg_sq,
-                           torch::Tensor& gpu_params)
-{
-    auto params_c = params.contiguous();
-    auto gpu_params_c = gpu_params.contiguous();
-    auto exp_avg_c = exp_avg.contiguous();
-    auto exp_avg_sq_c = exp_avg_sq.contiguous();
-    auto grads_c = grads.contiguous();
-
-    float* params_ptr = (float*)params_c.data_ptr();
-    float* grads_ptr = (float*)grads_c.data_ptr();
-    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
-    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
-    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
-
-    std::shared_ptr<Adam_Optimizer> opt =
-        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
-    opt->IncrementStep(step, beta1, beta2);
-    opt->update_state(lr, epsilon, weight_decay, bias_correction);
-    opt->Step_8(
-        params_ptr, grads_ptr, exp_avg_ptr, exp_avg_sq_ptr, params_c.size(0), gpu_params_ptr);
-
-    opt->SynchronizeStreams();
-    return 0;
-}
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
-{
-    m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
-    m.def("adam_update_copy",
-          &ds_adam_step_plus_copy,
-          "DeepSpeed CPU Adam update and param copy (C++)");
-    m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
-}
+#include "cpu_adam.h"
+#include <cuda_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "custom_cuda_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adam_Optimizer::Step_1(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float betta1_minus1 = 1 - _betta1;
+        float betta2_minus1 = 1 - _betta2;
+
+        float step_size = -1 * _alpha / _bias_correction1;
+        float w_decay = -1 * _alpha * _weight_decay;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = _exp_avg[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
+                momentum = momentum * _betta1;
+                momentum = grad * betta1_minus1 + momentum;
+
+                variance = variance * _betta2;
+                grad = grad * grad;
+                variance = grad * betta2_minus1 + variance;
+
+                grad = sqrt(variance);
+                grad = grad * _bias_correction2 + _eps;
+                grad = momentum / grad;
+                if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                _exp_avg[k] = momentum;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adam_Optimizer::Step_4(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adam_optimizer(int optimizer_id,
+                          float alpha = 1e-3,
+                          float betta1 = 0.9,
+                          float betta2 = 0.999,
+                          float eps = 1e-8,
+                          float weight_decay = 0,
+                          bool adamw_mode = true,
+                          bool should_log = false)
+{
+    auto opt =
+        std::make_shared<Adam_Optimizer>(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adam Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
+               alpha,
+               betta1,
+               betta2,
+               weight_decay,
+               (int)adamw_mode);
+    }
+
+    return 0;
+}
+
+void Adam_Optimizer::Step_8(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adam_step(int optimizer_id,
+                 size_t step,
+                 float lr,
+                 float beta1,
+                 float beta2,
+                 float epsilon,
+                 float weight_decay,
+                 bool bias_correction,
+                 torch::Tensor& params,
+                 torch::Tensor& grads,
+                 torch::Tensor& exp_avg,
+                 torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    // assert(params.options().dtype() == grads.options().dtype());
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                nullptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adam_step_plus_copy(int optimizer_id,
+                           size_t step,
+                           float lr,
+                           float beta1,
+                           float beta2,
+                           float epsilon,
+                           float weight_decay,
+                           bool bias_correction,
+                           torch::Tensor& params,
+                           torch::Tensor& grads,
+                           torch::Tensor& exp_avg,
+                           torch::Tensor& exp_avg_sq,
+                           torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adam_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
+    m.def("adam_update_copy",
+          &ds_adam_step_plus_copy,
+          "DeepSpeed CPU Adam update and param copy (C++)");
+    m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
+    m.def("destroy_adam", &destroy_adam_optimizer, "DeepSpeed CPU Adam destroy (C++)");
+}
diff --git a/csrc/adam/cpu_adam_hip.cpp b/csrc/adam/cpu_adam_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..67163979fe3311b85e6b3be3d587bdc1c498485f
--- /dev/null
+++ b/csrc/adam/cpu_adam_hip.cpp
@@ -0,0 +1,293 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cpu_adam_hip.h"
+#include <hip/hip_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "custom_hip_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adam_Optimizer::Step_1(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float betta1_minus1 = 1 - _betta1;
+        float betta2_minus1 = 1 - _betta2;
+
+        float step_size = -1 * _alpha / _bias_correction1;
+        float w_decay = -1 * _alpha * _weight_decay;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = _exp_avg[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
+                momentum = momentum * _betta1;
+                momentum = grad * betta1_minus1 + momentum;
+
+                variance = variance * _betta2;
+                grad = grad * grad;
+                variance = grad * betta2_minus1 + variance;
+
+                grad = sqrt(variance);
+                grad = grad * _bias_correction2 + _eps;
+                grad = momentum / grad;
+                if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                _exp_avg[k] = momentum;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adam_Optimizer::Step_4(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adam_optimizer(int optimizer_id,
+                          float alpha = 1e-3,
+                          float betta1 = 0.9,
+                          float betta2 = 0.999,
+                          float eps = 1e-8,
+                          float weight_decay = 0,
+                          bool adamw_mode = true,
+                          bool should_log = false)
+{
+    auto opt =
+        std::make_shared<Adam_Optimizer>(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adam Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
+               alpha,
+               betta1,
+               betta2,
+               weight_decay,
+               (int)adamw_mode);
+    }
+
+    return 0;
+}
+
+void Adam_Optimizer::Step_8(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adam_step(int optimizer_id,
+                 size_t step,
+                 float lr,
+                 float beta1,
+                 float beta2,
+                 float epsilon,
+                 float weight_decay,
+                 bool bias_correction,
+                 torch::Tensor& params,
+                 torch::Tensor& grads,
+                 torch::Tensor& exp_avg,
+                 torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    // assert(params.options().dtype() == grads.options().dtype());
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                nullptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adam_step_plus_copy(int optimizer_id,
+                           size_t step,
+                           float lr,
+                           float beta1,
+                           float beta2,
+                           float epsilon,
+                           float weight_decay,
+                           bool bias_correction,
+                           torch::Tensor& params,
+                           torch::Tensor& grads,
+                           torch::Tensor& exp_avg,
+                           torch::Tensor& exp_avg_sq,
+                           torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adam_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
+    m.def("adam_update_copy",
+          &ds_adam_step_plus_copy,
+          "DeepSpeed CPU Adam update and param copy (C++)");
+    m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
+    m.def("destroy_adam", &destroy_adam_optimizer, "DeepSpeed CPU Adam destroy (C++)");
+}
diff --git a/csrc/adam/multi_tensor_adam.hip b/csrc/adam/multi_tensor_adam.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f0b7ced5c29646b793f8fa904768c091fd9d749e
--- /dev/null
+++ b/csrc/adam/multi_tensor_adam.hip
@@ -0,0 +1,164 @@
+// !!! This is a file automatically generated by hipify!!!
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/hip/HIPContext.h>
+#include <ATen/hip/Exceptions.h>
+// Another possibility:
+// #include <torch/all.h>
+
+#include <assert.h>
+
+#include "multi_tensor_apply_hip.cuh"
+#include "type_shim_hip.h"
+
+#define BLOCK_SIZE 512
+#define ILP 4
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // L2 regularization mode
+    ADAM_MODE_1 = 1   // Decoupled weight decay mode(AdamW)
+} adamMode_t;
+
+using MATH_T = float;
+
+template <typename T>
+struct AdamFunctor {
+    __device__ __forceinline__ void operator()(int chunk_size,
+                                               volatile int* noop_gmem,
+                                               TensorListMetadata<4>& tl,
+                                               const float beta1,
+                                               const float beta2,
+                                               const float beta1_correction,
+                                               const float beta2_correction,
+                                               const float epsilon,
+                                               const float lr,
+                                               adamMode_t mode,
+                                               const float decay)
+    {
+        // I'd like this kernel to propagate infs/nans.
+        // if(*noop_gmem == 1)
+        //   return;
+
+        int tensor_loc = tl.block_to_tensor[blockIdx.x];
+
+        // potentially use to pass in list of scalar
+        // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
+
+        int chunk_idx = tl.block_to_chunk[blockIdx.x];
+        int n = tl.sizes[tensor_loc];
+
+        T* g = (T*)tl.addresses[0][tensor_loc];
+        g += chunk_idx * chunk_size;
+
+        T* p = (T*)tl.addresses[1][tensor_loc];
+        p += chunk_idx * chunk_size;
+
+        T* m = (T*)tl.addresses[2][tensor_loc];
+        m += chunk_idx * chunk_size;
+
+        T* v = (T*)tl.addresses[3][tensor_loc];
+        v += chunk_idx * chunk_size;
+
+        n -= chunk_idx * chunk_size;
+
+        // see note in multi_tensor_scale_kernel.cu
+        for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) {
+            MATH_T r_g[ILP];
+            MATH_T r_p[ILP];
+            MATH_T r_m[ILP];
+            MATH_T r_v[ILP];
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    r_g[ii] = g[i];
+                    r_p[ii] = p[i];
+                    r_m[ii] = m[i];
+                    r_v[ii] = v[i];
+                } else {
+                    r_g[ii] = MATH_T(0);
+                    r_p[ii] = MATH_T(0);
+                    r_m[ii] = MATH_T(0);
+                    r_v[ii] = MATH_T(0);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                if (mode == ADAM_MODE_0) {  // L2
+                    r_g[ii] = r_g[ii] + (decay * r_p[ii]);
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = next_m_unbiased / denom;
+                    r_p[ii] = r_p[ii] - (lr * update);
+                } else {  // weight decay
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
+                    r_p[ii] = r_p[ii] - (lr * update);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    p[i] = r_p[ii];
+                    m[i] = r_m[ii];
+                    v[i] = r_v[ii];
+                }
+            }
+        }
+    }
+};
+
+void multi_tensor_adam_cuda(int chunk_size,
+                            at::Tensor noop_flag,
+                            std::vector<std::vector<at::Tensor>> tensor_lists,
+                            const float lr,
+                            const float beta1,
+                            const float beta2,
+                            const float epsilon,
+                            const int step,
+                            const int mode,
+                            const int bias_correction,
+                            const float weight_decay)
+{
+    using namespace at;
+
+    // Handle bias correction mode
+    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
+    if (bias_correction == 1) {
+        bias_correction1 = 1 - ::pow(beta1, step);
+        bias_correction2 = 1 - ::pow(beta2, step);
+    }
+
+    // Assume single type across p,g,m1,m2 now
+    DISPATCH_DOUBLE_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(),
+                                   0,
+                                   "adam",
+                                   multi_tensor_apply<4>(BLOCK_SIZE,
+                                                         chunk_size,
+                                                         noop_flag,
+                                                         tensor_lists,
+                                                         AdamFunctor<scalar_t_0>(),
+                                                         beta1,
+                                                         beta2,
+                                                         bias_correction1,
+                                                         bias_correction2,
+                                                         epsilon,
+                                                         lr,
+                                                         (adamMode_t)mode,
+                                                         weight_decay);)
+
+    AT_CUDA_CHECK(hipGetLastError());
+}
diff --git a/csrc/adam/multi_tensor_apply_hip.cuh b/csrc/adam/multi_tensor_apply_hip.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..09bc9971f216f73d7e33a1b75c52d2e975115743
--- /dev/null
+++ b/csrc/adam/multi_tensor_apply_hip.cuh
@@ -0,0 +1,129 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/hip/HIPContext.h>
+#include <ATen/hip/Exceptions.h>
+#include <ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h>
+#include "compat.h"
+
+#include <assert.h>
+
+// #include <iostream>
+
+// This header is the one-stop shop for all your multi-tensor apply needs.
+
+// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
+constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
+constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
+
+template <int n>
+struct TensorListMetadata {
+    void* addresses[n][depth_to_max_tensors[n - 1]];
+    int sizes[depth_to_max_tensors[n - 1]];
+    unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
+    int block_to_chunk[depth_to_max_blocks[n - 1]];  // I fear this needs to be a full int.
+    int start_tensor_this_launch;
+};
+
+template <typename T, typename U, typename... ArgTypes>
+__global__ void multi_tensor_apply_kernel(int chunk_size,
+                                          volatile int* noop_flag,
+                                          T tl,
+                                          U callable,
+                                          ArgTypes... args)
+{
+    // Hand the chunk information to the user-supplied functor to process however it likes.
+    callable(chunk_size, noop_flag, tl, args...);
+}
+
+template <int depth, typename T, typename... ArgTypes>
+void multi_tensor_apply(int block_size,
+                        int chunk_size,
+                        const at::Tensor& noop_flag,
+                        const std::vector<std::vector<at::Tensor>>& tensor_lists,
+                        T callable,
+                        ArgTypes... args)
+{
+    TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
+    int len0 = tensor_lists[0].size();
+    TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
+    auto ref_device = tensor_lists[0][0].device();
+    TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
+    for (int l = 0; l < tensor_lists.size(); l++)  // No range-based for because I need indices
+    {
+        TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
+        for (int t = 0; t < tensor_lists[l].size(); t++) {
+            // TODO:  Print which tensor fails.
+            bool contiguous_memory = tensor_lists[l][t].is_contiguous();
+#ifdef VERSION_GE_1_5
+            contiguous_memory = (contiguous_memory ||
+                                 tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
+#endif
+            TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
+            TORCH_CHECK(tensor_lists[l][t].device() == ref_device,
+                        "A tensor was not on the same device as the first tensor");
+            TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
+        }
+    }
+
+    int ntensors = tensor_lists[0].size();
+
+    TensorListMetadata<depth> tl;
+
+    const at::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(device_of(tensor_lists[0][0]));
+    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+
+    tl.start_tensor_this_launch = 0;
+    int loc_block_info = 0;
+    int loc_tensor_info = 0;
+    for (int t = 0; t < ntensors; t++) {
+        tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
+        for (int d = 0; d < depth; d++)
+            tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
+        loc_tensor_info++;
+
+        int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
+
+        for (int chunk = 0; chunk < chunks_this_tensor; chunk++) {
+            // std::cout << chunks_this_tensor << std::endl;
+            tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
+            tl.block_to_chunk[loc_block_info] = chunk;
+            loc_block_info++;
+
+            bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
+                                 chunk == chunks_this_tensor - 1);
+            bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
+            bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
+            if (tensors_full || blocks_full || last_chunk) {
+                // using accscalar_t = acc_type<scalar_t, true>;
+               hipLaunchKernelGGL(( multi_tensor_apply_kernel), dim3(loc_block_info), dim3(block_size), 0, stream, 
+                    chunk_size, noop_flag.DATA_PTR<int>(), tl, callable, args...);
+
+                AT_CUDA_CHECK(hipGetLastError());
+
+                // Reset.  The control flow possibilities here make my brain hurt.
+                loc_block_info = 0;
+                if (chunk == chunks_this_tensor - 1) {
+                    // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    loc_tensor_info = 0;
+                    tl.start_tensor_this_launch = t + 1;
+                } else {
+                    // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
+                    for (int d = 0; d < depth; d++)
+                        tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
+                    loc_tensor_info = 1;
+                    tl.start_tensor_this_launch = t;
+                }
+            }
+        }
+    }
+}
diff --git a/csrc/aio/common/deepspeed_aio_common.cpp b/csrc/aio/common/deepspeed_aio_common.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..9e405d8e704c3c260bd8276f648b770471bcd4b7
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_common.cpp
@@ -0,0 +1,333 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <fcntl.h>
+#include <libaio.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <limits>
+#include <numeric>
+#include <string>
+#include <vector>
+
+#include "deepspeed_aio_common.h"
+
+using namespace std;
+using namespace std::chrono;
+
+#define DEBUG_DS_AIO_PERF 0
+#define DEBUG_DS_AIO_SUBMIT_PERF 0
+
+static const std::string c_library_name = "deepspeed_aio";
+
+static void _report_aio_statistics(const char* tag,
+                                   const std::vector<std::chrono::duration<double>>& latencies)
+    __attribute__((unused));
+
+static void _report_aio_statistics(const char* tag,
+                                   const std::vector<std::chrono::duration<double>>& latencies)
+{
+    std::vector<double> lat_usec;
+    for (auto& lat : latencies) { lat_usec.push_back(lat.count() * 1e6); }
+    const auto min_lat = *(std::min_element(lat_usec.begin(), lat_usec.end()));
+    const auto max_lat = *(std::max_element(lat_usec.begin(), lat_usec.end()));
+    const auto avg_lat = std::accumulate(lat_usec.begin(), lat_usec.end(), 0) / lat_usec.size();
+
+    std::cout << c_library_name << ": latency statistics(usec) " << tag
+              << " min/max/avg = " << min_lat << " " << max_lat << " " << avg_lat << std::endl;
+}
+
+static void _get_aio_latencies(std::vector<std::chrono::duration<double>>& raw_latencies,
+                               struct deepspeed_aio_latency_t& summary_latencies)
+{
+    std::vector<double> lat_usec;
+    for (auto& lat : raw_latencies) { lat_usec.push_back(lat.count() * 1e6); }
+    summary_latencies._min_usec = *(std::min_element(lat_usec.begin(), lat_usec.end()));
+    summary_latencies._max_usec = *(std::max_element(lat_usec.begin(), lat_usec.end()));
+    summary_latencies._avg_usec =
+        std::accumulate(lat_usec.begin(), lat_usec.end(), 0) / lat_usec.size();
+}
+
+static void _do_io_submit_singles(const long long int n_iocbs,
+                                  const long long int iocb_index,
+                                  std::unique_ptr<aio_context>& aio_ctxt,
+                                  std::vector<std::chrono::duration<double>>& submit_times)
+{
+    for (auto i = 0; i < n_iocbs; ++i) {
+        const auto st = std::chrono::high_resolution_clock::now();
+        const auto submit_ret = io_submit(aio_ctxt->_io_ctxt, 1, aio_ctxt->_iocbs.data() + i);
+        submit_times.push_back(std::chrono::high_resolution_clock::now() - st);
+#if DEBUG_DS_AIO_SUBMIT_PERF
+        printf("submit(usec) %f io_index=%lld buf=%p len=%lu off=%llu \n",
+               submit_times.back().count() * 1e6,
+               iocb_index,
+               aio_ctxt->_iocbs[i]->u.c.buf,
+               aio_ctxt->_iocbs[i]->u.c.nbytes,
+               aio_ctxt->_iocbs[i]->u.c.offset);
+#endif
+        assert(submit_ret > 0);
+    }
+}
+
+static void _do_io_submit_block(const long long int n_iocbs,
+                                const long long int iocb_index,
+                                std::unique_ptr<aio_context>& aio_ctxt,
+                                std::vector<std::chrono::duration<double>>& submit_times)
+{
+    const auto st = std::chrono::high_resolution_clock::now();
+    const auto submit_ret = io_submit(aio_ctxt->_io_ctxt, n_iocbs, aio_ctxt->_iocbs.data());
+    submit_times.push_back(std::chrono::high_resolution_clock::now() - st);
+#if DEBUG_DS_AIO_SUBMIT_PERF
+    printf("submit(usec) %f io_index=%lld nr=%lld buf=%p len=%lu off=%llu \n",
+           submit_times.back().count() * 1e6,
+           iocb_index,
+           n_iocbs,
+           aio_ctxt->_iocbs[0]->u.c.buf,
+           aio_ctxt->_iocbs[0]->u.c.nbytes,
+           aio_ctxt->_iocbs[0]->u.c.offset);
+#endif
+    assert(submit_ret > 0);
+}
+
+static int _do_io_complete(const long long int min_completes,
+                           const long long int max_completes,
+                           std::unique_ptr<aio_context>& aio_ctxt,
+                           std::vector<std::chrono::duration<double>>& reap_times)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+    const auto n_completes = io_getevents(
+        aio_ctxt->_io_ctxt, min_completes, max_completes, aio_ctxt->_io_events.data(), nullptr);
+    reap_times.push_back(std::chrono::high_resolution_clock::now() - start_time);
+
+    assert(n_completes >= min_completes);
+    return n_completes;
+}
+
+void do_aio_operation_sequential(const bool read_op,
+                                 std::unique_ptr<aio_context>& aio_ctxt,
+                                 std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                 deepspeed_aio_config_t* config,
+                                 deepspeed_aio_perf_t* perf)
+{
+    struct io_prep_context prep_ctxt(read_op, xfer_ctxt, aio_ctxt->_block_size, &aio_ctxt->_iocbs);
+
+    const auto num_io_blocks = static_cast<long long int>(
+        ceil(static_cast<double>(xfer_ctxt->_num_bytes) / aio_ctxt->_block_size));
+#if DEBUG_DS_AIO_PERF
+    const auto io_op_name = std::string(read_op ? "read" : "write");
+    std::cout << c_library_name << ": start " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes with " << num_io_blocks << " io blocks" << std::endl;
+#endif
+
+    std::vector<std::chrono::duration<double>> submit_times;
+    std::vector<std::chrono::duration<double>> reap_times;
+    const auto max_queue_bytes =
+        static_cast<long long int>(aio_ctxt->_queue_depth * aio_ctxt->_block_size);
+
+    auto start = std::chrono::high_resolution_clock::now();
+    for (long long iocb_index = 0; iocb_index < num_io_blocks;
+         iocb_index += aio_ctxt->_queue_depth) {
+        const auto start_offset = iocb_index * aio_ctxt->_block_size;
+        const auto start_buffer = (char*)xfer_ctxt->_mem_buffer + start_offset;
+        const auto n_iocbs =
+            min(static_cast<long long>(aio_ctxt->_queue_depth), (num_io_blocks - iocb_index));
+        const auto num_bytes = min(max_queue_bytes, (xfer_ctxt->_num_bytes - start_offset));
+        prep_ctxt.prep_iocbs(n_iocbs, num_bytes, start_buffer, start_offset);
+
+        if (config->_single_submit) {
+            _do_io_submit_singles(n_iocbs, iocb_index, aio_ctxt, submit_times);
+        } else {
+            _do_io_submit_block(n_iocbs, iocb_index, aio_ctxt, submit_times);
+        }
+
+        _do_io_complete(n_iocbs, n_iocbs, aio_ctxt, reap_times);
+    }
+    const std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start;
+
+    if (perf) {
+        _get_aio_latencies(submit_times, perf->_submit);
+        _get_aio_latencies(reap_times, perf->_complete);
+        perf->_e2e_usec = elapsed.count() * 1e6;
+        perf->_e2e_rate_GB = (xfer_ctxt->_num_bytes / elapsed.count() / 1e9);
+    }
+
+#if DEBUG_DS_AIO_PERF
+    _report_aio_statistics("submit", submit_times);
+    _report_aio_statistics("complete", reap_times);
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": runtime(usec) " << elapsed.count() * 1e6
+              << " rate(GB/sec) = " << (xfer_ctxt->_num_bytes / elapsed.count() / 1e9) << std::endl;
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": finish " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes " << std::endl;
+#endif
+}
+
+void do_aio_operation_overlap(const bool read_op,
+                              std::unique_ptr<aio_context>& aio_ctxt,
+                              std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                              deepspeed_aio_config_t* config,
+                              deepspeed_aio_perf_t* perf)
+{
+    struct io_prep_generator io_gen(read_op, xfer_ctxt, aio_ctxt->_block_size);
+
+#if DEBUG_DS_AIO_PERF
+    const auto io_op_name = std::string(read_op ? "read" : "write");
+    std::cout << c_library_name << ": start " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes with " << io_gen._num_io_blocks << " io blocks" << std::endl;
+#endif
+
+    std::vector<std::chrono::duration<double>> submit_times;
+    std::vector<std::chrono::duration<double>> reap_times;
+
+    auto request_iocbs = aio_ctxt->_queue_depth;
+    auto n_pending_iocbs = 0;
+    const auto min_completes = 1;
+    auto start = std::chrono::high_resolution_clock::now();
+    while (true) {
+        const auto n_iocbs = io_gen.prep_iocbs(request_iocbs - n_pending_iocbs, &aio_ctxt->_iocbs);
+        if (n_iocbs > 0) {
+            if (config->_single_submit) {
+                _do_io_submit_singles(
+                    n_iocbs, (io_gen._next_iocb_index - n_iocbs), aio_ctxt, submit_times);
+            } else {
+                _do_io_submit_block(
+                    n_iocbs, (io_gen._next_iocb_index - n_iocbs), aio_ctxt, submit_times);
+            }
+        }
+
+        n_pending_iocbs += n_iocbs;
+        assert(n_pending_iocbs <= aio_ctxt->_queue_depth);
+
+        if (n_pending_iocbs == 0) { break; }
+
+        const auto n_complete =
+            _do_io_complete(min_completes, n_pending_iocbs, aio_ctxt, reap_times);
+        n_pending_iocbs -= n_complete;
+    }
+
+    const std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start;
+
+    if (perf) {
+        _get_aio_latencies(submit_times, perf->_submit);
+        _get_aio_latencies(reap_times, perf->_complete);
+        perf->_e2e_usec = elapsed.count() * 1e6;
+        perf->_e2e_rate_GB = (xfer_ctxt->_num_bytes / elapsed.count() / 1e9);
+    }
+
+#if DEBUG_DS_AIO_PERF
+    _report_aio_statistics("submit", submit_times);
+    _report_aio_statistics("complete", reap_times);
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": runtime(usec) " << elapsed.count() * 1e6
+              << " rate(GB/sec) = " << (xfer_ctxt->_num_bytes / elapsed.count() / 1e9) << std::endl;
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": finish " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes " << std::endl;
+#endif
+}
+
+void report_file_error(const char* filename, const std::string file_op, const int error_code)
+{
+    std::string err_msg = file_op + std::string(" failed on ") + std::string(filename) +
+                          " error = " + std::to_string(error_code);
+    std::cerr << c_library_name << ":  " << err_msg << std::endl;
+}
+
+int open_file(const char* filename, const bool read_op)
+{
+    const int flags = read_op ? (O_RDONLY | __O_DIRECT) : (O_WRONLY | O_CREAT | __O_DIRECT);
+    const int mode = 0600;
+    const auto fd = open(filename, flags, mode);
+    if (fd == -1) {
+        const auto error_code = errno;
+        const auto error_msg = read_op ? " open for read " : " open for write ";
+        report_file_error(filename, error_msg, error_code);
+        return -1;
+    }
+    return fd;
+}
+
+int regular_read(const char* filename, std::vector<char>& buffer)
+{
+    long long int num_bytes;
+    const auto f_size = get_file_size(filename, num_bytes);
+    assert(f_size != -1);
+    buffer.resize(num_bytes);
+    const auto fd = open(filename, O_RDONLY, 0600);
+    assert(fd != -1);
+    long long int read_bytes = 0;
+    auto r = 0;
+    do {
+        const auto buffer_ptr = buffer.data() + read_bytes;
+        const auto bytes_to_read = num_bytes - read_bytes;
+        r = read(fd, buffer_ptr, bytes_to_read);
+        read_bytes += r;
+    } while (r > 0);
+
+    if (read_bytes != num_bytes) {
+        std::cerr << "read error "
+                  << " read_bytes (read) = " << read_bytes << " num_bytes (fstat) = " << num_bytes
+                  << std::endl;
+    }
+    assert(read_bytes == num_bytes);
+    close(fd);
+    return 0;
+}
+
+static bool _validate_buffer(const char* filename, void* aio_buffer, const long long int num_bytes)
+{
+    std::vector<char> regular_buffer;
+    const auto reg_ret = regular_read(filename, regular_buffer);
+    assert(0 == reg_ret);
+    std::cout << "regular read of " << filename << " returned " << regular_buffer.size() << " bytes"
+              << std::endl;
+
+    if (static_cast<long long int>(regular_buffer.size()) != num_bytes) { return false; }
+
+    return (0 == memcmp(aio_buffer, regular_buffer.data(), regular_buffer.size()));
+}
+
+bool validate_aio_operation(const bool read_op,
+                            const char* filename,
+                            void* aio_buffer,
+                            const long long int num_bytes)
+{
+    const auto msg_suffix = std::string("deepspeed_aio_") +
+                            std::string(read_op ? "read()" : "write()") +
+                            std::string("using read()");
+
+    if (false == _validate_buffer(filename, aio_buffer, num_bytes)) {
+        std::cout << "Fail: correctness of " << msg_suffix << std::endl;
+        return false;
+    }
+
+    std::cout << "Pass: correctness of  " << msg_suffix << std::endl;
+    return true;
+}
diff --git a/csrc/aio/common/deepspeed_aio_common.h b/csrc/aio/common/deepspeed_aio_common.h
new file mode 100644
index 0000000000000000000000000000000000000000..cc62d33765c804e88816791c72a3477278738e76
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_common.h
@@ -0,0 +1,36 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <deepspeed_aio_utils.h>
+#include <stdlib.h>
+#include <memory>
+#include <string>
+
+using namespace std;
+
+void do_aio_operation_sequential(const bool read_op,
+                                 std::unique_ptr<aio_context>& aio_ctxt,
+                                 std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                 deepspeed_aio_config_t* config,
+                                 deepspeed_aio_perf_t* perf);
+
+void do_aio_operation_overlap(const bool read_op,
+                              std::unique_ptr<aio_context>& aio_ctxt,
+                              std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                              deepspeed_aio_config_t* config,
+                              deepspeed_aio_perf_t* perf);
+
+int open_file(const char* filename, const bool read_op);
+
+void report_file_error(const char* filename, const std::string file_op, const int error_code);
+
+int regular_read(const char* filename, std::vector<char>& buffer);
+
+bool validate_aio_operation(const bool read_op,
+                            const char* filename,
+                            void* aio_buffer,
+                            const long long int num_bytes);
diff --git a/csrc/aio/common/deepspeed_aio_types.cpp b/csrc/aio/common/deepspeed_aio_types.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..e5811bb91149fad40422692ac7cde6f9348e0029
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_types.cpp
@@ -0,0 +1,74 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <cmath>
+
+#include "deepspeed_aio_utils.h"
+
+using namespace std;
+
+const int c_block_size = 128 * 1024;
+const int c_io_queue_depth = 8;
+
+deepspeed_aio_config_t::deepspeed_aio_config_t()
+    : _block_size(c_block_size),
+      _queue_depth(c_io_queue_depth),
+      _single_submit(false),
+      _overlap_events(false),
+      _lock_memory(false)
+{
+}
+
+deepspeed_aio_config_t::deepspeed_aio_config_t(const int block_size,
+                                               const int queue_depth,
+                                               const bool single_submit,
+                                               const bool overlap_events,
+                                               const bool lock_memory)
+    : _block_size(block_size),
+      _queue_depth(queue_depth),
+      _single_submit(single_submit),
+      _overlap_events(overlap_events),
+      _lock_memory(lock_memory)
+{
+}
+
+void deepspeed_aio_latency_t::dump(const std::string tag)
+{
+    std::cout << tag << _min_usec << " " << _max_usec << " " << _avg_usec << " " << std::endl;
+}
+
+void deepspeed_aio_latency_t::accumulate(const struct deepspeed_aio_latency_t& other)
+{
+    _min_usec += other._min_usec;
+    _max_usec += other._max_usec;
+    _avg_usec += other._avg_usec;
+}
+
+void deepspeed_aio_latency_t::scale(const float scaler)
+{
+    _min_usec *= scaler;
+    _max_usec *= scaler;
+    _avg_usec *= scaler;
+}
+
+aio_context::aio_context(const int block_size, const int queue_depth)
+{
+    _block_size = block_size;
+    _queue_depth = queue_depth;
+    for (auto i = 0; i < queue_depth; ++i) {
+        _iocbs.push_back((struct iocb*)calloc(1, sizeof(struct iocb)));
+    }
+    _io_events.resize(queue_depth);
+    io_queue_init(queue_depth, &_io_ctxt);
+}
+
+aio_context::~aio_context()
+{
+    for (auto& iocb : _iocbs) { free(iocb); }
+    _io_events.resize(0);
+    io_queue_release(_io_ctxt);
+}
diff --git a/csrc/aio/common/deepspeed_aio_types.h b/csrc/aio/common/deepspeed_aio_types.h
new file mode 100644
index 0000000000000000000000000000000000000000..be3b352d6be20733f7e03a821378a648384be0b5
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_types.h
@@ -0,0 +1,57 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <libaio.h>
+#include <stdlib.h>
+
+#include <string>
+#include <vector>
+
+using namespace std;
+
+struct deepspeed_aio_latency_t {
+    double _min_usec;
+    double _max_usec;
+    double _avg_usec;
+
+    void dump(const std::string tag);
+    void accumulate(const deepspeed_aio_latency_t&);
+    void scale(const float value);
+};
+
+struct deepspeed_aio_perf_t {
+    deepspeed_aio_latency_t _submit;
+    deepspeed_aio_latency_t _complete;
+    double _e2e_usec;
+    double _e2e_rate_GB;
+};
+
+struct deepspeed_aio_config_t {
+    const int _block_size;
+    const int _queue_depth;
+    const bool _single_submit;
+    const bool _overlap_events;
+    const bool _lock_memory;
+
+    deepspeed_aio_config_t();
+    deepspeed_aio_config_t(const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const bool lock_memory);
+};
+
+struct aio_context {
+    io_context_t _io_ctxt;
+    std::vector<struct io_event> _io_events;
+    std::vector<struct iocb*> _iocbs;
+    int _block_size;
+    int _queue_depth;
+
+    aio_context(const int block_size, const int queue_depth);
+    ~aio_context();
+};
diff --git a/csrc/aio/common/deepspeed_aio_utils.cpp b/csrc/aio/common/deepspeed_aio_utils.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..200c7030f120366c2e2a45cb6cc20785ec4518fd
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_utils.cpp
@@ -0,0 +1,123 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <cmath>
+
+#include "deepspeed_aio_utils.h"
+
+using namespace std;
+
+const int c_block_size = 128 * 1024;
+const int c_io_queue_depth = 8;
+
+io_xfer_ctxt::io_xfer_ctxt(const int fd,
+                           const long long int file_offset,
+                           const long long int num_bytes,
+                           const void* buffer)
+    : _fd(fd), _base_offset(file_offset), _mem_buffer(buffer), _num_bytes(num_bytes)
+{
+}
+
+io_prep_context::io_prep_context(const bool read_op,
+                                 const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                 const size_t block_size,
+                                 const std::vector<struct iocb*>* iocbs)
+    : _read_op(read_op), _xfer_ctxt(xfer_ctxt), _block_size(block_size), _iocbs(iocbs)
+{
+}
+
+void io_prep_context::prep_iocbs(const int n_iocbs,
+                                 const size_t num_bytes,
+                                 const void* start_buffer,
+                                 const long long int start_offset)
+{
+    assert(static_cast<size_t>(n_iocbs) <= _iocbs->size());
+    for (auto i = 0; i < n_iocbs; ++i) {
+        const auto shift = i * _block_size;
+        const auto xfer_buffer = (char*)start_buffer + _xfer_ctxt->_base_offset + shift;
+        const auto xfer_offset = _xfer_ctxt->_base_offset + start_offset + shift;
+        auto byte_count = _block_size;
+        if ((shift + _block_size) > num_bytes) { byte_count = num_bytes - shift; }
+
+        if (_read_op) {
+            io_prep_pread(_iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, byte_count, xfer_offset);
+        } else {
+            io_prep_pwrite(_iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, byte_count, xfer_offset);
+        }
+    }
+}
+
+io_prep_generator::io_prep_generator(const bool read_op,
+                                     const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                     const size_t block_size)
+    : _read_op(read_op),
+      _xfer_ctxt(xfer_ctxt),
+      _block_size(block_size),
+      _remaining_bytes(xfer_ctxt->_num_bytes),
+      _next_iocb_index(0)
+{
+    _num_io_blocks =
+        static_cast<long long int>(ceil(static_cast<double>(xfer_ctxt->_num_bytes) / block_size));
+    _remaining_io_blocks = _num_io_blocks;
+}
+
+int io_prep_generator::prep_iocbs(const int n_iocbs, std::vector<struct iocb*>* iocbs)
+{
+    if ((_remaining_bytes) == 0 || (_remaining_io_blocks == 0)) {
+        assert(static_cast<long long int>(_remaining_bytes) == _remaining_io_blocks);
+        return 0;
+    }
+
+    assert(static_cast<size_t>(n_iocbs) <= iocbs->size());
+
+    auto actual_n_iocbs = min(static_cast<long long int>(n_iocbs), _remaining_io_blocks);
+    for (auto i = 0; i < actual_n_iocbs; ++i, ++_next_iocb_index) {
+        const auto xfer_offset = _xfer_ctxt->_base_offset + (_next_iocb_index * _block_size);
+        const auto xfer_buffer = (char*)_xfer_ctxt->_mem_buffer + xfer_offset;
+        const auto num_bytes = min(static_cast<long long int>(_block_size), _remaining_bytes);
+
+        if (_read_op) {
+            io_prep_pread(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
+        } else {
+            io_prep_pwrite(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
+        }
+        _remaining_bytes -= num_bytes;
+    }
+    _remaining_io_blocks -= actual_n_iocbs;
+
+    return actual_n_iocbs;
+}
+
+int get_file_size(const char* filename, long long int& size)
+{
+    struct stat st;
+    if (stat(filename, &st) == -1) { return -1; }
+    size = st.st_size;
+    return 0;
+}
+
+void* ds_page_aligned_alloc(const size_t size, const bool lock)
+{
+    void* ptr;
+    int retval;
+
+    retval = posix_memalign(&ptr, (size_t)sysconf(_SC_PAGESIZE), size);
+    if (retval) { return nullptr; }
+
+    if (lock == false) { return ptr; }
+
+    auto mlock_ret = mlock(ptr, size);
+    if (mlock_ret != 0) {
+        auto mlock_error = errno;
+        printf("mlock failed with %d %s\n", mlock_error, strerror(mlock_error));
+
+        free(ptr);
+        return nullptr;
+    }
+
+    return ptr;
+}
diff --git a/csrc/aio/common/deepspeed_aio_utils.h b/csrc/aio/common/deepspeed_aio_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..6c5952749dd33d5e0059c209dc14ea755424da23
--- /dev/null
+++ b/csrc/aio/common/deepspeed_aio_utils.h
@@ -0,0 +1,77 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#pragma once
+
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <fcntl.h>
+#include <libaio.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <deepspeed_aio_types.h>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+struct io_xfer_ctxt {
+    const int _fd;
+    const long long int _base_offset;
+    const void* _mem_buffer;
+    const long long int _num_bytes;
+
+    io_xfer_ctxt(const int fd,
+                 const long long int file_offset,
+                 const long long int num_bytes,
+                 const void* buffer);
+};
+
+struct io_prep_context {
+    const bool _read_op;
+    const std::unique_ptr<io_xfer_ctxt>& _xfer_ctxt;
+    const size_t _block_size;
+    const std::vector<struct iocb*>* _iocbs;
+
+    io_prep_context(const bool read_op,
+                    const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                    const size_t block_size,
+                    const std::vector<struct iocb*>* iocbs);
+
+    void prep_iocbs(const int n_iocbs,
+                    const size_t num_bytes,
+                    const void* start_buffer,
+                    const long long int start_offset);
+};
+
+struct io_prep_generator {
+    const bool _read_op;
+    const std::unique_ptr<io_xfer_ctxt>& _xfer_ctxt;
+    const size_t _block_size;
+
+    long long int _remaining_bytes;
+    long long int _num_io_blocks;
+    long long int _remaining_io_blocks;
+    long long int _next_iocb_index;
+
+    io_prep_generator(const bool read_op,
+                      const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                      const size_t block_size);
+
+    int prep_iocbs(const int n_iocbs, std::vector<struct iocb*>* iocbs);
+};
+
+void* ds_page_aligned_alloc(const size_t size, const bool lock = false);
+
+int get_file_size(const char* filename, long long int& size);
diff --git a/csrc/aio/py_lib/deepspeed_aio_thread.cpp b/csrc/aio/py_lib/deepspeed_aio_thread.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..a2670fb7b4cbc6635c25d71e9b5f9bfd265f59c9
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_aio_thread.cpp
@@ -0,0 +1,84 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_aio_thread.h"
+
+using namespace std;
+
+io_op_desc_t::io_op_desc_t(const bool read_op,
+                           const torch::Tensor& buffer,
+                           const int fd,
+                           const char* filename,
+                           const long long int num_bytes,
+                           const bool validate)
+    : _read_op(read_op),
+      _buffer(buffer),
+      _fd(fd),
+      _filename(filename),
+      _num_bytes(num_bytes),
+      _validate(validate)
+{
+    _cpu_buffer = _buffer.is_cuda() ? _buffer.to(torch::kCPU).pin_memory() : _buffer;
+    _contiguous_buffer = _cpu_buffer.contiguous();
+}
+
+char* io_op_desc_t::data_ptr() const { return (char*)_contiguous_buffer.data_ptr(); }
+
+void io_op_desc_t::fini()
+{
+    if (_read_op && _buffer.is_cuda()) { _buffer.copy_(_cpu_buffer.to(torch::kCUDA)); }
+}
+
+deepspeed_aio_thread_t::deepspeed_aio_thread_t(const int tid, deepspeed_aio_config_t& aio_config)
+    : _tid(tid),
+      _aio_config(aio_config),
+      _aio_ctxt(new aio_context(aio_config._block_size, aio_config._queue_depth)),
+      _time_to_exit(false)
+{
+}
+
+deepspeed_aio_thread_t::~deepspeed_aio_thread_t() {}
+
+void deepspeed_aio_thread_t::run()
+{
+    while (true) {
+        std::shared_ptr<struct io_op_desc_t> next_io_op = nullptr;
+
+        {
+            std::unique_lock<std::mutex> lock(_work_sync._mutex);
+            _work_sync._cond_var.wait(lock,
+                                      [this] { return (!_work_queue.empty() || _time_to_exit); });
+            if (!_work_queue.empty()) {
+                next_io_op = _work_queue.front();
+                _work_queue.pop();
+            }
+        }
+
+        if (next_io_op) {
+            const auto base_offset = next_io_op->_num_bytes * _tid;
+
+            std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(
+                next_io_op->_fd, base_offset, next_io_op->_num_bytes, next_io_op->data_ptr()));
+
+            if (_aio_config._overlap_events) {
+                do_aio_operation_overlap(
+                    next_io_op->_read_op, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+            } else {
+                do_aio_operation_sequential(
+                    next_io_op->_read_op, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+            }
+
+            {
+                std::lock_guard<std::mutex> lock(_complete_sync._mutex);
+                _complete_queue.push(next_io_op);
+            }
+            _complete_sync._cond_var.notify_one();
+        }
+
+        if (_time_to_exit) { break; }
+    }
+}
diff --git a/csrc/aio/py_lib/deepspeed_aio_thread.h b/csrc/aio/py_lib/deepspeed_aio_thread.h
new file mode 100644
index 0000000000000000000000000000000000000000..d1cfcab8bfc2446921422b83efa100444ce0dd31
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_aio_thread.h
@@ -0,0 +1,57 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <condition_variable>
+#include <memory>
+#include <queue>
+#include "deepspeed_py_aio.h"
+
+struct io_op_desc_t {
+    const bool _read_op;
+    torch::Tensor _buffer;
+    int _fd;
+    const std::string _filename;
+    const long long int _num_bytes;
+    torch::Tensor _cpu_buffer;
+    torch::Tensor _contiguous_buffer;
+    const bool _validate;
+
+    io_op_desc_t(const bool read_op,
+                 const torch::Tensor& buffer,
+                 const int fd,
+                 const char* filename,
+                 const long long int num_bytes,
+                 const bool validate);
+
+    char* data_ptr() const;
+    void fini();
+};
+
+struct thread_sync_t {
+    std::mutex _mutex;
+    std::condition_variable _cond_var;
+};
+
+struct deepspeed_aio_thread_t {
+    const int _tid;
+    deepspeed_aio_config_t& _aio_config;
+
+    std::unique_ptr<struct aio_context> _aio_ctxt;
+    std::queue<std::shared_ptr<struct io_op_desc_t>> _work_queue;
+    std::queue<std::shared_ptr<struct io_op_desc_t>> _complete_queue;
+
+    bool _time_to_exit;
+
+    struct thread_sync_t _work_sync;
+    struct thread_sync_t _complete_sync;
+
+    deepspeed_aio_thread_t(const int tid, deepspeed_aio_config_t& aio_config);
+
+    ~deepspeed_aio_thread_t();
+
+    void run();
+};
diff --git a/csrc/aio/py_lib/deepspeed_py_aio.cpp b/csrc/aio/py_lib/deepspeed_py_aio.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..49ff1f240c433288a0e12c64389887c65926ad83
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio.cpp
@@ -0,0 +1,121 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <cassert>
+#include <chrono>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "deepspeed_py_aio.h"
+
+using namespace std;
+using namespace std::chrono;
+
+#define DEBUG_DS_AIO_READ 0
+#define DEBUG_DS_AIO_WRITE 0
+
+static const std::string c_library_name = "deepspeed_aio";
+
+int deepspeed_py_aio_write(const torch::Tensor& buffer,
+                           const char* filename,
+                           const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const bool validate)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+    deepspeed_aio_config_t config(block_size, queue_depth, single_submit, overlap_events, false);
+
+    const auto fd = open_file(filename, false);
+    if (fd == -1) { return -1; }
+
+    auto write_buffer = (char*)buffer.data_ptr();
+    const auto num_write_bytes = static_cast<long long int>(buffer.nbytes());
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_write_bytes, write_buffer));
+    std::unique_ptr<aio_context> aio_ctxt(new aio_context(config._block_size, config._queue_depth));
+
+    if (config._overlap_events) {
+        do_aio_operation_overlap(false, aio_ctxt, xfer_ctxt, &config, nullptr);
+    } else {
+        do_aio_operation_sequential(false, aio_ctxt, xfer_ctxt, &config, nullptr);
+    }
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    close(fd);
+
+    if (validate) { validate_aio_operation(false, filename, write_buffer, num_write_bytes); }
+
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
+
+int deepspeed_py_aio_read(torch::Tensor& buffer,
+                          const char* filename,
+                          const int block_size,
+                          const int queue_depth,
+                          const bool single_submit,
+                          const bool overlap_events,
+                          const bool validate)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+    long long num_file_bytes;
+    if (-1 == get_file_size(filename, num_file_bytes)) {
+        const auto error_code = errno;
+        report_file_error(filename, " fstat for read", error_code);
+        return -1;
+    }
+
+    deepspeed_aio_config_t config(block_size, queue_depth, single_submit, overlap_events, false);
+    const auto fd = open_file(filename, true);
+    if (fd == -1) { return -1; }
+
+    auto read_buffer = (char*)buffer.data_ptr();
+    assert(static_cast<long long int>(buffer.nbytes()) == num_file_bytes);
+
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_file_bytes, read_buffer));
+    std::unique_ptr<aio_context> aio_ctxt(new aio_context(config._block_size, config._queue_depth));
+
+    if (config._overlap_events) {
+        do_aio_operation_overlap(true, aio_ctxt, xfer_ctxt, &config, nullptr);
+    } else {
+        do_aio_operation_sequential(true, aio_ctxt, xfer_ctxt, &config, nullptr);
+    }
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    close(fd);
+
+    if (validate) { validate_aio_operation(true, filename, read_buffer, num_file_bytes); }
+
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
diff --git a/csrc/aio/py_lib/deepspeed_py_aio.h b/csrc/aio/py_lib/deepspeed_py_aio.h
new file mode 100644
index 0000000000000000000000000000000000000000..230d88da9763a0130554ca83c5e3b1a5d914116f
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio.h
@@ -0,0 +1,27 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <deepspeed_aio_common.h>
+#include <stdlib.h>
+#include <torch/extension.h>
+
+int deepspeed_py_aio_write(const torch::Tensor& buffer,
+                           const char* filename,
+                           const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const bool validate);
+
+int deepspeed_py_aio_read(torch::Tensor& buffer,
+                          const char* filename,
+                          const int block_size,
+                          const int queue_depth,
+                          const bool single_submit,
+                          const bool overlap_events,
+                          const bool validate);
diff --git a/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp b/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..417319f8ae5ce3bead644c80c094d9df1061879a
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp
@@ -0,0 +1,282 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_py_aio_handle.h"
+
+using namespace std;
+
+static void _start_aio_thread(std::shared_ptr<struct deepspeed_aio_thread_t> ctxt) { ctxt->run(); }
+
+deepspeed_aio_handle_t::deepspeed_aio_handle_t(const int block_size,
+                                               const int queue_depth,
+                                               const bool single_submit,
+                                               const bool overlap_events,
+                                               const int num_threads)
+    : _aio_ctxt(new aio_context(block_size, queue_depth)),
+      _single_submit(single_submit),
+      _overlap_events(overlap_events),
+      _num_threads(num_threads),
+      _aio_config(block_size, queue_depth, single_submit, overlap_events, false),
+      _num_pending_ops(0)
+{
+    for (auto i = 0; i < num_threads; ++i) {
+        _thread_contexts.push_back(std::make_shared<deepspeed_aio_thread_t>(i, _aio_config));
+    }
+
+    for (auto& ctxt : _thread_contexts) {
+        _threads.push_back(std::thread(_start_aio_thread, ctxt));
+    }
+}
+
+deepspeed_aio_handle_t::~deepspeed_aio_handle_t()
+{
+    _stop_threads();
+    for (auto& thr : _threads) { thr.join(); }
+}
+
+const int deepspeed_aio_handle_t::get_block_size() const
+{
+    return _aio_ctxt ? _aio_ctxt->_block_size : -1;
+}
+
+const int deepspeed_aio_handle_t::get_queue_depth() const
+{
+    return _aio_ctxt ? _aio_ctxt->_queue_depth : -1;
+}
+
+const bool deepspeed_aio_handle_t::get_single_submit() const { return _single_submit; }
+
+const bool deepspeed_aio_handle_t::get_overlap_events() const { return _overlap_events; }
+
+const int deepspeed_aio_handle_t::get_thread_count() const { return _num_threads; }
+
+int deepspeed_aio_handle_t::read(torch::Tensor& buffer, const char* filename, const bool validate)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+
+    assert(_aio_ctxt);
+
+    long long num_file_bytes;
+    if (-1 == get_file_size(filename, num_file_bytes)) {
+        const auto error_code = errno;
+        report_file_error(filename, " fstat for read", error_code);
+        return -1;
+    }
+    assert(static_cast<long long int>(buffer.nbytes()) == num_file_bytes);
+
+    const auto fd = open_file(filename, true);
+    if (fd == -1) { return -1; }
+
+    auto read_buffer = (char*)buffer.data_ptr();
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_file_bytes, read_buffer));
+
+    if (_aio_config._overlap_events) {
+        do_aio_operation_overlap(true, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    } else {
+        do_aio_operation_sequential(true, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    }
+
+    close(fd);
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    if (validate) { validate_aio_operation(true, filename, read_buffer, num_file_bytes); }
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
+
+int deepspeed_aio_handle_t::write(const torch::Tensor& buffer,
+                                  const char* filename,
+                                  const bool validate)
+{
+    assert(_aio_ctxt);
+
+    const auto start_time = std::chrono::high_resolution_clock::now();
+
+    const auto fd = open_file(filename, false);
+    if (fd == -1) { return -1; }
+
+    auto write_buffer = (char*)buffer.data_ptr();
+    const auto num_write_bytes = static_cast<long long int>(buffer.nbytes());
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_write_bytes, write_buffer));
+
+    if (_aio_config._overlap_events) {
+        do_aio_operation_overlap(false, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    } else {
+        do_aio_operation_sequential(false, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    }
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    close(fd);
+
+    if (validate) { validate_aio_operation(false, filename, write_buffer, num_write_bytes); }
+
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
+
+void deepspeed_aio_handle_t::_schedule_aio_work(std::shared_ptr<struct io_op_desc_t> scheduled_op)
+{
+    for (auto& ctxt : _thread_contexts) {
+        {
+            std::lock_guard<std::mutex> lock(ctxt->_work_sync._mutex);
+            ctxt->_work_queue.push(scheduled_op);
+        }
+        ctxt->_work_sync._cond_var.notify_one();
+    }
+    _num_pending_ops++;
+}
+
+std::shared_ptr<struct io_op_desc_t> deepspeed_aio_handle_t::_wait_for_aio_work()
+{
+    std::shared_ptr<struct io_op_desc_t> completed_op = nullptr;
+    for (auto& ctxt : _thread_contexts) {
+        std::unique_lock<std::mutex> lock(ctxt->_complete_sync._mutex);
+        ctxt->_complete_sync._cond_var.wait(lock,
+                                            [ctxt] { return !ctxt->_complete_queue.empty(); });
+        completed_op = ctxt->_complete_queue.front();
+        ctxt->_complete_queue.pop();
+    }
+    return completed_op;
+}
+
+void deepspeed_aio_handle_t::_stop_threads()
+{
+    assert(0 == _num_pending_ops);
+    for (auto& ctxt : _thread_contexts) {
+        {
+            std::lock_guard<std::mutex> lock(ctxt->_work_sync._mutex);
+            ctxt->_time_to_exit = true;
+        }
+        ctxt->_work_sync._cond_var.notify_one();
+    }
+}
+
+int deepspeed_aio_handle_t::wait()
+{
+    assert(_num_pending_ops > 0);
+    auto num_completed_ops = 0;
+
+    while (_num_pending_ops > 0) {
+        auto completed_op = _wait_for_aio_work();
+
+        completed_op->fini();
+
+        close(completed_op->_fd);
+
+        if (completed_op->_validate) {
+            validate_aio_operation(completed_op->_read_op,
+                                   completed_op->_filename.c_str(),
+                                   completed_op->data_ptr(),
+                                   _num_threads * completed_op->_num_bytes);
+        }
+        --_num_pending_ops;
+        ++num_completed_ops;
+    }
+
+    return num_completed_ops;
+}
+
+bool deepspeed_aio_handle_t::_is_valid_parallel_aio_op(const bool read_op,
+                                                       const long long int num_bytes)
+{
+    const auto op_string = read_op ? "Read" : "Write";
+    if (num_bytes % get_thread_count()) {
+        std::cout << "deepspeed_aio failure: parallel " << op_string << " num_bytes = " << num_bytes
+                  << " not divisible by thread count = " << get_thread_count() << std::endl;
+        return false;
+    }
+
+    return true;
+}
+
+int deepspeed_aio_handle_t::pread(const torch::Tensor& buffer,
+                                  const char* filename,
+                                  const bool validate,
+                                  const bool async)
+{
+    long long num_file_bytes;
+    if (-1 == get_file_size(filename, num_file_bytes)) {
+        const auto error_code = errno;
+        report_file_error(filename, " fstat for read", error_code);
+        return -1;
+    }
+    const auto buffer_bytes = static_cast<long long int>(buffer.nbytes());
+    if (buffer_bytes != num_file_bytes) {
+        std::cout << filename << ": buffer nbytes != file bytes " << buffer_bytes
+                  << " != " << num_file_bytes << std::endl;
+    }
+    assert(static_cast<long long int>(buffer.nbytes()) == num_file_bytes);
+    assert((num_file_bytes % _num_threads) == 0);
+
+    if (!_is_valid_parallel_aio_op(true, num_file_bytes)) { return -1; }
+
+    const auto fd = open_file(filename, true);
+    if (fd == -1) { return -1; }
+
+    auto scheduled_op = std::make_shared<io_op_desc_t>(
+        true, buffer, fd, filename, (num_file_bytes / _num_threads), validate);
+
+    _schedule_aio_work(scheduled_op);
+
+    if (async) { return 0; }
+
+    return wait();
+}
+
+int deepspeed_aio_handle_t::pwrite(const torch::Tensor& buffer,
+                                   const char* filename,
+                                   const bool validate,
+                                   const bool async)
+{
+    const auto num_write_bytes = static_cast<long long int>(buffer.nbytes());
+    assert((num_write_bytes % _num_threads) == 0);
+
+    if (!_is_valid_parallel_aio_op(false, num_write_bytes)) { return -1; }
+
+    const auto fd = open_file(filename, false);
+    if (fd == -1) { return -1; }
+
+    auto scheduled_op = std::make_shared<io_op_desc_t>(
+        false, buffer, fd, filename, (num_write_bytes / _num_threads), validate);
+
+    _schedule_aio_work(scheduled_op);
+
+    if (async) { return 0; }
+
+    return wait();
+}
+
+int deepspeed_aio_handle_t::sync_pread(torch::Tensor& buffer, const char* filename)
+{
+    return pread(buffer, filename, false, false);
+}
+
+int deepspeed_aio_handle_t::sync_pwrite(const torch::Tensor& buffer, const char* filename)
+{
+    return pwrite(buffer, filename, false, false);
+}
+
+int deepspeed_aio_handle_t::async_pread(torch::Tensor& buffer, const char* filename)
+{
+    return pread(buffer, filename, false, true);
+}
+
+int deepspeed_aio_handle_t::async_pwrite(const torch::Tensor& buffer, const char* filename)
+{
+    return pwrite(buffer, filename, false, true);
+}
diff --git a/csrc/aio/py_lib/deepspeed_py_aio_handle.h b/csrc/aio/py_lib/deepspeed_py_aio_handle.h
new file mode 100644
index 0000000000000000000000000000000000000000..22de4c3961d29abc94517b81ff38b7224822589c
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_aio_handle.h
@@ -0,0 +1,68 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <condition_variable>
+#include <memory>
+#include "deepspeed_aio_thread.h"
+
+struct deepspeed_aio_handle_t {
+    std::unique_ptr<struct aio_context> _aio_ctxt;
+    const bool _single_submit;
+    const bool _overlap_events;
+    const int _num_threads;
+    deepspeed_aio_config_t _aio_config;
+
+    std::vector<std::shared_ptr<struct deepspeed_aio_thread_t>> _thread_contexts;
+    std::vector<std::thread> _threads;
+    int _num_pending_ops;
+
+    deepspeed_aio_handle_t(const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const int num_threads);
+
+    ~deepspeed_aio_handle_t();
+
+    const int get_block_size() const;
+    const int get_queue_depth() const;
+    const bool get_single_submit() const;
+    const bool get_overlap_events() const;
+    const int get_thread_count() const;
+
+    int read(torch::Tensor& buffer, const char* filename, const bool validate);
+
+    int write(const torch::Tensor& buffer, const char* filename, const bool validate);
+
+    int pread(const torch::Tensor& buffer,
+              const char* filename,
+              const bool validate,
+              const bool async);
+
+    int pwrite(const torch::Tensor& buffer,
+               const char* filename,
+               const bool validate,
+               const bool async);
+
+    int sync_pread(torch::Tensor& buffer, const char* filename);
+
+    int sync_pwrite(const torch::Tensor& buffer, const char* filename);
+
+    int async_pread(torch::Tensor& buffer, const char* filename);
+
+    int async_pwrite(const torch::Tensor& buffer, const char* filename);
+
+    int wait();
+
+    void _stop_threads();
+
+    void _schedule_aio_work(std::shared_ptr<struct io_op_desc_t> scheduled_op);
+
+    std::shared_ptr<struct io_op_desc_t> _wait_for_aio_work();
+
+    bool _is_valid_parallel_aio_op(const bool read_op, const long long int num_bytes);
+};
diff --git a/csrc/aio/py_lib/deepspeed_py_copy.cpp b/csrc/aio/py_lib/deepspeed_py_copy.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..ee51147f9c414b184bb6ef81edd8905ca7fd4a78
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_copy.cpp
@@ -0,0 +1,133 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_py_copy.h"
+#include <omp.h>
+
+#define ROUND_DOWN(size, step) ((size) & ~((step)-1))
+
+#if defined(__AVX512__) or defined(__AVX256__)
+union AVX_Data {
+#if defined(__AVX512__)
+    __m512 data;
+#else
+    __m256 data;
+#endif
+};
+#endif
+
+static void helper_memcpy_1(float* dest, float* src, size_t param_size)
+{
+    size_t rounded_size = 0;
+
+#if defined(__AVX512__) or defined(__AVX256__)
+
+    rounded_size = ROUND_DOWN(param_size, SIMD_WIDTH);
+
+    for (size_t t = 0; t < rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
+        size_t offset = copy_size + t;
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH) {
+            AVX_Data src_4;
+            src_4.data = SIMD_LOAD(src + i);
+
+            SIMD_STORE(dest + i, src_4.data);
+        }
+    }
+
+#endif
+
+    if (param_size > rounded_size) {
+#pragma omp parallel for
+        for (size_t k = rounded_size; k < param_size; k++) { dest[k] = src[k]; }
+    }
+}
+
+static void helper_memcpy_4(float* dest, float* src, size_t param_size)
+{
+    size_t rounded_size = 0;
+
+#if defined(__AVX512__) or defined(__AVX256__)
+
+    rounded_size = ROUND_DOWN(param_size, (SIMD_WIDTH << 2));
+
+    for (size_t t = 0; t < rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
+        size_t offset = copy_size + t;
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += (SIMD_WIDTH << 2)) {
+            AVX_Data src_4[4];
+            src_4[0].data = SIMD_LOAD(src + i);
+            src_4[1].data = SIMD_LOAD(src + i + SIMD_WIDTH);
+            src_4[2].data = SIMD_LOAD(src + i + (SIMD_WIDTH << 1));
+            src_4[3].data = SIMD_LOAD(src + i + SIMD_WIDTH * 3);
+
+            SIMD_STORE(dest + i, src_4[0].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH, src_4[1].data);
+            SIMD_STORE(dest + i + (SIMD_WIDTH << 1), src_4[2].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 3, src_4[3].data);
+        }
+    }
+#endif
+    if (param_size > rounded_size)
+        helper_memcpy_1((dest + rounded_size), (src + rounded_size), (param_size - rounded_size));
+}
+
+static void helper_mempcy_8(float* dest, float* src, size_t param_size)
+{
+    size_t rounded_size = 0;
+
+#if defined(__AVX512__) or defined(__AVX256__)
+
+    rounded_size = ROUND_DOWN(param_size, (SIMD_WIDTH << 2));
+
+    for (size_t t = 0; t < rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
+        size_t offset = copy_size + t;
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += (SIMD_WIDTH << 3)) {
+            AVX_Data src_4[8];
+            src_4[0].data = SIMD_LOAD(src + i);
+            src_4[1].data = SIMD_LOAD(src + i + SIMD_WIDTH);
+            src_4[2].data = SIMD_LOAD(src + i + (SIMD_WIDTH << 1));
+            src_4[3].data = SIMD_LOAD(src + i + SIMD_WIDTH * 3);
+            src_4[4].data = SIMD_LOAD(src + i + (SIMD_WIDTH << 2));
+            src_4[5].data = SIMD_LOAD(src + i + SIMD_WIDTH * 5);
+            src_4[6].data = SIMD_LOAD(src + i + SIMD_WIDTH * 6);
+            src_4[7].data = SIMD_LOAD(src + i + SIMD_WIDTH * 7);
+
+            SIMD_STORE(dest + i, src_4[0].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH, src_4[1].data);
+            SIMD_STORE(dest + i + (SIMD_WIDTH << 1), src_4[2].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 3, src_4[3].data);
+            SIMD_STORE(dest + i + (SIMD_WIDTH << 2), src_4[4].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 5, src_4[5].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 6, src_4[6].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 7, src_4[7].data);
+        }
+    }
+#endif
+    if (param_size > rounded_size)
+        helper_memcpy_4((dest + rounded_size), (src + rounded_size), (param_size - rounded_size));
+}
+
+int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src)
+{
+    auto dest_c = dest.contiguous();
+    auto src_c = src.contiguous();
+
+    float* dest_ptr = (float*)dest_c.data_ptr();
+    float* src_ptr = (float*)src_c.data_ptr();
+
+    helper_mempcy_8(dest_ptr, src_ptr, dest_c.size(0));
+
+    return 0;
+}
diff --git a/csrc/aio/py_lib/deepspeed_py_copy.h b/csrc/aio/py_lib/deepspeed_py_copy.h
new file mode 100644
index 0000000000000000000000000000000000000000..69b044851eca1cbea461925fca2133f433e77533
--- /dev/null
+++ b/csrc/aio/py_lib/deepspeed_py_copy.h
@@ -0,0 +1,42 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#if (__x86_64__ || __i386__)
+#include <cpuid.h>
+#include <x86intrin.h>
+#endif
+
+#include <deepspeed_aio_common.h>
+#include <stdlib.h>
+#include <torch/extension.h>
+
+#define TILE (1024 * 1024 * 1024)
+
+#if defined(__AVX512__)
+#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm512_loadu_ps(x)
+#define SIMD_SET(x) _mm512_set1_ps(x)
+#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
+#define SIMD_WIDTH 16
+#else
+#if defined(__AVX256__)
+#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm256_loadu_ps(x)
+#define SIMD_SET(x) _mm256_set1_ps(x)
+#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
+#define SIMD_WIDTH 8
+#endif
+#endif
+
+int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src);
diff --git a/csrc/aio/py_lib/py_ds_aio.cpp b/csrc/aio/py_lib/py_ds_aio.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..68590581ce2d985bc5209a73d9de4f515c987c30
--- /dev/null
+++ b/csrc/aio/py_lib/py_ds_aio.cpp
@@ -0,0 +1,41 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <torch/extension.h>
+#include "deepspeed_py_aio_handle.h"
+#include "deepspeed_py_copy.h"
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("aio_read", &deepspeed_py_aio_read, "DeepSpeed Asynchronous I/O Read");
+
+    m.def("aio_write", &deepspeed_py_aio_write, "DeepSpeed Asynchronous I/O Write");
+
+    m.def("deepspeed_memcpy", &deepspeed_py_memcpy, "DeepSpeed Memory Copy");
+
+    py::class_<deepspeed_aio_handle_t>(m, "aio_handle")
+        .def(py::init<const int, const int, const bool, const bool, const int>())
+
+        .def("get_block_size", &deepspeed_aio_handle_t::get_block_size)
+        .def("get_queue_depth", &deepspeed_aio_handle_t::get_queue_depth)
+        .def("get_single_submit", &deepspeed_aio_handle_t::get_single_submit)
+        .def("get_overlap_events", &deepspeed_aio_handle_t::get_overlap_events)
+        .def("get_thread_count", &deepspeed_aio_handle_t::get_thread_count)
+
+        .def("read", &deepspeed_aio_handle_t::read)
+        .def("write", &deepspeed_aio_handle_t::write)
+
+        .def("pread", &deepspeed_aio_handle_t::pread)
+        .def("pwrite", &deepspeed_aio_handle_t::pwrite)
+
+        .def("sync_pread", &deepspeed_aio_handle_t::sync_pread)
+        .def("sync_pwrite", &deepspeed_aio_handle_t::sync_pwrite)
+        .def("async_pread", &deepspeed_aio_handle_t::async_pread)
+        .def("async_pwrite", &deepspeed_aio_handle_t::async_pwrite)
+
+        .def("wait", &deepspeed_aio_handle_t::wait);
+}
diff --git a/csrc/aio/py_test/aio_bench_generate_param.py b/csrc/aio/py_test/aio_bench_generate_param.py
new file mode 100644
index 0000000000000000000000000000000000000000..caa833f5febbe26eabf3b155a236fa331899667c
--- /dev/null
+++ b/csrc/aio/py_test/aio_bench_generate_param.py
@@ -0,0 +1,96 @@
+"""
+Copyright 2021 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+import os
+import argparse
+import json
+from parse_aio_stats import READ_SPEED, WRITE_SPEED, get_sorted_results
+from perf_sweep_utils import BENCH_LOG_DIR, READ_LOG_DIR, WRITE_LOG_DIR
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        '--log_dir',
+        type=str,
+        default=BENCH_LOG_DIR,
+        help=
+        f'Folder of performance sweep logs. Default is {os.path.join(".", BENCH_LOG_DIR)}'
+    )
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+
+    return args
+
+
+def validate_args(args):
+    for d in [READ_LOG_DIR, WRITE_LOG_DIR]:
+        log_dir = os.path.join(args.log_dir, d)
+        if not os.path.isdir(log_dir):
+            print(f'{log_dir} folder is not existent')
+            return False
+
+    return True
+
+
+def convert_to_param(key):
+    assert len(key) == 6
+    return {
+        "single_submit": "true" if key[0] == "single" else "false",
+        "overlap_events": "true" if key[1] == "overlap" else "false",
+        "thread_count": int(key[3]),
+        "queue_depth": int(key[4]),
+        "block_size": int(key[5])
+    }
+
+
+def generate_aio_param(read_log_dir, write_log_dir):
+    _, read_results = get_sorted_results(read_log_dir, READ_SPEED)
+    _, write_results = get_sorted_results(write_log_dir, WRITE_SPEED)
+    combined_perf = {key[1:]: value for key, value in read_results.items()}
+
+    for key, value in write_results.items():
+        new_key = key[1:]
+        if new_key in combined_perf:
+            combined_perf[new_key] += value
+        else:
+            combined_perf[new_key] = 0
+
+    optimal_key = None
+    optimal_perf = 0.0
+    for key, value in combined_perf.items():
+        if value > optimal_perf:
+            optimal_perf = value
+            optimal_key = key
+
+    aio_param = {"aio": convert_to_param(optimal_key)}
+
+    read_perf_keys = {key[1:]: key for key in read_results.keys()}
+    write_perf_keys = {key[1:]: key for key in write_results.keys()}
+    optimal_config_read = read_results.get(read_perf_keys[optimal_key], None)
+    optimal_config_write = write_results.get(write_perf_keys[optimal_key], None)
+
+    print(
+        f'Best performance (GB/sec): read = {optimal_config_read:5.2f}, write = {optimal_config_write:5.2f}'
+    )
+    print(json.dumps(aio_param, indent=3))
+
+
+def main():
+    print('Generate aio param')
+    args = parse_arguments()
+    if not validate_args(args):
+        quit()
+
+    read_log_dir = os.path.join(args.log_dir, READ_LOG_DIR)
+    write_log_dir = os.path.join(args.log_dir, WRITE_LOG_DIR)
+    generate_aio_param(read_log_dir, write_log_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/csrc/aio/py_test/aio_bench_perf_sweep.py b/csrc/aio/py_test/aio_bench_perf_sweep.py
new file mode 100644
index 0000000000000000000000000000000000000000..be6cd74f7ac6fff899a56a80d9e3969013243c12
--- /dev/null
+++ b/csrc/aio/py_test/aio_bench_perf_sweep.py
@@ -0,0 +1,397 @@
+"""
+Copyright 2021 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+import os
+import sys
+import argparse
+import json
+import itertools
+import subprocess
+import shutil
+
+from test_ds_aio_utils import refine_integer_value
+from perf_sweep_utils import READ_OP_DESC, WRITE_OP_DESC, BENCH_LOG_DIR, \
+    READ_IO_DIR, WRITE_IO_DIR, READ_LOG_DIR, WRITE_LOG_DIR
+
+OTHER_OPTIONS = '--handle'
+PERF_SCRIPT = 'test_ds_aio.py'
+DEFAULT_SWEEP_CONFIG = {
+    "block_size": ["128K",
+                   "256K"],
+    "queue_depth": [4,
+                    16,
+                    32],
+    "overlap_events": [True,
+                       False],
+    "io_parallel": [2,
+                    8],
+    "single_submit": [False]
+}
+
+
+class Job(object):
+    def __init__(self, cmd_line, output_file=None, work_dir=None):
+        self.cmd_line = cmd_line
+        self.output_file = output_file
+        self.work_dir = work_dir
+        self.output_fd = None
+
+    def cmd(self):
+        return self.cmd_line
+
+    def get_stdout(self):
+        return self.output_fd
+
+    def get_stderr(self):
+        return self.output_fd
+
+    def get_cwd(self):
+        return self.work_dir
+
+    def open_output_file(self):
+        if self.output_file is not None:
+            self.output_fd = open(self.output_file, 'w')
+
+    def close_output_file(self):
+        if self.output_fd is not None:
+            self.output_fd.close()
+            self.output_fd = None
+
+
+class SweepConfig(object):
+    def __init__(self, args):
+        self.nvme_dir = args.nvme_dir
+        self.io_size = args.io_size
+        self.search_space = get_sweep_config_dict(args.sweep_config)
+        self.read = not args.no_read
+        self.write = not args.no_write
+        self.flush_cache = not args.no_sudo
+        self.log_dir = args.log_dir
+        self.loops = args.loops
+        self.other_options = f'{OTHER_OPTIONS} --loops {args.loops}'
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        '--nvme_dir',
+        required=True,
+        type=str,
+        help=
+        'Directory in which to perform I/O tests. A writeable directory on a NVMe device.'
+    )
+
+    parser.add_argument('--sweep_config',
+                        type=str,
+                        default=None,
+                        help='Performance sweep configuration json file.')
+
+    parser.add_argument('--no_read',
+                        action='store_true',
+                        help='Disable read performance measurements.')
+
+    parser.add_argument('--no_write',
+                        action='store_true',
+                        help='Disable write performance measurements.')
+
+    parser.add_argument(
+        '--io_size',
+        type=str,
+        default="400M",
+        help='Number of I/O bytes to read/write for performance measurements.')
+
+    parser.add_argument(
+        '--no_sudo',
+        action='store_true',
+        help=
+        'Run without sudo access. Page cache will not be flushed and reported read speeds may be higher than actual.'
+    )
+
+    parser.add_argument(
+        '--log_dir',
+        type=str,
+        default=BENCH_LOG_DIR,
+        help=
+        f'Output directory for performance log files. Default is {os.path.join(".", BENCH_LOG_DIR)}'
+    )
+
+    parser.add_argument('--loops',
+                        type=int,
+                        default=1,
+                        help='Count of operation repetitions')
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+
+    return args
+
+
+def dump_cmd_lines(cmd_lines):
+    print(f'cmd line count =  {len(cmd_lines)}')
+    for i, cmd in enumerate(cmd_lines):
+        print(f'{i}:  {cmd}')
+
+
+def get_sweep_config_dict(sweep_config_json):
+    if sweep_config_json is None:
+        return DEFAULT_SWEEP_CONFIG
+
+    with open(sweep_config_json) as fp:
+        sweep_config = json.load(fp)
+    return sweep_config
+
+
+def get_sweep_cmd_lines(sweep_config_dict):
+    def flatten_options(key, value_list):
+        flat_list = []
+        for v in value_list:
+            if not type(v) is bool:
+                flat_list.append(f'--{key} {v}')
+            elif v:
+                flat_list.append(f'--{key}')
+            else:
+                flat_list.append(' ')
+
+        return flat_list
+
+    flat_list = [flatten_options(key, value) for key, value in sweep_config_dict.items()]
+    cmd_list = list(itertools.product(*flat_list))
+    cmd_list = [list(cmd) for cmd in cmd_list]
+    #dump_cmd_lines(cmd_list)
+    return cmd_list
+
+
+def run_job(job):
+    args = ' '.join(job.cmd())
+    print(f'args = {args}')
+    job.open_output_file()
+    proc = subprocess.run(args=args,
+                          shell=True,
+                          stdout=job.get_stdout(),
+                          stderr=job.get_stderr(),
+                          cwd=job.get_cwd())
+    job.close_output_file()
+    assert proc.returncode == 0, \
+    f"This command failed: {job.cmd()}"
+
+
+def launch_sweep(sweep_jobs, sync_job, flush_cache_job):
+    for perf_job in sweep_jobs:
+        if flush_cache_job is not None:
+            run_job(sync_job)
+            run_job(flush_cache_job)
+
+        run_job(perf_job)
+
+        run_job(sync_job)
+
+
+def create_cmd_tags(cmd_line):
+    tags = {}
+    for param_value in cmd_line:
+        fields = param_value.split()
+        if len(fields) == 1:
+            tags[fields[0]] = None
+        elif len(fields) == 2:
+            tags[fields[0]] = fields[1]
+    return tags
+
+
+def get_log_file(io_op_desc, cmd_line):
+    QUEUE_DEPTH = "--queue_depth"
+    BLOCK_SIZE = "--block_size"
+    SINGLE_SUBMIT = "--single_submit"
+    OVERLAP_EVENTS = "--overlap_events"
+    THREAD_COUNT = "--threads"
+    IO_PARALLEL = "--io_parallel"
+
+    tag_map = {
+        QUEUE_DEPTH: "d",
+        BLOCK_SIZE: "bs",
+        SINGLE_SUBMIT: "single",
+        OVERLAP_EVENTS: "overlap",
+        THREAD_COUNT: "t",
+        IO_PARALLEL: "p"
+    }
+
+    tag_default = {
+        QUEUE_DEPTH: 1,
+        BLOCK_SIZE: "1M",
+        SINGLE_SUBMIT: "block",
+        OVERLAP_EVENTS: "sequential",
+        THREAD_COUNT: 1,
+        IO_PARALLEL: 1
+    }
+
+    def get_default_value(tag):
+        value = tag_default[tag]
+        if tag in [SINGLE_SUBMIT, OVERLAP_EVENTS]:
+            return value
+        return f'{tag_map[tag]}{value}'
+
+    def get_config_value(tag, value):
+        tag_key = tag_map[tag]
+        if value is None:
+            return tag_key
+        return f'{tag_key}{value}'
+
+    tag_list = [
+        SINGLE_SUBMIT,
+        OVERLAP_EVENTS,
+        THREAD_COUNT,
+        IO_PARALLEL,
+        QUEUE_DEPTH,
+        BLOCK_SIZE
+    ]
+    log_tags = [io_op_desc]
+    cmd_tags = create_cmd_tags(cmd_line)
+    for tag in tag_list:
+        if tag in cmd_tags:
+            log_tags.append(get_config_value(tag, cmd_tags[tag]))
+        else:
+            log_tags.append(get_default_value(tag))
+
+    log_file = '_'.join(log_tags)
+    log_file += '.txt'
+    return log_file
+
+
+def create_perf_jobs(io_op_desc, log_dir, cmd_lines):
+    py_cmd = ['python', os.path.join(script_path(), PERF_SCRIPT)]
+
+    perf_jobs = []
+    for cmd in cmd_lines:
+        log_file = os.path.join(log_dir, get_log_file(io_op_desc, cmd))
+        job = Job(cmd_line=py_cmd + cmd, output_file=log_file)
+        perf_jobs.append(job)
+
+    return perf_jobs
+
+
+def script_path():
+    return os.path.dirname(os.path.realpath(sys.argv[0]))
+
+
+def async_io_setup():
+    import deepspeed
+    from deepspeed.ops.aio import AsyncIOBuilder
+    return AsyncIOBuilder().is_compatible()
+
+
+def get_block_size_and_count(io_bytes):
+    block_size = 1
+    block_count = io_bytes
+    bytes_in_KB = 1024
+
+    while block_count % bytes_in_KB == 0:
+        block_size *= bytes_in_KB
+        block_count /= bytes_in_KB
+
+    return int(block_size), int(block_count)
+
+
+def create_read_file(sweep_config):
+    read_folder = os.path.join(sweep_config.nvme_dir, f'{READ_IO_DIR}')
+    os.makedirs(read_folder, exist_ok=True)
+    read_file_name = os.path.join(read_folder, f'random_{sweep_config.io_size}B.pt')
+    block_size, block_count = get_block_size_and_count(refine_integer_value(sweep_config.io_size))
+    dd_job = Job(cmd_line=[
+        f'dd if=/dev/urandom of={read_file_name} bs={block_size} count={block_count}'
+    ])
+    print(
+        f'[Start] Create read file of {sweep_config.io_size} bytes by running {dd_job.cmd()} ....'
+    )
+    run_job(dd_job)
+    print(
+        f'[Done] Create read file of {sweep_config.io_size} bytes by running {dd_job.cmd()} ....'
+    )
+    return read_folder, read_file_name
+
+
+def remove_folder(folder):
+    assert os.path.isdir(folder), f"Error: cannot remove {folder} - folder not found"
+    shutil.rmtree(folder)
+
+
+def run_read_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines):
+    read_folder, read_file_name = create_read_file(sweep_config)
+    read_option = f'--read_file {read_file_name}'
+    read_cmd_lines = [[f'{read_option} {sweep_config.other_options}'] + cmd
+                      for cmd in cmd_lines]
+    #dump_cmd_lines(read_cmd_lines)
+
+    log_folder = os.path.join(sweep_config.log_dir, f'{READ_LOG_DIR}')
+    os.makedirs(log_folder, exist_ok=True)
+
+    perf_jobs = create_perf_jobs(io_op_desc=READ_OP_DESC,
+                                 log_dir=log_folder,
+                                 cmd_lines=read_cmd_lines)
+
+    launch_sweep(sweep_jobs=perf_jobs,
+                 sync_job=sync_job,
+                 flush_cache_job=flush_cache_job)
+
+    remove_folder(read_folder)
+
+
+def run_write_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines):
+    write_folder = os.path.join(sweep_config.nvme_dir, f'{WRITE_IO_DIR}')
+    os.makedirs(write_folder, exist_ok=True)
+    write_file_name = os.path.join(write_folder, f'random_{sweep_config.io_size}B.pt')
+    write_option = f'--write_size {sweep_config.io_size} --write_file {write_file_name}'
+    write_cmd_lines = [[f'{write_option} {sweep_config.other_options}'] + cmd
+                       for cmd in cmd_lines]
+    #dump_cmd_lines(write_cmd_lines)
+
+    log_folder = os.path.join(sweep_config.log_dir, f'{WRITE_LOG_DIR}')
+    os.makedirs(log_folder, exist_ok=True)
+
+    perf_jobs = create_perf_jobs(io_op_desc=WRITE_OP_DESC,
+                                 log_dir=log_folder,
+                                 cmd_lines=write_cmd_lines)
+
+    launch_sweep(sweep_jobs=perf_jobs,
+                 sync_job=sync_job,
+                 flush_cache_job=flush_cache_job)
+
+    remove_folder(write_folder)
+
+
+def main():
+    print("Running performance sweep of deepspeed nvme library")
+
+    if not async_io_setup():
+        error_msg = """
+            Failing because environment is not properly configured for deepspeed async i/o module.
+            Possible fix: apt install libaio-dev.
+        """
+        print(error_msg)
+        quit()
+
+    args = parse_arguments()
+    sweep_config = SweepConfig(args)
+    cmd_lines = get_sweep_cmd_lines(sweep_config.search_space)
+
+    if sweep_config.flush_cache:
+        flush_cache_job = Job(
+            cmd_line=['sudo',
+                      'bash -c',
+                      "'echo 1 > /proc/sys/vm/drop_caches'"])
+    else:
+        flush_cache_job = None
+
+    sync_job = Job(cmd_line=['sync'])
+
+    if sweep_config.read:
+        run_read_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines)
+
+    if sweep_config.write:
+        run_write_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/csrc/aio/py_test/ds_aio_basic.py b/csrc/aio/py_test/ds_aio_basic.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf70b6655e9c1366371d24a6fb33808c41729e93
--- /dev/null
+++ b/csrc/aio/py_test/ds_aio_basic.py
@@ -0,0 +1,144 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import torch
+import os
+import time
+from deepspeed.ops.aio import AsyncIOBuilder
+from multiprocessing import Pool, Barrier
+from test_ds_aio_utils import report_results, task_log, task_barrier
+
+
+def pre_basic(args, tid, read_op):
+    io_string = "Read" if read_op else "Write"
+    num_bytes = os.path.getsize(args.read_file) if read_op else args.write_size
+    file = args.read_file if read_op else f'{args.write_file}.{tid}'
+
+    task_log(tid, f'Allocate tensor of size {num_bytes} bytes')
+    buffer = torch.empty(num_bytes, dtype=torch.uint8, device='cpu').pin_memory()
+    task_log(
+        tid,
+        f'{io_string} file {file} of size {num_bytes} bytes from buffer on device {buffer.device}'
+    )
+
+    ctxt = {}
+    ctxt['file'] = file
+    ctxt['num_bytes'] = num_bytes
+    ctxt['buffer'] = buffer
+    ctxt['elapsed_sec'] = 0
+
+    return ctxt
+
+
+def pre_basic_read(pool_params):
+    args, tid = pool_params
+    ctxt = pre_basic(args, tid, True)
+    return ctxt
+
+
+def pre_basic_write(pool_params):
+    args, tid = pool_params
+    ctxt = pre_basic(args, tid, False)
+    return ctxt
+
+
+def post_basic(pool_params):
+    _, _, ctxt = pool_params
+    ctxt["buffer"].detach()
+    ctxt["buffer"] = None
+    return ctxt
+
+
+def main_basic_read(pool_params):
+    args, tid, ctxt = pool_params
+    start_time = time.time()
+    AsyncIOBuilder().load().aio_read(ctxt['buffer'],
+                                     ctxt['file'],
+                                     args.block_size,
+                                     args.queue_depth,
+                                     args.single_submit,
+                                     args.overlap_events,
+                                     args.validate)
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_basic_write(pool_params):
+    args, tid, ctxt = pool_params
+    start_time = time.time()
+    AsyncIOBuilder().load().aio_write(ctxt['buffer'],
+                                      ctxt['file'],
+                                      args.block_size,
+                                      args.queue_depth,
+                                      args.single_submit,
+                                      args.overlap_events,
+                                      args.validate)
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def get_schedule(args, read_op):
+    schedule = {}
+    if read_op:
+        schedule['pre'] = pre_basic_read
+        schedule['post'] = post_basic
+        schedule['main'] = main_basic_read
+    else:
+        schedule['pre'] = pre_basic_write
+        schedule['post'] = post_basic
+        schedule['main'] = main_basic_write
+
+    return schedule
+
+
+def _aio_handle_tasklet(pool_params):
+    args, tid, read_op = pool_params
+
+    # Create schedule
+    schedule = get_schedule(args, read_op)
+    task_log(tid, f'schedule = {schedule}')
+    task_barrier(aio_barrier, args.threads)
+
+    # Run pre task
+    task_log(tid, f'running pre-task')
+    ctxt = schedule["pre"]((args, tid))
+    task_barrier(aio_barrier, args.threads)
+
+    # Run main tasks in a loop
+    ctxt["main_task_sec"] = 0
+    for i in range(args.loops):
+        task_log(tid, f'running main task {i}')
+        start_time = time.time()
+        ctxt = schedule["main"]((args, tid, ctxt))
+        task_barrier(aio_barrier, args.threads)
+        stop_time = time.time()
+        ctxt["main_task_sec"] += stop_time - start_time
+
+    # Run post task
+    task_log(tid, f'running post-task')
+    ctxt = schedule["post"]((args, tid, ctxt))
+    task_barrier(aio_barrier, args.threads)
+
+    return ctxt["main_task_sec"], ctxt["elapsed_sec"], ctxt["num_bytes"] * args.loops
+
+
+def _init_tasklet(b):
+    global aio_barrier
+    aio_barrier = b
+
+
+def aio_basic_multiprocessing(args, read_op):
+    b = Barrier(args.threads)
+    pool_params = [(args, p, read_op) for p in range(args.threads)]
+    with Pool(processes=args.threads, initializer=_init_tasklet, initargs=(b, )) as p:
+        pool_results = p.map(_aio_handle_tasklet, pool_params)
+
+    report_results(args, read_op, pool_results)
diff --git a/csrc/aio/py_test/ds_aio_handle.py b/csrc/aio/py_test/ds_aio_handle.py
new file mode 100644
index 0000000000000000000000000000000000000000..947ee2e6cb633e52c33c4b0ce06c56ad75b73f4c
--- /dev/null
+++ b/csrc/aio/py_test/ds_aio_handle.py
@@ -0,0 +1,176 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import torch
+import os
+import time
+from multiprocessing import Pool, Barrier
+from deepspeed.ops.aio import AsyncIOBuilder
+from test_ds_aio_utils import report_results, task_log, task_barrier
+
+
+def pre_handle(args, tid, read_op):
+    io_string = "Read" if read_op else "Write"
+    num_bytes = os.path.getsize(args.read_file) if read_op else args.write_size
+    file = args.read_file if read_op else f'{args.write_file}.{tid}'
+
+    task_log(tid, f'Allocate tensor of size {num_bytes} bytes')
+    if args.gpu:
+        buffer = torch.empty(num_bytes, dtype=torch.uint8, device='cuda')
+    else:
+        buffer = torch.empty(num_bytes, dtype=torch.uint8, device='cpu').pin_memory()
+    task_log(
+        tid,
+        f'{io_string} file {file} of size {num_bytes} bytes from buffer on device {buffer.device}'
+    )
+
+    io_parallel = args.io_parallel if args.io_parallel else 1
+    handle = AsyncIOBuilder().load().aio_handle(args.block_size,
+                                                args.queue_depth,
+                                                args.single_submit,
+                                                args.overlap_events,
+                                                io_parallel)
+    task_log(tid, f'created deepspeed aio handle')
+
+    ctxt = {}
+    ctxt['file'] = file
+    ctxt['num_bytes'] = num_bytes
+    ctxt['handle'] = handle
+    ctxt['buffer'] = buffer
+    ctxt['elapsed_sec'] = 0
+
+    return ctxt
+
+
+def pre_handle_read(pool_params):
+    args, tid = pool_params
+    ctxt = pre_handle(args, tid, True)
+    return ctxt
+
+
+def pre_handle_write(pool_params):
+    args, tid = pool_params
+    ctxt = pre_handle(args, tid, False)
+    return ctxt
+
+
+def post_handle(pool_params):
+    _, _, ctxt = pool_params
+    ctxt["buffer"].detach()
+    ctxt["buffer"] = None
+    return ctxt
+
+
+def main_parallel_read(pool_params):
+    args, tid, ctxt = pool_params
+    handle = ctxt['handle']
+
+    start_time = time.time()
+    ret = handle.pread(ctxt['buffer'], ctxt['file'], args.validate, True)
+    assert ret != -1
+    handle.wait()
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_parallel_write(pool_params):
+    args, tid, ctxt = pool_params
+    handle = ctxt['handle']
+    start_time = time.time()
+    ret = handle.pwrite(ctxt['buffer'], ctxt['file'], args.validate, True)
+    assert ret != -1
+    handle.wait()
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_handle_read(pool_parms):
+    args, tid, ctxt = pool_parms
+    handle = ctxt['handle']
+
+    start_time = time.time()
+    ret = handle.read(ctxt['buffer'], ctxt['file'], args.validate)
+    assert ret != -1
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_handle_write(pool_parms):
+    args, tid, ctxt = pool_parms
+    handle = ctxt['handle']
+    start_time = time.time()
+    ret = handle.write(ctxt['buffer'], ctxt['file'], args.validate)
+    assert ret != -1
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def get_schedule(args, read_op):
+    schedule = {}
+    if read_op:
+        schedule['pre'] = pre_handle_read
+        schedule['post'] = post_handle
+        schedule['main'] = main_parallel_read if args.io_parallel else main_handle_read
+    else:
+        schedule['pre'] = pre_handle_write
+        schedule['post'] = post_handle
+        schedule['main'] = main_parallel_write if args.io_parallel else main_handle_write
+
+    return schedule
+
+
+def _aio_handle_tasklet(pool_params):
+    args, tid, read_op = pool_params
+
+    # Create schedule
+    schedule = get_schedule(args, read_op)
+    task_log(tid, f'schedule = {schedule}')
+    task_barrier(aio_barrier, args.threads)
+
+    # Run pre task
+    task_log(tid, f'running pre-task')
+    ctxt = schedule["pre"]((args, tid))
+    task_barrier(aio_barrier, args.threads)
+
+    # Run main tasks in a loop
+    ctxt["main_task_sec"] = 0
+    for i in range(args.loops):
+        task_log(tid, f'running main task {i}')
+        start_time = time.time()
+        ctxt = schedule["main"]((args, tid, ctxt))
+        task_barrier(aio_barrier, args.threads)
+        stop_time = time.time()
+        ctxt["main_task_sec"] += stop_time - start_time
+
+    # Run post task
+    task_log(tid, f'running post-task')
+    ctxt = schedule["post"]((args, tid, ctxt))
+    task_barrier(aio_barrier, args.threads)
+
+    return ctxt["main_task_sec"], ctxt["elapsed_sec"], ctxt["num_bytes"] * args.loops
+
+
+def _init_tasklet(b):
+    global aio_barrier
+    aio_barrier = b
+
+
+def aio_handle_multiprocessing(args, read_op):
+    b = Barrier(args.threads)
+    pool_params = [(args, p, read_op) for p in range(args.threads)]
+    with Pool(processes=args.threads, initializer=_init_tasklet, initargs=(b, )) as p:
+        pool_results = p.map(_aio_handle_tasklet, pool_params)
+
+    report_results(args, read_op, pool_results)
diff --git a/csrc/aio/py_test/parse_aio_stats.py b/csrc/aio/py_test/parse_aio_stats.py
new file mode 100644
index 0000000000000000000000000000000000000000..1921973e4f735ffbe0cc0d67b0f970e4c15a47ab
--- /dev/null
+++ b/csrc/aio/py_test/parse_aio_stats.py
@@ -0,0 +1,154 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import argparse
+import re
+
+READ_SPEED = 'read_speed'
+WRITE_SPEED = 'write_speed'
+
+PERF_METRICS = [READ_SPEED, WRITE_SPEED]
+
+METRIC_SEARCH = {READ_SPEED: 'E2E Read Speed', WRITE_SPEED: 'E2E Write Speed'}
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--log_dir',
+                        type=str,
+                        required=True,
+                        help='Folder of statistics logs')
+
+    parser.add_argument('--metric',
+                        type=str,
+                        required=True,
+                        help='Performance metric to report: [read_speed|write_speed]')
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+
+    return args
+
+
+def extract_value(key, file):
+    INVALID_PREFIXES = ["ds"]
+    for p in INVALID_PREFIXES:
+        if key.startswith(p):
+            return key
+    try:
+        if key[0] in ['t', 'd', 'p']:
+            return int(key[1:])
+        if key.startswith("bs"):
+            if key.endswith('K'):
+                v = key[2:].split('K')
+                return int(v[0]) * 1024
+            elif key.endswith('M'):
+                v = key[2:].split('M')
+                return int(v[0]) * 1024 * 1024
+            else:
+                return int(key[2:])
+    except:
+        print(f"{file}: extract_value fails on {key}")
+        return None
+
+    return key
+
+
+def get_file_key(file):
+    f, _ = os.path.splitext(os.path.basename(file))
+    fields = f.split('_')
+    values = [extract_value(k, file) for k in fields]
+    return tuple(values)
+
+
+def get_thread_count(file):
+    f, _ = os.path.splitext(os.path.basename(file))
+    fields = f.split('_')
+    for key in fields:
+        if key[0] == 't':
+            return int(key[1:])
+    return 1
+
+
+"""
+Extract performance metric from log file.
+Sample file lines are:
+Task Read Latency = 0.031647682189941406 sec
+Task Read Speed = 12.342926020792527 GB/sec
+E2E Read Latency = 0.031697988510131836 sec
+E2E Read Speed = 12.323337169333062 GB/sec
+
+For the above sample, -metric = "read_speed" corresponds to "E2E Read Speed", and 12.32 will be returned
+"""
+
+
+def get_metric(file, metric):
+    thread_count = get_thread_count(file)
+    with open(file) as f:
+        for line in f.readlines():
+            if line.startswith(METRIC_SEARCH[metric]):
+                if metric in [READ_SPEED, WRITE_SPEED]:
+                    fields = line.split()
+                    return float(fields[-2])
+                else:
+                    fields = line.split('=')
+                    return float(fields[-1])
+
+    return None
+
+
+def validate_args(args):
+    if not args.metric in PERF_METRICS:
+        print(f'{args.metric} is not a valid performance metrics')
+        return False
+
+    if not os.path.isdir(args.log_dir):
+        print(f'{args.log_dir} folder is not existent')
+        return False
+
+    return True
+
+
+def get_results(log_files, metric):
+    results = {}
+    for f in log_files:
+        file_key = get_file_key(f)
+        value = get_metric(f, metric)
+        results[file_key] = value
+
+    return results
+
+
+def get_sorted_results(log_dir, metric):
+    log_files = [
+        f for f in os.listdir(log_dir) if os.path.isfile(os.path.join(log_dir,
+                                                                      f))
+    ]
+
+    log_files_path = [os.path.join(log_dir, f) for f in log_files]
+    results = get_results(log_files_path, metric)
+    result_keys = list(results.keys())
+    sorted_keys = sorted(result_keys)
+    return sorted_keys, results
+
+
+def main():
+    print("Parsing aio statistics")
+    args = parse_arguments()
+
+    if not validate_args(args):
+        quit()
+
+    sorted_keys, results = get_sorted_results(args.log_dir, args.metric)
+    for k in sorted_keys:
+        print(f'{k} = {results[k]}')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/csrc/aio/py_test/perf_sweep_utils.py b/csrc/aio/py_test/perf_sweep_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fd1a4c55d0fd90090edb223c0303f90be5a0db9
--- /dev/null
+++ b/csrc/aio/py_test/perf_sweep_utils.py
@@ -0,0 +1,8 @@
+SCRIPT_PREFIX = '_aio_bench'
+WRITE_OP_DESC = 'write'
+READ_OP_DESC = 'read'
+READ_IO_DIR = f'{SCRIPT_PREFIX}_{READ_OP_DESC}_io'
+WRITE_IO_DIR = f'{SCRIPT_PREFIX}_{WRITE_OP_DESC}_io'
+BENCH_LOG_DIR = f'{SCRIPT_PREFIX}_logs'
+READ_LOG_DIR = f'{SCRIPT_PREFIX}_{READ_OP_DESC}_logs'
+WRITE_LOG_DIR = f'{SCRIPT_PREFIX}_{WRITE_OP_DESC}_logs'
diff --git a/csrc/aio/py_test/run_read_sweep.sh b/csrc/aio/py_test/run_read_sweep.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b9d7e050454a8382878b250442de3ffd7a07793d
--- /dev/null
+++ b/csrc/aio/py_test/run_read_sweep.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+if [[ $# -ne 2 ]]; then
+    echo "Usage: $0 <input file> <output log dir>"
+    exit 1
+fi
+
+
+function validate_environment()
+{
+    validate_cmd="python ./validate_async_io.py"
+    eval ${validate_cmd}
+    res=$?
+    if [[ $res != 0 ]]; then
+        echo "Failing because environment is not properly configured"
+        echo "Possible fix: sudo apt-get install libaio-dev"
+        exit 1
+    fi
+}
+
+
+validate_environment
+
+INPUT_FILE=$1
+if [[ ! -f ${INPUT_FILE} ]]; then
+    echo "Input file not found: ${INPUT_FILE}"
+    exit 1
+fi
+
+LOG_DIR=$2/aio_perf_sweep
+RUN_SCRIPT=./test_ds_aio.py
+READ_OPT="--read_file ${INPUT_FILE}"
+
+if [[ -d ${LOG_DIR} ]]; then
+    rm -f ${LOG_DIR}/*
+else
+    mkdir -p ${LOG_DIR}
+fi
+
+DISABLE_CACHE="sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches' "
+SYNC="sync"
+
+for sub in single block; do
+    if [[ $sub == "single" ]]; then
+        sub_opt="--single_submit"
+    else
+        sub_opt=""
+    fi
+    for ov in overlap sequential; do
+        if [[ $ov == "overlap" ]]; then
+            ov_opt="--overlap_events"
+        else
+            ov_opt=""
+        fi
+        for t in 1 2 4 8; do
+            for p in 1 ; do
+                for d in 1 2 4 8 16 32; do
+                    for bs in 128K 256K 512K 1M; do
+                        SCHED_OPTS="${sub_opt} ${ov_opt} --handle --threads ${t}"
+                        OPTS="--io_parallel ${p} --queue_depth ${d} --block_size ${bs}"
+                        LOG="${LOG_DIR}/read_${sub}_${ov}_t${t}_p${p}_d${d}_bs${bs}.txt"
+                        cmd="python ${RUN_SCRIPT} ${READ_OPT} ${OPTS} ${SCHED_OPTS} &> ${LOG}"
+                        echo ${DISABLE_CACHE}
+                        echo ${cmd}
+                        echo ${SYNC}
+
+                        eval ${DISABLE_CACHE}
+                        eval ${cmd}
+                        eval ${SYNC}
+                        sleep 2
+                    done
+                done
+            done
+        done
+    done
+done
diff --git a/csrc/aio/py_test/run_write_sweep.sh b/csrc/aio/py_test/run_write_sweep.sh
new file mode 100644
index 0000000000000000000000000000000000000000..99f2113dda6fed0a9b4a2dffa4436fdf5241e8f7
--- /dev/null
+++ b/csrc/aio/py_test/run_write_sweep.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+function prep_folder()
+{
+    folder=$1
+    if [[ -d ${folder} ]]; then
+        rm -f ${folder}/*
+    else
+        mkdir -p ${folder}
+    fi
+}
+
+function validate_environment()
+{
+    validate_cmd="python ./validate_async_io.py"
+    eval ${validate_cmd}
+    res=$?
+    if [[ $res != 0 ]]; then
+        echo "Failing because environment is not properly configured"
+        echo "Possible fix: sudo apt-get install libaio-dev"
+        exit 1
+    fi
+}
+
+
+
+validate_environment
+
+if [[ $# -ne 3 ]]; then
+    echo "Usage: $0 <write size in MB> <write dir ><output log dir>"
+    exit 1
+fi
+
+SIZE="$1M"
+WRITE_DIR=$2
+LOG_DIR=$3/aio_perf_sweep
+
+OUTPUT_FILE=${WRITE_DIR}/ds_aio_write_${SIZE}B.pt
+WRITE_OPT="--write_file ${OUTPUT_FILE} --write_size ${SIZE}"
+
+
+prep_folder ${WRITE_DIR}
+prep_folder ${LOG_DIR}
+
+RUN_SCRIPT=./test_ds_aio.py
+
+DISABLE_CACHE="sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches' "
+SYNC="sync"
+
+for sub in single block; do
+    if [[ $sub == "single" ]]; then
+        sub_opt="--single_submit"
+    else
+        sub_opt=""
+    fi
+    for ov in overlap sequential; do
+        if [[ $ov == "overlap" ]]; then
+            ov_opt="--overlap_events"
+        else
+            ov_opt=""
+        fi
+        for t in 1 2 4 8; do
+            for p in 1; do
+                for d in 1 2 4 8 16 32; do
+                    for bs in 128K 256K 512K 1M; do
+                        SCHED_OPTS="${sub_opt} ${ov_opt} --handle --threads ${t}"
+                        OPTS="--io_parallel ${p} --queue_depth ${d} --block_size ${bs}"
+                        LOG="${LOG_DIR}/write_${sub}_${ov}_t${t}_p${p}_d${d}_bs${bs}.txt"
+                        cmd="python ${RUN_SCRIPT} ${WRITE_OPT} ${OPTS} ${SCHED_OPTS} &> ${LOG}"
+                        echo ${DISABLE_CACHE}
+                        echo ${cmd}
+                        echo ${SYNC}
+
+                        eval ${DISABLE_CACHE}
+                        eval ${cmd}
+                        eval ${SYNC}
+                        sleep 2
+                    done
+                done
+        done
+        done
+    done
+done
diff --git a/csrc/aio/py_test/single_process_config.json b/csrc/aio/py_test/single_process_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..275c54135cd83d3d8508ea1f769b823af9529821
--- /dev/null
+++ b/csrc/aio/py_test/single_process_config.json
@@ -0,0 +1,29 @@
+{
+    "block_size": [
+        "128K",
+        "256K",
+        "1M"
+    ],
+    "queue_depth": [
+        4,
+        16,
+        32
+    ],
+    "io_parallel": [
+        1,
+        2,
+        4,
+        8
+    ],
+    "single_submit": [
+        true,
+        false
+    ],
+    "overlap_events": [
+        true,
+        false
+    ],
+    "threads": [
+        1
+    ]
+}
diff --git a/csrc/aio/py_test/test_ds_aio.py b/csrc/aio/py_test/test_ds_aio.py
new file mode 100644
index 0000000000000000000000000000000000000000..f97d3e676c03c13d54c54b6cc23e24745b09f335
--- /dev/null
+++ b/csrc/aio/py_test/test_ds_aio.py
@@ -0,0 +1,101 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import torch
+import argparse
+import time
+import sys
+from multiprocessing import Pool
+import multiprocessing as mp
+from ds_aio_basic import aio_basic_multiprocessing
+from ds_aio_handle import aio_handle_multiprocessing
+from test_ds_aio_utils import refine_args
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--read_file', type=str, default=None, help='Read file.')
+
+    parser.add_argument('--write_file', type=str, default=None, help='Write file.')
+
+    parser.add_argument('--write_size',
+                        type=str,
+                        default=None,
+                        help='Number of bytes to write.')
+
+    parser.add_argument('--block_size', type=str, default='1M', help='I/O block size.')
+
+    parser.add_argument('--queue_depth', type=int, default=32, help='I/O queue depth.')
+
+    parser.add_argument('--threads',
+                        type=int,
+                        default=1,
+                        help='Thread parallelism count.')
+
+    parser.add_argument(
+        '--single_submit',
+        action='store_true',
+        help=
+        'Submit I/O requests in singles (default is submit queue_depth amount at once.).'
+    )
+
+    parser.add_argument('--overlap_events',
+                        action='store_true',
+                        help='Overlap I/O submission and completion requests.')
+
+    parser.add_argument('--validate',
+                        action='store_true',
+                        help='Perform validation in library.')
+
+    parser.add_argument('--handle', action='store_true', help='Use AIO handle.')
+
+    parser.add_argument('--loops',
+                        type=int,
+                        default=1,
+                        help='Count of operation repetitions')
+
+    parser.add_argument('--io_parallel',
+                        type=int,
+                        default=None,
+                        help='Per iop parallelism')
+
+    parser.add_argument('--gpu', action='store_true', help='Use GPU memory')
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+    return args
+
+
+def validate_args(args):
+    if args.read_file and not os.path.isfile(args.read_file):
+        print(f'args validation error: {args.read_file} not found')
+        return False
+
+    return True
+
+
+def main():
+    print(f'Testing deepspeed_aio python frontend')
+
+    args = parse_arguments()
+    refine_args(args)
+    if not validate_args(args):
+        quit()
+
+    mp.set_start_method('spawn')
+    multiprocess_function = aio_handle_multiprocessing if args.handle else aio_basic_multiprocessing
+    if args.read_file:
+        multiprocess_function(args, True)
+
+    if args.write_file:
+        multiprocess_function(args, False)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/csrc/aio/py_test/test_ds_aio_utils.py b/csrc/aio/py_test/test_ds_aio_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c68dfdddc23343c5d3c0a623a4be33f11f78e628
--- /dev/null
+++ b/csrc/aio/py_test/test_ds_aio_utils.py
@@ -0,0 +1,59 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+
+BYTES_PER_GB = 1024**3
+LOG_TIDS = [0]
+
+
+def task_log(tid, msg):
+    if tid in LOG_TIDS:
+        print(f'tid {tid}: {msg}')
+
+
+def task_barrier(barrier, num_parties):
+    assert barrier.parties == num_parties
+    barrier.wait()
+    assert barrier.broken == False
+
+
+def report_results(args, read_op, pool_results):
+    #print(f'pool_results = {pool_results}')
+    io_string = 'Read' if read_op else 'Write'
+    if None in pool_results:
+        print(f'Failure in one of {args.threads} {io_string} processes')
+        return
+
+    total_bytes = sum([num_bytes for _, _, num_bytes in pool_results])
+
+    task_latency_sec = max([sec for _, sec, _ in pool_results])
+    task_speed_GB = total_bytes / task_latency_sec / BYTES_PER_GB
+    print(f'Task {io_string} Latency = {task_latency_sec} sec')
+    print(f'Task {io_string} Speed = {task_speed_GB} GB/sec')
+
+    e2e_latency_sec = max([sec for sec, _, _ in pool_results])
+    e2e_speed_GB = total_bytes / e2e_latency_sec / BYTES_PER_GB
+    print(f'E2E {io_string} Latency = {e2e_latency_sec} sec')
+    print(f'E2E {io_string} Speed = {e2e_speed_GB} GB/sec')
+
+
+def refine_integer_value(value):
+    unit_dict = {'K': 1024, 'M': 1024**2, 'G': 1024**3}
+
+    if value[-1] in list(unit_dict.keys()):
+        int_value = int(value[:-1]) * unit_dict[value[-1]]
+        return int_value
+    return int(value)
+
+
+def refine_args(args):
+    if args.write_size and type(args.write_size) == str:
+        args.write_size = refine_integer_value(args.write_size)
+
+    if args.block_size and type(args.block_size) == str:
+        args.block_size = refine_integer_value(args.block_size)
diff --git a/csrc/aio/py_test/validate_async_io.py b/csrc/aio/py_test/validate_async_io.py
new file mode 100644
index 0000000000000000000000000000000000000000..ceae84c840da9ba8902310170e1c7afe782a63ce
--- /dev/null
+++ b/csrc/aio/py_test/validate_async_io.py
@@ -0,0 +1,9 @@
+"""
+Copyright 2021 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+import deepspeed
+from deepspeed.ops.aio import AsyncIOBuilder
+assert AsyncIOBuilder().is_compatible()
diff --git a/csrc/common/custom_cuda_kernel.cu b/csrc/common/custom_cuda_kernel.cu
new file mode 100644
index 0000000000000000000000000000000000000000..f7a2b5d480dffae5f44b4558302c2c96240444ba
--- /dev/null
+++ b/csrc/common/custom_cuda_kernel.cu
@@ -0,0 +1,39 @@
+#include "custom_cuda_layers.h"
+
+__global__ void param_update_kernel(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (id < size) { output[id] = (__half)input[id]; }
+}
+
+void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream)
+{
+    int threads = 1024;
+
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+    param_update_kernel<<<grid_dim, block_dim, 0, stream>>>(input, output, size);
+}
+
+__global__ void param_update_kernel_half(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+    __half2* output_cast = reinterpret_cast<__half2*>(output);
+    if (id < size) {
+        float input_f = input[id];
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+        output_cast[id] = *input_h;
+    }
+}
+
+void launch_param_update_half(const float* input, __half* output, int size, cudaStream_t stream)
+{
+    int threads = 1024;
+    size /= 2;
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+    param_update_kernel_half<<<grid_dim, block_dim, 0, stream>>>(input, output, size);
+}
diff --git a/csrc/common/custom_hip_kernel.hip b/csrc/common/custom_hip_kernel.hip
new file mode 100644
index 0000000000000000000000000000000000000000..119647c587ff9d3bacd8eb03ef6c97e0e4ed00a4
--- /dev/null
+++ b/csrc/common/custom_hip_kernel.hip
@@ -0,0 +1,41 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+__global__ void param_update_kernel(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (id < size) { output[id] = (__half)input[id]; }
+}
+
+void launch_param_update(const float* input, __half* output, int size, hipStream_t stream)
+{
+    int threads = 1024;
+
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( param_update_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, input, output, size);
+}
+
+__global__ void param_update_kernel_half(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+    __half2* output_cast = reinterpret_cast<__half2*>(output);
+    if (id < size) {
+        float input_f = input[id];
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+        output_cast[id] = *input_h;
+    }
+}
+
+void launch_param_update_half(const float* input, __half* output, int size, hipStream_t stream)
+{
+    int threads = 1024;
+    size /= 2;
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( param_update_kernel_half), dim3(grid_dim), dim3(block_dim), 0, stream, input, output, size);
+}
diff --git a/csrc/includes/Timer.h b/csrc/includes/Timer.h
index 7c20854a056aca804ea4c660f3719e058e922523..efc7fff84abb86b91473d1a532c78bf16e387384 100644
--- a/csrc/includes/Timer.h
+++ b/csrc/includes/Timer.h
@@ -1,47 +1,47 @@
-
-#ifndef __TIMER_H__
-#define __TIMER_H__
-
-#include <cuda_runtime.h>
-#include <chrono>
-#include "cuda.h"
-
-class GPUTimer {
-    cudaEvent_t start, stop;
-
-public:
-    GPUTimer()
-    {
-        cudaEventCreate(&start);
-        cudaEventCreate(&stop);
-    }
-    ~GPUTimer()
-    {
-        cudaEventDestroy(start);
-        cudaEventDestroy(stop);
-    }
-    inline void Record() { cudaEventRecord(start); }
-    inline void Elapsed(float& time_elapsed)
-    {
-        cudaEventRecord(stop);
-        cudaEventSynchronize(stop);
-        cudaEventElapsedTime(&time_elapsed, start, stop);
-    }
-};
-
-class CPUTimer {
-    std::chrono::high_resolution_clock::time_point start;
-
-public:
-    CPUTimer() : start(std::chrono::high_resolution_clock::now()) {}
-    inline void Reset() { start = std::chrono::high_resolution_clock::now(); }
-    inline float Elapsed()
-    {
-        auto temp = start;
-        start = std::chrono::high_resolution_clock::now();
-        return (float)(std::chrono::duration_cast<std::chrono::microseconds>(start - temp).count() /
-                       1e3);
-    }
-};
-
-#endif
+
+#ifndef __TIMER_H__
+#define __TIMER_H__
+
+#include <cuda_runtime.h>
+#include <chrono>
+#include "cuda.h"
+
+class GPUTimer {
+    cudaEvent_t start, stop;
+
+public:
+    GPUTimer()
+    {
+        cudaEventCreate(&start);
+        cudaEventCreate(&stop);
+    }
+    ~GPUTimer()
+    {
+        cudaEventDestroy(start);
+        cudaEventDestroy(stop);
+    }
+    inline void Record() { cudaEventRecord(start); }
+    inline void Elapsed(float& time_elapsed)
+    {
+        cudaEventRecord(stop);
+        cudaEventSynchronize(stop);
+        cudaEventElapsedTime(&time_elapsed, start, stop);
+    }
+};
+
+class CPUTimer {
+    std::chrono::high_resolution_clock::time_point start;
+
+public:
+    CPUTimer() : start(std::chrono::high_resolution_clock::now()) {}
+    inline void Reset() { start = std::chrono::high_resolution_clock::now(); }
+    inline float Elapsed()
+    {
+        auto temp = start;
+        start = std::chrono::high_resolution_clock::now();
+        return (float)(std::chrono::duration_cast<std::chrono::microseconds>(start - temp).count() /
+                       1e3);
+    }
+};
+
+#endif
diff --git a/csrc/includes/Timer_hip.h b/csrc/includes/Timer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..2ee3d6f2944655ab1de9864b39d58c9edec53ebf
--- /dev/null
+++ b/csrc/includes/Timer_hip.h
@@ -0,0 +1,48 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#ifndef __TIMER_H__
+#define __TIMER_H__
+
+#include <hip/hip_runtime.h>
+#include <chrono>
+#include "hip/hip_runtime.h"
+
+class GPUTimer {
+    hipEvent_t start, stop;
+
+public:
+    GPUTimer()
+    {
+        hipEventCreate(&start);
+        hipEventCreate(&stop);
+    }
+    ~GPUTimer()
+    {
+        hipEventDestroy(start);
+        hipEventDestroy(stop);
+    }
+    inline void Record() { hipEventRecord(start); }
+    inline void Elapsed(float& time_elapsed)
+    {
+        hipEventRecord(stop);
+        hipEventSynchronize(stop);
+        hipEventElapsedTime(&time_elapsed, start, stop);
+    }
+};
+
+class CPUTimer {
+    std::chrono::high_resolution_clock::time_point start;
+
+public:
+    CPUTimer() : start(std::chrono::high_resolution_clock::now()) {}
+    inline void Reset() { start = std::chrono::high_resolution_clock::now(); }
+    inline float Elapsed()
+    {
+        auto temp = start;
+        start = std::chrono::high_resolution_clock::now();
+        return (float)(std::chrono::duration_cast<std::chrono::microseconds>(start - temp).count() /
+                       1e3);
+    }
+};
+
+#endif
diff --git a/csrc/includes/compat.h b/csrc/includes/compat.h
new file mode 100644
index 0000000000000000000000000000000000000000..86f84a85065c9582119296223bb24193e71e060b
--- /dev/null
+++ b/csrc/includes/compat.h
@@ -0,0 +1,14 @@
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#ifndef TORCH_CHECK
+#define TORCH_CHECK AT_CHECK
+#endif
+
+#ifdef VERSION_GE_1_3
+#define DATA_PTR data_ptr
+#else
+#define DATA_PTR data
+#endif
diff --git a/csrc/includes/context_hip.h b/csrc/includes/context_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..258b2bc27482e78d4458326386e4ef487e45fd54
--- /dev/null
+++ b/csrc/includes/context_hip.h
@@ -0,0 +1,172 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <ATen/hip/HIPContext.h>
+#include <hip/hip_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "gemm_test_hip.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return (std::max)(
+        (std::min)((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0)
+    {
+        hiprandCreateGenerator(&_gen, HIPRAND_RNG_PSEUDO_DEFAULT);
+        hiprandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (rocblas_create_handle(&_cublasHandle) != rocblas_status_success) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+    }
+
+    virtual ~Context()
+    {
+        rocblas_destroy_handle(_cublasHandle);
+        hipFree(_workspace);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void SetWorkSpace(void* workspace)
+    {
+        if (!workspace) { throw std::runtime_error("Workspace is null."); }
+        _workspace = workspace;
+    }
+
+    void* GetWorkSpace() { return _workspace; }
+
+    hiprandGenerator_t& GetRandGenerator() { return _gen; }
+
+    hipStream_t GetCurrentStream()
+    {
+        // get current pytorch stream.
+        hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return stream;
+    }
+
+    hipStream_t GetNewStream() { return at::hip::getStreamFromPoolMasqueradingAsCUDA(); }
+
+    rocblas_handle GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    void TestGemmFP16(bool test_gemm, int batch_size, int seq_len, int head_num, int size_per_head)
+    {
+        // avoid rerun.
+        if (_gemm_algos.size() > 0) return;
+
+        if (test_gemm) {
+            rocblas_handle handle = GetCublasHandle();
+
+            std::unique_ptr<GemmTest<__half>> test_qkv_fw(
+                new GemmTest<__half>(batch_size * seq_len,      // M
+                                     head_num * size_per_head,  // N
+                                     head_num * size_per_head,  // K
+                                     rocblas_operation_transpose,
+                                     rocblas_operation_none,
+                                     handle));
+
+            std::unique_ptr<GemmTest<__half>> test_inter(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     4 * head_num * size_per_head,  // N
+                                     head_num * size_per_head,      // K
+                                     rocblas_operation_transpose,
+                                     rocblas_operation_none,
+                                     handle));
+
+            std::unique_ptr<GemmTest<__half>> test_output(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     head_num * size_per_head,      // N
+                                     4 * head_num * size_per_head,  // K
+                                     rocblas_operation_transpose,
+                                     rocblas_operation_none,
+                                     handle));
+
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_scores(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            seq_len,                // M
+                                            seq_len,                // N
+                                            size_per_head,          // K
+                                            rocblas_operation_transpose,
+                                            rocblas_operation_none,
+                                            handle));
+
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_context(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            size_per_head,          // M
+                                            seq_len,                // N
+                                            seq_len,                // K
+                                            rocblas_operation_none,
+                                            rocblas_operation_none,
+                                            handle));
+
+            _gemm_algos.push_back(test_qkv_fw->TestAlgo(100));
+            _gemm_algos.push_back(test_inter->TestAlgo(100));
+            _gemm_algos.push_back(test_output->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_scores->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_context->TestAlgo(100));
+        } else {
+            // Use default algo.
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+        }
+    }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+private:
+    hiprandGenerator_t _gen;
+    rocblas_handle _cublasHandle;
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    std::vector<std::array<int, 3>> _gemm_algos;
+};
diff --git a/csrc/includes/cpu_adagrad.h b/csrc/includes/cpu_adagrad.h
new file mode 100644
index 0000000000000000000000000000000000000000..6c21b7c8e82d36ae62c11d3cc8dfbc153af19549
--- /dev/null
+++ b/csrc/includes/cpu_adagrad.h
@@ -0,0 +1,150 @@
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <cuda_fp16.h>
+#include <cuda_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "cuda.h"
+#include "custom_cuda_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adagrad_Optimizer {
+public:
+    Adagrad_Optimizer(float alpha = 1e-2, float eps = 1e-8, float weight_decay = 0)
+        : _alpha(alpha), _eps(eps), _weight_decay(weight_decay), _buf_index(false)
+    {
+        cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
+        cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adagrad_Optimizer()
+    {
+        cudaFreeHost(_doubled_buffer[0]);
+        cudaFreeHost(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) cudaStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step)
+    {
+        _step++;
+        if (_step != step) { _step = step; }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+    }
+
+private:
+    float _alpha;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+
+    cudaStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adagrad_Optimizer::Step_AVX(size_t* rounded_size,
+                                 float* _params,
+                                 float* grads,
+                                 float* _exp_avg_sq,
+                                 size_t _param_size,
+                                 __half* dev_params,
+                                 bool half_precision)
+{
+    size_t new_rounded_size = 0;
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0) weight_decay4.data = SIMD_SET(_weight_decay);
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, grads + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0) { simd_fma<span>(grad_4, param_4, weight_decay4, grad_4); }
+
+            simd_fma<span>(variance_4, grad_4, grad_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_add<span>(grad_4, grad_4, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/csrc/includes/cpu_adagrad_hip.h b/csrc/includes/cpu_adagrad_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..cb012a1c9a9003c89235e920346ddae584462202
--- /dev/null
+++ b/csrc/includes/cpu_adagrad_hip.h
@@ -0,0 +1,151 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adagrad_Optimizer {
+public:
+    Adagrad_Optimizer(float alpha = 1e-2, float eps = 1e-8, float weight_decay = 0)
+        : _alpha(alpha), _eps(eps), _weight_decay(weight_decay), _buf_index(false)
+    {
+        hipHostMalloc((void**)_doubled_buffer, TILE * sizeof(float));
+        hipHostMalloc((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adagrad_Optimizer()
+    {
+        hipHostFree(_doubled_buffer[0]);
+        hipHostFree(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) hipStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step)
+    {
+        _step++;
+        if (_step != step) { _step = step; }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+    }
+
+private:
+    float _alpha;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+
+    hipStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adagrad_Optimizer::Step_AVX(size_t* rounded_size,
+                                 float* _params,
+                                 float* grads,
+                                 float* _exp_avg_sq,
+                                 size_t _param_size,
+                                 __half* dev_params,
+                                 bool half_precision)
+{
+    size_t new_rounded_size = 0;
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0) weight_decay4.data = SIMD_SET(_weight_decay);
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, grads + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0) { simd_fma<span>(grad_4, param_4, weight_decay4, grad_4); }
+
+            simd_fma<span>(variance_4, grad_4, grad_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_add<span>(grad_4, grad_4, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/csrc/includes/cpu_adam.h b/csrc/includes/cpu_adam.h
old mode 100755
new mode 100644
index 5fae35261f550b96b5f6b0c76af09be07d29f7d8..09677c6842dee6a4a9abe835c245864f07739aa9
--- a/csrc/includes/cpu_adam.h
+++ b/csrc/includes/cpu_adam.h
@@ -1,163 +1,225 @@
-#pragma once
-
-#include <cpuid.h>
-#include <cuda_fp16.h>
-#include <cuda_runtime_api.h>
-#include <stdio.h>
-#include <x86intrin.h>
-#include <cassert>
-#include "context.h"
-#include "cublas_v2.h"
-#include "cuda.h"
-#include "curand.h"
-
-#define CUDA_CHECK(callstr)                                                                    \
-    {                                                                                          \
-        cudaError_t error_code = callstr;                                                      \
-        if (error_code != cudaSuccess) {                                                       \
-            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
-            assert(0);                                                                         \
-        }                                                                                      \
-    }
-
-#define TILE (1024 * 1024 * 1024)
-
-#if defined(__AVX512__)
-#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
-#define SIMD_LOAD(x) _mm512_loadu_ps(x)
-#define SIMD_SET(x) _mm512_set1_ps(x)
-#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
-#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
-#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
-#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
-#define SIMD_WIDTH 16
-#else
-#if defined(__AVX256__)
-#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
-#define SIMD_LOAD(x) _mm256_loadu_ps(x)
-#define SIMD_SET(x) _mm256_set1_ps(x)
-#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
-#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
-#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
-#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
-#define SIMD_WIDTH 8
-#endif
-#endif
-
-class Adam_Optimizer {
-public:
-    Adam_Optimizer(float alpha = 1e-3,
-                   float betta1 = 0.9,
-                   float betta2 = 0.999,
-                   float eps = 1e-8,
-                   float weight_decay = 0,
-                   bool adamw_mode = true)
-        : _alpha(alpha),
-          _betta1(betta1),
-          _betta2(betta2),
-          _eps(eps),
-          _weight_decay(weight_decay),
-          _betta1_t(1.0),
-          _betta2_t(1.0),
-          _step(0),
-          _buf_index(false),
-          _adamw_mode(adamw_mode)
-    {
-        cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
-        cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
-
-        _streams[0] = Context::Instance().GetCurrentStream();
-        _streams[1] = Context::Instance().GetNewStream();
-    }
-    ~Adam_Optimizer()
-    {
-        cudaFreeHost(_doubled_buffer[0]);
-        cudaFreeHost(_doubled_buffer[1]);
-    }
-    void Step(float* _params,
-              float* grads,
-              float* _exp_avg,
-              float* _exp_avg_sq,
-              size_t param_size,
-              __half* dev_param = nullptr);
-    void Step_4(float* _params,
-                float* grads,
-                float* _exp_avg,
-                float* _exp_avg_sa,
-                size_t param_size,
-                __half* dev_param = nullptr);
-    void Step_8(float* _params,
-                float* grads,
-                float* _exp_avg,
-                float* _exp_avg_sq,
-                size_t _param_size,
-                __half* dev_params = nullptr);
-    inline void SynchronizeStreams()
-    {
-        for (int i = 0; i < 2; i++) cudaStreamSynchronize(_streams[i]);
-    }
-    inline void IncrementStep(size_t step, float beta1, float beta2)
-    {
-        if (beta1 != _betta1 || beta2 != _betta2) {
-            _step = step;
-            _betta1 = beta1;
-            _betta2 = beta2;
-            _betta1_t = std::pow(_betta1, step);
-            _betta2_t = std::pow(_betta2, step);
-        } else {
-            _step++;
-            if (_step != step) {
-                _betta1_t = std::pow(_betta1, step);
-                _betta2_t = std::pow(_betta2, step);
-                _step = step;
-            } else {
-                _betta1_t *= _betta1;
-                _betta2_t *= _betta2;
-            }
-        }
-    }
-    inline void update_state(float lr, float epsilon, float weight_decay, bool bias_correction)
-    {
-        _alpha = lr;
-        _eps = epsilon;
-        _weight_decay = weight_decay;
-
-        _bias_correction1 = 1.0f;
-        _bias_correction2 = 1.0f;
-        if (bias_correction == 1) {
-            _bias_correction1 = 1 - _betta1_t;
-            _bias_correction2 = 1 / sqrt(1 - _betta2_t);
-        }
-    }
-
-private:
-#if defined(__AVX512__) or defined(__AVX256__)
-    union AVX_Data {
-#if defined(__AVX512__)
-        __m512 data;
-#else
-        __m256 data;
-#endif
-        // float data_f[16];
-    };
-#endif
-
-    float _alpha;
-    float _betta1;
-    float _betta2;
-    float _eps;
-    float _weight_decay;
-
-    float _betta1_t;
-    float _betta2_t;
-    size_t _step;
-
-    float _bias_correction1;
-    float _bias_correction2;
-
-    float* _doubled_buffer[2];
-    bool _buf_index;
-    bool _adamw_mode;
-
-    cudaStream_t _streams[2];
-};
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <cuda_fp16.h>
+#include <cuda_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "cuda.h"
+#include "custom_cuda_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg,             \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adam_Optimizer {
+public:
+    Adam_Optimizer(float alpha = 1e-3,
+                   float betta1 = 0.9,
+                   float betta2 = 0.999,
+                   float eps = 1e-8,
+                   float weight_decay = 0,
+                   bool adamw_mode = true)
+        : _alpha(alpha),
+          _betta1(betta1),
+          _betta2(betta2),
+          _eps(eps),
+          _weight_decay(weight_decay),
+          _betta1_t(1.0),
+          _betta2_t(1.0),
+          _step(0),
+          _buf_index(false),
+          _adamw_mode(adamw_mode)
+    {
+        cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
+        cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adam_Optimizer()
+    {
+        cudaFreeHost(_doubled_buffer[0]);
+        cudaFreeHost(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) cudaStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step, float beta1, float beta2)
+    {
+        if (beta1 != _betta1 || beta2 != _betta2) {
+            _step = step;
+            _betta1 = beta1;
+            _betta2 = beta2;
+            _betta1_t = std::pow(_betta1, step);
+            _betta2_t = std::pow(_betta2, step);
+        } else {
+            _step++;
+            if (_step != step) {
+                _betta1_t = std::pow(_betta1, step);
+                _betta2_t = std::pow(_betta2, step);
+                _step = step;
+            } else {
+                _betta1_t *= _betta1;
+                _betta2_t *= _betta2;
+            }
+        }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay, bool bias_correction)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+
+        _bias_correction1 = 1.0f;
+        _bias_correction2 = 1.0f;
+        if (bias_correction == 1) {
+            _bias_correction1 = 1 - _betta1_t;
+            _bias_correction2 = 1 / sqrt(1 - _betta2_t);
+        }
+    }
+
+private:
+    float _alpha;
+    float _betta1;
+    float _betta2;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float _bias_correction1;
+    float _bias_correction2;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+    bool _adamw_mode;
+
+    cudaStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adam_Optimizer::Step_AVX(size_t* rounded_size,
+                              float* _params,
+                              float* grads,
+                              float* _exp_avg,
+                              float* _exp_avg_sq,
+                              size_t _param_size,
+                              __half* dev_params,
+                              bool half_precision)
+{
+    size_t new_rounded_size = 0;
+
+    AVX_Data betta1_4;
+    betta1_4.data = SIMD_SET(_betta1);
+    AVX_Data betta2_4;
+    betta2_4.data = SIMD_SET(_betta2);
+
+    float betta1_minus1 = 1 - _betta1;
+    float betta2_minus1 = 1 - _betta2;
+    AVX_Data betta1_minus1_4;
+    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
+    AVX_Data betta2_minus1_4;
+    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
+
+    AVX_Data bias2_sqrt;
+    bias2_sqrt.data = SIMD_SET(_bias_correction2);
+
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha / _bias_correction1;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    float w_decay = -1 * _alpha * _weight_decay;
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0)
+        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, _exp_avg + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0 && !_adamw_mode) {
+                simd_fma<span>(grad_4, param_4, weight_decay4, grad_4);
+            }
+
+            simd_mul<span>(momentum_4, momentum_4, betta1_4);
+            simd_fma<span>(momentum_4, grad_4, betta1_minus1_4, momentum_4);
+            simd_mul<span>(variance_4, variance_4, betta2_4);
+            simd_mul<span>(grad_4, grad_4, grad_4);
+            simd_fma<span>(variance_4, grad_4, betta2_minus1_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_fma<span>(grad_4, grad_4, bias2_sqrt, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+
+            if (_weight_decay > 0 && _adamw_mode) {
+                simd_fma<span>(param_4, param_4, weight_decay4, param_4);
+            }
+
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg + i, momentum_4, false);
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/csrc/includes/cpu_adam_hip.h b/csrc/includes/cpu_adam_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..3622f34cb1558c39e92fa7ac6d2752d37b96974d
--- /dev/null
+++ b/csrc/includes/cpu_adam_hip.h
@@ -0,0 +1,226 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg,             \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adam_Optimizer {
+public:
+    Adam_Optimizer(float alpha = 1e-3,
+                   float betta1 = 0.9,
+                   float betta2 = 0.999,
+                   float eps = 1e-8,
+                   float weight_decay = 0,
+                   bool adamw_mode = true)
+        : _alpha(alpha),
+          _betta1(betta1),
+          _betta2(betta2),
+          _eps(eps),
+          _weight_decay(weight_decay),
+          _betta1_t(1.0),
+          _betta2_t(1.0),
+          _step(0),
+          _buf_index(false),
+          _adamw_mode(adamw_mode)
+    {
+        hipHostMalloc((void**)_doubled_buffer, TILE * sizeof(float));
+        hipHostMalloc((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adam_Optimizer()
+    {
+        hipHostFree(_doubled_buffer[0]);
+        hipHostFree(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) hipStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step, float beta1, float beta2)
+    {
+        if (beta1 != _betta1 || beta2 != _betta2) {
+            _step = step;
+            _betta1 = beta1;
+            _betta2 = beta2;
+            _betta1_t = std::pow(_betta1, step);
+            _betta2_t = std::pow(_betta2, step);
+        } else {
+            _step++;
+            if (_step != step) {
+                _betta1_t = std::pow(_betta1, step);
+                _betta2_t = std::pow(_betta2, step);
+                _step = step;
+            } else {
+                _betta1_t *= _betta1;
+                _betta2_t *= _betta2;
+            }
+        }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay, bool bias_correction)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+
+        _bias_correction1 = 1.0f;
+        _bias_correction2 = 1.0f;
+        if (bias_correction == 1) {
+            _bias_correction1 = 1 - _betta1_t;
+            _bias_correction2 = 1 / sqrt(1 - _betta2_t);
+        }
+    }
+
+private:
+    float _alpha;
+    float _betta1;
+    float _betta2;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float _bias_correction1;
+    float _bias_correction2;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+    bool _adamw_mode;
+
+    hipStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adam_Optimizer::Step_AVX(size_t* rounded_size,
+                              float* _params,
+                              float* grads,
+                              float* _exp_avg,
+                              float* _exp_avg_sq,
+                              size_t _param_size,
+                              __half* dev_params,
+                              bool half_precision)
+{
+    size_t new_rounded_size = 0;
+
+    AVX_Data betta1_4;
+    betta1_4.data = SIMD_SET(_betta1);
+    AVX_Data betta2_4;
+    betta2_4.data = SIMD_SET(_betta2);
+
+    float betta1_minus1 = 1 - _betta1;
+    float betta2_minus1 = 1 - _betta2;
+    AVX_Data betta1_minus1_4;
+    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
+    AVX_Data betta2_minus1_4;
+    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
+
+    AVX_Data bias2_sqrt;
+    bias2_sqrt.data = SIMD_SET(_bias_correction2);
+
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha / _bias_correction1;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    float w_decay = -1 * _alpha * _weight_decay;
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0)
+        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, _exp_avg + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0 && !_adamw_mode) {
+                simd_fma<span>(grad_4, param_4, weight_decay4, grad_4);
+            }
+
+            simd_mul<span>(momentum_4, momentum_4, betta1_4);
+            simd_fma<span>(momentum_4, grad_4, betta1_minus1_4, momentum_4);
+            simd_mul<span>(variance_4, variance_4, betta2_4);
+            simd_mul<span>(grad_4, grad_4, grad_4);
+            simd_fma<span>(variance_4, grad_4, betta2_minus1_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_fma<span>(grad_4, grad_4, bias2_sqrt, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+
+            if (_weight_decay > 0 && _adamw_mode) {
+                simd_fma<span>(param_4, param_4, weight_decay4, param_4);
+            }
+
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg + i, momentum_4, false);
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/csrc/includes/cublas_wrappers.h b/csrc/includes/cublas_wrappers.h
index 19d726c3bcd37b48ac8676069f3db6134298f21d..9bb6cc30f6ae5c00013e75051118e2618a18312c 100644
--- a/csrc/includes/cublas_wrappers.h
+++ b/csrc/includes/cublas_wrappers.h
@@ -5,7 +5,9 @@
 #include <cuda.h>
 #include <cuda_fp16.h>
 #include <cuda_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
 #include <mma.h>
+#endif
 #include <stdio.h>
 
 int cublas_gemm_ex(cublasHandle_t handle,
@@ -19,7 +21,11 @@ int cublas_gemm_ex(cublasHandle_t handle,
                    const float* A,
                    const float* B,
                    float* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
                    cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
 
 int cublas_gemm_ex(cublasHandle_t handle,
                    cublasOperation_t transa,
@@ -32,7 +38,11 @@ int cublas_gemm_ex(cublasHandle_t handle,
                    const __half* A,
                    const __half* B,
                    __half* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
                    cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
 
 int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int m,
@@ -49,7 +59,11 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int stride_B,
                                 int stride_C,
                                 int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
                                 cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
 
 int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int m,
@@ -66,4 +80,8 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int stride_B,
                                 int stride_C,
                                 int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
                                 cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
diff --git a/csrc/includes/cublas_wrappers_hip.h b/csrc/includes/cublas_wrappers_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..ddfa1867cf78baa62954bc16b325121e8f569dbb
--- /dev/null
+++ b/csrc/includes/cublas_wrappers_hip.h
@@ -0,0 +1,88 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <assert.h>
+#include <rocblas.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
diff --git a/csrc/includes/custom_cuda_layers.h b/csrc/includes/custom_cuda_layers.h
old mode 100755
new mode 100644
index 88b51161d8d6fb538ec983471c660bdedca9b4c1..30c633f72914f97fc0aa9f55f86cf963e7b49eea
--- a/csrc/includes/custom_cuda_layers.h
+++ b/csrc/includes/custom_cuda_layers.h
@@ -5,12 +5,29 @@
 #include <stdio.h>
 #include <stdlib.h>
 
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
 #include <cooperative_groups.h>
+#endif
 #include <curand_kernel.h>
 
 #include "context.h"
 #include "cublas_wrappers.h"
 
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
 #define MAX_THREADS 1024
 #define THREADS 256
 
@@ -24,6 +41,34 @@
 
 #define MAX_REGISTERS 256
 
+#define MAX_REG 256
+
+#define WARP_SIZE_BITS 5
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            cudaStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               cudaStream_t stream);
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 cudaStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    cudaStream_t stream);
 // Fused bias add with gelu activation
 template <typename T>
 void launch_bias_gelu(const T* input,
@@ -255,3 +300,4 @@ void launch_fuse_transpose_bias_kernel(const T* inp,
                                        cudaStream_t stream);
 
 void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream);
+void launch_param_update_half(const float* input, __half* output, int size, cudaStream_t stream);
diff --git a/csrc/includes/custom_hip_layers.h b/csrc/includes/custom_hip_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..9f48b31941b7985b4c2eabee02b610d1a5d9d3f8
--- /dev/null
+++ b/csrc/includes/custom_hip_layers.h
@@ -0,0 +1,304 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+#include <hiprand/hiprand_kernel.h>
+
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define MAX_THREADS 1024
+#define THREADS 256
+
+#define MAX_THREAD_STRIDE 32
+#define TILE_DIM 32
+
+// Maximum sequence-length support based on the number of threads (2048) allowed in each block and
+// this MAX is 8K For higher sequence length we need to use higher Max, like for 64K : 32
+#define MAX_THREAD_ITERATIONS 8  // Maximum 8K
+#define MAX_WARP_NUM 32
+
+#define MAX_REGISTERS 256
+
+#define MAX_REG 256
+
+#define WARP_SIZE_BITS 5
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            hipStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               hipStream_t stream);
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 hipStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    hipStream_t stream);
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream);
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 hipStream_t stream);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   hipStream_t stream);
+
+// Custom fused bias add with layer normalization
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* X_data,
+                                         const T* vars,
+                                         const T* means,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int hidden_dim,
+                                         hipStream_t stream[2]);
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* vals_hat,
+                                         const T* vars,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int hidden_dim,
+                                         hipStream_t stream[2],
+                                         bool invertible = false,
+                                         const T* betta = nullptr);
+
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* X_data,
+                               const T* vars,
+                               const T* means,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream[2]);
+
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* vals_hat,
+                               const T* vars,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream[2],
+                               bool invertible = false,
+                               const T* betta = nullptr);
+
+template <typename T>
+void launch_layerNorm_backward_nreversible(const T* out_grad,
+                                           const T* vals,
+                                           const T* out_grad_trans,
+                                           const T* vals_trans,
+                                           const T* means,
+                                           const T* vars,
+                                           const T* gamma,
+                                           T* gamma_grad,
+                                           T* betta_grad,
+                                           T* inp_grad,
+                                           int batch_size,
+                                           int hidden_dim,
+                                           hipStream_t stream[2]);
+
+template <typename T>
+void Transpose(const T* inp_mat, T* out_mat, int rows, int cols, hipStream_t stream);
+
+template <typename T>
+void launch_attn_softmax_backward(T* out_grad,
+                                  const T* soft_inp,
+                                  int batch_size,
+                                  int heads,
+                                  int seq_length,
+                                  hipStream_t stream);
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     hipStream_t stream);
+
+// Custom softmax with scaling and attention mask addition
+template <typename T>
+void launch_attn_softmax(T* vals,
+                         const T* attn_mask,
+                         int batch_size,
+                         int heads,
+                         int sequence_length,
+                         hipStream_t stream);
+
+template <typename T>
+void launch_transform_0213(T* output,
+                           const T* vals,
+                           int batch_size,
+                           int seq_length,
+                           int hidden_dim,
+                           int heads,
+                           hipStream_t stream);
+
+// Custom bias add
+template <typename T>
+void launch_bias_add_transform_0213(T* outputs,
+                                    const T* vals,
+                                    const T* bias,
+                                    int batch_size,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    int heads,
+                                    hipStream_t stream,
+                                    int trans_count);
+
+// 4D transform [0, 1, 2, 3] -> [0, 2, 1, 3]
+template <typename T>
+void launch_transform4d_0213(T* out,
+                             const T* in,
+                             int batch_size,
+                             int heads,
+                             int seq_length,
+                             int hidden_dim,
+                             hipStream_t stream,
+                             int trans_count);
+
+template <typename T>
+void launch_dropout(T* vals,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream);
+
+template <typename T>
+void launch_dropout(T* vals_out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream,
+                    bool bwd = false);
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream);
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, hipStream_t stream);
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         hipStream_t stream);
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       hipStream_t stream);
+
+void launch_param_update(const float* input, __half* output, int size, hipStream_t stream);
+void launch_param_update_half(const float* input, __half* output, int size, hipStream_t stream);
diff --git a/csrc/includes/dropout.h b/csrc/includes/dropout.h
index f6e32af5608d73596c9dcdd65e7d1684c4a563a2..a72572d0876e4ed30cb98270654fed8c201db6f0 100644
--- a/csrc/includes/dropout.h
+++ b/csrc/includes/dropout.h
@@ -1,76 +1,76 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-
-template <typename T>
-class Dropout {
-public:
-    struct Config {
-        float ratio;
-        uint32_t dim;
-        bool training;
-
-        Config(float r, uint32_t d) : ratio(r), dim(d), training(true) {}
-
-        float RATIO() const { return training ? ratio : 0.0; }
-        inline void SetDim(uint32_t d) { dim = d; }
-    };
-
-    Dropout(const Config& config) : _config(config), _mask(nullptr) {}
-
-    virtual ~Dropout() {}
-
-    void Forward(int bsz, T* out, const T* vals, cudaStream_t stream, bool bwd = false)
-    {
-        launch_dropout<T>(
-            out, vals, _mask, bsz * _config.dim, _config.dim, _config.RATIO(), stream, bwd);
-    }
-
-    void ForwardWithBias(int bsz, T* vals, const T* bias, cudaStream_t stream)
-    {
-        launch_dropout<T>(vals, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
-    }
-
-    void ForwardWithBias(int bsz,
-                         T* out,
-                         const T* vals,
-                         const T* residual,
-                         const T* bias,
-                         cudaStream_t stream)
-    {
-        launch_dropout<T>(
-            out, vals, residual, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
-    }
-
-    void Backward(int bsz, T* d_vals, cudaStream_t stream)
-    {
-        launch_dropout_grad<T>(d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
-    }
-
-    void Backward(int bsz, T* d_vals_out, const T* d_vals, cudaStream_t stream)
-    {
-        launch_dropout_grad<T>(
-            d_vals_out, d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
-    }
-
-    bool HasDropout() const { return _config.RATIO() > 0.0; }
-
-    void SetTrainingMode(bool training) { _config.training = training; }
-
-    void SetMask(uint8_t* mask)
-    {
-        if (!mask) { throw std::runtime_error("Dropout mask is null."); }
-
-        _mask = mask;
-    }
-
-    Config GetConfig() const { return _config; }
-
-    inline void SetDimension(uint32_t dim) { _config.SetDim(dim); }
-
-private:
-    uint8_t* _mask;
-    Config _config;
-};
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+
+template <typename T>
+class Dropout {
+public:
+    struct Config {
+        float ratio;
+        uint32_t dim;
+        bool training;
+
+        Config(float r, uint32_t d) : ratio(r), dim(d), training(true) {}
+
+        float RATIO() const { return training ? ratio : 0.0; }
+        inline void SetDim(uint32_t d) { dim = d; }
+    };
+
+    Dropout(const Config& config) : _config(config), _mask(nullptr) {}
+
+    virtual ~Dropout() {}
+
+    void Forward(int bsz, T* out, const T* vals, cudaStream_t stream, bool bwd = false)
+    {
+        launch_dropout<T>(
+            out, vals, _mask, bsz * _config.dim, _config.dim, _config.RATIO(), stream, bwd);
+    }
+
+    void ForwardWithBias(int bsz, T* vals, const T* bias, cudaStream_t stream)
+    {
+        launch_dropout<T>(vals, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void ForwardWithBias(int bsz,
+                         T* out,
+                         const T* vals,
+                         const T* residual,
+                         const T* bias,
+                         cudaStream_t stream)
+    {
+        launch_dropout<T>(
+            out, vals, residual, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals, cudaStream_t stream)
+    {
+        launch_dropout_grad<T>(d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals_out, const T* d_vals, cudaStream_t stream)
+    {
+        launch_dropout_grad<T>(
+            d_vals_out, d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    bool HasDropout() const { return _config.RATIO() > 0.0; }
+
+    void SetTrainingMode(bool training) { _config.training = training; }
+
+    void SetMask(uint8_t* mask)
+    {
+        if (!mask) { throw std::runtime_error("Dropout mask is null."); }
+
+        _mask = mask;
+    }
+
+    Config GetConfig() const { return _config; }
+
+    inline void SetDimension(uint32_t dim) { _config.SetDim(dim); }
+
+private:
+    uint8_t* _mask;
+    Config _config;
+};
diff --git a/csrc/includes/dropout_hip.h b/csrc/includes/dropout_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..1bf352f9e7123b40da7b612692e07f4b1f2a783e
--- /dev/null
+++ b/csrc/includes/dropout_hip.h
@@ -0,0 +1,77 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+
+template <typename T>
+class Dropout {
+public:
+    struct Config {
+        float ratio;
+        uint32_t dim;
+        bool training;
+
+        Config(float r, uint32_t d) : ratio(r), dim(d), training(true) {}
+
+        float RATIO() const { return training ? ratio : 0.0; }
+        inline void SetDim(uint32_t d) { dim = d; }
+    };
+
+    Dropout(const Config& config) : _config(config), _mask(nullptr) {}
+
+    virtual ~Dropout() {}
+
+    void Forward(int bsz, T* out, const T* vals, hipStream_t stream, bool bwd = false)
+    {
+        launch_dropout<T>(
+            out, vals, _mask, bsz * _config.dim, _config.dim, _config.RATIO(), stream, bwd);
+    }
+
+    void ForwardWithBias(int bsz, T* vals, const T* bias, hipStream_t stream)
+    {
+        launch_dropout<T>(vals, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void ForwardWithBias(int bsz,
+                         T* out,
+                         const T* vals,
+                         const T* residual,
+                         const T* bias,
+                         hipStream_t stream)
+    {
+        launch_dropout<T>(
+            out, vals, residual, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals, hipStream_t stream)
+    {
+        launch_dropout_grad<T>(d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals_out, const T* d_vals, hipStream_t stream)
+    {
+        launch_dropout_grad<T>(
+            d_vals_out, d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    bool HasDropout() const { return _config.RATIO() > 0.0; }
+
+    void SetTrainingMode(bool training) { _config.training = training; }
+
+    void SetMask(uint8_t* mask)
+    {
+        if (!mask) { throw std::runtime_error("Dropout mask is null."); }
+
+        _mask = mask;
+    }
+
+    Config GetConfig() const { return _config; }
+
+    inline void SetDimension(uint32_t dim) { _config.SetDim(dim); }
+
+private:
+    uint8_t* _mask;
+    Config _config;
+};
diff --git a/csrc/includes/ds_transformer_cuda.h b/csrc/includes/ds_transformer_cuda.h
old mode 100755
new mode 100644
index cdd65b4a7da72ea4c0497e87f8907108872bd809..09afeb9d4b1950f0fc01cad7e7963359b485970c
--- a/csrc/includes/ds_transformer_cuda.h
+++ b/csrc/includes/ds_transformer_cuda.h
@@ -34,12 +34,12 @@ struct BertGemmAlgos {
 template <typename T>
 class BertTransformerLayer {
 public:
-    BertTransformerLayer(int layer_id,
-                         int batch_size,
-                         int hidden_size,
-                         int num_heads,
-                         int intermediate_size,
-                         int seq_length,
+    BertTransformerLayer(unsigned layer_id,
+                         unsigned batch_size,
+                         unsigned hidden_size,
+                         unsigned num_heads,
+                         unsigned intermediate_size,
+                         unsigned seq_length,
                          float attn_dropout_ratio,
                          float hidden_output_dropout_ratio,
                          float layer_norm_eps,
@@ -52,7 +52,7 @@ public:
 
     virtual ~BertTransformerLayer();
 
-    void Forward(int bsz,
+    void Forward(unsigned bsz,
                  const T* input_ptr,
                  const T* input_mask_ptr,
                  const T* attn_qkvw_ptr,
@@ -80,7 +80,7 @@ public:
                  T* gelu_inp_ptr,
                  T* ff2_inp_ptr);
 
-    void Backward(int bsz,
+    void Backward(unsigned bsz,
                   const T* grad_output_ptr,
                   const T* input_ptr,
                   const T* output_ptr,
@@ -128,13 +128,13 @@ public:
                                 T* attn_layer_norm_var,
                                 T* attn_layer_norm_mean);
 
-    inline int GetBatchSize() const { return _batch_size; }
-    inline int GetNumHeads() const { return _heads; }
-    inline int GetSeqLength() const { return _seq_length; }
-    inline int GetIntermediateSize() const { return _intermediate_size; }
+    inline unsigned GetBatchSize() const { return _batch_size; }
+    inline unsigned GetNumHeads() const { return _heads; }
+    inline unsigned GetSeqLength() const { return _seq_length; }
+    inline unsigned GetIntermediateSize() const { return _intermediate_size; }
 
-    void SetSeqLength(int seq_len);
-    inline int GetHiddenSize() const { return _hidden_size; }
+    void SetSeqLength(unsigned seq_len);
+    inline unsigned GetHiddenSize() const { return _hidden_size; }
     void SetTrainingMode(bool training);
     inline bool IsTrainingMode() const { return _training; }
     inline bool GeluCheckpoint() const { return _gelu_checkpoint; }
@@ -144,13 +144,13 @@ private:
     size_t getWorkspaceSize(int maxBatchSize) const;
 
     // Params
-    int _layer_id;
-    int _batch_size;
-    int _hidden_size;
-    int _heads;
-    int _size_per_head;
-    int _intermediate_size;
-    int _seq_length;
+    unsigned _layer_id;
+    unsigned _batch_size;
+    unsigned _hidden_size;
+    unsigned _heads;
+    unsigned _size_per_head;
+    unsigned _intermediate_size;
+    unsigned _seq_length;
 
     bool _pre_or_postLayerNorm;
 
@@ -179,6 +179,6 @@ private:
     bool _normalize_invertible;
     bool _gelu_checkpoint;
 
-    // High Performace flags
+    // High Performance flags
     bool _stochastic_mode;
 };
diff --git a/csrc/includes/ds_transformer_hip.h b/csrc/includes/ds_transformer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..502f2f38445cc0704d964b3c52e901eba09ce865
--- /dev/null
+++ b/csrc/includes/ds_transformer_hip.h
@@ -0,0 +1,185 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime_api.h>
+#include <hiprand/hiprand.h>
+#include <memory>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "dropout_hip.h"
+#include "feed_forward_hip.h"
+#include "gelu_hip.h"
+#include "general_kernels_hip.h"
+#include "normalize_layer_hip.h"
+#include "softmax_hip.h"
+#include "strided_batch_gemm_hip.h"
+
+struct BertGemmAlgos {
+    int m_gemm_qkv_algo;
+    int m_gemm_inter_algo;
+    int m_gemm_output_algo;
+    int m_gemm_batch1_algo;
+    int m_gemm_batch2_algo;
+
+    BertGemmAlgos()
+        : m_gemm_qkv_algo(-1),
+          m_gemm_inter_algo(-1),
+          m_gemm_output_algo(-1),
+          m_gemm_batch1_algo(-1),
+          m_gemm_batch2_algo(-1)
+    {
+    }
+};
+
+template <typename T>
+class BertTransformerLayer {
+public:
+    BertTransformerLayer(unsigned layer_id,
+                         unsigned batch_size,
+                         unsigned hidden_size,
+                         unsigned num_heads,
+                         unsigned intermediate_size,
+                         unsigned seq_length,
+                         float attn_dropout_ratio,
+                         float hidden_output_dropout_ratio,
+                         float layer_norm_eps,
+                         bool pre_or_postLayerNorm,
+                         const std::vector<std::array<int, 3>>& gemm_algos,
+                         bool attn_dropout_checkpoint,
+                         bool normalize_invertible,
+                         bool gelu_checkpoint,
+                         bool stochastic_mode);
+
+    virtual ~BertTransformerLayer();
+
+    void Forward(unsigned bsz,
+                 const T* input_ptr,
+                 const T* input_mask_ptr,
+                 const T* attn_qkvw_ptr,
+                 const T* attn_qkvb_ptr,
+                 const T* attn_ow_ptr,
+                 const T* attn_ob_ptr,
+                 const T* attn_nw_ptr,
+                 const T* attn_nb_ptr,
+                 const T* inter_w_ptr,
+                 const T* inter_b_ptr,
+                 const T* output_w_ptr,
+                 const T* output_b_ptr,
+                 const T* norm_w_ptr,
+                 const T* norm_b_ptr,
+                 T* out_ptr,
+                 T* inp_norm_ptr,
+                 T* q_tf_ptr,
+                 T* k_tf_ptr,
+                 T* v_tf_ptr,
+                 T* softmax_output_ptr,
+                 T* ctx_bufB_ptr,
+                 T* attn_o_inp_ptr,
+                 T* add_res_ptr,
+                 T* ff1_inp_ptr,
+                 T* gelu_inp_ptr,
+                 T* ff2_inp_ptr);
+
+    void Backward(unsigned bsz,
+                  const T* grad_output_ptr,
+                  const T* input_ptr,
+                  const T* output_ptr,
+                  const T* inp_norm_ptr,
+                  const T* q_tf_ptr,
+                  const T* k_tf_ptr,
+                  const T* v_tf_ptr,
+                  const T* softmax_output_ptr,
+                  const T* ctx_bufB_ptr,
+                  const T* attn_o_inp_ptr,
+                  const T* add_res_ptr,
+                  const T* ff1_inp_ptr,
+                  const T* gelu_inp_ptr,
+                  const T* ff2_inp_ptr,
+                  const T* input_mask_ptr,
+                  const T* attn_qkvw_ptr,
+                  const T* attn_ow_ptr,
+                  const T* attn_nw_ptr,
+                  const T* attn_nb_ptr,
+                  const T* inter_w_ptr,
+                  const T* inter_b_ptr,
+                  const T* output_w_ptr,
+                  const T* norm_w_ptr,
+                  const T* norm_b_ptr,
+
+                  T* grad_input_ptr,
+                  T* grad_attn_qkvw_ptr,
+                  T* grad_attn_qkvb_ptr,
+                  T* grad_attn_ow_ptr,
+                  T* grad_attn_ob_ptr,
+                  T* grad_attn_nw_ptr,
+                  T* grad_attn_nb_ptr,
+                  T* grad_inter_w_ptr,
+                  T* grad_inter_b_ptr,
+                  T* grad_output_w_ptr,
+                  T* grad_output_b_ptr,
+                  T* grad_norm_w_ptr,
+                  T* grad_norm_b_ptr);
+
+    void SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                uint8_t* attn_output_dropout_mask_ptr,
+                                uint8_t* layer_output_dropout_mask_ptr,
+                                T* layer_norm_var,
+                                T* layer_norm_mean,
+                                T* attn_layer_norm_var,
+                                T* attn_layer_norm_mean);
+
+    inline unsigned GetBatchSize() const { return _batch_size; }
+    inline unsigned GetNumHeads() const { return _heads; }
+    inline unsigned GetSeqLength() const { return _seq_length; }
+    inline unsigned GetIntermediateSize() const { return _intermediate_size; }
+
+    void SetSeqLength(unsigned seq_len);
+    inline unsigned GetHiddenSize() const { return _hidden_size; }
+    void SetTrainingMode(bool training);
+    inline bool IsTrainingMode() const { return _training; }
+    inline bool GeluCheckpoint() const { return _gelu_checkpoint; }
+
+private:
+    void Initialize();
+    size_t getWorkspaceSize(int maxBatchSize) const;
+
+    // Params
+    unsigned _layer_id;
+    unsigned _batch_size;
+    unsigned _hidden_size;
+    unsigned _heads;
+    unsigned _size_per_head;
+    unsigned _intermediate_size;
+    unsigned _seq_length;
+
+    bool _pre_or_postLayerNorm;
+
+    rocblas_handle _cublasHandle;
+    hipStream_t _stream;
+
+    // layers
+    FeedForward<T> _qkv_linear;
+    FeedForward<T> _attn_out_linear;
+    Normalize_Layer<T> _attn_layer_norm;
+    Normalize_Layer<T> _layer_norm;
+    Normalize_Layer<T>* _last_normalize;
+    FeedForward<T> _ff1, _ff2;
+    Softmax<T> _softmax;
+    Gelu<T> _gelu;
+    Dropout<T> _attn_prob_dropout;
+    Dropout<T> _attn_output_dropout;
+    Dropout<T> _layer_output_dropout;
+    StridedBatchGemm<T> _attn_scores;
+    StridedBatchGemm<T> _attn_context;
+
+    bool _training;
+
+    // Memory saving flags
+    bool _attn_dropout_checkpoint;
+    bool _normalize_invertible;
+    bool _gelu_checkpoint;
+
+    // High Performance flags
+    bool _stochastic_mode;
+};
diff --git a/csrc/includes/feed_forward.h b/csrc/includes/feed_forward.h
index 7b7379d9b998b301b9100e87e1b09827a08cc1e2..de7a9cf1bf9eaf686f387e4dd1b3a45b02f28e85 100644
--- a/csrc/includes/feed_forward.h
+++ b/csrc/includes/feed_forward.h
@@ -1,93 +1,105 @@
-#ifndef __FEEDFORWARD_H__
-#define __FEEDFORWARD_H__
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-#include "custom_cuda_layers.h"
-
-template <typename T>
-class FeedForward {
-public:
-    struct Config {
-        int batchSize, outputSize;
-        int inputSize;
-        std::array<int, 3> gemm_algos;
-        Config(int batch, int outputs, int inputs, const std::array<int, 3>& algos)
-            : batchSize(batch), outputSize(outputs), inputSize(inputs), gemm_algos(algos)
-        {
-        }
-    };
-
-    FeedForward(Config config) : config_(config) {}
-
-    ~FeedForward() {}
-
-    void Forward(int bsz,
-                 const T* input_ptr,
-                 const T* weights,
-                 T* out,
-                 cublasHandle_t& _cublasHandle)
-    {
-        float alpha = T(1.);
-        float beta = T(0.);
-
-        cublas_gemm_ex(_cublasHandle,
-                       CUBLAS_OP_T,
-                       CUBLAS_OP_N,
-                       config_.outputSize,
-                       bsz,
-                       config_.inputSize,
-                       &alpha,
-                       &beta,
-                       weights,
-                       input_ptr,
-                       out,
-                       cublasGemmAlgo_t(config_.gemm_algos[0]));
-    }
-    void Backward(int bsz,
-                  const T* out_grad,
-                  const T* input_ptr,
-                  const T* weights,
-                  T* weights_grad,
-                  T* bias_grad,
-                  cublasHandle_t& _cublasHandle,
-                  cudaStream_t& stream,
-                  T* inp_grad_out = nullptr,
-                  T* out_grad_trans_out = nullptr)
-    {
-        float alpha = (T)1.0, beta = (T)0.0;
-        cublas_gemm_ex(_cublasHandle,
-                       CUBLAS_OP_N,
-                       CUBLAS_OP_T,
-                       config_.inputSize,
-                       config_.outputSize,
-                       bsz,
-                       &alpha,
-                       &beta,
-                       input_ptr,
-                       out_grad,
-                       weights_grad,
-                       cublasGemmAlgo_t(config_.gemm_algos[1]));
-
-        cublas_gemm_ex(_cublasHandle,
-                       CUBLAS_OP_N,
-                       CUBLAS_OP_N,
-                       config_.inputSize,
-                       bsz,
-                       config_.outputSize,
-                       &alpha,
-                       &beta,
-                       weights,
-                       out_grad,
-                       inp_grad_out,
-                       cublasGemmAlgo_t(config_.gemm_algos[2]));
-
-        launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
-    }
-
-private:
-    Config config_;
-};
-
-#endif
+#ifndef __FEEDFORWARD_H__
+#define __FEEDFORWARD_H__
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+
+template <typename T>
+class FeedForward {
+public:
+    struct Config {
+        int batchSize, outputSize;
+        int inputSize;
+        std::array<int, 3> gemm_algos;
+        Config(int batch, int outputs, int inputs, const std::array<int, 3>& algos)
+            : batchSize(batch), outputSize(outputs), inputSize(inputs), gemm_algos(algos)
+        {
+        }
+    };
+
+    FeedForward(Config config) : config_(config) {}
+
+    ~FeedForward() {}
+
+    void Forward(int bsz,
+                 const T* input_ptr,
+                 const T* weights,
+                 T* out,
+                 cublasHandle_t& _cublasHandle)
+    {
+        float alpha = T(1.);
+        float beta = T(0.);
+
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_T,
+                       CUBLAS_OP_N,
+                       config_.outputSize,
+                       bsz,
+                       config_.inputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       input_ptr,
+                       out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[0]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[0]));
+#endif
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* input_ptr,
+                  const T* weights,
+                  T* weights_grad,
+                  T* bias_grad,
+                  cublasHandle_t& _cublasHandle,
+                  cudaStream_t& stream,
+                  T* inp_grad_out = nullptr,
+                  T* out_grad_trans_out = nullptr)
+    {
+        float alpha = (T)1.0, beta = (T)0.0;
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_N,
+                       CUBLAS_OP_T,
+                       config_.inputSize,
+                       config_.outputSize,
+                       bsz,
+                       &alpha,
+                       &beta,
+                       input_ptr,
+                       out_grad,
+                       weights_grad,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[1]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[1]));
+#endif
+
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_N,
+                       CUBLAS_OP_N,
+                       config_.inputSize,
+                       bsz,
+                       config_.outputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       out_grad,
+                       inp_grad_out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[2]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[2]));
+#endif
+
+        launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
+    }
+
+private:
+    Config config_;
+};
+
+#endif
diff --git a/csrc/includes/feed_forward_hip.h b/csrc/includes/feed_forward_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..e7e0600803f7ebe71f352c0300222c17e6f6365b
--- /dev/null
+++ b/csrc/includes/feed_forward_hip.h
@@ -0,0 +1,106 @@
+// !!! This is a file automatically generated by hipify!!!
+#ifndef __FEEDFORWARD_H__
+#define __FEEDFORWARD_H__
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "custom_hip_layers.h"
+
+template <typename T>
+class FeedForward {
+public:
+    struct Config {
+        int batchSize, outputSize;
+        int inputSize;
+        std::array<int, 3> gemm_algos;
+        Config(int batch, int outputs, int inputs, const std::array<int, 3>& algos)
+            : batchSize(batch), outputSize(outputs), inputSize(inputs), gemm_algos(algos)
+        {
+        }
+    };
+
+    FeedForward(Config config) : config_(config) {}
+
+    ~FeedForward() {}
+
+    void Forward(int bsz,
+                 const T* input_ptr,
+                 const T* weights,
+                 T* out,
+                 rocblas_handle& _cublasHandle)
+    {
+        float alpha = T(1.);
+        float beta = T(0.);
+
+        cublas_gemm_ex(_cublasHandle,
+                       rocblas_operation_transpose,
+                       rocblas_operation_none,
+                       config_.outputSize,
+                       bsz,
+                       config_.inputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       input_ptr,
+                       out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[0]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[0]));
+#endif
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* input_ptr,
+                  const T* weights,
+                  T* weights_grad,
+                  T* bias_grad,
+                  rocblas_handle& _cublasHandle,
+                  hipStream_t& stream,
+                  T* inp_grad_out = nullptr,
+                  T* out_grad_trans_out = nullptr)
+    {
+        float alpha = (T)1.0, beta = (T)0.0;
+        cublas_gemm_ex(_cublasHandle,
+                       rocblas_operation_none,
+                       rocblas_operation_transpose,
+                       config_.inputSize,
+                       config_.outputSize,
+                       bsz,
+                       &alpha,
+                       &beta,
+                       input_ptr,
+                       out_grad,
+                       weights_grad,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[1]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[1]));
+#endif
+
+        cublas_gemm_ex(_cublasHandle,
+                       rocblas_operation_none,
+                       rocblas_operation_none,
+                       config_.inputSize,
+                       bsz,
+                       config_.outputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       out_grad,
+                       inp_grad_out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[2]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[2]));
+#endif
+
+        launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
+    }
+
+private:
+    Config config_;
+};
+
+#endif
diff --git a/csrc/includes/gelu.h b/csrc/includes/gelu.h
index 41cf6f2a68a79ec774c1c17d51fdbc83b1f06485..560f4140ed61e9455b78911da0a44c8944ce53ed 100644
--- a/csrc/includes/gelu.h
+++ b/csrc/includes/gelu.h
@@ -1,36 +1,36 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-#include "custom_cuda_layers.h"
-
-template <typename T>
-class Gelu {
-public:
-    struct Config {
-        uint32_t intermediate_size;
-        Config(uint32_t inter_size) : intermediate_size(inter_size) {}
-    };
-
-    Gelu(const Config& config) : _config(config) {}
-
-    virtual ~Gelu() {}
-
-    void ForwardWithBiasAdd(int bsz,
-                            const T* input_buf,
-                            const T* bias,
-                            T* output,
-                            cudaStream_t stream)
-    {
-        launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream);
-    }
-
-    void Backward(int bsz, T* d_output, const T* input_buf, const T* bias, cudaStream_t stream)
-    {
-        launch_d_gelu<T>(d_output, input_buf, bias, _config.intermediate_size, bsz, stream);
-    }
-
-private:
-    Config _config;
-};
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+
+template <typename T>
+class Gelu {
+public:
+    struct Config {
+        uint32_t intermediate_size;
+        Config(uint32_t inter_size) : intermediate_size(inter_size) {}
+    };
+
+    Gelu(const Config& config) : _config(config) {}
+
+    virtual ~Gelu() {}
+
+    void ForwardWithBiasAdd(int bsz,
+                            const T* input_buf,
+                            const T* bias,
+                            T* output,
+                            cudaStream_t stream)
+    {
+        launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream);
+    }
+
+    void Backward(int bsz, T* d_output, const T* input_buf, const T* bias, cudaStream_t stream)
+    {
+        launch_d_gelu<T>(d_output, input_buf, bias, _config.intermediate_size, bsz, stream);
+    }
+
+private:
+    Config _config;
+};
diff --git a/csrc/includes/gelu_hip.h b/csrc/includes/gelu_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..0297b66f394ec3b60c96c0453cf9ba6258296c1b
--- /dev/null
+++ b/csrc/includes/gelu_hip.h
@@ -0,0 +1,37 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "custom_hip_layers.h"
+
+template <typename T>
+class Gelu {
+public:
+    struct Config {
+        uint32_t intermediate_size;
+        Config(uint32_t inter_size) : intermediate_size(inter_size) {}
+    };
+
+    Gelu(const Config& config) : _config(config) {}
+
+    virtual ~Gelu() {}
+
+    void ForwardWithBiasAdd(int bsz,
+                            const T* input_buf,
+                            const T* bias,
+                            T* output,
+                            hipStream_t stream)
+    {
+        launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream);
+    }
+
+    void Backward(int bsz, T* d_output, const T* input_buf, const T* bias, hipStream_t stream)
+    {
+        launch_d_gelu<T>(d_output, input_buf, bias, _config.intermediate_size, bsz, stream);
+    }
+
+private:
+    Config _config;
+};
diff --git a/csrc/includes/gemm_test.h b/csrc/includes/gemm_test.h
index b920896b419e89d49fbd1b7a9023cebc6c34de7d..22c35123f2c776e2e87d53310c316497e55d214d 100644
--- a/csrc/includes/gemm_test.h
+++ b/csrc/includes/gemm_test.h
@@ -1,293 +1,327 @@
-
-#pragma once
-
-#include <cuda_fp16.h>
-#include <cuda_profiler_api.h>
-#include <array>
-#include <cstdio>
-#include <cstdlib>
-#include <ctime>
-#include <limits>
-#include <memory>
-#include "StopWatch.h"
-#include "cublas_wrappers.h"
-
-template <typename T>
-void check(T result, char const* const func, const char* const file, int const line)
-{
-    if (result) {
-        std::cout << (std::string("CUDA runtime error: ") + +file + ":" + std::to_string(line) +
-                      " \n");
-    }
-}
-
-#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
-
-template <typename T>
-class GemmTest {
-public:
-    GemmTest(int m, int n, int k, cublasOperation_t ta, cublasOperation_t tb, cublasHandle_t h)
-        : M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
-    {
-        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K));
-        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N));
-        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N));
-    }
-
-    ~GemmTest()
-    {
-        check_cuda_error(cudaFree(A));
-        check_cuda_error(cudaFree(B));
-        check_cuda_error(cudaFree(C));
-    }
-
-    std::array<int, 3> TestAlgo(int loops)
-    {
-        float alpha = (T)1.0f;
-        float beta = (T)0.0f;
-
-        int algo_fw = Run(loops, [=](int algo) {
-            cublas_gemm_ex(handle,
-                           CUBLAS_OP_T,
-                           CUBLAS_OP_N,
-                           N,
-                           M,
-                           K,
-                           &alpha,
-                           &beta,
-                           B,
-                           A,
-                           C,
-                           static_cast<cublasGemmAlgo_t>(algo));
-        });
-
-        int algo_bw1 = Run(loops, [=](int algo) {
-            cublas_gemm_ex(handle,
-                           CUBLAS_OP_N,
-                           CUBLAS_OP_T,
-                           K,
-                           N,
-                           M,
-                           &alpha,
-                           &beta,
-                           A,
-                           C,
-                           B,
-                           static_cast<cublasGemmAlgo_t>(algo));
-        });
-
-        int algo_bw2 = Run(loops, [=](int algo) {
-            cublas_gemm_ex(handle,
-                           CUBLAS_OP_N,
-                           CUBLAS_OP_N,
-                           K,
-                           M,
-                           N,
-                           &alpha,
-                           &beta,
-                           B,
-                           C,
-                           A,
-                           static_cast<cublasGemmAlgo_t>(algo));
-        });
-
-        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
-    }
-
-    template <typename Func>
-    int Run(int loops, Func f)
-    {
-        float fast_latency = (std::numeric_limits<float>::max)();
-        int fast_algo = 0;
-
-        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
-             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
-             algo++) {
-            int warm_up = 5;
-            for (int i = 0; i < warm_up; ++i) f(algo);
-
-            cudaDeviceSynchronize();
-            Stopwatch timer;
-            timer.Restart();
-
-            for (int i = 0; i < loops; ++i) f(algo);
-
-            cudaDeviceSynchronize();
-            timer.Stop();
-
-            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
-
-            printf("algo-%d: %.3fms\n", algo, avg_latency);
-
-            if (avg_latency < fast_latency) {
-                fast_latency = avg_latency;
-                fast_algo = algo;
-            }
-        }
-
-        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
-
-        return fast_algo;
-    }
-
-private:
-    int M, N, K;
-    cublasHandle_t handle;
-    cublasOperation_t transa, transb;
-    T *A, *B, *C;
-};
-
-template <typename T>
-class StridedGemmTest {
-public:
-    StridedGemmTest(int b,
-                    int m,
-                    int n,
-                    int k,
-                    cublasOperation_t ta,
-                    cublasOperation_t tb,
-                    cublasHandle_t h)
-        : bsz(b), M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
-    {
-        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K * bsz));
-        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N * bsz));
-        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N * bsz));
-    }
-
-    ~StridedGemmTest()
-    {
-        check_cuda_error(cudaFree(A));
-        check_cuda_error(cudaFree(B));
-        check_cuda_error(cudaFree(C));
-    }
-
-    std::array<int, 3> TestAlgo(int loops)
-    {
-        float alpha = (T)1.0f;
-        float beta = (T)0.0f;
-
-        int algo_fw = Run(loops, [=](int algo) {
-            int stride_a = M * K;
-            int stride_b = N * K;
-            int stride_c = M * N;
-
-            cublas_strided_batched_gemm(handle,
-                                        M,
-                                        N,
-                                        K,
-                                        &alpha,
-                                        &beta,
-                                        A,
-                                        B,
-                                        C,
-                                        transa,
-                                        transb,
-                                        stride_a,
-                                        stride_b,
-                                        stride_c,
-                                        bsz,
-                                        static_cast<cublasGemmAlgo_t>(algo));
-        });
-
-        int algo_bw1 = Run(loops, [=](int algo) {
-            int mb = (transa == CUBLAS_OP_T ? K : M);
-            int kb = (transa == CUBLAS_OP_T ? M : K);
-
-            int stride_a = mb * N;
-            int stride_b = N * kb;
-            int stride_c = M * K;
-
-            // B need to transpose.
-            cublasOperation_t op_b = (transb == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
-
-            // Calculate d_A.
-            cublas_strided_batched_gemm(handle,
-                                        mb,
-                                        kb,
-                                        N,
-                                        &alpha,
-                                        &beta,
-                                        (transa == CUBLAS_OP_T ? B : C),
-                                        (transa == CUBLAS_OP_T ? C : B),
-                                        A,
-                                        CUBLAS_OP_N,
-                                        op_b,
-                                        stride_a,
-                                        stride_b,
-                                        stride_c,
-                                        bsz,
-                                        static_cast<cublasGemmAlgo_t>(algo));
-        });
-
-        int algo_bw2 = Run(loops, [=](int algo) {
-            // A need to transpose.
-            cublasOperation_t op_a = (transa == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
-
-            int stride_a = M * K;
-            int stride_b = M * N;
-            int stride_c = N * K;
-
-            // Calculate d_B.
-            cublas_strided_batched_gemm(handle,
-                                        K,
-                                        N,
-                                        M,
-                                        &alpha,
-                                        &beta,
-                                        A,
-                                        C,
-                                        B,
-                                        op_a,
-                                        CUBLAS_OP_N,
-                                        stride_a,
-                                        stride_b,
-                                        stride_c,
-                                        bsz,
-                                        static_cast<cublasGemmAlgo_t>(algo));
-        });
-
-        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
-    }
-
-    template <typename Func>
-    int Run(int loops, Func f)
-    {
-        float fast_latency = (std::numeric_limits<float>::max)();
-        int fast_algo = 0;
-
-        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
-             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
-             algo++) {
-            int warm_up = 5;
-            for (int i = 0; i < warm_up; ++i) f(algo);
-
-            cudaDeviceSynchronize();
-            Stopwatch timer;
-            timer.Restart();
-
-            for (int i = 0; i < loops; ++i) f(algo);
-
-            cudaDeviceSynchronize();
-            timer.Stop();
-
-            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
-
-            printf("algo-%d: %.3fms\n", algo, avg_latency);
-
-            if (avg_latency < fast_latency) {
-                fast_latency = avg_latency;
-                fast_algo = algo;
-            }
-        }
-
-        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
-
-        return fast_algo;
-    }
-
-private:
-    int bsz, M, N, K;
-    cublasHandle_t handle;
-    cublasOperation_t transa, transb;
-    T *A, *B, *C;
-};
+
+#pragma once
+
+#include <cuda_fp16.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+#include <limits>
+#include <memory>
+#include "StopWatch.h"
+#include "cublas_wrappers.h"
+
+template <typename T>
+void check(T result, char const* const func, const char* const file, int const line)
+{
+    if (result) {
+        std::cout << (std::string("CUDA runtime error: ") + +file + ":" + std::to_string(line) +
+                      " \n");
+    }
+}
+
+#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
+
+template <typename T>
+class GemmTest {
+public:
+    GemmTest(int m, int n, int k, cublasOperation_t ta, cublasOperation_t tb, cublasHandle_t h)
+        : M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K));
+        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N));
+        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N));
+    }
+
+    ~GemmTest()
+    {
+        check_cuda_error(cudaFree(A));
+        check_cuda_error(cudaFree(B));
+        check_cuda_error(cudaFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_T,
+                           CUBLAS_OP_N,
+                           N,
+                           M,
+                           K,
+                           &alpha,
+                           &beta,
+                           B,
+                           A,
+                           C,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_N,
+                           CUBLAS_OP_T,
+                           K,
+                           N,
+                           M,
+                           &alpha,
+                           &beta,
+                           A,
+                           C,
+                           B,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_N,
+                           CUBLAS_OP_N,
+                           K,
+                           M,
+                           N,
+                           &alpha,
+                           &beta,
+                           B,
+                           C,
+                           A,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int M, N, K;
+    cublasHandle_t handle;
+    cublasOperation_t transa, transb;
+    T *A, *B, *C;
+};
+
+template <typename T>
+class StridedGemmTest {
+public:
+    StridedGemmTest(int b,
+                    int m,
+                    int n,
+                    int k,
+                    cublasOperation_t ta,
+                    cublasOperation_t tb,
+                    cublasHandle_t h)
+        : bsz(b), M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K * bsz));
+        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N * bsz));
+        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N * bsz));
+    }
+
+    ~StridedGemmTest()
+    {
+        check_cuda_error(cudaFree(A));
+        check_cuda_error(cudaFree(B));
+        check_cuda_error(cudaFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            int stride_a = M * K;
+            int stride_b = N * K;
+            int stride_c = M * N;
+
+            cublas_strided_batched_gemm(handle,
+                                        M,
+                                        N,
+                                        K,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        B,
+                                        C,
+                                        transa,
+                                        transb,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            int mb = (transa == CUBLAS_OP_T ? K : M);
+            int kb = (transa == CUBLAS_OP_T ? M : K);
+
+            int stride_a = mb * N;
+            int stride_b = N * kb;
+            int stride_c = M * K;
+
+            // B need to transpose.
+            cublasOperation_t op_b = (transb == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+            // Calculate d_A.
+            cublas_strided_batched_gemm(handle,
+                                        mb,
+                                        kb,
+                                        N,
+                                        &alpha,
+                                        &beta,
+                                        (transa == CUBLAS_OP_T ? B : C),
+                                        (transa == CUBLAS_OP_T ? C : B),
+                                        A,
+                                        CUBLAS_OP_N,
+                                        op_b,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            // A need to transpose.
+            cublasOperation_t op_a = (transa == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+            int stride_a = M * K;
+            int stride_b = M * N;
+            int stride_c = N * K;
+
+            // Calculate d_B.
+            cublas_strided_batched_gemm(handle,
+                                        K,
+                                        N,
+                                        M,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        C,
+                                        B,
+                                        op_a,
+                                        CUBLAS_OP_N,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int bsz, M, N, K;
+    cublasHandle_t handle;
+    cublasOperation_t transa, transb;
+    T *A, *B, *C;
+};
diff --git a/csrc/includes/gemm_test_hip.h b/csrc/includes/gemm_test_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..117302ddb4a3b250512a04115d5ee856771928c9
--- /dev/null
+++ b/csrc/includes/gemm_test_hip.h
@@ -0,0 +1,328 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#pragma once
+
+#include <hip/hip_fp16.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+#include <limits>
+#include <memory>
+#include "StopWatch.h"
+#include "cublas_wrappers_hip.h"
+
+template <typename T>
+void check(T result, char const* const func, const char* const file, int const line)
+{
+    if (result) {
+        std::cout << (std::string("CUDA runtime error: ") + +file + ":" + std::to_string(line) +
+                      " \n");
+    }
+}
+
+#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
+
+template <typename T>
+class GemmTest {
+public:
+    GemmTest(int m, int n, int k, rocblas_operation ta, rocblas_operation tb, rocblas_handle h)
+        : M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(hipMalloc((void**)&A, sizeof(T) * M * K));
+        check_cuda_error(hipMalloc((void**)&B, sizeof(T) * K * N));
+        check_cuda_error(hipMalloc((void**)&C, sizeof(T) * M * N));
+    }
+
+    ~GemmTest()
+    {
+        check_cuda_error(hipFree(A));
+        check_cuda_error(hipFree(B));
+        check_cuda_error(hipFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           rocblas_operation_transpose,
+                           rocblas_operation_none,
+                           N,
+                           M,
+                           K,
+                           &alpha,
+                           &beta,
+                           B,
+                           A,
+                           C,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           rocblas_operation_none,
+                           rocblas_operation_transpose,
+                           K,
+                           N,
+                           M,
+                           &alpha,
+                           &beta,
+                           A,
+                           C,
+                           B,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           rocblas_operation_none,
+                           rocblas_operation_none,
+                           K,
+                           M,
+                           N,
+                           &alpha,
+                           &beta,
+                           B,
+                           C,
+                           A,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int M, N, K;
+    rocblas_handle handle;
+    rocblas_operation transa, transb;
+    T *A, *B, *C;
+};
+
+template <typename T>
+class StridedGemmTest {
+public:
+    StridedGemmTest(int b,
+                    int m,
+                    int n,
+                    int k,
+                    rocblas_operation ta,
+                    rocblas_operation tb,
+                    rocblas_handle h)
+        : bsz(b), M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(hipMalloc((void**)&A, sizeof(T) * M * K * bsz));
+        check_cuda_error(hipMalloc((void**)&B, sizeof(T) * K * N * bsz));
+        check_cuda_error(hipMalloc((void**)&C, sizeof(T) * M * N * bsz));
+    }
+
+    ~StridedGemmTest()
+    {
+        check_cuda_error(hipFree(A));
+        check_cuda_error(hipFree(B));
+        check_cuda_error(hipFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            int stride_a = M * K;
+            int stride_b = N * K;
+            int stride_c = M * N;
+
+            cublas_strided_batched_gemm(handle,
+                                        M,
+                                        N,
+                                        K,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        B,
+                                        C,
+                                        transa,
+                                        transb,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            int mb = (transa == rocblas_operation_transpose ? K : M);
+            int kb = (transa == rocblas_operation_transpose ? M : K);
+
+            int stride_a = mb * N;
+            int stride_b = N * kb;
+            int stride_c = M * K;
+
+            // B need to transpose.
+            rocblas_operation op_b = (transb == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+            // Calculate d_A.
+            cublas_strided_batched_gemm(handle,
+                                        mb,
+                                        kb,
+                                        N,
+                                        &alpha,
+                                        &beta,
+                                        (transa == rocblas_operation_transpose ? B : C),
+                                        (transa == rocblas_operation_transpose ? C : B),
+                                        A,
+                                        rocblas_operation_none,
+                                        op_b,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            // A need to transpose.
+            rocblas_operation op_a = (transa == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+            int stride_a = M * K;
+            int stride_b = M * N;
+            int stride_c = N * K;
+
+            // Calculate d_B.
+            cublas_strided_batched_gemm(handle,
+                                        K,
+                                        N,
+                                        M,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        C,
+                                        B,
+                                        op_a,
+                                        rocblas_operation_none,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int bsz, M, N, K;
+    rocblas_handle handle;
+    rocblas_operation transa, transb;
+    T *A, *B, *C;
+};
diff --git a/csrc/includes/general_kernels.h b/csrc/includes/general_kernels.h
index 588cf2aaa048d92b3d855f90050929d1ca569b18..e949309483ce8060c86ad1a46ca9264b1f45f810 100644
--- a/csrc/includes/general_kernels.h
+++ b/csrc/includes/general_kernels.h
@@ -1,47 +1,51 @@
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-#include <stdlib.h>
-
-#include <cooperative_groups.h>
-#include <curand_kernel.h>
-
-#include "context.h"
-#include "cublas_wrappers.h"
-
-#define THREADS 256
-#define TILE_DIM 32
-
-#define minus_infinity -1 * std::numeric_limits<float>::infinity()
-
-#define FINAL_MASK 0xffffffff
-
-template <typename T>
-void launch_fused_add2(T* out,
-                       const T* inp1,
-                       const T* inp2,
-                       int batch_size,
-                       int seq_length,
-                       int hidden_size,
-                       cudaStream_t& stream);
-
-template <typename T>
-void launch_fused_add4(T* out,
-                       const T* inp1,
-                       const T* inp2,
-                       const T* inp3,
-                       const T* inp4,
-                       int batch_size,
-                       int seq_length,
-                       int hidden_size,
-                       cudaStream_t& stream);
-
-template <typename T>
-void launch_fused_add3(T* out,
-                       const T* inp1,
-                       const T* inp2,
-                       const T* inp3,
-                       int batch_size,
-                       int seq_length,
-                       int hidden_size,
-                       cudaStream_t& stream);
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <curand_kernel.h>
+
+#include "context.h"
+#include "cublas_wrappers.h"
+
+#define THREADS 256
+#define TILE_DIM 32
+
+#define minus_infinity -1 * std::numeric_limits<float>::infinity()
+
+#define FINAL_MASK 0xffffffff
+
+template <typename T>
+void launch_fused_add2(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
+
+template <typename T>
+void launch_fused_add4(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       const T* inp4,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
+
+template <typename T>
+void launch_fused_add3(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
diff --git a/csrc/includes/general_kernels_hip.h b/csrc/includes/general_kernels_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..2cafd95b2a1ba34f398b65f132f27846b57ac751
--- /dev/null
+++ b/csrc/includes/general_kernels_hip.h
@@ -0,0 +1,52 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <hiprand/hiprand_kernel.h>
+
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+
+#define THREADS 256
+#define TILE_DIM 32
+
+#define minus_infinity -1 * std::numeric_limits<float>::infinity()
+
+#define FINAL_MASK 0xffffffff
+
+template <typename T>
+void launch_fused_add2(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       hipStream_t& stream);
+
+template <typename T>
+void launch_fused_add4(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       const T* inp4,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       hipStream_t& stream);
+
+template <typename T>
+void launch_fused_add3(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       hipStream_t& stream);
diff --git a/csrc/includes/normalize_layer.h b/csrc/includes/normalize_layer.h
index e18e01a33893f277d88b678e7873949c875aa307..b4d135ec495e61b041c0610deb966fb0468ca4a6 100644
--- a/csrc/includes/normalize_layer.h
+++ b/csrc/includes/normalize_layer.h
@@ -1,202 +1,202 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-#include <fstream>
-#include "custom_cuda_layers.h"
-
-using namespace std;
-
-template <typename T>
-class Normalize_Layer {
-public:
-    struct Config {
-        uint32_t batchSize;
-        uint32_t seqLength;
-        uint32_t hiddenDim;
-        float epsilon;
-        bool training;
-        bool useMean;
-        Config(uint32_t batch,
-               uint32_t seq,
-               uint32_t h,
-               float epsilon = 1e-12,
-               bool training = true,
-               bool useMean = true)
-            : batchSize(batch),
-              seqLength(seq),
-              hiddenDim(h),
-              epsilon(epsilon),
-              training(training),
-              useMean(useMean)
-        {
-        }
-    };
-
-    Normalize_Layer(Config config)
-        : config_(config), vars(nullptr), means(nullptr), vals_hat(nullptr)
-    {
-    }
-
-    ~Normalize_Layer() {}
-
-    void ForwardCheckpoint(int bsz,  // batch * seq
-                           T* vals,
-                           const T* residual,
-                           const T* gamma,
-                           const T* betta,
-                           cudaStream_t& stream,
-                           bool preLayerNorm = false)
-    {
-        launch_bias_residual_layer_norm(vals,
-                                        residual,
-                                        gamma,
-                                        betta,
-                                        config_.epsilon,
-                                        bsz,
-                                        config_.hiddenDim,
-                                        stream,
-                                        preLayerNorm,
-                                        config_.training,
-                                        vars,
-                                        means);
-    }
-
-    void Forward(int bsz,
-                 T* vals,
-                 const T* residual,
-                 const T* gamma,
-                 const T* betta,
-                 cudaStream_t& stream,
-                 bool preLayerNorm = false)
-    {
-        launch_bias_residual_layer_norm(vals,
-                                        residual,
-                                        gamma,
-                                        betta,
-                                        config_.epsilon,
-                                        bsz,
-                                        config_.hiddenDim,
-                                        stream,
-                                        preLayerNorm,
-                                        config_.training,
-                                        vars);
-    }
-
-    void Backward(int bsz,
-                  const T* out_grad,
-                  const T* gamma,
-                  T* gamma_grad,
-                  T* betta_grad,
-                  cudaStream_t stream[2],
-                  T* inp_grad_out,
-                  const T* norm_in = nullptr)
-    {
-        launch_layerNorm_backward(out_grad,
-                                  norm_in,
-                                  vars,
-                                  means,
-                                  gamma,
-                                  gamma_grad,
-                                  betta_grad,
-                                  inp_grad_out,
-                                  bsz,
-                                  config_.hiddenDim,
-                                  stream);
-    }
-
-    void Backward(int bsz,
-                  const T* out_grad,
-                  const T* gamma,
-                  const T* betta,
-                  T* gamma_grad,
-                  T* betta_grad,
-                  cudaStream_t stream[2],
-                  T* inp_grad_out,
-                  const T* norm_out)
-    {
-        launch_layerNorm_backward(out_grad,
-                                  norm_out,
-                                  vars,
-                                  gamma,
-                                  gamma_grad,
-                                  betta_grad,
-                                  inp_grad_out,
-                                  bsz,
-                                  config_.hiddenDim,
-                                  stream,
-                                  !config_.useMean,
-                                  betta);
-    }
-
-    void BackwardFusedAdd(int bsz,
-                          const T* out_grad1,
-                          const T* out_grad2,
-                          const T* gamma,
-                          T* gamma_grad,
-                          T* betta_grad,
-                          cudaStream_t stream[2],
-                          T* inp_grad_out,
-                          const T* norm_in = nullptr)
-    {
-        launch_layerNorm_backward_fused_add(out_grad1,
-                                            out_grad2,
-                                            norm_in,
-                                            vars,
-                                            means,
-                                            gamma,
-                                            gamma_grad,
-                                            betta_grad,
-                                            inp_grad_out,
-                                            bsz,
-                                            config_.hiddenDim,
-                                            stream);
-    }
-
-    void BackwardFusedAdd(int bsz,
-                          const T* out_grad1,
-                          const T* out_grad2,
-                          const T* gamma,
-                          const T* betta,
-                          T* gamma_grad,
-                          T* betta_grad,
-                          cudaStream_t stream[2],
-                          T* inp_grad_out,
-                          const T* norm_out)
-    {
-        launch_layerNorm_backward_fused_add(out_grad1,
-                                            out_grad2,
-                                            norm_out,
-                                            vars,
-                                            gamma,
-                                            gamma_grad,
-                                            betta_grad,
-                                            inp_grad_out,
-                                            bsz,
-                                            config_.hiddenDim,
-                                            stream,
-                                            !config_.useMean,
-                                            betta);
-    }
-
-    inline bool UseMean() const { return config_.useMean; }
-
-    inline void SetVar(T* variance)
-    {
-        if (!variance) { throw std::runtime_error("Normalize variance is null."); }
-        vars = variance;
-    }
-
-    inline void SetMean(T* mean)
-    {
-        if (!mean) { throw std::runtime_error("Normalize mean is null."); }
-        means = mean;
-    }
-
-private:
-    Config config_;
-    T* vars;
-    T* means;
-    T* vals_hat;
-};
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <fstream>
+#include "custom_cuda_layers.h"
+
+using namespace std;
+
+template <typename T>
+class Normalize_Layer {
+public:
+    struct Config {
+        uint32_t batchSize;
+        uint32_t seqLength;
+        uint32_t hiddenDim;
+        float epsilon;
+        bool training;
+        bool useMean;
+        Config(uint32_t batch,
+               uint32_t seq,
+               uint32_t h,
+               float epsilon = 1e-12,
+               bool training = true,
+               bool useMean = true)
+            : batchSize(batch),
+              seqLength(seq),
+              hiddenDim(h),
+              epsilon(epsilon),
+              training(training),
+              useMean(useMean)
+        {
+        }
+    };
+
+    Normalize_Layer(Config config)
+        : config_(config), vars(nullptr), means(nullptr), vals_hat(nullptr)
+    {
+    }
+
+    ~Normalize_Layer() {}
+
+    void ForwardCheckpoint(int bsz,  // batch * seq
+                           T* vals,
+                           const T* residual,
+                           const T* gamma,
+                           const T* betta,
+                           cudaStream_t& stream,
+                           bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars,
+                                        means);
+    }
+
+    void Forward(int bsz,
+                 T* vals,
+                 const T* residual,
+                 const T* gamma,
+                 const T* betta,
+                 cudaStream_t& stream,
+                 bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  cudaStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_in,
+                                  vars,
+                                  means,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  const T* betta,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  cudaStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_out)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_out,
+                                  vars,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream,
+                                  !config_.useMean,
+                                  betta);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          cudaStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_in,
+                                            vars,
+                                            means,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          const T* betta,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          cudaStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_out)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_out,
+                                            vars,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream,
+                                            !config_.useMean,
+                                            betta);
+    }
+
+    inline bool UseMean() const { return config_.useMean; }
+
+    inline void SetVar(T* variance)
+    {
+        if (!variance) { throw std::runtime_error("Normalize variance is null."); }
+        vars = variance;
+    }
+
+    inline void SetMean(T* mean)
+    {
+        if (!mean) { throw std::runtime_error("Normalize mean is null."); }
+        means = mean;
+    }
+
+private:
+    Config config_;
+    T* vars;
+    T* means;
+    T* vals_hat;
+};
diff --git a/csrc/includes/normalize_layer_hip.h b/csrc/includes/normalize_layer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..41702762d3f388c6c8a3346e0c2b8f219b20e922
--- /dev/null
+++ b/csrc/includes/normalize_layer_hip.h
@@ -0,0 +1,203 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <fstream>
+#include "custom_hip_layers.h"
+
+using namespace std;
+
+template <typename T>
+class Normalize_Layer {
+public:
+    struct Config {
+        uint32_t batchSize;
+        uint32_t seqLength;
+        uint32_t hiddenDim;
+        float epsilon;
+        bool training;
+        bool useMean;
+        Config(uint32_t batch,
+               uint32_t seq,
+               uint32_t h,
+               float epsilon = 1e-12,
+               bool training = true,
+               bool useMean = true)
+            : batchSize(batch),
+              seqLength(seq),
+              hiddenDim(h),
+              epsilon(epsilon),
+              training(training),
+              useMean(useMean)
+        {
+        }
+    };
+
+    Normalize_Layer(Config config)
+        : config_(config), vars(nullptr), means(nullptr), vals_hat(nullptr)
+    {
+    }
+
+    ~Normalize_Layer() {}
+
+    void ForwardCheckpoint(int bsz,  // batch * seq
+                           T* vals,
+                           const T* residual,
+                           const T* gamma,
+                           const T* betta,
+                           hipStream_t& stream,
+                           bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars,
+                                        means);
+    }
+
+    void Forward(int bsz,
+                 T* vals,
+                 const T* residual,
+                 const T* gamma,
+                 const T* betta,
+                 hipStream_t& stream,
+                 bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  hipStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_in,
+                                  vars,
+                                  means,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  const T* betta,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  hipStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_out)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_out,
+                                  vars,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream,
+                                  !config_.useMean,
+                                  betta);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          hipStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_in,
+                                            vars,
+                                            means,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          const T* betta,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          hipStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_out)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_out,
+                                            vars,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream,
+                                            !config_.useMean,
+                                            betta);
+    }
+
+    inline bool UseMean() const { return config_.useMean; }
+
+    inline void SetVar(T* variance)
+    {
+        if (!variance) { throw std::runtime_error("Normalize variance is null."); }
+        vars = variance;
+    }
+
+    inline void SetMean(T* mean)
+    {
+        if (!mean) { throw std::runtime_error("Normalize mean is null."); }
+        means = mean;
+    }
+
+private:
+    Config config_;
+    T* vars;
+    T* means;
+    T* vals_hat;
+};
diff --git a/csrc/includes/quantizer.h b/csrc/includes/quantizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..79eeb14e2a0613255c61f647df44f04fbe9df6c6
--- /dev/null
+++ b/csrc/includes/quantizer.h
@@ -0,0 +1,9 @@
+#pragma once
+
+#include <cooperative_groups.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
diff --git a/csrc/includes/quantizer_hip.h b/csrc/includes/quantizer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..7fbbbaf2c5c67bf388ab93e8a57cbf575e72489b
--- /dev/null
+++ b/csrc/includes/quantizer_hip.h
@@ -0,0 +1,10 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <cooperative_groups.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
diff --git a/csrc/includes/simd.h b/csrc/includes/simd.h
new file mode 100644
index 0000000000000000000000000000000000000000..44c4da83e01c02dc664aa712fe6847be0b3f6aa2
--- /dev/null
+++ b/csrc/includes/simd.h
@@ -0,0 +1,137 @@
+#pragma once
+
+#if (__x86_64__ || __i386__)
+#include <cpuid.h>
+#include <x86intrin.h>
+#endif
+
+#define TILE (128 * 1024 * 1024)
+#if defined(__AVX512__) or defined(__AVX256__)
+
+#define ROUND_DOWN(size, step) ((size) & ~((step)-1))
+
+#if defined(__AVX512__)
+#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm512_loadu_ps(x)
+#define SIMD_SET(x) _mm512_set1_ps(x)
+#define SIMD_ADD(x, y) _mm512_add_ps(x, y)
+#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
+#define SIMD_WIDTH 16
+
+#define SIMD_LOAD2(x, h) \
+    ((h) ? _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i*)x)) : _mm512_loadu_ps(x))
+#define SIMD_STORE2(x, d, h)                                                                      \
+    ((h) ? _mm256_store_ps(x, _mm256_castsi256_ps(_mm512_cvtps_ph(d, _MM_FROUND_TO_NEAREST_INT))) \
+         : _mm512_storeu_ps(x, d))
+
+#define INTV __m256i
+#elif defined(__AVX256__)
+#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm256_loadu_ps(x)
+#define SIMD_SET(x) _mm256_set1_ps(x)
+#define SIMD_ADD(x, y) _mm256_add_ps(x, y)
+#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
+#define SIMD_WIDTH 8
+#define SIMD_LOAD2(x, h) \
+    ((h) ? _mm256_cvtph_ps(_mm_loadu_si128((const __m128i*)x)) : _mm256_loadu_ps(x))
+
+#define SIMD_STORE2(x, d, h)                                                                \
+    ((h) ? _mm_store_ps(x, _mm_castsi128_ps(_mm256_cvtps_ph(d, _MM_FROUND_TO_NEAREST_INT))) \
+         : _mm256_storeu_ps(x, d))
+
+#define INTV __m128i
+#endif
+
+union AVX_Data {
+#if defined(__AVX512__)
+    __m512 data;
+#elif defined(__AVX256__)
+    __m256 data;
+#endif
+    // float data_f[16];
+};
+
+template <int span>
+inline void simd_store(float* dst, AVX_Data* src, bool half_precision)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        SIMD_STORE2(dst + SIMD_WIDTH * i, src[i].data, half_precision);
+    }
+}
+template <int span>
+inline void simd_load(AVX_Data* dst, float* src, bool half_precision)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_LOAD2(src + SIMD_WIDTH * i, half_precision);
+    }
+}
+template <int span>
+inline void simd_fma(AVX_Data* dst, AVX_Data* src_m_l, AVX_Data src_m_r, AVX_Data* src_a)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_FMA(src_m_l[i].data, src_m_r.data, src_a[i].data);
+    }
+}
+template <int span>
+inline void simd_fma(AVX_Data* dst, AVX_Data* src_m_l, AVX_Data src_m_r, AVX_Data src_a)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_FMA(src_m_l[i].data, src_m_r.data, src_a.data);
+    }
+}
+template <int span>
+inline void simd_fma(AVX_Data* dst, AVX_Data* src_m_l, AVX_Data* src_m_r, AVX_Data* src_a)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_FMA(src_m_l[i].data, src_m_r[i].data, src_a[i].data);
+    }
+}
+template <int span>
+inline void simd_sqrt(AVX_Data* dst, AVX_Data* src)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_SQRT(src[i].data); }
+}
+template <int span>
+inline void simd_add(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_ADD(src_a_l[i].data, src_a_r.data); }
+}
+template <int span>
+inline void simd_add(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data* src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_ADD(src_a_l[i].data, src_a_r[i].data); }
+}
+template <int span>
+inline void simd_mul(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_MUL(src_a_l[i].data, src_a_r.data); }
+}
+template <int span>
+inline void simd_mul(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data* src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_MUL(src_a_l[i].data, src_a_r[i].data); }
+}
+template <int span>
+inline void simd_div(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data* src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_DIV(src_a_l[i].data, src_a_r[i].data); }
+}
+
+#endif
diff --git a/csrc/includes/softmax.h b/csrc/includes/softmax.h
old mode 100755
new mode 100644
index 2bc2f67059cf1909df5778e1fe202c45e69f8d0e..8d541a6fe0e691522a3add63bfd92e2620c914d3
--- a/csrc/includes/softmax.h
+++ b/csrc/includes/softmax.h
@@ -1,60 +1,60 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-#include "custom_cuda_layers.h"
-
-#include <fstream>
-
-using namespace std;
-
-template <typename T>
-class Softmax {
-public:
-    struct Config {
-        size_t batchSize;
-        size_t heads;
-        size_t seq_length;
-        size_t prob_depth;
-        float temprature;
-        bool mem_alloc;
-        Config(size_t batch, size_t h, size_t seq, int prob_size = 0, bool mem_alloc = false)
-            : batchSize(batch),
-              heads(h),
-              seq_length(seq),
-              prob_depth(prob_size),
-              temprature(1.0),
-              mem_alloc(mem_alloc)
-        {
-        }
-    };
-
-    Softmax(Config config) : config_(config) {}
-
-    ~Softmax() {}
-
-    void Forward(int bsz, T* vals, const T* attn_mask, cudaStream_t& stream)
-    {
-        launch_attn_softmax<T>(vals, attn_mask, bsz, config_.heads, config_.seq_length, stream);
-    }
-
-    void Backward(int bsz, T* out_grad, const T* soft_out, cudaStream_t stream)
-    {
-        launch_attn_softmax_backward_v2<T>(
-            out_grad, soft_out, bsz, config_.heads, config_.seq_length, stream);
-    }
-
-    inline size_t GetProbDepth() const { return config_.prob_depth; }
-
-    inline size_t GetBatchSize() const { return config_.batchSize; }
-
-    inline size_t GetNumHeads() const { return config_.heads; }
-
-    inline size_t GetSeqLength() const { return config_.seq_length; }
-
-    inline void SetSeqLength(size_t seq_len) { config_.seq_length = seq_len; }
-
-private:
-    Config config_;
-};
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+
+#include <fstream>
+
+using namespace std;
+
+template <typename T>
+class Softmax {
+public:
+    struct Config {
+        size_t batchSize;
+        size_t heads;
+        size_t seq_length;
+        size_t prob_depth;
+        float temperature;
+        bool mem_alloc;
+        Config(size_t batch, size_t h, size_t seq, int prob_size = 0, bool mem_alloc = false)
+            : batchSize(batch),
+              heads(h),
+              seq_length(seq),
+              prob_depth(prob_size),
+              temperature(1.0),
+              mem_alloc(mem_alloc)
+        {
+        }
+    };
+
+    Softmax(Config config) : config_(config) {}
+
+    ~Softmax() {}
+
+    void Forward(int bsz, T* vals, const T* attn_mask, cudaStream_t& stream)
+    {
+        launch_attn_softmax<T>(vals, attn_mask, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    void Backward(int bsz, T* out_grad, const T* soft_out, cudaStream_t stream)
+    {
+        launch_attn_softmax_backward_v2<T>(
+            out_grad, soft_out, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    inline size_t GetProbDepth() const { return config_.prob_depth; }
+
+    inline size_t GetBatchSize() const { return config_.batchSize; }
+
+    inline size_t GetNumHeads() const { return config_.heads; }
+
+    inline size_t GetSeqLength() const { return config_.seq_length; }
+
+    inline void SetSeqLength(size_t seq_len) { config_.seq_length = seq_len; }
+
+private:
+    Config config_;
+};
diff --git a/csrc/includes/softmax_hip.h b/csrc/includes/softmax_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..47822e6e999ada42ec761a8c80571aad30edf57c
--- /dev/null
+++ b/csrc/includes/softmax_hip.h
@@ -0,0 +1,61 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "custom_hip_layers.h"
+
+#include <fstream>
+
+using namespace std;
+
+template <typename T>
+class Softmax {
+public:
+    struct Config {
+        size_t batchSize;
+        size_t heads;
+        size_t seq_length;
+        size_t prob_depth;
+        float temperature;
+        bool mem_alloc;
+        Config(size_t batch, size_t h, size_t seq, int prob_size = 0, bool mem_alloc = false)
+            : batchSize(batch),
+              heads(h),
+              seq_length(seq),
+              prob_depth(prob_size),
+              temperature(1.0),
+              mem_alloc(mem_alloc)
+        {
+        }
+    };
+
+    Softmax(Config config) : config_(config) {}
+
+    ~Softmax() {}
+
+    void Forward(int bsz, T* vals, const T* attn_mask, hipStream_t& stream)
+    {
+        launch_attn_softmax<T>(vals, attn_mask, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    void Backward(int bsz, T* out_grad, const T* soft_out, hipStream_t stream)
+    {
+        launch_attn_softmax_backward_v2<T>(
+            out_grad, soft_out, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    inline size_t GetProbDepth() const { return config_.prob_depth; }
+
+    inline size_t GetBatchSize() const { return config_.batchSize; }
+
+    inline size_t GetNumHeads() const { return config_.heads; }
+
+    inline size_t GetSeqLength() const { return config_.seq_length; }
+
+    inline void SetSeqLength(size_t seq_len) { config_.seq_length = seq_len; }
+
+private:
+    Config config_;
+};
diff --git a/csrc/includes/strided_batch_gemm.h b/csrc/includes/strided_batch_gemm.h
index 44a1b313b986fdd5aab9d03494775a07b7209ceb..037319ba0dd963de0aca8327250eeffaf870bcd0 100644
--- a/csrc/includes/strided_batch_gemm.h
+++ b/csrc/includes/strided_batch_gemm.h
@@ -1,179 +1,195 @@
-#pragma once
-
-#include <cuda.h>
-#include <cuda_fp16.h>
-#include <stdio.h>
-#include "context.h"
-
-template <typename T>
-class StridedBatchGemm {
-public:
-    struct Config {
-        int batch_size;
-        int m;
-        int n;
-        int k;
-        float alpha;
-        float beta;
-        cublasOperation_t op_A;
-        cublasOperation_t op_B;
-        std::array<int, 3> gemm_algos;
-
-        Config(int batch,
-               int mm,
-               int nn,
-               int kk,
-               float param_alpha,
-               float param_beta,
-               cublasOperation_t opA,
-               cublasOperation_t opB,
-               const std::array<int, 3>& algos)
-            : batch_size(batch),
-              m(mm),
-              n(nn),
-              k(kk),
-              alpha(param_alpha),
-              beta(param_beta),
-              op_A(opA),
-              op_B(opB),
-              gemm_algos(algos)
-        {
-        }
-        void SetConfig(int mm, int nn, int kk)
-        {
-            m = mm;
-            n = nn;
-            k = kk;
-        }
-    };
-
-    StridedBatchGemm(const Config& config) : _config(config) {}
-
-    virtual ~StridedBatchGemm() {}
-
-    void Forward(int bsz, T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
-    {
-        int stride_a = _config.m * _config.k;
-        int stride_b = _config.n * _config.k;
-        int stride_c = _config.m * _config.n;
-
-        cublas_strided_batched_gemm(handle,
-                                    _config.m,
-                                    _config.n,
-                                    _config.k,
-                                    &_config.alpha,
-                                    &_config.beta,
-                                    _buffer_a,
-                                    _buffer_b,
-                                    output,
-                                    _config.op_A,
-                                    _config.op_B,
-                                    stride_a,
-                                    stride_b,
-                                    stride_c,
-                                    bsz,
-                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
-    }
-
-    void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
-    {
-        int stride_a = _config.m * _config.k;
-        int stride_b = _config.n * _config.k;
-        int stride_c = _config.m * _config.n;
-
-        cublas_strided_batched_gemm(handle,
-                                    _config.m,
-                                    _config.n,
-                                    _config.k,
-                                    &_config.alpha,
-                                    &_config.beta,
-                                    _buffer_a,
-                                    _buffer_b,
-                                    output,
-                                    _config.op_A,
-                                    _config.op_B,
-                                    stride_a,
-                                    stride_b,
-                                    stride_c,
-                                    _config.batch_size,
-                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
-
-        k_buf = _buffer_a;
-        q_buf = _buffer_b;
-    }
-
-    void Backward(int bsz,
-                  const T* d_output,
-                  const T* _buffer_a,
-                  const T* _buffer_b,
-                  cublasHandle_t handle,
-                  T* inpGradA = nullptr,
-                  T* inpGradB = nullptr)
-    {
-        int mb = (_config.op_A == CUBLAS_OP_T ? _config.k : _config.m);
-        int kb = (_config.op_A == CUBLAS_OP_T ? _config.m : _config.k);
-
-        int stride_a = mb * _config.n;
-        int stride_b = _config.n * kb;
-        int stride_c = _config.m * _config.k;
-
-        // B need to transpose.
-        cublasOperation_t op_b = (_config.op_B == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
-
-        // Calculate d_A.
-        cublas_strided_batched_gemm(handle,
-                                    mb,
-                                    kb,
-                                    _config.n,
-                                    &_config.alpha,
-                                    &_config.beta,
-                                    (_config.op_A == CUBLAS_OP_T ? _buffer_b : d_output),
-                                    (_config.op_A == CUBLAS_OP_T ? d_output : _buffer_b),
-                                    inpGradA,
-                                    CUBLAS_OP_N,
-                                    op_b,
-                                    stride_a,
-                                    stride_b,
-                                    stride_c,
-                                    bsz,
-                                    cublasGemmAlgo_t(_config.gemm_algos[1]));
-
-        // A need to transpose.
-        cublasOperation_t op_a = (_config.op_A == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
-
-        stride_a = _config.m * _config.k;
-        stride_b = _config.m * _config.n;
-        stride_c = _config.n * _config.k;
-
-        // Calculate d_B.
-        cublas_strided_batched_gemm(handle,
-                                    _config.k,
-                                    _config.n,
-                                    _config.m,
-                                    &_config.alpha,
-                                    &_config.beta,
-                                    _buffer_a,
-                                    d_output,
-                                    inpGradB,
-                                    op_a,
-                                    CUBLAS_OP_N,
-                                    stride_a,
-                                    stride_b,
-                                    stride_c,
-                                    bsz,
-                                    cublasGemmAlgo_t(_config.gemm_algos[2]));
-    }
-
-    inline int GetN() const { return _config.k; }
-
-    inline const T* GetBufferA() const { return k_buf; }
-
-    inline const T* GetBufferB() const { return q_buf; }
-
-    inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
-
-private:
-    Config _config;
-    const T* q_buf;
-    const T* k_buf;
-};
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "context.h"
+
+template <typename T>
+class StridedBatchGemm {
+public:
+    struct Config {
+        int batch_size;
+        int m;
+        int n;
+        int k;
+        float alpha;
+        float beta;
+        cublasOperation_t op_A;
+        cublasOperation_t op_B;
+        std::array<int, 3> gemm_algos;
+
+        Config(int batch,
+               int mm,
+               int nn,
+               int kk,
+               float param_alpha,
+               float param_beta,
+               cublasOperation_t opA,
+               cublasOperation_t opB,
+               const std::array<int, 3>& algos)
+            : batch_size(batch),
+              m(mm),
+              n(nn),
+              k(kk),
+              alpha(param_alpha),
+              beta(param_beta),
+              op_A(opA),
+              op_B(opB),
+              gemm_algos(algos)
+        {
+        }
+        void SetConfig(int mm, int nn, int kk)
+        {
+            m = mm;
+            n = nn;
+            k = kk;
+        }
+    };
+
+    StridedBatchGemm(const Config& config) : _config(config) {}
+
+    virtual ~StridedBatchGemm() {}
+
+    void Forward(int bsz, T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+    }
+
+    void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    _config.batch_size,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+
+        k_buf = _buffer_a;
+        q_buf = _buffer_b;
+    }
+
+    void Backward(int bsz,
+                  const T* d_output,
+                  const T* _buffer_a,
+                  const T* _buffer_b,
+                  cublasHandle_t handle,
+                  T* inpGradA = nullptr,
+                  T* inpGradB = nullptr)
+    {
+        int mb = (_config.op_A == CUBLAS_OP_T ? _config.k : _config.m);
+        int kb = (_config.op_A == CUBLAS_OP_T ? _config.m : _config.k);
+
+        int stride_a = mb * _config.n;
+        int stride_b = _config.n * kb;
+        int stride_c = _config.m * _config.k;
+
+        // B need to transpose.
+        cublasOperation_t op_b = (_config.op_B == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+        // Calculate d_A.
+        cublas_strided_batched_gemm(handle,
+                                    mb,
+                                    kb,
+                                    _config.n,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    (_config.op_A == CUBLAS_OP_T ? _buffer_b : d_output),
+                                    (_config.op_A == CUBLAS_OP_T ? d_output : _buffer_b),
+                                    inpGradA,
+                                    CUBLAS_OP_N,
+                                    op_b,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[1]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[1]));
+#endif
+
+        // A need to transpose.
+        cublasOperation_t op_a = (_config.op_A == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+        stride_a = _config.m * _config.k;
+        stride_b = _config.m * _config.n;
+        stride_c = _config.n * _config.k;
+
+        // Calculate d_B.
+        cublas_strided_batched_gemm(handle,
+                                    _config.k,
+                                    _config.n,
+                                    _config.m,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    d_output,
+                                    inpGradB,
+                                    op_a,
+                                    CUBLAS_OP_N,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[2]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[2]));
+#endif
+    }
+
+    inline int GetN() const { return _config.k; }
+
+    inline const T* GetBufferA() const { return k_buf; }
+
+    inline const T* GetBufferB() const { return q_buf; }
+
+    inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
+
+private:
+    Config _config;
+    const T* q_buf;
+    const T* k_buf;
+};
diff --git a/csrc/includes/strided_batch_gemm_hip.h b/csrc/includes/strided_batch_gemm_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..9db208dc7230033e9681b7e4d2e0b651c9f458cd
--- /dev/null
+++ b/csrc/includes/strided_batch_gemm_hip.h
@@ -0,0 +1,196 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "context_hip.h"
+
+template <typename T>
+class StridedBatchGemm {
+public:
+    struct Config {
+        int batch_size;
+        int m;
+        int n;
+        int k;
+        float alpha;
+        float beta;
+        rocblas_operation op_A;
+        rocblas_operation op_B;
+        std::array<int, 3> gemm_algos;
+
+        Config(int batch,
+               int mm,
+               int nn,
+               int kk,
+               float param_alpha,
+               float param_beta,
+               rocblas_operation opA,
+               rocblas_operation opB,
+               const std::array<int, 3>& algos)
+            : batch_size(batch),
+              m(mm),
+              n(nn),
+              k(kk),
+              alpha(param_alpha),
+              beta(param_beta),
+              op_A(opA),
+              op_B(opB),
+              gemm_algos(algos)
+        {
+        }
+        void SetConfig(int mm, int nn, int kk)
+        {
+            m = mm;
+            n = nn;
+            k = kk;
+        }
+    };
+
+    StridedBatchGemm(const Config& config) : _config(config) {}
+
+    virtual ~StridedBatchGemm() {}
+
+    void Forward(int bsz, T* output, const T* _buffer_a, const T* _buffer_b, rocblas_handle handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+    }
+
+    void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, rocblas_handle handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    _config.batch_size,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+
+        k_buf = _buffer_a;
+        q_buf = _buffer_b;
+    }
+
+    void Backward(int bsz,
+                  const T* d_output,
+                  const T* _buffer_a,
+                  const T* _buffer_b,
+                  rocblas_handle handle,
+                  T* inpGradA = nullptr,
+                  T* inpGradB = nullptr)
+    {
+        int mb = (_config.op_A == rocblas_operation_transpose ? _config.k : _config.m);
+        int kb = (_config.op_A == rocblas_operation_transpose ? _config.m : _config.k);
+
+        int stride_a = mb * _config.n;
+        int stride_b = _config.n * kb;
+        int stride_c = _config.m * _config.k;
+
+        // B need to transpose.
+        rocblas_operation op_b = (_config.op_B == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+        // Calculate d_A.
+        cublas_strided_batched_gemm(handle,
+                                    mb,
+                                    kb,
+                                    _config.n,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    (_config.op_A == rocblas_operation_transpose ? _buffer_b : d_output),
+                                    (_config.op_A == rocblas_operation_transpose ? d_output : _buffer_b),
+                                    inpGradA,
+                                    rocblas_operation_none,
+                                    op_b,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[1]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[1]));
+#endif
+
+        // A need to transpose.
+        rocblas_operation op_a = (_config.op_A == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+        stride_a = _config.m * _config.k;
+        stride_b = _config.m * _config.n;
+        stride_c = _config.n * _config.k;
+
+        // Calculate d_B.
+        cublas_strided_batched_gemm(handle,
+                                    _config.k,
+                                    _config.n,
+                                    _config.m,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    d_output,
+                                    inpGradB,
+                                    op_a,
+                                    rocblas_operation_none,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[2]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[2]));
+#endif
+    }
+
+    inline int GetN() const { return _config.k; }
+
+    inline const T* GetBufferA() const { return k_buf; }
+
+    inline const T* GetBufferB() const { return q_buf; }
+
+    inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
+
+private:
+    Config _config;
+    const T* q_buf;
+    const T* k_buf;
+};
diff --git a/csrc/includes/type_shim.h b/csrc/includes/type_shim.h
index 3c6b9c5ae98e0c4e5de82fca5aa1742c2eefb445..4f4e7a539ac15d6931a6ac3dfd541c6bd2f6760d 100644
--- a/csrc/includes/type_shim.h
+++ b/csrc/includes/type_shim.h
@@ -1,7 +1,7 @@
 /* Taken from NVIDIA/apex commit 855808f3fc268e9715d613f3c2e56469d8c986d8 */
 #include <ATen/ATen.h>
 
-// Forward/backward compatiblity hack around
+// Forward/backward compatibility hack around
 // https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
 // pending more future-proof guidance from upstream.
 // struct TypeShim
@@ -26,6 +26,11 @@
             __VA_ARGS__;                                                         \
             break;                                                               \
         }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
         default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
     }
 
@@ -46,6 +51,11 @@
             __VA_ARGS__;                                                         \
             break;                                                               \
         }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
         default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
     }
 
diff --git a/csrc/includes/type_shim_hip.h b/csrc/includes/type_shim_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..3bd86d52bee9f5cbe3ffed6da9f3af00f851899b
--- /dev/null
+++ b/csrc/includes/type_shim_hip.h
@@ -0,0 +1,121 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+/* Taken from NVIDIA/apex commit 855808f3fc268e9715d613f3c2e56469d8c986d8 */
+#include <ATen/ATen.h>
+
+// Forward/backward compatibility hack around
+// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
+// pending more future-proof guidance from upstream.
+// struct TypeShim
+// {
+//   const at::Type& payload;
+//   TypeShim(const at::Type& type) : payload(type) {}
+//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
+//   operator const at::Type&(){ return payload; };
+//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
+//   //operator at::ScalarType(){ return payload.; };
+// };
+
+#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                          \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                   \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+#define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...)                        \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+template <typename T>
+__device__ __forceinline__ T
+reduce_block_into_lanes(T* x,
+                        T val,
+                        int lanes = 1,
+                        bool share_result = false)  // lanes is intended to be <= 32.
+{
+    int tid = threadIdx.x + threadIdx.y * blockDim.x;
+    int blockSize = blockDim.x * blockDim.y;  // blockSize is intended to be a multiple of 32.
+
+    if (blockSize >= 64) {
+        x[tid] = val;
+        __syncthreads();
+    }
+
+#pragma unroll
+    for (int i = (blockSize >> 1); i >= 64; i >>= 1) {
+        if (tid < i) x[tid] = x[tid] + x[tid + i];
+        __syncthreads();
+    }
+
+    T final;
+
+    if (tid < 32) {
+        if (blockSize >= 64)
+            final = x[tid] + x[tid + 32];
+        else
+            final = val;
+            // __SYNCWARP();
+
+#pragma unroll
+        for (int i = 16; i >= lanes; i >>= 1)
+            final = final + __shfl_down_sync(0xffffffff, final, i);
+    }
+
+    if (share_result) {
+        if (tid < lanes) x[tid] = final;  // EpilogueOp
+        // Make sure the smem result is visible to all warps.
+        __syncthreads();
+    }
+
+    return final;
+}
diff --git a/csrc/lamb/fused_lamb_cuda.cpp b/csrc/lamb/fused_lamb_cuda.cpp
index efa4f18d4d9414f51c904b8bf8878b193e05f615..7a142b13b00ccafbc102b5217c9567ec42384af7 100644
--- a/csrc/lamb/fused_lamb_cuda.cpp
+++ b/csrc/lamb/fused_lamb_cuda.cpp
@@ -61,7 +61,7 @@ at::Tensor lamb(at::Tensor& p,
 
     // intermediate for weight L2 reduction
     // make sure that the threads per block is at least 512 during the kernel launch otherwise the
-    // behavious is unexpected
+    // behaviour is unexpected
     at::Tensor w_l2_i = at::empty(
         {512},
         p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
@@ -69,7 +69,7 @@ at::Tensor lamb(at::Tensor& p,
 
     // intermediate for update L2 reduction
     // make sure that the threads per block is at least 512 during the kernel launch otherwise the
-    // behavious is unexpected
+    // behaviour is unexpected
     at::Tensor u_l2_i = at::empty(
         {512},
         p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
diff --git a/csrc/lamb/fused_lamb_cuda_kernel.cu b/csrc/lamb/fused_lamb_cuda_kernel.cu
index c94e9bb9562ca3c8b3b956356cf620a7ee235ace..c76632362cb3e12abe5ac95a6b8889ea25a61697 100644
--- a/csrc/lamb/fused_lamb_cuda_kernel.cu
+++ b/csrc/lamb/fused_lamb_cuda_kernel.cu
@@ -8,13 +8,16 @@
 #include "ATen/cuda/CUDAContext.h"
 #include "ATen/cuda/detail/IndexUtils.cuh"
 //#include "ATen/Type.h"
-#include <THC/THCGeneral.h>
 #include "ATen/AccumulateType.h"
 
 #include <iostream>
 
 //#include <helper_functions.h>
+#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION > 305
+#include <hip/hip_cooperative_groups.h>
+#else
 #include <cooperative_groups.h>
+#endif
 #include <cuda_runtime_api.h>
 #include <stdio.h>
 
@@ -30,8 +33,10 @@ struct SharedMemory {
     // Ensure that we won't compile any un-specialized types
     __device__ inline operator T*()
     {
+#ifndef _WIN32
         extern __device__ void error(void);
         error();
+#endif
         return NULL;
     }
 };
@@ -281,13 +286,13 @@ __global__ void lamb_cuda_kernel_part3(
 
     float lamb_coeff = 1.0;
 
-    if (reg_w != 0 and reg_u != 0) {
+    if (reg_w != 0 && reg_u != 0) {
         lamb_coeff = reg_w / reg_u;
         if (lamb_coeff > max_coeff) { lamb_coeff = max_coeff; }
         if (lamb_coeff < min_coeff) { lamb_coeff = min_coeff; }
     }
 
-    if (blockId == 0 and threadIdInBlock == 0) {
+    if (blockId == 0 && threadIdInBlock == 0) {
         lamb_coeff_val[0] = lamb_coeff;
         // printf("Cuda Lamb Coeff is %.6f \n",lamb_coeff);
     }
@@ -462,7 +467,7 @@ void fused_lamb_cuda(at::Tensor& p,
                         lamb_coeff.data<scalar_t>());
             }));
     }
-    THCudaCheck(cudaGetLastError());
+    C10_CUDA_CHECK(cudaGetLastError());
 }
 
 // template __device__ void reduce_two_vectors_in_register<float,512>(float a, float b, float* g_a,
diff --git a/csrc/lamb/fused_lamb_hip_kernel.hip b/csrc/lamb/fused_lamb_hip_kernel.hip
new file mode 100644
index 0000000000000000000000000000000000000000..2e2bc69f6156c9a21cc2a481dc423ded17351651
--- /dev/null
+++ b/csrc/lamb/fused_lamb_hip_kernel.hip
@@ -0,0 +1,475 @@
+// !!! This is a file automatically generated by hipify!!!
+/* Copyright 2019 The Microsoft DeepSpeed Team */
+#include <hip/hip_runtime.h>
+#include <hip/hip_runtime.h>
+#include <stdio.h>
+#include <cmath>
+#include "ATen/ATen.h"
+#include "ATen/TensorUtils.h"
+#include "ATen/hip/HIPContext.h"
+#include "ATen/hip/detail/IndexUtils.cuh"
+//#include "ATen/Type.h"
+#include "ATen/AccumulateType.h"
+
+#include <iostream>
+
+//#include <helper_functions.h>
+#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION > 305
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <hip/hip_runtime_api.h>
+#include <stdio.h>
+
+namespace cg = cooperative_groups;
+
+// Utility class used to avoid linker errors with extern
+// unsized shared memory arrays with templated type
+namespace {
+// This is the un-specialized struct.  Note that we prevent instantiation of this
+// struct by putting an undefined symbol in the function body so it won't compile.
+template <typename T>
+struct SharedMemory {
+    // Ensure that we won't compile any un-specialized types
+    __device__ inline operator T*()
+    {
+#ifndef _WIN32
+        extern __device__ void error(void);
+        error();
+#endif
+        return NULL;
+    }
+};
+
+template <>
+struct SharedMemory<float> {
+    __device__ inline operator float*()
+    {
+        HIP_DYNAMIC_SHARED( float, s_float)
+        return s_float;
+    }
+};
+
+template <>
+struct SharedMemory<double> {
+    __device__ inline operator double*()
+    {
+        HIP_DYNAMIC_SHARED( double, s_double)
+        return s_double;
+    }
+};
+}  // namespace
+
+#include "type_shim_hip.h"
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // eps under square root
+    ADAM_MODE_1 = 1   // eps outside square root
+} adamMode_t;
+
+// s_a and s_b are in shared memory
+// g_a and g_b are in shared memory
+template <typename T, int blockSize>
+__device__ void reduce_block_in_shared_memory(T* s_a, T* s_b, T* g_a, T* g_b)
+{
+    // Handle to thread block group
+    cg::thread_block cta = cg::this_thread_block();
+
+    // perform block reduction in shared memory,
+    unsigned int tid = cta.thread_rank();
+
+    T a_sum = s_a[tid];
+    T b_sum = s_b[tid];
+
+    cg::sync(cta);
+
+    // do reduction in shared mem
+    if ((blockSize >= 512) && (tid < 256)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 256];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 256];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 256) && (tid < 128)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 128];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 128];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 128) && (tid < 64)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 64];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 64];
+    }
+
+    cg::sync(cta);
+
+#if (__CUDA_ARCH__ >= 300)
+    if (tid < 32) {
+        cg::coalesced_group active = cg::coalesced_threads();
+
+        // Fetch final intermediate sum from 2nd warp
+        if (blockSize >= 64) {
+            a_sum = a_sum + s_a[tid + 32];
+            b_sum = b_sum + s_b[tid + 32];
+        }
+
+        // Reduce final warp using shuffle
+        for (int offset = warpSize / 2; offset > 0; offset /= 2) {
+            a_sum += active.shfl_down(a_sum, offset);
+            b_sum += active.shfl_down(b_sum, offset);
+        }
+    }
+#else
+    if ((blockSize >= 64) && (tid < 32)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 32];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 32];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 32) && (tid < 16)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 16];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 16];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 16) && (tid < 8)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 8];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 8];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 8) && (tid < 4)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 4];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 4];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 4) && (tid < 2)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 2];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 2];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 2) && (tid < 1)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 1];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 1];
+    }
+
+    cg::sync(cta);
+
+#endif
+
+    // write result for this block to global mem
+    if (tid == 0) {
+        g_a[blockIdx.x] = (T)a_sum;
+        g_b[blockIdx.x] = (T)b_sum;
+    }
+}
+
+template <typename T, int blockSize>
+__device__ void reduce_two_vectors_in_register(T a, T b, T* g_a, T* g_b)
+{
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+
+    T* s_a = SharedMemory<T>();
+    T* s_b = SharedMemory<T>() + cg::this_thread_block().size();
+
+    s_a[threadIdInBlock] = a;
+    s_b[threadIdInBlock] = b;
+
+    reduce_block_in_shared_memory<T, blockSize>(s_a, s_b, g_a, g_b);
+}
+
+template <typename T, typename GRAD_T, int blockSize>
+__global__ void lamb_cuda_kernel_part1(
+    T* __restrict__ p,
+    GRAD_T* __restrict__ p_copy,  // For mixed precision training, pass NULL if not needed
+    T* __restrict__ m,
+    T* __restrict__ v,
+    const GRAD_T* __restrict__ g,
+    const float b1,
+    const float b2,
+    const float eps,
+    const float grad_scale,
+    const float step_size,
+    const size_t tsize,
+    adamMode_t mode,
+    const float decay,
+    T* __restrict__ w_l2_i,
+    T* __restrict__ u_l2_i)
+{
+    // Assuming 2D grids and 2D blocks
+    const int blockId = gridDim.x * blockIdx.y + blockIdx.x;
+    const int threadsPerBlock = blockDim.x * blockDim.y;
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+    const int i = (blockId * threadsPerBlock + threadIdInBlock);
+    const int totThreads = gridDim.x * gridDim.y * threadsPerBlock;
+
+    T reg_w = 0;
+    T reg_u = 0;
+
+    for (int j = i; j < tsize; j += totThreads) {
+        T scaled_grad = g[j] / grad_scale;
+        T pj = p[j];
+        m[j] = b1 * m[j] + (1 - b1) * scaled_grad;
+        v[j] = b2 * v[j] + (1 - b2) * scaled_grad * scaled_grad;
+        float denom;
+        if (mode == ADAM_MODE_0)
+            denom = sqrtf(v[j] + eps);
+        else  // Mode 1
+            denom = sqrtf(v[j]) + eps;
+        T update = (m[j] / denom) + (decay * p[j]);
+
+        reg_u += update * update;
+        reg_w += pj * pj;
+    }
+
+    reduce_two_vectors_in_register<T, blockSize>(reg_w, reg_u, w_l2_i, u_l2_i);
+}
+
+template <typename T, typename GRAD_T, int blockSize>
+__global__ void lamb_cuda_kernel_part2(const size_t tsize, T* __restrict__ g_a, T* __restrict__ g_b)
+{
+    T* s_a = SharedMemory<T>();
+    T* s_b = SharedMemory<T>() + cg::this_thread_block().size();
+
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+
+    s_a[threadIdInBlock] = g_a[threadIdInBlock];
+    s_b[threadIdInBlock] = g_b[threadIdInBlock];
+
+    if (threadIdInBlock >= tsize) {
+        s_a[threadIdInBlock] = 0.0;
+        s_b[threadIdInBlock] = 0.0;
+    }
+
+    reduce_block_in_shared_memory<T, blockSize>(s_a, s_b, g_a, g_b);
+}
+
+template <typename T, typename GRAD_T>
+__global__ void lamb_cuda_kernel_part3(
+    T* __restrict__ p,
+    GRAD_T* __restrict__ p_copy,  // For mixed precision training, pass NULL if not needed
+    T* __restrict__ m,
+    T* __restrict__ v,
+    const GRAD_T* __restrict__ g,
+    const float b1,
+    const float b2,
+    const float max_coeff,
+    const float min_coeff,
+    const float eps,
+    const float grad_scale,
+    const float step_size,
+    const size_t tsize,
+    adamMode_t mode,
+    const float decay,
+    T* __restrict__ w_l2_i,
+    T* __restrict__ u_l2_i,
+    T* __restrict__ lamb_coeff_val)
+{
+    // Assuming 2D grids and 2D blocks
+    const int blockId = gridDim.x * blockIdx.y + blockIdx.x;
+    const int threadsPerBlock = blockDim.x * blockDim.y;
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+    const int i = (blockId * threadsPerBlock + threadIdInBlock);
+    const int totThreads = gridDim.x * gridDim.y * threadsPerBlock;
+
+    T reg_w = sqrtf(w_l2_i[0]);
+    T reg_u = sqrtf(u_l2_i[0]);
+
+    float lamb_coeff = 1.0;
+
+    if (reg_w != 0 && reg_u != 0) {
+        lamb_coeff = reg_w / reg_u;
+        if (lamb_coeff > max_coeff) { lamb_coeff = max_coeff; }
+        if (lamb_coeff < min_coeff) { lamb_coeff = min_coeff; }
+    }
+
+    if (blockId == 0 && threadIdInBlock == 0) {
+        lamb_coeff_val[0] = lamb_coeff;
+        // printf("Cuda Lamb Coeff is %.6f \n",lamb_coeff);
+    }
+
+    for (int j = i; j < tsize; j += totThreads) {
+        T pj = (float)p[j];
+        T mj = m[j];
+        T vj = v[j];
+        float denom;
+        if (mode == ADAM_MODE_0)
+            denom = sqrtf(vj + eps);
+        else  // Mode 1
+            denom = sqrtf(vj) + eps;
+        T update = (mj / denom) + (decay * pj);
+
+        pj = pj - (step_size * lamb_coeff * update);
+        p[j] = pj;
+        if (p_copy != NULL) p_copy[j] = (GRAD_T)pj;
+    }
+}
+
+void fused_lamb_cuda(at::Tensor& p,
+                     at::Tensor& p_copy,
+                     at::Tensor& m,
+                     at::Tensor& v,
+                     at::Tensor& g,
+                     float lr,
+                     float beta1,
+                     float beta2,
+                     float max_coeff,
+                     float min_coeff,
+                     float eps,
+                     float grad_scale,
+                     int step,
+                     int mode,
+                     int bias_correction,
+                     float decay,
+                     at::Tensor& w_l2_i,
+                     at::Tensor& u_l2_i,
+                     at::Tensor& lamb_coeff)
+{
+    //        using namespace at;
+
+    // Get tensor size
+    int tsize = p.numel();
+    // Determine #threads and #blocks
+    const int threadsPerBlock = 512;
+    int num_blocks = (tsize + threadsPerBlock - 1) / threadsPerBlock;
+    if (num_blocks > 512) num_blocks = 512;
+
+    int smemsize = 0;
+
+    if (p.type().scalarType() == at::ScalarType::Double)
+        smemsize = 2 * threadsPerBlock * sizeof(double);
+    else
+        smemsize = 2 * threadsPerBlock * sizeof(float);
+
+    const dim3 blocks(num_blocks);
+    const dim3 threads(threadsPerBlock);
+
+    AT_ASSERTM(at::cuda::detail::canUse32BitIndexMath(p),
+               "parameter tensor is too large to be indexed with int32");
+    // Constants
+    float step_size = 0;
+    if (bias_correction == 1) {
+        const float bias_correction1 = 1 - ::pow(beta1, step);
+        const float bias_correction2 = 1 - ::pow(beta2, step);
+        step_size = lr * std::sqrt(bias_correction2) / bias_correction1;
+    } else {
+        step_size = lr;
+    }
+    hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+
+    if (g.type().scalarType() == at::ScalarType::Half) {
+        // all other values should be fp32 for half gradients
+        AT_ASSERTM(p.type().scalarType() == at::ScalarType::Float,
+                   "expected parameter to be of float type");
+        // dispatch is done on the gradient type
+        using namespace at;  // prevents "toString is undefined" errors
+        AT_DISPATCH_FLOATING_TYPES_AND_HALF(
+            g.scalar_type(), "lamb_cuda_kernel", ([&] {
+                using accscalar_t = at::acc_type<scalar_t, true>;
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part1<accscalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<accscalar_t>(),
+                        p_copy.numel() ? p_copy.data<scalar_t>() : NULL,
+                        m.data<accscalar_t>(),
+                        v.data<accscalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<accscalar_t>(),
+                        u_l2_i.data<accscalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part2<accscalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(1), dim3(threadsPerBlock), smemsize, stream, 
+                        num_blocks, w_l2_i.data<accscalar_t>(), u_l2_i.data<accscalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part3<accscalar_t, scalar_t>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<accscalar_t>(),
+                        p_copy.numel() ? p_copy.data<scalar_t>() : NULL,
+                        m.data<accscalar_t>(),
+                        v.data<accscalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        max_coeff,
+                        min_coeff,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<accscalar_t>(),
+                        u_l2_i.data<accscalar_t>(),
+                        lamb_coeff.data<accscalar_t>());
+            }));
+    } else {
+        using namespace at;
+        AT_DISPATCH_FLOATING_TYPES(
+            g.scalar_type(), "lamb_cuda_kernel", ([&] {
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part1<scalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<scalar_t>(),
+                        NULL,  // don't output p_copy for fp32, it's wasted write
+                        m.data<scalar_t>(),
+                        v.data<scalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<scalar_t>(),
+                        u_l2_i.data<scalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part2<scalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(1), dim3(threadsPerBlock), smemsize, stream, 
+                        num_blocks, w_l2_i.data<scalar_t>(), u_l2_i.data<scalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part3<scalar_t, scalar_t>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<scalar_t>(),
+                        NULL,  // don't output p_copy for fp32, it's wasted write
+                        m.data<scalar_t>(),
+                        v.data<scalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        max_coeff,
+                        min_coeff,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<scalar_t>(),
+                        u_l2_i.data<scalar_t>(),
+                        lamb_coeff.data<scalar_t>());
+            }));
+    }
+    C10_HIP_CHECK(hipGetLastError());
+}
+
+// template __device__ void reduce_two_vectors_in_register<float,512>(float a, float b, float* g_a,
+// float* g_b, cg::grid_group &cgg);
diff --git a/csrc/quantization/pt_binding.cpp b/csrc/quantization/pt_binding.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..f76c4368a20090d1821776fed4877fa021db57c5
--- /dev/null
+++ b/csrc/quantization/pt_binding.cpp
@@ -0,0 +1,77 @@
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "custom_cuda_layers.h"
+
+template <typename T>
+at::Tensor ds_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("ds_quantize_fp32", &ds_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_quantize_fp16", &ds_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_fp32", &ds_sr_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_fp16", &ds_sr_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_quantize_asym_fp32", &ds_quantize_asym<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def(
+        "ds_quantize_asym_fp16", &ds_quantize_asym<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp32",
+          &ds_sr_quantize_asym<float>,
+          "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp16",
+          &ds_sr_quantize_asym<__half>,
+          "DeepSpeed Quantize with fp16 (CUDA)");
+}
diff --git a/csrc/quantization/pt_binding_hip.cpp b/csrc/quantization/pt_binding_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..25ddba1a36a985e78be34b6b7a9c0c5c7df2fea5
--- /dev/null
+++ b/csrc/quantization/pt_binding_hip.cpp
@@ -0,0 +1,78 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <ATen/hip/HIPContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "custom_hip_layers.h"
+
+template <typename T>
+at::Tensor ds_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("ds_quantize_fp32", &ds_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_quantize_fp16", &ds_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_fp32", &ds_sr_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_fp16", &ds_sr_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_quantize_asym_fp32", &ds_quantize_asym<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def(
+        "ds_quantize_asym_fp16", &ds_quantize_asym<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp32",
+          &ds_sr_quantize_asym<float>,
+          "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp16",
+          &ds_sr_quantize_asym<__half>,
+          "DeepSpeed Quantize with fp16 (CUDA)");
+}
diff --git a/csrc/quantization/quantizer.cu b/csrc/quantization/quantizer.cu
new file mode 100644
index 0000000000000000000000000000000000000000..37883410e976c6daaece041d86a8a6e78838a00d
--- /dev/null
+++ b/csrc/quantization/quantizer.cu
@@ -0,0 +1,1037 @@
+#include <math.h>
+#include "custom_cuda_layers.h"
+
+namespace cg = cooperative_groups;
+
+__global__ void quantize_kernel(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (abs((float)data_h[0]) > max) max = abs((float)data_h[0]);
+            if (abs((float)data_h[1]) > max) max = abs((float)data_h[1]);
+            if (abs((float)data_h[2]) > max) max = abs((float)data_h[2]);
+            if (abs((float)data_h[3]) > max) max = abs((float)data_h[3]);
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+        float q_scale_inv = 1 / q_scale;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf(q_data[0].x * q_scale);
+                q_data_int[0].y = roundf(q_data[0].y * q_scale);
+                q_data_int[1].x = roundf(q_data[1].x * q_scale);
+                q_data_int[1].y = roundf(q_data[1].y * q_scale);
+
+                q_data_int[0].x *= q_scale_inv;
+                q_data_int[0].y *= q_scale_inv;
+                q_data_int[1].x *= q_scale_inv;
+                q_data_int[1].y *= q_scale_inv;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (abs(data_reg.x) > max) max = abs(data_reg.x);
+        if (abs(data_reg.y) > max) max = abs(data_reg.y);
+        if (abs(data_reg.z) > max) max = abs(data_reg.z);
+        if (abs(data_reg.w) > max) max = abs(data_reg.w);
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+    __shared__ float partialMax[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+
+    b.sync();
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+
+    max = g.shfl(max, 0);
+
+    float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf(q_data.x * q_scale);
+            q_data_int.y = roundf(q_data.y * q_scale);
+            q_data_int.w = roundf(q_data.w * q_scale);
+            q_data_int.z = roundf(q_data.z * q_scale);
+
+            q_data.x = q_data_int.x * q_scale_inv;
+            q_data.y = q_data_int.y * q_scale_inv;
+            q_data.w = q_data_int.w * q_scale_inv;
+            q_data.z = q_data_int.z * q_scale_inv;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            cudaStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+    quantize_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel(float* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     cudaStream_t stream);
+template void launch_quantize_kernel(__half* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     cudaStream_t stream);
+
+__global__ void sr_quantize_kernel(__half* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (abs((float)data_f[0].x) > max) max = abs((float)data_f[0].x);
+            if (abs((float)data_f[0].y) > max) max = abs((float)data_f[0].y);
+            if (abs((float)data_f[1].x) > max) max = abs((float)data_f[1].x);
+            if (abs((float)data_f[1].y) > max) max = abs((float)data_f[1].y);
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((int)(data_f[0].x * q_scale_val));
+                q_data_int[0].y = (float)((int)(data_f[0].y * q_scale_val));
+                q_data_int[1].x = (float)((int)(data_f[1].x * q_scale_val));
+                q_data_int[1].y = (float)((int)(data_f[1].y * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(data_f[0].x - (q_data_int[0].x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(data_f[0].y - (q_data_int[0].y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(data_f[1].x - (q_data_int[1].x / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(data_f[1].y - (q_data_int[1].y / q_scale_val)) * q_scale_val;
+
+                q_data_int[0].x =
+                    (rand.x < q_error[0] && q_data_int[0].x > low_q && q_data_int[0].x < high_q)
+                        ? (q_data_int[0].x + (data_f[0].x > 0 ? 1 : -1))
+                        : q_data_int[0].x;
+                q_data_int[0].y =
+                    (rand.y < q_error[1] && q_data_int[0].y > low_q && q_data_int[0].y < high_q)
+                        ? (q_data_int[0].y + (data_f[0].y > 0 ? 1 : -1))
+                        : q_data_int[0].y;
+                q_data_int[1].x =
+                    (rand.w < q_error[2] && q_data_int[1].x > low_q && q_data_int[1].x < high_q)
+                        ? (q_data_int[1].x + (data_f[1].x > 0 ? 1 : -1))
+                        : q_data_int[1].x;
+                q_data_int[1].y =
+                    (rand.z < q_error[3] && q_data_int[1].y > low_q && q_data_int[1].y < high_q)
+                        ? (q_data_int[1].y + (data_f[1].y > 0 ? 1 : -1))
+                        : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x / q_scale_val;
+                data_f[0].y = q_data_int[0].y / q_scale_val;
+                data_f[1].x = q_data_int[1].x / q_scale_val;
+                data_f[1].y = q_data_int[1].y / q_scale_val;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel(float* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            data[reg_count] = vals_cast[group_index];
+
+            if (abs(data[reg_count].x) > max) max = abs(data[reg_count].x);
+            if (abs(data[reg_count].y) > max) max = abs(data[reg_count].y);
+            if (abs(data[reg_count].z) > max) max = abs(data[reg_count].z);
+            if (abs(data[reg_count].w) > max) max = abs(data[reg_count].w);
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)(q_data.x * q_scale_val));
+                q_data_int.y = (float)((int)(q_data.y * q_scale_val));
+                q_data_int.w = (float)((int)(q_data.w * q_scale_val));
+                q_data_int.z = (float)((int)(q_data.z * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - (q_data_int.x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(q_data.y - (q_data_int.y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(q_data.w - (q_data_int.w / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(q_data.z - (q_data_int.z / q_scale_val)) * q_scale_val;
+
+                q_data_int.x =
+                    (rand.x < q_error[0] && q_data_int.x > low_q && q_data_int.x < high_q)
+                        ? (q_data_int.x + (q_data.x > 0 ? 1 : -1))
+                        : q_data_int.x;
+                q_data_int.y =
+                    (rand.y < q_error[1] && q_data_int.y > low_q && q_data_int.y < high_q)
+                        ? (q_data_int.y + (q_data.y > 0 ? 1 : -1))
+                        : q_data_int.y;
+                q_data_int.w =
+                    (rand.w < q_error[2] && q_data_int.w > low_q && q_data_int.w < high_q)
+                        ? (q_data_int.w + (q_data.w > 0 ? 1 : -1))
+                        : q_data_int.w;
+                q_data_int.z =
+                    (rand.z < q_error[3] && q_data_int.z > low_q && q_data_int.z < high_q)
+                        ? (q_data_int.z + (q_data.z > 0 ? 1 : -1))
+                        : q_data_int.z;
+
+                q_data_int.x /= q_scale_val;
+                q_data_int.y /= q_scale_val;
+                q_data_int.w /= q_scale_val;
+                q_data_int.z /= q_scale_val;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               cudaStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    sr_quantize_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel(float* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        cudaStream_t stream);
+template void launch_sr_quantize_kernel(__half* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        cudaStream_t stream);
+
+__global__ void quantize_kernel_asym(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+        float min = 10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (((float)data_h[0]) > max) max = (float)data_h[0];
+            if (((float)data_h[1]) > max) max = (float)data_h[1];
+            if (((float)data_h[2]) > max) max = (float)data_h[2];
+            if (((float)data_h[3]) > max) max = (float)data_h[3];
+
+            if (((float)data_h[0]) < min) min = (float)data_h[0];
+            if (((float)data_h[1]) < min) min = (float)data_h[1];
+            if (((float)data_h[2]) < min) min = (float)data_h[2];
+            if (((float)data_h[3]) < min) min = (float)data_h[3];
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_inv = 1 / q_scale;
+
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf((q_data[0].x - min) * q_scale_inv);
+                q_data_int[0].y = roundf((q_data[0].y - min) * q_scale_inv);
+                q_data_int[1].x = roundf((q_data[1].x - min) * q_scale_inv);
+                q_data_int[1].y = roundf((q_data[1].y - min) * q_scale_inv);
+
+                q_data_int[0].x = q_data_int[0].x * q_scale + min;
+                q_data_int[0].y = q_data_int[0].y * q_scale + min;
+                q_data_int[1].x = q_data_int[1].x * q_scale + min;
+                q_data_int[1].y = q_data_int[1].y * q_scale + min;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel_asym(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+    float min = 10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (data_reg.x > max) max = data_reg.x;
+        if (data_reg.y > max) max = data_reg.y;
+        if (data_reg.w > max) max = data_reg.w;
+        if (data_reg.z > max) max = data_reg.z;
+
+        if (data_reg.x < min) min = data_reg.x;
+        if (data_reg.y < min) min = data_reg.y;
+        if (data_reg.w < min) min = data_reg.w;
+        if (data_reg.z < min) min = data_reg.z;
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(min, i);
+        if (min > temp) min = temp;
+    }
+
+    __shared__ float partialMax[WARP_SIZE];
+    __shared__ float partialMin[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+    if (lane == 0) partialMin[gid] = min;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+    if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(min, i);
+        if (min > temp) min = temp;
+    }
+
+    max = g.shfl(max, 0);
+    min = g.shfl(min, 0);
+
+    float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf((q_data.x - min) * q_scale_inv);
+            q_data_int.y = roundf((q_data.y - min) * q_scale_inv);
+            q_data_int.w = roundf((q_data.w - min) * q_scale_inv);
+            q_data_int.z = roundf((q_data.z - min) * q_scale_inv);
+
+            q_data.x = q_data_int.x * q_scale + min;
+            q_data.y = q_data_int.y * q_scale + min;
+            q_data.w = q_data_int.w * q_scale + min;
+            q_data.z = q_data_int.z * q_scale + min;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 cudaStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+    quantize_kernel_asym<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel_asym(float* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          cudaStream_t stream);
+template void launch_quantize_kernel_asym(__half* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          cudaStream_t stream);
+
+__global__ void sr_quantize_kernel_asym(__half* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (((float)data_f[0].x) > max) max = (float)data_f[0].x;
+            if (((float)data_f[0].y) > max) max = (float)data_f[0].y;
+            if (((float)data_f[1].x) > max) max = (float)data_f[1].x;
+            if (((float)data_f[1].y) > max) max = (float)data_f[1].y;
+
+            if (((float)data_f[0].x) < min) min = (float)data_f[0].x;
+            if (((float)data_f[0].y) < min) min = (float)data_f[0].y;
+            if (((float)data_f[1].x) < min) min = (float)data_f[1].x;
+            if (((float)data_f[1].y) < min) min = (float)data_f[1].y;
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_val_inv = 1 / q_scale_val;
+        float high_q = (float)((1 << num_bits) - 1);
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((unsigned int)((data_f[0].x - min) * q_scale_val_inv));
+                q_data_int[0].y = (float)((unsigned int)((data_f[0].y - min) * q_scale_val_inv));
+                q_data_int[1].x = (float)((unsigned int)((data_f[1].x - min) * q_scale_val_inv));
+                q_data_int[1].y = (float)((unsigned int)((data_f[1].y - min) * q_scale_val_inv));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] =
+                    abs(data_f[0].x - ((q_data_int[0].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[1] =
+                    abs(data_f[0].y - ((q_data_int[0].y * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[2] =
+                    abs(data_f[1].x - ((q_data_int[1].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[3] =
+                    abs(data_f[1].y - ((q_data_int[1].y * q_scale_val) + min)) * q_scale_val_inv;
+
+                q_data_int[0].x = (rand.x < q_error[0] && q_data_int[0].x < high_q)
+                                      ? (q_data_int[0].x + 1)
+                                      : q_data_int[0].x;
+                q_data_int[0].y = (rand.y < q_error[1] && q_data_int[0].y < high_q)
+                                      ? (q_data_int[0].y + 1)
+                                      : q_data_int[0].y;
+                q_data_int[1].x = (rand.w < q_error[2] && q_data_int[1].x < high_q)
+                                      ? (q_data_int[1].x + 1)
+                                      : q_data_int[1].x;
+                q_data_int[1].y = (rand.z < q_error[3] && q_data_int[1].y < high_q)
+                                      ? (q_data_int[1].y + 1)
+                                      : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x * q_scale_val + min;
+                data_f[0].y = q_data_int[0].y * q_scale_val + min;
+                data_f[1].x = q_data_int[1].x * q_scale_val + min;
+                data_f[1].y = q_data_int[1].y * q_scale_val + min;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel_asym(float* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            float4 data_reg = vals_cast[group_index];
+            data[reg_count] = data_reg;
+            if (data_reg.x > max) max = data_reg.x;
+            if (data_reg.y > max) max = data_reg.y;
+            if (data_reg.w > max) max = data_reg.w;
+            if (data_reg.z > max) max = data_reg.z;
+
+            if (data_reg.x < min) min = data_reg.x;
+            if (data_reg.y < min) min = data_reg.y;
+            if (data_reg.w < min) min = data_reg.w;
+            if (data_reg.z < min) min = data_reg.z;
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float high_q = (float)((1 << num_bits) - 1);
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)((q_data.x - min) / q_scale_val));
+                q_data_int.y = (float)((int)((q_data.y - min) / q_scale_val));
+                q_data_int.w = (float)((int)((q_data.w - min) / q_scale_val));
+                q_data_int.z = (float)((int)((q_data.z - min) / q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - ((q_data_int.x * q_scale_val) + min)) / q_scale_val;
+                q_error[1] = abs(q_data.y - ((q_data_int.y * q_scale_val) + min)) / q_scale_val;
+                q_error[2] = abs(q_data.w - ((q_data_int.w * q_scale_val) + min)) / q_scale_val;
+                q_error[3] = abs(q_data.z - ((q_data_int.z * q_scale_val) + min)) / q_scale_val;
+
+                q_data_int.x = (rand.x < q_error[0] && q_data_int.x < high_q) ? (q_data_int.x + 1)
+                                                                              : q_data_int.x;
+                q_data_int.y = (rand.y < q_error[1] && q_data_int.y < high_q) ? (q_data_int.y + 1)
+                                                                              : q_data_int.y;
+                q_data_int.w = (rand.w < q_error[2] && q_data_int.w < high_q) ? (q_data_int.w + 1)
+                                                                              : q_data_int.w;
+                q_data_int.z = (rand.z < q_error[3] && q_data_int.z < high_q) ? (q_data_int.z + 1)
+                                                                              : q_data_int.z;
+
+                q_data_int.x = q_data_int.x * q_scale_val + min;
+                q_data_int.y = q_data_int.y * q_scale_val + min;
+                q_data_int.w = q_data_int.w * q_scale_val + min;
+                q_data_int.z = q_data_int.z * q_scale_val + min;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    cudaStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    sr_quantize_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel_asym(float* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             cudaStream_t stream);
+template void launch_sr_quantize_kernel_asym(__half* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             cudaStream_t stream);
diff --git a/csrc/quantization/quantizer.hip b/csrc/quantization/quantizer.hip
new file mode 100644
index 0000000000000000000000000000000000000000..9134593275130a29dc43384d99e15bd2722f3e4c
--- /dev/null
+++ b/csrc/quantization/quantizer.hip
@@ -0,0 +1,1039 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <math.h>
+#include "custom_hip_layers.h"
+
+namespace cg = cooperative_groups;
+
+__global__ void quantize_kernel(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (abs((float)data_h[0]) > max) max = abs((float)data_h[0]);
+            if (abs((float)data_h[1]) > max) max = abs((float)data_h[1]);
+            if (abs((float)data_h[2]) > max) max = abs((float)data_h[2]);
+            if (abs((float)data_h[3]) > max) max = abs((float)data_h[3]);
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+        float q_scale_inv = 1 / q_scale;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf(q_data[0].x * q_scale);
+                q_data_int[0].y = roundf(q_data[0].y * q_scale);
+                q_data_int[1].x = roundf(q_data[1].x * q_scale);
+                q_data_int[1].y = roundf(q_data[1].y * q_scale);
+
+                q_data_int[0].x *= q_scale_inv;
+                q_data_int[0].y *= q_scale_inv;
+                q_data_int[1].x *= q_scale_inv;
+                q_data_int[1].y *= q_scale_inv;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (abs(data_reg.x) > max) max = abs(data_reg.x);
+        if (abs(data_reg.y) > max) max = abs(data_reg.y);
+        if (abs(data_reg.z) > max) max = abs(data_reg.z);
+        if (abs(data_reg.w) > max) max = abs(data_reg.w);
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+    __shared__ float partialMax[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+
+    b.sync();
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+
+    max = g.shfl(max, 0);
+
+    float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf(q_data.x * q_scale);
+            q_data_int.y = roundf(q_data.y * q_scale);
+            q_data_int.w = roundf(q_data.w * q_scale);
+            q_data_int.z = roundf(q_data.z * q_scale);
+
+            q_data.x = q_data_int.x * q_scale_inv;
+            q_data.y = q_data_int.y * q_scale_inv;
+            q_data.w = q_data_int.w * q_scale_inv;
+            q_data.z = q_data_int.z * q_scale_inv;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            hipStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+   hipLaunchKernelGGL(( quantize_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel(float* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     hipStream_t stream);
+template void launch_quantize_kernel(__half* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     hipStream_t stream);
+
+__global__ void sr_quantize_kernel(__half* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (abs((float)data_f[0].x) > max) max = abs((float)data_f[0].x);
+            if (abs((float)data_f[0].y) > max) max = abs((float)data_f[0].y);
+            if (abs((float)data_f[1].x) > max) max = abs((float)data_f[1].x);
+            if (abs((float)data_f[1].y) > max) max = abs((float)data_f[1].y);
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((int)(data_f[0].x * q_scale_val));
+                q_data_int[0].y = (float)((int)(data_f[0].y * q_scale_val));
+                q_data_int[1].x = (float)((int)(data_f[1].x * q_scale_val));
+                q_data_int[1].y = (float)((int)(data_f[1].y * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(data_f[0].x - (q_data_int[0].x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(data_f[0].y - (q_data_int[0].y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(data_f[1].x - (q_data_int[1].x / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(data_f[1].y - (q_data_int[1].y / q_scale_val)) * q_scale_val;
+
+                q_data_int[0].x =
+                    (rand.x < q_error[0] && q_data_int[0].x > low_q && q_data_int[0].x < high_q)
+                        ? (q_data_int[0].x + (data_f[0].x > 0 ? 1 : -1))
+                        : q_data_int[0].x;
+                q_data_int[0].y =
+                    (rand.y < q_error[1] && q_data_int[0].y > low_q && q_data_int[0].y < high_q)
+                        ? (q_data_int[0].y + (data_f[0].y > 0 ? 1 : -1))
+                        : q_data_int[0].y;
+                q_data_int[1].x =
+                    (rand.w < q_error[2] && q_data_int[1].x > low_q && q_data_int[1].x < high_q)
+                        ? (q_data_int[1].x + (data_f[1].x > 0 ? 1 : -1))
+                        : q_data_int[1].x;
+                q_data_int[1].y =
+                    (rand.z < q_error[3] && q_data_int[1].y > low_q && q_data_int[1].y < high_q)
+                        ? (q_data_int[1].y + (data_f[1].y > 0 ? 1 : -1))
+                        : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x / q_scale_val;
+                data_f[0].y = q_data_int[0].y / q_scale_val;
+                data_f[1].x = q_data_int[1].x / q_scale_val;
+                data_f[1].y = q_data_int[1].y / q_scale_val;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel(float* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            data[reg_count] = vals_cast[group_index];
+
+            if (abs(data[reg_count].x) > max) max = abs(data[reg_count].x);
+            if (abs(data[reg_count].y) > max) max = abs(data[reg_count].y);
+            if (abs(data[reg_count].z) > max) max = abs(data[reg_count].z);
+            if (abs(data[reg_count].w) > max) max = abs(data[reg_count].w);
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)(q_data.x * q_scale_val));
+                q_data_int.y = (float)((int)(q_data.y * q_scale_val));
+                q_data_int.w = (float)((int)(q_data.w * q_scale_val));
+                q_data_int.z = (float)((int)(q_data.z * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - (q_data_int.x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(q_data.y - (q_data_int.y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(q_data.w - (q_data_int.w / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(q_data.z - (q_data_int.z / q_scale_val)) * q_scale_val;
+
+                q_data_int.x =
+                    (rand.x < q_error[0] && q_data_int.x > low_q && q_data_int.x < high_q)
+                        ? (q_data_int.x + (q_data.x > 0 ? 1 : -1))
+                        : q_data_int.x;
+                q_data_int.y =
+                    (rand.y < q_error[1] && q_data_int.y > low_q && q_data_int.y < high_q)
+                        ? (q_data_int.y + (q_data.y > 0 ? 1 : -1))
+                        : q_data_int.y;
+                q_data_int.w =
+                    (rand.w < q_error[2] && q_data_int.w > low_q && q_data_int.w < high_q)
+                        ? (q_data_int.w + (q_data.w > 0 ? 1 : -1))
+                        : q_data_int.w;
+                q_data_int.z =
+                    (rand.z < q_error[3] && q_data_int.z > low_q && q_data_int.z < high_q)
+                        ? (q_data_int.z + (q_data.z > 0 ? 1 : -1))
+                        : q_data_int.z;
+
+                q_data_int.x /= q_scale_val;
+                q_data_int.y /= q_scale_val;
+                q_data_int.w /= q_scale_val;
+                q_data_int.z /= q_scale_val;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               hipStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( sr_quantize_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel(float* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        hipStream_t stream);
+template void launch_sr_quantize_kernel(__half* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        hipStream_t stream);
+
+__global__ void quantize_kernel_asym(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+        float min = 10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (((float)data_h[0]) > max) max = (float)data_h[0];
+            if (((float)data_h[1]) > max) max = (float)data_h[1];
+            if (((float)data_h[2]) > max) max = (float)data_h[2];
+            if (((float)data_h[3]) > max) max = (float)data_h[3];
+
+            if (((float)data_h[0]) < min) min = (float)data_h[0];
+            if (((float)data_h[1]) < min) min = (float)data_h[1];
+            if (((float)data_h[2]) < min) min = (float)data_h[2];
+            if (((float)data_h[3]) < min) min = (float)data_h[3];
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_inv = 1 / q_scale;
+
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf((q_data[0].x - min) * q_scale_inv);
+                q_data_int[0].y = roundf((q_data[0].y - min) * q_scale_inv);
+                q_data_int[1].x = roundf((q_data[1].x - min) * q_scale_inv);
+                q_data_int[1].y = roundf((q_data[1].y - min) * q_scale_inv);
+
+                q_data_int[0].x = q_data_int[0].x * q_scale + min;
+                q_data_int[0].y = q_data_int[0].y * q_scale + min;
+                q_data_int[1].x = q_data_int[1].x * q_scale + min;
+                q_data_int[1].y = q_data_int[1].y * q_scale + min;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel_asym(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+    float min = 10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (data_reg.x > max) max = data_reg.x;
+        if (data_reg.y > max) max = data_reg.y;
+        if (data_reg.w > max) max = data_reg.w;
+        if (data_reg.z > max) max = data_reg.z;
+
+        if (data_reg.x < min) min = data_reg.x;
+        if (data_reg.y < min) min = data_reg.y;
+        if (data_reg.w < min) min = data_reg.w;
+        if (data_reg.z < min) min = data_reg.z;
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(min, i);
+        if (min > temp) min = temp;
+    }
+
+    __shared__ float partialMax[WARP_SIZE];
+    __shared__ float partialMin[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+    if (lane == 0) partialMin[gid] = min;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+    if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(min, i);
+        if (min > temp) min = temp;
+    }
+
+    max = g.shfl(max, 0);
+    min = g.shfl(min, 0);
+
+    float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf((q_data.x - min) * q_scale_inv);
+            q_data_int.y = roundf((q_data.y - min) * q_scale_inv);
+            q_data_int.w = roundf((q_data.w - min) * q_scale_inv);
+            q_data_int.z = roundf((q_data.z - min) * q_scale_inv);
+
+            q_data.x = q_data_int.x * q_scale + min;
+            q_data.y = q_data_int.y * q_scale + min;
+            q_data.w = q_data_int.w * q_scale + min;
+            q_data.z = q_data_int.z * q_scale + min;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 hipStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+   hipLaunchKernelGGL(( quantize_kernel_asym), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel_asym(float* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          hipStream_t stream);
+template void launch_quantize_kernel_asym(__half* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          hipStream_t stream);
+
+__global__ void sr_quantize_kernel_asym(__half* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (((float)data_f[0].x) > max) max = (float)data_f[0].x;
+            if (((float)data_f[0].y) > max) max = (float)data_f[0].y;
+            if (((float)data_f[1].x) > max) max = (float)data_f[1].x;
+            if (((float)data_f[1].y) > max) max = (float)data_f[1].y;
+
+            if (((float)data_f[0].x) < min) min = (float)data_f[0].x;
+            if (((float)data_f[0].y) < min) min = (float)data_f[0].y;
+            if (((float)data_f[1].x) < min) min = (float)data_f[1].x;
+            if (((float)data_f[1].y) < min) min = (float)data_f[1].y;
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_val_inv = 1 / q_scale_val;
+        float high_q = (float)((1 << num_bits) - 1);
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((unsigned int)((data_f[0].x - min) * q_scale_val_inv));
+                q_data_int[0].y = (float)((unsigned int)((data_f[0].y - min) * q_scale_val_inv));
+                q_data_int[1].x = (float)((unsigned int)((data_f[1].x - min) * q_scale_val_inv));
+                q_data_int[1].y = (float)((unsigned int)((data_f[1].y - min) * q_scale_val_inv));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] =
+                    abs(data_f[0].x - ((q_data_int[0].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[1] =
+                    abs(data_f[0].y - ((q_data_int[0].y * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[2] =
+                    abs(data_f[1].x - ((q_data_int[1].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[3] =
+                    abs(data_f[1].y - ((q_data_int[1].y * q_scale_val) + min)) * q_scale_val_inv;
+
+                q_data_int[0].x = (rand.x < q_error[0] && q_data_int[0].x < high_q)
+                                      ? (q_data_int[0].x + 1)
+                                      : q_data_int[0].x;
+                q_data_int[0].y = (rand.y < q_error[1] && q_data_int[0].y < high_q)
+                                      ? (q_data_int[0].y + 1)
+                                      : q_data_int[0].y;
+                q_data_int[1].x = (rand.w < q_error[2] && q_data_int[1].x < high_q)
+                                      ? (q_data_int[1].x + 1)
+                                      : q_data_int[1].x;
+                q_data_int[1].y = (rand.z < q_error[3] && q_data_int[1].y < high_q)
+                                      ? (q_data_int[1].y + 1)
+                                      : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x * q_scale_val + min;
+                data_f[0].y = q_data_int[0].y * q_scale_val + min;
+                data_f[1].x = q_data_int[1].x * q_scale_val + min;
+                data_f[1].y = q_data_int[1].y * q_scale_val + min;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel_asym(float* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            float4 data_reg = vals_cast[group_index];
+            data[reg_count] = data_reg;
+            if (data_reg.x > max) max = data_reg.x;
+            if (data_reg.y > max) max = data_reg.y;
+            if (data_reg.w > max) max = data_reg.w;
+            if (data_reg.z > max) max = data_reg.z;
+
+            if (data_reg.x < min) min = data_reg.x;
+            if (data_reg.y < min) min = data_reg.y;
+            if (data_reg.w < min) min = data_reg.w;
+            if (data_reg.z < min) min = data_reg.z;
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float high_q = (float)((1 << num_bits) - 1);
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)((q_data.x - min) / q_scale_val));
+                q_data_int.y = (float)((int)((q_data.y - min) / q_scale_val));
+                q_data_int.w = (float)((int)((q_data.w - min) / q_scale_val));
+                q_data_int.z = (float)((int)((q_data.z - min) / q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - ((q_data_int.x * q_scale_val) + min)) / q_scale_val;
+                q_error[1] = abs(q_data.y - ((q_data_int.y * q_scale_val) + min)) / q_scale_val;
+                q_error[2] = abs(q_data.w - ((q_data_int.w * q_scale_val) + min)) / q_scale_val;
+                q_error[3] = abs(q_data.z - ((q_data_int.z * q_scale_val) + min)) / q_scale_val;
+
+                q_data_int.x = (rand.x < q_error[0] && q_data_int.x < high_q) ? (q_data_int.x + 1)
+                                                                              : q_data_int.x;
+                q_data_int.y = (rand.y < q_error[1] && q_data_int.y < high_q) ? (q_data_int.y + 1)
+                                                                              : q_data_int.y;
+                q_data_int.w = (rand.w < q_error[2] && q_data_int.w < high_q) ? (q_data_int.w + 1)
+                                                                              : q_data_int.w;
+                q_data_int.z = (rand.z < q_error[3] && q_data_int.z < high_q) ? (q_data_int.z + 1)
+                                                                              : q_data_int.z;
+
+                q_data_int.x = q_data_int.x * q_scale_val + min;
+                q_data_int.y = q_data_int.y * q_scale_val + min;
+                q_data_int.w = q_data_int.w * q_scale_val + min;
+                q_data_int.z = q_data_int.z * q_scale_val + min;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    hipStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( sr_quantize_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel_asym(float* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             hipStream_t stream);
+template void launch_sr_quantize_kernel_asym(__half* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             hipStream_t stream);
diff --git a/csrc/sparse_attention/utils.cpp b/csrc/sparse_attention/utils.cpp
index a802025e92edd24c38b2c3994450145785bce5ec..8e4346be8a299a09d38ce22adf1c2f80385620c1 100644
--- a/csrc/sparse_attention/utils.cpp
+++ b/csrc/sparse_attention/utils.cpp
@@ -94,14 +94,14 @@ ret_t sdd_segment(torch::Tensor layout, int start_width)
     // block index
     torch::Tensor idx = torch::zeros_like(layout);
     int current = 0;
-    size_t H = layout.size(0);
-    size_t M = layout.size(1);
-    size_t N = layout.size(2);
+    int64_t H = layout.size(0);
+    int64_t M = layout.size(1);
+    int64_t N = layout.size(2);
     auto _layout = layout.accessor<int, 3>();
     auto _idx = idx.accessor<int, 3>();
-    for (size_t h = 0; h < H; h++)
-        for (size_t m = 0; m < M; m++)
-            for (size_t n = 0; n < N; n++) {
+    for (int64_t h = 0; h < H; h++)
+        for (int64_t m = 0; m < M; m++)
+            for (int64_t n = 0; n < N; n++) {
                 if (_layout[h][m][n] == 0) continue;
                 _idx[h][m][n] = current++;
             }
diff --git a/csrc/transformer/cublas_wrappers.cu b/csrc/transformer/cublas_wrappers.cu
index 72b62386ea6d923854a4da0d359fdd61cfe0d82b..75ecd3fb4ef9d5d63d9c7681bdce0cf949641b5d 100644
--- a/csrc/transformer/cublas_wrappers.cu
+++ b/csrc/transformer/cublas_wrappers.cu
@@ -1,5 +1,19 @@
 #include "cublas_wrappers.h"
 
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
 int cublas_gemm_ex(cublasHandle_t handle,
                    cublasOperation_t transa,
                    cublasOperation_t transb,
@@ -12,7 +26,34 @@ int cublas_gemm_ex(cublasHandle_t handle,
                    const float* B,
                    float* C,
                    cublasGemmAlgo_t algo)
+#endif
 {
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
     cublasStatus_t status = cublasGemmEx(handle,
                                          transa,
                                          transb,
@@ -32,8 +73,13 @@ int cublas_gemm_ex(cublasHandle_t handle,
                                          m,
                                          CUDA_R_32F,
                                          algo);
+#endif
 
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
     if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
         fprintf(stderr,
                 "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                 m,
@@ -45,6 +91,20 @@ int cublas_gemm_ex(cublasHandle_t handle,
     return 0;
 }
 
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
 int cublas_gemm_ex(cublasHandle_t handle,
                    cublasOperation_t transa,
                    cublasOperation_t transb,
@@ -57,7 +117,34 @@ int cublas_gemm_ex(cublasHandle_t handle,
                    const __half* B,
                    __half* C,
                    cublasGemmAlgo_t algo)
+#endif
 {
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
     cublasStatus_t status = cublasGemmEx(handle,
                                          transa,
                                          transb,
@@ -77,8 +164,13 @@ int cublas_gemm_ex(cublasHandle_t handle,
                                          m,
                                          CUDA_R_32F,
                                          algo);
+#endif
 
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
     if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
         fprintf(stderr,
                 "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                 m,
@@ -90,6 +182,24 @@ int cublas_gemm_ex(cublasHandle_t handle,
     return 0;
 }
 
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
 int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int m,
                                 int n,
@@ -106,7 +216,40 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int stride_C,
                                 int batch,
                                 cublasGemmAlgo_t algo)
+#endif
 {
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
     cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
                                                        op_A,
                                                        op_B,
@@ -130,8 +273,13 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
                                                        batch,
                                                        CUDA_R_32F,
                                                        algo);
+#endif
 
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
     if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
         fprintf(stderr,
                 "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
                 batch,
@@ -144,6 +292,24 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
     return 0;
 }
 
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
 int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int m,
                                 int n,
@@ -160,7 +326,40 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
                                 int stride_C,
                                 int batch,
                                 cublasGemmAlgo_t algo)
+#endif
 {
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
     cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
                                                        op_A,
                                                        op_B,
@@ -184,8 +383,13 @@ int cublas_strided_batched_gemm(cublasHandle_t handle,
                                                        batch,
                                                        CUDA_R_32F,
                                                        algo);
+#endif
 
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
     if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
         fprintf(stderr,
                 "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                 m,
diff --git a/csrc/transformer/cublas_wrappers.hip b/csrc/transformer/cublas_wrappers.hip
new file mode 100644
index 0000000000000000000000000000000000000000..04aa0ef0a7d083a50fc7d4ec8f01b24e2ccd52e8
--- /dev/null
+++ b/csrc/transformer/cublas_wrappers.hip
@@ -0,0 +1,404 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cublas_wrappers_hip.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer/dropout_kernels.cu b/csrc/transformer/dropout_kernels.cu
old mode 100755
new mode 100644
index 98f2ac22fddded51afae1c6766bcb6c74b92daf0..d1ba135f4900f8eff3f6b4cab70d1b35b39f7833
--- a/csrc/transformer/dropout_kernels.cu
+++ b/csrc/transformer/dropout_kernels.cu
@@ -1,868 +1,868 @@
-#include "custom_cuda_layers.h"
-
-const int unroll_factor = 4;
-
-__global__ void dropout_kernel(const int N,
-                               const float ratio,
-                               float* out,
-                               const float* Xdata,
-                               uint8_t* mask,
-                               std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-
-    curandStatePhilox4_32_10_t state;
-    curand_init(seed.first, idx, seed.second, &state);
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        float4 rand = curand_uniform4(&state);
-        uint8_t m[unroll_factor];
-
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        int i = j * unroll_factor;
-
-        mask[i] = (uint8_t)m[0];
-        mask[i + 1] = (uint8_t)m[1];
-        mask[i + 2] = (uint8_t)m[2];
-        mask[i + 3] = (uint8_t)m[3];
-
-        out[i] = Xdata[i] * scale * m[0];
-        out[i + 1] = Xdata[i + 1] * scale * m[1];
-        out[i + 2] = Xdata[i + 2] * scale * m[2];
-        out[i + 3] = Xdata[i + 3] * scale * m[3];
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        float4 rand = curand_uniform4(&state);
-        float* rand_data = &(rand.x);
-        int k = 0;
-        for (int i = high_index; i < N; i++) {
-            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
-            out[i] = Xdata[i] * scale * m;
-            mask[i] = m;
-        }
-    }
-}
-
-__global__ void dropout_kernel(const int N,
-                               const float ratio,
-                               __half* out,
-                               const __half* Xdata,
-                               uint8_t* mask,
-                               std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-
-    curandStatePhilox4_32_10_t state;
-    curand_init(seed.first, idx, seed.second, &state);
-
-#ifdef __STOCHASTIC_MODE__
-
-    const __half2 h_scale = __float2half2_rn(scale);
-    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
-    float2* out_cast = reinterpret_cast<float2*>(out);
-    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
-
-    uint32_t m_32;
-    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
-
-    float2 result_f;
-    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-    __half2 mask_h[2];
-    float2 mask_f[2];
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        float2 x_f = x_cast[j];
-        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
-
-        float4 rand = curand_uniform4(&state);
-
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        float* mask_f_data = &mask_f[0].x;
-#pragma unroll
-        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
-
-        mask_h[0] = __float22half2_rn(mask_f[0]);
-        mask_h[1] = __float22half2_rn(mask_f[1]);
-
-        result_h[0] = x_h[0] * h_scale * mask_h[0];
-        result_h[1] = x_h[1] * h_scale * mask_h[1];
-
-        out_cast[j] = result_f;
-
-        mask_cast[j] = m_32;
-    }
-
-#else
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        int i = j * unroll_factor;
-
-        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
-        float2 vals_half_f[2];
-        vals_half_f[0] = __half22float2(vals_half[0]);
-        vals_half_f[1] = __half22float2(vals_half[1]);
-
-        uint8_t m[unroll_factor];
-        float4 rand = curand_uniform4(&state);
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
-        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
-        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
-        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
-
-        mask[i] = m[0];
-        mask[i + 1] = m[1];
-        mask[i + 2] = m[2];
-        mask[i + 3] = m[3];
-    }
-
-#endif
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        float4 rand = curand_uniform4(&state);
-        float* rand_data = &(rand.x);
-        int k = 0;
-        for (int i = high_index; i < N; i++) {
-            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
-            out[i] = __float2half((float)Xdata[i] * scale * m);
-            mask[i] = m;
-        }
-    }
-}
-
-__global__ void dropout_kernel_bwd(const int N,
-                                   const float ratio,
-                                   const float* Xdata,
-                                   float* out,
-                                   uint8_t* mask,
-                                   std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        int i = j * unroll_factor;
-
-        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
-        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
-        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
-        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
-    }
-}
-
-__global__ void dropout_kernel_bwd(const int N,
-                                   const float ratio,
-                                   const __half* Xdata,
-                                   __half* out,
-                                   uint8_t* mask,
-                                   std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-
-#ifdef __STOCHASTIC_MODE__
-
-    const __half2 h_scale = __float2half2_rn(scale);
-
-    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
-    float2* out_cast = reinterpret_cast<float2*>(out);
-    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        float2 x_f = x_cast[j];
-        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
-
-        uint32_t m_32 = mask_cast[j];
-        uint8_t* m = (uint8_t*)&m_32;
-
-        __half2 mask_h[2];
-        float2 mask_f[2];
-
-        float* mask_f_data = &mask_f[0].x;
-#pragma unroll
-        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
-
-#pragma unroll
-        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
-
-        float2 result_f;
-        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-
-        result_h[0] = x_h[0] * h_scale * mask_h[0];
-        result_h[1] = x_h[1] * h_scale * mask_h[1];
-
-        out_cast[j] = result_f;
-    }
-
-#else
-
-    const __half h_scale = __float2half(scale);
-    const __half h_zero = __float2half(0.0);
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        int i = j * unroll_factor;
-
-        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
-
-        uint8_t* m = mask + i;
-
-        float2 vals_half_f[2];
-
-        vals_half_f[0] = __half22float2(vals_half[0]);
-        vals_half_f[1] = __half22float2(vals_half[1]);
-
-        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
-        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
-        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
-        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
-    }
-
-#endif
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        for (int i = high_index; i < N; i++) {
-            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
-        }
-    }
-}
-
-template <typename T>
-void launch_dropout(T* out,
-                    const T* vals,
-                    uint8_t* mask,
-                    int total_count,
-                    int dim,
-                    float ratio,
-                    cudaStream_t stream,
-                    bool bwd)
-{
-    assert(unroll_factor == 4);
-
-    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
-    dim3 block_dim = DS_CUDA_NUM_THREADS;
-
-    if (dim > 512) {
-        block_dim.x >>= 1;
-        grid_dim.x <<= 1;
-    }
-    uint64_t inc = total_count / grid_dim.x / block_dim.x;
-    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
-    if (bwd)
-        dropout_kernel_bwd<<<grid_dim, block_dim, 0, stream>>>(
-            total_count, ratio, vals, out, mask, seed);
-    else
-        dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
-            total_count, ratio, out, vals, mask, seed);
-}
-
-template void launch_dropout(float* out,
-                             const float* vals,
-                             uint8_t* mask,
-                             int total_count,
-                             int dim,
-                             float ratio,
-                             cudaStream_t stream,
-                             bool);
-template void launch_dropout(__half* out,
-                             const __half* vals,
-                             uint8_t* mask,
-                             int total_count,
-                             int dim,
-                             float ratio,
-                             cudaStream_t stream,
-                             bool);
-
-__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
-{
-    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
-}
-
-__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
-{
-    const __half2 h_scale = __float2half2_rn(scale);
-    float2* x_cast = reinterpret_cast<float2*>(Xdata);
-    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        float2 x_data = x_cast[j];
-        uint32_t m_32 = mask_cast[j];
-        uint8_t* m = (uint8_t*)&m_32;
-
-        float2 result_f;
-        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-
-#ifdef __STOCHASTIC_MODE__
-
-        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
-        __half2 mask_h[2];
-        float2 mask_f[2];
-
-        float* mask_f_data = &mask_f[0].x;
-#pragma unroll
-        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
-
-        mask_h[0] = __float22half2_rn(mask_f[0]);
-        mask_h[1] = __float22half2_rn(mask_f[1]);
-
-        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
-        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
-
-#else
-
-        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
-        float2 result[2];
-
-        result[0].x = (float)x_data_h[0] * scale * m[0];
-        result[0].y = (float)x_data_h[1] * scale * m[1];
-        result[1].x = (float)x_data_h[2] * scale * m[2];
-        result[1].y = (float)x_data_h[3] * scale * m[3];
-
-        result_h[0] = __float22half2_rn(result[0]);
-        result_h[1] = __float22half2_rn(result[1]);
-
-#endif
-        x_cast[j] = result_f;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        for (int i = high_index; i < N; i++) {
-            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
-        }
-    }
-}
-
-template <typename T>
-void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream)
-{
-    assert(unroll_factor == 4);
-
-    const float scale = 1. / (1. - ratio);
-    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
-                          DS_CUDA_NUM_THREADS,
-                          0,
-                          stream>>>(total_count, scale, vals, mask);
-}
-
-template void launch_dropout_grad(float* vals,
-                                  uint8_t* mask,
-                                  int total_count,
-                                  float ratio,
-                                  cudaStream_t stream);
-template void launch_dropout_grad(__half* vals,
-                                  uint8_t* mask,
-                                  int total_count,
-                                  float ratio,
-                                  cudaStream_t stream);
-
-__global__ void dropout_grad_kernel(const int N,
-                                    const float scale,
-                                    const float* Xdata,
-                                    float* out,
-                                    uint8_t* mask)
-{
-    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
-}
-
-__global__ void dropout_grad_kernel(const int N,
-                                    const float scale,
-                                    const __half* Xdata,
-                                    __half* out,
-                                    uint8_t* mask)
-{
-    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
-    float2* out_cast = reinterpret_cast<float2*>(out);
-    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
-
-    float2 result_f;
-    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-
-    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
-    {
-        float2 x_data = x_cast[j];
-        uint32_t m_32 = mask_cast[j];
-        uint8_t* m = (uint8_t*)&m_32;
-
-        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
-        float2 result[2];
-
-        result[0].x = (float)x_data_h[0] * scale * m[0];
-        result[0].y = (float)x_data_h[1] * scale * m[1];
-        result[1].x = (float)x_data_h[2] * scale * m[2];
-        result[1].y = (float)x_data_h[3] * scale * m[3];
-
-        result_h[0] = __float22half2_rn(result[0]);
-        result_h[1] = __float22half2_rn(result[1]);
-
-        out_cast[j] = result_f;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        for (int i = high_index; i < N; i++) {
-            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
-        }
-    }
-}
-
-template <typename T>
-void launch_dropout_grad(T* vals_out,
-                         const T* vals,
-                         uint8_t* mask,
-                         int total_count,
-                         float ratio,
-                         cudaStream_t stream)
-{
-    assert(unroll_factor == 4);
-
-    const float scale = 1. / (1. - ratio);
-    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
-                          DS_CUDA_NUM_THREADS,
-                          0,
-                          stream>>>(total_count, scale, vals, vals_out, mask);
-}
-template void launch_dropout_grad(float*,
-                                  const float* vals,
-                                  uint8_t* mask,
-                                  int total_count,
-                                  float ratio,
-                                  cudaStream_t stream);
-template void launch_dropout_grad(__half*,
-                                  const __half* vals,
-                                  uint8_t* mask,
-                                  int total_count,
-                                  float ratio,
-                                  cudaStream_t stream);
-
-__global__ void dropout_kernel(const int N,
-                               const int dim,
-                               const float ratio,
-                               const float* bias,
-                               float* Xdata,
-                               uint8_t* mask,
-                               std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    int tid = threadIdx.x % (dim / unroll_factor);
-
-    curandStatePhilox4_32_10_t state;
-    curand_init(seed.first, idx, seed.second, &state);
-
-    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
-    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
-    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
-
-    CUDA_1D_KERNEL_LOOP(j, N)
-    {
-        float4 rand = curand_uniform4(&state);
-        uint32_t m_32;
-        uint8_t* m = (uint8_t*)&m_32;
-
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        float4 x_data = Xdata_cast[j];
-        float4 b_data = bias_cast[j % (dim / unroll_factor)];
-
-        x_data.x += b_data.x;
-        x_data.y += b_data.y;
-        x_data.z += b_data.z;
-        x_data.w += b_data.w;
-
-        x_data.x = x_data.x * scale * m[0];
-        x_data.y = x_data.y * scale * m[1];
-        x_data.z = x_data.z * scale * m[2];
-        x_data.w = x_data.w * scale * m[3];
-
-        mask_32[j] = m_32;
-        Xdata_cast[j] = x_data;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        float4 rand = curand_uniform4(&state);
-        float* rand_data = &(rand.x);
-        int k = 0;
-        for (int i = high_index; i < N; i++) {
-            float x_data = Xdata[i] + bias[i % dim];
-            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
-            Xdata[i] = x_data * scale * m;
-            mask[i] = m;
-        }
-    }
-}
-
-__global__ void dropout_kernel(const int N,
-                               const int dim,
-                               const float ratio,
-                               const __half* bias,
-                               __half* Xdata,
-                               uint8_t* mask,
-                               std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    int tid = threadIdx.x % (dim / unroll_factor);
-
-    curandStatePhilox4_32_10_t state;
-    curand_init(seed.first, idx, seed.second, &state);
-
-    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
-    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
-    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
-
-    CUDA_1D_KERNEL_LOOP(j, N)
-    {
-        float4 rand = curand_uniform4(&state);
-
-        float2 data_f;
-        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
-
-        float2 bias_f;
-        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
-
-        data_f = Xdata_cast[j];
-        bias_f = bias_cast[j % (dim / unroll_factor)];
-
-        float2 data_h_0 = __half22float2(data_h[0]);
-        float2 data_h_1 = __half22float2(data_h[1]);
-
-        float2 bias_h_0 = __half22float2(bias_h[0]);
-        float2 bias_h_1 = __half22float2(bias_h[1]);
-
-        data_h_0.x += bias_h_0.x;
-        data_h_0.y += bias_h_0.y;
-        data_h_1.x += bias_h_1.x;
-        data_h_1.y += bias_h_1.y;
-
-        uint32_t m_32;
-        uint8_t* m = (uint8_t*)&m_32;
-
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
-        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
-        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
-        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
-
-        float2 result_f;
-        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-
-        result_h[0] = __float22half2_rn(data_h_0);
-        result_h[1] = __float22half2_rn(data_h_1);
-
-        Xdata_cast[j] = result_f;
-        mask_32[j] = m_32;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        float4 rand = curand_uniform4(&state);
-        float* rand_data = &(rand.x);
-        int k = 0;
-        for (int i = high_index; i < N; i++) {
-            float x_data = (float)Xdata[i] + (float)bias[i % dim];
-            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
-            Xdata[i] = __float2half(x_data * scale * m);
-            mask[i] = m;
-        }
-    }
-}
-
-template <typename T>
-void launch_dropout(T* out,
-                    const T* bias,
-                    uint8_t* mask,
-                    int batch,
-                    int dim,
-                    float ratio,
-                    cudaStream_t stream)
-{
-    assert(unroll_factor == 4);
-
-    int total_count = batch * dim / unroll_factor;
-
-    dim3 grid_dim = DS_GET_BLOCKS(total_count);
-    dim3 block_dim = DS_CUDA_NUM_THREADS;
-
-    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
-    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
-
-    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
-        total_count, dim, ratio, bias, out, mask, seed);
-}
-
-template void launch_dropout(float*,
-                             const float* bias,
-                             uint8_t* mask,
-                             int batch,
-                             int dim,
-                             float ratio,
-                             cudaStream_t stream);
-template void launch_dropout(__half*,
-                             const __half* bias,
-                             uint8_t* mask,
-                             int batch,
-                             int dim,
-                             float ratio,
-                             cudaStream_t stream);
-
-__global__ void dropout_kernel(const int N,
-                               const int dim,
-                               const float ratio,
-                               const float* input,
-                               const float* residual,
-                               const float* bias,
-                               float* out,
-                               uint8_t* mask,
-                               std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    int tid = threadIdx.x % (dim / unroll_factor);
-
-    curandStatePhilox4_32_10_t state;
-    curand_init(seed.first, idx, seed.second, &state);
-
-    float4* out_cast = reinterpret_cast<float4*>(out);
-    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
-
-    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
-    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
-    const float4* input_cast = reinterpret_cast<const float4*>(input);
-
-    CUDA_1D_KERNEL_LOOP(j, N)
-    {
-        float4 rand = curand_uniform4(&state);
-
-        uint32_t m_32;
-        uint8_t* m = (uint8_t*)&m_32;
-
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        float4 out_data;
-        float4 b_data = bias_cast[j % (dim / unroll_factor)];
-        float4 res_data = residual_cast[j];
-        float4 inp_data = input_cast[j];
-
-        out_data.x = (b_data.x + inp_data.x);
-        out_data.y = (b_data.y + inp_data.y);
-        out_data.z = (b_data.z + inp_data.z);
-        out_data.w = (b_data.w + inp_data.w);
-
-        out_data.x = out_data.x * scale * m[0];
-        out_data.y = out_data.y * scale * m[1];
-        out_data.z = out_data.z * scale * m[2];
-        out_data.w = out_data.w * scale * m[3];
-
-        out_data.x += res_data.x;
-        out_data.y += res_data.y;
-        out_data.z += res_data.z;
-        out_data.w += res_data.w;
-
-        mask_32[j] = m_32;
-        out_cast[j] = out_data;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        float4 rand = curand_uniform4(&state);
-        float* rand_data = &(rand.x);
-        int k = 0;
-        for (int i = high_index; i < N; i++) {
-            float x_data = input[i] + bias[i % dim];
-            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
-            x_data = x_data * scale * m;
-            x_data += residual[i];
-
-            out[i] = x_data;
-            mask[i] = m;
-        }
-    }
-}
-
-__global__ void dropout_kernel(const int N,
-                               const int dim,
-                               const float ratio,
-                               const __half* input,
-                               const __half* residual,
-                               const __half* bias,
-                               __half* out,
-                               uint8_t* mask,
-                               std::pair<uint64_t, uint64_t> seed)
-{
-    const float scale = 1. / (1. - ratio);
-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    int tid = threadIdx.x % (dim / unroll_factor);
-
-    curandStatePhilox4_32_10_t state;
-    curand_init(seed.first, idx, seed.second, &state);
-
-    float2* out_cast = reinterpret_cast<float2*>(out);
-    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
-
-    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
-    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
-    const float2* input_cast = reinterpret_cast<const float2*>(input);
-
-    CUDA_1D_KERNEL_LOOP(j, N)
-    {
-        float4 rand = curand_uniform4(&state);
-
-        float2 data_f;
-        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
-
-        float2 bias_f;
-        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
-
-        float2 residual_f;
-        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
-
-        float2 input_f;
-        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
-
-        bias_f = bias_cast[j % (dim / unroll_factor)];
-        residual_f = residual_cast[j];
-        input_f = input_cast[j];
-
-        float2 data_h_0 = __half22float2(data_h[0]);
-        float2 data_h_1 = __half22float2(data_h[1]);
-
-        float2 bias_h_0 = __half22float2(bias_h[0]);
-        float2 bias_h_1 = __half22float2(bias_h[1]);
-
-        float2 residual_h_0 = __half22float2(residual_h[0]);
-        float2 residual_h_1 = __half22float2(residual_h[1]);
-
-        float2 input_h_0 = __half22float2(input_h[0]);
-        float2 input_h_1 = __half22float2(input_h[1]);
-
-        data_h_0.x = (bias_h_0.x + input_h_0.x);
-        data_h_0.y = (bias_h_0.y + input_h_0.y);
-        data_h_1.x = (bias_h_1.x + input_h_1.x);
-        data_h_1.y = (bias_h_1.y + input_h_1.y);
-
-        uint32_t m_32;
-        uint8_t* m = (uint8_t*)&m_32;
-
-        m[0] = (uint8_t)(rand.x > ratio);
-        m[1] = (uint8_t)(rand.y > ratio);
-        m[2] = (uint8_t)(rand.z > ratio);
-        m[3] = (uint8_t)(rand.w > ratio);
-
-        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
-        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
-        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
-        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
-
-        data_h_0.x += residual_h_0.x;
-        data_h_0.y += residual_h_0.y;
-        data_h_1.x += residual_h_1.x;
-        data_h_1.y += residual_h_1.y;
-
-        float2 result_f;
-        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-
-        result_h[0] = __float22half2_rn(data_h_0);
-        result_h[1] = __float22half2_rn(data_h_1);
-
-        out_cast[j] = result_f;
-        mask_32[j] = m_32;
-    }
-    int high_index =
-        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
-    if (N > high_index) {
-        float4 rand = curand_uniform4(&state);
-        float* rand_data = &(rand.x);
-        int k = 0;
-        for (int i = high_index; i < N; i++) {
-            float x_data = (float)input[i] + (float)bias[i % dim];
-            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
-            x_data = x_data * scale * m;
-            x_data += (float)residual[i];
-
-            out[i] = __float2half(x_data);
-            mask[i] = m;
-        }
-    }
-}
-
-template <typename T>
-void launch_dropout(T* out,
-                    const T* input,
-                    const T* residual,
-                    const T* bias,
-                    uint8_t* mask,
-                    int batch,
-                    int dim,
-                    float ratio,
-                    cudaStream_t stream)
-{
-    assert(unroll_factor == 4);
-
-    int total_count = batch * dim / unroll_factor;
-    dim3 grid_dim = DS_GET_BLOCKS(total_count);
-    dim3 block_dim = DS_CUDA_NUM_THREADS;
-
-    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
-    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
-
-    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
-        total_count, dim, ratio, input, residual, bias, out, mask, seed);
-}
-
-template void launch_dropout(float*,
-                             const float*,
-                             const float* residual,
-                             const float* bias,
-                             uint8_t* mask,
-                             int batch,
-                             int dim,
-                             float ratio,
-                             cudaStream_t stream);
-template void launch_dropout(__half*,
-                             const __half*,
-                             const __half* residual,
-                             const __half* bias,
-                             uint8_t* mask,
-                             int batch,
-                             int dim,
-                             float ratio,
-                             cudaStream_t stream);
+#include "custom_cuda_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = curand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = curand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+        dropout_kernel_bwd<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, vals, out, mask, seed);
+    else
+        dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
diff --git a/csrc/transformer/dropout_kernels.hip b/csrc/transformer/dropout_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..a4b880a721e9833d10bccd0fa438acf6b14ded54
--- /dev/null
+++ b/csrc/transformer/dropout_kernels.hip
@@ -0,0 +1,870 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = hiprand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = hiprand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+       hipLaunchKernelGGL(( dropout_kernel_bwd), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, vals, out, mask, seed);
+    else
+       hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
diff --git a/csrc/transformer/ds_transformer_cuda.cpp b/csrc/transformer/ds_transformer_cuda.cpp
index 8e605d7558d556b7c8bdd051613587f8a04638c2..0e86322291f76573730b56fb25fb5e89f3d09ddd 100644
--- a/csrc/transformer/ds_transformer_cuda.cpp
+++ b/csrc/transformer/ds_transformer_cuda.cpp
@@ -19,16 +19,17 @@ const int init_seq_length = 128;
 // C++ interface
 
 template <typename T>
-size_t get_workspace_size(int maxBatchSize,
-                          int seq_len,
-                          int hidden_size,
-                          int intermediate_size,
-                          int heads,
-                          bool training,
-                          bool gelu_checkpoint)
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
 {
-    size_t workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
     if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
         workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
                                      2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
         if (gelu_checkpoint)
@@ -45,12 +46,12 @@ size_t get_workspace_size(int maxBatchSize,
     CHECK_CONTIGUOUS(x)
 
 template <typename T>
-BertTransformerLayer<T>::BertTransformerLayer(int layer_id,
-                                              int batch_size,
-                                              int hidden_size,
-                                              int num_heads,
-                                              int intermediate_size,
-                                              int seq_length,
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
                                               float attn_prob_dropout_ratio,
                                               float hidden_output_dropout_ratio,
                                               float layer_norm_eps,
@@ -111,7 +112,9 @@ BertTransformerLayer<T>::BertTransformerLayer(int layer_id,
                                                         _seq_length,
                                                         _seq_length,
                                                         _hidden_size / _heads,
-                                                        (T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        //aiss debug 0506
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
                                                         T(0.0),
                                                         CUBLAS_OP_T,
                                                         CUBLAS_OP_N,
@@ -139,11 +142,13 @@ BertTransformerLayer<T>::~BertTransformerLayer()
 template <typename T>
 void BertTransformerLayer<T>::Initialize()
 {
+#ifndef __HIP_PLATFORM_HCC__
     if (std::is_same<T, __half>::value) cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
 }
 
 template <typename T>
-void BertTransformerLayer<T>::Forward(int bsz,
+void BertTransformerLayer<T>::Forward(unsigned bsz,
                                       const T* input_ptr,
                                       const T* input_mask_ptr,
                                       const T* attn_qkvw_ptr,
@@ -291,7 +296,7 @@ void BertTransformerLayer<T>::Forward(int bsz,
 }
 
 template <typename T>
-void BertTransformerLayer<T>::Backward(int bsz,
+void BertTransformerLayer<T>::Backward(unsigned bsz,
                                        const T* grad_output_ptr,
                                        const T* input_ptr,
                                        const T* output_ptr,
@@ -574,7 +579,7 @@ void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_
 }
 
 template <typename T>
-void BertTransformerLayer<T>::SetSeqLength(int seq_len)
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
 {
     _seq_length = seq_len;
 
@@ -585,11 +590,11 @@ void BertTransformerLayer<T>::SetSeqLength(int seq_len)
 }
 
 template <typename T>
-int create_transformer_layer(int layer_id,
-                             int batch_size,
-                             int hidden_dim,
-                             int num_heads,
-                             int intermediate_size,
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
                              float attn_dropout_ratio,
                              float hidden_dropout_ratio,
                              float layer_norm_eps,
@@ -632,7 +637,7 @@ int create_transformer_layer(int layer_id,
 }
 
 template <typename T>
-std::vector<torch::Tensor> ds_transformer_forward(int layer_id,
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
                                                   const torch::Tensor& input,
                                                   const torch::Tensor& input_mask,
                                                   const torch::Tensor& attn_qkvw,
@@ -668,7 +673,7 @@ std::vector<torch::Tensor> ds_transformer_forward(int layer_id,
     CHECK_INPUT(norm_w);
     CHECK_INPUT(norm_b);
 
-    int bsz = input.size(0);
+    unsigned bsz = input.size(0);
 
     const T* input_ptr = (const T*)input.data_ptr();
     const T* input_mask_ptr = (const T*)input_mask.data_ptr();
@@ -703,7 +708,7 @@ std::vector<torch::Tensor> ds_transformer_forward(int layer_id,
     std::shared_ptr<BertTransformerLayer<T>> layer =
         std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
 
-    int seq_len = layer->GetSeqLength();
+    unsigned seq_len = layer->GetSeqLength();
     if (input.size(1) != seq_len) {
         seq_len = input.size(1);
         layer->SetSeqLength(seq_len);
@@ -817,7 +822,7 @@ std::vector<torch::Tensor> ds_transformer_forward(int layer_id,
 }
 
 template <typename T>
-std::vector<torch::Tensor> ds_transformer_backward(int layer_id,
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
                                                    const torch::Tensor& grad_output,
                                                    const torch::Tensor& output,
                                                    const torch::Tensor& inp_norm,
@@ -878,12 +883,12 @@ std::vector<torch::Tensor> ds_transformer_backward(int layer_id,
     CHECK_INPUT(norm_w);
     CHECK_INPUT(norm_b);
 
-    int bsz = g_output.size(0);
+    unsigned bsz = g_output.size(0);
 
     std::shared_ptr<BertTransformerLayer<T>> layer =
         std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
 
-    int seq_len = layer->GetSeqLength();
+    unsigned seq_len = layer->GetSeqLength();
     if (g_output.size(1) != seq_len) {
         seq_len = g_output.size(1);
         layer->SetSeqLength(seq_len);
diff --git a/csrc/transformer/ds_transformer_hip.cpp b/csrc/transformer/ds_transformer_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..f9e0a53a93c1bf9aa04c7a072af00f159e9156a3
--- /dev/null
+++ b/csrc/transformer/ds_transformer_hip.cpp
@@ -0,0 +1,1052 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <torch/extension.h>
+
+#include <rocblas.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer_hip.h"
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+#include "ds_transformer_hip.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        //aiss debug 0506
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        rocblas_operation_transpose,
+                                                        rocblas_operation_none,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         rocblas_operation_none,
+                                                         rocblas_operation_none,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    hipStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/csrc/transformer/gelu_kernels.cu b/csrc/transformer/gelu_kernels.cu
index 209b64a90902c0b48a09c8d88bb9a6c5b752ca55..d683cf0af83daf829e390d83182eb6f0ffd040a6 100644
--- a/csrc/transformer/gelu_kernels.cu
+++ b/csrc/transformer/gelu_kernels.cu
@@ -1,335 +1,330 @@
-#include "custom_cuda_layers.h"
-
-inline __device__ float gelu(const float x)
-{
-    const float sqrt_param = 0.79788456080286535587989211986876f;
-    const float mul_param = 0.044715;
-    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
-}
-
-inline __device__ float d_gelu(const float x)
-{
-    const float sqrt_param = 0.79788456080286535587989211986876f;
-    const float mul_param = 0.044715;
-
-    float x2mul = x * x * mul_param;
-    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
-    float dg1 = 0.5f * (1.0f + tan_h);
-    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
-    float dg3 = dg2 * 3 * x2mul;
-    return (dg1 + dg2 + dg3);
-}
-
-/*
-Fused bias add with GELU
-
-Loads a vector of 4 elements each iteration, for stride
-iterations. It was written with the intention to launch 256 thread
-threadblocks, so to launch for bert-large, we would set ITERATIONS
-to 4. This is currently done automatically as a heuristic, setting
-the number of iterations as blocks of 1024.
-
-For FP16, the values are loaded from memory as __half, but converted
-to FP32 for the arithmetic itself, to prevent numerous overflow on
-the intermediate hyperbolic tangent, since there's no intrinsic
-that computes it directly.
-*/
-
-__global__ void gelu_kernel(const float* input, float* vals, int intermediate_size)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int loop_stride = blockDim.x;
-    int iterations = intermediate_size / blockDim.x / 4;
-    int row_stride = intermediate_size / 4;
-
-    const float4* input_cast = reinterpret_cast<const float4*>(input);
-    float4* vals_cast = reinterpret_cast<float4*>(vals);
-
-    for (int i = 0; i < iterations; i++) {
-        if (i * loop_stride + id < row_stride) {
-            float4 data = input_cast[row * row_stride + i * loop_stride + id];
-
-            data.x = gelu(data.x);
-            data.y = gelu(data.y);
-            data.z = gelu(data.z);
-            data.w = gelu(data.w);
-
-            vals_cast[row * row_stride + i * loop_stride + id] = data;
-        }
-    }
-}
-
-__global__ void gelu_kernel(const __half* input, __half* vals, int intermediate_size)
-{
-#if __CUDA_ARCH__ >= 700
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int loop_stride = blockDim.x;
-    int iterations = intermediate_size / blockDim.x / 4;
-    int row_stride = intermediate_size / 4;
-
-    const float2* input_cast = reinterpret_cast<const float2*>(input);
-    float2* vals_cast = reinterpret_cast<float2*>(vals);
-
-    for (int i = 0; i < iterations; i++) {
-        if (i * loop_stride + id < row_stride) {
-            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
-
-            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
-
-            float2 low_data = __half22float2(vals_half[0]);
-            float2 high_data = __half22float2(vals_half[1]);
-
-            low_data.x = gelu(low_data.x);
-            low_data.y = gelu(low_data.y);
-            high_data.x = gelu(high_data.x);
-            high_data.y = gelu(high_data.y);
-
-            vals_half[0] = __float22half2_rn(low_data);
-            vals_half[1] = __float22half2_rn(high_data);
-
-            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
-        }
-    }
-#endif
-}
-
-__global__ void fused_bias_gelu(const float* input,
-                                const float* bias,
-                                float* vals,
-                                int intermediate_size)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int loop_stride = blockDim.x;
-    int iterations = intermediate_size / blockDim.x / 4;
-    int row_stride = intermediate_size / 4;
-
-    const float4* input_cast = reinterpret_cast<const float4*>(input);
-    float4* vals_cast = reinterpret_cast<float4*>(vals);
-    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
-
-    for (int i = 0; i < iterations; i++) {
-        if (i * loop_stride + id < row_stride) {
-            float4 data = input_cast[row * row_stride + i * loop_stride + id];
-            float4 bias_data = bias_cast[i * loop_stride + id];
-
-            data.x += bias_data.x;
-            data.y += bias_data.y;
-            data.z += bias_data.z;
-            data.w += bias_data.w;
-
-            data.x = gelu(data.x);
-            data.y = gelu(data.y);
-            data.z = gelu(data.z);
-            data.w = gelu(data.w);
-
-            vals_cast[row * row_stride + i * loop_stride + id] = data;
-        }
-    }
-}
-
-__global__ void fused_bias_gelu(const __half* input,
-                                const __half* bias,
-                                __half* vals,
-                                int intermediate_size)
-{
-#if __CUDA_ARCH__ >= 700
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int loop_stride = blockDim.x;
-    int iterations = intermediate_size / blockDim.x / 4;
-    int row_stride = intermediate_size / 4;
-
-    const float2* input_cast = reinterpret_cast<const float2*>(input);
-    float2* vals_cast = reinterpret_cast<float2*>(vals);
-    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
-
-    for (int i = 0; i < iterations; i++) {
-        if (i * loop_stride + id < row_stride) {
-            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
-            float2 bias_vec = bias_cast[i * loop_stride + id];
-
-            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
-            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
-
-            float2 low_data = __half22float2(vals_half[0]);
-            float2 high_data = __half22float2(vals_half[1]);
-
-            float2 low_bias = __half22float2(bias_half[0]);
-            float2 high_bias = __half22float2(bias_half[1]);
-
-            low_data.x += low_bias.x;
-            low_data.y += low_bias.y;
-            high_data.x += high_bias.x;
-            high_data.y += high_bias.y;
-
-            low_data.x = gelu(low_data.x);
-            low_data.y = gelu(low_data.y);
-            high_data.x = gelu(high_data.x);
-            high_data.y = gelu(high_data.y);
-
-            vals_half[0] = __float22half2_rn(low_data);
-            vals_half[1] = __float22half2_rn(high_data);
-
-            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
-        }
-    }
-#endif
-}
-
-__global__ void d_gelu_func(float* d_output,
-                            const float* gelu_input,
-                            const float* bias,
-                            int intermediate_size)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int loop_stride = blockDim.x;
-    int iterations = intermediate_size / blockDim.x / 4;
-    int row_stride = intermediate_size / 4;
-
-    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
-    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
-    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
-
-    for (int i = 0; i < iterations; i++) {
-        if (i * loop_stride + id < row_stride) {
-            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
-            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
-            float4 bias_data = bias_cast[i * loop_stride + id];
-
-            gelu_input_data.x += bias_data.x;
-            gelu_input_data.y += bias_data.y;
-            gelu_input_data.z += bias_data.z;
-            gelu_input_data.w += bias_data.w;
-
-            output_data.x *= d_gelu(gelu_input_data.x);
-            output_data.y *= d_gelu(gelu_input_data.y);
-            output_data.z *= d_gelu(gelu_input_data.z);
-            output_data.w *= d_gelu(gelu_input_data.w);
-
-            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
-        }
-    }
-}
-
-__global__ void d_gelu_func(__half* d_output,
-                            const __half* gelu_input,
-                            const __half* bias,
-                            int intermediate_size)
-{
-#if __CUDA_ARCH__ >= 700
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int loop_stride = blockDim.x;
-    int iterations = intermediate_size / blockDim.x / 4;
-    int row_stride = intermediate_size / 4;
-
-    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
-    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
-    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
-
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        if (i * loop_stride + id < row_stride) {
-            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
-            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
-            float2 bias_vec = bias_cast[i * loop_stride + id];
-
-            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
-            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
-            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
-
-            float2 output_half_0 = __half22float2(output_data_half[0]);
-            float2 output_half_1 = __half22float2(output_data_half[1]);
-
-            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
-            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
-
-            float2 bias_half_0 = __half22float2(bias_half[0]);
-            float2 bias_half_1 = __half22float2(bias_half[1]);
-
-            gelu_input_half_0.x += bias_half_0.x;
-            gelu_input_half_0.y += bias_half_0.y;
-            gelu_input_half_1.x += bias_half_1.x;
-            gelu_input_half_1.y += bias_half_1.y;
-
-            output_half_0.x *= d_gelu(gelu_input_half_0.x);
-            output_half_0.y *= d_gelu(gelu_input_half_0.y);
-            output_half_1.x *= d_gelu(gelu_input_half_1.x);
-            output_half_1.y *= d_gelu(gelu_input_half_1.y);
-
-            float2 result;
-            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
-
-            result_half2[0] = __float22half2_rn(output_half_0);
-            result_half2[1] = __float22half2_rn(output_half_1);
-
-            d_output_cast[row * row_stride + i * loop_stride + id] = result;
-        }
-    }
-#endif
-}
-
-template <typename T>
-void launch_bias_gelu(const T* input,
-                      const T* bias,
-                      T* output,
-                      int intermediate_size,
-                      int batch_size,
-                      cudaStream_t stream)
-{
-    int iterations = (intermediate_size + 1023) / 1024;
-    int threads = intermediate_size / iterations / 4;
-    dim3 block_dims(threads);
-    dim3 grid_dims(batch_size);
-
-    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(input, bias, output, intermediate_size);
-}
-
-template <typename T>
-void launch_gelu(const T* input,
-                 T* output,
-                 int intermediate_size,
-                 int batch_size,
-                 cudaStream_t stream)
-{
-    int iterations = (intermediate_size + 1023) / 1024;
-    int threads = intermediate_size / iterations / 4;
-    dim3 block_dims(threads);
-    dim3 grid_dims(batch_size);
-
-    gelu_kernel<<<grid_dims, block_dims, 0, stream>>>(input, output, intermediate_size);
-}
-
-template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, cudaStream_t);
-template void launch_bias_gelu<__half>(const __half*,
-                                       const __half*,
-                                       __half*,
-                                       int,
-                                       int,
-                                       cudaStream_t);
-
-template void launch_gelu<float>(const float*, float*, int, int, cudaStream_t);
-template void launch_gelu<__half>(const __half*, __half*, int, int, cudaStream_t);
-
-template <typename T>
-void launch_d_gelu(T* d_output,
-                   const T* input,
-                   const T* bias,
-                   int intermediate_size,
-                   int batch_size,
-                   cudaStream_t stream)
-{
-    int iterations = (intermediate_size + 1023) / 1024;
-    int threads = intermediate_size / iterations / 4;
-    dim3 block_dims(threads);
-    dim3 grid_dims(batch_size);
-
-    d_gelu_func<<<grid_dims, block_dims, 0, stream>>>(d_output, input, bias, intermediate_size);
-}
-
-template void launch_d_gelu<float>(float*, const float*, const float*, int, int, cudaStream_t);
-template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, cudaStream_t);
+#include "custom_cuda_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    gelu_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       cudaStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, cudaStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, cudaStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    d_gelu_func<<<grid_dims, block_dims, 0, stream>>>(
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, cudaStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, cudaStream_t);
diff --git a/csrc/transformer/gelu_kernels.hip b/csrc/transformer/gelu_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f7e7a7fa7e0f22a6d20de4a1fbb5c2071afb9c77
--- /dev/null
+++ b/csrc/transformer/gelu_kernels.hip
@@ -0,0 +1,332 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( gelu_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       hipStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, hipStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, hipStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( d_gelu_func), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, hipStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, hipStream_t);
diff --git a/csrc/transformer/general_kernels.cu b/csrc/transformer/general_kernels.cu
index 7d318773f3540b918d956ccf79faedb79c2d8ea2..1eaa94e1e71a40bf44b661656395cf1ed087f589 100644
--- a/csrc/transformer/general_kernels.cu
+++ b/csrc/transformer/general_kernels.cu
@@ -1,411 +1,411 @@
-#include "general_kernels.h"
-
-namespace cg = cooperative_groups;
-
-template <typename T>
-__global__ void column_sum_reduce(const T* __restrict__ inp,
-                                  T* __restrict__ out,
-                                  int rows,
-                                  int width)
-{
-    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-
-    int idx = blockDim.x * blockIdx.x + threadIdx.x;
-
-    int y_stride = width * TILE_DIM;
-
-    float localSum = 0;
-
-    // Loop across matrix height
-    if (idx < width) {
-        int offset = threadIdx.y * width + idx;
-        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-            localSum += (float)inp[offset];
-            offset += y_stride;
-        }
-    }
-
-    tile[threadIdx.x][threadIdx.y] = localSum;
-
-    __syncthreads();
-
-    // Sum the shared buffer.
-    float sum = tile[threadIdx.y][threadIdx.x];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
-
-    if (threadIdx.x == 0) {
-        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-        if (pos < width) out[pos] = sum;
-    }
-}
-
-template <typename T>
-void launch_fuse_transpose_bias_kernel(const T* inp,
-                                       T* out,
-                                       int rows,
-                                       int cols,
-                                       cudaStream_t stream);
-
-template <>
-void launch_fuse_transpose_bias_kernel<float>(const float* inp,
-                                              float* out,
-                                              int rows,
-                                              int cols,
-                                              cudaStream_t stream)
-{
-    // assert(rows % TILE_DIM == 0);
-    // assert(cols % TILE_DIM == 0);
-
-    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    column_sum_reduce<float><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
-}
-
-template <>
-void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
-                                               __half* out,
-                                               int rows,
-                                               int cols,
-                                               cudaStream_t stream)
-{
-    // assert(rows % TILE_DIM == 0);
-    // assert(cols % TILE_DIM == 0);
-
-    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    column_sum_reduce<__half><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
-}
-
-__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
-{
-    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
-    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
-    float4* out_4 = reinterpret_cast<float4*>(out);
-
-    CUDA_1D_KERNEL_LOOP(j, N)
-    {
-        float4 val;
-        float4 inp1_reg = inp1_4[j];
-        float4 inp2_reg = inp2_4[j];
-
-        val.x = inp1_reg.x + inp2_reg.x;
-        val.y = inp1_reg.y + inp2_reg.y;
-        val.z = inp1_reg.z + inp2_reg.z;
-        val.w = inp1_reg.w + inp2_reg.w;
-
-        out_4[j] = val;
-    }
-}
-
-__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
-{
-    float2 inp1_4;
-    float2 inp2_4;
-
-    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
-    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
-
-    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
-    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
-
-    CUDA_1D_KERNEL_LOOP(j, N)
-    {
-        inp1_4 = inp1_arr[j];
-        inp2_4 = inp2_arr[j];
-
-        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
-        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
-
-        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
-        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
-
-        inp1_h_f_0.x += inp2_h_f_0.x;
-        inp1_h_f_0.y += inp2_h_f_0.y;
-        inp1_h_f_1.x += inp2_h_f_1.x;
-        inp1_h_f_1.y += inp2_h_f_1.y;
-
-        float2 val_f;
-        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
-
-        val_h[0] = __float22half2_rn(inp1_h_f_0);
-        val_h[1] = __float22half2_rn(inp1_h_f_1);
-
-        float2* out_4 = reinterpret_cast<float2*>(out);
-        out_4[j] = val_f;
-    }
-}
-
-template <>
-void launch_fused_add2<float>(float* out,
-                              const float* inp1,
-                              const float* inp2,
-                              int batch_size,
-                              int seq_length,
-                              int hidden_dim,
-                              cudaStream_t& stream)
-{
-    int total_count = batch_size * seq_length * hidden_dim / 4;
-    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
-
-    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
-
-    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
-}
-
-template <>
-void launch_fused_add2<__half>(__half* out,
-                               const __half* inp1,
-                               const __half* inp2,
-                               int batch_size,
-                               int seq_length,
-                               int hidden_dim,
-                               cudaStream_t& stream)
-{
-    int total_count = batch_size * seq_length * hidden_dim / 4;
-    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
-
-    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
-
-    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
-}
-
-__global__ void fused_add3_kernel(float* out,
-                                  const float* inp1,
-                                  const float* inp2,
-                                  const float* inp3,
-                                  int size,
-                                  int row_stride)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-
-    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
-    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
-    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
-
-    float4* out_4 = reinterpret_cast<float4*>(out);
-
-    float4 val;
-    float4 inp1_reg = inp1_4[row * row_stride + id];
-    float4 inp2_reg = inp2_4[row * row_stride + id];
-    float4 inp3_reg = inp3_4[row * row_stride + id];
-
-    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
-    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
-    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
-    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
-
-    out_4[row * row_stride + id] = val;
-}
-
-__global__ void fused_add3_kernel(__half* out,
-                                  const __half* inp1,
-                                  const __half* inp2,
-                                  const __half* inp3,
-                                  int size,
-                                  int row_stride)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
-    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
-    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
-
-    float2 inp1_4 = inp1_arr[row * row_stride + id];
-    float2 inp2_4 = inp2_arr[row * row_stride + id];
-    float2 inp3_4 = inp3_arr[row * row_stride + id];
-
-    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
-    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
-    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
-
-    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
-    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
-
-    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
-    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
-
-    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
-    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
-
-    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
-    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
-    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
-    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
-
-    float2 val_f;
-    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
-
-    val_h[0] = __float22half2_rn(inp1_h_f_0);
-    val_h[1] = __float22half2_rn(inp1_h_f_1);
-
-    float2* out_4 = reinterpret_cast<float2*>(out);
-    out_4[row * row_stride + id] = val_f;
-}
-
-template <>
-void launch_fused_add3<float>(float* out,
-                              const float* inp1,
-                              const float* inp2,
-                              const float* inp3,
-                              int batch_size,
-                              int seq_length,
-                              int hidden_size,
-                              cudaStream_t& stream)
-{
-    dim3 grid_dim(batch_size * seq_length);
-
-    dim3 block_dim(hidden_size / 4);
-
-    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
-        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
-}
-
-template <>
-void launch_fused_add3<__half>(__half* out,
-                               const __half* inp1,
-                               const __half* inp2,
-                               const __half* inp3,
-                               int batch_size,
-                               int seq_length,
-                               int hidden_size,
-                               cudaStream_t& stream)
-{
-    dim3 grid_dim(batch_size * seq_length);
-
-    dim3 block_dim(hidden_size / 4);
-
-    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
-        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
-}
-
-__global__ void fused_add4_kernel(float* out,
-                                  const float* inp1,
-                                  const float* inp2,
-                                  const float* inp3,
-                                  const float* inp4,
-                                  int size,
-                                  int row_stride)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-
-    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
-    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
-    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
-    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
-    float4* out_4 = reinterpret_cast<float4*>(out);
-
-    float4 val;
-    float4 inp1_reg = inp1_4[row * row_stride + id];
-    float4 inp2_reg = inp2_4[row * row_stride + id];
-    float4 inp3_reg = inp3_4[row * row_stride + id];
-    float4 inp4_reg = inp4_4[row * row_stride + id];
-
-    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
-    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
-    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
-    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
-
-    out_4[row * row_stride + id] = val;
-}
-
-__global__ void fused_add4_kernel(__half* out,
-                                  const __half* inp1,
-                                  const __half* inp2,
-                                  const __half* inp3,
-                                  const __half* inp4,
-                                  int size,
-                                  int row_stride)
-{
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
-    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
-    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
-    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
-
-    float2 inp1_4 = inp1_arr[row * row_stride + id];
-    float2 inp2_4 = inp2_arr[row * row_stride + id];
-    float2 inp3_4 = inp3_arr[row * row_stride + id];
-    float2 inp4_4 = inp4_arr[row * row_stride + id];
-
-    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
-    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
-    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
-    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
-
-    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
-    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
-
-    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
-    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
-
-    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
-    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
-
-    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
-    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
-
-    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
-    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
-    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
-    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
-
-    float2 val_f;
-    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
-
-    val_h[0] = __float22half2_rn(inp1_h_f_0);
-    val_h[1] = __float22half2_rn(inp1_h_f_1);
-
-    float2* out_4 = reinterpret_cast<float2*>(out);
-    out_4[row * row_stride + id] = val_f;
-}
-
-template <>
-void launch_fused_add4<float>(float* out,
-                              const float* inp1,
-                              const float* inp2,
-                              const float* inp3,
-                              const float* inp4,
-                              int batch_size,
-                              int seq_length,
-                              int hidden_size,
-                              cudaStream_t& stream)
-{
-    dim3 grid_dim(batch_size * seq_length);
-
-    dim3 block_dim(hidden_size / 4);
-
-    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
-        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
-}
-
-template <>
-void launch_fused_add4<__half>(__half* out,
-                               const __half* inp1,
-                               const __half* inp2,
-                               const __half* inp3,
-                               const __half* inp4,
-                               int batch_size,
-                               int seq_length,
-                               int hidden_size,
-                               cudaStream_t& stream)
-{
-    dim3 grid_dim(batch_size * seq_length);
-
-    dim3 block_dim(hidden_size / 4);
-
-    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
-        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
-}
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       cudaStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<float><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<__half><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/csrc/transformer/general_kernels.hip b/csrc/transformer/general_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..5be2fc240debf3dbddba72f0f9587331222910df
--- /dev/null
+++ b/csrc/transformer/general_kernels.hip
@@ -0,0 +1,413 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       hipStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<float>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu b/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu
new file mode 100644
index 0000000000000000000000000000000000000000..175854b8860b92e68485342a21bc7d636d58065c
--- /dev/null
+++ b/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu
@@ -0,0 +1,374 @@
+#include "custom_cuda_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+        apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+        apply_rotary_pos_emb1<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  cudaStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+cudaStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+
+apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+*/
diff --git a/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip b/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip
new file mode 100644
index 0000000000000000000000000000000000000000..4e04f7aeb4c80be79c2fe6d8b91a4cc2fecde823
--- /dev/null
+++ b/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip
@@ -0,0 +1,376 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb1), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  hipStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+hipStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+hipLaunchKernelGGL((
+apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+*/
diff --git a/csrc/transformer/inference/csrc/dequantize.cu b/csrc/transformer/inference/csrc/dequantize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..4ddaabda3eb70b1b958b1fc4c2f959867828d1a2
--- /dev/null
+++ b/csrc/transformer/inference/csrc/dequantize.cu
@@ -0,0 +1,110 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+    dequantize_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       cudaStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        cudaStream_t);
diff --git a/csrc/transformer/inference/csrc/dequantize.hip b/csrc/transformer/inference/csrc/dequantize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..7c22e306aace1058947ed47e58c0427a4f066ecb
--- /dev/null
+++ b/csrc/transformer/inference/csrc/dequantize.hip
@@ -0,0 +1,112 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+   hipLaunchKernelGGL(( dequantize_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       hipStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        hipStream_t);
diff --git a/csrc/transformer/inference/csrc/gelu.cu b/csrc/transformer/inference/csrc/gelu.cu
new file mode 100644
index 0000000000000000000000000000000000000000..70bbf42cf9ed74558ce1b789d939c17d38573a86
--- /dev/null
+++ b/csrc/transformer/inference/csrc/gelu.cu
@@ -0,0 +1,525 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+    fused_bias_add<<<grid_dims, block_dims, 0, stream>>>(input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    fused_bias_residual<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, cudaStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           cudaStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    gptj_residual_add<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              cudaStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               cudaStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+    moe_res_matmul<<<grid_dim, block_dim, 0, stream>>>(
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
diff --git a/csrc/transformer/inference/csrc/gelu.hip b/csrc/transformer/inference/csrc/gelu.hip
new file mode 100644
index 0000000000000000000000000000000000000000..00c03efb9b6b3d7b05f19549472b5b771f46e1f4
--- /dev/null
+++ b/csrc/transformer/inference/csrc/gelu.hip
@@ -0,0 +1,527 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_add), dim3(grid_dims), dim3(block_dims), 0, stream, input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_residual), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, hipStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           hipStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( gptj_residual_add), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              hipStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               hipStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+   hipLaunchKernelGGL(( moe_res_matmul), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
diff --git a/csrc/transformer/inference/csrc/normalize.cu b/csrc/transformer/inference/csrc/normalize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..7f3cfc118631145cd30766cdf13d439a23c138c6
--- /dev/null
+++ b/csrc/transformer/inference/csrc/normalize.cu
@@ -0,0 +1,453 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/csrc/transformer/inference/csrc/normalize.hip b/csrc/transformer/inference/csrc/normalize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..333e91f7c046a2e7ca3e2843f045cede327cae49
--- /dev/null
+++ b/csrc/transformer/inference/csrc/normalize.hip
@@ -0,0 +1,455 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/csrc/transformer/inference/csrc/pt_binding.cpp b/csrc/transformer/inference/csrc/pt_binding.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..320e6491b1cd1cb87749e8c8cb8624871b1cc904
--- /dev/null
+++ b/csrc/transformer/inference/csrc/pt_binding.cpp
@@ -0,0 +1,951 @@
+
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_T,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // cudaEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    // cudaEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // cudaStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::cuda::getCurrentCUDAStream());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::cuda::getCurrentCUDAStream());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/csrc/transformer/inference/csrc/pt_binding_hip.cpp b/csrc/transformer/inference/csrc/pt_binding_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6fed126f2c360dd3eec0ce9831b200acce3cd9d9
--- /dev/null
+++ b/csrc/transformer/inference/csrc/pt_binding_hip.cpp
@@ -0,0 +1,952 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#include <ATen/hip/HIPContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_transpose,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // hipEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    // hipEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // hipStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/csrc/transformer/inference/csrc/softmax.cu b/csrc/transformer/inference/csrc/softmax.cu
new file mode 100644
index 0000000000000000000000000000000000000000..bf3c8bc90049ddd9cf91ce4006729d02ebcdcf3e
--- /dev/null
+++ b/csrc/transformer/inference/csrc/softmax.cu
@@ -0,0 +1,434 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    cudaError_t err = cudaGetLastError();
+    if (err == cudaSuccess) return;
+    std::cerr << cudaGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+        attn_softmax_v2<<<grid_dim, block_dim, 0, stream>>>(
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
diff --git a/csrc/transformer/inference/csrc/softmax.hip b/csrc/transformer/inference/csrc/softmax.hip
new file mode 100644
index 0000000000000000000000000000000000000000..51d5bef3a72436a23f910b3a73ada214b012389a
--- /dev/null
+++ b/csrc/transformer/inference/csrc/softmax.hip
@@ -0,0 +1,436 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    hipError_t err = hipGetLastError();
+    if (err == hipSuccess) return;
+    std::cerr << hipGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+       hipLaunchKernelGGL(( attn_softmax_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
diff --git a/csrc/transformer/inference/includes/context.h b/csrc/transformer/inference/includes/context.h
new file mode 100644
index 0000000000000000000000000000000000000000..21f0b3cfe07b3f5f519af7a1b3a4daa4f7b88424
--- /dev/null
+++ b/csrc/transformer/inference/includes/context.h
@@ -0,0 +1,184 @@
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        curandCreateGenerator(&_gen, CURAND_RNG_PSEUDO_DEFAULT);
+        curandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (cublasCreate(&_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+#ifndef __HIP_PLATFORM_HCC__
+        cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        cudaEventCreate(&_comp1_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp2_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comm_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+#else
+        cudaEventCreate(&_comp1_event);
+        cudaEventCreate(&_comp2_event);
+        cudaEventCreate(&_comp_event);
+        cudaEventCreate(&_comm_event);
+#endif
+    }
+
+    virtual ~Context()
+    {
+        cublasDestroy(_cublasHandle);
+        cudaFree(_workspace);
+        cudaEventDestroy(_comp1_event);
+        cudaEventDestroy(_comp2_event);
+        cudaEventDestroy(_comp_event);
+        cudaEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            cudaMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            cudaFree(_workspace);
+            cudaMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    cudaEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    curandGenerator_t& GetRandGenerator() { return _gen; }
+
+    cudaStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::cuda::getStreamFromPool(true)
+                                    : at::cuda::getCurrentCUDAStream();
+        return _comm_stream;
+    }
+    cudaStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::cuda::getStreamFromPool(true);
+            return _stream;
+        }
+        cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+        return stream;
+    }
+
+    cublasHandle_t GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        cudaEventRecord(_comp_event, _comp_stream);
+        cudaStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        cudaEventRecord(_comm_event, _comm_stream);
+        cudaStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    curandGenerator_t _gen;
+    cublasHandle_t _cublasHandle;
+
+    cudaEvent_t _comp_event;
+    cudaEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    cudaEvent_t _comp1_event;
+    cudaEvent_t _comp2_event;
+
+    cudaStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    cudaStream_t _comp_stream;
+    cudaStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/csrc/transformer/inference/includes/context_hip.h b/csrc/transformer/inference/includes/context_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..738e2dcd61e7ef8c11afcfdf7d4385299307469a
--- /dev/null
+++ b/csrc/transformer/inference/includes/context_hip.h
@@ -0,0 +1,185 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <ATen/hip/HIPContext.h>
+#include <hip/hip_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        hiprandCreateGenerator(&_gen, HIPRAND_RNG_PSEUDO_DEFAULT);
+        hiprandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (rocblas_create_handle(&_cublasHandle) != rocblas_status_success) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+#ifndef __HIP_PLATFORM_HCC__
+        rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        hipEventCreate(&_comp1_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp2_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comm_event, (hipEventDisableTiming | hipEventBlockingSync));
+#else
+        hipEventCreate(&_comp1_event);
+        hipEventCreate(&_comp2_event);
+        hipEventCreate(&_comp_event);
+        hipEventCreate(&_comm_event);
+#endif
+    }
+
+    virtual ~Context()
+    {
+        rocblas_destroy_handle(_cublasHandle);
+        hipFree(_workspace);
+        hipEventDestroy(_comp1_event);
+        hipEventDestroy(_comp2_event);
+        hipEventDestroy(_comp_event);
+        hipEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            hipMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            hipFree(_workspace);
+            hipMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    hipEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    hiprandGenerator_t& GetRandGenerator() { return _gen; }
+
+    hipStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::hip::getStreamFromPoolMasqueradingAsCUDA(true)
+                                    : at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return _comm_stream;
+    }
+    hipStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::hip::getStreamFromPoolMasqueradingAsCUDA(true);
+            return _stream;
+        }
+        hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return stream;
+    }
+
+    rocblas_handle GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        hipEventRecord(_comp_event, _comp_stream);
+        hipStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        hipEventRecord(_comm_event, _comm_stream);
+        hipStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    hiprandGenerator_t _gen;
+    rocblas_handle _cublasHandle;
+
+    hipEvent_t _comp_event;
+    hipEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    hipEvent_t _comp1_event;
+    hipEvent_t _comp2_event;
+
+    hipStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    hipStream_t _comp_stream;
+    hipStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/csrc/transformer/inference/includes/cublas_wrappers.h b/csrc/transformer/inference/includes/cublas_wrappers.h
new file mode 100644
index 0000000000000000000000000000000000000000..75d18a40fc8e468c3ddcc5b1ae8bbdfc421c7072
--- /dev/null
+++ b/csrc/transformer/inference/includes/cublas_wrappers.h
@@ -0,0 +1,413 @@
+#pragma once
+
+#include <assert.h>
+#include <cublas_v2.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer/inference/includes/cublas_wrappers_hip.h b/csrc/transformer/inference/includes/cublas_wrappers_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..e7c81906bd790e200e6401f66c99dbcac2a0cbc5
--- /dev/null
+++ b/csrc/transformer/inference/includes/cublas_wrappers_hip.h
@@ -0,0 +1,414 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <assert.h>
+#include <rocblas.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer/inference/includes/custom_cuda_layers.h b/csrc/transformer/inference/includes/custom_cuda_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..06b4340061c98c65b4b301c7349d2da03185f715
--- /dev/null
+++ b/csrc/transformer/inference/includes/custom_cuda_layers.h
@@ -0,0 +1,124 @@
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              cudaStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream);
diff --git a/csrc/transformer/inference/includes/custom_hip_layers.h b/csrc/transformer/inference/includes/custom_hip_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..36cab34d6262f5d6211a18584f6d55284c04846e
--- /dev/null
+++ b/csrc/transformer/inference/includes/custom_hip_layers.h
@@ -0,0 +1,125 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              hipStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream);
diff --git a/csrc/transformer/normalize_kernels.cu b/csrc/transformer/normalize_kernels.cu
index 366e9372463830bf92a87c835b7ea919c564247a..d634c7f1b2cd1c2632495d8e1f3b47b45867c353 100644
--- a/csrc/transformer/normalize_kernels.cu
+++ b/csrc/transformer/normalize_kernels.cu
@@ -1,2103 +1,2121 @@
-#include "custom_cuda_layers.h"
-
-namespace cg = cooperative_groups;
-
-/*
-Fused bias add, residual (elementwise) add, and normalization layer.
-
-For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
-__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
-
-For specific launch constraints, see the launch functions.
-*/
-
-#define NORM_REG (MAX_REGISTERS / 4)
-
-__global__ void fused_bias_residual_layer_norm(float* vals,
-                                               const float* residual,
-                                               const float* gamma,
-                                               const float* beta,
-                                               float epsilon,
-                                               bool preLayerNorm,
-                                               bool training,
-                                               float* vars,
-                                               float* means,
-                                               int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int gid = id / WARP_SIZE;
-
-    float vals_arr[NORM_REG];
-    __shared__ float shr[MAX_WARP_NUM];
-
-    residual += (row * row_stride);
-    vals += (row * row_stride);
-
-    float sum = 0.f;
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] = residual[i * iteration_stride + id];
-        sum += vals_arr[i];
-    }
-    if (high_index < row_stride) {
-        vals_arr[iterations] = residual[high_index];
-        sum += vals_arr[iterations];
-        iterations++;
-    }
-
-    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = sum;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) sum = shr[g.thread_rank()];
-
-#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { sum += g.shfl_down(sum, i); }
-
-    sum = g.shfl(sum, 0);
-    float mean = sum / row_stride;
-    if (training)
-        if (g.thread_rank() == 0) means[row] = mean;
-    float variance = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] -= mean;
-        variance += vals_arr[i] * vals_arr[i];
-    }
-
-    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = variance;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) variance = shr[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { variance += g.shfl_down(variance, i); }
-    variance = g.shfl(variance, 0);
-    variance /= row_stride;
-    variance += epsilon;
-    if (training)
-        if (g.thread_rank() == 0) vars[row] = variance;
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
-        vals_arr[i] =
-            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
-        vals[i * iteration_stride + id] = vals_arr[i];
-    }
-    if ((high_index) < row_stride) {
-        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
-        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
-        vals[high_index] = vals_arr[iterations];
-    }
-}
-
-__global__ void fused_bias_residual_layer_norm(__half* vals,
-                                               const __half* residual,
-                                               const __half* gamma,
-                                               const __half* beta,
-                                               float epsilon,
-                                               bool preLayerNorm,
-                                               bool training,
-                                               __half* vars,
-                                               __half* means,
-                                               int row_stride)
-{
-#if __CUDA_ARCH__ >= 700
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int gid = id >> 5;
-
-    float2 vals_f[NORM_REG];
-    __shared__ float shr[MAX_WARP_NUM];
-
-    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
-    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
-
-    residual_cast += (row * row_stride);
-    vals_cast += (row * row_stride);
-
-    float sum = 0.f;
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
-        sum += vals_f[i].x;
-        sum += vals_f[i].y;
-    }
-    if ((high_index) < row_stride) {
-        vals_f[iterations] = __half22float2(residual_cast[high_index]);
-        sum += vals_f[iterations].x;
-        sum += vals_f[iterations].y;
-        iterations++;
-    }
-
-    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = sum;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) sum = shr[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { sum += g.shfl_down(sum, i); }
-    sum = g.shfl(sum, 0);
-    float mean = sum / (row_stride * 2);
-
-    float variance = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        vals_f[i].x -= mean;
-        vals_f[i].y -= mean;
-        variance += vals_f[i].x * vals_f[i].x;
-        variance += vals_f[i].y * vals_f[i].y;
-    }
-
-    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = variance;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) variance = shr[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { variance += g.shfl_down(variance, i); }
-    variance = g.shfl(variance, 0);
-    variance /= (row_stride * 2);
-    variance += epsilon;
-
-    __half2 variance_h = __float2half2_rn(variance);
-    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
-    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
-
-    if (training && g.thread_rank() == 0) {
-        vars[row] = __float2half(variance);
-        means[row] = __float2half(mean);
-    }
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        __half2 vals_arr = __float22half2_rn(vals_f[i]);
-        vals_arr = vals_arr * h2rsqrt(variance_h);
-        vals_arr =
-            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
-        vals_cast[i * iteration_stride + id] = vals_arr;
-    }
-    if ((high_index) < row_stride) {
-        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
-        vals_arr = vals_arr * h2rsqrt(variance_h);
-        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
-        vals_cast[high_index] = vals_arr;
-    }
-#endif
-}
-
-template <typename T>
-void launch_bias_residual_layer_norm(T* vals,
-                                     const T* residual,
-                                     const T* gamma,
-                                     const T* beta,
-                                     float epsilon,
-                                     int batch_size,
-                                     int hidden_dim,
-                                     cudaStream_t stream,
-                                     bool preLayerNorm,
-                                     bool training,
-                                     T* vars,
-                                     T* means);
-
-template <>
-void launch_bias_residual_layer_norm<float>(float* vals,
-                                            const float* residual,
-                                            const float* gamma,
-                                            const float* beta,
-                                            float epsilon,
-                                            int batch_size,
-                                            int hidden_dim,
-                                            cudaStream_t stream,
-                                            bool preLayerNorm,
-                                            bool training,
-                                            float* vars,
-                                            float* means)
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(batch_size);
-
-    if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 1;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 2;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim(threads);
-
-    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
-        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
-}
-
-template <>
-void launch_bias_residual_layer_norm<__half>(__half* vals,
-                                             const __half* residual,
-                                             const __half* gamma,
-                                             const __half* beta,
-                                             float epsilon,
-                                             int batch_size,
-                                             int hidden_dim,
-                                             cudaStream_t stream,
-                                             bool preLayerNorm,
-                                             bool training,
-                                             __half* vars,
-                                             __half* means)
-{
-    int threads = 128;
-
-    dim3 grid_dim(batch_size);
-
-    if (hidden_dim > 8192 && hidden_dim <= 16384)
-        threads <<= 1;
-    else if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 2;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 3;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim(threads);
-
-    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
-        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
-}
-
-__global__ void fused_bias_residual_layer_norm(float* vals,
-                                               const float* residual,
-                                               const float* gamma,
-                                               const float* beta,
-                                               float epsilon,
-                                               bool preLayerNorm,
-                                               bool training,
-                                               float* vars,
-                                               int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int gid = id / 32;
-
-    float vals_arr[NORM_REG];
-    __shared__ float shr[MAX_WARP_NUM];
-
-    residual += (row * row_stride);
-    vals += (row * row_stride);
-
-    float sum = 0.f;
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] = residual[i * iteration_stride + id];
-        sum += vals_arr[i];
-    }
-    if ((high_index) < row_stride) {
-        vals_arr[iterations] = residual[high_index];
-        sum += vals_arr[iterations];
-        iterations++;
-    }
-
-    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = sum;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) sum = shr[g.thread_rank()];
-
-#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { sum += g.shfl_down(sum, i); }
-
-    sum = g.shfl(sum, 0);
-    float mean = sum / row_stride;
-    float variance = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] -= mean;
-        variance += vals_arr[i] * vals_arr[i];
-    }
-
-    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = variance;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) variance = shr[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { variance += g.shfl_down(variance, i); }
-    variance = g.shfl(variance, 0);
-    variance /= row_stride;
-    variance += epsilon;
-    if (training)
-        if (g.thread_rank() == 0) vars[row] = variance;
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
-        vals_arr[i] =
-            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
-        vals[i * iteration_stride + id] = vals_arr[i];
-    }
-    if ((high_index) < row_stride) {
-        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
-        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
-        vals[high_index] = vals_arr[iterations];
-    }
-}
-
-__global__ void fused_bias_residual_layer_norm(__half* vals,
-                                               const __half* residual,
-                                               const __half* gamma,
-                                               const __half* beta,
-                                               float epsilon,
-                                               bool preLayerNorm,
-                                               bool training,
-                                               __half* vars,
-                                               int row_stride)
-{
-#if __CUDA_ARCH__ >= 700
-
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int gid = id >> 5;
-
-    float2 vals_f[NORM_REG];
-    __shared__ float shr[MAX_WARP_NUM];
-
-    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
-    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
-
-    residual_cast += (row * row_stride);
-    vals_cast += (row * row_stride);
-
-    float sum = 0.f;
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
-        sum += vals_f[i].x;
-        sum += vals_f[i].y;
-    }
-    if ((high_index) < row_stride) {
-        vals_f[iterations] = __half22float2(residual_cast[high_index]);
-        sum += vals_f[iterations].x;
-        sum += vals_f[iterations].y;
-        iterations++;
-    }
-
-    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = sum;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) sum = shr[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { sum += g.shfl_down(sum, i); }
-    sum = g.shfl(sum, 0);
-    float mean = sum / (row_stride * 2);
-
-    float variance = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        vals_f[i].x -= mean;
-        vals_f[i].y -= mean;
-        variance += vals_f[i].x * vals_f[i].x;
-        variance += vals_f[i].y * vals_f[i].y;
-    }
-
-    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
-
-    if (g.thread_rank() == 0) shr[gid] = variance;
-
-    b.sync();
-
-    if (g.thread_rank() < (iteration_stride >> 5)) variance = shr[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    b.sync();
-#endif
-
-    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { variance += g.shfl_down(variance, i); }
-    variance = g.shfl(variance, 0);
-    variance /= (row_stride * 2);
-    variance += epsilon;
-
-    __half2 variance_h = __float2half2_rn(variance);
-    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
-    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
-
-    if (training && g.thread_rank() == 0) vars[row] = __float2half(variance);
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        __half2 vals_arr = __float22half2_rn(vals_f[i]);
-        vals_arr = vals_arr * h2rsqrt(variance_h);
-        vals_arr =
-            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
-        vals_cast[i * iteration_stride + id] = vals_arr;
-    }
-    if ((high_index) < row_stride) {
-        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
-        vals_arr = vals_arr * h2rsqrt(variance_h);
-        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
-        vals_cast[high_index] = vals_arr;
-    }
-#endif
-}
-
-template <typename T>
-void launch_bias_residual_layer_norm(T* vals,
-                                     const T* residual,
-                                     const T* gamma,
-                                     const T* beta,
-                                     float epsilon,
-                                     int batch_size,
-                                     int hidden_dim,
-                                     cudaStream_t stream,
-                                     bool preLayerNorm,
-                                     bool training,
-                                     T* vars);
-
-/*
-To tune this launch the following restrictions must be met:
-
-For float:
-row_stride == hidden_size
-threads * iterations == row_stride
-threads is in [32, 64, 128, 256, 512, 1024]
-
-For half:
-row_stride == hidden_size / 2
-threads * iterations == row_stride
-threads is in [32, 64, 128, 256, 512, 1024]
-
-*/
-
-template <>
-void launch_bias_residual_layer_norm<float>(float* vals,
-                                            const float* residual,
-                                            const float* gamma,
-                                            const float* beta,
-                                            float epsilon,
-                                            int batch_size,
-                                            int hidden_dim,
-                                            cudaStream_t stream,
-                                            bool preLayerNorm,
-                                            bool training,
-                                            float* vars)
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(batch_size);
-
-    // There are some limitations to call below functions, now just enumerate the situations.
-
-    if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 1;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 2;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim(threads);
-
-    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
-        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
-}
-
-template <>
-void launch_bias_residual_layer_norm<__half>(__half* vals,
-                                             const __half* residual,
-                                             const __half* gamma,
-                                             const __half* beta,
-                                             float epsilon,
-                                             int batch_size,
-                                             int hidden_dim,
-                                             cudaStream_t stream,
-                                             bool preLayerNorm,
-                                             bool training,
-                                             __half* vars)
-{
-    int threads = 128;
-
-    dim3 grid_dim(batch_size);
-
-    // There are some limitations to call below functions, now just enumerate the situations.
-
-    if (hidden_dim > 8192 && hidden_dim <= 16384)
-        threads <<= 1;
-    else if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 2;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 3;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim(threads);
-    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
-        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
-}
-
-/* Normalize Gamma & Betta gradients
- * Compute gradients using either X_hat or
- * normalize input (invertible).
- * Combine transpose with gradients computation.
- */
-
-template <typename T>
-__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
-                                   const T* __restrict__ vals_hat,
-                                   const T* __restrict__ gamma,
-                                   const T* __restrict__ betta,
-                                   T* __restrict__ gamma_grad,
-                                   T* __restrict__ betta_grad,
-                                   int rows,
-                                   int width,
-                                   bool invertible)
-{
-    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
-    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-
-    int idx = blockDim.x * blockIdx.x + threadIdx.x;
-    int offset = threadIdx.y * width + idx;
-    int y_stride = width * TILE_DIM;
-
-    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
-    float gamma_reg = (float)gamma[idx];
-
-    // Loop across matrix height
-    float betta_tmp = 0;
-    float gamma_tmp = 0;
-    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        float grad = (float)out_grad[offset];
-        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
-                                : (float)vals_hat[offset]);
-        betta_tmp += grad;
-        gamma_tmp += (val * grad);
-
-        offset += y_stride;
-    }
-
-    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
-    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
-
-    __syncthreads();
-
-    // Sum the shared buffer.
-    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
-    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < TILE_DIM; i <<= 1) {
-        s1 += g.shfl_down(s1, i);
-        s2 += g.shfl_down(s2, i);
-    }
-
-    if (threadIdx.x == 0) {
-        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-        betta_grad[pos] = s1;
-        gamma_grad[pos] = s2;
-    }
-}
-
-/* Normalize Gamma & Betta gradients
- * Compute gradients using the input to
- * the normalize.
- * Combine transpose with gradients computation.
- */
-
-template <typename T>
-__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
-                                   const T* __restrict__ X_data,
-                                   const T* __restrict__ vars,
-                                   const T* __restrict__ means,
-                                   T* __restrict__ gamma_grad,
-                                   T* __restrict__ betta_grad,
-                                   int rows,
-                                   int width)
-{
-    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
-    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-
-    int idx = blockDim.x * blockIdx.x + threadIdx.x;
-    int offset = threadIdx.y * width + idx;
-    int y_stride = width * TILE_DIM;
-
-    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-    // Loop across matrix height
-
-    float betta_tmp = 0;
-    float gamma_tmp = 0;
-    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        float grad = (float)out_grad[offset];
-        float val = (float)X_data[offset];
-        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
-        betta_tmp += grad;
-        gamma_tmp += (val * grad);
-
-        offset += y_stride;
-    }
-
-    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
-    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
-
-    __syncthreads();
-
-    // Sum the shared buffer.
-    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
-    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < TILE_DIM; i <<= 1) {
-        s1 += g.shfl_down(s1, i);
-        s2 += g.shfl_down(s2, i);
-    }
-
-    if (threadIdx.x == 0) {
-        betta_grad[pos] = s1;
-        gamma_grad[pos] = s2;
-    }
-}
-/*
-
-/* Backward Normalize (Input-Gradient)
- * Using the means and variances from the input
- * This type of backward is invertible!
- * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
- */
-
-__global__ void LayerNormBackward2(const float* out_grad,
-                                   const float* vals_hat,
-                                   const float* gamma,
-                                   const float* betta,
-                                   const float* vars,
-                                   float* inp_grad,
-                                   bool invertible,
-                                   int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (THREADS < row_stride ? THREADS : row_stride) / WARP_SIZE;
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    out_grad += (row * row_stride);
-    vals_hat += (row * row_stride);
-    inp_grad += (row * row_stride);
-
-    float vals_arr[NORM_REG];
-    float vals_hat_arr[NORM_REG];
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        float gamma_reg = gamma[i * iteration_stride + id];
-        vals_arr[i] = out_grad[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;
-        vals_hat_arr[i] =
-            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
-                              gamma_reg
-                        : vals_hat[i * iteration_stride + id]);
-    }
-    if ((high_index) < row_stride) {
-        float gamma_reg = gamma[high_index];
-        vals_arr[iterations] = out_grad[high_index];
-        vals_arr[iterations] *= gamma_reg;
-        vals_hat_arr[iterations] =
-            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
-                        : vals_hat[high_index]);
-        iterations++;
-    }
-
-    float var_reg = vars[row];
-
-    float sum = 0;
-    for (int i = 0; i < iterations; i++) {
-        sum += vals_hat_arr[i] * vals_arr[i] *
-               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
-        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
-
-    sum = 0;
-    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
-    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
-}
-
-__global__ void LayerNormBackward2(const __half* out_grad,
-                                   const __half* vals_hat,
-                                   const __half* gamma,
-                                   const __half* betta,
-                                   const __half* vars,
-                                   __half* inp_grad,
-                                   bool invertible,
-                                   int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (iteration_stride < row_stride ? iteration_stride : row_stride) / WARP_SIZE;
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    __half2 vals_arr[NORM_REG];
-    float2 vals_arr_f[NORM_REG];
-    __half2 vals_hat_arr[NORM_REG];
-
-    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
-    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
-    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
-
-    inp_grad_h += (row * row_stride);
-    out_grad_h += (row * row_stride);
-    vals_hat_h += (row * row_stride);
-
-    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
-    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
-        vals_arr[i] = out_grad_h[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;
-        vals_hat_arr[i] =
-            (invertible
-                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
-                       gamma_reg
-                 : vals_hat_h[i * iteration_stride + id]);
-    }
-    if ((high_index) < row_stride) {
-        __half2 gamma_reg = gamma_h[high_index];
-        vals_arr[iterations] = out_grad_h[high_index];
-        vals_arr[iterations] *= gamma_reg;
-        vals_hat_arr[iterations] =
-            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
-                        : vals_hat_h[high_index]);
-        iterations++;
-    }
-    __half var_h = vars[row];
-    __half2 var_reg = __halves2half2(var_h, var_h);
-
-    float sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
-        float2 result_f = __half22float2(result_h);
-        sum += result_f.x;
-        sum += result_f.y;
-        vals_arr[i] *= h2rsqrt(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-    __half2 sum_h = __float2half2_rn(sum);
-
-    for (int i = 0; i < iterations; i++) {
-        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
-        vals_arr_f[i] = __half22float2(vals_arr[i]);
-        float2 temp_f = __half22float2(temp);
-        vals_arr_f[i].x += temp_f.x;
-        vals_arr_f[i].y += temp_f.y;
-    }
-    sum = 0.f;
-
-    for (int i = 0; i < iterations; i++) {
-        sum += (vals_arr_f[i].x);
-        sum += (vals_arr_f[i].y);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr_f[i].x -= sum;
-        vals_arr_f[i].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[i]);
-
-        inp_grad_h[i * iteration_stride + id] = temp;
-    }
-    if ((high_index) < row_stride) {
-        vals_arr_f[iterations].x -= sum;
-        vals_arr_f[iterations].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
-
-        inp_grad_h[high_index] = temp;
-    }
-}
-
-template <>
-void launch_layerNorm_backward<float>(const float* out_grad,
-                                      const float* vals_hat,
-                                      const float* vars,
-                                      const float* gamma,
-                                      float* gamma_grad,
-                                      float* betta_grad,
-                                      float* inp_grad,
-                                      int batch,
-                                      int hidden_dim,
-                                      cudaStream_t stream[2],
-                                      bool invertible,
-                                      const float* betta)
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 1;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 2;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads);
-
-    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
-}
-
-template <>
-void launch_layerNorm_backward<__half>(const __half* out_grad,
-                                       const __half* vals_hat,
-                                       const __half* vars,
-                                       const __half* gamma,
-                                       __half* gamma_grad,
-                                       __half* betta_grad,
-                                       __half* inp_grad,
-                                       int batch,
-                                       int hidden_dim,
-                                       cudaStream_t stream[2],
-                                       bool invertible,
-                                       const __half* betta)
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 8192 && hidden_dim <= 16384)
-        threads <<= 1;
-    else if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 2;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 3;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads / 2);
-
-    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
-}
-
-/* Backward Normalize (Input-Gradient)
- * Using the means and variances from the input
- * This type of backward is not invertible!
- * We do the backward using the input (X)
- */
-
-__global__ void LayerNormBackward2(const float* out_grad,
-                                   const float* X_vals,
-                                   const float* gamma,
-                                   const float* vars,
-                                   const float* means,
-                                   float* inp_grad,
-                                   int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (THREADS < row_stride ? THREADS : row_stride) / WARP_SIZE;
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    out_grad += (row * row_stride);
-    X_vals += (row * row_stride);
-    inp_grad += (row * row_stride);
-
-    float vals_arr[NORM_REG];
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        float gamma_reg = gamma[i * iteration_stride + id];
-        vals_arr[i] = out_grad[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;
-    }
-    if ((high_index) < row_stride) {
-        float gamma_reg = gamma[high_index];
-        vals_arr[iterations] = out_grad[high_index];
-        vals_arr[iterations] *= gamma_reg;
-        iterations++;
-    }
-
-    float var_reg = vars[row];
-    float mean_reg = means[row];
-
-    float sum = 0;
-    float xu[NORM_REG];
-    for (int i = 0; i < iterations; i++) {
-        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
-        sum += vals_arr[i] * xu[i];
-        vals_arr[i] *= rsqrtf(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
-    }
-
-    sum = 0;
-    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
-    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
-}
-
-__global__ void LayerNormBackward2(const __half* out_grad,
-                                   const __half* X_vals,
-                                   const __half* gamma,
-                                   const __half* vars,
-                                   const __half* means,
-                                   __half* inp_grad,
-                                   int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (iteration_stride < row_stride ? iteration_stride : row_stride) / WARP_SIZE;
-
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    __half2 vals_arr[NORM_REG];
-    float2 vals_arr_f[NORM_REG];
-
-    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
-    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
-    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
-
-    inp_grad_h += (row * row_stride);
-    out_grad_h += (row * row_stride);
-    vals_hat_h += (row * row_stride);
-
-    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
-        vals_arr[i] = out_grad_h[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;  // out_grad * gamma
-    }
-    if ((high_index) < row_stride) {
-        __half2 gamma_reg = gamma_h[high_index];
-        vals_arr[iterations] = out_grad_h[high_index];
-        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
-        iterations++;
-    }
-    __half mean_h = means[row];
-    __half var_h = vars[row];
-    __half2 var_reg = __halves2half2(var_h, var_h);
-    __half2 mean_reg = __halves2half2(mean_h, mean_h);
-    __half2 xu[NORM_REG];
-
-    float sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
-        __half2 result_h = (xu[i] * vals_arr[i]);
-        float2 result_f = __half22float2(result_h);
-        sum += result_f.x;
-        sum += result_f.y;
-        vals_arr[i] *= h2rsqrt(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-    __half2 sum_h = __float2half2_rn(sum);
-
-    for (int i = 0; i < iterations; i++) {
-        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
-        vals_arr_f[i] = __half22float2(vals_arr[i]);
-        float2 xu_grad_f = __half22float2(xu_grad);
-        vals_arr_f[i].x += xu_grad_f.x;
-        vals_arr_f[i].y += xu_grad_f.y;
-    }
-
-    sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        sum += (vals_arr_f[i].x);
-        sum += (vals_arr_f[i].y);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr_f[i].x -= sum;
-        vals_arr_f[i].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[i]);
-        inp_grad_h[i * iteration_stride + id] = temp;
-    }
-    if ((high_index) < row_stride) {
-        vals_arr_f[iterations].x -= sum;
-        vals_arr_f[iterations].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
-        inp_grad_h[high_index] = temp;
-    }
-}
-
-template <>
-void launch_layerNorm_backward<float>(const float* out_grad,
-                                      const float* X_data,
-                                      const float* vars,
-                                      const float* means,
-                                      const float* gamma,
-                                      float* gamma_grad,
-                                      float* betta_grad,
-                                      float* inp_grad,
-                                      int batch,
-                                      int hidden_dim,
-                                      cudaStream_t stream[2])
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 1;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 2;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads);
-    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
-}
-
-template <>
-void launch_layerNorm_backward<__half>(const __half* out_grad,
-                                       const __half* X_data,
-                                       const __half* vars,
-                                       const __half* means,
-                                       const __half* gamma,
-                                       __half* gamma_grad,
-                                       __half* betta_grad,
-                                       __half* inp_grad,
-                                       int batch,
-                                       int hidden_dim,
-                                       cudaStream_t stream[2])
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 8192 && hidden_dim <= 16384)
-        threads <<= 1;
-    else if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 2;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 3;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads / 2);
-    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
-}
-
-template <typename T>
-__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
-                                             const T* __restrict__ out_grad2,
-                                             const T* __restrict__ vals_hat,
-                                             const T* __restrict__ gamma,
-                                             const T* __restrict__ betta,
-                                             T* __restrict__ gamma_grad,
-                                             T* __restrict__ betta_grad,
-                                             int rows,
-                                             int width,
-                                             bool invertible)
-{
-    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
-    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-
-    int idx = blockDim.x * blockIdx.x + threadIdx.x;
-    int offset = threadIdx.y * width + idx;
-    int y_stride = width * TILE_DIM;
-
-    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
-    float gamma_reg = (float)gamma[idx];
-
-    // Loop across matrix height
-    float betta_tmp = 0;
-    float gamma_tmp = 0;
-    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
-        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
-                                : (float)vals_hat[offset]);
-        betta_tmp += grad;
-        gamma_tmp += (val * grad);
-
-        offset += y_stride;
-    }
-
-    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
-    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
-
-    __syncthreads();
-
-    // Sum the shared buffer.
-    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
-    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < TILE_DIM; i <<= 1) {
-        s1 += g.shfl_down(s1, i);
-        s2 += g.shfl_down(s2, i);
-    }
-
-    if (threadIdx.x == 0) {
-        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-        betta_grad[pos] = s1;
-        gamma_grad[pos] = s2;
-    }
-}
-
-template <typename T>
-__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
-                                             const T* __restrict__ out_grad2,
-                                             const T* __restrict__ X_data,
-                                             const T* __restrict__ vars,
-                                             const T* __restrict__ means,
-                                             T* __restrict__ gamma_grad,
-                                             T* __restrict__ betta_grad,
-                                             int rows,
-                                             int width)
-{
-    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
-    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
-
-    int idx = blockDim.x * blockIdx.x + threadIdx.x;
-    int offset = threadIdx.y * width + idx;
-    int y_stride = width * TILE_DIM;
-
-    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
-    // Loop across matrix height
-
-    float betta_tmp = 0;
-    float gamma_tmp = 0;
-    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
-        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
-        float val = (float)X_data[offset];
-        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
-        betta_tmp += grad;
-        gamma_tmp += (val * grad);
-
-        offset += y_stride;
-    }
-
-    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
-    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
-
-    __syncthreads();
-
-    // Sum the shared buffer.
-    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
-    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < TILE_DIM; i <<= 1) {
-        s1 += g.shfl_down(s1, i);
-        s2 += g.shfl_down(s2, i);
-    }
-
-    if (threadIdx.x == 0) {
-        betta_grad[pos] = s1;
-        gamma_grad[pos] = s2;
-    }
-}
-
-__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
-                                             const float* out_grad2,
-                                             const float* vals_hat,
-                                             const float* gamma,
-                                             const float* betta,
-                                             const float* vars,
-                                             float* inp_grad,
-                                             bool invertible,
-                                             int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (THREADS < row_stride ? THREADS : row_stride) / WARP_SIZE;
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    out_grad1 += (row * row_stride);
-    out_grad2 += (row * row_stride);
-    vals_hat += (row * row_stride);
-    inp_grad += (row * row_stride);
-
-    float vals_arr[NORM_REG];
-    float vals_hat_arr[NORM_REG];
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        float gamma_reg = gamma[i * iteration_stride + id];
-        vals_arr[i] = out_grad1[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;
-        vals_hat_arr[i] =
-            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
-                              gamma_reg
-                        : vals_hat[i * iteration_stride + id]);
-    }
-    if ((high_index) < row_stride) {
-        float gamma_reg = gamma[high_index];
-        vals_arr[iterations] = out_grad1[high_index];
-        vals_arr[iterations] *= gamma_reg;
-        vals_hat_arr[iterations] =
-            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
-                        : vals_hat[high_index]);
-        iterations++;
-    }
-
-    float var_reg = vars[row];
-
-    float sum = 0;
-    for (int i = 0; i < iterations; i++) {
-        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
-        vals_arr[i] *= rsqrtf(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
-
-    sum = 0;
-    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++)
-        inp_grad[i * iteration_stride + id] =
-            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
-    if ((high_index) < row_stride)
-        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
-}
-
-__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
-                                             const __half* out_grad2,
-                                             const __half* vals_hat,
-                                             const __half* gamma,
-                                             const __half* betta,
-                                             const __half* vars,
-                                             __half* inp_grad,
-                                             bool invertible,
-                                             int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (iteration_stride < row_stride ? iteration_stride : row_stride) / WARP_SIZE;
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    __half2 vals_arr[NORM_REG];
-    float2 vals_arr_f[NORM_REG];
-    __half2 vals_hat_arr[NORM_REG];
-
-    // float2 result[iterations];
-
-    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
-    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
-    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
-    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
-
-    inp_grad_h += (row * row_stride);
-    out_grad_h1 += (row * row_stride);
-    out_grad_h2 += (row * row_stride);
-    vals_hat_h += (row * row_stride);
-
-    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
-    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
-        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;  // out_grad * gamma
-        vals_hat_arr[i] =
-            (invertible
-                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
-                       gamma_reg
-                 : vals_hat_h[i * iteration_stride + id]);
-    }
-    if ((high_index) < row_stride) {
-        __half2 gamma_reg = gamma_h[high_index];
-        vals_arr[iterations] = out_grad_h1[high_index];
-        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
-        vals_hat_arr[iterations] =
-            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
-                        : vals_hat_h[high_index]);
-        iterations++;
-    }
-    __half var_h = vars[row];
-    __half2 var_reg = __halves2half2(var_h, var_h);
-
-    float sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
-        float2 result_f = __half22float2(result_h);
-        sum += result_f.x;
-        sum += result_f.y;
-        vals_arr[i] *= h2rsqrt(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-    __half2 sum_h = __float2half2_rn(sum);
-
-    for (int i = 0; i < iterations; i++) {
-        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
-        vals_arr_f[i] = __half22float2(vals_arr[i]);
-        float2 temp_f = __half22float2(temp);
-        vals_arr_f[i].x += temp_f.x;
-        vals_arr_f[i].y += temp_f.y;
-    }
-    sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        sum += (vals_arr_f[i].x);
-        sum += (vals_arr_f[i].y);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr_f[i].x -= sum;
-        vals_arr_f[i].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[i]);
-
-        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
-    }
-    if ((high_index) < row_stride) {
-        vals_arr_f[iterations].x -= sum;
-        vals_arr_f[iterations].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
-
-        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
-    }
-}
-
-template <>
-void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
-                                                const float* out_grad2,
-                                                const float* vals_hat,
-                                                const float* vars,
-                                                const float* gamma,
-                                                float* gamma_grad,
-                                                float* betta_grad,
-                                                float* inp_grad,
-                                                int batch,
-                                                int hidden_dim,
-                                                cudaStream_t stream[2],
-                                                bool invertible,
-                                                const float* betta)
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 1;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 2;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads);
-    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
-}
-
-template <>
-void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
-                                                 const __half* out_grad2,
-                                                 const __half* vals_hat,
-                                                 const __half* vars,
-                                                 const __half* gamma,
-                                                 __half* gamma_grad,
-                                                 __half* betta_grad,
-                                                 __half* inp_grad,
-                                                 int batch,
-                                                 int hidden_dim,
-                                                 cudaStream_t stream[2],
-                                                 bool invertible,
-                                                 const __half* betta)
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 8192 && hidden_dim <= 16384)
-        threads <<= 1;
-    else if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 2;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 3;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads / 2);
-    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
-}
-
-/* Backward Normalize (Input-Gradient)
- * Using the means and variances from the input
- * This type of backward is not invertible!
- * We do the backward using the input (X)
- */
-
-__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
-                                             const float* out_grad2,
-                                             const float* X_vals,
-                                             const float* gamma,
-                                             const float* vars,
-                                             const float* means,
-                                             float* inp_grad,
-                                             int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (THREADS < row_stride ? THREADS : row_stride) / WARP_SIZE;
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    float vals_arr[NORM_REG];
-    float vals_hat_arr[NORM_REG];
-
-    out_grad1 += (row * row_stride);
-    out_grad2 += (row * row_stride);
-    X_vals += (row * row_stride);
-    inp_grad += (row * row_stride);
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        float gamma_reg = gamma[i * iteration_stride + id];
-        vals_arr[i] = out_grad1[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;
-        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
-    }
-    if ((high_index) < row_stride) {
-        float gamma_reg = gamma[high_index];
-        vals_arr[iterations] = out_grad1[high_index];
-        vals_arr[iterations] *= gamma_reg;
-        vals_hat_arr[iterations] = X_vals[high_index];
-        iterations++;
-    }
-
-    float var_reg = vars[row];
-    float mean_reg = means[row];
-
-    float sum = 0;
-    float xu[NORM_REG];
-    for (int i = 0; i < iterations; i++) {
-        xu[i] = (vals_hat_arr[i] - mean_reg);
-        sum += vals_arr[i] * xu[i];
-        vals_arr[i] *= rsqrtf(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    for (int i = 0; i < iterations; i++) {
-        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
-    }
-
-    sum = 0;
-    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-    sum = g.shfl(sum, 0);
-    sum /= row_stride;
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++)
-        inp_grad[i * iteration_stride + id] =
-            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
-    if ((high_index) < row_stride)
-        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
-}
-
-__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
-                                             const __half* out_grad2,
-                                             const __half* X_vals,
-                                             const __half* gamma,
-                                             const __half* vars,
-                                             const __half* means,
-                                             __half* inp_grad,
-                                             int row_stride)
-{
-    int iteration_stride = blockDim.x;
-    int iterations = row_stride / iteration_stride;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-    int wid = id / WARP_SIZE;
-    int warp_num = (iteration_stride < row_stride ? iteration_stride : row_stride) / WARP_SIZE;
-
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    __half2 vals_arr[NORM_REG];
-    float2 vals_arr_f[NORM_REG];
-    __half2 vals_hat_arr[NORM_REG];
-
-    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
-    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
-    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
-    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
-
-    out_grad_h1 += (row * row_stride);
-    out_grad_h2 += (row * row_stride);
-    inp_grad_h += (row * row_stride);
-    vals_hat_h += (row * row_stride);
-
-    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
-    int high_index = iterations * iteration_stride + id;
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
-        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
-        vals_arr[i] *= gamma_reg;  // out_grad * gamma
-        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
-    }
-    if ((high_index) < row_stride) {
-        __half2 gamma_reg = gamma_h[high_index];
-        vals_arr[iterations] = out_grad_h1[high_index];
-        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
-        vals_hat_arr[iterations] = vals_hat_h[high_index];
-        iterations++;
-    }
-
-    __half mean_h = means[row];
-    __half var_h = vars[row];
-    __half2 var_reg = __halves2half2(var_h, var_h);
-    __half2 mean_reg = __halves2half2(mean_h, mean_h);
-    __half2 xu[NORM_REG];
-
-    float sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        xu[i] = (vals_hat_arr[i] - mean_reg);
-        __half2 result_h = (xu[i] * vals_arr[i]);
-        float2 result_f = __half22float2(result_h);
-        sum += result_f.x;
-        sum += result_f.y;
-        vals_arr[i] *= h2rsqrt(var_reg);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-    __half2 sum_h = __float2half2_rn(sum);
-
-    for (int i = 0; i < iterations; i++) {
-        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
-        vals_arr_f[i] = __half22float2(vals_arr[i]);
-        float2 xu_grad_f = __half22float2(xu_grad);
-        vals_arr_f[i].x += xu_grad_f.x;
-        vals_arr_f[i].y += xu_grad_f.y;
-    }
-
-    sum = 0.f;
-    for (int i = 0; i < iterations; i++) {
-        sum += (vals_arr_f[i].x);
-        sum += (vals_arr_f[i].y);
-    }
-
-    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
-
-    if (g.thread_rank() == 0) partialSum[wid] = sum;
-
-    __syncthreads();
-
-    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
-
-#ifndef __STOCHASTIC_MODE__
-    __syncthreads();
-#endif
-
-    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
-
-    sum = g.shfl(sum, 0);
-    sum /= (2 * row_stride);
-
-    iterations = row_stride / iteration_stride;
-    for (int i = 0; i < iterations; i++) {
-        vals_arr_f[i].x -= sum;
-        vals_arr_f[i].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[i]);
-        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
-    }
-    if ((high_index) < row_stride) {
-        vals_arr_f[iterations].x -= sum;
-        vals_arr_f[iterations].y -= sum;
-        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
-        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
-    }
-}
-
-template <>
-void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
-                                                const float* out_grad2,
-                                                const float* X_data,
-                                                const float* vars,
-                                                const float* means,
-                                                const float* gamma,
-                                                float* gamma_grad,
-                                                float* betta_grad,
-                                                float* inp_grad,
-                                                int batch,
-                                                int hidden_dim,
-                                                cudaStream_t stream[2])
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 1;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 2;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads);
-    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
-}
-
-template <>
-void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
-                                                 const __half* out_grad2,
-                                                 const __half* X_data,
-                                                 const __half* vars,
-                                                 const __half* means,
-                                                 const __half* gamma,
-                                                 __half* gamma_grad,
-                                                 __half* betta_grad,
-                                                 __half* inp_grad,
-                                                 int batch,
-                                                 int hidden_dim,
-                                                 cudaStream_t stream[2])
-{
-    int threads = THREADS;
-
-    dim3 grid_dim(hidden_dim / TILE_DIM);
-    dim3 block_dim(TILE_DIM, TILE_DIM);
-
-    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
-        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
-
-    dim3 grid_dim2(batch);
-
-    if (hidden_dim > 8192 && hidden_dim <= 16384)
-        threads <<= 1;
-    else if (hidden_dim > 16384 && hidden_dim <= 32768)
-        threads <<= 2;
-    else if (hidden_dim > 32768 && hidden_dim <= 65536)
-        threads <<= 3;
-    else if (hidden_dim > 65536)
-        throw std::runtime_error("Unsupport hidden_dim.");
-
-    dim3 block_dim2(threads / 2);
-    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
-        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
-}
+#include "custom_cuda_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    // LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/csrc/transformer/normalize_kernels.hip b/csrc/transformer/normalize_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..3d1b17c8f779f0940593a66fea8c07bba6c5534c
--- /dev/null
+++ b/csrc/transformer/normalize_kernels.hip
@@ -0,0 +1,2123 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    //hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/csrc/transformer/softmax_kernels.cu b/csrc/transformer/softmax_kernels.cu
index be776b0c074d4ce8450ccc2dadcd96027fc7168c..34487cba9bbec060baa0fb79922e691076296476 100644
--- a/csrc/transformer/softmax_kernels.cu
+++ b/csrc/transformer/softmax_kernels.cu
@@ -1,591 +1,595 @@
-#include <math.h>
-#include "custom_cuda_layers.h"
-#include "general_kernels.h"
-
-namespace cg = cooperative_groups;
-
-// Fused attention + softmax
-template <int tbSize, int blockStride, int tbSeq>
-__global__ void attn_softmax(float* vals,
-                             const float* attn_mask,
-                             int heads,
-                             int seq_length,
-                             int iterations)
-{
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    int warp_num = blockDim.x >> 5;
-
-    int iteration_stride = blockDim.x;
-    int block_width = blockStride * seq_length;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
-
-    int batch = blockIdx.x;
-    int row = blockIdx.y;
-    int max_threads_in_sequence = std::max(seq_length, tbSeq);
-    int seq_lane = threadIdx.x % max_threads_in_sequence;
-
-    int data_offset = batch * (gridDim.y * block_width) + row * block_width +
-                      (threadIdx.x / max_threads_in_sequence) * seq_length;
-    int mask_offset = batch * seq_length;
-
-    int wid = threadIdx.x >> 5;
-    int lane = threadIdx.x & 0x1f;
-
-    float4* val_cast = reinterpret_cast<float4*>(vals);
-    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
-
-    float4 data[MAX_THREAD_ITERATIONS];
-
-    float max_val = minus_infinity;
-
-    for (int i = 0; i < iterations; i++) {
-        int data_id = i * iteration_stride + seq_lane;
-        if (data_id < seq_length) {
-            float4 mask = attn_mask_cast[mask_offset + data_id];
-            data[i] = val_cast[data_offset + data_id];
-
-            data[i].x += mask.x;
-            data[i].y += mask.y;
-            data[i].z += mask.z;
-            data[i].w += mask.w;
-
-            max_val = (data[i].x > max_val ? data[i].x : max_val);
-            max_val = (data[i].y > max_val ? data[i].y : max_val);
-            max_val = (data[i].z > max_val ? data[i].z : max_val);
-            max_val = (data[i].w > max_val ? data[i].w : max_val);
-        } else {
-            data[i].x = minus_infinity;
-            data[i].y = minus_infinity;
-            data[i].z = minus_infinity;
-            data[i].w = minus_infinity;
-        }
-    }
-
-    for (int i = 1; i < tbSize; i *= 2) {
-        auto temp = g.shfl_xor(max_val, i);
-        max_val = (temp > max_val ? temp : max_val);
-    }
-
-    if (seq_length > tbSize) {
-        if (lane == 0) partialSum[wid] = max_val;
-        b.sync();
-
-        if (lane < warp_num) max_val = partialSum[lane];
-
-#ifndef __STOCHASTIC_MODE__
-        b.sync();
-#endif
-
-        int iters = warp_num;
-        if (seq_length < iteration_stride)
-            iters = warp_num / (iteration_stride / max_threads_in_sequence);
-
-        for (int i = 1; i < iters; i *= 2) {
-            auto temp = g.shfl_xor(max_val, i);
-            max_val = (temp > max_val ? temp : max_val);
-        }
-
-        max_val = g.shfl(max_val, threadIdx.x / tbSize);
-    }
-
-    float sum = 0;
-    for (int i = 0; i < iterations; i++) {
-        data[i].x = __expf(data[i].x - max_val);
-        data[i].y = __expf(data[i].y - max_val);
-        data[i].z = __expf(data[i].z - max_val);
-        data[i].w = __expf(data[i].w - max_val);
-
-        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
-    }
-
-    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
-
-    if (seq_length > tbSize) {
-        if (lane == 0) partialSum[wid] = sum;
-        b.sync();
-
-        if (lane < warp_num) sum = partialSum[lane];
-
-#ifndef __STOCHASTIC_MODE__
-        b.sync();
-#endif
-
-        int iters = warp_num;
-        if (seq_length < iteration_stride)
-            iters = warp_num / (iteration_stride / max_threads_in_sequence);
-
-        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
-
-        sum = g.shfl(sum, threadIdx.x / tbSize);
-    }
-
-    sum += 1e-6;
-
-    for (int i = 0; i < iterations; i++) {
-        data[i].x /= sum;
-        data[i].y /= sum;
-        data[i].z /= sum;
-        data[i].w /= sum;
-
-        int data_id = i * iteration_stride + seq_lane;
-        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
-    }
-}
-
-template <int tbSize, int blockStride, int tbSeq>
-__global__ void attn_softmax(__half* vals,
-                             const __half* attn_mask,
-                             int heads,
-                             int seq_length,
-                             int iterations)
-{
-#if __CUDA_ARCH__ >= 700
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    int warp_num = blockDim.x >> 5;
-
-    int iteration_stride = blockDim.x;
-    int block_width = blockStride * seq_length;
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
-
-    int batch = blockIdx.x;
-    int row = blockIdx.y;
-    int max_threads_in_sequence = std::max(seq_length, tbSeq);
-    int seq_lane = threadIdx.x % max_threads_in_sequence;
-
-    int data_offset = batch * (gridDim.y * block_width) + row * block_width +
-                      (threadIdx.x / max_threads_in_sequence) * seq_length;
-    int mask_offset = batch * seq_length;
-
-    int wid = threadIdx.x >> 5;
-    int lane = threadIdx.x & 0x1f;
-
-    float2* val_cast = reinterpret_cast<float2*>(vals);
-    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
-
-    val_cast += data_offset;
-    attn_mask_cast += mask_offset;
-
-    float2 low_data[MAX_THREAD_ITERATIONS];
-    float2 high_data[MAX_THREAD_ITERATIONS];
-
-    float max_val = minus_infinity;
-
-    for (int i = 0; i < iterations; i++) {
-        int data_id = i * iteration_stride + seq_lane;
-        if (data_id < seq_length) {
-            float2 data = val_cast[data_id];
-            float2 mask = attn_mask_cast[data_id];
-
-            __half2* data_arr = reinterpret_cast<__half2*>(&data);
-            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
-
-            low_data[i] = __half22float2(data_arr[0]);
-            high_data[i] = __half22float2(data_arr[1]);
-            float2 low_mask = __half22float2(mask_arr[0]);
-            float2 high_mask = __half22float2(mask_arr[1]);
-
-            low_data[i].x += low_mask.x;
-            low_data[i].y += low_mask.y;
-            high_data[i].x += high_mask.x;
-            high_data[i].y += high_mask.y;
-
-            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
-            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
-            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
-            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
-        }
-    }
-
-    for (int i = 1; i < tbSize; i *= 2) {
-        auto temp = g.shfl_xor(max_val, i);
-        max_val = (temp > max_val ? temp : max_val);
-    }
-
-    if (seq_length > tbSize) {
-        if (lane == 0) partialSum[wid] = max_val;
-        b.sync();
-
-        if (lane < warp_num) max_val = partialSum[lane];
-
-#ifndef __STOCHASTIC_MODE__
-        b.sync();
-#endif
-
-        int iters = warp_num;
-        if (seq_length < iteration_stride)
-            iters = warp_num / (iteration_stride / max_threads_in_sequence);
-
-        for (int i = 1; i < iters; i *= 2) {
-            auto temp = g.shfl_xor(max_val, i);
-            max_val = (temp > max_val ? temp : max_val);
-        }
-
-        max_val = g.shfl(max_val, threadIdx.x / tbSize);
-    }
-
-    float sum = 0;
-    for (int i = 0; i < iterations; i++) {
-        int data_id = i * iteration_stride + seq_lane;
-        if (data_id < seq_length) {
-            low_data[i].x = __expf(low_data[i].x - max_val);
-            low_data[i].y = __expf(low_data[i].y - max_val);
-            high_data[i].x = __expf(high_data[i].x - max_val);
-            high_data[i].y = __expf(high_data[i].y - max_val);
-
-            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
-        }
-    }
-
-    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
-
-    if (seq_length > tbSize) {
-        if (lane == 0) partialSum[wid] = sum;
-        b.sync();
-
-        if (lane < warp_num) sum = partialSum[lane];
-
-#ifndef __STOCHASTIC_MODE__
-        b.sync();
-#endif
-
-        int iters = warp_num;
-        if (seq_length < iteration_stride)
-            iters = warp_num / (iteration_stride / max_threads_in_sequence);
-
-        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
-
-        sum = g.shfl(sum, threadIdx.x / tbSize);
-    }
-
-    sum += 1e-6;
-
-    for (int i = 0; i < iterations; i++) {
-        int data_id = i * iteration_stride + seq_lane;
-        if (data_id < seq_length) {
-            float2 result_f;
-            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
-
-            low_data[i].x /= sum;
-            low_data[i].y /= sum;
-            high_data[i].x /= sum;
-            high_data[i].y /= sum;
-
-            result_h[0] = __float22half2_rn(low_data[i]);
-            result_h[1] = __float22half2_rn(high_data[i]);
-
-            val_cast[data_id] = result_f;
-        }
-    }
-
-#endif
-}
-
-template <typename T>
-void launch_attn_softmax(T*, const T*, int, int, int, cudaStream_t);
-
-template <>
-void launch_attn_softmax<float>(float* vals,
-                                const float* attn_mask,
-                                int batch_size,
-                                int heads,
-                                int sequence_length,
-                                cudaStream_t stream)
-{
-    const int threads = 128;
-    int seq_length4 = sequence_length / 4;
-
-    int block_compute_size =
-        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
-    dim3 grid_dim(batch_size, heads * sequence_length / block_compute_size);
-
-    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
-
-    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
-                                            subblock_max_workload * threads)
-                                         : threads);
-    int iterations =
-        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
-                                                 : MAX_THREAD_ITERATIONS);
-
-    if (sequence_length <= 8)
-        attn_softmax<2, (threads / 2), 2>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 16)
-        attn_softmax<4, (threads / 4), 4>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 32)
-        attn_softmax<8, (threads / 8), 8>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 64)
-        attn_softmax<16, (threads / 16), 16>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 128)
-        attn_softmax<32, (threads / 32), 32>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 256)
-        attn_softmax<32, (threads / 64), 64>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else {
-        const int threads = 256;
-        block_compute_size =
-            (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4))))
-                                   : 1);
-        dim3 grid_dim(batch_size, heads * sequence_length / block_compute_size);
-
-        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
-
-        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
-                                                subblock_max_workload * threads)
-                                             : threads);
-        iterations =
-            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
-                                                     : MAX_THREAD_ITERATIONS);
-        if (sequence_length <= 512)
-            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
-                vals, attn_mask, heads, seq_length4, iterations);
-        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
-            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
-                vals, attn_mask, heads, seq_length4, iterations);
-        else
-            throw std::runtime_error(
-                "Unsupport Seq_Length! Check the restriction of the max_threads and "
-                "max_thread_iterations!");
-    }
-}
-
-template <>
-void launch_attn_softmax<__half>(__half* vals,
-                                 const __half* attn_mask,
-                                 int batch_size,
-                                 int heads,
-                                 int sequence_length,
-                                 cudaStream_t stream)
-{
-    const int threads = 128;
-    int seq_length4 = sequence_length / 4;
-
-    int block_compute_size =
-        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
-    dim3 grid_dim(batch_size, heads * sequence_length / block_compute_size);
-
-    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
-
-    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
-                                            subblock_max_workload * threads)
-                                         : threads);
-
-    int iterations =
-        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
-                                                 : MAX_THREAD_ITERATIONS);
-
-    if (sequence_length <= 8)
-        attn_softmax<2, (threads / 2), 2>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 16)
-        attn_softmax<4, (threads / 4), 4>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 32)
-        attn_softmax<8, (threads / 8), 8>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 64)
-        attn_softmax<16, (threads / 16), 16>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 128)
-        attn_softmax<32, (threads / 32), 32>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else if (sequence_length <= 256)
-        attn_softmax<32, (threads / 64), 64>
-            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
-    else {
-        const int threads = 256;
-        block_compute_size =
-            (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4))))
-                                   : 1);
-        dim3 grid_dim(batch_size, heads * sequence_length / block_compute_size);
-
-        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
-
-        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
-                                                subblock_max_workload * threads)
-                                             : threads);
-        iterations =
-            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
-                                                     : MAX_THREAD_ITERATIONS);
-        if (sequence_length <= 512)
-            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
-                vals, attn_mask, heads, seq_length4, iterations);
-        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
-            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
-                vals, attn_mask, heads, seq_length4, iterations);
-        else
-            throw std::runtime_error(
-                "Unsupport Seq_Length! Check the restriction of the max_threads and "
-                "max_thread_iterations!");
-    }
-}
-
-template <typename T, int tbSize, int blockStride>
-__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
-{
-    __shared__ float partialSum[MAX_WARP_NUM];
-
-    int warp_num = blockDim.x >> 5;  // warp-count = num_threads / WARP_SIZE (32)
-
-    int iteration_stride = blockDim.x;
-    int block_width = blockStride * seq_length;
-
-    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
-                          ? (seq_length + iteration_stride - 1) / iteration_stride
-                          : MAX_THREAD_ITERATIONS);
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
-
-    int row = blockIdx.x;
-    int id = threadIdx.x;
-
-    int wid = id >> 5;
-    int lane = id & 0x1f;
-
-    T val_reg[MAX_THREAD_ITERATIONS];
-    T soft_reg[MAX_THREAD_ITERATIONS];
-    float grad_reg = 0.0f;
-
-#pragma unroll
-    for (int i = 0; i < iterations; i++) {
-        int data_id = i * iteration_stride + id;
-        if (data_id < block_width) {
-            val_reg[i] = out_grad[row * block_width + data_id];
-            soft_reg[i] = soft_inp[row * block_width + data_id];
-
-            grad_reg += ((float)val_reg[i] *
-                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
-                                               // 2% of accuracy in computation!!
-        }
-    }
-    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
-
-    if (seq_length > tbSize) {
-        if (lane == 0) partialSum[wid] = grad_reg;
-        b.sync();
-
-        if (lane < warp_num) grad_reg = partialSum[lane];
-
-        int iters = warp_num;
-        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
-
-        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
-
-        grad_reg = g.shfl(grad_reg, id / tbSize);
-    }
-
-    for (int i = 0; i < iterations; i++) {
-        int data_id = i * iteration_stride + id;
-        if (data_id < block_width) {
-            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
-            out_grad[row * block_width + data_id] = (T)temp;
-        }
-    }
-}
-
-template <typename T, int ITERATIONS>
-__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
-                                           const T* output,
-                                           int softmax_length)
-{
-    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
-    int offset = batch_idx * softmax_length + threadIdx.x;
-
-    grad += offset;
-    output += offset;
-
-    T grad_reg[ITERATIONS];
-    T output_reg[ITERATIONS];
-    float sum = 0.0;
-
-#pragma unroll
-    for (int i = 0; i < ITERATIONS; ++i) {
-        int curr_idx = threadIdx.x + i * WARP_SIZE;
-        if (curr_idx < softmax_length) {
-            grad_reg[i] = grad[i * WARP_SIZE];
-            output_reg[i] = output[i * WARP_SIZE];
-            sum += (float)grad_reg[i] * (float)output_reg[i];
-        }
-    }
-
-    cg::thread_block b = cg::this_thread_block();
-    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
-
-    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
-
-#pragma unroll
-    for (int i = 0; i < ITERATIONS; ++i) {
-        int curr_idx = threadIdx.x + i * WARP_SIZE;
-        if (curr_idx < softmax_length)
-            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
-    }
-}
-
-template <typename T>
-void launch_attn_softmax_backward_v2(T* out_grad,
-                                     const T* soft_inp,
-                                     int batch_size,
-                                     int heads,
-                                     int seq_length,
-                                     cudaStream_t stream)
-{
-    const int warps_per_block = 4;
-    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
-    dim3 block_dim(WARP_SIZE, warps_per_block);
-
-    if (seq_length <= 32)
-        softmax_backward_kernel_v2<T, 1>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 64)
-        softmax_backward_kernel_v2<T, 2>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 128)
-        softmax_backward_kernel_v2<T, 4>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 256)
-        softmax_backward_kernel_v2<T, 8>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 384)
-        softmax_backward_kernel_v2<T, 12>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 512)
-        softmax_backward_kernel_v2<T, 16>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 768)
-        softmax_backward_kernel_v2<T, 24>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 1024)
-        softmax_backward_kernel_v2<T, 32>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else if (seq_length <= 2048)
-        softmax_backward_kernel_v2<T, 64>
-            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
-    else
-        throw std::runtime_error(
-            std::string("Special sequence length found in softmax backward, seq_length: ") +
-            std::to_string(seq_length));
-}
-
-template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
-                                                      const __half* soft_inp,
-                                                      int batch_size,
-                                                      int heads,
-                                                      int seq_length,
-                                                      cudaStream_t stream);
-template void launch_attn_softmax_backward_v2<float>(float* out_grad,
-                                                     const float* soft_inp,
-                                                     int batch_size,
-                                                     int heads,
-                                                     int seq_length,
-                                                     cudaStream_t stream);
+#include <math.h>
+#include "custom_cuda_layers.h"
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, cudaStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     cudaStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+        softmax_backward_kernel_v2<T, 1>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+        softmax_backward_kernel_v2<T, 2>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+        softmax_backward_kernel_v2<T, 4>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+        softmax_backward_kernel_v2<T, 8>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+        softmax_backward_kernel_v2<T, 12>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+        softmax_backward_kernel_v2<T, 16>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+        softmax_backward_kernel_v2<T, 24>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+        softmax_backward_kernel_v2<T, 32>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+        softmax_backward_kernel_v2<T, 64>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      cudaStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     cudaStream_t stream);
diff --git a/csrc/transformer/softmax_kernels.hip b/csrc/transformer/softmax_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..afe65b0c9cbdc6b10027db2ddd5c7e8f447e0c24
--- /dev/null
+++ b/csrc/transformer/softmax_kernels.hip
@@ -0,0 +1,597 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <math.h>
+#include "custom_hip_layers.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, hipStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     hipStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 1>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 12>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 24>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      hipStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     hipStream_t stream);
diff --git a/csrc/transformer/transform_kernels.cu b/csrc/transformer/transform_kernels.cu
old mode 100755
new mode 100644
index 7d8a27eeeb43c315ec008f1a4b3740751582d710..15a2219333e43a6da1b93038a406b35d302bb9d9
--- a/csrc/transformer/transform_kernels.cu
+++ b/csrc/transformer/transform_kernels.cu
@@ -1,575 +1,575 @@
-#include "custom_cuda_layers.h"
-
-#define rows_trans 16
-#define cols_trans 16
-
-template <typename T>
-__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
-{
-    __shared__ T data_block[rows_trans * (cols_trans + 1)];
-
-    int r = threadIdx.x / cols_trans;
-    int c = threadIdx.x % cols_trans;
-
-    int m = row_width / cols_trans;
-
-    int i = blockIdx.x / m * rows_trans + r;
-    int j = blockIdx.x % m * cols_trans + c;
-
-    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
-
-    for (int k = 0; k < rows_trans; k += row_stride)
-        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
-
-    __syncthreads();
-
-    i = blockIdx.x % m * rows_trans + r;
-    j = blockIdx.x / m * cols_trans + c;
-
-    for (int k = 0; k < rows_trans; k += row_stride)
-        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
-}
-
-template <>
-void Transpose<__half>(const __half* inp_mat,
-                       __half* out_mat,
-                       int rows,
-                       int cols,
-                       cudaStream_t stream)
-{
-    int threads = THREADS;
-
-    Transpose_Kernel<__half><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
-        inp_mat, out_mat, cols, rows);
-}
-
-template <>
-void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, cudaStream_t stream)
-{
-    int threads = THREADS;
-
-    Transpose_Kernel<float><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
-        inp_mat, out_mat, cols, rows);
-}
-
-template <typename T>
-__global__ void transform_0213(T* output,
-                               const T* vals,
-                               int hidden_dim,
-                               int seq_length,
-                               int heads,
-                               int head_ext);
-
-template <>
-__global__ void transform_0213<float>(float* output,
-                                      const float* vals,
-                                      int hidden_dim,
-                                      int seq_length,
-                                      int heads,
-                                      int head_ext)
-{
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-
-    int d0_out_stride = d0_stride;
-    int d1_out_stride = d2_stride;
-    int d2_out_stride = d2_stride * seq_length;
-
-    int d0 = blockIdx.x;                                                  // Batch
-    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
-    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
-    int d3 = threadIdx.x;                                                 // Values (groups of 4)
-
-    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
-    float4* output_vec = reinterpret_cast<float4*>(output);
-
-    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
-    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
-}
-
-template <>
-__global__ void transform_0213<__half>(__half* output,
-                                       const __half* vals,
-                                       int hidden_dim,
-                                       int seq_length,
-                                       int heads,
-                                       int head_ext)
-{
-#if __CUDA_ARCH__ >= 700
-
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-
-    int d0_out_stride = d0_stride;
-    int d1_out_stride = d2_stride;
-    int d2_out_stride = d2_stride * seq_length;
-
-    int d0 = blockIdx.x;                                                  // Batch
-    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
-    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
-    int d3 = threadIdx.x;                                                 // Values (groups of 4)
-
-    float4 vals_arr[1];
-
-    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
-    float4* output_vec = reinterpret_cast<float4*>(output);
-
-    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
-    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
-#endif
-}
-
-template <>
-void launch_transform_0213<float>(float* output,
-                                  const float* vals,
-                                  int batch_size,
-                                  int seq_length,
-                                  int hidden_dim,
-                                  int heads,
-                                  cudaStream_t stream)
-{
-    hidden_dim >>= 2;
-    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
-    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
-    dim3 grid_dim(batch_size, (seq_length * head_ext));
-
-    transform_0213<float>
-        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
-}
-
-template <>
-void launch_transform_0213<__half>(__half* output,
-                                   const __half* vals,
-                                   int batch_size,
-                                   int seq_length,
-                                   int hidden_dim,
-                                   int heads,
-                                   cudaStream_t stream)
-{
-    hidden_dim >>= 3;
-    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
-    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
-    dim3 grid_dim(batch_size, (seq_length * head_ext));
-    transform_0213<__half>
-        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
-}
-
-// Bias add
-template <typename T>
-__global__ void bias_add_transform_0213(T* output,
-                                        const T* vals,
-                                        const T* bias,
-                                        int hidden_dim,
-                                        int seq_length,
-                                        int heads,
-                                        int head_ext);
-
-template <>
-__global__ void bias_add_transform_0213<float>(float* output,
-                                               const float* vals,
-                                               const float* bias,
-                                               int hidden_dim,
-                                               int seq_length,
-                                               int heads,
-                                               int head_ext)
-{
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-
-    int d0_out_stride = d0_stride;
-    int d1_out_stride = d2_stride;
-    int d2_out_stride = d2_stride * seq_length;
-
-    int d0 = blockIdx.x;                                                  // Batch
-    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
-    int cnt = blockIdx.z / head_ext;                                      // Hidden count
-    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
-    int d3 = threadIdx.x;                                                 // Values (groups of 4)
-
-    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
-    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
-    float4* output_vec = reinterpret_cast<float4*>(output);
-
-    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
-                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
-    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
-
-    float4 outputs;
-    outputs.x = inputs.x + biases.x;
-    outputs.y = inputs.y + biases.y;
-    outputs.z = inputs.z + biases.z;
-    outputs.w = inputs.w + biases.w;
-
-    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
-               d2 * d2_out_stride + d3] = outputs;
-}
-
-#define ATTN_H 3
-#define MAX_SEQ_LINE 10
-
-template <>
-__global__ void bias_add_transform_0213<__half>(__half* output,
-                                                const __half* vals,
-                                                const __half* bias,
-                                                int hidden_dim,
-                                                int seq_length,
-                                                int heads,
-                                                int head_ext)
-{
-#if __CUDA_ARCH__ >= 700
-
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-
-    int d2_out_stride = d2_stride * seq_length;
-
-    int d0 = blockIdx.x;                                                  // Batch
-    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
-    int cnt = blockIdx.z / head_ext;                                      // Hidden count
-    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
-    int d3 = threadIdx.x;                                                 // Values (groups of 4)
-
-    float4 vals_arr;
-    float4 bias_arr;
-    float4 output_arr;
-    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
-    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
-    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
-
-    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
-    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
-    float4* output_vec = reinterpret_cast<float4*>(output);
-
-    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
-    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
-    vals_vec += (cnt * d1_stride);
-    vals_vec += (d2 * d2_stride);
-
-    bias_vec += (cnt * d1_stride);
-    bias_vec += (d2 * d2_stride);
-
-    output_vec += (cnt * d0_stride * gridDim.x);
-    output_vec += (d1 * d2_stride);
-    output_vec += (d0 * d0_stride);
-    output_vec += (d2 * d2_out_stride);
-
-    bias_arr = bias_vec[d3];
-    vals_arr = vals_vec[d3];
-
-#if defined(__ACC_HALF__)
-    output_half[0] = vals_half[0] + bias_half[0];
-    output_half[1] = vals_half[1] + bias_half[1];
-    output_half[2] = vals_half[2] + bias_half[2];
-    output_half[3] = vals_half[3] + bias_half[3];
-#else
-    float2 bias_arr_f[4];
-    float2 vals_arr_f[4];
-#pragma unroll
-    for (int l = 0; l < 4; l++) {
-        bias_arr_f[l] = __half22float2(bias_half[l]);
-        vals_arr_f[l] = __half22float2(vals_half[l]);
-        vals_arr_f[l].x += bias_arr_f[l].x;
-        vals_arr_f[l].y += bias_arr_f[l].y;
-        output_half[l] = __float22half2_rn(vals_arr_f[l]);
-    }
-#endif
-    output_vec[d3] = output_arr;
-
-#endif
-}
-
-__global__ void bias_add_transform_0213_v2(__half* output,
-                                           const __half* vals,
-                                           const __half* bias,
-                                           int hidden_dim,
-                                           int seq_length,
-                                           int heads)
-{
-#if __CUDA_ARCH__ >= 700
-    __shared__ float4 in_data[3072];
-
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
-    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
-
-    int d0_out_stride = d0_stride;
-    int d1_out_stride = d2_stride;
-    int d2_out_stride = d2_stride * seq_length;
-
-    int d0 = blockIdx.x;    // Batch
-    int d1 = blockIdx.y;    // Sequence ID (0-127)
-    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
-    int d2 = threadIdx.y;   // Head (0-11)
-    int d3 = threadIdx.x;   // Values (groups of 4)
-
-    float4 vals_arr[1];
-    float4 bias_arr[1];
-    float4 output_arr[1];
-    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
-    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
-    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
-
-    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
-    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
-    float4* output_vec = reinterpret_cast<float4*>(output);
-
-    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
-    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
-    bias_arr[0] = bias_vec[iter_index];
-
-#pragma unroll
-    for (int iter = 0; iter < 2; iter++) {
-        int iter_id = iter * iteration_stride + iter_index;
-        vals_arr[0] = vals_vec[input_offset + iter_id];
-
-        output_half[0] = vals_half[0] + bias_half[0];
-        output_half[1] = vals_half[1] + bias_half[1];
-        output_half[2] = vals_half[2] + bias_half[2];
-        output_half[3] = vals_half[3] + bias_half[3];
-
-        in_data[iter_id] = output_arr[0];
-    }
-    __syncthreads();
-
-    iteration_stride = blockDim.z * (blockDim.y >> 1);
-    int matrix_stride = (d0_out_stride * gridDim.x);
-    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
-
-    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
-
-#pragma unroll
-    for (int iter = 0; iter < 2; iter++) {
-        int iter_row = (iter * iteration_stride) + head_count;
-        int iter_offset =
-            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
-        output_vec[out_index + iter_offset] =
-            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
-    }
-#endif
-}
-
-// [B S C*H] - > C * [B A S N]
-template <>
-void launch_bias_add_transform_0213<float>(float* output,
-                                           const float* vals,
-                                           const float* bias,
-                                           int batch_size,
-                                           int seq_length,
-                                           int hidden_dim,
-                                           int heads,
-                                           cudaStream_t stream,
-                                           int trans_count)
-{
-    hidden_dim >>= 2;
-    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
-
-    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
-    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
-
-    bias_add_transform_0213<float><<<grid_dim, block_dim, 0, stream>>>(
-        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
-}
-
-template <>
-void launch_bias_add_transform_0213<__half>(__half* output,
-                                            const __half* vals,
-                                            const __half* bias,
-                                            int batch_size,
-                                            int seq_length,
-                                            int hidden_dim,
-                                            int heads,
-                                            cudaStream_t stream,
-                                            int trans_count)
-{
-    hidden_dim >>= 3;
-    if (hidden_dim > 128 || hidden_dim < 16) {
-        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
-        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
-        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
-        bias_add_transform_0213<__half><<<grid_dim, block_dim, 0, stream>>>(
-            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
-    } else {
-        dim3 block_dim(hidden_dim / heads, heads, trans_count);
-        dim3 grid_dim(batch_size, seq_length / 2);
-        bias_add_transform_0213_v2<<<grid_dim, block_dim, 0, stream>>>(
-            output, vals, bias, hidden_dim, seq_length, heads);
-    }
-}
-
-template <typename T>
-__global__ void transform4d_0213(T* out,
-                                 const T* in,
-                                 int heads,
-                                 int seq_length,
-                                 int hidden_dim,
-                                 int head_ext);
-
-template <>
-__global__ void transform4d_0213<float>(float* out,
-                                        const float* in,
-                                        int heads,
-                                        int seq_length,
-                                        int hidden_dim,
-                                        int head_ext)
-{
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = d0_stride / heads;
-    int d2_stride = hidden_dim / heads;
-
-    int d0_out_stride = d0_stride;
-    int d1_out_stride = d2_stride;
-    int d2_out_stride = hidden_dim;
-
-    int d0 = blockIdx.x;                                        // Batch
-    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
-    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
-    int cnt = blockIdx.z;
-    int d3 = threadIdx.x;  // Values (groups of 8)
-
-    if (d2 < seq_length) {
-        const float4* in_vec = reinterpret_cast<const float4*>(in);
-        float4* out_vec = reinterpret_cast<float4*>(out);
-
-        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
-                                 d2 * d2_stride + d3];
-        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
-                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
-    }
-}
-
-template <>
-__global__ void transform4d_0213<__half>(__half* out,
-                                         const __half* in,
-                                         int heads,
-                                         int seq_length,
-                                         int hidden_dim,
-                                         int head_ext)
-{
-#if __CUDA_ARCH__ >= 700
-
-    int d0_stride = hidden_dim * (seq_length / head_ext);
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-
-    int d0 = blockIdx.x;                                                  // Batch
-    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
-    int d2 = blockIdx.z / head_ext;                                       // Sequence
-    int cnt = blockIdx.y;                                                 // Hidden count
-    int d3 = threadIdx.x;                                                 // Values (groups of 8)
-
-    const float4* in_vec = reinterpret_cast<const float4*>(in);
-    float4* out_vec = reinterpret_cast<float4*>(out);
-
-    in_vec += (cnt * d0_stride * gridDim.x);
-    in_vec += (d0 * d0_stride);
-    in_vec += (d2 * d2_stride);
-    in_vec += (d1 * d2_stride * seq_length);
-
-    out_vec += (cnt * d1_stride);
-    out_vec += (d1 * d2_stride);
-    out_vec += (d0 * d0_stride * gridDim.y);
-    out_vec += (d2 * d1_stride * gridDim.y);
-
-    out_vec[d3] = in_vec[d3];
-
-#endif
-}
-
-__global__ void transform4d_0213_v2(__half* out,
-                                    const __half* in,
-                                    int heads,
-                                    int seq_length,
-                                    int hidden_dim)
-{
-#if __CUDA_ARCH__ >= 700
-    __shared__ float4 in_data[3072];
-
-    int d0_stride = hidden_dim * seq_length;
-    int d1_stride = hidden_dim;
-    int d2_stride = hidden_dim / heads;
-
-    int d0 = blockIdx.x;    // Batch
-    int d1 = threadIdx.y;   // Head
-    int d2 = blockIdx.y;    // Sequence
-    int cnt = threadIdx.z;  // Hidden count
-    int d3 = threadIdx.x;   // Values (groups of 8)
-
-    const float4* in_vec = reinterpret_cast<const float4*>(in);
-    float4* out_vec = reinterpret_cast<float4*>(out);
-
-    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
-    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
-    int iteration_stride = blockDim.z * (blockDim.y >> 1);
-    int matrix_stride = (d0_stride * gridDim.x);
-
-#pragma unroll
-    for (int iter = 0; iter < 2; iter++) {
-        int iter_row = iter * iteration_stride + head_count;
-        int iter_offset = (iter_row % blockDim.y) * d2_stride;
-
-        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
-            in_vec[input_offset + iter_offset * seq_length +
-                   (iter_row / blockDim.y) * matrix_stride];
-    }
-    __syncthreads();
-
-    iteration_stride = d1_stride * blockDim.z;
-    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
-    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
-
-#pragma unroll
-    for (int iter = 0; iter < 2; iter++) {
-        int iter_id = iter * iteration_stride + iter_index;
-        out_vec[output_offset + iter_id] = in_data[iter_id];
-    }
-#endif
-}
-
-// 3 * [B A S N] - > [B S C*H]
-template <>
-void launch_transform4d_0213<float>(float* out,
-                                    const float* in,
-                                    int batch_size,
-                                    int heads,
-                                    int seq_length,
-                                    int hidden_dim,
-                                    cudaStream_t stream,
-                                    int trans_count)
-{
-    hidden_dim >>= 2;
-    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
-    dim3 block_dims(hidden_dim / heads, 8);
-    transform4d_0213<float>
-        <<<grid_dims, block_dims, 0, stream>>>(out, in, heads, seq_length, hidden_dim, 1);
-}
-
-template <>
-void launch_transform4d_0213<__half>(__half* out,
-                                     const __half* in,
-                                     int batch_size,
-                                     int heads,
-                                     int seq_length,
-                                     int hidden_dim,
-                                     cudaStream_t stream,
-                                     int trans_count)
-{
-    hidden_dim >>= 3;
-    if (hidden_dim > 128 || hidden_dim < 16) {
-        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
-        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
-        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
-        transform4d_0213<__half><<<grid_dims, block_dims, 0, stream>>>(
-            out, in, heads, seq_length, hidden_dim, head_ext);
-    } else {
-        dim3 grid_dims(batch_size, seq_length / 2);
-        dim3 block_dims(hidden_dim / heads, heads, trans_count);
-        transform4d_0213_v2<<<grid_dims, block_dims, 0, stream>>>(
-            out, in, heads, seq_length, hidden_dim);
-    }
-}
+#include "custom_cuda_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<__half><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<float><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  cudaStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+    transform_0213<float>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   cudaStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+    transform_0213<__half>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           cudaStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+    bias_add_transform_0213<float><<<grid_dim, block_dim, 0, stream>>>(
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            cudaStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+        bias_add_transform_0213<__half><<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+        bias_add_transform_0213_v2<<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    cudaStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+    transform4d_0213<float>
+        <<<grid_dims, block_dims, 0, stream>>>(out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+        transform4d_0213<__half><<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+        transform4d_0213_v2<<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/csrc/transformer/transform_kernels.hip b/csrc/transformer/transform_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..0aaa4cca150e18ed63c701e66ce4eaf6313e30ab
--- /dev/null
+++ b/csrc/transformer/transform_kernels.hip
@@ -0,0 +1,577 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<__half>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<float>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  hipStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+   hipLaunchKernelGGL(( transform_0213<float>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   hipStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+   hipLaunchKernelGGL(( transform_0213<__half>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           hipStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+   hipLaunchKernelGGL(( bias_add_transform_0213<float>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            hipStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+       hipLaunchKernelGGL(( bias_add_transform_0213<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+       hipLaunchKernelGGL(( bias_add_transform_0213_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    hipStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+   hipLaunchKernelGGL(( transform4d_0213<float>)
+        , dim3(grid_dims), dim3(block_dims), 0, stream, out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+       hipLaunchKernelGGL(( transform4d_0213<__half>), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+       hipLaunchKernelGGL(( transform4d_0213_v2), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/csrc/transformer_bak/cublas_wrappers.cu b/csrc/transformer_bak/cublas_wrappers.cu
new file mode 100644
index 0000000000000000000000000000000000000000..75ecd3fb4ef9d5d63d9c7681bdce0cf949641b5d
--- /dev/null
+++ b/csrc/transformer_bak/cublas_wrappers.cu
@@ -0,0 +1,403 @@
+#include "cublas_wrappers.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer_bak/cublas_wrappers.hip b/csrc/transformer_bak/cublas_wrappers.hip
new file mode 100644
index 0000000000000000000000000000000000000000..04aa0ef0a7d083a50fc7d4ec8f01b24e2ccd52e8
--- /dev/null
+++ b/csrc/transformer_bak/cublas_wrappers.hip
@@ -0,0 +1,404 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cublas_wrappers_hip.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer_bak/dropout_kernels.cu b/csrc/transformer_bak/dropout_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d1ba135f4900f8eff3f6b4cab70d1b35b39f7833
--- /dev/null
+++ b/csrc/transformer_bak/dropout_kernels.cu
@@ -0,0 +1,868 @@
+#include "custom_cuda_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = curand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = curand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+        dropout_kernel_bwd<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, vals, out, mask, seed);
+    else
+        dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
diff --git a/csrc/transformer_bak/dropout_kernels.hip b/csrc/transformer_bak/dropout_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..a4b880a721e9833d10bccd0fa438acf6b14ded54
--- /dev/null
+++ b/csrc/transformer_bak/dropout_kernels.hip
@@ -0,0 +1,870 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = hiprand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = hiprand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+       hipLaunchKernelGGL(( dropout_kernel_bwd), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, vals, out, mask, seed);
+    else
+       hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
diff --git a/csrc/transformer_bak/ds_transformer_cuda.cpp b/csrc/transformer_bak/ds_transformer_cuda.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..18e7fffc1f5ddcd28588589742cf384c0fd96080
--- /dev/null
+++ b/csrc/transformer_bak/ds_transformer_cuda.cpp
@@ -0,0 +1,1051 @@
+#include <torch/extension.h>
+
+#include <cublas_v2.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer.h"
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+#include "ds_transformer_cuda.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //aiss debug 0506
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        CUBLAS_OP_T,
+                                                        CUBLAS_OP_N,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         CUBLAS_OP_N,
+                                                         CUBLAS_OP_N,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    cublasSetStream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) cudaStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    cublasSetStream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) cudaStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    cudaStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/csrc/transformer_bak/ds_transformer_hip.cpp b/csrc/transformer_bak/ds_transformer_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..7b47686737500b9c47ebf66a651c21a6590fc8e0
--- /dev/null
+++ b/csrc/transformer_bak/ds_transformer_hip.cpp
@@ -0,0 +1,1052 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <torch/extension.h>
+
+#include <rocblas.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer_hip.h"
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+#include "ds_transformer_hip.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //aiss debug 0506
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        rocblas_operation_transpose,
+                                                        rocblas_operation_none,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         rocblas_operation_none,
+                                                         rocblas_operation_none,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    hipStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/csrc/transformer_bak/gelu_kernels.cu b/csrc/transformer_bak/gelu_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d683cf0af83daf829e390d83182eb6f0ffd040a6
--- /dev/null
+++ b/csrc/transformer_bak/gelu_kernels.cu
@@ -0,0 +1,330 @@
+#include "custom_cuda_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    gelu_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       cudaStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, cudaStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, cudaStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    d_gelu_func<<<grid_dims, block_dims, 0, stream>>>(
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, cudaStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, cudaStream_t);
diff --git a/csrc/transformer_bak/gelu_kernels.hip b/csrc/transformer_bak/gelu_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f7e7a7fa7e0f22a6d20de4a1fbb5c2071afb9c77
--- /dev/null
+++ b/csrc/transformer_bak/gelu_kernels.hip
@@ -0,0 +1,332 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( gelu_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       hipStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, hipStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, hipStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( d_gelu_func), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, hipStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, hipStream_t);
diff --git a/csrc/transformer_bak/general_kernels.cu b/csrc/transformer_bak/general_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..1eaa94e1e71a40bf44b661656395cf1ed087f589
--- /dev/null
+++ b/csrc/transformer_bak/general_kernels.cu
@@ -0,0 +1,411 @@
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       cudaStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<float><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<__half><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/csrc/transformer_bak/general_kernels.hip b/csrc/transformer_bak/general_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..5be2fc240debf3dbddba72f0f9587331222910df
--- /dev/null
+++ b/csrc/transformer_bak/general_kernels.hip
@@ -0,0 +1,413 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       hipStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<float>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.cu b/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.cu
new file mode 100644
index 0000000000000000000000000000000000000000..524a63a83daeb97731d0685359b7be8412712d8e
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.cu
@@ -0,0 +1,372 @@
+#include "custom_cuda_layers.h"
+
+//#include <cuda_profiler_api.h>
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+        apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+        apply_rotary_pos_emb1<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  cudaStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+cudaStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+
+apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+*/
diff --git a/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.hip b/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.hip
new file mode 100644
index 0000000000000000000000000000000000000000..4d70a0a80a6d831ea419624eda6afd9f186d4501
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.hip
@@ -0,0 +1,374 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+//#include <cuda_profiler_api.h>
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb1), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  hipStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+hipStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+hipLaunchKernelGGL((
+apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+*/
diff --git a/csrc/transformer_bak/inference/csrc/dequantize.cu b/csrc/transformer_bak/inference/csrc/dequantize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..4ddaabda3eb70b1b958b1fc4c2f959867828d1a2
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/dequantize.cu
@@ -0,0 +1,110 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+    dequantize_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       cudaStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        cudaStream_t);
diff --git a/csrc/transformer_bak/inference/csrc/dequantize.hip b/csrc/transformer_bak/inference/csrc/dequantize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..7c22e306aace1058947ed47e58c0427a4f066ecb
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/dequantize.hip
@@ -0,0 +1,112 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+   hipLaunchKernelGGL(( dequantize_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       hipStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        hipStream_t);
diff --git a/csrc/transformer_bak/inference/csrc/gelu.cu b/csrc/transformer_bak/inference/csrc/gelu.cu
new file mode 100644
index 0000000000000000000000000000000000000000..70bbf42cf9ed74558ce1b789d939c17d38573a86
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/gelu.cu
@@ -0,0 +1,525 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+    fused_bias_add<<<grid_dims, block_dims, 0, stream>>>(input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    fused_bias_residual<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, cudaStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           cudaStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    gptj_residual_add<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              cudaStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               cudaStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+    moe_res_matmul<<<grid_dim, block_dim, 0, stream>>>(
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
diff --git a/csrc/transformer_bak/inference/csrc/gelu.hip b/csrc/transformer_bak/inference/csrc/gelu.hip
new file mode 100644
index 0000000000000000000000000000000000000000..00c03efb9b6b3d7b05f19549472b5b771f46e1f4
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/gelu.hip
@@ -0,0 +1,527 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_add), dim3(grid_dims), dim3(block_dims), 0, stream, input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_residual), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, hipStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           hipStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( gptj_residual_add), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              hipStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               hipStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+   hipLaunchKernelGGL(( moe_res_matmul), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
diff --git a/csrc/transformer_bak/inference/csrc/normalize.cu b/csrc/transformer_bak/inference/csrc/normalize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..1d691394ed5e9cb1568b10b542bbe052566d3ee8
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/normalize.cu
@@ -0,0 +1,451 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/csrc/transformer_bak/inference/csrc/normalize.hip b/csrc/transformer_bak/inference/csrc/normalize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..dc7fa7accbebfb05a9532cf18121e1a8cc4fc052
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/normalize.hip
@@ -0,0 +1,453 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/csrc/transformer_bak/inference/csrc/pt_binding.cpp b/csrc/transformer_bak/inference/csrc/pt_binding.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..5432314bb6dd09110963e8f75232e1b9d259e1b3
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/pt_binding.cpp
@@ -0,0 +1,911 @@
+
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_T,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // cudaEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    // cudaEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // cudaStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::cuda::getCurrentCUDAStream());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::cuda::getCurrentCUDAStream());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/csrc/transformer_bak/inference/csrc/pt_binding_hip.cpp b/csrc/transformer_bak/inference/csrc/pt_binding_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..009951db340bcdaa2e3c0d806782dbca89dfdf76
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/pt_binding_hip.cpp
@@ -0,0 +1,912 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#include <ATen/hip/HIPContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_transpose,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // hipEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    // hipEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // hipStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/csrc/transformer_bak/inference/csrc/softmax.cu b/csrc/transformer_bak/inference/csrc/softmax.cu
new file mode 100644
index 0000000000000000000000000000000000000000..788de78bb1d836d274c4cef4e22541acb3bc2dd4
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/softmax.cu
@@ -0,0 +1,432 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    cudaError_t err = cudaGetLastError();
+    if (err == cudaSuccess) return;
+    std::cerr << cudaGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+        attn_softmax_v2<<<grid_dim, block_dim, 0, stream>>>(
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
diff --git a/csrc/transformer_bak/inference/csrc/softmax.hip b/csrc/transformer_bak/inference/csrc/softmax.hip
new file mode 100644
index 0000000000000000000000000000000000000000..a933d5177295f9f483638d19ecc89d3ae5f3937d
--- /dev/null
+++ b/csrc/transformer_bak/inference/csrc/softmax.hip
@@ -0,0 +1,434 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    hipError_t err = hipGetLastError();
+    if (err == hipSuccess) return;
+    std::cerr << hipGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+       hipLaunchKernelGGL(( attn_softmax_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
diff --git a/csrc/transformer_bak/inference/includes/context.h b/csrc/transformer_bak/inference/includes/context.h
new file mode 100644
index 0000000000000000000000000000000000000000..79008d4f3402bca94fdc411c2f1a07b76078f8a1
--- /dev/null
+++ b/csrc/transformer_bak/inference/includes/context.h
@@ -0,0 +1,177 @@
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        curandCreateGenerator(&_gen, CURAND_RNG_PSEUDO_DEFAULT);
+        curandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (cublasCreate(&_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+        cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        cudaEventCreate(&_comp1_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp2_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comm_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+    }
+
+    virtual ~Context()
+    {
+        cublasDestroy(_cublasHandle);
+        cudaFree(_workspace);
+        cudaEventDestroy(_comp1_event);
+        cudaEventDestroy(_comp2_event);
+        cudaEventDestroy(_comp_event);
+        cudaEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            cudaMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            cudaFree(_workspace);
+            cudaMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    cudaEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    curandGenerator_t& GetRandGenerator() { return _gen; }
+
+    cudaStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::cuda::getStreamFromPool(true)
+                                    : at::cuda::getCurrentCUDAStream();
+        return _comm_stream;
+    }
+    cudaStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::cuda::getStreamFromPool(true);
+            return _stream;
+        }
+        cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+        return stream;
+    }
+
+    cublasHandle_t GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        cudaEventRecord(_comp_event, _comp_stream);
+        cudaStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        cudaEventRecord(_comm_event, _comm_stream);
+        cudaStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    curandGenerator_t _gen;
+    cublasHandle_t _cublasHandle;
+
+    cudaEvent_t _comp_event;
+    cudaEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    cudaEvent_t _comp1_event;
+    cudaEvent_t _comp2_event;
+
+    cudaStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    cudaStream_t _comp_stream;
+    cudaStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/csrc/transformer_bak/inference/includes/context_hip.h b/csrc/transformer_bak/inference/includes/context_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..89c6299bfbf89f3248a492fcfd6e7c61cb1df9fd
--- /dev/null
+++ b/csrc/transformer_bak/inference/includes/context_hip.h
@@ -0,0 +1,178 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <ATen/hip/HIPContext.h>
+#include <hip/hip_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        hiprandCreateGenerator(&_gen, HIPRAND_RNG_PSEUDO_DEFAULT);
+        hiprandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (rocblas_create_handle(&_cublasHandle) != rocblas_status_success) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+        rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        hipEventCreate(&_comp1_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp2_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comm_event, (hipEventDisableTiming | hipEventBlockingSync));
+    }
+
+    virtual ~Context()
+    {
+        rocblas_destroy_handle(_cublasHandle);
+        hipFree(_workspace);
+        hipEventDestroy(_comp1_event);
+        hipEventDestroy(_comp2_event);
+        hipEventDestroy(_comp_event);
+        hipEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            hipMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            hipFree(_workspace);
+            hipMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    hipEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    hiprandGenerator_t& GetRandGenerator() { return _gen; }
+
+    hipStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::hip::getStreamFromPoolMasqueradingAsCUDA(true)
+                                    : at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return _comm_stream;
+    }
+    hipStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::hip::getStreamFromPoolMasqueradingAsCUDA(true);
+            return _stream;
+        }
+        hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return stream;
+    }
+
+    rocblas_handle GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        hipEventRecord(_comp_event, _comp_stream);
+        hipStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        hipEventRecord(_comm_event, _comm_stream);
+        hipStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    hiprandGenerator_t _gen;
+    rocblas_handle _cublasHandle;
+
+    hipEvent_t _comp_event;
+    hipEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    hipEvent_t _comp1_event;
+    hipEvent_t _comp2_event;
+
+    hipStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    hipStream_t _comp_stream;
+    hipStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/csrc/transformer_bak/inference/includes/cublas_wrappers.h b/csrc/transformer_bak/inference/includes/cublas_wrappers.h
new file mode 100644
index 0000000000000000000000000000000000000000..3addd0291f03cb46a50cc21fcbfd4e22af6929ef
--- /dev/null
+++ b/csrc/transformer_bak/inference/includes/cublas_wrappers.h
@@ -0,0 +1,207 @@
+#pragma once
+
+#include <assert.h>
+#include <cublas_v2.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <mma.h>
+#include <stdio.h>
+
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer_bak/inference/includes/cublas_wrappers_hip.h b/csrc/transformer_bak/inference/includes/cublas_wrappers_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..285e5befdedccfca5d09b72cfd73dfc1ef002f5d
--- /dev/null
+++ b/csrc/transformer_bak/inference/includes/cublas_wrappers_hip.h
@@ -0,0 +1,208 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <assert.h>
+#include <rocblas.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <mma.h>
+#include <stdio.h>
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+{
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+{
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/csrc/transformer_bak/inference/includes/custom_cuda_layers.h b/csrc/transformer_bak/inference/includes/custom_cuda_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..06b4340061c98c65b4b301c7349d2da03185f715
--- /dev/null
+++ b/csrc/transformer_bak/inference/includes/custom_cuda_layers.h
@@ -0,0 +1,124 @@
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              cudaStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream);
diff --git a/csrc/transformer_bak/inference/includes/custom_hip_layers.h b/csrc/transformer_bak/inference/includes/custom_hip_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..36cab34d6262f5d6211a18584f6d55284c04846e
--- /dev/null
+++ b/csrc/transformer_bak/inference/includes/custom_hip_layers.h
@@ -0,0 +1,125 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              hipStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream);
diff --git a/csrc/transformer_bak/normalize_kernels.cu b/csrc/transformer_bak/normalize_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d634c7f1b2cd1c2632495d8e1f3b47b45867c353
--- /dev/null
+++ b/csrc/transformer_bak/normalize_kernels.cu
@@ -0,0 +1,2121 @@
+#include "custom_cuda_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    // LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/csrc/transformer_bak/normalize_kernels.hip b/csrc/transformer_bak/normalize_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..3d1b17c8f779f0940593a66fea8c07bba6c5534c
--- /dev/null
+++ b/csrc/transformer_bak/normalize_kernels.hip
@@ -0,0 +1,2123 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    //hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/csrc/transformer_bak/softmax_kernels.cu b/csrc/transformer_bak/softmax_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..34487cba9bbec060baa0fb79922e691076296476
--- /dev/null
+++ b/csrc/transformer_bak/softmax_kernels.cu
@@ -0,0 +1,595 @@
+#include <math.h>
+#include "custom_cuda_layers.h"
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, cudaStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     cudaStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+        softmax_backward_kernel_v2<T, 1>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+        softmax_backward_kernel_v2<T, 2>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+        softmax_backward_kernel_v2<T, 4>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+        softmax_backward_kernel_v2<T, 8>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+        softmax_backward_kernel_v2<T, 12>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+        softmax_backward_kernel_v2<T, 16>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+        softmax_backward_kernel_v2<T, 24>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+        softmax_backward_kernel_v2<T, 32>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+        softmax_backward_kernel_v2<T, 64>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      cudaStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     cudaStream_t stream);
diff --git a/csrc/transformer_bak/softmax_kernels.hip b/csrc/transformer_bak/softmax_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..afe65b0c9cbdc6b10027db2ddd5c7e8f447e0c24
--- /dev/null
+++ b/csrc/transformer_bak/softmax_kernels.hip
@@ -0,0 +1,597 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <math.h>
+#include "custom_hip_layers.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, hipStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     hipStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 1>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 12>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 24>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      hipStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     hipStream_t stream);
diff --git a/csrc/transformer_bak/transform_kernels.cu b/csrc/transformer_bak/transform_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..15a2219333e43a6da1b93038a406b35d302bb9d9
--- /dev/null
+++ b/csrc/transformer_bak/transform_kernels.cu
@@ -0,0 +1,575 @@
+#include "custom_cuda_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<__half><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<float><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  cudaStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+    transform_0213<float>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   cudaStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+    transform_0213<__half>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           cudaStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+    bias_add_transform_0213<float><<<grid_dim, block_dim, 0, stream>>>(
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            cudaStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+        bias_add_transform_0213<__half><<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+        bias_add_transform_0213_v2<<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    cudaStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+    transform4d_0213<float>
+        <<<grid_dims, block_dims, 0, stream>>>(out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+        transform4d_0213<__half><<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+        transform4d_0213_v2<<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/csrc/transformer_bak/transform_kernels.hip b/csrc/transformer_bak/transform_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..0aaa4cca150e18ed63c701e66ce4eaf6313e30ab
--- /dev/null
+++ b/csrc/transformer_bak/transform_kernels.hip
@@ -0,0 +1,577 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<__half>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<float>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  hipStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+   hipLaunchKernelGGL(( transform_0213<float>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   hipStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+   hipLaunchKernelGGL(( transform_0213<__half>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           hipStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+   hipLaunchKernelGGL(( bias_add_transform_0213<float>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            hipStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+       hipLaunchKernelGGL(( bias_add_transform_0213<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+       hipLaunchKernelGGL(( bias_add_transform_0213_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    hipStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+   hipLaunchKernelGGL(( transform4d_0213<float>)
+        , dim3(grid_dims), dim3(block_dims), 0, stream, out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+       hipLaunchKernelGGL(( transform4d_0213<__half>), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+       hipLaunchKernelGGL(( transform4d_0213_v2), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/deepspeed/__init__.py b/deepspeed/__init__.py
old mode 100755
new mode 100644
index 3401f121bca0cc71ac0a66e82794c432a9b19426..2292b4195c070ef1d192541a6a60157238dfba0d
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -1,18 +1,29 @@
 '''
 Copyright 2020 The Microsoft DeepSpeed Team
 '''
+
 import sys
 import types
+from typing import Optional, Union
+import torch
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import _LRScheduler
+from packaging import version as pkg_version
 
 from . import ops
+from . import module_inject
 
-from .runtime.engine import DeepSpeedEngine
+from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable
 from .runtime.engine import ADAM_OPTIMIZER, LAMB_OPTIMIZER
 from .runtime.pipe.engine import PipelineEngine
+from .inference.engine import InferenceEngine
+
 from .runtime.lr_schedules import add_tuning_arguments
 from .runtime.config import DeepSpeedConfig, DeepSpeedConfigError
 from .runtime.activation_checkpointing import checkpointing
 from .ops.transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
+from .module_inject import replace_transformer_layer, revert_transformer_layer
+
 from .utils import log_dist
 from .utils.distributed import init_distributed
 
@@ -25,9 +36,8 @@ from .git_version_info import version, git_hash, git_branch
 
 def _parse_version(version_str):
     '''Parse a version string and extract the major, minor, and patch versions.'''
-    import re
-    matched = re.search('^(\d+)\.(\d+)\.(\d+)', version_str)
-    return int(matched.group(1)), int(matched.group(2)), int(matched.group(3))
+    ver = pkg_version.parse(version_str)
+    return ver.major, ver.minor, ver.micro
 
 
 # Export version information
@@ -36,46 +46,38 @@ __version_major__, __version_minor__, __version_patch__ = _parse_version(__versi
 __git_hash__ = git_hash
 __git_branch__ = git_branch
 
-# Provide backwards compatability with old deepspeed.pt module structure, should hopefully not be used
-pt = types.ModuleType('pt', 'dummy pt module for backwards compatability')
-deepspeed = sys.modules[__name__]
-setattr(deepspeed, 'pt', pt)
-setattr(deepspeed.pt, 'deepspeed_utils', deepspeed.runtime.utils)
-sys.modules['deepspeed.pt'] = deepspeed.pt
-sys.modules['deepspeed.pt.deepspeed_utils'] = deepspeed.runtime.utils
-setattr(deepspeed.pt, 'deepspeed_config', deepspeed.runtime.config)
-sys.modules['deepspeed.pt.deepspeed_config'] = deepspeed.runtime.config
-setattr(deepspeed.pt, 'loss_scaler', deepspeed.runtime.fp16.loss_scaler)
-sys.modules['deepspeed.pt.loss_scaler'] = deepspeed.runtime.fp16.loss_scaler
-
 
 def initialize(args=None,
-               model=None,
-               optimizer=None,
-               model_parameters=None,
-               training_data=None,
-               lr_scheduler=None,
+               model: torch.nn.Module = None,
+               optimizer: Optional[Union[Optimizer,
+                                         DeepSpeedOptimizerCallable]] = None,
+               model_parameters: Optional[torch.nn.Module] = None,
+               training_data: Optional[torch.utils.data.Dataset] = None,
+               lr_scheduler: Optional[Union[_LRScheduler,
+                                            DeepSpeedSchedulerCallable]] = None,
                mpu=None,
-               dist_init_required=None,
+               dist_init_required: Optional[bool] = None,
                collate_fn=None,
+               config=None,
                config_params=None):
     """Initialize the DeepSpeed Engine.
 
     Arguments:
-        args: an object containing local_rank and deepspeed_config fields. This is optional if `config_params` is passed.
+        args: an object containing local_rank and deepspeed_config fields.
+            This is optional if `config` is passed.
 
         model: Required: nn.module class before apply any wrappers
 
-        optimizer: Optional: a user defined optimizer, this is typically used instead of defining
-            an optimizer in the DeepSpeed json config.
+        optimizer: Optional: a user defined Optimizer or Callable that returns an Optimizer object.
+            This overrides any optimizer definition in the DeepSpeed json config.
 
         model_parameters: Optional: An iterable of torch.Tensors or dicts.
             Specifies what Tensors should be optimized.
 
         training_data: Optional: Dataset of type torch.utils.data.Dataset
 
-        lr_scheduler: Optional: Learning Rate Scheduler Object. It should define a get_lr(),
-            step(), state_dict(), and load_state_dict() methods
+        lr_scheduler: Optional: Learning Rate Scheduler Object or a Callable that takes an Optimizer and returns a Scheduler object.
+            The scheduler object should define a get_lr(), step(), state_dict(), and load_state_dict() methods
 
         mpu: Optional: A model parallelism unit object that implements
             get_{model,data}_parallel_{rank,group,world_size}()
@@ -87,8 +89,10 @@ def initialize(args=None,
             mini-batch of Tensor(s).  Used when using batched loading from a
             map-style dataset.
 
-        config_params: Optional: Instead of requiring args.deepspeed_config you can pass your deepspeed config
-            as a dictionary instead.
+        config: Optional: Instead of requiring args.deepspeed_config you can pass your deepspeed config
+            as an argument instead, as a path or a dictionary.
+
+        config_params: Optional: Same as `config`, kept for backwards compatibility.
 
     Returns:
         A tuple of ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``
@@ -109,7 +113,6 @@ def initialize(args=None,
         __git_hash__,
         __git_branch__),
              ranks=[0])
-
     assert model is not None, "deepspeed.initialize requires a model"
 
     if not isinstance(model, PipelineModule):
@@ -122,6 +125,7 @@ def initialize(args=None,
                                  mpu=mpu,
                                  dist_init_required=dist_init_required,
                                  collate_fn=collate_fn,
+                                 config=config,
                                  config_params=config_params)
     else:
         assert mpu is None, "mpu must be None with pipeline parallelism"
@@ -134,6 +138,7 @@ def initialize(args=None,
                                 mpu=model.mpu(),
                                 dist_init_required=dist_init_required,
                                 collate_fn=collate_fn,
+                                config=config,
                                 config_params=config_params)
 
     return_items = [
@@ -210,3 +215,91 @@ def add_config_arguments(parser):
     parser = _add_core_arguments(parser)
 
     return parser
+
+
+def init_inference(model,
+                   triangular_masking=True,
+                   mp_size=1,
+                   training_mp_size=1,
+                   mpu=None,
+                   ep_group=None,
+                   expert_mp_group=None,
+                   checkpoint=None,
+                   dtype=None,
+                   injection_policy=None,
+                   replace_method='auto',
+                   quantization_setting=None,
+                   replace_with_kernel_inject=False,
+                   return_tuple=True,
+                   ep_size=1,
+                   moe=False,
+                   moe_experts=1,
+                   moe_type='standard',
+                   args=None):
+    """Initialize the DeepSpeed InferenceEngine.
+
+    Arguments:
+        model: Required: nn.module class before apply any wrappers
+
+        triangular_masking: Required: this shows the type of masking for attention scores in transformer layer
+            note that the masking is application specific.
+
+        mp_size: Optional: Desired model parallel size, default is 1 meaning no
+            model parallelism.
+
+        training_mp_size: Optional: if loading a checkpoint this is the mp size that it was trained with,
+            it may be different than what the mp size that you want to use during inference.
+
+        mpu: Optional: A model parallelism unit object that implements
+            get_{model,data}_parallel_{rank,group,world_size}()
+
+        checkpoint: Optional: Path to deepspeed compatible checkpoint or path to
+            JSON with load policy.
+
+        dtype: Optional: Desired model data type, will convert model to this type.
+            Supported target types: torch.half, torch.int8, torch.float
+
+        injection_policy: Optional: Dictionary mapping a client nn.Module to its corresponding
+            injection policy. e.g., {BertLayer : deepspeed.inference.HFBertLayerPolicy}
+
+        replace_method: Optional: If 'auto' DeepSpeed will automatically try and replace
+            model modules with its optimized versions. If an injection_policy is set this will
+            override the automatic replacement behavior.
+
+        quantization_setting: Optional: Quantization settings used for quantizing your model using the MoQ.
+            The setting can be one element or a tuple. If one value is passed in, we consider it as the number
+            of groups used in quantization. A tuple is passed in if we want to mention that there is extra-grouping
+            for the MLP part of a Transformer layer (e.g. (True, 8) shows we quantize the model using 8 groups for
+            all the network except the MLP part that we use 8 extra grouping).
+        replace_with_kernel_inject: If set we inject kernel as we initialize the inference-engine
+
+    Returns:
+        A deepspeed.InferenceEngine wrapped model.
+    """
+    log_dist("DeepSpeed info: version={}, git-hash={}, git-branch={}".format(
+        __version__,
+        __git_hash__,
+        __git_branch__),
+             ranks=[0])
+
+    engine = InferenceEngine(model,
+                             triangular_masking,
+                             mp_size,
+                             training_mp_size,
+                             ep_size,
+                             mpu,
+                             ep_group,
+                             expert_mp_group,
+                             checkpoint,
+                             dtype,
+                             injection_policy,
+                             return_tuple,
+                             replace_method,
+                             quantization_setting,
+                             replace_with_kernel_inject,
+                             moe,
+                             moe_experts,
+                             moe_type,
+                             args)
+
+    return engine
diff --git a/deepspeed/autotuning/.gitignore b/deepspeed/autotuning/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..2e75d7c8666b8aa841bf9beb7dcfca5f655024db
--- /dev/null
+++ b/deepspeed/autotuning/.gitignore
@@ -0,0 +1,6 @@
+test*
+runs
+autotuning_results*
+autotuning_exps
+output*
+*.png
diff --git a/deepspeed/autotuning/README.md b/deepspeed/autotuning/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cb73b01318a60e8d83f8df4fe9d47d485c62d9a
--- /dev/null
+++ b/deepspeed/autotuning/README.md
@@ -0,0 +1,415 @@
+# DeepSpeed Autotuning
+## Overview
+
+One pain point in model training is to figure out good performance-relevant configurations such as micro-batch size to fully utilize the hardware and achieve a high throughput number. This configuration exploring process is commonly done manually but is important since model training is repeated many times and benefits from using a good configuration. Not only is the hand-tuning process time-consuming, but the outcome is hardware-dependent. This means that a good configuration on one hardware might not be the best on another different hardware. The user thus has to hand tune the configuration again. With DeepSpeed, there are more configuration parameters that could potentially affect the training speed, thus making it more tedious to manually tune the configuration.
+
+The DeepSpeed Autotuner mitigates this pain point and automatically discovers the optimal DeepSpeed configuration that delivers good training speed.
+The Autotuner uses model information, system information, and heuristics to efficiently tune system knobs that affect compute and memory efficiencies, such as ZeRO optimization stages, micro-batch sizes, and many other ZeRO optimization configurations.
+It not only reduces the time and resources users spend on tuning, but also can discover configurations better than hand-tuned methods.
+
+DeepSpeed Autotuning is easy to use, requiring no code change from DeepSpeed users.
+Compared to the original training script (`deepspeed your_program.py <normal cl args> --deepspeed ds_config.json`), invoking the autotuning feature in DeepSpeed only requires setting an `autotuning` flag after the DeepSpeed launcher (see [Usage](#usage) for details), and adding `"autotuning": {"enabled": true}` to the DeepSpeed configuration file. Users can further tailor the autotuning process by changing the autotuning configuration in the DeepSpeed configuration JSON file (See [Autotuning Configuration](#autotuning-configuration) for details).
+
+## Usage
+
+To use DeepSpeed Autotuner, you need to do two things:
+
+1. Add `"autotuning": {"enabled": true}` to the DeepSpeed configuration file. If the user training script uses DeepSpeed configuration parameters as command-line arguments, the name mappings between the parameters in DeepSpeed configuration and the training script arguments must be provided in the `arg_mappings` dictionary in the `autotuning` section of the DeepSpeed configuration file.
+Common train scripts have micro-batch size per GPU as an argument, this mapping between the flag name and `train_micro_batch_size_per_gpu` must be provided. Below is the an example where the training script takes `--per_device_train_batch_size` as micro-batch size. Note that `--` is needed.
+
+```json
+{
+    "autotuning": {
+        "enabled": true,
+        "arg_mappings": {
+            "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+        }
+    }
+```
+
+2. Specifying `--autotuning=[run|tune]` in the command line, shown as below.
+```bash
+deepspeed --autotuning=[run|tune] <user script> --deepspeed ds_config.json <other user args>
+```
+
+`--autotuning=run` finds the optimal DeepSpeed configuration and then launches the training with that configuration. If you want to just find the optimal configuration without running the training script, then set `--autotuning` to `tune`.
+
+
+If users specify the [resource configuration](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) using the flags `--num_gpus` and `--num_nodes`, then the command becomes:
+
+```bash
+deepspeed --autotuning=[run|tune] --num_gpus=$NUM_GPUS --num_nodes=$NUM_NODES <user script> --deepspeed ds_config.json <other user args>
+```
+
+ Below shows an example where `train_micro_batch_size_per_gpu` and `gradient_accumulation_steps` are mapped to `--per_device_train_batch_size` and `--gradient_accumulation_steps` as training arguments.
+
+Example script (some details omitted):
+
+```bash
+  deepspeed --autotuning run --num_nodes=1 --num_gpus=8 $HF_PATH/transformers/examples/pytorch/language-modeling/run_clm.py \
+  --deepspeed $DS_CONFIG_PATH \
+  --model_name_or_path gpt2 \
+  --do_train \
+  --do_eval \
+  --fp16 \
+  --per_device_train_batch_size 8 \
+  --gradient_accumulation_steps 1 \
+  ...
+```
+
+DeepSpeed configuration file:
+
+```json
+{
+    "autotuning": {
+        "enabled": true,
+        "arg_mappings": {
+            "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+            "gradient_accumulation_steps ": "--gradient_accumulation_steps"
+        }
+    }
+```
+
+
+By default, the Autotuner would only tune ZeRO optimization stages and micro-batch sizes per GPU (`fast` mode). This saves the autotuning time for a close-to-optimal tuning result. If you would like to tune other ZeRO optimization configurations,set `"fast"` to `false` in the [autotuning configuration](#autotuning-configuration).
+
+
+## Autotuning Workflow and Scope
+
+Currently, the DeepSpeed Autotuner tunes ZeRO stages, micro-batch size per GPU, and ZeRO configurations (offloading is not yet supported) on top of other configurations such as optimizer, scheduler, fp16 defined by the user in the DeepSpeed configuration file. A high-level workflow is described below:
+  1. At the beginning of the autotuning process, the Autotuner launches a model information profiling experiment to get the number of model parameters and amount of activation memory.
+  2. Then the Autotuner explores ZeRO stages in the order of `[0, 1, 2, 3]`. For each ZeRO stage, the Autotuner estimates the minimal memory required per GPU to train the model, and compares it with the available GPU memory. A less-than indicates that the model might be runnable with the given ZeRO stage, and the Autotuner then tunes the micro-batch size per GPU and other ZeRO configurations for that ZeRO stage.
+     1. The Autotuner first tunes the micro-batch size per GPU along with gradient accumulation steps (users can specify the maximum global train batch size for the model), and selects a list of micro-batch sizes to explore next.
+     2. Each ZeRO stage has a carefully-chosen default tuning space to explore for the other [ZeRO configurations](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training). Users can overwrite it through the DeepSpeed configuration file.
+     3. Combinations of different micro-batch sizes and ZeRO configurations are then explored as experiments by the Autotuner using a supported algorithm (e.g., xgboost model-based algorithm). Early termination in this exploration is set by heuristics and is configurable by the user.
+     4. An optimal configuration based on a metric (throughput, latency, or FLOPS) is returned for that ZeRO stage.
+  3. The exploration of different ZeRO stages would stop if the optimal setup for the current ZeRO stage is no better than that of the previous ZeRO stage tuned (other heuristics are used as well for determining the termination).
+  4. In the end, the global optimal setup is returned to the user. If the value of the `--autotuning` flag is set to `run`, the Autotuner launches the training with the found optimal setup.
+
+Note that ZeRO stages, micro-batch sizes, and other ZeRO configurations to tune are also configurable and can be overwritten by the user through the DeepSpeed configuration file. See [Configuring Tuning Scope](#configuring-tuning-scope) for details.
+
+
+## Configuring Tuning Scope
+
+The DeepSpeed Autotuner tunes ZeRO stages, micro-batch size per GPU, and ZeRO configurations. Other DeepSpeed configurations are used as defined by the user in the DeepSpeed configuration file. Users can overwrite any of the tuning parameters.
+### Configuring ZeRO Stage
+
+By default, the DeepSpeed Autotuner tunes ZeRO stages. If `"zero_optimization"` is not defined or set to `"all"`, the Autotuner explores ZeRO stages in the order of `[0, 1, 2, 3]`. Users can overwrite this behavior if they already know what ZeRO stage(s) to use. For example, the below section in the DeepSpeed configuration file limits the Autotuner to only exploring ZeRO stage 2 and 3.
+
+```json
+{
+  "zero_optimization": {
+    "stage": [2, 3]
+  }
+}
+```
+
+### Configuring Train Micro-Batch Size
+
+The DeepSpeed Autotuner tunes the micro-batch size per GPU (`train_micro_batch_size_per_gpu` in DeepSpeed configuration) along with gradient accumulation steps (`gradient_accumulation_steps` in DeepSpeed configuration). The `train_micro_batch_size_per_gpu` value specified by the user in the DeepSpeed configuration file is used as the minimal micro-batch size per GPU to tune if it's runnable.
+
+When using Hugging Face and `train_micro_batch_size_per_gpu` is set to ["auto"](#using-autotuning-with-hugging-face), if `train_micro_batch_size_per_gpu` has a corresponding training script mapping provided in `args_mapping`, the command-line value is used as the minimal micro-batch size per GPU to tune; else, `1` would be used as the minimal micro-batch size per GPU in tuning.
+
+`train_batch_size` in DeepSpeed configuration must be equal to `train_micro_batch_size_per_gpu * gradient_accumulation_steps * total_num_gpus // model_parallelism_size `. Currently, the DeepSpeed Autotuner ignores the `train_batch_size` parameter specified in the DeepSpeed configuration file, please use `train_micro_batch_size_per_gpu` and `gradient_accumulation_steps` in autotuning.
+
+The configuration below asks the Autotuner to use `4` as the minimal micro-batch size per GPU in tuning. Note that the value passed to the training script through `--per_device_train_batch_size` is ignored (which is supposed to be equal to the `train_micro_batch_size_per_gpu` value set in the DeepSpeed configuration).
+
+```json
+{
+    "train_micro_batch_size_per_gpu": 4,
+    "autotuning": {
+        "enabled": true,
+    },
+    "arg_mappings": {
+        "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+    }
+}
+```
+
+The configuration below asks the Autotuner to use the value of `"--per_device_train_batch_size"` in the training script as the minimal micro-batch size per GPU in tuning. Also, the training script takes `gradient_accumulation_steps` as an argument in training code.
+```json
+{
+    "train_micro_batch_size_per_gpu": "auto",
+    "gradient_accumulation_steps": "auto",
+    "autotuning": {
+        "enabled": true,
+        "arg_mappings": {
+            "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+            "gradient_accumulation_steps ": "--gradient_accumulation_steps"
+        }
+    }
+}
+```
+
+Users can set the maximum train batch size (global effective batch size) for the autotuning process by specifying `max_train_batch_size` in the autotuning configuration section of the DeepSpeed configuration file, as shown below. If `max_train_batch_size` is not defined, the Autotuner would use `maximum_train_micro_batch_size_per_gpu_runnable * gradient_accumulation_steps * total_num_gpus // model_parallelism_size` as `max_train_batch_size` (`gradient_accumulation_steps` defined in the DeepSpeed configuration file or training script or `1` is used here).
+
+```json
+{
+    "autotuning": {
+        "enabled": true,
+        "max_train_batch_size": 1024,
+        "arg_mappings": {
+          "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+        }
+    }
+}
+```
+
+By default, the DeepSpeed Autotuning would select at maximum `num_tuning_micro_batch_sizes` (micro-batch size per GPU, gradient accumulation steps) pairs to tune ZeRO configurations. `num_tuning_micro_batch_sizes` defaults to `3` and can be set in the [autotuning configuration](#autotuning-configuration).
+
+Users can specify the list of micro-batch sizes to tune in the DeepSpeed configuration file.
+For example, the following section in the DeepSpeed configuration file limits the autotuning to explore `train_micro_batch_size_per_gpu` in `[1, 4, 16]`, and `gradient_accumulation_steps = 2` is used. Combinations of the two parameters are considered in the tuning (constrained by `max_train_batch_size` if defined). Note that specifying a list of `gradient_accumulation_steps` to tune is not supported.
+
+```json
+{
+  "train_micro_batch_size_per_gpu": [1, 4, 16],
+  "gradient_accumulation_steps": 2
+}
+```
+
+The entry below asks the Autotuner to use `4` as the micro-batch size per GPU in tuning (micro-batch size per GPU is fixed as 4). Note that it's different from using ` "train_micro_batch_size_per_gpu": [4]` which asks the Autotuner to tune micro-batch size per GPU starting from `4`.
+```json
+{
+    "train_micro_batch_size_per_gpu": [4],
+}
+```
+
+#### Learning rate scaling when the effective batch size changes
+
+Given that DS Autotuner provides users the flexibility to explore the best performance configuration under a range of different batch sizes (e.g., by changing the `train_micro_batch_size_per_gpu`), it is possible that the total effective batch size `B'` per training iteration that maximizes the compute efficiency is different from the one `B` the user originally uses for training. If the user decides to choose the best-performing batch size `B'` identified by DS autotuner for training to achieve faster training speed, we suggest the user to scale the learning rate by `sqrt(B'/B)` while keeping the other hyperparameters unchanged. The rationale behind this scaling is that one should scale the learning rate such that the variance in the gradient expectation remains constant when the batch size changes. In the case of stochastic gradient descent, we recommend the user to scale the learning rate linearly by `B'/B` while keeping the other hyperparameters (momentum, weight decay, etc.) the same, which we empirically find to work better for SGD and momentum-based optimizer.
+
+
+### Configuring ZeRO configurations
+
+The DeepSpeed Autotuner explores a set of carefully-chosen default values for ZeRO configuration parameters, defined in [`DEFAULT_TUNING_SPACE_ZERO_0,1,2,3`](constants.py). Users can overwrite any of the parameters (using a value or a list of values) in the DeepSpeed configuration file.
+
+For example, the default tuning space for ZeRO stage 1 is
+```python
+DEFAULT_TUNING_SPACE_ZERO_1 = {
+    "zero_optimization": {
+        "stage": 1,
+        "reduce_bucket_size": [5e7,
+                               5e8,
+                               1e9],
+        "allgather_bucket_size": [5e7,
+                                  5e8,
+                                  1e9],
+    }
+}
+```
+, where `3*3 = 9` combinations of different `reduce_bucket_size` and `allgather_bucket_size` values are explored in the tuning. Users can overwrite it in the DeepSpeed configuration file
+by
+```json
+{
+  "zero_optimization": {
+    "stage": 1,
+    "reduce_bucket_size": [5e7, 5e8],
+    "allgather_bucket_size": 5e8,
+  }
+}
+```
+, where only `2*1` cases `{"reduce_bucket_size": 5e7, "allgather_bucket_size": 5e8}` and `{"reduce_bucket_size": 5e8, "allgather_bucket_size": 5e8}` would be explored in the tuning.
+If `"stage"` is not defined or set as `"all"`, then the overwriting applies to all ZeRO stages.
+#### Offloading and NVME
+
+Currently, the DeepSpeed Autotuner does not tune offloading behaviors but instead uses the values defined in the offload section of the DeepSpeed configuration file. See [Parameter offloading](https://www.deepspeed.ai/docs/config-json/#parameter-offloading) and [Optimizer offloading](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) for details.
+
+If using NVME for offloading, users can run a benchmark offline to select the optimal `aio` setup in DeepSpeed. Refer to [profiling NVMe and configuring aio param section](https://github.com/microsoft/DeepSpeed/issues/998).
+
+## Autotuning Output
+
+By default, the DeepSpeed Autotuner generates a folder named `"autotuning_exps"` to store the descriptions of the autotuning experiments, and a folder named `"autotuning_results"` to store the results of the autotuning experiments under the training script launching path. Users can specify other path to use by setting `"results_dir"` or `"exps_dir"` in the autotuning configuration ([Results and Experiments Path](#results-and-experiments-path)).
+
+Each autotuning experiment has a unique name based on the tuning parameters and. For example, z1_tmbspg3_gas1 means the experiment uses ZeRO stage 1, train micro-batch size per GPU (tmbspg) of 3, and gradient accumulation steps (gas) of 1. Then the experiment description is store as file z1_tmbspg3_gas1.json in the `"exps_dir"` folder, and the experiment result is stored in a folder named z1_tmbspg3_gas1 under the `"results_dir"`.
+
+Each experiment result folder could contain the following files:
+
+```bash
+z1_tmbspg3_gas1/ # z1_tmbspg4_gas1 experiment result folder
+|-- cmd.txt # command used to launch the experiment
+|-- ds_config.json # DeepSpeed configuration used in the experiment
+|-- exp.json # experiment description, used by the Autotuner for experiment management
+|-- metrics.json # performance metrics recorded for the experiment
+|-- stderr.log # stderr of running the experiment
+`-- stdout.log # stdout of running the experiment
+```
+
+After the autotuning is done, a table of tuning experiments summary and autotuning duration would be printed to the terminal, for example:
+
+```
+| tuning_space | num_exps | best_metric_val | best_exp_name   |
+| :----------- | -------: | --------------: | :-------------- |
+| z0           |        2 |         90.1269 | z0_tmbspg2_gas1 |
+| z1           |        2 |         187.298 | z1_tmbspg3_gas1 |
+| z2           |        2 |         148.154 | z2_tmbspg3_gas1 |
+| global       |        6 |         187.298 | z1_tmbspg3_gas1 |
+
+Tuning completed in 0:00:03.602291
+```
+
+A file named `summary.txt` with the same content is saved under the `"results_dir"` for reference as well.
+Other than the tuning summary, `ds_config_optimal.json`, the optimal DeepSpeed configuration found by autotuning,  and the corresponding command to launch the experiment `cmd_optimal.txt` are also saved under the `"results_dir"` after autotuning finishes.
+
+## Autotuning Configuration
+
+While `"autotuning": {"enabled": true}` is the minimal requirement to enable autotuning, there are other parameters users can define to configure the autotuning process. Below shows major parameters and their default values. These parameters can be set in the "autotuning" section in DeepSpeed configuration file.
+```json
+{
+  "autotuning": {
+    "enabled": false,
+    "results_dir": null,
+    "exps_dir": null,
+    "overwrite": false,
+    "metric": "throughput",
+    "start_profile_step": 3,
+    "end_profile_step": 5,
+    "fast": true,
+    "max_train_batch_size": null,
+    "mp_size": 1,
+    "num_tuning_micro_batch_sizes": 3,
+    "tuner_type": "model_based",
+    "tuner_early_stopping": 5,
+    "tuner_num_trials": 50,
+    "arg_mappings": null
+  }
+}
+```
+
+### Results and Experiments Path
+
+`"results_dir"` points to a folder where the results of all the autotuning experiments are stored. `"exps_dir"` points to a folder where the descriptions of the autotuning experiments are stored.
+By default, `"exps_dir"` is set to a folder named `"autotuning_exps"` and `"results_dir"` is set to a folder named `"autotuning_results"` under the training script launching path. Users can specify other paths to use by setting these two parameters in the autotuning configuration.
+
+By default, the Autotuner does not run experiments whose results already exist. To change this behavior and rerun experiments all the time, set `"overwrite"` to true.
+
+### Autotuning Metric
+
+The Autotuner ranks tuning experiments by a metric. Currently, three metric types are supported, `"latency"`, `"throughput"`, and `"FLOPS"`:
+* "throughput": training samples per second (calculated as  `train_batch_size * 1000 / "latency"`)
+* "latency": training step latency in ms (`training iteration latency * gradient accumulation steps`)
+* "FLOPS": floating-point operations per second achieved per GPU (calculated as `the number of flops / training iteration latency`). Refer to [DeepSpeed flops profiler](https://www.deepspeed.ai/tutorials/flops-profiler/) for details on how the number of flops is measured.
+
+By default, `"throughput"` is used for ranking. Users can select other metrics, e.g., setting `{"metric": "latency"}` would use latency as the ranking metric.
+
+Note that the performance metric used in autotuning is calculated using the timings captured within DeepSpeed forward, backward and step functions. The sum of these timings is less than the actual training step latency, thus the throughput metric values used by autotuning would be higher than the end-to-end throughput in training.
+
+### Autotuning Resources
+
+The DeepSpeed Autotuner uses all the hardware resources in the environment to run the tuning experiments. Experiments can be scheduled and run in parallel if resources are available and parallelization applies to the tuning logic (some steps in the tuning workflow is sequential).
+For example, in an environment with 2 nodes and 16 GPUs per node, if the user specifies `--num_gpus=16` and `--num_nodes=1` in the training script, then at most two autotuning experiments can be run in parallel at a time.
+
+### Profile Steps
+
+In each of the tuning experiments, profiling is performed over a continuous portion of training steps to collect performance metrics, which are then used to rank the tuning experiments. Users can specify when to start and end the profiling.
+* start_step (int, defaults to 3): the training step to start recording performance metrics
+* end_step (int, >= start_step, defaults to 5): the training step to end recording performance metrics
+
+Note that setting `start_step` to large values could result in a noticeable longer run time for each tuning experiment.
+
+### Fast Mode
+
+Besides ZeRO stages and micro-batch sizes per GPU (`fast` mode), users can tune other ZeRO optimization configurations by setting `"fast"` to `false`. The autotuning time would increase as the tuning space gets larger and more tuning experiments are performed. The fast mode is by default enabled.
+
+### Max Train Batch Size
+
+Users can set the maximum train batch size (global effective batch size) for the autotuning process by specifying `max_train_batch_size` in the autotuning configuration section of the DeepSpeed configuration file. If `max_train_batch_size` is not defined, the Autotuner would use `maximum_train_micro_batch_size_per_gpu_runnable * gradient_accumulation_steps * total_num_gpus // model_parallelism_size` as `max_train_batch_size` (`gradient_accumulation_steps` defined in the DeepSpeed configuration file or training script or `1` is used here). See [Configuring Train Micro-Batch Size](#configuring-train-micro-batch-size) for its usage with micro-batch size and gradient accumulation steps.
+
+# Model Parallelism Size
+
+If model parallelism is used, set the `mp_size` in the autotuning configuration to be the model parallelism degree. `mp_size` defaults to 1 which means no model parallelism is used.
+### Tuning algorithms
+
+Within a ZeRO stage, combinations of micro-batch sizes and other ZeRO configurations form a tuning space if experiments where the DeepSpeed Autotuner explores in an order (tuner algorithm).
+
+Currently, three types of tuner algorithms are supported:
+* `"random"`: randomly select the next set of configurations to experiment with.
+* `" gridsearch" `: sequentially select the next set of configurations to experiment with.
+* `"model_based"`: xgboost cost model is used to select the next set of configurations to experiment with given the results of the finished experiments.
+
+By default, `"model_based"` algorithm is used.
+
+The Autotuner stops exploring the space when any of the following conditions meet:
+
+* When there is no more promising configurations are likely to be found. `"tuner_early_stopping"` defines the number of experiments to explore beyond the current best experiment. If no better experiment is found within that number, the Autotuner stops the exploration. `"tuner_early_stopping"` defaults to `5`.
+* When the total number of experiments explored exceeds the `"tuner_num_trials"`, which defaults to `50`.
+* When all the experiments in the tuning space are explored.
+
+## Using Autotuning with Hugging Face
+
+Hugging Face users can set some configurations values to ["auto"](https://huggingface.co/transformers/main_classes/deepspeed.html?highlight=gradient_accumulation_steps#shared-configuration).
+`"auto"` means the value will be set to the default in Hugging Face or be overwritten using the supplied values from the command line arguments.
+In DeepSpeed Autotuning, if the user-provided DeepSpeed configuration file has "auto" keywords, they are treated as the value "auto".
+
+##  GPT2-large Example
+
+This section shows an example of using DeepSpeed autotuning. For more examples, refer to [autotuning](https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning) in the DeepSpeedExamples repo.
+
+Example training script:
+
+```bash
+MODEL_NAME=gpt2-large
+PER_DEVICE_TRAIN_BATCH_SIZE=1
+HF_PATH=~/projects # REPLACE WITH YOUR HUGGING FACE PATH
+DS_CONFIG_PATH=ds_config.json # REPLACE WITH YOUR DEEPSPEED CONFIGURATION FILE PATH
+
+NEPOCHS=1
+NGPUS=16
+NNODES=1
+OUTPUT_DIR=./output_b${PER_DEVICE_TRAIN_BATCH_SIZE}_g${NGPUS}
+
+deepspeed --autotuning run --num_nodes=$NNODES --num_gpus=$NGPUS $HF_PATH/transformers/examples/pytorch/language-modeling/run_clm.py --deepspeed $DS_CONFIG_PATH \
+--model_name_or_path $MODEL_NAME \
+--dataset_name wikitext \
+--dataset_config_name wikitext-2-raw-v1 \
+--do_train \
+--do_eval \
+--fp16 \
+--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
+--learning_rate 2e-5 \
+--num_train_epochs $NEPOCHS \
+--output_dir ${OUTPUT_DIR} \
+--overwrite_output_dir
+```
+
+Example DeepSpeed configuration file:
+
+```json
+{
+  "train_micro_batch_size_per_gpu": "auto",
+  "gradient_accumulation_steps": "auto",
+  "autotuning": {
+    "enabled": true,
+    "arg_mappings": {
+      "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+      "gradient_accumulation_steps": "--gradient_accumulation_steps"
+    },
+  }
+}
+```
+
+Example output (in `summary.txt`):
+
+```
+| tuning_space | num_experiments | best_metric_val | best_exp_name   |
+| :----------- | --------------: | --------------: | :-------------- |
+| z0           |               4 |         59.0229 | z0_gas1_tmbspg2 |
+| z1           |               5 |         87.3017 | z1_gas1_tmbspg3 |
+| z2           |               3 |         77.8338 | z2_gas1_tmbspg3 |
+| z3           |               1 |               0 | z3_gas1_tmbspg3 |
+| global       |              13 |         87.3017 | z1_gas1_tmbspg3 |
+
+Tuning completed in 0:27:33.988447. Total number of experiments: 13.
+```
+
+The table below shows the throughput (samples per second) comparison. The corresponding train micro-batch size per GPU (mbs or tmbspg) and ZeRO stage used to achieve the throughput value is also shown in the parentheses. Assume the strategy users would use in the hand-tuning process is to start from `mbs = 1` and increase mbs by 2 each time until running out of GPU memory.
+ - `baseline` is the vanilla Hugging Face (HF) without DeepSpeed (DS) and mbs is hand-tuned.
+ - `HF + DS hand-tuned` is HF with DS, and mbs is hand-tuned while other DS configuration uses default values.
+ - `HF + DS autotuning` is HF with DS, and the DS configuration selected from autotuning.
+
+Notation: Hugging Face (HF), DeepSpeed (DS), ZeRO stage (z), gradient accumulation steps (gas), train micro-batch size per GPU (mbs or tmbspg).
+
+| Model name | baseline (vanilla HF) | HF + DS hand-tuned       | HF + DS autotuning (fast-mode) |
+| ---------- | -------------------- | ------------------------ | ------------------------------ |
+| GPT2-large | 27.874 (mbs = 1)     | 56.797 (z = 1, mbs = 2), | 69.061 (z = 1, mbs = 3)        |
+
+As we can see the DeepSpeed Autotuner can select a better than hand-tuned configuration with a reasonable number of experiments. Examples in [Autotuning Hugging Face Examples](https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning/hf#autotuning-hugging-face-examples) would demonstrate the effectiveness of autotuning across different models.
diff --git a/deepspeed/autotuning/__init__.py b/deepspeed/autotuning/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..deea832eb6daf1c1b838f8d01a89a3209e4220c9
--- /dev/null
+++ b/deepspeed/autotuning/__init__.py
@@ -0,0 +1 @@
+from .autotuner import Autotuner
diff --git a/deepspeed/autotuning/autotuner.py b/deepspeed/autotuning/autotuner.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ff85e6d9717767ed24d3387edec8ff1d37f961c
--- /dev/null
+++ b/deepspeed/autotuning/autotuner.py
@@ -0,0 +1,1132 @@
+import copy
+import json
+import os
+from random import sample
+import shutil
+import subprocess
+import hjson
+import torch
+import time
+import datetime
+import math
+
+from ..runtime.config_utils import dict_raise_error_on_duplicate_keys
+from ..runtime.constants import *
+from ..runtime.zero.constants import *
+from ..utils import logger
+from .config import DeepSpeedAutotuningConfig
+from .constants import *
+from .scheduler import ResourceManager, run_experiment
+from .tuner import GridSearchTuner, RandomTuner, ModelBasedTuner
+from .utils import *
+
+try:
+    from tabulate import tabulate
+except ImportError:
+    tabulate = None
+
+
+class Autotuner:
+    """The DeepSpeed Autotuner automatically discovers the optimal DeepSpeed configuration that delivers good training speed. The Autotuner uses model information, system information, and heuristics to efficiently tune system knobs that affect compute and memory efficiencies, such as ZeRO optimization stages, micro-batch sizes, and many other ZeRO optimization configurations. It not only reduces the time and resources user spend on tuning, but also can discover configurations better than hand-tuned methods.
+    Autotuning with DeepSpeed requires no code change from DeepSpeed users. Please refer to the README for usage details.
+    """
+    def __init__(self, args, active_resources):
+        self.args = args
+        self.selected_exp_dir = None
+
+        assert tabulate is not None, "Missing required package `tabulate`, please install with `pip install deepspeed[autotuning]`."
+
+        logger.debug(f"autotunning args={args}")
+
+        self.user_config = self._get_user_config(args.user_args)
+        assert self.user_config is not None, "DeepSpeed configuration is not provided"
+
+        self.autotuning_config = DeepSpeedAutotuningConfig(self.user_config)
+
+        self.exps_dir = DEFAULT_EXPRS_DIR
+        if self.autotuning_config.exps_dir and self.autotuning_config.exps_dir != "":
+            self.exps_dir = self.autotuning_config.exps_dir
+        if self.autotuning_config.overwrite and os.path.exists(self.exps_dir):
+            shutil.rmtree(self.exps_dir, ignore_errors=True)
+        if not os.path.exists(self.exps_dir):
+            os.makedirs(self.exps_dir, exist_ok=True)
+
+        self.results_dir = DEFAULT_RESULTS_DIR
+        if self.autotuning_config.results_dir and self.autotuning_config.results_dir != "":
+            self.results_dir = self.autotuning_config.results_dir
+        if self.autotuning_config.overwrite and os.path.exists(self.results_dir):
+            shutil.rmtree(self.results_dir, ignore_errors=True)
+        if not os.path.exists(self.results_dir):
+            os.makedirs(self.results_dir, exist_ok=True)
+
+        # set the active resource for the autotuner resource manager
+        self.rm = self._get_resource_manager(active_resources)
+
+        # get resource requirement for each autotuning experiment
+        self.exp_num_nodes, self.exp_num_gpus = self._get_exp_resources(args)
+
+        assert self.exp_num_gpus <= self.rm.num_gpus_per_node, "num_gpus in the autotuning configuration must not be less than the --num_gpus value in the train script if any"
+        assert self.exp_num_nodes <= len(
+            self.rm.nodes), "num_nodes in the autotuning configuration must not be less than the --num_nodes value in the train script if any"
+
+        self.records = {}
+
+    def print_tuning_results(self):
+        """Print the autotuning results in tabular format.
+        """
+        best_space_records = self.get_best_space_records()
+        tab = []
+        if best_space_records:
+            for key, val in best_space_records.items():
+                if not val:
+                    continue
+                row = []
+                row.append(key)
+                num_exps = 0
+                if key == GLOBAL_TUNING_SPACE:
+                    cnt = 0
+                    for k, v in best_space_records.items():
+                        if k != GLOBAL_TUNING_SPACE:
+                            cnt += v[2]
+                    num_exps = cnt
+                else:
+                    num_exps = val[2]
+                row.append(num_exps)
+                row.append(val[1])
+                row.append(val[0]['name'])
+                tab.append(row)
+            summary = tabulate(tab,
+                               headers=[
+                                   "tuning_space",
+                                   "num_experiments",
+                                   "best_metric_val",
+                                   "best_exp_name"
+                               ],
+                               tablefmt="pipe")
+            print(summary)
+            with open(os.path.join(self.results_dir,
+                                   'summary.txt'),
+                      'w',
+                      buffering=BUFSIZE) as fd:
+                fd.write(summary)
+                fd.flush()
+                os.fsync(fd)
+
+        if GLOBAL_TUNING_SPACE in best_space_records:
+            best_exp, best_metric_val, total_num_exps = best_space_records[GLOBAL_TUNING_SPACE]
+            if best_exp:
+                logger.info(
+                    f"{best_exp['name']} is the optimal setup after tuning. The exp result is at {best_exp['result_dir']}."
+                )
+            else:
+                logger.info(
+                    f"No optimal setup is found. Please check that experiments were run successfully."
+                )
+            tuning_duration = datetime.timedelta(seconds=(time.time() - self.start_time))
+
+            logger.info(f"Tuning completed in {tuning_duration}")
+            with open(os.path.join(self.results_dir, 'summary.txt'), 'a') as f:
+                f.write(
+                    f"\n\nTuning completed in {tuning_duration}. Total number of experiments: {self.rm.experiment_count - 1}."
+                )
+                f.flush()
+
+    def _get_user_config(self, user_args):
+        """Get DeepSpeed configuration from the user arguments passed to the launcher.
+
+        Args:
+            user_args ([list]): user arguments passed to the DeepSpeed launcher
+
+        Returns:
+            [dict]: DeepSpeed configuration dictionary
+        """
+        user_config_file = None
+        if "--deepspeed_config" in user_args:
+            idx = user_args.index("--deepspeed_config")
+            assert ".json" in user_args[idx +
+                                        1],  "DeepSpeed --deepspeed_config requires a json file to specify the configuration"
+
+            user_config_file = user_args[idx + 1]
+        elif "--deepspeed" in user_args:
+            idx = user_args.index("--deepspeed")
+            if ".json" in user_args[idx + 1]:
+                user_config_file = user_args[idx + 1]
+
+        logger.debug(f"user_config_file = {user_config_file}")
+        if user_config_file is not None:
+            assert os.path.isfile(
+                user_config_file
+            ), "DeepSpeed configuration file: {} is not an existing file".format(
+                user_config_file
+            )
+            if os.path.exists(user_config_file):
+                return json.load(open(user_config_file,
+                                      "r"),
+                                 object_pairs_hook=dict_raise_error_on_duplicate_keys)
+
+        return None
+
+    def _get_resource_manager(self, active_resources):
+        """Initialize and return a resource manager
+
+        Args:
+            active_resources ([dict]): A dictionary of hostname and its slots (GPUs), e.g. {"worker-0": "0,1,2,3,4,5,6,7,8"}
+
+        Raises:
+            RuntimeError: raises the error if no GPU is available
+
+        Returns:
+            [ResourceManager]: A resource manager that schedules and runs autotuning experiments.
+        """
+        logger.info(f"active_resources = {active_resources}")
+
+        hosts = []
+        ngpus_per_node = 100
+        for hostname, slots in active_resources.items():
+            hosts.append(hostname)
+            ngpus_per_node = min(len(slots), ngpus_per_node)
+
+        assert ngpus_per_node > 0, "no gpu is available"
+
+        return ResourceManager(args=self.args,
+                               hosts=hosts,
+                               num_gpus_per_node=ngpus_per_node,
+                               results_dir=self.results_dir,
+                               exps_dir=self.exps_dir,
+                               arg_mappings=self.autotuning_config.arg_mappings)
+
+    def _get_exp_resources(self, args):
+        """Get resource requirement for each autotuning experiment
+
+        Args:
+            args (dict): user args
+
+        Returns:
+            num_nodes, num_gpus: the number of gpus and number of nodes used in the autotuning experiments
+        """
+        if args.num_nodes > 0:
+            num_nodes = args.num_nodes
+        else:
+            num_nodes = len(self.rm.nodes)
+
+        if args.num_gpus > 0:
+            num_gpus = args.num_gpus
+        else:
+            num_gpus = self.rm.num_gpus_per_node
+
+        return num_nodes, num_gpus
+
+    def metric(self):
+        return self.autotuning_config.metric
+
+    def fast_enabled(self):
+        return self.autotuning_config.fast
+
+    def max_train_batch_size(self):
+        return self.autotuning_config.max_train_batch_size
+
+    def mp_size(self):
+        return self.autotuning_config.mp_size
+
+    def max_train_micro_batch_size_per_gpu(self):
+        if self.max_train_batch_size() and self.max_train_batch_size(
+        ) > 0:  # if the user specifies a max_train_batch_size
+            max_train_micro_batch_size = self.max_train_batch_size() * self.mp_size(
+            ) // (self.exp_num_gpus * self.exp_num_nodes
+                  )  # gradient accumulation steps >=1
+            return min(self.autotuning_config.max_train_micro_batch_size_per_gpu,
+                       max_train_micro_batch_size)
+        else:
+            return self.autotuning_config.max_train_micro_batch_size_per_gpu
+
+    def min_train_micro_batch_size_per_gpu(self):
+        return self.autotuning_config.min_train_micro_batch_size_per_gpu
+
+    def num_tuning_micro_batch_sizes(self):
+        return self.autotuning_config.num_tuning_micro_batch_sizes
+
+    def fp16_enabled(self):
+        if FP16 in self.user_config.keys():
+            return self.user_config[FP16].get(FP16_ENABLED, FP16_ENABLED_DEFAULT)
+        else:
+            return False
+
+    def get_gpu_memory_info(self):
+        return torch.cuda.get_device_properties(0).total_memory
+
+    def get_activation_memory_per_gpu(self):
+        if self.model_info and "activation_mem_per_gpu" in self.model_info:
+            return self.model_info["activation_mem_per_gpu"]
+
+    def get_instantiation_memory_required_per_gpu(self, zero_stage):
+        num_params = self.get_model_num_params()
+        total_gpus = self.exp_num_nodes * self.exp_num_gpus
+        fp16_enabled = self.fp16_enabled()
+
+        if not num_params:
+            return 0
+        # assume the model uses Adam optimizer
+        # ZERO_OPTIMIZATION_DISABLED:
+        params_mem = num_params * (2 if fp16_enabled else 4)
+        gradients_mem = num_params * (2 if fp16_enabled else 4)
+        optimizer_mem = num_params * (16 if fp16_enabled else 8)
+
+        if zero_stage >= ZERO_OPTIMIZATION_OPTIMIZER_STATES:
+            optimizer_mem = optimizer_mem / total_gpus
+
+        if zero_stage >= ZERO_OPTIMIZATION_GRADIENTS:
+            gradients_mem = gradients_mem / total_gpus
+
+        if zero_stage >= ZERO_OPTIMIZATION_WEIGHTS:
+            params_mem = params_mem / total_gpus
+
+        mem_per_gpu = (params_mem + gradients_mem + optimizer_mem) / self.mp_size()
+
+        return mem_per_gpu
+
+    def _generate_experiments(self, tuning_space, max_train_batch_size_per_gpu):
+        """Generates a list of autotuning experiments given a tuning_space.
+            The corresponding parameter values are replaced by user-defined values in the DeepSpeed configuration file.
+        Args:
+            tuning_space ([dict]): A DeepSpeed configuration dictionary where a value can be a list (called a tuning parameter). For example,
+                {
+                    "zero_optimization": {
+                        "stage": 1,
+                        "reduce_bucket_size": [5e7,
+                                            5e8,
+                                            1e9],
+                        "allgather_bucket_size": [5e7,
+                                                5e8,
+                                                1e9],
+                    }
+                }
+                reduce_bucket_size and allgather_bucket_size are the tuning parameters in this tuning space.
+        Returns:
+            [list]: a list of experiments generated by taking combinations of values of the tuning space. The above tuning space generates 3*3 = 9 experiments if the user DeepSpeed configuration file does not overwrite the two tuning parameters or define more tuning parameters.
+        """
+        exps = []
+
+        # each zero stage uses a different template configuration file
+        config_zero = tuning_space.get(ZERO_OPTIMIZATION, {})
+        stage = config_zero.get(ZERO_OPTIMIZATION_STAGE, None)
+        template_config = {}
+        if stage == 0:
+            template_path = DEFAULT_TEMPLATE_PATH_ZERO_0
+            template_config = hjson.load(open(template_path, 'r'))
+            prefix = "z0_"
+
+        elif stage == 1:
+            template_path = DEFAULT_TEMPLATE_PATH_ZERO_1
+            template_config = hjson.load(open(template_path, 'r'))
+            prefix = "z1_"
+
+        elif stage == 2:
+            template_path = DEFAULT_TEMPLATE_PATH_ZERO_2
+            template_config = hjson.load(open(template_path, 'r'))
+            prefix = "z2_"
+
+        elif stage == 3:
+            template_path = DEFAULT_TEMPLATE_PATH_ZERO_3
+            template_config = hjson.load(open(template_path, 'r'))
+            model_info = self.model_info
+            if model_info and "hidden_size" in model_info:
+                hs = model_info["hidden_size"]
+                template_config[ZERO_OPTIMIZATION][
+                    ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE] = hs * hs
+                template_config[ZERO_OPTIMIZATION][
+                    ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE] = 0.9 * hs * hs
+                template_config[ZERO_OPTIMIZATION][
+                    ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD] = 10 * hs
+            prefix = "z3_"
+        else:
+            return exps
+
+        # replace the corresponding parameter values if the user specifies them in the DeepSpeed configuration file
+        replace_dict(tuning_space,
+                     self.user_config,
+                     [ZERO_OPTIMIZATION,
+                      TRAIN_MICRO_BATCH_SIZE_PER_GPU])
+
+        logger.debug(f"tuning_space = {json.dumps(tuning_space)}")
+
+        all_configs = get_all_configs(tuning_space, ignore_keys=["optimizer"])
+
+        tuning_keys = get_tuning_keys(tuning_space)
+
+        logger.debug(f"tuning_keys = {tuning_keys}")
+
+        logger.debug(f"before prunning total configs = {len(all_configs)}")
+
+        pruned_list = prune_configs(all_configs)
+
+        logger.debug(f"after prunning total configs = {len(pruned_list)}")
+
+        for config in pruned_list:
+            exp_config = copy.deepcopy(template_config)
+            # fill the template with the expr config
+            replace_dict(exp_config, config)
+
+            # if the config does not use offloading, remove the offloading section
+            config_zero = config.get(ZERO_OPTIMIZATION, None)
+            if config_zero:
+                if OFFLOAD_OPTIMIZER not in config_zero and OFFLOAD_OPTIMIZER in exp_config[
+                        ZERO_OPTIMIZATION]:
+                    del exp_config[ZERO_OPTIMIZATION][OFFLOAD_OPTIMIZER]
+                if OFFLOAD_PARAM not in config_zero and OFFLOAD_PARAM in exp_config[
+                        ZERO_OPTIMIZATION]:
+                    del exp_config[ZERO_OPTIMIZATION][OFFLOAD_PARAM]
+
+            # set gradient accumulation steps according to max_train_batch_size_per_gpu
+            mbs = exp_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU]
+            gas = max_train_batch_size_per_gpu // mbs
+            exp_config[GRADIENT_ACCUMULATION_STEPS] = gas
+            exp_config[TRAIN_BATCH_SIZE] = mbs * gas * \
+                self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+            exp = {}
+            # generate the expr name
+            exp_name = canonical_name(exp_config, tuning_keys, prefix)
+            exp['name'] = exp_name
+            exp[DS_CONFIG] = exp_config
+            exp['num_gpus'] = self.exp_num_gpus
+            exp['num_nodes'] = self.exp_num_nodes
+            exps.append(exp)
+
+        return exps
+
+    def tune(self):
+        """ Tunes Zero stages, micro batch size per GPU, and other Zero configurations. Performance metrics of different tuning spaces are recorded in self.records.
+        """
+        self.start_time = time.time()
+        if self.fast_enabled():
+            logger.info(f"Fast mode is enabled. Tuning micro batch size only.")
+
+        # model info profile run with DEFAULT_MIN_MEM_CONFIG
+        model_info = self.model_info_profile_run()
+        if model_info:
+            self.model_info = model_info
+        else:
+            return
+
+        logger.info(
+            f"The model has {number_to_string(self.get_model_num_params())} parameters.")
+
+        self.gpu_mem = self.get_gpu_memory_info()
+        logger.info(
+            f"Memory per GPU in the system is {memory_to_string(self.gpu_mem, postfix='B')}."
+        )
+
+        self.activation_mem = self.get_activation_memory_per_gpu()
+        logger.info(
+            f"The model requires at least {memory_to_string(self.activation_mem, postfix='B')} activation memory for micro batch size 1."
+        )
+
+        stage = self.user_config.get(ZERO_OPTIMIZATION,
+                                     {}).get(ZERO_OPTIMIZATION_STAGE,
+                                             "all")
+        user_zero_stages = [stage] if not isinstance(stage, list) else stage
+        logger.info(f"User-defined zero stages are {stage}.")
+
+        mbs = 0
+        max_mbs = 0
+        metric_val = 0
+
+        required_gpu_mem = self.get_instantiation_memory_required_per_gpu(
+            ZERO_OPTIMIZATION_DISABLED) + self.activation_mem
+        if self.gpu_mem > required_gpu_mem:
+            if "all" in user_zero_stages or ZERO_OPTIMIZATION_DISABLED in user_zero_stages:
+                logger.info(
+                    f"The model might be runable with ZERO 0 (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory with mbs = 1), adding DEFAULT_TUNING_SPACE_ZERO_0 to the global tuning space"
+                )
+                next_max_mbs, next_mbs, next_metric_val = self.tune_space(
+                    DEFAULT_TUNING_SPACE_ZERO_0)
+                if next_mbs > mbs:
+                    mbs = next_mbs
+                    max_mbs = next_max_mbs
+                    metric_val = next_metric_val
+        else:
+            logger.info(
+                f"The model is not runable with ZERO stage {ZERO_OPTIMIZATION_DISABLED} (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory with mbs = 1)"
+            )
+
+        required_gpu_mem = self.get_instantiation_memory_required_per_gpu(
+            ZERO_OPTIMIZATION_OPTIMIZER_STATES) + self.activation_mem
+        if self.gpu_mem > required_gpu_mem:
+            if "all" in user_zero_stages or ZERO_OPTIMIZATION_OPTIMIZER_STATES in user_zero_stages:
+                logger.info(
+                    f"The model might be runable with ZERO 1 (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory), adding DEFAULT_TUNING_SPACE_ZERO_1 to the global tuning space"
+                )
+                next_max_mbs, next_mbs, next_metric_val = self.tune_space(
+                    DEFAULT_TUNING_SPACE_ZERO_1, prev_max_mbs = max_mbs, prev_best_mbs=mbs, prev_best_metric_val=metric_val)
+                if next_mbs > mbs:
+                    mbs = next_mbs
+                    max_mbs = next_max_mbs
+                    metric_val = next_metric_val
+        else:
+            logger.info(
+                f"The model is not runable with ZERO stage {ZERO_OPTIMIZATION_OPTIMIZER_STATES} (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory with mbs = 1)"
+            )
+
+        required_gpu_mem = self.get_instantiation_memory_required_per_gpu(
+            ZERO_OPTIMIZATION_GRADIENTS) + self.activation_mem
+        if self.gpu_mem > required_gpu_mem:
+            if "all" in user_zero_stages or ZERO_OPTIMIZATION_GRADIENTS in user_zero_stages:
+                logger.info(
+                    f"The model might be runable with ZERO 2 (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory), adding DEFAULT_TUNING_SPACE_ZERO_2 to the global tuning space"
+                )
+                next_max_mbs, next_mbs, next_metric_val = self.tune_space(
+                    DEFAULT_TUNING_SPACE_ZERO_2, prev_max_mbs = max_mbs, prev_best_mbs=mbs, prev_best_metric_val=metric_val)
+                if next_mbs > mbs:
+                    mbs = next_mbs
+                    max_mbs = next_max_mbs
+                    metric_val = next_metric_val
+        else:
+            logger.info(
+                f"The model is not runable with ZERO stage {ZERO_OPTIMIZATION_GRADIENTS} (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory with mbs = 1)"
+            )
+
+        required_gpu_mem = self.get_instantiation_memory_required_per_gpu(
+            ZERO_OPTIMIZATION_WEIGHTS) + self.activation_mem
+        if self.gpu_mem > required_gpu_mem:
+            if "all" in user_zero_stages or ZERO_OPTIMIZATION_WEIGHTS in user_zero_stages:
+                logger.info(
+                    f"The model might be runable with ZERO 3 (which requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory), adding DEFAULT_TUNING_SPACE_ZERO_3 to the global tuning space"
+                )
+                _, _, _ = self.tune_space(
+                    DEFAULT_TUNING_SPACE_ZERO_3, prev_max_mbs = max_mbs, prev_best_mbs=mbs, prev_best_metric_val=metric_val)
+        else:
+            logger.info(
+                f"The model has {self.get_model_num_params()} parameters and requires at least {memory_to_string(required_gpu_mem, postfix='B')} memory per GPU with DeepSpeed Zero stage {ZERO_OPTIMIZATION_WEIGHTS} optimization. Memory per GPU in system is {memory_to_string(self.gpu_mem)}. No tuning is performed."
+            )
+            return
+
+    def tune_space(self,
+                   tuning_space,
+                   prev_max_mbs=0,
+                   prev_best_mbs=0,
+                   prev_best_metric_val=0):
+        config_zero = tuning_space.get(ZERO_OPTIMIZATION, {})
+        stage = config_zero.get(ZERO_OPTIMIZATION_STAGE, ZERO_OPTIMIZATION_STAGE_DEFAULT)
+        tuning_space_name = TUNING_MICRO_BATCH_SIZE_PREFIX + str(stage)
+        tuning_micro_batch_sizes = []
+        max_train_batch_size_per_gpu = 0
+        tuning_micro_batch_sizes_overwritten = False
+
+        # calculate max micro batch size using gpu memory, model instantiation memory and activation memory
+        # calculated_max_micro_batch_size = (memory_per_gpu - instantiation_memory) // activation_memory_micro_batch_size_1
+        calculated_max_micro_batch_size = int(
+            self.gpu_mem -
+            self.get_instantiation_memory_required_per_gpu(stage)) // self.activation_mem
+        logger.info(
+            f"Start tuning for space {tuning_space_name}, calculated_max_micro_batch_size = {calculated_max_micro_batch_size}"
+        )
+
+        if calculated_max_micro_batch_size < prev_max_mbs:
+            logger.info(
+                f"No need to tune Zero stage {stage}. End tuning for space {tuning_space_name}"
+            )
+            return 0, 0, 0
+
+        if TRAIN_MICRO_BATCH_SIZE_PER_GPU in self.user_config and isinstance(
+                self.user_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU],
+                list):
+            # user-specified micro batch size per gpu is a list which overwrites the default tuning behavior
+            tuning_micro_batch_sizes = [
+                s for s in self.user_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU]
+                if isinstance(s,
+                              int)
+            ]
+            gas = self.get_gas_from_user_config()
+            min_micro_batch_size = min(tuning_micro_batch_sizes)
+            max_micro_batch_size = max(tuning_micro_batch_sizes)
+            max_train_batch_size_per_gpu = max_micro_batch_size * gas
+            tuning_micro_batch_sizes_overwritten = True
+        else:
+            # auto-detects the list of micro batch sizes to tune
+            min_micro_batch_size, max_micro_batch_size = self.get_min_max_micro_batch_size(
+                stage, prev_max_mbs, calculated_max_micro_batch_size)
+
+            if max_micro_batch_size < prev_max_mbs:
+                logger.info(
+                    f"No need to tune Zero stage {stage}. End tuning for space {tuning_space_name}"
+                )
+                return 0, 0, 0
+
+            tuning_micro_batch_sizes, max_train_batch_size_per_gpu = self.get_tuning_micro_batch_size_list(
+                min_micro_batch_size,
+                max_micro_batch_size,
+                num_tuning_micro_batch_sizes=self.num_tuning_micro_batch_sizes())
+
+        logger.info(
+            f"tuning_micro_batch_sizes = {tuning_micro_batch_sizes}, max_train_batch_size_per_gpu = {max_train_batch_size_per_gpu}"
+        )
+
+        # return if the tuning_micro_batch_sizes list is empty
+        if not tuning_micro_batch_sizes:
+            logger.info(f"End tuning for space {tuning_space_name}")
+            return 0, 0, 0
+
+        # tune micro batch sizes and gradient accumulation steps given max_train_batch_size_per_gpu
+        tuning_micro_batch_sizes = self.run_tuning_micro_batch_sizes(
+            tuning_micro_batch_sizes,
+            max_train_batch_size_per_gpu,
+            min_micro_batch_size,
+            stage,
+            tuning_micro_batch_sizes_overwritten)
+
+        fast_best_record = self.get_best_space_record(tuning_space_name)
+        fast_best_metric_val = fast_best_record[1] if fast_best_record else 0
+        fast_best_mbs = fast_best_record[0][DS_CONFIG][
+            TRAIN_MICRO_BATCH_SIZE_PER_GPU] if fast_best_record else 0
+        logger.info(
+            f"fast_best_mbs = {fast_best_mbs}, name = {fast_best_record[0]['name']}")
+
+        if self.fast_enabled() or stage == 0:
+            logger.info(f"End tuning for space: {tuning_space_name}")
+            return max_micro_batch_size, fast_best_mbs, fast_best_metric_val
+
+        # if the best metric or the micro batch size for that best metric in the current Zero stage after tuning micro batch size is less than the corresponding value in the previous Zero stage, return, do not tune other Zero configuration parameters
+        if stage > 0:
+            if fast_best_mbs <= prev_best_mbs or fast_best_metric_val < prev_best_metric_val:
+                logger.info(
+                    f"End tuning for space: {tuning_space_name}. No need to tune other Zero configuration parameters."
+                )
+                return max_micro_batch_size, fast_best_mbs, fast_best_metric_val
+
+        tuning_space[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = tuning_micro_batch_sizes
+        tuning_space_name = canonical_name(tuning_space,
+                                           tuning_keys=get_tuning_keys(tuning_space),
+                                           prefix="z" + str(stage) + "_",
+                                           omit_val=True)
+
+        logger.info(f'Tuning space is {tuning_space}')
+        logger.info(f'Tuning space name is {tuning_space_name}')
+
+        exps = self._generate_experiments(tuning_space, max_train_batch_size_per_gpu)
+
+        logger.info(f'Tuner type is {self.autotuning_config.tuner_type}')
+        if self.autotuning_config.tuner_type == AUTOTUNING_TUNER_MODELBASED:
+            t = ModelBasedTuner(exps, self.rm, self.metric(), tuning_space)
+        elif self.autotuning_config.tuner_type == AUTOTUNING_TUNER_RANDOM:
+            t = RandomTuner(exps, self.rm, self.metric())
+        else:
+            t = GridSearchTuner(exps, self.rm, self.metric())
+
+        sample_size = len(self.rm.nodes) * self.rm.num_gpus_per_node // (
+            self.exp_num_gpus * self.exp_num_nodes)
+        num_exps = t.tune(sample_size=sample_size,
+                          n_trials=self.autotuning_config.tuner_num_trials,
+                          early_stopping=self.autotuning_config.tuner_early_stopping)
+        exp = t.best_exp
+        metric_val = t.best_metric_val
+        if exp:
+            self.update_records(tuning_space_name, exp, metric_val, num_exps)
+
+        full_best_record = self.get_best_space_record(tuning_space_name)
+        full_best_metric_val = full_best_record[1] if full_best_record else -1
+
+        if full_best_metric_val > fast_best_metric_val:
+            best_metric_val = full_best_metric_val
+            best_mbs = full_best_record[0][DS_CONFIG][
+                TRAIN_MICRO_BATCH_SIZE_PER_GPU] if full_best_record else -1
+        else:
+            best_metric_val = fast_best_metric_val
+            best_mbs = fast_best_mbs
+
+        logger.info(f"End tuning for space: {tuning_space_name}")
+        return max_micro_batch_size, best_mbs, best_metric_val
+
+    def get_plauteu_mbs(self, tuning_space_name):
+        if tuning_space_name not in self.records:
+            return 0
+        space_records = self.records[tuning_space_name]
+        sorted_space_records = sorted(
+            space_records,
+            key=lambda x: x[0][DS_CONFIG][TRAIN_MICRO_BATCH_SIZE_PER_GPU])
+        prev_metric_val = None
+        prev_micro_batch_size = 0
+        for (exp, metric_val, _) in sorted_space_records:
+            if prev_metric_val:
+                if metric_val < prev_metric_val:
+                    break
+                if (metric_val >= prev_metric_val
+                        and (metric_val - prev_metric_val) / prev_metric_val <
+                        METRIC_PERCENT_DIFF_CONST):
+                    break
+            prev_metric_val = metric_val
+            prev_micro_batch_size = exp[DS_CONFIG][TRAIN_MICRO_BATCH_SIZE_PER_GPU]
+        plateau_mbs = prev_micro_batch_size
+        return plateau_mbs
+
+    def get_model_num_params(self):
+        if self.model_info and "num_params" in self.model_info:
+            return self.model_info["num_params"]
+
+    def model_info_profile_run(self):
+        """Does a model information profling experiment that collects the number of model parameters and activation memory.\
+            The experiment produces a "profile_model_info" folder under self.results_dir.
+        Returns:
+            [dict]: a model information dictionary, e.g., {"num_params": 335144976, "trainable_num_params": 335144976, "activation_mem_per_gpu": 324358144, "rank": 0}
+        """
+        logger.info("Starting model info profile run.")
+        model_info = self.autotuning_config.model_info
+        if model_info and MODEL_INFO_NUM_PARAMS in model_info:
+            return model_info
+
+        ds_config = copy.deepcopy(self.user_config)
+        replace_dict(ds_config, DEFAULT_MIN_MEM_CONFIG)
+
+        model_info_path = os.path.join(self.results_dir,
+                                       "profile_model_info",
+                                       "model_info.json")
+        ds_config[AUTOTUNING] = {
+            "enabled": True,
+            "model_info_path": model_info_path,
+            "model_info": {
+                "profile": True
+            }
+        }
+
+        exp_config = {}
+        exp_name = "profile_model_info"
+        exp_config['name'] = exp_name
+        exp_config[DS_CONFIG] = ds_config
+        exp_config['num_gpus'] = self.exp_num_gpus
+        exp_config['num_nodes'] = self.exp_num_nodes
+        exp_path = os.path.join(self.exps_dir, f'{exp_name}.json')
+
+        with open(exp_path, 'w', buffering=BUFSIZE) as fd:
+            json.dump(exp_config, fd)
+            fd.flush()
+            os.fsync(fd)
+
+        self.rm.schedule_experiments([exp_path])
+        self.rm.run()
+
+        for exp_id, (exp_json, err) in self.rm.finished_experiments.items():
+            self.rm.clear()
+            if err:
+                logger.error(
+                    f"The model is not runnable with DeepSpeed with error = {err}")
+                return None
+
+        if os.path.exists(model_info_path):
+            with open(model_info_path, 'r') as f:
+                model_info = hjson.load(f)
+                return model_info
+
+    def update_records(self, space_name, exp, metric_val, num_exps):
+        if space_name not in self.records:
+            self.records[space_name] = [(exp, metric_val, num_exps)]
+        else:
+            self.records[space_name].append((exp, metric_val, num_exps))
+
+    def get_best_space_record(self, space_name):
+        if space_name not in self.records:
+            return None
+        space_records = self.records[space_name]
+        best_space_record = None
+        space_num_exps = 0
+        for (exp, metric_val, num_exps) in space_records:
+            space_num_exps += num_exps
+            if best_space_record is None or metric_val > best_space_record[1]:
+                best_space_record = (exp, metric_val)
+        if best_space_record:
+            best_space_record = best_space_record + (space_num_exps, )
+        return best_space_record
+
+    def get_best_space_records(self):
+        best_space_records = {}
+        global_best_record = None
+        for space_name, space_records in self.records.items():
+            best_space_record = self.get_best_space_record(space_name)
+            if best_space_record:
+                best_space_records[space_name] = best_space_record
+                if not global_best_record or best_space_record[1] > global_best_record[1]:
+                    global_best_record = best_space_record
+        if global_best_record:
+            best_space_records[GLOBAL_TUNING_SPACE] = global_best_record
+        return best_space_records
+
+    def run_tuning_micro_batch_sizes(self,
+                                     tuning_micro_batch_sizes,
+                                     max_train_batch_size_per_gpu,
+                                     min_micro_batch_size,
+                                     stage,
+                                     tuning_micro_batch_sizes_overwritten):
+        assert tuning_micro_batch_sizes, "the tuning micro batch size list is empty"
+        tuning_micro_batch_sizes.sort()
+        max_micro_batch_size = tuning_micro_batch_sizes[-1]
+        max_micro_batch_size_metric_val = 0
+
+        ds_config = get_first_config(self.user_config)
+        ds_config[ZERO_OPTIMIZATION] = {ZERO_OPTIMIZATION_STAGE: stage}
+        tuning_space_name = TUNING_MICRO_BATCH_SIZE_PREFIX + str(stage)
+
+        exp_paths = []
+        for mbs in tuning_micro_batch_sizes:
+            ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = mbs
+            gas = max_train_batch_size_per_gpu // mbs
+            ds_config[GRADIENT_ACCUMULATION_STEPS] = gas
+            ds_config[TRAIN_BATCH_SIZE] = mbs * gas * \
+                self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+            exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(mbs)
+            exp_config = {}
+            exp_config['name'] = exp_name
+            exp_config[DS_CONFIG] = ds_config
+            exp_config['num_gpus'] = self.exp_num_gpus
+            exp_config['num_nodes'] = self.exp_num_nodes
+            exp_path = os.path.join(self.exps_dir, f'{exp_name}.json')
+
+            with open(exp_path, 'w', buffering=BUFSIZE) as fd:
+                json.dump(exp_config, fd)
+                fd.flush()
+                os.fsync(fd)
+            exp_paths.append(exp_path)
+
+        self.rm.schedule_experiments(exp_paths)
+        self.rm.run()
+        for exp_id, (exp, err) in self.rm.finished_experiments.items():
+            if exp:
+                metric_file = exp[DS_CONFIG][AUTOTUNING][AUTOTUNING_METRIC_PATH]
+
+                if os.path.exists(metric_file):
+                    with open(metric_file, 'r') as f:
+                        results = hjson.load(f)
+                        metric_val = results[self.metric()]
+                        self.update_records(tuning_space_name, exp, metric_val, 1)
+                        if max_micro_batch_size == exp[DS_CONFIG][
+                                TRAIN_MICRO_BATCH_SIZE_PER_GPU]:
+                            max_micro_batch_size_metric_val = metric_val
+                else:
+                    self.update_records(tuning_space_name, exp, 0, 1)
+            else:
+                mbs = exp[DS_CONFIG][TRAIN_MICRO_BATCH_SIZE_PER_GPU]
+                logger.info(f"micro batch size = {mbs} was not run successfully")
+        self.rm.clear()
+
+        if tuning_micro_batch_sizes_overwritten:
+            return tuning_micro_batch_sizes
+
+        # in a auto-detected tuning_micro_batch_sizs list, max_micro_batch_size might not be performant as the memory consumption is close to max
+        # try smaller values while gas stays the same
+        # if finding a more performant mbs value, use it to replace max_micro_batch_size in the list
+        min_micro_batch_size_with_same_gas = (
+            tuning_micro_batch_sizes[-2] +
+            1) if len(tuning_micro_batch_sizes) > 1 else min_micro_batch_size
+
+        prev_best_metric_val = max_micro_batch_size_metric_val
+        prev_best_mbs = max_micro_batch_size
+
+        stride = (max_micro_batch_size - min_micro_batch_size_with_same_gas) // 3
+        if stride == 0:
+            stride = 1
+        for mbs in reversed(
+                range(min_micro_batch_size_with_same_gas,
+                      max_micro_batch_size,
+                      stride)):
+            ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = mbs
+            gas = max_train_batch_size_per_gpu // mbs
+            ds_config[GRADIENT_ACCUMULATION_STEPS] = gas
+            ds_config[TRAIN_BATCH_SIZE] = mbs * gas * \
+                self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+            exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(mbs)
+            exp, metric_val = self.run_ds_config(ds_config, exp_name)
+            if metric_val:
+                self.update_records(tuning_space_name, exp, metric_val, 1)
+                if metric_val > prev_best_metric_val * (1 + METRIC_PERCENT_DIFF_CONST):
+                    prev_best_metric_val = metric_val
+                    prev_best_mbs = mbs
+                else:
+                    break
+            else:
+                self.update_records(tuning_space_name, exp, 0, 1)
+                break
+        if prev_best_mbs != max_micro_batch_size:
+            tuning_micro_batch_sizes[-1] = prev_best_mbs
+
+        return tuning_micro_batch_sizes
+
+    def get_min_max_micro_batch_size(self,
+                                     stage,
+                                     min_micro_batch_size,
+                                     calculated_max_micro_batch_size):
+        # get min and max micro batch size with gradient accumulation steps = 1
+        if min_micro_batch_size > calculated_max_micro_batch_size:
+            return -1, -1
+
+        used_micro_batch_sizes = []
+        tuning_space_name = TUNING_MICRO_BATCH_SIZE_PREFIX + str(stage)
+
+        ds_config = get_first_config(self.user_config)
+        ds_config[ZERO_OPTIMIZATION] = {ZERO_OPTIMIZATION_STAGE: stage}
+        gas = self.get_gas_from_user_config()
+        ds_config[GRADIENT_ACCUMULATION_STEPS] = gas
+
+        # search for the min micro batch size
+        if min_micro_batch_size < 1:
+            if TRAIN_MICRO_BATCH_SIZE_PER_GPU in self.user_config and isinstance(
+                    self.user_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU],
+                    int):
+                # user specifies train_micro_batch_size_per_gpu as an int
+                mbs = int(self.user_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU])
+            else:
+                # user does not specify train_micro_batch_size_per_gpu or sets it to "auto" when using Hugging Face
+                val = self.get_val_from_user_args(TRAIN_MICRO_BATCH_SIZE_PER_GPU)
+                if val:
+                    mbs = int(val)
+                else:
+                    mbs = 1
+            assert mbs > 0, "The micro batch size per GPU must be greater than 0."
+            ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = mbs
+            ds_config[GRADIENT_ACCUMULATION_STEPS] = gas
+            ds_config[TRAIN_BATCH_SIZE] = mbs * gas * \
+                self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+            exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(mbs)
+            exp, metric_val = self.run_ds_config(ds_config, exp_name)
+            if metric_val:
+                self.update_records(tuning_space_name, exp, metric_val, 1)
+                used_micro_batch_sizes.append(mbs)
+                min_micro_batch_size = mbs
+            else:
+                self.update_records(tuning_space_name, exp, 0, 1)
+                logger.info(
+                    f"User-specified micro batch size per GPU {mbs} does not run")
+                if self.min_train_micro_batch_size_per_gpu() == mbs:
+                    return -1, -1
+                mbs = self.min_train_micro_batch_size_per_gpu()
+                ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = mbs
+                ds_config[GRADIENT_ACCUMULATION_STEPS] = gas
+                ds_config[TRAIN_BATCH_SIZE] = mbs * gas * \
+                    self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+                exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(mbs)
+                exp, metric_val = self.run_ds_config(ds_config, exp_name)
+                if not metric_val:
+                    self.update_records(tuning_space_name, exp, 0, 1)
+                    logger.info(
+                        f"min_train_micro_batch_size_per_gpu {mbs} is not runnable.")
+                    return -1, -1
+                self.update_records(tuning_space_name, exp, metric_val, 1)
+                min_micro_batch_size = mbs
+                used_micro_batch_sizes.append(mbs)
+        else:
+            ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = min_micro_batch_size
+            ds_config[GRADIENT_ACCUMULATION_STEPS] = gas
+            ds_config[TRAIN_BATCH_SIZE] = min_micro_batch_size * gas * \
+                self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+            exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(
+                min_micro_batch_size)
+            exp, metric_val = self.run_ds_config(ds_config, exp_name)
+            if metric_val:
+                self.update_records(tuning_space_name, exp, metric_val, 1)
+                used_micro_batch_sizes.append(min_micro_batch_size)
+            else:
+                self.update_records(tuning_space_name, exp, 0, 1)
+                return -1, -1
+
+        # search for the max micro batch size
+        max_micro_batch_size = min(calculated_max_micro_batch_size,
+                                   self.max_train_micro_batch_size_per_gpu())
+        for mbs in [
+                math.ceil(1.05 * max_micro_batch_size),
+                max_micro_batch_size,
+                int(0.95 * max_micro_batch_size)
+        ]:
+            if mbs > self.max_train_micro_batch_size_per_gpu():
+                continue
+            if mbs in used_micro_batch_sizes:
+                return min_micro_batch_size, mbs
+            ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = mbs
+            ds_config[TRAIN_BATCH_SIZE] = mbs * gas * \
+                self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+            exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(mbs)
+            exp, metric_val = self.run_ds_config(ds_config, exp_name)
+
+            if metric_val:
+                logger.info(f"mbs = {mbs} is found as max mbs")
+                self.update_records(tuning_space_name, exp, metric_val, 1)
+                used_micro_batch_sizes.append(mbs)
+                return min_micro_batch_size, mbs
+            else:
+                self.update_records(tuning_space_name, exp, 0, 1)
+
+        space_records = self.records[
+            tuning_space_name] if tuning_space_name in self.records else []
+        if space_records:
+            prev_idx = min(range(len(space_records)),
+                           key=lambda i: abs(space_records[i][0][DS_CONFIG][
+                               TRAIN_MICRO_BATCH_SIZE_PER_GPU] - min_micro_batch_size))
+            prev_metric_val = space_records[prev_idx][1]
+        else:
+            prev_metric_val = None
+
+        low = min_micro_batch_size
+        high = max_micro_batch_size
+        while low < high:
+            mid = int((low + high) // 2)
+            logger.debug(f"trying mbs = {mid}, low = {low}, high = {high}")
+            if mid == low:
+                break
+            if mid not in used_micro_batch_sizes:
+                ds_config[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = mid
+                ds_config[TRAIN_BATCH_SIZE] = mid * gas * \
+                    self.exp_num_gpus * self.exp_num_nodes // self.mp_size()
+                exp_name = tuning_space_name + "_gas" + str(gas) + "_tmbspg" + str(mid)
+                exp, metric_val = self.run_ds_config(ds_config, exp_name)
+                if metric_val:
+                    low = mid
+                    self.update_records(tuning_space_name, exp, metric_val, 1)
+                    used_micro_batch_sizes.append(mid)
+                    if prev_metric_val and ((metric_val - prev_metric_val) /
+                                            prev_metric_val) < METRIC_PERCENT_DIFF_CONST:
+                        logger.info(f"performance plateaus at mbs = {low}")
+                        break
+                    prev_metric_val = metric_val
+                else:
+                    self.update_records(tuning_space_name, exp, 0, 1)
+                    high = mid - 1
+            else:
+                low = mid
+        max_micro_batch_size = low
+
+        logger.info(
+            f"min_micro_batch_size = {min_micro_batch_size}, max_micro_batch_size = {max_micro_batch_size}."
+        )
+
+        return min_micro_batch_size, max_micro_batch_size
+
+    def get_gas_from_user_config(self):
+        gas = 1
+        if GRADIENT_ACCUMULATION_STEPS in self.user_config:
+            gas_in_config = self.user_config[GRADIENT_ACCUMULATION_STEPS]
+            if isinstance(gas_in_config, int):
+                gas = gas_in_config
+            elif gas_in_config == "auto":  # GRADIENT_ACCUMULATION_STEPS: "auto"
+                val = self.get_val_from_config(GRADIENT_ACCUMULATION_STEPS)
+                if val:
+                    gas = int(val)
+            elif isinstance(gas_in_config, list):
+                logger.info(
+                    f"Specifying a list of {GRADIENT_ACCUMULATION_STEPS} to tune is not supported. 1 would be used."
+                )
+        assert gas > 0, "Gradient accumulation steps must be positive."
+        return gas
+
+    def get_val_from_user_args(self, ds_name):
+        arg_mappings = self.autotuning_config.arg_mappings
+        user_args = self.args.user_args
+        if arg_mappings and ds_name in arg_mappings:
+            arg_name = arg_mappings[ds_name]
+            if arg_name in user_args:
+                idx = user_args.index(arg_name)
+                if user_args[idx + 1].isnumeric():
+                    return (user_args[idx + 1])
+        return None
+
+    def get_tuning_micro_batch_size_list(self,
+                                         min_micro_batch_size,
+                                         max_micro_batch_size,
+                                         num_tuning_micro_batch_sizes):
+        """Get a list of micro batch sizes to tune based on min and max values, as well as the size of the list.
+        Args:
+            min_micro_batch_size ([int]): min micro batch size per GPU
+            max_micro_batch_size ([int]): max micro batch size per GPU
+            num_tuning_micro_batch_sizes (int): the number of items in the returned list
+
+        Returns:
+            [list]: a list of micro batch sizes to tune.
+        """
+        if min_micro_batch_size <= 0 or max_micro_batch_size <= 0:
+            logger.info(
+                f"min_micro_batch_size = {min_micro_batch_size}, max_micro_batch_size = {max_micro_batch_size}"
+            )
+            return [], 0
+
+        # NUM_GPUS=$(( ${NUM_WORKERS} * ${NUM_GPUS_PER_WORKER} ))
+        # DP_SIZE=$(( ${NUM_GPUS} / (${PP_SIZE} * ${MP_SIZE}) ))
+        # GRAD_ACC_STEPS=$(( ${TARGET_GLOBAL_BATCH_SIZE} / (${BATCH_SIZE} * ${DP_SIZE}) ))
+        if self.max_train_batch_size() and self.max_train_batch_size(
+        ) > 0:  # if the user specifies a max_train_batch_size
+            max_train_batch_size_per_gpu = self.max_train_batch_size() * self.mp_size(
+            ) // (self.exp_num_gpus * self.exp_num_nodes)
+        else:
+            gas = self.get_gas_from_user_config()
+            max_train_batch_size_per_gpu = max_micro_batch_size * gas // self.mp_size()
+        logger.info(f"max_train_batch_size_per_gpu = {max_train_batch_size_per_gpu}")
+        if min_micro_batch_size < max_micro_batch_size // 2:
+            min_micro_batch_size = max_micro_batch_size // 2
+
+        # constant stride
+        stride = (max_micro_batch_size -
+                  min_micro_batch_size) // num_tuning_micro_batch_sizes
+        if stride == 0:
+            stride = 1
+        ls = []
+        min_gas = max_train_batch_size_per_gpu // max_micro_batch_size
+        # if gas is the same as min_gas, do not add mbs to the tuning list
+        for mbs in range(min_micro_batch_size, max_micro_batch_size, stride):
+            if max_micro_batch_size // mbs != min_gas:
+                ls.append(mbs)
+        ls.append(max_micro_batch_size)
+
+        return ls, max_train_batch_size_per_gpu
+
+    def run_ds_config(self, ds_config, exp_name):
+        exp_config = {}
+        exp_config['name'] = exp_name
+        exp_config[DS_CONFIG] = ds_config
+        exp_config['num_gpus'] = self.exp_num_gpus
+        exp_config['num_nodes'] = self.exp_num_nodes
+        exp_path = os.path.join(self.exps_dir, f'{exp_name}.json')
+
+        logger.debug(f'run_ds_config exp_name = {exp_name}')
+
+        with open(exp_path, 'w', buffering=BUFSIZE) as fd:
+            json.dump(exp_config, fd)
+            fd.flush()
+            os.fsync(fd)
+
+        self.rm.schedule_experiments([exp_path])
+        self.rm.run()
+        exp, metric_val = self.rm.parse_results(self.metric())
+        self.rm.clear()
+        return exp, metric_val
+
+    def run_after_tuning(self):
+        """ Launches the training with the optmimal DeepSpeed configuration found through the autotuning process.
+            "ds_config_optimal.json" describing the optmimal DeepSpeed configuration as well the command used to launch training "cmd_optimal.txt" are saved to self.results_dir.
+        """
+        best_space_records = self.get_best_space_records()
+        if GLOBAL_TUNING_SPACE not in best_space_records:
+            return
+        best_exp, best_metric_val, _ = best_space_records[GLOBAL_TUNING_SPACE]
+        if best_exp:
+            logger.info(
+                "Start training with the optmimal DeepSpeed configuration found through the tuning process"
+            )
+
+            exp_dir = best_exp["result_dir"]
+            cmd = None
+            with open(os.path.join(exp_dir, "cmd.txt"), "r") as f:
+                cmd = [str(i) for i in f.read().split()]
+
+            ds_config = hjson.load(open(os.path.join(exp_dir, "ds_config.json"), "r"))
+            ds_config.pop(AUTOTUNING)
+
+            ds_config_path = os.path.join(self.results_dir, "ds_config_optimal.json")
+            json.dump(ds_config, open(ds_config_path, "w"))
+
+            idx = cmd.index(os.path.join(exp_dir, "ds_config.json"))
+            cmd[idx] = ds_config_path
+
+            cmd_path = os.path.join(self.results_dir, "cmd_optimal.txt")
+            with open(cmd_path, "w") as fd:
+                fd.write(" ".join(cmd))
+                fd.write("\n")
+                fd.flush()
+
+            result = subprocess.Popen(cmd)
+            result.wait()
+
+            logger.info(
+                f"Done running with the optimal DeepSpeed configuration found by autotuning: {ds_config_path}"
+            )
diff --git a/deepspeed/autotuning/config.py b/deepspeed/autotuning/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..dea36f0377dcb6ac0eba3fcf9211f95fbde6ba55
--- /dev/null
+++ b/deepspeed/autotuning/config.py
@@ -0,0 +1,134 @@
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+
+from deepspeed.runtime.config_utils import get_scalar_param, get_dict_param, DeepSpeedConfigObject
+from deepspeed.autotuning.constants import *
+
+
+class DeepSpeedAutotuningConfig(DeepSpeedConfigObject):
+    def __init__(self, param_dict):
+        super(DeepSpeedAutotuningConfig, self).__init__()
+
+        self.enabled = None
+        self.start_step = None
+        self.end_step = None
+        self.metric_path = None
+        self.arg_mappings = None
+        self.metric = None
+        self.model_info = None
+        self.results_dir = None
+        self.exps_dir = None
+        self.overwrite = None
+
+        if param_dict and AUTOTUNING in param_dict.keys():
+            autotuning_dict = param_dict[AUTOTUNING]
+        else:
+            autotuning_dict = {}
+
+        self._initialize(autotuning_dict)
+
+    def _initialize(self, autotuning_dict):
+        self.enabled = get_scalar_param(autotuning_dict,
+                                        AUTOTUNING_ENABLED,
+                                        AUTOTUNING_ENABLED_DEFAULT)
+
+        self.fast = get_scalar_param(autotuning_dict,
+                                     AUTOTUNING_FAST,
+                                     AUTOTUNING_FAST_DEFAULT)
+
+        self.results_dir = get_scalar_param(autotuning_dict,
+                                            AUTOTUNING_RESULTS_DIR,
+                                            AUTOTUNING_RESULTS_DIR_DEFAULT)
+
+        self.exps_dir = get_scalar_param(autotuning_dict,
+                                         AUTOTUNING_EXPS_DIR,
+                                         AUTOTUNING_EXPS_DIR_DEFAULT)
+
+        self.overwrite = get_scalar_param(autotuning_dict,
+                                          AUTOTUNING_OVERWRITE,
+                                          AUTOTUNING_OVERWRITE_DEFAULT)
+
+        self.start_profile_step = get_scalar_param(
+            autotuning_dict,
+            AUTOTUNING_START_PROFILE_STEP,
+            AUTOTUNING_START_PROFILE_STEP_DEFAULT)
+
+        self.end_profile_step = get_scalar_param(autotuning_dict,
+                                                 AUTOTUNING_END_PROFILE_STEP,
+                                                 AUTOTUNING_END_PROFILE_STEP_DEFAULT)
+
+        self.metric = get_scalar_param(autotuning_dict,
+                                       AUTOTUNING_METRIC,
+                                       AUTOTUNING_METRIC_DEFAULT)
+
+        self.metric_path = get_scalar_param(autotuning_dict,
+                                            AUTOTUNING_METRIC_PATH,
+                                            AUTOTUNING_METRIC_PATH_DEFAULT)
+
+        self.tuner_type = get_scalar_param(autotuning_dict,
+                                           AUTOTUNING_TUNER_TYPE,
+                                           AUTOTUNING_TUNER_TYPE_DEFAULT)
+
+        self.tuner_early_stopping = get_scalar_param(
+            autotuning_dict,
+            AUTOTUNING_TUNER_EARLY_STOPPING,
+            AUTOTUNING_TUNER_EARLY_STOPPING_DEFAULT)
+
+        self.tuner_num_trials = get_scalar_param(autotuning_dict,
+                                                 AUTOTUNING_TUNER_NUM_TRIALS,
+                                                 AUTOTUNING_TUNER_NUM_TRIALS_DEFAULT)
+
+        self.arg_mappings = get_dict_param(autotuning_dict,
+                                           AUTOTUNING_ARG_MAPPINGS,
+                                           AUTOTUNING_ARG_MAPPINGS_DEFAULT)
+
+        self.model_info = get_model_info_config(autotuning_dict)
+
+        self.model_info_path = get_scalar_param(autotuning_dict,
+                                                AUTOTUNING_MODEL_INFO_PATH,
+                                                AUTOTUNING_MODEL_INFO_PATH_DEFAULT)
+        self.mp_size = get_scalar_param(autotuning_dict,
+                                        AUTOTUNING_MP_SIZE,
+                                        AUTOTUNING_MP_SIZE_DEFAULT)
+
+        self.max_train_batch_size = get_dict_param(
+            autotuning_dict,
+            AUTOTUNING_MAX_TRAIN_BATCH_SIZE,
+            AUTOTUNING_MAX_TRAIN_BATCH_SIZE_DEFAULT)
+
+        self.min_train_batch_size = get_dict_param(
+            autotuning_dict,
+            AUTOTUNING_MIN_TRAIN_BATCH_SIZE,
+            AUTOTUNING_MIN_TRAIN_BATCH_SIZE_DEFAULT)
+
+        self.max_train_micro_batch_size_per_gpu = get_dict_param(
+            autotuning_dict,
+            AUTOTUNING_MAX_TRAIN_MICRO_BATCH_SIZE_PER_GPU,
+            AUTOTUNING_MAX_TRAIN_MICRO_BATCH_SIZE_PER_GPU_DEFAULT)
+
+        self.min_train_micro_batch_size_per_gpu = get_dict_param(
+            autotuning_dict,
+            AUTOTUNING_MIN_TRAIN_MICRO_BATCH_SIZE_PER_GPU,
+            AUTOTUNING_MIN_TRAIN_MICRO_BATCH_SIZE_PER_GPU_DEFAULT)
+
+        self.num_tuning_micro_batch_sizes = get_dict_param(
+            autotuning_dict,
+            AUTOTUNING_NUM_TUNING_MICRO_BATCH_SIZES,
+            AUTOTUNING_NUM_TUNING_MICRO_BATCH_SIZES_DEFAULT)
+
+
+def get_model_info_config(param_dict):
+    if MODEL_INFO in param_dict and param_dict[MODEL_INFO] is not None:
+        model_info_config = {}
+        for key, default_value in MODEL_INFO_KEY_DEFAULT_DICT.items():
+            model_info_config[key] = get_scalar_param(param_dict[MODEL_INFO],
+                                                      key,
+                                                      default_value)
+        return model_info_config
+    return None
+
+
+def get_default_model_info_config():
+    return MODEL_INFO_KEY_DEFAULT_DICT
diff --git a/deepspeed/autotuning/config_templates/template_zero0.json b/deepspeed/autotuning/config_templates/template_zero0.json
new file mode 100644
index 0000000000000000000000000000000000000000..b95c7da0948ef76ae2e593ffacc7b1c153c6671d
--- /dev/null
+++ b/deepspeed/autotuning/config_templates/template_zero0.json
@@ -0,0 +1,5 @@
+{
+  "zero_optimization": {
+    "stage": 0
+  }
+}
diff --git a/deepspeed/autotuning/config_templates/template_zero1.json b/deepspeed/autotuning/config_templates/template_zero1.json
new file mode 100644
index 0000000000000000000000000000000000000000..dc90f946f57436a74b540e60907b21283825fe31
--- /dev/null
+++ b/deepspeed/autotuning/config_templates/template_zero1.json
@@ -0,0 +1,7 @@
+{
+  "zero_optimization": {
+    "stage": 1,
+    "reduce_bucket_size": 5e8,
+    "allgather_bucket_size": 5e8
+  }
+}
diff --git a/deepspeed/autotuning/config_templates/template_zero2.json b/deepspeed/autotuning/config_templates/template_zero2.json
new file mode 100644
index 0000000000000000000000000000000000000000..46f1817af7eead82a702822c5f2feacdf1a173e1
--- /dev/null
+++ b/deepspeed/autotuning/config_templates/template_zero2.json
@@ -0,0 +1,11 @@
+{
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients": false
+  }
+}
diff --git a/deepspeed/autotuning/config_templates/template_zero3.json b/deepspeed/autotuning/config_templates/template_zero3.json
new file mode 100644
index 0000000000000000000000000000000000000000..620d7eb10e81102a9ec77cfaf3e9c1ab1ccb9a80
--- /dev/null
+++ b/deepspeed/autotuning/config_templates/template_zero3.json
@@ -0,0 +1,17 @@
+{
+  "zero_optimization": {
+    "stage": 3,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients": false,
+    "stage3_max_live_parameters": 1e9,
+    "stage3_max_reuse_distance": 1e9,
+    "stage3_prefetch_bucket_size": 5e8,
+    "stage3_param_persistence_threshold": 1e6,
+    "stage3_gather_16bit_weights_on_model_save": false,
+    "sub_group_size": 1e12
+  }
+}
diff --git a/deepspeed/autotuning/constants.py b/deepspeed/autotuning/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bfcd2725f90b41813546411346264927bf28b95
--- /dev/null
+++ b/deepspeed/autotuning/constants.py
@@ -0,0 +1,211 @@
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+
+#########################################
+# autotunner implementation constants
+#########################################
+
+import os
+
+DEFAULT_TEMPLATE_PATH_ZERO_0 = os.path.join(os.path.dirname(os.path.realpath(__file__)),
+                                            "config_templates",
+                                            "template_zero0.json")
+DEFAULT_TEMPLATE_PATH_ZERO_1 = os.path.join(os.path.dirname(os.path.realpath(__file__)),
+                                            "config_templates",
+                                            "template_zero1.json")
+DEFAULT_TEMPLATE_PATH_ZERO_2 = os.path.join(os.path.dirname(os.path.realpath(__file__)),
+                                            "config_templates",
+                                            "template_zero2.json")
+DEFAULT_TEMPLATE_PATH_ZERO_3 = os.path.join(os.path.dirname(os.path.realpath(__file__)),
+                                            "config_templates",
+                                            "template_zero3.json")
+
+DEFAULT_EXPRS_DIR = os.path.join(os.getcwd(), "autotuning_exps")
+DEFAULT_RESULTS_DIR = os.path.join(os.getcwd(), "autotuning_results")
+
+METRIC_PERCENT_DIFF_CONST = 0.05
+DS_CONFIG = "ds_config"
+BUFSIZE = 1  # line buffer size for writing files
+
+#########################################
+# autotuner configuration constants
+#########################################
+# Autotuner. By default, this feature is not enabled.
+# Users can configure in ds_config.json as below example:
+AUTOTUNING_FORMAT = """
+autotuner should be enabled as:
+"session_params": {
+  "autotuning": {
+    "enabled": true,
+    "start_step": 5,
+    "end_step": 15
+    }
+}
+"""
+
+AUTOTUNING = "autotuning"
+
+AUTOTUNING_ENABLED = "enabled"
+AUTOTUNING_ENABLED_DEFAULT = False
+
+AUTOTUNING_FAST = "fast"
+AUTOTUNING_FAST_DEFAULT = True
+
+AUTOTUNING_RESULTS_DIR = "results_dir"
+AUTOTUNING_RESULTS_DIR_DEFAULT = None
+
+AUTOTUNING_EXPS_DIR = "exps_dir"
+AUTOTUNING_EXPS_DIR_DEFAULT = None
+
+AUTOTUNING_OVERWRITE = "overwrite"
+AUTOTUNING_OVERWRITE_DEFAULT = True
+
+AUTOTUNING_START_PROFILE_STEP = "start_profile_step"
+AUTOTUNING_START_PROFILE_STEP_DEFAULT = 3
+
+AUTOTUNING_END_PROFILE_STEP = "end_profile_step"
+AUTOTUNING_END_PROFILE_STEP_DEFAULT = 5
+AUTOTUNING_METRIC_PATH = "metric_path"
+AUTOTUNING_METRIC_PATH_DEFAULT = None
+
+AUTOTUNING_TUNER_TYPE = "tuner_type"
+AUTOTUNING_TUNER_GRIDSEARCH = "gridsearch"
+AUTOTUNING_TUNER_RANDOM = "random"
+AUTOTUNING_TUNER_MODELBASED = "model_based"
+AUTOTUNING_TUNER_TYPE_DEFAULT = AUTOTUNING_TUNER_GRIDSEARCH
+AUTOTUNING_TUNER_EARLY_STOPPING = "tuner_early_stopping"
+AUTOTUNING_TUNER_EARLY_STOPPING_DEFAULT = 5
+AUTOTUNING_TUNER_NUM_TRIALS = "tuner_num_trials"
+AUTOTUNING_TUNER_NUM_TRIALS_DEFAULT = 50
+
+AUTOTUNING_ARG_MAPPINGS = "arg_mappings"
+AUTOTUNING_ARG_MAPPINGS_DEFAULT = None
+
+AUTOTUNING_MAX_TRAIN_BATCH_SIZE = "max_train_batch_size"
+AUTOTUNING_MAX_TRAIN_BATCH_SIZE_DEFAULT = None
+AUTOTUNING_MIN_TRAIN_BATCH_SIZE = "min_train_batch_size"
+AUTOTUNING_MIN_TRAIN_BATCH_SIZE_DEFAULT = 1
+AUTOTUNING_MAX_TRAIN_MICRO_BATCH_SIZE_PER_GPU = "max_train_micro_batch_size_per_gpu"
+AUTOTUNING_MAX_TRAIN_MICRO_BATCH_SIZE_PER_GPU_DEFAULT = 1024
+AUTOTUNING_MIN_TRAIN_MICRO_BATCH_SIZE_PER_GPU = "min_train_micro_batch_size_per_gpu"
+AUTOTUNING_MIN_TRAIN_MICRO_BATCH_SIZE_PER_GPU_DEFAULT = 1
+AUTOTUNING_NUM_TUNING_MICRO_BATCH_SIZES = "num_tuning_micro_batch_sizes"
+AUTOTUNING_NUM_TUNING_MICRO_BATCH_SIZES_DEFAULT = 3
+
+AUTOTUNING_MP_SIZE = "mp_size"
+AUTOTUNING_MP_SIZE_DEFAULT = 1
+
+AUTOTUNING_METRIC = "metric"
+AUTOTUNING_METRIC_LATENCY = "latency"
+AUTOTUNING_METRIC_THROUGHPUT = "throughput"
+AUTOTUNING_METRIC_FLOPS = "flops"
+AUTOTUNING_METRIC_FORWARD = "forward"
+AUTOTUNING_METRIC_BACKWRAD = "flops"
+AUTOTUNING_METRIC_STEPS = "step"
+AUTOTUNING_METRIC_DEFAULT = AUTOTUNING_METRIC_THROUGHPUT
+
+#########################################
+# MODEL INFO
+#########################################
+AUTOTUNING_MODEL_INFO_PATH = "model_info_path"
+AUTOTUNING_MODEL_INFO_PATH_DEFAULT = None
+
+MODEL_INFO_FORMAT = '''
+"model_info": {
+  "num_params": 1000000000,
+  "hidden_size": 10,
+  "num_layers": 12,
+}
+'''
+MODEL_INFO = "model_info"
+MODEL_INFO_PROFILE = "profile"
+MODEL_INFO_PROFILE_DEFAULT = False
+MODEL_INFO_NUM_PARAMS = "num_params"
+MODEL_INFO_NUM_PARAMS_DEFAULT = None
+MODEL_INFO_HIDDEN_SIZE = "hideen_size"
+MODEL_INFO_HIDDEN_SIZE_DEFAULT = None
+MODEL_INFO_NUM_LAYERS = "num_layers"
+MODEL_INFO_NUM_LAYERS_DEFAULT = None
+
+MODEL_INFO_KEY_DEFAULT_DICT = {
+    MODEL_INFO_PROFILE: MODEL_INFO_PROFILE_DEFAULT,
+    MODEL_INFO_NUM_PARAMS: MODEL_INFO_NUM_PARAMS_DEFAULT,
+    MODEL_INFO_HIDDEN_SIZE: MODEL_INFO_HIDDEN_SIZE_DEFAULT,
+    MODEL_INFO_NUM_LAYERS: MODEL_INFO_NUM_LAYERS_DEFAULT
+}
+
+#########################################
+# autotunner search space constants
+#########################################
+
+DEFAULT_HF_CONFIG = {
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "gradient_accumulation_steps": "auto",
+}
+
+DEFAULT_MIN_MEM_CONFIG = {
+    "train_micro_batch_size_per_gpu": 1,
+    "zero_optimization": {
+        "stage": 3
+    },
+    "memory_break_down": False
+}
+
+DEFAULT_TUNING_SPACE_ZERO_0 = {"zero_optimization": {"stage": 0}}
+
+DEFAULT_TUNING_SPACE_ZERO_1 = {
+    "zero_optimization": {
+        "stage": 1,
+        "reduce_bucket_size": [5e7,
+                               5e8,
+                               1e9],
+        "allgather_bucket_size": [5e7,
+                                  5e8,
+                                  1e9],
+    }
+}
+
+DEFAULT_TUNING_SPACE_ZERO_2 = {
+    "zero_optimization": {
+        "stage": 2,
+        "overlap_comm": [True,
+                         False],
+        "reduce_scatter": [False,
+                           True],
+        "reduce_bucket_size": [5e7,
+                               5e8,
+                               1e9],
+        "allgather_bucket_size": [5e7,
+                                  5e8,
+                                  1e9],
+        "contiguous_gradients": [False,
+                                 True]
+    },
+}
+
+DEFAULT_TUNING_SPACE_ZERO_3 = {
+    "zero_optimization": {
+        "stage": 3,
+        "overlap_comm": [True,
+                         False],
+        "reduce_scatter": [False,
+                           True],
+        "reduce_bucket_size": [5e7,
+                               5e8,
+                               1e9],
+        "allgather_partitions": [True,
+                                 False],
+        "allgather_bucket_size": [5e7,
+                                  5e8,
+                                  1e9],
+        "contiguous_gradients": [False,
+                                 True]
+    },
+}
+
+GLOBAL_TUNING_SPACE = 'global'
+# TUNING_MICRO_BATCH_SIZE_PREFIX="tune_micro_batch_size_z"
+TUNING_MICRO_BATCH_SIZE_PREFIX = "z"
diff --git a/deepspeed/autotuning/scheduler.py b/deepspeed/autotuning/scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4090b38e4c6464c70ebab8bd0504d8b911e1687
--- /dev/null
+++ b/deepspeed/autotuning/scheduler.py
@@ -0,0 +1,439 @@
+import copy
+from re import I
+
+from numpy import BUFSIZE
+from deepspeed.env_report import SUCCESS
+from enum import Flag
+import json
+import os
+import subprocess
+import sys
+import threading
+import time
+from pathlib import Path
+from typing import List
+
+import hjson
+from tqdm import tqdm
+
+from ..utils import logger
+from .constants import *
+from .constants import AUTOTUNING, AUTOTUNING_METRIC_PATH
+from .utils import get_val_by_key, search_error, was_interruptted
+"""
+thread-0: loop over experiment queue dispatching experiments if they become available
+thread-N: start each experiment in its own thread
+"""
+
+import torch.distributed as dist
+
+from datetime import datetime
+
+TIMEOUT = 5
+
+
+class ResourceManager:
+    def __init__(self,
+                 args,
+                 hosts,
+                 num_gpus_per_node,
+                 results_dir,
+                 exps_dir,
+                 arg_mappings):
+        self.results_dir = results_dir
+        self.exps_dir = exps_dir
+
+        self.nodes = []
+        self.num_gpus_per_node = num_gpus_per_node
+        for host in hosts:
+            self.nodes.append(Node(host, num_gpus_per_node))
+
+        self.experiment_queue = []
+        self.running_experiments = {}
+        self.finished_experiments = {}
+        self.experiment_count = 0
+        self.exp_paths = set()
+        self.args = args
+
+        self.arg_mappings = {}
+        if arg_mappings is not None:
+            for k, v in arg_mappings.items():
+                k = k.strip()
+                v = v.strip()
+                if k not in self.arg_mappings:
+                    self.arg_mappings[k] = v
+
+    def schedule_experiments(self, exp_paths):
+        for exp_path in exp_paths:
+            if exp_path in self.exp_paths:
+                continue
+            else:
+                self.exp_paths.add(exp_path)
+                with open(exp_path, "r") as fd:
+                    exp = hjson.load(fd)
+                    exp["exp_id"] = self.experiment_count
+                    self.experiment_count += 1
+
+                    result_dir = exp["result_dir"] = os.path.join(
+                        self.results_dir,
+                        exp['name'])
+                    if AUTOTUNING in exp["ds_config"]:
+                        metric_file = os.path.join(result_dir, "metrics.json")
+                        exp["ds_config"][AUTOTUNING][
+                            AUTOTUNING_METRIC_PATH] = metric_file
+                    stderr_file = os.path.join(result_dir, "stderr.log")
+                    model_info_file = os.path.join(result_dir, "model_info.json")
+                    metric_file = os.path.join(result_dir, "metrics.json")
+
+                    # skip existing experiments (except for the ones that were interrupted)
+                    if os.path.exists(result_dir) and os.path.exists(stderr_file):
+                        if not was_interruptted(stderr_file):
+                            err = search_error(stderr_file)
+                            exp_id = exp["exp_id"]
+                            self.finished_experiments[exp_id] = (exp, err)
+                            if err or os.path.exists(metric_file) or os.path.exists(
+                                    model_info_file):
+                                logger.info(
+                                    f"Skipping exp {exp['name']} whose result already exists"
+                                )
+                                continue
+
+                    self.experiment_queue.append(exp)
+
+    def run_job(self, exp: dict, reservations):
+        exp_id = exp["exp_id"]
+        exp["master_port"] = self.args.master_port + exp_id
+        exp["result_dir"] = os.path.join(self.results_dir, exp['name'])
+        user_script = self.args.user_script
+        user_args = self.args.user_args
+
+        # overwrite the user arg in the arg_mappings
+        for key, val in self.arg_mappings.items():
+            nval = get_val_by_key(exp, key)
+            if nval and str(nval) != "auto":
+                if val in user_args:
+                    idx = user_args.index(val)
+                    user_args[idx + 1] = str(nval)
+                else:
+                    user_args.append(val)
+                    user_args.append(str(nval))
+
+        t = threading.Thread(target=run_experiment,
+                             args=(exp,
+                                   reservations,
+                                   user_script,
+                                   user_args))
+        t.start()
+        self.running_experiments[exp_id] = (t, exp, reservations, time.time())
+
+    def experiment_check(self, pbar):
+        finished_exps = []
+        for exp_id, exp_data in self.running_experiments.items():
+            thread, exp_json, reservations, start_time = exp_data
+            logger.debug(f"Checking exp_id = {exp_id}, alive = {thread.is_alive()}")
+            thread.join(timeout=TIMEOUT)
+            if not thread.is_alive():
+                exp_dir = exp_json["result_dir"]
+                stderr_file = os.path.join(exp_dir, "stderr.log")
+                err = search_error(stderr_file)
+                finished_exps.append((exp_id, reservations))
+                self.finished_experiments[exp_id] = (exp_json, err)
+                duration = time.time() - start_time
+                logger.debug(f"Finished exp_id = {exp_id}, duration={duration:.2f} sec")
+                pbar.update(len(finished_exps))
+        for exp_id, reservations in finished_exps:
+            for reservation in reservations:
+                reservation.restore_slots()
+            self.running_experiments.pop(exp_id)
+        time.sleep(TIMEOUT)
+
+    def resource_request(self, exp):
+        num_gpus, num_nodes = exp['num_gpus'], exp['num_nodes']
+        slot_request = num_gpus
+        reservations = []
+        for node in self.nodes:
+            if num_nodes == 0:
+                break
+            slots = node.reserve_slots(slot_request=slot_request)
+            if slots:
+                reservations.append(Reservation(node=node, slots=slots))
+                num_nodes -= 1
+
+        if num_nodes == 0:
+            # request satisfied
+            return reservations
+        else:
+            # request not satisfied
+            for reservation in reservations:
+                reservation.restore_slots()
+
+    def status(self):
+        status = ""
+        for node in self.nodes:
+            status += f"{node.host} ({len(node.idle_slots)} idle gpus), "
+        return status[:-1]
+
+    def run(self):
+        pbar = tqdm(total=len(self.experiment_queue))
+
+        while len(self.experiment_queue) > 0:
+            exp = self.experiment_queue.pop(0)
+            logger.debug(f'Popped exp_id = {exp["exp_id"]} from the queue')
+            logger.debug(f'Resource status: {self.status()}')
+            reservations = self.resource_request(exp)
+
+            if not reservations:
+                logger.debug(f'Unable to schedule exp_id = {exp["exp_id"]}')
+                self.experiment_queue.insert(0, exp)
+                logger.debug(f'Put exp_id = {exp["exp_id"]} back into the queue')
+                self.experiment_check(pbar)
+            else:
+
+                desc = ""
+                for reservation in reservations:
+                    reservation.slots.sort()
+                    slots = ",".join(map(str, reservation.slots))
+                    desc += f"{reservation.node.host}:{slots}@"
+                desc = desc[:-1]
+                logger.debug(f'Running exp_id = {exp["exp_id"]} on {desc}')
+                self.run_job(exp, reservations)
+
+        # All pending experiments are scheduled, waiting for them to complete
+        while len(self.running_experiments) > 0:
+            self.experiment_check(pbar)
+
+    def save_exp_results_to_database(self, message, ranks=None, path=None):
+        """Print message when one of following condition meets
+
+        + not dist.is_initialized()
+        + dist.get_rank() in ranks if ranks is not None or ranks = [-1]
+
+    Args:
+            message (str)
+            ranks (list)
+            path (str)
+
+        """
+        should_log = not dist.is_initialized()
+        ranks = ranks or []
+        my_rank = dist.get_rank() if dist.is_initialized() else -1
+        if ranks and not should_log:
+            should_log = ranks[0] == -1
+            should_log = should_log or (my_rank in set(ranks))
+        logger.debug(f"*** Should log: {should_log}")
+        if should_log:
+            message['rank'] = my_rank
+            with open(path, 'a') as outfile:
+                json.dump(message, outfile)
+                outfile.write('\n')
+
+    def parse_results(self, metric):
+        """ Parses the metric file of the finished experiments to select the optimal DeepSpeed configuration.
+
+        Args:
+            finished_experiments (dcit): a dictionary of experiment id and experiment description.
+
+        Returns:
+            The path to the result folder of the experiment with the optimal configuration.
+        """
+        max_throughput = sys.float_info.min
+        best_exp_id = -1
+        for exp_id, (exp, err) in self.finished_experiments.items():
+            if err:
+                logger.info(
+                    f"The experiment exp_id = {exp_id}, exp_name = {exp['name']}, did not run successfully with error = {err}, thus a metrics.txt does not exist for it. Check the stderr.log in {exp['result_dir']}"
+                )
+                continue
+
+            metric_file = exp["ds_config"][AUTOTUNING][AUTOTUNING_METRIC_PATH]
+
+            if os.path.exists(metric_file):
+                with open(metric_file, 'r') as f:
+                    results = hjson.load(f)
+                    curr_throughput = results[metric]
+                    if curr_throughput > max_throughput:
+                        max_throughput = curr_throughput
+                        best_exp_id = exp_id
+                    exp['results'] = results
+
+        if best_exp_id != -1:
+            best_exp, _ = self.finished_experiments[best_exp_id]
+            return best_exp, max_throughput
+
+        return exp, None
+
+    def clear(self):
+        """Clear experiment queues, does not reset self.experiment_count
+        """
+        self.experiment_queue = []
+        # clean up the running experiments
+        for exp_id, exp_data in self.running_experiments.items():
+            thread, exp_json, reservations, start_time = exp_data
+            clean_up(exp_json, reservations)
+        self.running_experiments = {}
+        self.finished_experiments = {}
+        self.exp_paths = set()
+
+
+class Node:
+    def __init__(self, host, max_slots):
+        self.host = host
+        self.max_slots = max_slots
+        self.idle_slots = list(range(max_slots))
+
+    def reserve_slots(self, slot_request: int) -> list:
+        if len(self.idle_slots) >= slot_request:
+            return [self.idle_slots.pop(0) for _ in range(slot_request)]
+
+    def restore_slots(self, slots: list):
+        self.idle_slots += slots
+
+
+class Reservation:
+    def __init__(self, node, slots):
+        self.node = node
+        self.slots = slots
+
+    def restore_slots(self):
+        self.node.restore_slots(self.slots)
+
+    def desc(self):
+        slots = ",".join(map(str, self.slots))
+        return f"{self.node.host}:{slots}@"
+
+
+def get_job_id():
+    # Infrastructure-specific job-id
+    infra_job_id = None
+    if "DLWS_JOB_ID" in os.environ:
+        infra_job_id = os.environ["DLWS_JOB_ID"]
+    elif "DLTS_JOB_ID" in os.environ:
+        infra_job_id = os.environ["DLTS_JOB_ID"]
+    else:
+        infra_job_id = "unknown-job-id"
+
+    return infra_job_id
+
+
+def get_user():
+    user = None
+    if "USER" in os.environ:
+        user = os.environ["USER"]
+    else:
+        user = "unknown-user"
+    return user
+
+
+def run_experiment(exp: dict, reservations, user_script, user_args):
+    include_str = ""
+    for reservation in reservations:
+        reservation.slots.sort()
+        slots = ",".join(map(str, reservation.slots))
+        include_str += f"{reservation.node.host}:{slots}@"
+    include_str = include_str[:-1]
+    master_port = exp["master_port"]
+    exp["launcher_args"] = [
+        "--include",
+        f"{include_str}",
+        "--master_port",
+        str(master_port),
+    ]
+    logger.debug(f'launcher args={exp["launcher_args"]}')
+
+    exp["user"] = get_user()
+    exp["job_id"] = get_job_id()
+    exp_dir = exp["result_dir"]
+    os.makedirs(exp_dir, exist_ok=True)
+
+    exp["ds_config_path"] = os.path.join(exp_dir, "ds_config.json")
+
+    ds_config = copy.deepcopy(exp["ds_config"])
+
+    with open(exp["ds_config_path"], "w", buffering=BUFSIZE) as fd:
+        json.dump(ds_config, fd)
+        fd.flush()
+        os.fsync(fd)
+    with open(os.path.join(exp_dir, "exp.json"), "w", buffering=BUFSIZE) as fd:
+        json.dump(exp, fd)
+        fd.flush()
+        os.fsync(fd)
+
+    # remove "--deepspeed_config ds_config.json" from user_args
+    if user_args:
+        if "--deepspeed_config" in user_args:
+            idx = user_args.index("--deepspeed_config")
+        # "--deepspeed_config" is omitted in HF
+        elif "--deepspeed" in user_args:
+            idx = user_args.index("--deepspeed")
+        assert idx < len(user_args) and ".json" in user_args[idx +
+                                                             1], "there is no ds_config file specified after --deepspeed_config or --deepspeed"
+        user_args[idx + 1] = exp["ds_config_path"]
+
+    exp["user_script"] = user_script
+    exp["user_args"] = user_args
+
+    cmd = ["deepspeed"] + exp["launcher_args"] + [user_script] + user_args
+
+    assert len(exp["launcher_args"]) > 0, "must provide launcher args"
+
+    with open(os.path.join(exp_dir, "cmd.txt"), "w", buffering=BUFSIZE) as fd:
+        fd.write(" ".join(cmd))
+        fd.write("\n")
+        fd.flush()
+        os.fsync(fd)
+
+    logger.info(f"Launching exp_id = {exp['exp_id']}, exp_name = {exp['name']}")
+
+    with open(os.path.join(exp_dir, "stdout.log"), "wb") as out, open(
+        os.path.join(exp_dir, "stderr.log"), "wb"
+    ) as err:
+        result = subprocess.Popen(cmd, stdout=out, stderr=err)
+        result.wait()
+        out.flush()
+        err.flush()
+        os.fsync(out)
+        os.fsync(err)
+
+    clean_up(exp, reservations)
+
+    logger.info(f"Done running exp_id = {exp['exp_id']}, exp_name = {exp['name']}")
+
+
+PDSH_MAX_FAN_OUT = 1024
+
+
+def clean_up(exp: dict, reservations):
+    env = os.environ.copy()
+    env['PDSH_RCMD_TYPE'] = 'ssh'
+
+    nodes_str = ""
+    for reservation in reservations:
+        nodes_str += f"{reservation.node.host},"
+    nodes_str = nodes_str[:-1]
+    logger.debug(
+        f"Cleaning up exp_id = {exp['exp_id']} on the following workers: {nodes_str}")
+
+    # PDSH flags for max node fan out and specific hosts to launch on
+    # See https://linux.die.net/man/1/pdsh for flag details
+    pdsh_cmd = ['pdsh', '-f', str(PDSH_MAX_FAN_OUT), '-w', nodes_str]
+
+    kill_cmd = [
+        'pkill',
+        '-f',
+        exp['name'],
+    ]
+    cmd = pdsh_cmd + kill_cmd
+    logger.debug("cmd = {}".format(' '.join(cmd)))
+
+    result = subprocess.Popen(cmd, env=env)
+    result.wait()
+
+    # In case of failure must propagate the error-condition back to the caller (usually shell). The
+    # actual error and traceback should have been printed in the subprocess, so in order to avoid
+    # unnecessary noise we just quietly exit here with the same code as the subprocess
+    if result.returncode > 0:
+        sys.exit(result.returncode)
+
+    logger.info(
+        f"Done cleaning up exp_id = {exp['exp_id']} on the following workers: {nodes_str}"
+    )
diff --git a/deepspeed/autotuning/tuner/README.md b/deepspeed/autotuning/tuner/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4218140a29e8a45fd258a2685b013a926ab441ad
--- /dev/null
+++ b/deepspeed/autotuning/tuner/README.md
@@ -0,0 +1,19 @@
+# Tuner
+
+
+`exps` is a list of experiment descriptions (dictionaries).
+An experimentation description has a `ds_config` field that stores the DeepSpeed configuration to be used in the experiment.
+
+A tuner is based on BaseTuner and at least implements the `next_batch` method. It can implement a different `tune` method from the BaseTuner's.
+
+```python
+class NewTuner(BaseTuner):
+    def __init__(self, exps: list, resource_manager):
+        super(NewTuner, self).__init__(exps, resource_manager)
+
+    def next_batch(self, sample_size=1):
+        pass
+
+    def tune(self): # if it differs from BaseTuner
+        pass
+```
diff --git a/deepspeed/autotuning/tuner/__init__.py b/deepspeed/autotuning/tuner/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ce9fe4f971282a888a7c86da8145c21565c5da1
--- /dev/null
+++ b/deepspeed/autotuning/tuner/__init__.py
@@ -0,0 +1,3 @@
+from .index_based_tuner import RandomTuner, GridSearchTuner
+# from .ga_tuner import GATuner
+from .model_based_tuner import ModelBasedTuner
diff --git a/deepspeed/autotuning/tuner/base_tuner.py b/deepspeed/autotuning/tuner/base_tuner.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbdb16dacb533396fe23c9789778779913cae88e
--- /dev/null
+++ b/deepspeed/autotuning/tuner/base_tuner.py
@@ -0,0 +1,69 @@
+import atexit
+import sys
+
+from deepspeed.autotuning.constants import *
+from deepspeed.autotuning.utils import write_experiments
+from deepspeed.utils import logger
+
+import json
+
+
+class BaseTuner:
+    def __init__(self, exps, resource_manager, metric):
+        self.all_exps = exps
+        self.rm = resource_manager
+        self.best_iter = 0
+        self.best_exp = None
+        self.best_metric_val = None
+        self.metric = metric if metric else AUTOTUNING_METRIC_DEFAULT
+        logger.info(f"total number of exps =  {len(self.all_exps)}")
+
+    def has_next(self):
+        """Whether there exists more configurations for evaluation"""
+        if len(self.all_exps) > 0:
+            return True
+        else:
+            return False
+
+    def next_batch(self, sample_size):
+        """Select the next batch of configurations for evaluation"""
+        raise NotImplementedError
+
+    def update(self):
+        """"Update the tuner with what configurations have been evaluated and their performance results"""
+
+    def tune(self, sample_size=1, n_trials=1000, early_stopping=None):
+        i = 0
+        try:
+            while i < n_trials and self.has_next():
+                # Select the next batch of configuratiosn for evaluation
+                sampled_exps = self.next_batch(sample_size)
+                # Generate experiments for measurement of performance
+                exp_paths = write_experiments(sampled_exps, self.rm.exps_dir)
+                self.rm.schedule_experiments(exp_paths)
+                self.rm.run()
+                exp, metric_val = self.rm.parse_results(self.metric)
+                if self.best_exp == None or self.best_metric_val == None or (
+                        metric_val and metric_val > self.best_metric_val):
+                    # logger.info(f"tuner finds better = {exp}")
+                    self.best_exp = exp
+                    self.best_metric_val = metric_val
+                    self.best_iter = i
+
+                i += len(sampled_exps)
+
+                # Update the tuner with evaluated performance results
+                self.update()
+
+                self.rm.clear()
+
+                # Early stop if no more promising configurations are likely to be found
+                if early_stopping and i >= self.best_iter + early_stopping:
+                    logger.info(
+                        f"Tuner early stopped at iteration {i}. Best iteration is {self.best_iter}. Early stopping threshold is {early_stopping}"
+                    )
+                    break
+            return i
+        except:
+            logger.info("Tunner Error:", sys.exc_info()[0])
+            return i
diff --git a/deepspeed/autotuning/tuner/cost_model.py b/deepspeed/autotuning/tuner/cost_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..c311659426bf05a0ada0adc41915384d6599e0df
--- /dev/null
+++ b/deepspeed/autotuning/tuner/cost_model.py
@@ -0,0 +1,62 @@
+import numpy as np
+
+from .utils import *
+
+try:
+    import xgboost as xgb
+except ImportError:
+    xgb = None
+
+
+class XGBoostCostModel():
+    def __init__(self, loss_type, num_threads=None, log_interval=25, upper_model=None):
+
+        assert xgb is not None, "missing requirements, please install deepspeed w. 'autotuning_ml' extra."
+
+        self.loss_type = loss_type
+
+        if loss_type == "reg":
+            self.xgb_params = {
+                "max_depth": 3,
+                "gamma": 0.0001,
+                "min_child_weight": 1,
+                "subsample": 1.0,
+                "eta": 0.3,
+                "lambda": 1.0,
+                "alpha": 0,
+                "objective": "reg:linear",
+            }
+        elif loss_type == "rank":
+            self.xgb_params = {
+                "max_depth": 3,
+                "gamma": 0.0001,
+                "min_child_weight": 1,
+                "subsample": 1.0,
+                "eta": 0.3,
+                "lambda": 1.0,
+                "alpha": 0,
+                "objective": "rank:pairwise",
+            }
+        else:
+            raise RuntimeError("Invalid loss type: " + loss_type)
+
+        self.xgb_params["verbosity"] = 0
+        if num_threads:
+            self.xgb_params["nthread"] = num_threads
+
+    def fit(self, xs, ys):
+        x_train = np.array(xs, dtype=np.float32)
+        y_train = np.array(ys, dtype=np.float32)
+        y_max = np.max(y_train)
+        y_train = y_train / max(y_max, 1e-9)
+
+        index = np.random.permutation(len(x_train))
+        dtrain = xgb.DMatrix(x_train[index], y_train[index])
+
+        self.bst = xgb.train(self.xgb_params, dtrain)
+
+    def predict(self, xs):
+
+        features = xgb.DMatrix(xs)
+
+        return self.bst.predict(features)
diff --git a/deepspeed/autotuning/tuner/index_based_tuner.py b/deepspeed/autotuning/tuner/index_based_tuner.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7233f2e8d752d094729ad3663daedf3710a1f26
--- /dev/null
+++ b/deepspeed/autotuning/tuner/index_based_tuner.py
@@ -0,0 +1,35 @@
+import random
+
+from deepspeed.utils import logger
+
+from .base_tuner import BaseTuner
+
+
+class RandomTuner(BaseTuner):
+    """Explore the search space in random order"""
+    def __init__(self, exps: list, resource_manager, metric):
+        super().__init__(exps, resource_manager, metric)
+
+    def next_batch(self, sample_size=1):
+        if sample_size > len(self.all_exps):
+            sample_size = len(self.all_exps)
+
+        sampled_batch = random.sample(self.all_exps, sample_size)
+        self.all_exps = [x for x in self.all_exps if x not in sampled_batch]
+
+        return sampled_batch
+
+
+class GridSearchTuner(BaseTuner):
+    """Explore the search space in sequential order"""
+    def __init__(self, exps: list, resource_manager, metric):
+        super().__init__(exps, resource_manager, metric)
+
+    def next_batch(self, sample_size=1):
+        if sample_size > len(self.all_exps):
+            sample_size = len(self.all_exps)
+
+        sampled_batch = self.all_exps[0:sample_size]
+        self.all_exps = [x for x in self.all_exps if x not in sampled_batch]
+
+        return sampled_batch
diff --git a/deepspeed/autotuning/tuner/model_based_tuner.py b/deepspeed/autotuning/tuner/model_based_tuner.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8bc2b499f3d343a9510c38029b5f1254c92821f
--- /dev/null
+++ b/deepspeed/autotuning/tuner/model_based_tuner.py
@@ -0,0 +1,158 @@
+import hjson
+import numpy as np
+from deepspeed.utils import logger
+
+from ..constants import AUTOTUNING, AUTOTUNING_METRIC_PATH, AUTOTUNING_METRIC_DEFAULT
+from .base_tuner import BaseTuner
+from .cost_model import XGBoostCostModel
+from .utils import *
+from ..utils import *
+import numbers
+from ..constants import AUTOTUNING_METRIC_LATENCY
+
+INIT_NUM = 2
+
+
+class ModelBasedTuner(BaseTuner):
+    """Exploring the search space with a cost model"""
+    def __init__(self, exps: list, resource_manager, metric, tuning_sapce):
+        super().__init__(exps, resource_manager, metric)
+        self.tuning_space = tuning_sapce
+        self.best_iter = 0
+
+        self.all_configs = [e['ds_config'] for e in exps]
+        self.num_all_configs = len(self.all_configs)
+
+        self.dims = dict_to_dims(self.tuning_space)
+
+        logger.info(
+            f"Create config dim: {self.dims}, all configs: {self.num_all_configs}")
+
+        self.visited = set([])
+
+        self.trials = []
+        self.trial_pt = 0
+
+        init_num = min(INIT_NUM, self.num_all_configs)
+
+        for _ in range(init_num):
+            exp_feature = np.random.randint(self.num_all_configs)
+            exp_feature = 0
+            while exp_feature in self.visited:
+                exp_feature = np.random.randint(self.num_all_configs)
+            self.trials.append(exp_feature)
+            self.visited.add(exp_feature)
+
+        self.cost_model = XGBoostCostModel("rank")
+
+        self.evaluated_configs = []
+        self.evaluated_perf = []
+
+        self.train_ct = 0
+
+        self.random_exploration_ratio = 0.2  # do random exploration
+
+    def find_estimated_top_configs(self):
+        """Use the cost model to predict the estimated performance of configurations and find the top ones for the next round of evaluation"""
+
+        configs = []
+
+        for c in self.all_configs:
+            flattened_ds_config = flatten(c)
+            feature_val = []
+            for k, v in flattened_ds_config.items():
+                if isinstance(v, numbers.Number):
+                    feature_val.append(v)
+            configs.append(feature_val)
+        # print(configs)
+        # TODO the current implementation requires that all configs have the same shape.
+        configs = np.array(configs, dtype=np.float32)
+        estimates = self.cost_model.predict(configs)
+
+        n = len(estimates)
+        top_idx = np.argsort(estimates)
+        top_idx_ret = top_idx if self.metric == AUTOTUNING_METRIC_LATENCY else top_idx[::
+                                                                                       -1][:
+                                                                                           n]
+
+        # top_configs = [self.all_configs[i] for i in top_idx]
+
+        return top_idx_ret
+
+    def next_batch(self, sample_size):
+        sampled_batch = []
+
+        counter = 0
+        while counter < sample_size:
+
+            if len(self.visited) >= self.num_all_configs:
+                break
+
+            while self.trial_pt < len(self.trials):
+                logger.debug(f"trials: {self.trials}")
+                # Select top promising trials
+                index = self.trials[self.trial_pt]
+                if index not in self.visited:
+                    break
+                self.trial_pt += 1
+
+            # To avoid over-exploitation, randomly select one that has not been explored.
+            rand = np.random.rand()
+            if rand < self.random_exploration_ratio:
+                # Do normal selection
+                feature = np.random.choice(self.trials)
+                while index in self.visited:
+                    index = np.random.randint(self.num_all_configs)
+
+            # Need to track both the sampled configs and indices
+
+            sampled_batch.append(self.all_exps[index])
+            self.visited.add(index)
+            counter += 1
+
+        return sampled_batch
+
+    def has_next(self):
+        return len(self.visited) < self.num_all_configs
+
+    def update(self):
+        for exp_id, (exp, err) in self.rm.finished_experiments.items():
+            feature_val = []
+            if err:
+                logger.info(
+                    f"Skipping exp_id = {exp_id}, exp_name = {exp['name']}, the experiment did not run successfully with error = {err}, thus a metrics.txt does not exist for it. Please check the stderr.log in {exp['result_dir']}"
+                )
+                ds_config = exp["ds_config"]
+                flattened_ds_config = flatten(ds_config)
+                for k, v in flattened_ds_config.items():
+                    if isinstance(v, numbers.Number):
+                        feature_val.append(v)
+                self.evaluated_configs.append(feature_val)
+                self.evaluated_perf.append(0.0)
+                continue
+
+            p = exp["ds_config"][AUTOTUNING][AUTOTUNING_METRIC_PATH]
+            with open(p, 'r') as f:
+                results = hjson.load(f)
+                curr_iter = results[self.metric]
+                logger.debug(f"parsing the results for {exp_id}， Result is {curr_iter}")
+
+                ds_config = exp["ds_config"]
+                flattened_ds_config = flatten(ds_config)
+                for k, v in flattened_ds_config.items():
+                    if isinstance(v, numbers.Number):
+                        feature_val.append(v)
+                self.evaluated_configs.append(feature_val)
+                self.evaluated_perf.append(curr_iter)
+
+        logger.debug(
+            f"**Evaluated configs: {len(self.evaluated_configs)}, evaluated perf: {self.evaluated_perf}"
+        )
+
+        self.cost_model.fit(self.evaluated_configs, self.evaluated_perf)
+
+        estimated_top_configs = self.find_estimated_top_configs()
+
+        self.trials = estimated_top_configs
+        self.trial_pt = 0
+        self.train_ct += 1
diff --git a/deepspeed/autotuning/tuner/utils.py b/deepspeed/autotuning/tuner/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f87c7563966ad8ecfb2012e4968a2c0f9d61dcfa
--- /dev/null
+++ b/deepspeed/autotuning/tuner/utils.py
@@ -0,0 +1,83 @@
+import numpy as np
+import itertools
+from ..utils import *
+import collections.abc
+
+
+def index_to_feature(p, dims):
+    """convert index form (single integer) to feature form (vector)"""
+    feature = []
+    for dim in dims:
+        feature.append(p % dim)
+        p //= dim
+    return feature
+
+
+def feature_to_index(feature, dims):
+    """convert feature form (vector) to index form (single integer)"""
+    p = 0
+    for j, k in enumerate(feature):
+        print("j:", "k:", k, "dims", dims[:j])
+        p += int(np.prod(dims[:j])) * k
+    return p
+
+
+def dict_to_dims(tuning_space):
+
+    dims = []
+
+    for key, val in tuning_space.items():
+        if isinstance(val, dict):
+            dims.extend(dict_to_dims(val))
+        elif isinstance(val, list):
+            dims.append(len(val))
+        else:
+            dims.append(1)
+
+    return dims
+
+
+def gen_combinations(d: dict):
+    keys, values = d.keys(), d.values()
+    for v in values:
+        if not isinstance(v, list):
+            v = [v]
+    values_choices = (gen_combinations(v) if isinstance(v,
+                                                        dict) else get_list(v)
+                      for v in values)
+    for comb in itertools.product(*values_choices):
+        yield dict(zip(keys, comb))
+
+
+def flatten(d, parent_key='', sep='_'):
+    items = []
+    for k, v in d.items():
+        new_key = parent_key + sep + k if parent_key else k
+        if isinstance(v, collections.abc.MutableMapping):
+            items.extend(flatten(v, new_key, sep=sep).items())
+        else:
+            items.append((new_key, v))
+    return dict(items)
+
+
+def dict_to_feature(feature_dict, keys, max_value=None):
+    """Extract values from dict"""
+    feature = []
+    for key, val in feature_dict.items():  # First level
+        if key not in keys:
+            continue
+        if val is None or val == "auto" or key == "autotuning" or val == "":
+            continue
+        if isinstance(val, dict):
+            feature.append(dict_to_feature(val, max_value))
+        else:
+            feature.append(float(val))
+
+    # normalization, should not matter in tree models
+    if max_value is not None:
+        norm_feature = []
+        for f, mv in zip(feature, max_value):
+            norm_feature.append(f / mv)
+        feature = norm_feature
+
+    return feature
diff --git a/deepspeed/autotuning/utils.py b/deepspeed/autotuning/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cfa92cd991180da67ad3184ad1d884551ad77e7
--- /dev/null
+++ b/deepspeed/autotuning/utils.py
@@ -0,0 +1,470 @@
+import re
+import collections.abc
+import os
+import json
+from deepspeed.runtime.constants import GRADIENT_ACCUMULATION_STEPS, TRAIN_MICRO_BATCH_SIZE_PER_GPU
+import hjson
+import sys
+import itertools
+import copy
+
+from ..utils import logger
+
+
+def search_error(filename):
+    if not os.path.exists(filename):
+        return "stderr.log does not exist"
+    with open(filename) as f:
+        for line in f:
+            for s in ["Error", "error", "ERROR"]:
+                idx = line.find(s)
+                if idx != -1:
+                    return line[idx + len(s):].lstrip(": ")
+    return None
+
+
+def was_interruptted(filename):
+    if not os.path.exists(filename):
+        return "stderr.log does not exist"
+    with open(filename) as f:
+        for line in f:
+            s = "KeyboardInterrupt"
+            idx = line.find(s)
+            if idx != -1:
+                return True
+    return False
+
+
+def was_interruptted(filename):
+    if not os.path.exists(filename):
+        return "stderr.log does not exist"
+    with open(filename) as f:
+        for line in f:
+            s = "KeyboardInterrupt"
+            idx = line.find(s)
+            if idx != -1:
+                return True
+    return False
+
+
+def find_replace_str(value, replace_dict):
+    if not isinstance(value, str):
+        return str(value)
+
+    matches = re.findall("\$[A-Za-z0-9_]+", value)
+    for var in matches:
+        var_key = var.replace("$", "").lower()
+        if var_key == "nvme_path":
+            continue
+        assert var_key in replace_dict, f"unknown var key: {var_key}, in {replace_dict}"
+        if isinstance(replace_dict[var_key], str):
+            value = value.replace(var, replace_dict[var_key])
+        else:
+            assert len(matches) == 1, "unable to replace multiple non-string matches"
+            value = replace_dict[var_key]
+    return value
+
+
+def find_replace(target, replace_dict):
+    if isinstance(target, dict):
+        for key, value in target.items():
+            if isinstance(value, str):
+                target[key] = find_replace_str(value, replace_dict)
+            if isinstance(value, list):
+                for i in range(len(value)):
+                    value[i] = find_replace_str(value[i], replace_dict)
+            if isinstance(value, dict):
+                find_replace(value, replace_dict)
+    elif isinstance(target, list):
+        for i in range(len(target)):
+            target[i] = str(find_replace_str(target[i], replace_dict))
+
+
+def get_list(val):
+    if not isinstance(val, list):
+        return [val]
+    else:
+        return val
+
+
+def combine_dict(d, u):
+    for k, v in u.items():
+        if isinstance(v, collections.abc.Mapping):
+            d[k] = combine_dict(d.get(k, {}), v)
+        else:
+            if k not in d:
+                d[k] = v
+            else:
+                if not isinstance(d[k], list):
+                    d[k] = [d[k]]
+                d[k].extend(i for i in get_list(v) if i not in d[k])
+    return d
+
+
+def del_if_exists(t, d):
+    """Deletes a key from a dictionary if it exists.
+
+    Args:
+        t (string): target key to delete
+        d (dict): dictionary to delete from
+    """
+    if t in d:
+        del d[t]
+        return
+    for k, v in d.items():
+        if isinstance(v, collections.abc.Mapping):
+            del_if_exists(t, v)
+
+
+def replace_dict(d, u, ignored_keys=[]):
+    """Replaces values in dict d with values in dict u.
+
+    Args:
+        d (dict): the target dict to overwrite
+        u (dict): the dict containing the values to overwrite the target dict
+
+    Returns:
+        dict d with values overwritten by the corresponding ones in dict u.
+    """
+    if u is not None:
+        for k, v in u.items():
+            if k not in ignored_keys:
+                if v is None:
+                    del_if_exists(k, d)
+                    continue
+                if isinstance(v, collections.abc.Mapping):
+                    d[k] = replace_dict(d.get(k, {}), v, ignored_keys)
+                else:
+                    d[k] = v
+    return d
+
+
+def get_val_by_key(d: dict, k):
+    if k in d:
+        return d[k]
+    for v in d.values():
+        if isinstance(v, dict):
+            return get_val_by_key(v, k)
+    return None
+
+
+def set_val_by_key(d: dict, k, vv):
+    if k in d:
+        d[k] = vv
+    for v in d.values():
+        if isinstance(v, dict):
+            set_val_by_key(v, k, vv)
+
+
+def fetch_hostfile(hostfile_path):
+    if not os.path.isfile(hostfile_path):
+        logger.warning("Unable to find hostfile, will proceed with training "
+                       "with local resources only.")
+        return None
+
+    # e.g., worker-0 slots=16
+    with open(hostfile_path, 'r') as fd:
+        resource_pool = collections.OrderedDict()
+        for line in fd.readlines():
+            line = line.strip()
+            if line == '':
+                # skip empty lines
+                continue
+            try:
+                hostname, slots = line.split()
+                _, slot_count = slots.split("=")
+                slot_count = int(slot_count)
+            except ValueError as err:
+                logger.error("Hostfile is not formatted correctly, unable to "
+                             "proceed with training.")
+                raise err
+            if hostname in resource_pool:
+                logger.error("Hostfile contains duplicate hosts, unable to "
+                             "proceed with training.")
+                raise ValueError("host {} is already defined".format(hostname))
+            resource_pool[hostname] = slot_count
+
+    return resource_pool
+
+
+def validate_ds_config(config: dict):
+    def is_False(config: dict, key):
+        if config is None:
+            return False
+        return bool(config.get(key))
+
+    config_zero = config.get("zero_optimization", {})
+    if not config_zero:
+        return True
+    stage = config_zero.get("stage")
+    offload = False
+    if stage == 1:
+        return True
+    elif stage == 2:
+        if is_False(config_zero,
+                    "cpu_offload") and is_False(config_zero,
+                                                "cpu_offload_params"):
+            return False
+    elif stage == 3:
+        offload_devices = ["cpu", "nvme"]
+        if config_zero.get("offload_optimizer", {}).get("device") in offload_devices:
+            offload = True
+        if config_zero.get("offload_param", {}).get("device") in offload_devices:
+            offload = True
+    else:
+        return True
+
+    # HF requires that "ZeRO Offload can only work with DeepSpeed optimizers"
+    if offload and not config.get("optimizer"):
+        return False
+
+    return True
+
+
+def remove_dupe_dicts(l):
+    """ Removes duplicate dictionaries from a list. Uses list comprehension and the json library to sort and stringify each dictionary and the set data type to ensure unique values. Works with nested data structures.
+
+    Args:
+        l (list): a list of (nested) data structures.
+
+    Returns:
+        A list of unique values.
+    """
+    list_of_strings = [json.dumps(d, sort_keys=True) for d in l]
+    list_of_strings = set(list_of_strings)
+    return [json.loads(s) for s in list_of_strings]
+
+
+def prune_config(config, ignored_keys=[]):
+    """ Prunes the input configurations
+
+    Args:
+        configs (dict): A configuration dictionary.
+        ignored_keys (list, optional): the keys of the sections to delete. Defaults to [].
+
+    Returns:
+        A configuration dictionary.
+    """
+    if ignored_keys:
+        for k in ignored_keys:
+
+            def find_del_key(d: dict, k: str):
+                if k in d:
+                    del d[k]
+                else:
+                    for dd in d.values():
+                        if isinstance(dd, dict):
+                            find_del_key(dd, k)
+
+            find_del_key(config, k)
+
+
+def prune_configs(configs, ignored_keys=[]):
+    """ Prunes the input list of configurations
+
+    Args:
+        configs (list): A list of configuration dictionaries.
+        ignored_keys (list, optional): the keys of the sections to delete. Defaults to [].
+
+    Returns:
+        A list of valid and unique configuration dictionaries.
+    """
+    pruned_list = []
+    for config in configs:
+        prune_config(config, ignored_keys)
+        pruned_list.append(config)
+
+    return remove_dupe_dicts(pruned_list)
+
+
+def get_tuning_keys(tuning_space: dict):
+    """Outputs the list of tunnable parameters in the tuning space dict.
+
+    Args:
+        tuning_space (dict): a configuration dictionary containing tunable parameters as lists of values.
+
+    Returns:
+        A list of strings
+    """
+    tuning_keys = []
+    for key, val in tuning_space.items():
+        if isinstance(val, dict):
+            tuning_keys.extend(get_tuning_keys(val))
+        if isinstance(val, list) and len(val) > 1:
+            tuning_keys.append(key)
+    return tuning_keys
+
+
+def get_all_configs(tuning_space: dict, ignore_keys=None):
+    """ Splits the tuning space dictionary to result in all combinations of values.
+
+    Args:
+        tuning_space (dict): the tuning space where tunable parameters are lists of values.
+    """
+    def gen_combinations(d: dict):
+        keys, values = d.keys(), d.values()
+        for v in values:
+            if not isinstance(v, list):
+                v = [v]
+        values_choices = (gen_combinations(v) if isinstance(v,
+                                                            dict) else get_list(v)
+                          for v in values)
+        for comb in itertools.product(*values_choices):
+            yield dict(zip(keys, comb))
+
+    all_configs = []
+    ignored_key_vals = {}
+    for ik in ignore_keys:
+        ignored_key_vals[ik] = tuning_space.get(ik, {})
+        del_if_exists(ik, tuning_space)
+    for c in gen_combinations(tuning_space):
+        replace_dict(c, ignored_key_vals)
+        all_configs.append(c)
+    return all_configs
+
+
+def canonical_name(config: dict, tuning_keys=None, prefix="", omit_val=False):
+    """ Generates a name from the acronyms of the tuning keys in the config dict. TRAIN_MICRO_BATCH_SIZE_PER_GPU is always included in the tuning keys.
+    Args:
+        config (dict): the config dict used to generate the name
+        tuning_keys (list, optional):  the tuning keys used to generate the name. Defaults to None.
+        prefix (str, optional): a string added to the beginning of the name. Defaults to None.
+    """
+    if TRAIN_MICRO_BATCH_SIZE_PER_GPU not in tuning_keys:
+        tuning_keys.append(TRAIN_MICRO_BATCH_SIZE_PER_GPU)
+    if GRADIENT_ACCUMULATION_STEPS not in tuning_keys:
+        tuning_keys.append(GRADIENT_ACCUMULATION_STEPS)
+    tuning_keys.sort()
+
+    def get_offload_name(offload_config):
+        cname = ""
+        if offload_config is None:
+            return "None_"
+        for key, val in offload_config.items():
+            key = "".join(map(lambda c: c[0], key.split('_')))
+            if (isinstance(val, int) or isinstance(val, float)) and val > 9000:
+                cname += key + '{:.1e}'.format(val) + "_"
+            else:
+                if isinstance(val, bool):
+                    val = "T" if val else "F"
+                cname += f"{key}{val}_"
+        return cname
+
+    def get_name_by_keys(config: dict, tuning_keys=None, omit_val=False):
+        cname = ""
+        if not tuning_keys or config is None:
+            return cname
+        for key, val in config.items():
+            # skip the arg_mappings section when naming the exp file
+            if key == "arg_mappings":
+                continue
+            if key == "offload_param":
+                cname += "op_"
+                if not omit_val:
+                    cname += get_offload_name(val)
+                continue
+            if key == "offload_optimizer":
+                cname += "oo_"
+                if not omit_val:
+                    cname += get_offload_name(val)
+                continue
+            # recursively call the func to get name for the child dicts
+            if isinstance(val, dict):
+                n = get_name_by_keys(val, tuning_keys, omit_val=omit_val)
+                if n != "":
+                    cname += n + "_"
+            if tuning_keys and key not in tuning_keys:
+                continue
+
+            key_str = "".join(map(lambda c: c[0], key.split('_')))
+
+            if not omit_val:
+                if (isinstance(val, int) or isinstance(val, float)) and val > 9000:
+                    cname += key_str + '{:.1e}'.format(val) + "_"
+                else:
+                    if isinstance(val, bool):
+                        val = "T" if val else "F"
+                    cname += f"{key_str}{val}_"
+            else:
+                cname += key_str + "_"
+
+        return cname[:-1]
+
+    name = get_name_by_keys(config, tuning_keys, omit_val=omit_val)
+
+    return prefix + (name if name != "" else "exp")
+
+
+def get_first_config(config: dict):
+    if not config:
+        return None
+    cfg = copy.deepcopy(config)
+
+    for key, val in cfg.items():
+        if isinstance(val, dict):
+            if key == "optimizer":  # use user defined optimizer which might have lists of values as params
+                cfg[key] = val
+            else:
+                cfg[key] = get_first_config(val)
+        if isinstance(val, list) and len(val) > 0:
+            cfg[key] = val[0]
+    return cfg
+
+
+def write_experiments(exps: list, exps_dir: str):
+    exp_paths = []
+    for exp in exps:
+        exp_name = exp['name']
+        # write the expr config to a json file
+        exp_path = os.path.join(exps_dir, f'{exp_name}.json')
+        with open(exp_path, 'w') as fd:
+
+            json.dump(exp, fd)
+            exp_paths.append(exp_path)
+    return exp_paths
+
+
+def memory_to_string(n, postfix="", units=None, precision=2):
+    if units is None:
+        if n // 10**12 > 0:
+            return str(round(n / 1024**4, precision)) + " T" + postfix
+        if n // 10**9 > 0:
+            return str(round(n / 1024**3, precision)) + " G" + postfix
+        elif n // 10**6 > 0:
+            return str(round(n / 1024**2, precision)) + " M" + postfix
+        elif n // 10**3 > 0:
+            return str(round(n / 1014, precision)) + " K" + postfix
+        else:
+            return str(n) + " "
+    else:
+        if units == "T":
+            return str(round(n / 1024**4, precision)) + " " + units
+        if units == "G" + postfix:
+            return str(round(n / 1024**3, precision)) + " " + units
+        elif units == "M" + postfix:
+            return str(round(n / 1024**2, precision)) + " " + units
+        elif units == "K" + postfix:
+            return str(round(n / 1024, precision)) + " " + units
+        else:
+            return str(n) + " "
+
+
+def number_to_string(n, postfix="", units=None, precision=2):
+    if units is None:
+        if n // 10**9 > 0:
+            return str(round(n / 1000**3, precision)) + " B" + postfix
+        if n // 10**6 > 0:
+            return str(round(n / 1000**2, precision)) + " M" + postfix
+        elif n // 10**3 > 0:
+            return str(round(n / 1000**1, precision)) + " K" + postfix
+        else:
+            return str(n) + " "
+    else:
+        if units == "B" + postfix:
+            return str(round(n / 1000**3, precision)) + " " + units
+        elif units == "M" + postfix:
+            return str(round(n / 1000**2, precision)) + " " + units
+        elif units == "K" + postfix:
+            return str(round(n / 1000**1, precision)) + " " + units
+        else:
+            return str(n) + " "
diff --git a/deepspeed/checkpoint/__init__.py b/deepspeed/checkpoint/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/deepspeed/checkpoint/constants.py b/deepspeed/checkpoint/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..f45777025db7594e1b3e6f56bb1d39402c363cfb
--- /dev/null
+++ b/deepspeed/checkpoint/constants.py
@@ -0,0 +1,25 @@
+'''
+    Various symbolic constants used for model checkpointing
+'''
+
+#########################################
+# Optimizer checkpoint keys
+#########################################
+OPTIMIZER_STATE_DICT = "optimizer_state_dict"
+FP32_GROUPS = "fp32_groups"
+FP32_FLAT_GROUPS = 'fp32_flat_groups'
+
+BASE_OPTIMIZER_STATE = 'base_optimizer_state'
+SINGLE_PARTITION_OF_FP32_GROUPS = "single_partition_of_fp32_groups"
+GROUPS_PADDING = 'groups_padding'
+
+PARTITION_COUNT = 'partition_count'
+ZERO_STAGE = 'zero_stage'
+CLIP_GRAD = 'clip_grad'
+
+#########################################
+# Module checkpoint keys
+#########################################
+PARAM_SHAPES = 'param_shapes'
+BUFFER_NAMES = 'buffer_names'
+DS_VERSION = 'ds_version'
diff --git a/deepspeed/constants.py b/deepspeed/constants.py
index 9b82950918e5777643737f97f75535bf1dd43ef3..9576c9c078d5dcd4c0927c2750d0371032ecc413 100644
--- a/deepspeed/constants.py
+++ b/deepspeed/constants.py
@@ -14,3 +14,6 @@ TORCH_DISTRIBUTED_DEFAULT_PORT = 29500
 # To make an attempt at backwards compatibility with THD, we use an
 # extraordinarily high default timeout, given that THD did not have timeouts.
 default_pg_timeout = timedelta(minutes=30)
+
+INFERENCE_GENERIC_MODE = 'generic'
+INFERENCE_SPECIALIZED_MODE = 'specialized'
diff --git a/deepspeed/elasticity/config.py b/deepspeed/elasticity/config.py
index 47a428ead16a3224b45609fb4e296414b4997f5d..67db58c70e71d3107cfa994dd6df674b32c479a8 100644
--- a/deepspeed/elasticity/config.py
+++ b/deepspeed/elasticity/config.py
@@ -10,21 +10,18 @@ class ElasticityError(Exception):
     """
     Base exception for all elasticity related errors
     """
-    pass
 
 
 class ElasticityConfigError(ElasticityError):
     """
     Elasticity configuration error
     """
-    pass
 
 
 class ElasticityIncompatibleWorldSize(ElasticityError):
     """
     Attempting to run a world size that is incompatible with a given elastic config
     """
-    pass
 
 
 class ElasticityConfig:
diff --git a/deepspeed/elasticity/constants.py b/deepspeed/elasticity/constants.py
index 03cba725fa87d66ef5008c333c05243c149fe043..cf428b70a0f40b837750304f603d8ad22dbc8632 100644
--- a/deepspeed/elasticity/constants.py
+++ b/deepspeed/elasticity/constants.py
@@ -16,7 +16,7 @@ Elasticity should be enabled as:
   "max_train_batch_size": 2000,
   "micro_batch_sizes": [2,4,6],
   "min_gpus": 1,
-  "max_gpus" : 10000
+  "max_gpus" : 10000,
   "min_time": 20,
   "prefer_larger_batch": true,
   "ignore_non_elastic_batch_info": false,
diff --git a/deepspeed/elasticity/elasticity.py b/deepspeed/elasticity/elasticity.py
index ae91877f5f24eeaab857b7920bdfdde68cf14796..e678d5ed836e09c1164629400152f0b4dd62e46a 100644
--- a/deepspeed/elasticity/elasticity.py
+++ b/deepspeed/elasticity/elasticity.py
@@ -6,6 +6,8 @@ import re
 import json
 import numpy as np
 
+from packaging import version as pkg_version
+
 from .config import ElasticityConfig, ElasticityConfigError, ElasticityError, \
     ElasticityIncompatibleWorldSize
 from .constants import ELASTICITY, ENABLED, ENABLED_DEFAULT, LATEST_ELASTICITY_VERSION, \
@@ -60,17 +62,16 @@ HCN_LIST = [
 
 def get_candidate_batch_sizes(base_list, max_acceptable_batch_size):
     candidate_batch_size = []
-
-    #brute force is fine here. We are working with very small lists
     for base in base_list:
-        batch_size = base
-        for hcn in HCN_LIST:
-            new_batch_size = base * hcn
-            if new_batch_size > max_acceptable_batch_size:
-                break
-            batch_size = new_batch_size
-        candidate_batch_size.append(batch_size)
-    return list(set(candidate_batch_size))
+        if base >= max_acceptable_batch_size:
+            candidate_batch_size.append(base)
+        else:
+            value = max_acceptable_batch_size // base
+            index = np.argmax(np.asarray(HCN_LIST) > value)
+            candidate_batch_size.append(HCN_LIST[index - 1] * base)
+    candidate_batch_size = list(set(candidate_batch_size))
+    logger.info(f"Candidate batch size: {candidate_batch_size}")
+    return candidate_batch_size
 
 
 def get_valid_gpus(batch_size, micro_batches, min_valid_gpus, max_valid_gpus):
@@ -82,12 +83,17 @@ def get_valid_gpus(batch_size, micro_batches, min_valid_gpus, max_valid_gpus):
             if max_gpus >= min_valid_gpus and max_gpus <= max_valid_gpus:
                 valid_gpus.append(max_gpus)
 
+            # find all factors less than max_gpus / 2
             for i in range(1, max_gpus // 2 + 1):
+                if i > max_valid_gpus:
+                    break
+                if i < min_valid_gpus:
+                    continue
                 if max_gpus % i == 0:
-                    if i >= min_valid_gpus and i <= max_valid_gpus:
-                        valid_gpus.append(i)
+                    valid_gpus.append(i)
     valid_gpus = set(valid_gpus)
     valid_gpus = sorted(list(valid_gpus))
+    logger.info(f"Valid GPUs: {valid_gpus}")
     return valid_gpus
 
 
@@ -141,16 +147,12 @@ def _get_compatible_gpus_v01(micro_batches,
         final_batch_size
         valid_gpus
     '''
+    min_gpus = min_gpus or 1
+    max_gpus = max_gpus or max_acceptable_batch_size // min(micro_batches)
 
-    if min_gpus is None:
-        min_gpus = int(1)
-
-    if max_gpus is None:
-        max_gpus = int(max_acceptable_batch_size / min(micro_batches))
-
-    assert all(mb <= max_acceptable_batch_size for mb in micro_batches ), \
-            f"All micro batches must be less than \
-            or equal to max_acceptable_batch_size: {max_acceptable_batch_size}"
+    if not all(mb <= max_acceptable_batch_size for mb in micro_batches):
+        raise ValueError(f"All micro batches must be less than \
+            or equal to max_acceptable_batch_size: {max_acceptable_batch_size}")
 
     lcm = np.lcm.reduce(micro_batches)
 
@@ -171,29 +173,13 @@ def _get_compatible_gpus_v01(micro_batches,
     return final_batch_size, valid_gpus
 
 
-def _parse_version(version_str):
-    '''Parse a version string and extract the major and minor versions (and possibly patch version).'''
-    matched = re.search('^(\d+)\.(\d+)\.(\d+)', version_str)
-    if matched:
-        return int(matched.group(1)), int(matched.group(2)), int(matched.group(3))
-    else:
-        matched = re.search('^(\d+)\.(\d+)', version_str)
-        assert matched != None, "Unable to parse version number, expecting" \
-            f"major.minor[.patch] format but received {version_str}"
-        return int(matched.group(1)), int(matched.group(2)), 0
-
-
 def _compatible_ds_version_check(target_deepspeed_version: str):
-    min_major, min_minor, min_patch = _parse_version(MINIMUM_DEEPSPEED_VERSION)
-    trg_major, trg_minor, trg_patch = _parse_version(target_deepspeed_version)
+    min_version = pkg_version.parse(MINIMUM_DEEPSPEED_VERSION)
+    target_version = pkg_version.parse(target_deepspeed_version)
 
     err_str = f"Target deepspeed version of {target_deepspeed_version} is not compatible " \
         f"with minimum version {MINIMUM_DEEPSPEED_VERSION} supporting elasticity."
-    if trg_major < min_major:
-        raise ElasticityError(err_str)
-    if trg_minor < min_minor:
-        raise ElasticityError(err_str)
-    if trg_patch < min_patch:
+    if target_version < min_version:
         raise ElasticityError(err_str)
     return True
 
diff --git a/deepspeed/env_report.py b/deepspeed/env_report.py
index b14ac446483550d607a48487635b086935d44090..aad18366587059d1229c3ae55e162b023e2a7ea0 100644
--- a/deepspeed/env_report.py
+++ b/deepspeed/env_report.py
@@ -1,6 +1,7 @@
 import torch
 import deepspeed
 import subprocess
+import argparse
 from .ops.op_builder import ALL_OPS
 from .git_version_info import installed_ops, torch_info
 from .ops import __compatible_ops__ as compatible_ops
@@ -20,7 +21,7 @@ okay = f"{GREEN}[OKAY]{END}"
 warning = f"{YELLOW}[WARNING]{END}"
 
 
-def op_report():
+def op_report(verbose=True):
     max_dots = 23
     max_dots2 = 11
     h = ["op name", "installed", "compatible"]
@@ -43,7 +44,7 @@ def op_report():
     no = f"{YELLOW}[NO]{END}"
     for op_name, builder in ALL_OPS.items():
         dots = "." * (max_dots - len(op_name))
-        is_compatible = OKAY if builder.is_compatible() else no
+        is_compatible = OKAY if builder.is_compatible(verbose) else no
         is_installed = installed if installed_ops[op_name] else no
         dots2 = '.' * ((len(h[1]) + (max_dots2 - len(h[1]))) -
                        (len(is_installed) - color_len))
@@ -78,6 +79,11 @@ def nvcc_version():
 
 def debug_report():
     max_dots = 33
+
+    hip_version = None
+    if hasattr(torch.version, 'hip'):
+        hip_version = torch.version.hip
+
     report = [
         ("torch install path",
          torch.__path__),
@@ -85,25 +91,51 @@ def debug_report():
          torch.__version__),
         ("torch cuda version",
          torch.version.cuda),
+        ("torch hip version",
+         hip_version),
         ("nvcc version",
-         nvcc_version()),
+         (None if hip_version else nvcc_version())),
         ("deepspeed install path",
          deepspeed.__path__),
         ("deepspeed info",
          f"{deepspeed.__version__}, {deepspeed.__git_hash__}, {deepspeed.__git_branch__}"
          ),
         ("deepspeed wheel compiled w.",
-         f"torch {torch_info['version']}, cuda {torch_info['cuda_version']}"),
+         f"torch {torch_info['version']}, " +
+         (f"hip {torch_info['hip_version']}"
+          if hip_version else f"cuda {torch_info['cuda_version']}")),
     ]
     print("DeepSpeed general environment info:")
     for name, value in report:
         print(name, "." * (max_dots - len(name)), value)
 
 
-def main():
-    op_report()
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--hide_operator_status',
+        action='store_true',
+        help=
+        'Suppress display of installation and compatibility statuses of DeepSpeed operators. '
+    )
+    parser.add_argument('--hide_errors_and_warnings',
+                        action='store_true',
+                        help='Suppress warning and error messages.')
+    args = parser.parse_args()
+    return args
+
+
+def main(hide_operator_status=False, hide_errors_and_warnings=False):
+    if not hide_operator_status:
+        op_report(verbose=not hide_errors_and_warnings)
     debug_report()
 
 
+def cli_main():
+    args = parse_arguments()
+    main(hide_operator_status=args.hide_operator_status,
+         hide_errors_and_warnings=args.hide_errors_and_warnings)
+
+
 if __name__ == "__main__":
     main()
diff --git a/deepspeed/git_version_info.py b/deepspeed/git_version_info.py
index f04982c74f0d4223413b768af78afc376c67aff9..a806475c397b2927fce14930a19e169ac50af2a0 100644
--- a/deepspeed/git_version_info.py
+++ b/deepspeed/git_version_info.py
@@ -14,4 +14,4 @@ except ModuleNotFoundError:
     from .ops.op_builder import ALL_OPS
     installed_ops = dict.fromkeys(ALL_OPS.keys(), False)
     compatible_ops = dict.fromkeys(ALL_OPS.keys(), False)
-    torch_info = {'version': "0.0", "cuda_version": "0.0"}
+    torch_info = {'version': "0.0", "cuda_version": "0.0", "hip_version": "0.0"}
diff --git a/deepspeed/inference/__init__.py b/deepspeed/inference/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ee60d6547b43768ea483274ed88d89895f6c7e9
--- /dev/null
+++ b/deepspeed/inference/__init__.py
@@ -0,0 +1 @@
+from .engine import InferenceEngine
diff --git a/deepspeed/inference/engine.py b/deepspeed/inference/engine.py
new file mode 100644
index 0000000000000000000000000000000000000000..a37f7c23f5990182e05120592fe8cdecf0c19980
--- /dev/null
+++ b/deepspeed/inference/engine.py
@@ -0,0 +1,380 @@
+'''
+Copyright 2021 The Microsoft DeepSpeed Team
+'''
+import torch
+import os
+from torch.nn.modules import Module
+import torch.distributed as dist
+from ..runtime.state_dict_factory import SDLoaderFactory
+from ..runtime.weight_quantizer import WeightQuantization
+from ..module_inject.replace_module import replace_transformer_layer
+from ..utils import logger, init_distributed
+
+from ..pipe import PipelineModule
+from ..moe.utils import has_moe_layers
+from ..moe.layer import MoE
+
+import torch.distributed as dist
+import deepspeed.utils.groups as groups
+
+DS_INFERENCE_ENABLED = False
+
+
+class InferenceEngine(Module):
+    inference_mp_group = None
+    inference_ep_group = None
+    expert_mp_group = None
+
+    def __init__(self,
+                 model,
+                 triangular_masking=True,
+                 mp_size=1,
+                 training_mp_size=1,
+                 ep_size=1,
+                 mpu=None,
+                 ep_group=None,
+                 expert_mp_group=None,
+                 checkpoint=None,
+                 dtype=None,
+                 injection_dict=None,
+                 return_tuple=True,
+                 replace_method='auto',
+                 quantization_setting=None,
+                 replace_with_kernel_inject=False,
+                 moe=False,
+                 moe_experts=1,
+                 moe_type='standard',
+                 config=None):
+        """
+        Args:
+            model: torch.nn.Module
+            mp_size: model-parallel size
+            mpu: model-parallel unit (used for Megatron-type models)
+            checkpoint: the json-path, showing the address of model-checkpoints
+                Example: {type: 'Megatron', 'checkpoints': [ckpt_mp0.pt, ckpt_mp1.pt], 'version': 1.0}
+            dtype: data-type by which inference is executed
+            injection_dict: the dictionary that shows the injection policy:
+                Example: {BertLayer: HFBertLayerPolicy}
+            return_tuple: if true, inference-API returns a tuple, otherwise a tensor
+            replace_method: the injection method, this can be passed as auto if no injection-policy is defined, in which case the injection is automatic based on the available policies
+            quantization_setting:
+                one of None, Tuple(mlp_extra_grouping, quantize_groups), quantize_groups
+            replace_with_kernel_inject: this flag need to be set to true to inject inference kernels for models such as, Bert, GPT2, GPT-Neo and GPT-J. Otherwise,
+            the injection_dict provides the names of two linear layers as a tuple: (attention_output projection, transformer output projection)
+        """
+        global DS_INFERENCE_ENABLED
+        DS_INFERENCE_ENABLED = True
+
+        super().__init__()
+
+        self.module = model
+
+        self._get_model_config_generate(config)
+
+        self.mp_world_size = mp_size
+        self.checkpoint = checkpoint
+        self.dtype = dtype
+        self.injection_dict = injection_dict
+        self.mp_group = None
+        self.mpu = mpu
+        self._validate_args(mpu)
+        self.replace_method = replace_method
+        self.quantize_merge_count = 1
+        self.quantization_scales = None
+        self.triangular_masking = triangular_masking
+        self.ep_size = ep_size
+        self.ep_group = ep_group
+        self.expert_mp_group = expert_mp_group
+
+        self._init_quantization_setting(quantization_setting)
+
+        if self.checkpoint:
+            self._load_checkpoint(self.checkpoint)
+
+        # convert model to intended dtype
+        if self.dtype:
+            self._convert_to_dtype()
+
+        if self.mpu:
+            self.mp_world_size = dist.get_world_size(
+                group=self.mpu.get_model_parallel_group())
+            self.mp_group = mpu.get_model_parallel_group()
+        elif self.mp_world_size > 1:
+            self._create_model_parallel_group()
+
+        moe, _ = has_moe_layers(self.module)
+
+        if moe and dist.get_world_size() > 1:
+            self._create_ep_parallel_group(moe_experts)
+
+        if self.injection_dict:
+            for client_module, injection_policy in self.injection_dict.items():
+                self._apply_injection_policy(client_module,
+                                             injection_policy,
+                                             return_tuple,
+                                             replace_with_kernel_inject,
+                                             moe,
+                                             moe_experts,
+                                             moe_type,
+                                             training_mp_size)
+        elif replace_method == 'auto':
+            self._apply_injection_policy(
+                return_tuple=return_tuple,
+                replace_with_kernel_inject=replace_with_kernel_inject,
+                moe=moe,
+                moe_experts=moe_experts,
+                moe_type=moe_type,
+                training_mp_size=training_mp_size)
+
+        device = torch.cuda.current_device()
+        logger.info(f"Place model to device: {device}")
+        self.module.to(device)
+
+        if self.mp_world_size > 1:
+            self.model_orig_fwd = self.module.forward
+            self.module.forward = self.forward
+        else:
+            self.module.register_forward_pre_hook(self._pre_forward_hook)
+
+    def _get_model_config_generate(self, config):
+        self.config = getattr(self.module, 'config', None) if config is None else config
+        self.generate = getattr(self.module, 'generate', None)
+
+    def _create_model_parallel_group(self):
+        # Call the init process
+        if InferenceEngine.inference_mp_group is None:
+            init_distributed()
+
+            local_rank = int(os.getenv('LOCAL_RANK', '0'))
+            torch.cuda.set_device(local_rank)
+
+            ranks = [i for i in range(self.mp_world_size)]
+            self.mp_group = dist.new_group(ranks)
+            InferenceEngine.inference_mp_group = self.mp_group
+
+        else:
+            self.mp_group = InferenceEngine.inference_mp_group
+
+    def _create_ep_parallel_group(self, moe_experts):
+        # Call the init process
+        self.ep_group = {}
+        self.expert_mp_group = {}
+        moe_experts = moe_experts if type(moe_experts) is list else [moe_experts]
+        for e in moe_experts:
+            self.ep_group.update({e: None})
+            self.expert_mp_group.update({e: None})
+        for moe_ep_size in self.ep_group.keys():
+            num_ep_groups = dist.get_world_size() // moe_ep_size
+            for i in range(num_ep_groups):
+                ep_cnt = i * moe_ep_size
+                size = dist.get_world_size(
+                ) if moe_ep_size > dist.get_world_size() else moe_ep_size
+                ranks = list(range(ep_cnt, ep_cnt + size))
+                _ep_group = dist.new_group(ranks)
+                if dist.get_rank() in ranks:
+                    self.ep_group.update({moe_ep_size: _ep_group})
+
+            if dist.get_world_size() > moe_ep_size:
+                num_expert_mp_groups = dist.get_world_size() // num_ep_groups
+                expert_mp_size = dist.get_world_size() // moe_ep_size
+                for i in range(num_expert_mp_groups):
+                    expert_mp_comm_ranks = [
+                        i + nr * moe_ep_size for nr in range(expert_mp_size)
+                    ]
+                    _expert_mp_group = dist.new_group(expert_mp_comm_ranks)
+                    if dist.get_rank() in expert_mp_comm_ranks:
+                        self.expert_mp_group.update({moe_ep_size: _expert_mp_group})
+
+    def _init_quantization_setting(self, quantization_setting):
+        self.quantize_bits = 8
+        self.mlp_extra_grouping = False
+        self.quantize_groups = 1
+        if type(quantization_setting) is tuple:
+            self.mlp_extra_grouping, \
+            self.quantize_groups = quantization_setting
+        elif quantization_setting is not None:
+            self.quantize_groups = quantization_setting
+        logger.info(f"quantize_bits = {self.quantize_bits} "
+                    f"mlp_extra_grouping = {self.mlp_extra_grouping}, "
+                    f"quantize_groups = {self.quantize_groups}")
+
+    def _validate_args(self, mpu):
+        if not isinstance(self.module, Module):
+            raise ValueError(f"model must be a torch.nn.Module, got {type(self.module)}")
+        if not isinstance(self.mp_world_size, int) or self.mp_world_size < 1:
+            raise ValueError(f"mp_size must be an int >= 1, got {self.mp_world_size}")
+
+        if mpu:
+            methods = ["get_model_parallel_group", "get_data_parallel_group"]
+            for method in methods:
+                if not hasattr(mpu, method):
+                    raise ValueError(f"mpu is missing {method}")
+        if self.checkpoint is not None and not isinstance(self.checkpoint, str):
+            raise ValueError(
+                f"checkpoint must be None or a str, got {type(self.checkpoint)}")
+
+        supported_dtypes = [None, torch.half, torch.int8, torch.float]
+        if self.dtype not in supported_dtypes:
+            raise ValueError(
+                f"{self.dtype} not supported, valid dtype: {supported_dtypes}")
+
+        if self.injection_dict is not None and not isinstance(self.injection_dict, dict):
+            raise ValueError(
+                f"injection_dict must be None or a dict, got: {self.injection_dict}")
+
+    def _apply_injection_policy(self,
+                                client_module=None,
+                                injection_policy=None,
+                                return_tuple=True,
+                                replace_with_kernel_inject=False,
+                                moe=False,
+                                moe_experts=1,
+                                moe_type='standard',
+                                training_mp_size=1):
+
+        replace_transformer_layer(client_module,
+                                  self.module,
+                                  triangular_masking=self.triangular_masking,
+                                  policy=injection_policy,
+                                  mp_size=self.mp_world_size,
+                                  mp_group=self.mp_group,
+                                  ep_group=self.ep_group,
+                                  expert_mp_group=self.expert_mp_group,
+                                  config=self.config,
+                                  fp16=(self.dtype == torch.half),
+                                  training=False,
+                                  return_tuple=return_tuple,
+                                  quantize=(self.dtype == torch.int8),
+                                  quantize_settings=(self.quantization_scales,
+                                                     self.quantize_merge_count,
+                                                     self.mlp_extra_grouping,
+                                                     self.quantize_groups),
+                                  replace_with_kernel_inject=replace_with_kernel_inject,
+                                  moe=moe,
+                                  moe_experts=moe_experts,
+                                  moe_type=moe_type,
+                                  training_mp_size=training_mp_size)
+
+    def _get_all_ckpt_names(self, checkpoints_path, tag):
+        ckpt_file_pattern = self._get_ckpt_name(checkpoints_path,
+                                                tag,
+                                                mp_placeholder="*")
+        import glob
+
+        ckpt_files = glob.glob(ckpt_file_pattern)
+        ckpt_files.sort()
+        return ckpt_files
+
+    def _get_ckpt_name(self, checkpoints_path, tag, mp_placeholder=None):
+        if mp_placeholder is not None:
+            mp_rank_str = mp_placeholder
+        else:
+            mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
+            mp_rank_str = "{:02d}".format(mp_rank)
+
+        ckpt_name = os.path.join(
+            checkpoints_path,
+            "mp_rank_" + mp_rank_str + "_model_states.pt",
+        )
+        return ckpt_name
+
+    def _load_checkpoint(self, load_dir, load_module_strict=True, tag=None):
+        is_pipe_parallel = isinstance(self.module, PipelineModule)
+        if is_pipe_parallel:
+            raise RuntimeError(
+                'pipeline parallelism is currently not supported in inference.')
+        if os.path.isdir(load_dir):
+            if tag is None:
+                latest_path = os.path.join(load_dir, "latest")
+                if os.path.isfile(latest_path):
+                    with open(latest_path, "r") as fd:
+                        tag = fd.read().strip()
+
+            ckpt_list = self._get_all_ckpt_names(load_dir, tag)
+            sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list)
+        else:
+            sd_loader = SDLoaderFactory.get_sd_loader_json(load_dir)
+
+        mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
+
+        load_path, checkpoint, quantize_config = sd_loader.load(self.mp_world_size,
+                                                  mp_rank,
+                                                  is_pipe_parallel=is_pipe_parallel,
+                                                  quantize=(self.dtype is torch.int8),
+                                                  quantize_groups=self.quantize_groups,
+                                                  mlp_extra_grouping=self.mlp_extra_grouping)
+
+        self.quantization_scales, self.quantize_merge_count = quantize_config
+
+        moe, _ = has_moe_layers(self.module)
+        if moe:
+            from deepspeed.runtime.engine import DeepSpeedEngine
+            old_moe_load = False
+            if not isinstance(checkpoint['num_experts'], list):
+                old_moe_load = True
+            DeepSpeedEngine.load_moe_state_dict(
+                load_dir,
+                tag,
+                state_dict=checkpoint[self._choose_module_key(checkpoint)],
+                old_moe_load=old_moe_load,
+                model=self.module,
+                mpu=self.mpu)
+
+        self.module.load_state_dict(
+            state_dict=checkpoint[self._choose_module_key(checkpoint)],
+            strict=load_module_strict)
+
+    def _choose_module_key(self, sd):
+        assert not ('module' in sd and 'model' in sd), "checkpoint has both 'model' and 'module' keys, not sure how to proceed"
+        assert 'module' in sd or 'model' in sd, "checkpoint contains neither 'model' or 'module' keys, not sure how to proceed"
+        if 'module' in sd:
+            return 'module'
+        elif 'model' in sd:
+            return 'model'
+
+    def _convert_to_dtype(self):
+        if self.dtype is torch.int8 and self.quantization_scales is None:
+            quantizer = WeightQuantization(mlp_extra_grouping=self.mlp_extra_grouping)
+            model, self.quantization_scales = quantizer.model_quantize(self.module,
+                                                                        self.injection_dict,
+                                                                        self.quantize_bits,
+                                                                        self.quantize_groups)
+        elif self.dtype == torch.half:
+            self.module.half()
+        elif self.dtype == torch.float:
+            self.module.float()
+
+    def _pre_forward_hook(self, module, *inputs, **kwargs):
+        for input in inputs:
+            if torch.is_tensor(input):
+                input = input.to(torch.cuda.current_device())
+        for k in kwargs:
+            if torch.is_tensor(kwargs[k]):
+                kwargs[k] = kwargs[k].to(torch.cuda.current_device())
+
+    def forward(self, *inputs, **kwargs):
+        """Execute forward propagation
+
+        Arguments:
+            *inputs: Variable length input list
+            **kwargs: variable length keyword arguments
+        """
+        if self.mp_world_size > 1:
+            if self.mpu is None:
+                for input in inputs:
+                    if torch.is_tensor(input):
+                        input = input.to(torch.cuda.current_device())
+                        if not input.is_contiguous():
+                            input = input.contiguous()
+                        dist.broadcast(input, 0)
+                for k in kwargs:
+                    if torch.is_tensor(kwargs[k]):
+                        kwargs[k] = kwargs[k].to(torch.cuda.current_device())
+                        if not kwargs[k].is_contiguous():
+                            kwargs[k] = kwargs[k].contiguous()
+                        dist.broadcast(kwargs[k], 0)
+
+            outputs = self.model_orig_fwd(*inputs, **kwargs)
+        else:
+            outputs = self.module(*inputs, **kwargs)
+        return outputs
diff --git a/deepspeed/launcher/constants.py b/deepspeed/launcher/constants.py
index fd56facc4343723b973916229e0bdf957f5c4d7f..15b262342d475f10322b2a8a120eb08b95571422 100644
--- a/deepspeed/launcher/constants.py
+++ b/deepspeed/launcher/constants.py
@@ -1,9 +1,9 @@
-# Copyright 2020 The Microsoft DeepSpeed Team
-
-PDSH_LAUNCHER = 'pdsh'
-PDSH_MAX_FAN_OUT = 1024
-
-OPENMPI_LAUNCHER = 'openmpi'
-
-MVAPICH_LAUNCHER = 'mvapich'
-MVAPICH_TMP_HOSTFILE = '/tmp/deepspeed_mvapich_hostfile'
+# Copyright 2020 The Microsoft DeepSpeed Team
+
+PDSH_LAUNCHER = 'pdsh'
+PDSH_MAX_FAN_OUT = 1024
+
+OPENMPI_LAUNCHER = 'openmpi'
+
+MVAPICH_LAUNCHER = 'mvapich'
+MVAPICH_TMP_HOSTFILE = '/tmp/deepspeed_mvapich_hostfile'
diff --git a/deepspeed/launcher/launch.py b/deepspeed/launcher/launch.py
old mode 100755
new mode 100644
index 2ad98ea358a5bad26735b9b8cbd00ddbc205cd8b..21ef04b338b78c7c3a11be36e9fb08046ba2a9a3
--- a/deepspeed/launcher/launch.py
+++ b/deepspeed/launcher/launch.py
@@ -1,7 +1,7 @@
 # Copyright 2020 The Microsoft DeepSpeed Team
 """
 DeepSpeed launcher, this is similar to torch.distributed.launch but supports
-additional features such as abitrary gpu exclusion.
+additional features such as arbitrary gpu exclusion.
 
 deepspeed.launcher.launch is intended to be run on a single worker node and
 will spawn several worker sub-processes depending on how many devices/ranks
@@ -21,6 +21,8 @@ from argparse import ArgumentParser, REMAINDER
 from ..constants import TORCH_DISTRIBUTED_DEFAULT_PORT
 from ..utils import logger
 
+PID_FILE_BASEPATH = "/tmp"
+
 
 def parse_args():
     parser = ArgumentParser(description="DeepSpeed distributed training launch"
@@ -51,6 +53,27 @@ def parse_args():
                         type=str,
                         help="world info base64 encoded dictionary")
 
+    parser.add_argument("--module",
+                        action="store_true",
+                        help="Change each process to interpret the launch "
+                        "script as a Python module, executing with the same "
+                        "behavior as 'python -m'.")
+
+    parser.add_argument("--no_python",
+                        action="store_true",
+                        help="Skip prepending the training script with "
+                        "'python' - just execute it directly.")
+
+    parser.add_argument("--no_local_rank",
+                        action="store_true",
+                        help="Do not pass local_rank as an argument when calling "
+                        "the user's training script.")
+
+    parser.add_argument("--save_pid",
+                        type=int,
+                        default=0,
+                        help="main launching process pid, for internal pid tracking")
+
     # positional
     parser.add_argument("training_script",
                         type=str,
@@ -70,23 +93,21 @@ def main():
 
     for k in current_env.keys():
         if "NCCL" in k:
-            logger.info("%s %s %s", args.node_rank, k, current_env[k])
+            logger.info(f"{args.node_rank} {k}={current_env[k]}")
 
-    world_info = None
-    assert args.world_info != "None", "must provide world info dict"
+    if args.world_info == "None":
+        raise ValueError("world_info can not be None")
     world_info = base64.urlsafe_b64decode(args.world_info)
     world_info = json.loads(world_info)
 
-    logger.info("WORLD INFO DICT: {}".format(world_info))
+    logger.info(f"WORLD INFO DICT: {world_info}")
     node_list = list(world_info.keys())
     args.nnodes = len(node_list)
     local_node = node_list[args.node_rank]
     local_gpu_ids = world_info[local_node]
     num_local_procs = len(local_gpu_ids)
     logger.info(
-        "nnodes={}, num_local_procs={}, node_rank={}".format(args.nnodes,
-                                                             num_local_procs,
-                                                             args.node_rank),
+        f"nnodes={args.nnodes}, num_local_procs={num_local_procs}, node_rank={args.node_rank}"
     )
 
     global_rank_mapping = defaultdict(list)
@@ -98,19 +119,32 @@ def main():
         for gid in gids:
             global_rank_mapping[node_id].append(curr_global_rank)
             curr_global_rank += 1
-    logger.info("global_rank_mapping={}".format(global_rank_mapping))
-    logger.info("dist_world_size={}".format(dist_world_size))
+    logger.info(f"global_rank_mapping={global_rank_mapping}")
+    logger.info(f"dist_world_size={dist_world_size}")
     current_env["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, local_gpu_ids))
-    logger.info("Setting CUDA_VISIBLE_DEVICES={}".format(
-        current_env["CUDA_VISIBLE_DEVICES"]))
-    exclusion_counts_per_node = None
+    logger.info(f"Setting CUDA_VISIBLE_DEVICES={current_env['CUDA_VISIBLE_DEVICES']}")
 
     # set PyTorch distributed related environmental variables
     current_env["MASTER_ADDR"] = args.master_addr
     current_env["MASTER_PORT"] = str(args.master_port)
     current_env["WORLD_SIZE"] = str(dist_world_size)
+    current_env["CROSS_RANK"] = str(args.node_rank)
+    current_env["CROSS_SIZE"] = str(args.nnodes)
+    current_env["LOCAL_SIZE"] = str(num_local_procs)
+
+    if args.save_pid:
+        print(f"launcher pid: {os.getpid()}")
+
+    pid_file = None
+    if args.save_pid:
+        launcher_pid = os.getpid()
+        pid_file = os.path.join(PID_FILE_BASEPATH, f"{args.save_pid}.deepspeed")
+        assert not os.path.isfile(pid_file), "pid file exists but shouldn't"
+        with open(pid_file, 'w') as fd:
+            fd.write(f"{launcher_pid}")
 
     processes = []
+    cmd = []
     for local_rank in range(0, num_local_procs):
         # each process's rank
         dist_rank = global_rank_mapping[local_node][local_rank]
@@ -118,36 +152,48 @@ def main():
         current_env["LOCAL_RANK"] = str(local_rank)
 
         # spawn the processes
-        cmd = [
-            sys.executable,
-            "-u",
-            args.training_script,
-            "--local_rank={}".format(local_rank)
-        ] + args.training_script_args
-
-        sig_names = {2: "SIGINT", 15: "SIGTERM"}
-        last_return_code = None
-
-        def sigkill_handler(signum, frame):
-            for process in processes:
-                print(f"Killing subprocess {process.pid}")
-                try:
-                    process.kill()
-                except Exception as e:
-                    pass
-            if last_return_code is not None:
-                raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
-            if signum in sig_names:
-                print(f"Main process received {sig_names[signum]}, exiting")
-            sys.exit(1)
-
-        # pass SIGINT/SIGTERM to children if the parent is being terminated
-        signal.signal(signal.SIGINT, sigkill_handler)
-        signal.signal(signal.SIGTERM, sigkill_handler)
+        cmd = []
+        if not args.no_python:
+            cmd = [sys.executable, "-u"]
+            if args.module:
+                cmd.append("-m")
+        else:
+            if args.module:
+                raise ValueError("Don't use both the '--no_python' flag"
+                                 " and the '--module' flag at the same time.")
+        cmd.append(args.training_script)
+        # A user may not want to pass local_rank as a keyword arg so we make this optional.
+        if not args.no_local_rank:
+            cmd.append(f"--local_rank={local_rank}")
+        cmd += args.training_script_args
 
         process = subprocess.Popen(cmd, env=current_env)
         processes.append(process)
 
+    sig_names = {2: "SIGINT", 15: "SIGTERM"}
+    last_return_code = None
+
+    def sigkill_handler(signum, frame):
+        for process in processes:
+            logger.info(f"Killing subprocess {process.pid}")
+            try:
+                process.kill()
+            except Exception:
+                pass
+        if last_return_code is not None:
+            logger.error(f"{cmd} exits with return code = {last_return_code}")
+            sys.exit(last_return_code)
+        if signum in sig_names:
+            logger.info(f"Main process received {sig_names[signum]}, exiting")
+        if args.save_pid:
+            if os.path.isfile(pid_file):
+                os.remove(pid_file)
+        sys.exit(1)
+
+    # pass SIGINT/SIGTERM to children if the parent is being terminated
+    signal.signal(signal.SIGINT, sigkill_handler)
+    signal.signal(signal.SIGTERM, sigkill_handler)
+
     alive_processes = set(processes)
     while len(alive_processes):
         finished_processes = []
@@ -161,6 +207,7 @@ def main():
                     sigkill_handler(signal.SIGTERM, None)  # not coming back
                 else:
                     # exited cleanly
+                    logger.info(f"Process {process.pid} exits successfully.")
                     finished_processes.append(process)
         alive_processes = set(alive_processes) - set(finished_processes)
 
diff --git a/deepspeed/launcher/multinode_runner.py b/deepspeed/launcher/multinode_runner.py
index a45cc6a56505abe1bdd7cc79f235ddc2967ebd23..a962a8a7c925955b05c99d997990f46b6fee7592 100644
--- a/deepspeed/launcher/multinode_runner.py
+++ b/deepspeed/launcher/multinode_runner.py
@@ -1,189 +1,243 @@
-import os
-import sys
-import shutil
-import subprocess
-import warnings
-from abc import ABC, abstractmethod
-
-from ..utils import logger
-from .constants import PDSH_MAX_FAN_OUT, MVAPICH_TMP_HOSTFILE
-
-
-class MultiNodeRunner(ABC):
-    def __init__(self, args, world_info_base64):
-        self.args = args
-        self.user_arguments = self.parse_user_args()
-        self.user_script = args.user_script
-        self.world_info_base64 = world_info_base64
-        self.exports = {}
-
-    @abstractmethod
-    def backend_exists(self):
-        pass
-
-    @abstractmethod
-    def get_cmd(self, environment, active_resources):
-        pass
-
-    def add_export(self, key, var):
-        self.exports[key.strip()] = var.strip()
-
-    def parse_user_args(self):
-        return self.args.user_args
-
-
-class PDSHRunner(MultiNodeRunner):
-    def __init__(self, args, world_info_base64):
-        super().__init__(args, world_info_base64)
-
-    def backend_exists(self):
-        return shutil.which('pdsh')
-
-    def parse_user_args(self):
-        return list(
-            map(lambda x: x if x.startswith("-") else "'{}'".format(x),
-                self.args.user_args))
-
-    def get_cmd(self, environment, active_resources):
-        environment['PDSH_RCMD_TYPE'] = 'ssh'
-
-        active_workers = ",".join(active_resources.keys())
-        logger.info("Running on the following workers: %s" % active_workers)
-
-        # PDSH flags for max node fan out and specific hosts to launch on
-        # See https://linux.die.net/man/1/pdsh for flag details
-        pdsh_cmd_args = ['pdsh', '-f', str(PDSH_MAX_FAN_OUT), '-w', active_workers]
-
-        exports = ""
-        for key, val in self.exports.items():
-            exports += "export {}={}; ".format(key, val)
-
-        deepspeed_launch = [
-            exports,
-            "cd {};".format(os.path.abspath('.')),
-            sys.executable,
-            "-u",
-            "-m",
-            "deepspeed.launcher.launch",
-            '--world_info={}'.format(self.world_info_base64),
-            "--node_rank=%n",
-            "--master_addr={}".format(self.args.master_addr),
-            "--master_port={}".format(self.args.master_port)
-        ]
-
-        return pdsh_cmd_args + deepspeed_launch + [self.user_script
-                                                   ] + self.user_arguments
-
-
-class OpenMPIRunner(MultiNodeRunner):
-    def __init__(self, args, world_info_base64, resource_pool):
-        super().__init__(args, world_info_base64)
-        self.resource_pool = resource_pool
-        self.add_export('UCX_TLS', 'tcp')
-
-    def backend_exists(self):
-        #TODO: if IB is available we should suggestion mvapich
-        return shutil.which('ompi_info')
-
-    def get_cmd(self, environment, active_resources):
-        #TODO: Allow for include/exclude at node-level but not gpu-level
-        assert self.args.include == "" and self.args.exclude == "", 'openmpi backend does not support worker include/exclusion'
-        assert self.args.num_nodes == -1 and self.args.num_gpus == -1, 'openmpi backend does not support limiting num nodes/gpus'
-        total_process_count = sum(self.resource_pool.values())
-
-        mpirun_cmd = [
-            'mpirun',
-            '-n',
-            f'{total_process_count}',
-            '-hostfile',
-            f'{self.args.hostfile}',
-            '--mca',
-            'btl',
-            '^openib',
-            '--mca',
-            'btl_tcp_if_include',
-            'eth0',
-        ]
-
-        export_cmd = []
-        for k, v in self.exports.items():
-            export_cmd += ['-x', f'{k}={v}']
-
-        python_exec = [sys.executable, "-u"]
-
-        return mpirun_cmd + export_cmd + python_exec + [self.user_script
-                                                        ] + self.user_arguments
-
-
-class MVAPICHRunner(MultiNodeRunner):
-    def __init__(self, args, world_info_base64, resource_pool):
-        super().__init__(args, world_info_base64)
-        self.resource_pool = resource_pool
-
-        # Disable the CMA kernel module, not available on Ubuntu systems
-        self.add_export('MV2_SMP_USE_CMA', '0')
-
-        # If we fail this will output more verbose logging
-        self.add_export('MV2_DEBUG_SHOW_BACKTRACE', '1')
-
-        # Enabled cuda-aware communication
-        self.add_export('MV2_USE_CUDA', '1')
-
-        # Support deep learning frameworks: http://hidl.cse.ohio-state.edu/userguide/horovod/
-        self.add_export('MV2_SUPPORT_DL', '1')
-
-        # Support MPI_THREAD_MULTIPLE
-        self.add_export('MV2_ENABLE_AFFINITY', '0')
-
-        # Performance tuning flags for allgather
-        self.add_export('MV2_INTER_ALLGATHER_TUNING', '5')
-        self.add_export('MV2_CUDA_USE_NAIVE', '0')
-
-    def backend_exists(self):
-        #TODO: if IB is available we should suggestion mvapich
-        mpiname_exists = shutil.which('mpiname')
-        exists = False
-        if not mpiname_exists:
-            warnings.warn("mpiname does not exist, mvapich is not installed properly")
-        else:
-            results = subprocess.check_output('mpiname', shell=True)
-            mpiname_results = results.decode('utf-8').strip()
-            if "MVAPICH2-GDR" in mpiname_results:
-                exists = True
-            else:
-                warnings.warn(
-                    f"Expected MVAPICH2-GDR as return for mpiname but received {mpiname_results}"
-                )
-        return exists
-
-    def get_cmd(self, environment, active_resources):
-        #TODO: Allow for include/exclude at node-level but not gpu-level
-        assert self.args.include == "" and self.args.exclude == "", 'mvapich backend does not support worker include/exclusion'
-        assert self.args.num_nodes == -1 and self.args.num_gpus == -1, 'mvapich backend does not support limiting num nodes/gpus'
-        devices_per_node = self.resource_pool.values()
-        total_process_count = sum(devices_per_node)
-        process_per_node = list(devices_per_node)[0]
-        assert all([n == process_per_node for n in devices_per_node]), "mvapich requires same number of devices per node"
-
-        with open(MVAPICH_TMP_HOSTFILE, 'w') as fd:
-            for host in self.resource_pool.keys():
-                fd.write(f'{host}\n')
-
-        mpirun_cmd = [
-            'mpirun',
-            '-np',
-            f'{total_process_count}',
-            '-ppn',
-            f'{process_per_node}',
-            '--hostfile',
-            f'{MVAPICH_TMP_HOSTFILE}',
-        ]
-
-        export_cmd = []
-        for k, v in self.exports.items():
-            export_cmd += ['-env', f'{k}={v}']
-
-        python_exec = [sys.executable, "-u"]
-
-        return mpirun_cmd + export_cmd + python_exec + [self.user_script
-                                                        ] + self.user_arguments
+import os
+import sys
+import shutil
+import subprocess
+import warnings
+from shlex import quote
+from abc import ABC, abstractmethod
+
+from ..utils import logger
+from .constants import PDSH_MAX_FAN_OUT, MVAPICH_TMP_HOSTFILE
+
+
+class MultiNodeRunner(ABC):
+    def __init__(self, args, world_info_base64):
+        self.args = args
+        self.validate_args()
+        self.user_arguments = self.parse_user_args()
+        self.user_script = args.user_script
+        self.world_info_base64 = world_info_base64
+        self.exports = {}
+
+    @abstractmethod
+    def backend_exists(self):
+        """Return whether the corresponding backend exists"""
+
+    @abstractmethod
+    def get_cmd(self, environment, active_resources):
+        """Return the command to execute on node"""
+
+    def add_export(self, key, var):
+        self.exports[key.strip()] = var.strip()
+
+    def parse_user_args(self):
+        return self.args.user_args
+
+    @property
+    def name(self):
+        """Return the name of the backend"""
+        return self.__class__.__name__
+
+    def validate_args(self):
+        """Validate self.args"""
+
+
+class PDSHRunner(MultiNodeRunner):
+    def __init__(self, args, world_info_base64):
+        super().__init__(args, world_info_base64)
+
+    def backend_exists(self):
+        return shutil.which('pdsh')
+
+    @property
+    def name(self):
+        return "pdsh"
+
+    def parse_user_args(self):
+        return list(
+            map(lambda x: x if x.startswith("-") else f"'{x}'",
+                self.args.user_args))
+
+    def get_cmd(self, environment, active_resources):
+        environment['PDSH_RCMD_TYPE'] = 'ssh'
+
+        active_workers = ",".join(active_resources.keys())
+        logger.info("Running on the following workers: %s" % active_workers)
+
+        # PDSH flags for max node fan out and specific hosts to launch on
+        # See https://linux.die.net/man/1/pdsh for flag details
+        pdsh_cmd_args = ['pdsh', '-f', str(PDSH_MAX_FAN_OUT), '-w', active_workers]
+
+        exports = ""
+        for key, val in self.exports.items():
+            exports += "export {}={}; ".format(key, val)
+
+        # https://linux.die.net/man/1/pdsh
+        # %n will be replaced by pdsh command
+        deepspeed_launch = [
+            exports,
+            f"cd {os.path.abspath('.')};",
+            sys.executable,
+            "-u",
+            "-m",
+            "deepspeed.launcher.launch",
+            f'--world_info={self.world_info_base64}',
+            "--node_rank=%n",
+            f"--master_addr={self.args.master_addr}",
+            f"--master_port={self.args.master_port}"
+        ]
+        if self.args.no_python:
+            deepspeed_launch.append("--no_python")
+        if self.args.module:
+            deepspeed_launch.append("--module")
+        if self.args.no_local_rank:
+            deepspeed_launch.append("--no_local_rank")
+        if self.args.save_pid:
+            deepspeed_launch += ["--save_pid", f"{os.getpid()}"]
+        return pdsh_cmd_args + deepspeed_launch + [self.user_script
+                                                   ] + self.user_arguments
+
+
+class OpenMPIRunner(MultiNodeRunner):
+    def __init__(self, args, world_info_base64, resource_pool):
+        super().__init__(args, world_info_base64)
+        self.resource_pool = resource_pool
+        self.add_export('UCX_TLS', 'tcp')
+
+    def backend_exists(self):
+        #TODO: if IB is available we should suggestion mvapich
+        return shutil.which('ompi_info')
+
+    @property
+    def name(self):
+        return "openmpi"
+
+    def validate_args(self):
+        super().validate_args()
+        #TODO: Allow for include/exclude at node-level but not gpu-level
+        if self.args.include != "" or self.args.exclude != "":
+            raise ValueError(
+                f"{self.name} backend does not support worker include/exclusion")
+        if self.args.num_nodes != -1 or self.args.num_gpus != -1:
+            raise ValueError(
+                f"{self.name} backend does not support limiting num nodes/gpus")
+
+    def get_cmd(self, environment, active_resources):
+        total_process_count = sum(self.resource_pool.values())
+
+        mpirun_cmd = [
+            'mpirun',
+            '-n',
+            f'{total_process_count}',
+            '-hostfile',
+            f'{self.args.hostfile}',
+            '--mca',
+            'btl',
+            '^openib',
+            '--mca',
+            'btl_tcp_if_include',
+            'eth0',
+        ]
+
+        export_cmd = []
+        for k, v in self.exports.items():
+            export_cmd += ['-x', "{}={}".format(k, v)]
+
+        python_exec = []
+        if not self.args.no_python:
+            python_exec = [sys.executable, "-u"]
+            if self.args.module:
+                python_exec.append("-m")
+
+        return mpirun_cmd + export_cmd + python_exec + [self.user_script
+                                                        ] + self.user_arguments
+
+
+class MVAPICHRunner(MultiNodeRunner):
+    def __init__(self, args, world_info_base64, resource_pool):
+        super().__init__(args, world_info_base64)
+        self.resource_pool = resource_pool
+
+        # Disable the CMA kernel module, not available on Ubuntu systems
+        self.add_export('MV2_SMP_USE_CMA', '0')
+
+        # If we fail this will output more verbose logging
+        self.add_export('MV2_DEBUG_SHOW_BACKTRACE', '1')
+
+        # Enabled cuda-aware communication
+        self.add_export('MV2_USE_CUDA', '1')
+
+        # Support deep learning frameworks: http://hidl.cse.ohio-state.edu/userguide/horovod/
+        self.add_export('MV2_SUPPORT_DL', '1')
+
+        # Support MPI_THREAD_MULTIPLE
+        self.add_export('MV2_ENABLE_AFFINITY', '0')
+
+        # Performance tuning flags for allgather
+        self.add_export('MV2_INTER_ALLGATHER_TUNING', '5')
+        self.add_export('MV2_CUDA_USE_NAIVE', '0')
+
+    def backend_exists(self):
+        #TODO: if IB is available we should suggestion mvapich
+        mpiname_exists = shutil.which('mpiname')
+        exists = False
+        if not mpiname_exists:
+            warnings.warn("mpiname does not exist, mvapich is not installed properly")
+        else:
+            results = subprocess.check_output('mpiname', shell=True)
+            mpiname_results = results.decode('utf-8').strip()
+            if "MVAPICH2-GDR" in mpiname_results:
+                exists = True
+            else:
+                warnings.warn(
+                    f"Expected MVAPICH2-GDR as return for mpiname but received {mpiname_results}"
+                )
+        return exists
+
+    @property
+    def name(self):
+        return "mvapich"
+
+    def validate_args(self):
+        super().validate_args()
+        #TODO: Allow for include/exclude at node-level but not gpu-level
+        if self.args.include != "" or self.args.exclude != "":
+            raise ValueError(
+                f"{self.name} backend does not support worker include/exclusion")
+        if self.args.num_nodes != -1 or self.args.num_gpus != -1:
+            raise ValueError(
+                f"{self.name} backend does not support limiting num nodes/gpus")
+
+    def get_cmd(self, environment, active_resources):
+        devices_per_node = self.resource_pool.values()
+        total_process_count = sum(devices_per_node)
+        process_per_node = list(devices_per_node)[0]
+        if not all([n == process_per_node for n in devices_per_node]):
+            raise ValueError("mvapich requires same number of devices per node")
+
+        with open(MVAPICH_TMP_HOSTFILE, 'w') as fd:
+            for host in self.resource_pool.keys():
+                fd.write(f'{host}\n')
+
+        mpirun_cmd = [
+            'mpirun',
+            '-np',
+            f'{total_process_count}',
+            '-ppn',
+            f'{process_per_node}',
+            '--hostfile',
+            f'{MVAPICH_TMP_HOSTFILE}',
+        ]
+
+        export_cmd = []
+        for k, v in self.exports.items():
+            export_cmd += ['-env', "{}={}".format(k, v)]
+
+        python_exec = []
+        if not self.args.no_python:
+            python_exec = [sys.executable, "-u"]
+            if self.args.module:
+                python_exec.append("-m")
+
+        return mpirun_cmd + export_cmd + python_exec + [self.user_script
+                                                        ] + self.user_arguments
diff --git a/deepspeed/launcher/runner.py b/deepspeed/launcher/runner.py
old mode 100755
new mode 100644
index ac873f4ca3f4a18660cd7cf163f799c6be5a51df..551d24e0e1d1191c546b11413399fbe880b165be
--- a/deepspeed/launcher/runner.py
+++ b/deepspeed/launcher/runner.py
@@ -2,14 +2,13 @@
 """
 DeepSpeed runner is the main front-end to launching multi-worker
 training jobs with DeepSpeed. By default this uses pdsh to parallel
-ssh into multiple worker nodes and launch all the neccisary processes
+ssh into multiple worker nodes and launch all the necessary processes
 per rank for training.
 """
 
 import os
 import sys
 import json
-import shutil
 import base64
 import argparse
 import subprocess
@@ -23,8 +22,10 @@ from .constants import PDSH_LAUNCHER, OPENMPI_LAUNCHER, MVAPICH_LAUNCHER
 from ..constants import TORCH_DISTRIBUTED_DEFAULT_PORT
 from ..utils import logger
 
+from ..autotuning import Autotuner
+
 DLTS_HOSTFILE = "/job/hostfile"
-EXPORT_ENVS = ["NCCL", "PYTHON", "MV2", 'UCX']
+EXPORT_ENVS = ["NCCL", "PYTHON", "MV2", "UCX"]
 DEEPSPEED_ENVIRONMENT_NAME = ".deepspeed_env"
 DEEPSPEED_ENVIRONMENT_PATHS = [os.path.expanduser("~"), '.']
 PDSH_MAX_FAN_OUT = 1024
@@ -95,7 +96,7 @@ def parse_args(args=None):
     parser.add_argument("--launcher",
                         default=PDSH_LAUNCHER,
                         type=str,
-                        help="(optional) choose launcher backend for multi-node"
+                        help="(optional) choose launcher backend for multi-node "
                         "training. Options currently include PDSH, OpenMPI, MVAPICH.")
 
     parser.add_argument("--launcher_args",
@@ -104,6 +105,43 @@ def parse_args(args=None):
                         help="(optional) pass launcher specific arguments as a "
                         "single quoted argument.")
 
+    parser.add_argument("--module",
+                        action="store_true",
+                        help="Change each process to interpret the launch "
+                        "script as a Python module, executing with the same "
+                        "behavior as 'python -m'.")
+
+    parser.add_argument("--no_python",
+                        action="store_true",
+                        help="Skip prepending the training script with "
+                        "'python' - just execute it directly.")
+
+    parser.add_argument("--no_local_rank",
+                        action="store_true",
+                        help="Do not pass local_rank as an argument when calling "
+                        "the user's training script.")
+
+    parser.add_argument("--force_multi",
+                        action="store_true",
+                        help="Force multi-node launcher mode, helps in cases where user "
+                        "wants to launch on single remote node.")
+
+    parser.add_argument(
+        "--save_pid",
+        action="store_true",
+        help="Save file containing launcher process id (pid) at /tmp/<main-pid>.ds, "
+        "where <main-pid> is the pid of the first process that invoked `deepspeed`. "
+        "Useful when launching deepspeed processes programmatically.")
+
+    parser.add_argument(
+        "--autotuning",
+        default="",
+        choices=["tune",
+                 "run"],
+        type=str,
+        help="Run DeepSpeed autotuner to discover optimal configuration parameters "
+        "before running job.")
+
     parser.add_argument("user_script",
                         type=str,
                         help="User script to launch, followed by any required "
@@ -137,12 +175,22 @@ def fetch_hostfile(hostfile_path):
             if hostname in resource_pool:
                 logger.error("Hostfile contains duplicate hosts, unable to "
                              "proceed with training.")
-                raise ValueError("host {} is already defined".format(hostname))
+                raise ValueError(f"host {hostname} is already defined")
             resource_pool[hostname] = slot_count
 
     return resource_pool
 
 
+def _stable_remove_duplicates(data):
+    # Create a new list in the same order as original but with duplicates
+    # removed, should never be more than ~16 elements so simple is best
+    new_list = []
+    for x in data:
+        if x not in new_list:
+            new_list.append(x)
+    return new_list
+
+
 def parse_resource_filter(host_info, include_str="", exclude_str=""):
     '''Parse an inclusion or exclusion string and filter a hostfile dictionary.
 
@@ -187,27 +235,25 @@ def parse_resource_filter(host_info, include_str="", exclude_str=""):
 
             # sanity checks
             if hostname not in host_info:
-                raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
-            for s in slots:
-                if s not in host_info[hostname]:
-                    raise ValueError("No slot '{}' specified on host '{}'".format(
-                        s,
-                        hostname))
+                raise ValueError(f"Hostname '{hostname}' not found in hostfile")
+            for slot in slots:
+                if slot not in host_info[hostname]:
+                    raise ValueError(f"No slot '{slot}' specified on host '{hostname}'")
 
             # If include string, build the list from here
             if include_str:
                 filtered_hosts[hostname] = slots
             elif exclude_str:
-                for s in slots:
-                    logger.info('removing {} from {}'.format(s, hostname))
-                    filtered_hosts[hostname].remove(s)
+                for slot in slots:
+                    logger.info(f'removing {slot} from {hostname}')
+                    filtered_hosts[hostname].remove(slot)
 
         # User just specified the whole node
         else:
             hostname = node_config
             # sanity check hostname
             if hostname not in host_info:
-                raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
+                raise ValueError(f"Hostname '{hostname}' not found in hostfile")
 
             if include_str:
                 filtered_hosts[hostname] = host_info[hostname]
@@ -218,7 +264,7 @@ def parse_resource_filter(host_info, include_str="", exclude_str=""):
     del_keys = []
     for hostname in filtered_hosts:
         # Remove duplicates
-        filtered_hosts[hostname] = list(set(filtered_hosts[hostname]))
+        filtered_hosts[hostname] = _stable_remove_duplicates(filtered_hosts[hostname])
         # Remove empty hosts
         if len(filtered_hosts[hostname]) == 0:
             del_keys.append(hostname)
@@ -251,15 +297,43 @@ def encode_world_info(world_info):
     return world_info_base64
 
 
+def run_autotuning(args, active_resources):
+    tuner = Autotuner(args, active_resources)
+    logger.info("[Start] Running autotuning")
+
+    tuner.tune()
+    tuner.print_tuning_results()
+
+    logger.info("[End] Running autotuning")
+
+    if args.autotuning == "run":
+        tuner.run_after_tuning()
+
+
 def main(args=None):
     args = parse_args(args)
 
+    resource_pool = fetch_hostfile(args.hostfile)
+
+    # respect CUDA_VISIBLE_DEVICES for a single node and no explicit resource filters
+    cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", "")
+    if not resource_pool and len(cuda_visible_devices):
+        detected_str = f"Detected CUDA_VISIBLE_DEVICES={cuda_visible_devices}"
+        if len(args.include) or len(
+                args.exclude) or args.num_nodes > 1 or args.num_gpus > 0:
+            print(
+                f"{detected_str} but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed."
+            )
+        else:
+            args.include = f"localhost:{cuda_visible_devices}"
+            print(f"{detected_str}: setting --include={args.include}")
+        del os.environ["CUDA_VISIBLE_DEVICES"]
+
     if args.num_nodes >= 0 or args.num_gpus >= 0:
         if args.include != "" or args.exclude != "":
             raise ValueError("Cannot specify num_nodes/gpus with include/exclude")
 
     multi_node_exec = True
-    resource_pool = fetch_hostfile(args.hostfile)
     if not resource_pool:
         resource_pool = {}
         device_count = torch.cuda.device_count()
@@ -275,17 +349,33 @@ def main(args=None):
     active_resources = parse_inclusion_exclusion(resource_pool,
                                                  args.include,
                                                  args.exclude)
-
     env = os.environ.copy()
 
+    # validate that passwordless-ssh is workly properly with this hostfile
+    if multi_node_exec:
+        first_host = list(active_resources.keys())[0]
+        try:
+            subprocess.check_call(
+                f'ssh -o PasswordAuthentication=no {first_host} hostname',
+                stderr=subprocess.DEVNULL,
+                stdout=subprocess.DEVNULL,
+                shell=True)
+        except subprocess.CalledProcessError:
+            raise RuntimeError(
+                f"Using hostfile at {args.hostfile} but host={first_host} was not reachable via ssh. If you are running with a single node please remove {args.hostfile} or setup passwordless ssh."
+            )
+
     if not args.master_addr:
+        assert multi_node_exec
         first_host = list(active_resources.keys())[0]
-        hostname_cmd = ["ssh {} hostname -I".format(first_host)]
+        hostname_cmd = [f"ssh {first_host} hostname -I"]
         result = subprocess.check_output(hostname_cmd, shell=True)
         args.master_addr = result.decode('utf-8').split()[0]
-        logger.info("Using IP address of {} for node {}".format(
-            args.master_addr,
-            first_host))
+        logger.info(f"Using IP address of {args.master_addr} for node {first_host}")
+
+    if args.autotuning != "":
+        run_autotuning(args, active_resources)
+        return
 
     if args.num_nodes > 0:
         updated_active_resources = collections.OrderedDict()
@@ -304,10 +394,7 @@ def main(args=None):
     # encode world info as base64 to make it easier to pass via command line
     world_info_base64 = encode_world_info(active_resources)
 
-    multi_node_exec = len(active_resources) > 1
-
-    if multi_node_exec and not shutil.which('pdsh'):
-        raise RuntimeError("pdsh is not installed, unable to proceed")
+    multi_node_exec = args.force_multi or len(active_resources) > 1
 
     if not multi_node_exec:
         deepspeed_launch = [
@@ -315,10 +402,18 @@ def main(args=None):
             "-u",
             "-m",
             "deepspeed.launcher.launch",
-            "--world_info={}".format(world_info_base64),
-            "--master_addr={}".format(args.master_addr),
-            "--master_port={}".format(args.master_port)
+            f"--world_info={world_info_base64}",
+            f"--master_addr={args.master_addr}",
+            f"--master_port={args.master_port}"
         ]
+        if args.no_python:
+            deepspeed_launch.append("--no_python")
+        if args.module:
+            deepspeed_launch.append("--module")
+        if args.no_local_rank:
+            deepspeed_launch.append("--no_local_rank")
+        if args.save_pid:
+            deepspeed_launch += ["--save_pid", f"{os.getpid()}"]
         cmd = deepspeed_launch + [args.user_script] + args.user_args
     else:
         args.launcher = args.launcher.lower()
@@ -350,13 +445,14 @@ def main(args=None):
             if os.path.isfile(environ_file):
                 with open(environ_file, 'r') as fd:
                     for var in fd.readlines():
-                        key, val = var.split('=')
+                        key, val = var.split('=', maxsplit=1)
                         runner.add_export(key, val)
 
         cmd = runner.get_cmd(env, active_resources)
 
-    logger.info("cmd = {}".format(' '.join(cmd)))
+    logger.info(f"cmd = {' '.join(cmd)}")
     result = subprocess.Popen(cmd, env=env)
+
     result.wait()
 
     # In case of failure must propagate the error-condition back to the caller (usually shell). The
diff --git a/deepspeed/module_inject/__init__.py b/deepspeed/module_inject/__init__.py
old mode 100755
new mode 100644
index c029ca8b7296b74dbadab9aa8a42d3a5a6252f4a..7fe1a3b36b2ed4b286b0a2c351b759cf5130a537
--- a/deepspeed/module_inject/__init__.py
+++ b/deepspeed/module_inject/__init__.py
@@ -1 +1,3 @@
-from .replace_module import replace_transformer_layer
+from .replace_module import replace_transformer_layer, revert_transformer_layer
+from .module_quantize import quantize_transformer_layer
+from .replace_policy import DSPolicy, HFBertLayerPolicy
diff --git a/deepspeed/module_inject/inject.py b/deepspeed/module_inject/inject.py
old mode 100755
new mode 100644
diff --git a/deepspeed/module_inject/module_quantize.py b/deepspeed/module_inject/module_quantize.py
new file mode 100644
index 0000000000000000000000000000000000000000..fde6990eba280691a7c1ef37093d9973f0c8a074
--- /dev/null
+++ b/deepspeed/module_inject/module_quantize.py
@@ -0,0 +1,80 @@
+import copy
+import torch
+import deepspeed
+
+
+def quantize_transformer_layer(orig_layer_impl, model, megatron=False, preln=False):
+    """ Quantize bert-style transformer layers with DeepSpeed's transformer layer
+    Arguments:
+        orig_layer_impl (torch.nn.Module): the original transformer layer implementation to look for,
+            e.g., transformers.modeling_bert.BertLayer.
+        model (torch.nn.Module): user's nn.module representing their model
+
+        megatron (bool): megatron model-parallel implementation (this is supported for inference only)
+        preln (bool): does the original layer implementation do pre or post layer norm?
+
+        Note: For Bert kind of models, we inject based on the DeepSpeed-Example models, if not setting huggingface flag.
+
+    Returns:
+        Updated nn.module with quantized transformer layers
+    """
+    def quantize_weight(weight):
+        return weight.to(torch.int8)
+
+    def megatron_layer_quantize(layer):
+        layer.attention.query_key_value.weight.data = quantize_weight(
+            layer.attention.query_key_value.weight.data)
+        layer.attention.dense.weight.data = quantize_weight(
+            layer.attention.dense.weight.data)
+        layer.mlp.dense_h_to_4h.weight.data = quantize_weight(
+            layer.mlp.dense_h_to_4h.weight.data)
+        layer.mlp.dense_4h_to_h.weight.data = quantize_weight(
+            layer.mlp.dense_4h_to_h.weight.data)
+
+    def bert_layer_quantize(layer):
+        layer.attention.self.query.weight.data = quantize_weight(
+            layer.attention.self.query.weight.data)
+        layer.attention.self.key.weight.data = quantize_weight(
+            layer.attention.self.key.weight.data)
+        layer.attention.self.value.weight.data = quantize_weight(
+            layer.attention.self.value.weight.data)
+        layer.attention.output.dense.weight.data = quantize_weight(
+            layer.attention.output.dense.weight.data)
+        if preln:
+            layer.intermediate.dense_act.weight.data = quantize_weight(
+                layer.intermediate.dense_act.weight.data)
+        else:
+            layer.intermediate.dense.weight.data = quantize_weight(
+                layer.intermediate.dense.weight.data)
+        layer.output.dense.weight.data = quantize_weight(layer.output.dense.weight.data)
+
+    def quantize_fn(child):
+        if megatron:
+            # Quantize megatron GPT2 / GPT3 trained model
+            megatron_layer_quantize(child)
+        else:
+            # Quantize either DeepSpeed or HuggingFace trained model
+            bert_layer_quantize(child)
+
+        return child
+
+    return quantize_module(model=model,
+                           orig_class=orig_layer_impl,
+                           quantize_fn=quantize_fn)
+
+
+def quantize_module(model, orig_class, quantize_fn):
+    policy = {orig_class: quantize_fn}
+    return _quantize_module(model, policy)
+
+
+def _quantize_module(model, policies):
+    for name, child in model.named_children():
+        if child.__class__ in policies:
+            orig = repr(child)
+            setattr(model, name, policies[child.__class__](child))
+            new = getattr(model, name)
+        else:
+            _quantize_module(child, policies)
+
+    return model
diff --git a/deepspeed/module_inject/replace_module.py b/deepspeed/module_inject/replace_module.py
old mode 100755
new mode 100644
index de014640ad6ac0575b23d4aecee2528b0ccc0e02..62d5cd75e9a1ca8413d06a78583eb529d358b006
--- a/deepspeed/module_inject/replace_module.py
+++ b/deepspeed/module_inject/replace_module.py
@@ -1,109 +1,646 @@
 import copy
 import torch
 import deepspeed
+import deepspeed.ops.transformer as transformer_inference
+from .replace_policy import HFBertLayerPolicy, HFGPT2LayerPolicy, HFGPTJLayerPolicy
+from .replace_policy import replace_policies
+from ..constants import INFERENCE_GENERIC_MODE, INFERENCE_SPECIALIZED_MODE
+from ..runtime.weight_quantizer import WeightQuantization
+from torch import nn
+
+
+class LinearAllreduce(nn.Module):
+    def __init__(self, weight, bias=None, mp_group=None):
+        super(LinearAllreduce, self).__init__()
+        self.weight = weight
+        self.bias = bias
+        self.mp_group = mp_group
+
+    def forward(self, input):
+        output = torch.matmul(input, self.weight)
+        if self.mp_group is not None:
+            torch.distributed.all_reduce(output, group=self.mp_group)
+        if self.bias is not None:
+            output += self.bias
+        return output
+
+
+class LinearLayer(nn.Module):
+    def __init__(self, weight, bias=None):
+        super(LinearLayer, self).__init__()
+        self.weight = weight
+        self.bias = bias
+
+    def forward(self, input):
+        output = torch.matmul(input, self.weight)
+        if self.bias is not None:
+            output += self.bias
+        return output
+
+
+class ReplaceWithTensorSlicing:
+    def __init__(self, mp_group=None):
+        if mp_group is not None:
+            self.gpu_index = torch.distributed.get_rank(group=mp_group)
+        else:
+            self.gpu_index = 0
+
+    def merge_assert(self, dim1, dim2):
+        assert dim1 > dim2, \
+            'Merging tensors is not allowed here! Please use deepspeed load_checkpoint\
+            for merging your checkpoints before replacing the transformer layer with\
+            inference-kernels'
+
+    def qkv_copy(self, dst, src):
+        if src is None:
+            return torch.nn.Parameter(src)
+        src_shape = src.shape
+        dst_shape = dst.shape
+
+        src_split = torch.split(src.data, src.shape[-1] // 3, dim=-1)
+
+        if (len(src_shape) == 2 and len(dst_shape) == 2):
+            if src_shape[1] == dst_shape[1]:
+                return torch.nn.Parameter(src)
+
+            self.merge_assert(src_shape[1], dst_shape[1])
+            qkv_size = dst_shape[1] // 3
+            qkv_split = [torch.split(src_s, qkv_size, dim=1) for src_s in src_split]
+
+            weight_split = [
+                torch.cat([qkv_s[i] for qkv_s in qkv_split],
+                          axis=1) for i in range(len(qkv_split[0]))
+            ]
+            dst.data.copy_(weight_split[self.gpu_index].to(
+                torch.cuda.current_device()).contiguous())
+        else:
+            if src_shape[0] == dst_shape[0]:
+                return torch.nn.Parameter(src)
+
+            qkv_size = dst_shape[0] // 3
+            qkv_split = [torch.split(src_s, qkv_size, dim=0) for src_s in src_split]
+            bias_split = [
+                torch.cat([qkv_s[i] for qkv_s in qkv_split],
+                          axis=0) for i in range(len(qkv_split[0]))
+            ]
+            dst.data.copy_(bias_split[self.gpu_index].to(
+                torch.cuda.current_device()).contiguous())
+
+        return torch.nn.Parameter(dst)
+
+    def copy(self, dst, src):
+        if src is None:
+            return torch.nn.Parameter(src)
+
+        src_shape = src.shape
+        dst_shape = dst.shape
+
+        if (len(src_shape) == 2 and len(dst_shape) == 2):
+
+            if src_shape[0] == dst_shape[0] and src_shape[1] == dst_shape[1]:
+                return torch.nn.Parameter(src)
+
+            if src_shape[0] != dst_shape[0]:
+                self.merge_assert(src_shape[0], dst_shape[0])
+                weight_split = torch.split(src, dst_shape[0])
+            else:
+                self.merge_assert(src_shape[1], dst_shape[1])
+                weight_split = torch.split(src.data, dst_shape[1], dim=1)
+
+            dst.data.copy_(weight_split[self.gpu_index].to(
+                torch.cuda.current_device()).contiguous())
+        else:
+            if src_shape[0] == dst_shape[0]:
+                return torch.nn.Parameter(src)
+
+            bias_split = torch.split(src.data, dst_shape[-1])
+            dst.data.copy_(bias_split[self.gpu_index].to(
+                torch.cuda.current_device()).contiguous())
+
+        return torch.nn.Parameter(dst)
 
 
 def replace_transformer_layer(orig_layer_impl,
                               model,
-                              micro_batch_size,
-                              bert_config,
+                              policy=None,
+                              micro_batch_size=-1,
+                              config=None,
                               seed=-1,
+                              hidden_size=-1,
+                              num_attention_heads=-1,
+                              mp_size=1,
+                              training_mp_size=1,
+                              mp_group=None,
+                              ep_group=None,
+                              expert_mp_group=None,
                               preln=True,
                               fp16=True,
+                              local_rank=-1,
+                              stochastic_mode=True,
                               training=True,
-                              huggingface=False,
-                              local_rank=-1):
+                              quantize=False,
+                              quantize_settings=None,
+                              triangular_masking=False,
+                              return_tuple=True,
+                              replace_with_kernel_inject=False,
+                              linear_layer_setting=None,
+                              moe=False,
+                              moe_experts=1,
+                              moe_type='standard'):
     """ Replace bert-style transformer layers with DeepSpeed's transformer layer
     Arguments:
         orig_layer_impl (torch.nn.Module): the original transformer layer implementation to look for,
             e.g., transformers.modeling_bert.BertLayer.
         model (torch.nn.Module): user's nn.module representing their model
+        policy: shows the policy for mapping from the orig_layer_impl to transformer parameters when
+            replace_with_kernel_inject is set, otherwise, it provides the names of two linear layers as
+            a tuple: (attention_output projection, transformer output projection)
         micro_batch_size (int): micro batch size per gpu used during training/eval
-        bert_config (dict): model config containing hidden size, attention heads, etc.
+        config (dict): model config containing hidden size, attention heads, etc.
         seed (int): random seed value
+        max_seq_length (int): max sequence length for training
+        hidden_size (int): hidden dimension
+        num_attention_heads (int): number of attention heads
+        mp_size (int): model_parallelism degree
+        mp_group : model_parallel group initialized on the modeling side
         preln (bool): does the original layer implementation do pre or post layer norm?
         fp16 (bool): fp16 or fp32
-        Training (bool): select between training (True) or inference (False) mode
-        huggingface (bool): huggingface implementation is unique (supports both encoder/decoder modes)
-
+        local_rank (int): GPU rank (optional),
+        stochastic_mode (bool): whether to use stochastic mode
+        training (bool): specifying whether kernel-injection is done for training/inference (set to false for inference-mode injection)
+        quantize_settings (tuple): this setting shows how we can quantize a model for running it through the inference kernels.
+                It includes (quantization_scales, merge_count, mlp_extra_grouping, quantize_groups).
+        return_tuple (bool): if set, transformer layer returns a tuple as the output.
+            Note: this flag needs to be set for huggingface models.
+        replace_with_kernel_inject (bool): injection_mode, if true, kernels will be add along with configuring
+            Tensor-Parallelism
+        linear_layer_setting (tuple of modules) [Optional]: shows which two classes are used for linear layers
+            and embedding layers
+        attention_params: (list of strings) [Optional]: shows the parameters in the attention part that needs to
+            be adjusted based on the model-parallelism
     Returns:
         Updated nn.module with replaced transformer layers
     """
-    def replace_fn(child):
-        transformer_config = deepspeed.DeepSpeedTransformerConfig(
-            batch_size=micro_batch_size,
-            hidden_size=bert_config.hidden_size,
-            heads=bert_config.num_attention_heads,
-            attn_dropout_ratio=bert_config.attention_probs_dropout_prob,
-            hidden_dropout_ratio=bert_config.hidden_dropout_prob,
-            num_hidden_layers=bert_config.num_hidden_layers,
-            initializer_range=bert_config.initializer_range,
-            layer_norm_eps=bert_config.layer_norm_eps,
-            seed=seed,
-            fp16=fp16,
-            pre_layer_norm=preln,
-            huggingface=huggingface,
-            local_rank=local_rank,
-            training=training)
-        new_module = deepspeed.DeepSpeedTransformerLayer(transformer_config)
-
-        # copy relevant state from child -> new module
-        qw = child.attention.self.query.weight
-        qb = child.attention.self.query.bias
-        kw = child.attention.self.key.weight
-        kb = child.attention.self.key.bias
-        vw = child.attention.self.value.weight
-        vb = child.attention.self.value.bias
-
-        qkvw = torch.cat((qw, kw, vw), 0)
-        qkvb = torch.cat((qb, kb, vb), 0)
-
-        #qw.data,kw.data,vw.data = torch.chunk(qkvw, 3, axis=0)
-        #qb.data,kb.data,vb.data = torch.chunk(qkvb, 3, axis=0)
-
-        new_module.attn_qkvw.data = qkvw
-        new_module.attn_qkvb.data = qkvb
-        new_module.attn_ow.data = child.attention.output.dense.weight
-        new_module.attn_ob.data = child.attention.output.dense.bias
-        if preln:
-            attention_layernorm = child.PostAttentionLayerNorm
+    def replace_with_policy(child,
+                            policy_cls,
+                            triangular_masking,
+                            inference=False,
+                            preln=True,
+                            layer_id=0):
+        preln = False if policy_cls is HFBertLayerPolicy else preln
+        if policy_cls is HFBertLayerPolicy:
+            policy = policy_cls(child, inference=inference, preln=preln)
         else:
-            attention_layernorm = child.attention.output.LayerNorm
-        new_module.attn_nw.data = attention_layernorm.weight
-        new_module.attn_nb.data = attention_layernorm.bias
-        if preln:
-            intermediate_ff = child.intermediate.dense_act
+            policy = policy_cls(child, inference=inference)
+
+        if inference:
+            hidden_size, num_attention_heads = policy.get_hidden_heads()
+            assert num_attention_heads % mp_size == 0,\
+                "To run the model parallel across the GPUs, the attention_heads require to be divisible by the world_size!" +\
+                "This is because the attention computation is partitioned evenly among the parallel GPUs."
+        from deepspeed.moe.layer import MoE
+        moe = False
+        if hasattr(child, 'mlp') and isinstance(child.mlp, MoE):
+            num_experts = child.mlp.num_experts
+            moe = True
+
+        attn_linear_layer, qkvw, qkvb, dense_w, dense_b, scale_attention, megatron_v2 = policy.attention()
+        if not moe or moe_type == 'standard':
+            mlp_linear_layer, _h4h_w, _h4h_b, _4hh_w, _4hh_b = policy.mlp()
         else:
-            intermediate_ff = child.intermediate.dense
-        new_module.inter_w.data = intermediate_ff.weight
-        new_module.inter_b.data = intermediate_ff.bias
-        new_module.output_w.data = child.output.dense.weight
-        new_module.output_b.data = child.output.dense.bias
-        if preln:
-            transformer_layernorm = child.PreAttentionLayerNorm
+            mlp_linear_layer, _h4h_w, _h4h_b, _4hh_w, _4hh_b, \
+                _res_h4h_w, _res_h4h_b, _res_4hh_w, _res_4hh_b, _res_coef = policy.mlp(moe_type)
+
+        attn_nw, attn_nb, input_nw, input_nb = policy.layerNorm()
+        if quantize:
+            if policy_cls is not HFBertLayerPolicy:
+                qkvw = qkvw.to(torch.int8)
+            dense_w = dense_w.to(torch.int8)
+            _h4h_w = [moe_w1.to(torch.int8)
+                      for moe_w1 in _h4h_w] if moe else _h4h_w.to(torch.int8)
+            _4hh_w = [moe_w1.to(torch.int8)
+                      for moe_w1 in _4hh_w] if moe else _4hh_w.to(torch.int8)
+        elif fp16:
+            qkvw = qkvw.half()
+            dense_w = dense_w.half()
+            _h4h_w = [moe_w1.half() for moe_w1 in _h4h_w] if moe else _h4h_w.half()
+            _4hh_w = [moe_w1.half() for moe_w1 in _4hh_w] if moe else _4hh_w.half()
+        if quantize or fp16:
+            qkvb = qkvb if qkvb is None else qkvb.half()
+            dense_b = dense_b if dense_b is None else dense_b.half()
+            _h4h_b = [moe_b1.half() for moe_b1 in _h4h_b] if moe else _h4h_b.half()
+            _4hh_b = [moe_b1.half() for moe_b1 in _4hh_b] if moe else _4hh_b.half()
+            attn_nw = attn_nw if attn_nw is None else attn_nw.half()
+            attn_nb = attn_nb if attn_nb is None else attn_nb.half()
+            input_nw = input_nw.half()
+            input_nb = input_nb.half()
+
+        if moe and moe_type == 'residual' and fp16:
+            _res_h4h_b = _res_h4h_b.half()
+            _res_4hh_b = _res_4hh_b.half()
+            _res_h4h_w = _res_h4h_w.half()
+            _res_4hh_w = _res_4hh_w.half()
+            _res_coef = _res_coef.half()
+
+        mp_replace = ReplaceWithTensorSlicing(mp_group=mp_group)
+        #expert_mp_replace = ReplaceWithTensorSlicing(mp_group=expert_mp_group)
+
+        if inference:
+            if moe:
+                ep_world_size = torch.distributed.get_world_size()
+                local_ep_size = 1 if num_experts < ep_world_size else num_experts // ep_world_size
+
+                transformer_config = transformer_inference.DeepSpeedMoEInferenceConfig(
+                    hidden_size=hidden_size,
+                    heads=num_attention_heads,
+                    layer_norm_eps=config.layer_norm_eps if hasattr(
+                        config,
+                        'layer_norm_eps') else 1e-12,
+                    fp16=fp16,
+                    pre_layer_norm=preln,
+                    mp_size=mp_size,
+                    q_int8=quantize,
+                    moe_experts=local_ep_size,
+                    global_experts=num_experts,
+                    mlp_type=moe_type)
+            else:
+                rotary_dim = config.rotary_dim if hasattr(config, 'rotary_dim') else child.attention.rotary_ndims \
+                                            if hasattr(child, 'attention') and hasattr(child.attention,'rotary_ndims') else -1
+                transformer_config = transformer_inference.DeepSpeedInferenceConfig(
+                    hidden_size=hidden_size,
+                    heads=num_attention_heads,
+                    layer_norm_eps=config.layer_norm_eps if hasattr(
+                        config,
+                        'layer_norm_eps') else
+                    (config.layer_norm_epsilon
+                     if hasattr(config,
+                                'layer_norm_epsilon') else config.layernorm_epsilon
+                     if hasattr(config,
+                                'layernorm_epsilon') else 1.0e-12),
+                    fp16=fp16,
+                    pre_layer_norm=preln,
+                    mp_size=mp_size,
+                    q_int8=quantize,
+                    return_tuple=(return_tuple or (policy_cls is HFBertLayerPolicy)),
+                    triangular_masking=(policy_cls is not HFBertLayerPolicy),
+                    local_attention=((config.attention_layers[layer_id] == "local")
+                                     if hasattr(config,
+                                                'attention_layers') else False),
+                    window_size=(config.window_size if hasattr(config,
+                                                               'window_size') else 1),
+                    rotary_dim=rotary_dim,
+                    mlp_after_attn=(rotary_dim is None or rotary_dim < 0),
+                    training_mp_size=training_mp_size)
+
+            if quantize and quantize_settings is not None:
+                (quantization_scales,
+                 merge_count,
+                 mlp_extra_grouping,
+                 quantize_groups) = quantize_settings
+                if moe:
+                    new_module = transformer_inference.DeepSpeedMoEInference(
+                        transformer_config,
+                        mp_group=mp_group,
+                        ep_group=None if ep_group is None else ep_group[num_experts],
+                        expert_mp_group=None
+                        if expert_mp_group is None else expert_mp_group[num_experts],
+                        quantize_scales=quantization_scales[layer_id],
+                        quantize_groups=quantize_groups,
+                        merge_count=merge_count,
+                        mlp_extra_grouping=mlp_extra_grouping,
+                        qkv_merging=(policy_cls is HFBertLayerPolicy))
+
+                else:
+                    new_module = transformer_inference.DeepSpeedTransformerInference(
+                        transformer_config,
+                        mp_group=mp_group,
+                        quantize_scales=quantization_scales[layer_id],
+                        quantize_groups=quantize_groups,
+                        merge_count=merge_count,
+                        mlp_extra_grouping=mlp_extra_grouping,
+                        qkv_merging=(policy_cls is HFBertLayerPolicy))
+
+                if quantize and qkvw.dtype != torch.int8:
+                    quantize_bits = 8
+                    quantizer = WeightQuantization()
+                    if policy_cls is HFBertLayerPolicy:
+                        data_quantized, _ = quantizer.quantize_data(qkvw.data, quantize_bits, quantize_groups * 3)
+                    else:
+                        data_quantized, _ = quantizer.quantize_data(qkvw.data, quantize_bits, quantize_groups)
+                    qkvw.data.copy_(data_quantized)
+                    qkvw.data = qkvw.data.to(torch.int8)
+            else:
+
+                if moe:
+                    new_module = transformer_inference.DeepSpeedMoEInference(
+                        transformer_config,
+                        mp_group=mp_group,
+                        ep_group=None if ep_group is None else ep_group[num_experts],
+                        expert_mp_group=None
+                        if expert_mp_group is None else expert_mp_group[num_experts],
+                    )
+
+                else:
+                    new_module = transformer_inference.DeepSpeedTransformerInference(
+                        transformer_config,
+                        mp_group=mp_group,
+                    )
+            new_module.config.scale_attention = scale_attention
+
+            # we want the weights in [input, output] shape
+            # linear layer is created with [input, output] shape
+            # transpose it here to reduce inference cost!
+            def transpose(data):
+                data.view(-1).copy_(data.transpose(-1, -2).contiguous().view(-1))
+                data = data.reshape(data.shape[-1], data.shape[-2])
+                return data
+
+            if attn_linear_layer:
+                qkvw.data = transpose(qkvw.data)
+                dense_w.data = transpose(dense_w.data)
+
+            if megatron_v2:
+                new_module.config.rotate_half = True
+                new_module.config.rotate_every_two = False
+
+                def _transpose(x):
+                    num_attention_heads_per_partition = transformer_config.heads // transformer_config.mp_size
+                    attention_head_size = x.shape[-1] // num_attention_heads_per_partition
+                    new_x_shape = x.size()[:-1] + (num_attention_heads_per_partition,
+                                                   attention_head_size)
+                    x_1 = x.view(*new_x_shape)
+                    (q,
+                     k,
+                     v) = torch.split(x_1,
+                                      (x_1.shape[-1] // 3),
+                                      dim=(x_1.dim() - 1))
+                    if len(q.shape) > 2:
+                        return torch.cat((q.reshape(q.shape[0],
+                                                    -1),
+                                          k.reshape(q.shape[0],
+                                                    -1),
+                                          v.reshape(q.shape[0],
+                                                    -1)),
+                                         dim=-1).reshape(x.shape)
+                    else:
+                        return torch.cat((q.reshape(-1),
+                                          k.reshape(-1),
+                                          v.reshape(-1)),
+                                         dim=-1).reshape(x.shape)
+
+                qkvw = torch.nn.Parameter(_transpose(qkvw).contiguous())
+                qkvb = torch.nn.Parameter(_transpose(qkvb).contiguous())
+
+            dense_b = dense_b * (transformer_config.training_mp_size /
+                                 transformer_config.mp_size)
+            _4hh_b = _4hh_b * (transformer_config.training_mp_size /
+                               transformer_config.mp_size)
+
+            if mlp_linear_layer:
+                _h4h_w = [transpose(moe_w1.data)
+                          for moe_w1 in _h4h_w] if moe else transpose(_h4h_w.data)
+                _4hh_w = [transpose(moe_w1.data)
+                          for moe_w1 in _4hh_w] if moe else transpose(_4hh_w.data)
+
+            if moe and moe_type == 'residual':
+                _res_h4h_w.data = transpose(_res_h4h_w.data)
+                _res_4hh_w.data = transpose(_res_4hh_w.data)
+                _res_coef.data = transpose(_res_coef.data)
+
+            attn_block = new_module.attention
+            attn_block.attn_qkvw = mp_replace.qkv_copy(attn_block.attn_qkvw, qkvw)
+            attn_block.attn_qkvb = mp_replace.qkv_copy(attn_block.attn_qkvb, qkvb)
+
+            attn_block.attn_ow = mp_replace.copy(attn_block.attn_ow, dense_w)
+            attn_block.attn_ob = mp_replace.copy(attn_block.attn_ob, dense_b)
+
+            mpl_block = new_module.mlp
+            if moe:
+                gpu_index = torch.distributed.get_rank()
+                gpu_index = 0
+                for ep_index in range(local_ep_size):
+                    mpl_block[ep_index].inter_w.data = _h4h_w[
+                        gpu_index * local_ep_size + ep_index].to(
+                            torch.cuda.current_device())
+                    mpl_block[ep_index].inter_b.data = _h4h_b[
+                        gpu_index * local_ep_size + ep_index].to(
+                            torch.cuda.current_device())
+                    mpl_block[ep_index].output_w.data = _4hh_w[
+                        gpu_index * local_ep_size + ep_index].to(
+                            torch.cuda.current_device())
+                    mpl_block[ep_index].output_b.data = _4hh_b[
+                        gpu_index * local_ep_size + ep_index].to(
+                            torch.cuda.current_device())
+                new_module.attn_nw.data = attn_nw.to(torch.cuda.current_device())
+                new_module.attn_nb.data = attn_nb.to(torch.cuda.current_device())
+                if moe_type == 'residual':
+                    new_module.res_mlp.inter_w.data = _res_h4h_w.to(
+                        torch.cuda.current_device())
+                    new_module.res_mlp.inter_b.data = _res_h4h_b.to(
+                        torch.cuda.current_device())
+                    new_module.res_mlp.output_w.data = _res_4hh_w.to(
+                        torch.cuda.current_device())
+                    new_module.res_mlp.output_b.data = _res_4hh_b.to(
+                        torch.cuda.current_device())
+                    new_module.res_coef.data = _res_coef.to(torch.cuda.current_device())
+            else:
+                mpl_block.inter_w.data = mp_replace.copy(mpl_block.inter_w, _h4h_w)
+                mpl_block.inter_b.data = mp_replace.copy(mpl_block.inter_b, _h4h_b)
+                mpl_block.output_w.data = mp_replace.copy(mpl_block.output_w, _4hh_w)
+                mpl_block.output_b.data = mp_replace.copy(mpl_block.output_b, _4hh_b)
+                if attn_nw is None:
+                    new_module.mlp.attn_nw = attn_nw
+                else:
+                    new_module.mlp.attn_nw.data = attn_nw.to(torch.cuda.current_device())
+                if attn_nb is None:
+                    new_module.mlp.attn_nb = attn_nb
+                else:
+                    new_module.mlp.attn_nb.data = attn_nb.to(torch.cuda.current_device())
+            new_module.norm_w.data = input_nw.to(torch.cuda.current_device())
+            new_module.norm_b.data = input_nb.to(torch.cuda.current_device())
         else:
-            transformer_layernorm = child.output.LayerNorm
-        new_module.norm_w.data = transformer_layernorm.weight
-        new_module.norm_b.data = transformer_layernorm.bias
+            transformer_config = deepspeed.DeepSpeedTransformerConfig(
+                batch_size=micro_batch_size,
+                hidden_size=config.hidden_size,
+                heads=config.num_attention_heads,
+                attn_dropout_ratio=config.attention_probs_dropout_prob,
+                hidden_dropout_ratio=config.hidden_dropout_prob,
+                num_hidden_layers=config.num_hidden_layers,
+                initializer_range=config.initializer_range,
+                layer_norm_eps=config.layer_norm_eps if hasattr(
+                    config,
+                    'layer_norm_eps') else 1e-12,
+                seed=seed,
+                fp16=fp16,
+                pre_layer_norm=(False if policy_cls is HFBertLayerPolicy else preln),
+                return_tuple=return_tuple,
+                local_rank=local_rank,
+                stochastic_mode=stochastic_mode,
+                normalize_invertible=True,
+                training=training)
+            new_module = deepspeed.DeepSpeedTransformerLayer(transformer_config)
+            new_module.attn_qkvw.data = qkvw
+            new_module.attn_qkvb.data = qkvb
+            new_module.attn_ow.data = dense_w
+            new_module.attn_ob.data = dense_b
+
+            new_module.attn_nw.data = attn_nw
+            new_module.attn_nb.data = attn_nb
+            new_module.norm_w.data = input_nw
+            new_module.norm_b.data = input_nb
+
+            new_module.inter_w.data = _h4h_w
+            new_module.inter_b.data = _h4h_b
+            new_module.output_w.data = _4hh_w
+            new_module.output_b.data = _4hh_b
         return new_module
 
-    return replace_module(model=model, orig_class=orig_layer_impl, replace_fn=replace_fn)
+    def replace_wo_policy(module, all_reduce_linears):
+        def _replace(child, name, conv_linear_layer):
+            mp_replace = ReplaceWithTensorSlicing(mp_group=mp_group)
+            if name in all_reduce_linears:
+                new_weight = torch.empty(
+                    (child.weight.shape[0]
+                     if conv_linear_layer else child.weight.shape[1] // mp_size,
+                     child.weight.shape[1]
+                     if conv_linear_layer else child.weight.shape[0]),
+                    device=child.weight.device,
+                    dtype=torch.half if fp16 else torch.float)
+                if not conv_linear_layer:
+                    child.weight.data.view(-1).copy_(
+                        child.weight.data.transpose(-1,
+                                                    -2).contiguous().view(-1))
+                    child.weight.data = child.weight.data.reshape(
+                        child.weight.data.shape[-1],
+                        child.weight.data.shape[-2])
+                data = mp_replace.copy(new_weight,
+                                       child.weight.data).to(torch.cuda.current_device())
+                return LinearAllreduce(data, child.bias if child.bias is None else \
+                            child.bias.to(torch.cuda.current_device()), mp_group)
+            else:
+                new_weight = torch.empty(
+                    (child.weight.shape[0] //
+                     mp_size if conv_linear_layer else child.weight.shape[1],
+                     child.weight.shape[1]
+                     if conv_linear_layer else child.weight.shape[0] // mp_size),
+                    device=child.weight.device,
+                    dtype=torch.half if fp16 else torch.float)
+                if not conv_linear_layer:
+                    child.weight.data.view(-1).copy_(
+                        child.weight.data.transpose(-1,
+                                                    -2).contiguous().view(-1))
+                    child.weight.data = child.weight.data.reshape(
+                        child.weight.data.shape[-1],
+                        child.weight.data.shape[-2])
+                data = mp_replace.copy(new_weight, child.weight.data)
+                new_bias = torch.empty((child.weight.shape[1] // mp_size),
+                                       device=child.weight.device,
+                                       dtype=torch.half if fp16 else torch.float)
+                bias_data = None if child.bias is None else mp_replace.copy(
+                    new_bias,
+                    child.bias.data).to(torch.cuda.current_device())
+                return LinearLayer(data.to(torch.cuda.current_device()), bias_data)
+
+        def _slice_embedding(child, name, conv_linear_layer):
+            mp_replace = ReplaceWithTensorSlicing(mp_group=mp_group)
+            new_weight = torch.empty((child.weight.shape[0],
+                                      child.weight.shape[1] // mp_size),
+                                     device=child.weight.device,
+                                     dtype=child.weight.dtype)
+            data = mp_replace.copy(new_weight, child.weight.data)
+            new_embedding = nn.Embedding(child.weight.shape[0],
+                                         child.weight.shape[1] // mp_size)
+            new_embedding.weight.data.copy_(data)
+            return new_embedding
 
+        def update_mp_params(child):
+            if hasattr(child, 'n_heads'):
+                child.n_heads = child.n_heads // mp_size
+            if hasattr(child, 'inner_dim'):
+                child.inner_dim = child.inner_dim // mp_size
+            if hasattr(child, 'num_heads'):
+                child.num_heads = child.num_heads // mp_size
+            if hasattr(child, 'num_attention_heads'):
+                child.num_attention_heads = child.num_attention_heads // mp_size
+            if hasattr(child, 'all_head_size'):
+                child.all_head_size = child.all_head_size // mp_size
+            if hasattr(child, 'embed_dim'):
+                child.embed_dim = child.embed_dim // mp_size
 
-def revert_transformer_layer(orig_layer_impl, model, bert_config, preln=False):
+        conv_linear_layer = False
+        if linear_layer_setting is not None:
+            linear_policies = {linear_layer_setting[0]: _replace}
+            if len(linear_layer_setting) == 2:
+                linear_policies.update({linear_layer_setting[1]: _slice_embedding})
+        else:
+            if orig_layer_impl is HFGPT2LayerPolicy._orig_layer_class:
+                try:
+                    import transformers
+                    conv_linear_layer = True
+                    linear_policies = {transformers.model_utils.Conv1D: _replace}
+                except ImportError:
+                    linear_policies = {nn.Linear: _replace}
+            else:
+                linear_policies = {nn.Linear: _replace, nn.Embedding: _slice_embedding}
+
+        def _replace_module(r_module, prev_name=''):
+            for name, child in r_module.named_children():
+                if child.__class__ in linear_policies:
+                    setattr(
+                        r_module,
+                        name,
+                        linear_policies[child.__class__](child,
+                                                         prev_name + '.' + name,
+                                                         conv_linear_layer))
+                else:
+                    update_mp_params(child)
+                    _replace_module(child, name)
+            return r_module
+
+        return _replace_module(module)
+
+    def replace_fn(child, _policy, layer_id=0):
+        if training:
+            # copy relevant state from child -> new module
+            new_module = replace_with_policy(child,
+                                             _policy,
+                                             triangular_masking,
+                                             preln=preln)
+
+        else:
+            # copy relevant state from child -> new module
+            if replace_with_kernel_inject:
+                new_module = replace_with_policy(child,
+                                                 _policy,
+                                                 triangular_masking,
+                                                 inference=True,
+                                                 preln=(_policy
+                                                        is not HFBertLayerPolicy),
+                                                 layer_id=layer_id)
+            else:
+                new_module = replace_wo_policy(child, _policy)
+
+        return new_module
+
+    return replace_module(model=model,
+                          orig_class=orig_layer_impl,
+                          replace_fn=replace_fn,
+                          _replace_policy=policy)
+
+
+def revert_transformer_layer(orig_layer_impl, model, config, preln=False):
     """ Revert DeepSpeed's transformer layer back to original bert-style transformer layer
     Arguments:
         orig_layer_impl (torch.nn.Module): the original transformer layer implementation that was replaced,
             e.g., transformers.modeling_bert.BertLayer.
         model (torch.nn.Module): user's nn.module representing their model
-        bert_config (dict): model config containing hidden size, attention heads, etc.
-
+        config (dict): model config containing hidden size, attention heads, etc.
     Returns:
         Updated nn.module with original bert-style transformer layers
     """
-    def replace_fn(child):
+    def replace_fn(child, _replace_policy, layer_id):
         #from turing.nvidia_modelingpreln import BertLayer
-        orig_module = orig_layer_impl(bert_config)
+        orig_module = orig_layer_impl(config)
 
         # copy relevant state from child -> original module
         qkvw = child.attn_qkvw.data
@@ -155,39 +692,63 @@ def revert_transformer_layer(orig_layer_impl, model, bert_config, preln=False):
 
     return replace_module(model=model,
                           orig_class=deepspeed.DeepSpeedTransformerLayer,
-                          replace_fn=replace_fn)
+                          replace_fn=replace_fn,
+                          _replace_policy=None)
 
 
-def replace_module(model, orig_class, replace_fn):
+def replace_module(model, orig_class, replace_fn, _replace_policy):
     """ Scan the model for instances of ``orig_clas:`` to replace using ``replace_fn``.
     Arguments:
         model (torch.nn.Module): the model to augment
         orig_class (torch.nn.Module): the module to search for
         replace_fn (method): a method to convert instances of ``orig_class`` to the
                              desired type and return a new instance.
-
     Returns:
         A modified ``model``.
     """
-    policy = {orig_class: replace_fn}
-    return _replace_module(model, policy)
+    policy = {}
+    if orig_class is not None:
+        policy.update({orig_class: (replace_fn, _replace_policy)})
+    else:
+        for plcy in replace_policies:
+            # instantiate a throw-away policy in order to populate the _orig_layer_class
+            _ = plcy(None)
+            if isinstance(plcy._orig_layer_class, list):
+                for orig_layer_class in plcy._orig_layer_class:
+                    policy.update({orig_layer_class: (replace_fn, plcy)})
+            elif plcy._orig_layer_class is not None:
+                policy.update({plcy._orig_layer_class: (replace_fn, plcy)})
+    assert len(policy.items()) > 0,\
+        "No default policy found! Please specify your policy injection_policy (like {BertLayer:HFBEertLayerPolicy})." +\
+        "You can find some samples here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py"
 
+    replaced_module, _ = _replace_module(model, policy)
+    return replaced_module
 
-def _replace_module(model, policies):
+
+from ..pipe import PipelineModule
+
+
+def _replace_module(model, policies, layer_id=0):
     """ Traverse model's children recursively and apply any transformations in ``policies``.
     Arguments:
         model (torch.nn.Module): model to augment
         policies (dict): Mapping of source class to replacement function.
-
     Returns:
         Modified ``model``.
     """
     for name, child in model.named_children():
         if child.__class__ in policies:
-            orig = repr(child)
-            setattr(model, name, policies[child.__class__](child))
-            new = getattr(model, name)
+            replaced_module = policies[child.__class__][0](child,
+                                                           policies[child.__class__][-1],
+                                                           layer_id)
+            setattr(model, name, replaced_module)
+            if isinstance(model, PipelineModule):
+                assert hasattr(model, 'forward_funcs'),\
+                    "we require pipe-module to have the list of fwd_functions"
+                model.forward_funcs[model.fwd_map[name]] = replaced_module
+            layer_id += 1
         else:
-            _replace_module(child, policies)
+            _, layer_id = _replace_module(child, policies, layer_id=layer_id)
 
-    return model
+    return model, layer_id
diff --git a/deepspeed/module_inject/replace_policy.py b/deepspeed/module_inject/replace_policy.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8d14e431d0863aa18afbd39f553470f1647d3a8
--- /dev/null
+++ b/deepspeed/module_inject/replace_policy.py
@@ -0,0 +1,379 @@
+from abc import ABC
+
+import torch
+from torch.nn.parameter import Parameter
+
+
+class DSPolicy(ABC):
+    def __init__(self,
+                 inference=True,
+                 linear_layer=True,
+                 scale_attention=True,
+                 megatron_v2=False):
+        self.inference = inference
+        self.linear_layer = linear_layer
+        self.scale_attention = scale_attention
+        self.is_megatron_v2 = megatron_v2
+
+    def attention(self):
+        """
+        Returns attention qkv and dense parameters
+        weight: (3*hidden, hidden) and (hidden, hidden)
+        bias: (3*hidden) and (hidden)
+        """
+        raise NotImplementedError
+
+    def get_hidden_heads(self):
+        """
+        return hidden_size and number of heads
+        """
+        raise NotImplementedError
+
+    def mlp(self):
+        """
+        Returns mlp intermediate and output
+        weight: (intermediate, hidden) and (hidden, intermediate)
+        bias: (intermediate) and (hidden)
+        """
+        raise NotImplementedError
+
+    def layerNorm(self):
+        """
+        Returns LayerNorms used in transformer layer
+        Post-Attention and pre/post layer norm
+        gamma and beta with shape: (hidden)
+        """
+        raise NotImplementedError
+
+
+class HFBertLayerPolicy(DSPolicy):
+    _orig_layer_class = None
+
+    def __init__(self, client_module, inference=False, preln=False):
+        super().__init__(inference)
+        self.client_module = client_module
+        self.preln = preln
+        if HFBertLayerPolicy._orig_layer_class is None:
+            try:
+                import transformers
+                HFBertLayerPolicy._orig_layer_class = [
+                    transformers.models.bert.modeling_bert.BertLayer,
+                    transformers.models.roberta.modeling_roberta.RobertaLayer
+                ]
+            except:
+                HFBertLayerPolicy._orig_layer_class = None
+
+    def get_hidden_heads(self):
+        return self.client_module.attention.self.query.weight.shape[1], \
+                self.client_module.attention.self.num_attention_heads
+
+    def attention(self):
+        qw = self.client_module.attention.self.query.weight
+        qb = self.client_module.attention.self.query.bias
+        kw = self.client_module.attention.self.key.weight
+        kb = self.client_module.attention.self.key.bias
+        vw = self.client_module.attention.self.value.weight
+        vb = self.client_module.attention.self.value.bias
+
+        qkvw = Parameter(torch.cat((qw, kw, vw), dim=0), requires_grad=False)
+        qkvb = Parameter(torch.cat((qb, kb, vb), dim=0), requires_grad=False)
+
+        return self.linear_layer, \
+               qkvw, \
+               qkvb, \
+               self.client_module.attention.output.dense.weight, \
+               self.client_module.attention.output.dense.bias, \
+               self.scale_attention, \
+               self.is_megatron_v2
+
+    def mlp(self):
+        if self.preln:
+            intermediate_ff = self.client_module.intermediate.dense_act
+        else:
+            intermediate_ff = self.client_module.intermediate.dense
+
+        return self.linear_layer, intermediate_ff.weight, intermediate_ff.bias, \
+            self.client_module.output.dense.weight, \
+            self.client_module.output.dense.bias
+
+    def layerNorm(self):
+        if self.preln:
+            attention_layernorm = self.client_module.PostAttentionLayerNorm
+            transformer_layernorm = self.client_module.PreAttentionLayerNorm
+        else:
+            attention_layernorm = self.client_module.attention.output.LayerNorm
+            transformer_layernorm = self.client_module.output.LayerNorm
+        return attention_layernorm.weight, \
+               attention_layernorm.bias, \
+               transformer_layernorm.weight, \
+               transformer_layernorm.bias
+
+
+class HFGPTNEOLayerPolicy(DSPolicy):
+    _orig_layer_class = None
+
+    def __init__(self, client_module, inference=True):
+        super().__init__(inference, scale_attention=False)
+        self.client_module = client_module
+        try:
+            import transformers
+            HFGPTNEOLayerPolicy._orig_layer_class = transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoBlock
+        except:
+            HFGPTNEOLayerPolicy._orig_layer_class = None
+
+    def get_hidden_heads(self):
+        return self.client_module.attn.attention.q_proj.weight.shape[1], \
+                self.client_module.attn.attention.num_heads
+
+    def attention(self):
+        qw = self.client_module.attn.attention.q_proj.weight
+        kw = self.client_module.attn.attention.k_proj.weight
+        vw = self.client_module.attn.attention.v_proj.weight
+
+        qkvw = Parameter(torch.cat((qw, kw, vw), dim=0), requires_grad=False)
+
+        return self.linear_layer, \
+                qkvw, \
+                None, \
+                self.client_module.attn.attention.out_proj.weight, \
+                self.client_module.attn.attention.out_proj.bias, \
+                self.scale_attention, \
+               self.is_megatron_v2
+
+    def mlp(self):
+        return self.linear_layer, \
+                self.client_module.mlp.c_fc.weight, \
+                self.client_module.mlp.c_fc.bias, \
+                self.client_module.mlp.c_proj.weight, \
+                self.client_module.mlp.c_proj.bias
+
+    def layerNorm(self):
+        return self.client_module.ln_2.weight, \
+               self.client_module.ln_2.bias, \
+               self.client_module.ln_1.weight, \
+               self.client_module.ln_1.bias
+
+
+class HFGPTJLayerPolicy(DSPolicy):
+    _orig_layer_class = None
+
+    def __init__(self, client_module, inference=True):
+        super().__init__(inference, scale_attention=True)
+        self.client_module = client_module
+        try:
+            import transformers
+            HFGPTJLayerPolicy._orig_layer_class = transformers.models.gptj.modeling_gptj.GPTJBlock
+        except:
+            HFGPTJLayerPolicy._orig_layer_class = None
+
+    def get_hidden_heads(self):
+        return self.client_module.attn.q_proj.weight.shape[1], \
+                self.client_module.attn.num_attention_heads
+
+    def attention(self):
+        qw = self.client_module.attn.q_proj.weight
+        kw = self.client_module.attn.k_proj.weight
+        vw = self.client_module.attn.v_proj.weight
+
+        qkvw = Parameter(torch.cat((qw, kw, vw), dim=0), requires_grad=False)
+
+        return self.linear_layer, \
+                qkvw, \
+                None, \
+                self.client_module.attn.out_proj.weight, \
+                None, \
+                self.scale_attention, \
+               self.is_megatron_v2
+
+    def mlp(self):
+        return self.linear_layer, \
+                self.client_module.mlp.fc_in.weight, \
+                self.client_module.mlp.fc_in.bias, \
+                self.client_module.mlp.fc_out.weight, \
+                self.client_module.mlp.fc_out.bias
+
+    def layerNorm(self):
+        return None, \
+               None, \
+               self.client_module.ln_1.weight, \
+               self.client_module.ln_1.bias
+
+
+class MegatronLayerPolicy(DSPolicy):
+    _orig_layer_class = None
+    version = 0
+    moe_type = 'standard'
+
+    def __init__(self, client_module, inference=True):
+        super().__init__(inference)
+        self.client_module = client_module
+        # we use megatron version to differentiate between the old and new
+        # megatron-lm source code
+        if MegatronLayerPolicy._orig_layer_class is None:
+            try:
+                import megatron
+                from megatron.model.transformer import ParallelTransformerLayer
+                MegatronLayerPolicy._orig_layer_class = ParallelTransformerLayer
+            except ImportError:
+                MegatronLayerPolicy._orig_layer_class = None
+
+    def get_hidden_heads(self):
+        return self.client_module.attention.query_key_value.weight.shape[1], \
+                self.client_module.attention.num_attention_heads
+
+    def attention(self):
+        if self.inference:
+            if MegatronLayerPolicy.version == 0:
+                attention = self.client_module.attention
+            else:
+                attention = self.client_module.self_attention
+
+        return self.linear_layer, \
+                attention.query_key_value.weight, \
+                attention.query_key_value.bias, \
+                attention.dense.weight, \
+                attention.dense.bias, \
+                self.scale_attention, \
+                self.is_megatron_v2
+
+    def mlp(self, moe_type='standard'):
+        from deepspeed.moe.utils import has_moe_layers
+        moe, _ = has_moe_layers(self.client_module)
+
+        if moe:
+            moe_experts = self.client_module.mlp.deepspeed_moe.experts.deepspeed_experts if moe_type == 'standard' else \
+                            self.client_module.mlp.moe.deepspeed_moe.experts.deepspeed_experts
+            num_experts = len(moe_experts)
+            if moe_type == 'standard':
+                return self.linear_layer, \
+                    [moe_experts[i].dense_h_to_4h.weight for i in range(num_experts)], \
+                    [moe_experts[i].dense_h_to_4h.bias for i in range(num_experts)], \
+                    [moe_experts[i].dense_4h_to_h.weight for i in range(num_experts)], \
+                    [moe_experts[i].dense_4h_to_h.bias for i in range(num_experts)]
+            else:
+
+                return self.linear_layer, \
+                    [moe_experts[i].dense_h_to_4h.weight for i in range(num_experts)], \
+                    [moe_experts[i].dense_h_to_4h.bias for i in range(num_experts)], \
+                    [moe_experts[i].dense_4h_to_h.weight for i in range(num_experts)], \
+                    [moe_experts[i].dense_4h_to_h.bias for i in range(num_experts)], \
+                    self.client_module.mlp.mlp.dense_h_to_4h.weight, \
+                    self.client_module.mlp.mlp.dense_h_to_4h.bias, \
+                    self.client_module.mlp.mlp.dense_4h_to_h.weight, \
+                    self.client_module.mlp.mlp.dense_4h_to_h.bias, \
+                    self.client_module.mlp.coefficient.weight
+
+        else:
+            return self.linear_layer, \
+                self.client_module.mlp.dense_h_to_4h.weight, \
+                self.client_module.mlp.dense_h_to_4h.bias, \
+                self.client_module.mlp.dense_4h_to_h.weight, \
+                self.client_module.mlp.dense_4h_to_h.bias
+
+    def layerNorm(self):
+        return self.client_module.post_attention_layernorm.weight, \
+               self.client_module.post_attention_layernorm.bias, \
+               self.client_module.input_layernorm.weight, \
+               self.client_module.input_layernorm.bias
+
+
+class HFGPT2LayerPolicy(DSPolicy):
+    _orig_layer_class = None
+
+    def __init__(self, client_module, inference=True):
+        # HuggingFace GPT2 uses convolutional layer instead of linear layer
+        super().__init__(inference, linear_layer=False)
+        self.client_module = client_module
+        try:
+            import transformers
+            HFGPT2LayerPolicy._orig_layer_class = transformers.models.gpt2.modeling_gpt2.GPT2Block
+        except:
+            HFGPT2LayerPolicy._orig_layer_class = None
+
+    def get_hidden_heads(self):
+        return self.client_module.attn.embed_dim, \
+                self.client_module.attn.num_heads
+
+    def attention(self):
+        return self.linear_layer, \
+                self.client_module.attn.c_attn.weight, \
+                self.client_module.attn.c_attn.bias, \
+                self.client_module.attn.c_proj.weight, \
+                self.client_module.attn.c_proj.bias, \
+                self.scale_attention, \
+                self.is_megatron_v2
+
+    def mlp(self):
+        return self.linear_layer, \
+            self.client_module.mlp.c_fc.weight, \
+            self.client_module.mlp.c_fc.bias, \
+            self.client_module.mlp.c_proj.weight, \
+            self.client_module.mlp.c_proj.bias
+
+    def layerNorm(self):
+        return self.client_module.ln_2.weight, \
+               self.client_module.ln_2.bias, \
+               self.client_module.ln_1.weight, \
+               self.client_module.ln_1.bias
+
+
+class GPTNEOXLayerPolicy(DSPolicy):
+    _orig_layer_class = None
+    version = 0
+
+    def __init__(self, client_module, inference=True, megatron_v2=True):
+        super().__init__(inference, megatron_v2=megatron_v2)
+        self.client_module = client_module
+        if GPTNEOXLayerPolicy._orig_layer_class is None:
+            try:
+                import megatron
+                from megatron.model.transformer import ParallelTransformerLayerPipe
+                GPTNEOXLayerPolicy._orig_layer_class = ParallelTransformerLayerPipe
+            except ImportError:
+                GPTNEOXLayerPolicy._orig_layer_class = None
+
+    def get_hidden_heads(self):
+        if GPTNEOXLayerPolicy.version == 0:
+            attention = self.client_module.attention
+        else:
+            attention = self.client_module.self_attention
+
+        return self.client_module.attention.query_key_value.weight.shape[1], \
+                self.client_module.attention.num_attention_heads
+
+    def attention(self):
+        if GPTNEOXLayerPolicy.version == 0:
+            attention = self.client_module.attention
+        else:
+            attention = self.client_module.self_attention
+
+        return self.linear_layer, \
+                attention.query_key_value.weight, \
+                attention.query_key_value.bias, \
+                attention.dense.weight, \
+                attention.dense.bias, \
+                self.scale_attention, \
+                self.is_megatron_v2
+
+    def mlp(self):
+        return self.linear_layer, \
+            self.client_module.mlp.dense_h_to_4h.weight, \
+            self.client_module.mlp.dense_h_to_4h.bias, \
+            self.client_module.mlp.dense_4h_to_h.weight, \
+            self.client_module.mlp.dense_4h_to_h.bias
+
+    def layerNorm(self):
+        return self.client_module.post_attention_layernorm.weight, \
+               self.client_module.post_attention_layernorm.bias, \
+               self.client_module.input_layernorm.weight, \
+               self.client_module.input_layernorm.bias
+
+
+replace_policies = [
+    HFBertLayerPolicy,
+    HFGPTNEOLayerPolicy,
+    GPTNEOXLayerPolicy,
+    HFGPTJLayerPolicy,
+    MegatronLayerPolicy,
+    HFGPT2LayerPolicy,
+]
diff --git a/deepspeed/moe/__init__.py b/deepspeed/moe/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/deepspeed/moe/experts.py b/deepspeed/moe/experts.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fa440c2883a91b20254315f0cfffee11b7e5e38
--- /dev/null
+++ b/deepspeed/moe/experts.py
@@ -0,0 +1,34 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+
+import torch
+import copy
+
+
+class Experts(torch.nn.Module):
+    def __init__(self, expert, num_local_experts=1, expert_group_name=None):
+        super(Experts, self).__init__()
+
+        self.deepspeed_experts = torch.nn.ModuleList(
+            [copy.deepcopy(expert) for i in range(num_local_experts)])
+        self.num_local_experts = num_local_experts
+
+        # TODO: revisit allreduce for moe.gate...
+        for expert in self.deepspeed_experts:
+            # TODO: Create param groups to handle expert + data case (e.g. param.group = moe_group)
+            for name, param in expert.named_parameters():
+                param.allreduce = False
+                param.group_name = expert_group_name
+
+    def forward(self, inputs):
+        chunks = inputs.chunk(self.num_local_experts, dim=1)
+        expert_outputs = []
+        for chunk, expert in zip(chunks, self.deepspeed_experts):
+            out = expert(chunk)
+            if type(out) is tuple:
+                out = out[0]  # Ignore the bias term for now
+            expert_outputs += [out]
+
+        expert_output = torch.cat(expert_outputs, dim=1)
+        return expert_output
diff --git a/deepspeed/moe/layer.py b/deepspeed/moe/layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c596da4903e8bfb0dfd81bd739b20508f7077f11
--- /dev/null
+++ b/deepspeed/moe/layer.py
@@ -0,0 +1,131 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+
+import torch.nn.init as init
+import torch
+import torch.distributed as dist
+
+from deepspeed.utils import logger, log_dist
+
+import deepspeed.utils.groups as groups
+from .sharded_moe import MOELayer, TopKGate
+from .experts import Experts
+import copy
+import typing
+
+
+class MoE(torch.nn.Module):
+    def __init__(self,
+                 hidden_size,
+                 expert,
+                 num_experts=1,
+                 ep_size=1,
+                 k=1,
+                 capacity_factor=1.,
+                 eval_capacity_factor=1.,
+                 min_capacity=4,
+                 use_residual=False,
+                 noisy_gate_policy: typing.Optional[str] = None,
+                 drop_tokens: bool = True,
+                 use_rts=True,
+                 use_tutel: bool = False):
+        """Initialize an MoE layer.
+
+        Arguments:
+            hidden_size (int): the hidden dimension of the model, importantly this is also the input and output dimension.
+            expert (torch.nn.Module): the torch module that defines the expert (e.g., MLP, torch.linear).
+            num_experts (int, optional): default=1, the total number of experts per layer.
+            ep_size (int, optional): default=1, number of ranks in the expert parallel world or group.
+            k (int, optional): default=1, top-k gating value, only supports k=1 or k=2.
+            capacity_factor (float, optional): default=1.0, the capacity of the expert at training time.
+            eval_capacity_factor (float, optional): default=1.0, the capacity of the expert at eval time.
+            min_capacity (int, optional): default=4, the minimum capacity per expert regardless of the capacity_factor.
+            use_residual (bool, optional): default=False, make this MoE layer a Residual MoE (https://arxiv.org/abs/2201.05596) layer.
+            noisy_gate_policy (str, optional): default=None, noisy gate policy, valid options are 'Jitter', 'RSample' or 'None'.
+            drop_tokens (bool, optional): default=True, whether to drop tokens - (setting to False is equivalent to infinite capacity).
+            use_rts (bool, optional): default=True, whether to use Random Token Selection.
+            use_tutel (bool, optional): default=False, whether to use Tutel optimizations (if installed).
+        """
+
+        super(MoE, self).__init__()
+
+        self.use_residual = use_residual
+        self.ep_size = min(
+            ep_size,
+            num_experts)  # the ep size should be less than the number of experts
+        self.expert_group_name = f"ep_size_{self.ep_size}"
+        self.num_experts = num_experts
+        self.num_local_experts = 1 if num_experts < ep_size else num_experts // ep_size
+
+        log_dist(
+            f'Creating MoE layer with num_experts: {num_experts} | num_local_experts: {self.num_local_experts} | expert_parallel_size: {ep_size}',
+            [0])
+
+        assert noisy_gate_policy is None or noisy_gate_policy in ['None', 'Jitter', 'RSample'], \
+            'Unsupported noisy_gate_policy: ' + noisy_gate_policy
+
+        experts = Experts(expert, self.num_local_experts, self.expert_group_name)
+        self.deepspeed_moe = MOELayer(TopKGate(hidden_size,
+                                               num_experts,
+                                               k,
+                                               capacity_factor,
+                                               eval_capacity_factor,
+                                               min_capacity,
+                                               noisy_gate_policy,
+                                               drop_tokens,
+                                               use_rts),
+                                      experts,
+                                      self.expert_group_name,
+                                      self.ep_size,
+                                      self.num_local_experts,
+                                      use_tutel=use_tutel)
+        if self.use_residual:
+            self.mlp = expert
+            # coefficient is used for weighted sum of the output of expert and mlp
+            self.coefficient = torch.nn.Linear(hidden_size, 2)
+
+    def set_deepspeed_parallelism(self):
+        self._create_process_groups()
+
+    def _create_process_groups(self):
+        # Create process group for a layer if needed
+        if self.expert_group_name not in groups._get_expert_parallel_group_dict():
+            print(
+                f"No existing process group found, creating a new group named: {self.expert_group_name}"
+            )
+            if groups.mpu is None:
+                groups._create_expert_and_data_parallel(self.ep_size)
+            else:
+                groups._create_expert_data_and_model_parallel(self.ep_size,
+                                                              mpu=groups.mpu)
+        # Set the group handle for the MOELayer (deepspeed_moe) object
+        self.deepspeed_moe._set_ep_group(
+            groups._get_expert_parallel_group(self.expert_group_name))
+
+    def forward(self, hidden_states, used_token=None):
+        """ MoE forward
+
+        Arguments:
+            hidden_states (Tensor): input to the layer
+            used_token (Tensor, optional): default: None, mask only used tokens
+
+        Returns:
+            A tuple including output, gate loss, and expert count.
+
+            * output (Tensor): output of the model
+
+            * l_aux (Tensor): gate loss value
+
+            * exp_counts (int): expert count
+        """
+        output = self.deepspeed_moe(hidden_states, used_token)
+        if self.use_residual:
+            # Residual MoE
+            output_mlp = self.mlp(hidden_states)
+            if type(output_mlp) is tuple:
+                output_mlp = output_mlp[0]  # Ignore the bias term for now
+            coef = self.coefficient(hidden_states)
+            coef = torch.nn.functional.softmax(coef, dim=1)
+            output = output * coef[..., 0:1] + output_mlp * coef[..., 1:]
+        return output, self.deepspeed_moe.l_aux, self.deepspeed_moe.exp_counts
diff --git a/deepspeed/moe/sharded_moe.py b/deepspeed/moe/sharded_moe.py
new file mode 100644
index 0000000000000000000000000000000000000000..024de2f51e05c2e12356c3c67b1af820e80acc7d
--- /dev/null
+++ b/deepspeed/moe/sharded_moe.py
@@ -0,0 +1,555 @@
+'''
+Copyright 2021 The Microsoft DeepSpeed Team
+'''
+# The file has been adapted from two fairscale files:
+# (1) https://github.com/facebookresearch/fairscale/blob/master/fairscale/nn/moe/moe_layer.py
+# (2) https://github.com/facebookresearch/fairscale/blob/master/fairscale/nn/moe/top2gate.py
+# Git commit hash: 34df606902a240567a0d898037ece55c2f1336cf
+# We retain the following license from the original files:
+
+# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+
+from deepspeed.utils.timer import ThroughputTimer, SynchronizedWallClockTimer
+from deepspeed.utils import logger, log_dist
+from typing import Callable, Dict, TYPE_CHECKING, Any, Optional, Tuple, Union, cast
+
+import time
+from time import perf_counter
+import torch
+from torch import Tensor
+import torch.distributed as dist
+from torch.nn import Module, ModuleList
+import torch.nn.functional as F
+
+if TYPE_CHECKING:
+    Base = Module[Tensor]
+else:
+    Base = Module
+
+uniform_map: Dict[torch.device, Callable] = {}
+gumbel_map: Dict[torch.device, Callable] = {}
+exp_selection_uniform_map: Dict[torch.device, Callable] = {}
+
+try:
+    # To enable Tutel MoE optimizations:
+    #   python3 -m pip install --user --upgrade git+https://github.com/microsoft/tutel@v0.1.x
+    from tutel import moe as tutel_moe
+    TUTEL_INSTALLED = True
+except:
+    # Fail silently so we don't spam logs unnecessarily if user isn't using tutel
+    TUTEL_INSTALLED = False
+    pass
+
+
+def multiplicative_jitter(x, device: torch.device, epsilon=1e-2):
+    """
+    Modified from switch transformer paper. mesh transformers
+    Multiply values by a random number between 1-epsilon and 1+epsilon.
+    Makes models more resilient to rounding errors introduced by bfloat16.
+    This seems particularly important for logits.
+    Args:
+        x: a torch.tensor
+        device: torch.device
+        epsilon: a floating point value
+    Returns:
+        a jittered x.
+    """
+    if epsilon == 0:
+        return x
+    uniform = uniform_map.get(device)
+    if uniform is None:
+        uniform = torch.distributions.uniform.Uniform(
+            low=torch.tensor(1.0 - epsilon,
+                             device=device),
+            high=torch.tensor(1.0 + epsilon,
+                              device=device)).rsample  # type: ignore
+        uniform_map[device] = uniform
+    return x * uniform(x.shape)
+
+
+def gumbel_rsample(shape: Tuple, device: torch.device) -> Tensor:
+    gumbel = gumbel_map.get(device)
+    if gumbel is None:
+        one = torch.tensor(1.0, device=device)
+        zero = torch.tensor(0.0, device=device)
+        gumbel = torch.distributions.gumbel.Gumbel(zero, one).rsample  # type: ignore
+        gumbel_map[device] = gumbel
+    return gumbel(shape)
+
+
+# Based on https://github.com/pytorch/pytorch/pull/40762
+class _AllToAll(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx: Any,
+                group: dist.ProcessGroup,
+                input: Tensor) -> Tensor:  # type: ignore
+        ctx.group = group
+        input = input.contiguous()
+        output = torch.empty_like(input)
+        dist.all_to_all_single(output, input, group=group)
+        return output
+
+    @staticmethod
+    def backward(ctx: Any, *grad_output: Tensor) -> Tuple[None, Tensor]:
+        return (None, _AllToAll.apply(ctx.group, *grad_output))
+
+
+# einsum rewrites are on par or more performant
+# switch can be bubbled up in future
+USE_EINSUM = True
+
+
+# einsum dimensions: (g)roup, (s)equence, (e)xpert, (m)odel, (c)apacity
+# See https://arxiv.org/pdf/2006.16668.pdf for details.
+def einsum(rule, a, b):
+    if USE_EINSUM:
+        return torch.einsum(rule, a, b)
+    elif rule == 's,se->se':
+        return a.reshape(a.shape[0], -1) * b
+    elif rule == 'se,sc->sec':
+        return a.unsqueeze(2) * b.unsqueeze(1)
+    elif rule == 'se,se->s':
+        return torch.bmm(a.unsqueeze(1), b.unsqueeze(2)).reshape(-1)
+    elif rule == 'sec,sm->ecm':
+        s = a.shape[0]
+        e = a.shape[1]
+        c = a.shape[2]
+        m = b.shape[1]
+        return torch.matmul(a.reshape(s, -1).t(), b).reshape(e, c, m)
+    elif rule == 'sec,ecm->sm':
+        return torch.matmul(a.reshape(a.shape[0], -1), b.reshape(-1, b.shape[-1]))
+    elif rule == 'ks,ksm->sm':
+        k = b.shape[0]
+        s = b.shape[1]
+        m = b.shape[2]
+        # [k, s] -> [s, k] -> [s, 1, k]
+        a = a.t().unsqueeze(1)
+        # [k,s,m] -> [k, sm] -> [sm, k] -> [s, m, k]
+        b = b.reshape(k, -1).t().reshape(s, m, k)
+        # bmm([s, 1, k], [s, m, k]^t) -> [s, m, 1]
+        return torch.bmm(a, b.transpose(1, 2)).squeeze(2)
+    else:
+        return torch.einsum(rule, a, b)
+
+
+# The following functions are extracted and scripted
+# because otherwise during a torch.jit.trace, the non-Tensor
+# values used in the calculations get recorded as constants.
+# torch.jit.script coerces them into Tensors and preserves
+# their dynamic shapes. This enables ONNX export.
+# We can't script the entire top1gating function because it
+# includes stateful caching logic which is incompatible with ONNX.
+
+
+@torch.jit.script
+def _capacity(gates: Tensor, capacity_factor: Tensor, min_capacity: Tensor) -> Tensor:
+    # gates has shape of SE
+    num_tokens = gates.shape[0]
+    num_experts = gates.shape[1]
+    # to(torch.int64) works around a bug in torch.onnx.export:
+    # it should cast k to int64 when converting torch.topk but it doesn't.
+    capacity = torch.ceil((num_tokens / num_experts) * capacity_factor).to(torch.int64)
+    if capacity < min_capacity:
+        capacity = min_capacity.to(torch.int64)
+    return capacity
+
+
+@torch.jit.script
+def _top_idx(source, k):
+    return torch.topk(source, k=k, dim=0)[1]
+
+
+@torch.jit.script
+def _one_hot_to_float(x, num_classes):
+    return F.one_hot(x, num_classes=num_classes).float()
+
+
+def top1gating(logits: Tensor,
+               capacity_factor: float,
+               min_capacity: int,
+               used_token: Tensor = None,
+               noisy_gate_policy: Optional[str] = None,
+               drop_tokens: bool = True,
+               use_rts: bool = True,
+               use_tutel: bool = False) -> Tuple[Tensor,
+                                                 Tensor,
+                                                 Tensor,
+                                                 Tensor]:
+    """Implements Top1Gating on logits."""
+    if noisy_gate_policy == 'RSample':
+        logits_w_noise = logits + gumbel_rsample(logits.shape, device=logits.device)
+    # everything is in fp32 in this function
+    gates = F.softmax(logits, dim=1)
+
+    capacity = _capacity(gates,
+                         torch.tensor(capacity_factor),
+                         torch.tensor(min_capacity))
+
+    # Create a mask for 1st's expert per token
+    # noisy gating
+    indices1_s = torch.argmax(
+        logits_w_noise if noisy_gate_policy == 'RSample' else gates,
+        dim=1)
+    num_experts = int(gates.shape[1])
+    mask1 = F.one_hot(indices1_s, num_classes=num_experts)
+
+    # mask only used tokens
+    if used_token is not None:
+        mask1 = einsum("s,se->se", used_token, mask1)
+
+    # gating decisions
+    exp_counts = torch.sum(mask1, dim=0).detach().to('cpu')
+
+    # if we don't want to drop any tokens
+    if not drop_tokens:
+        new_capacity = torch.max(exp_counts).to(logits.device)
+        dist.all_reduce(new_capacity, op=dist.ReduceOp.MAX, group=dist.group.WORLD)
+        capacity = new_capacity
+
+    # Compute l_aux
+    me = torch.mean(gates, dim=0)
+    ce = torch.mean(mask1.float(), dim=0)
+    l_aux = torch.sum(me * ce) * num_experts
+
+    # Random Token Selection
+    if use_rts:
+        uniform = exp_selection_uniform_map.get(logits.device)
+        if uniform is None:
+            uniform = torch.distributions.uniform.Uniform(
+                low=torch.tensor(0.0,
+                                 device=logits.device),
+                high=torch.tensor(1.0,
+                                  device=logits.device)).rsample
+            exp_selection_uniform_map[logits.device] = uniform
+
+        mask1_rand = mask1 * uniform(mask1.shape)
+    else:
+        mask1_rand = mask1
+
+    assert logits.shape[0] >= min_capacity, "No. of tokens (batch-size) should be greater than min_capacity. Either set min_capacity to 0 or increase your batch size."
+
+    top_idx = _top_idx(mask1_rand, capacity)
+
+    new_mask1 = mask1 * torch.zeros_like(mask1).scatter_(0, top_idx, 1)
+    mask1 = new_mask1
+
+    if use_tutel:
+        # Tutel doesn't support index values masked with zero
+        # so we need to replace masked indices with -1
+        indices_mask = mask1.sum(dim=1) * num_experts - 1
+        indices1_s = torch.min(indices1_s, indices_mask)
+
+    # Compute locations in capacity buffer
+    if use_tutel:
+        locations1 = tutel_moe.fast_cumsum_sub_one(mask1)
+    else:
+        locations1 = torch.cumsum(mask1, dim=0) - 1
+
+    if use_tutel:
+        gates1_s = (gates * mask1).sum(dim=1)
+        locations1_s = torch.sum(locations1 * mask1, dim=1)
+        return l_aux, capacity, num_experts, [indices1_s,], [locations1_s,], [gates1_s,], exp_counts
+
+    # Store the capacity location for each token
+    locations1_s = torch.sum(locations1 * mask1, dim=1)
+
+    # Normalize gate probabilities
+    mask1_float = mask1.float()
+    gates = gates * mask1_float
+
+    locations1_sc = _one_hot_to_float(locations1_s, capacity)
+    combine_weights = einsum("se,sc->sec", gates, locations1_sc)
+
+    dispatch_mask = combine_weights.bool()
+
+    return l_aux, combine_weights, dispatch_mask, exp_counts
+
+
+def top2gating(logits: Tensor,
+               capacity_factor: float,
+               min_capacity: int) -> Tuple[Tensor,
+                                           Tensor,
+                                           Tensor,
+                                           Tensor]:
+    """Implements Top2Gating on logits."""
+    # everything is in fp32 in this function
+    gates = F.softmax(logits, dim=1)
+
+    capacity = _capacity(gates,
+                         torch.tensor(capacity_factor * 2),
+                         torch.tensor(min_capacity))
+
+    # Create a mask for 1st's expert per token
+    indices1_s = torch.argmax(gates, dim=1)
+    num_experts = int(gates.shape[1])
+    mask1 = F.one_hot(indices1_s, num_classes=num_experts)
+
+    # Create a mask for 2nd's expert per token using Gumbel-max trick
+    # https://timvieira.github.io/blog/post/2014/07/31/gumbel-max-trick/
+    logits_w_noise = logits + gumbel_rsample(logits.shape, device=logits.device)
+    # Replace top-expert with min value
+    logits_except1 = logits_w_noise.masked_fill(mask1.bool(), float("-inf"))
+    indices2_s = torch.argmax(logits_except1, dim=1)
+    mask2 = F.one_hot(indices2_s, num_classes=num_experts)
+
+    # Compute locations in capacity buffer
+    locations1 = torch.cumsum(mask1, dim=0) - 1
+    locations2 = torch.cumsum(mask2, dim=0) - 1
+    # Update 2nd's location by accounting for locations of 1st
+    locations2 += torch.sum(mask1, dim=0, keepdim=True)
+
+    # gating decisions
+    exp_counts = torch.sum(mask1, dim=0).detach().to('cpu')
+
+    # Compute l_aux
+    me = torch.mean(gates, dim=0)
+    ce = torch.mean(mask1.float(), dim=0)
+    l_aux = torch.mean(me * ce) * num_experts * num_experts
+
+    # Remove locations outside capacity from mask
+    mask1 *= torch.lt(locations1, capacity)
+    mask2 *= torch.lt(locations2, capacity)
+
+    # Store the capacity location for each token
+    locations1_s = torch.sum(locations1 * mask1, dim=1)
+    locations2_s = torch.sum(locations2 * mask2, dim=1)
+
+    # Normalize gate probabilities
+    mask1_float = mask1.float()
+    mask2_float = mask2.float()
+    gates1_s = einsum("se,se->s", gates, mask1_float)
+    gates2_s = einsum("se,se->s", gates, mask2_float)
+    denom_s = gates1_s + gates2_s
+    # Avoid divide-by-zero
+    denom_s = torch.clamp(denom_s, min=torch.finfo(denom_s.dtype).eps)
+    gates1_s /= denom_s
+    gates2_s /= denom_s
+
+    # Calculate combine_weights and dispatch_mask
+    gates1 = einsum("s,se->se", gates1_s, mask1_float)
+    gates2 = einsum("s,se->se", gates2_s, mask2_float)
+    locations1_sc = _one_hot_to_float(locations1_s, capacity)
+    locations2_sc = _one_hot_to_float(locations2_s, capacity)
+    combine1_sec = einsum("se,sc->sec", gates1, locations1_sc)
+    combine2_sec = einsum("se,sc->sec", gates2, locations2_sc)
+    combine_weights = combine1_sec + combine2_sec
+    dispatch_mask = combine_weights.bool()
+
+    return l_aux, combine_weights, dispatch_mask, exp_counts
+
+
+class TopKGate(Module):
+    """Gate module which implements Top2Gating as described in Gshard_.
+    ::
+
+        gate = TopKGate(model_dim, num_experts)
+        l_aux, combine_weights, dispatch_mask = gate(input)
+
+    .. Gshard_: https://arxiv.org/pdf/2006.16668.pdf
+
+    Args:
+        model_dim (int):
+            size of model embedding dimension
+        num_experts (ints):
+            number of experts in model
+    """
+
+    wg: torch.nn.Linear
+
+    def __init__(self,
+                 model_dim: int,
+                 num_experts: int,
+                 k: int = 1,
+                 capacity_factor: float = 1.0,
+                 eval_capacity_factor: float = 1.0,
+                 min_capacity: int = 8,
+                 noisy_gate_policy: Optional[str] = None,
+                 drop_tokens: bool = True,
+                 use_rts: bool = True) -> None:
+        super().__init__()
+
+        # Only top-1 and top-2 are supported at the moment.
+        if k != 1 and k != 2:
+            raise ValueError('Only top-1 and top-2 gatings are supported.')
+        self.wg = torch.nn.Linear(model_dim, num_experts, bias=False).float()
+        self.k = k
+        self.capacity_factor = capacity_factor
+        self.eval_capacity_factor = eval_capacity_factor
+        self.min_capacity = min_capacity
+        self.noisy_gate_policy = noisy_gate_policy
+        self.timers = SynchronizedWallClockTimer()
+        self.wall_clock_breakdown = False
+        self.gate_time = 0.0
+        self.drop_tokens = drop_tokens
+        self.use_rts = use_rts
+
+    def forward(
+            self,
+            input: torch.Tensor,
+            used_token: torch.Tensor = None,
+            use_tutel: bool = False) -> Tuple[Tensor,
+                                              Tensor,
+                                              Tensor]:  # type: ignore
+
+        if self.wall_clock_breakdown:
+            self.timers('TopKGate').start()
+
+        if self.wg.weight.dtype != torch.float32:
+            self.wg = self.wg.float()
+        input_fp32 = input.float()
+        # input jittering
+        if self.noisy_gate_policy == 'Jitter' and self.training:
+            input_fp32 = multiplicative_jitter(input_fp32, device=input.device)
+        logits = self.wg(input_fp32)
+
+        if self.k == 1:
+            gate_output = top1gating(
+                logits,
+                self.capacity_factor if self.training else self.eval_capacity_factor,
+                self.min_capacity,
+                used_token,
+                self.noisy_gate_policy if self.training else None,
+                self.drop_tokens,
+                self.use_rts,
+                use_tutel)
+
+        else:
+            gate_output = top2gating(
+                logits,
+                self.capacity_factor if self.training else self.eval_capacity_factor,
+                self.min_capacity)
+
+        if self.wall_clock_breakdown:
+            self.timers('TopKGate').stop()
+            self.gate_time = self.timers('TopKGate').elapsed(reset=False) * 1000
+
+        return gate_output
+
+
+class MOELayer(Base):
+    """MOELayer module which implements MixtureOfExperts as described in Gshard_.
+    ::
+
+        gate = TopKGate(model_dim, num_experts)
+        moe = MOELayer(gate, expert)
+        output = moe(input)
+        l_aux = moe.l_aux
+
+    .. Gshard_: https://arxiv.org/pdf/2006.16668.pdf
+
+    Args:
+        gate (torch.nn.Module):
+            gate network
+        expert (torch.nn.Module):
+            expert network
+    """
+    def __init__(self,
+                 gate: Module,
+                 experts: Module,
+                 ep_group_name,
+                 ep_size,
+                 num_local_experts: int,
+                 use_tutel: bool = False) -> None:
+        super().__init__()
+        self.gate = gate
+        self.experts = experts
+        self.ep_group = None
+        self.ep_size = ep_size
+        self.ep_group_name = ep_group_name
+        self.num_local_experts = num_local_experts
+        self.time_falltoall = 0.0
+        self.time_salltoall = 0.0
+        self.time_moe = 0.0
+        self.timers = SynchronizedWallClockTimer()
+        self.wall_clock_breakdown = False
+
+        self.use_tutel = use_tutel and TUTEL_INSTALLED
+
+        if self.use_tutel:
+            logger.info('Using Tutel optimizations.')
+        elif use_tutel and not TUTEL_INSTALLED:
+            logger.warning("Tutel optimization requested but not installed. "
+                           "Proceeding without Tutel.")
+
+    def _set_ep_group(self, ep_group):
+        self.ep_group = ep_group
+
+    def forward(self, *input: Tensor, **kwargs: Any) -> Tensor:
+
+        if self.wall_clock_breakdown:
+            self.timers('moe').start()
+
+        # Implement Algorithm 2 from GShard paper.
+        d_model = input[0].shape[-1]
+
+        # Initial implementation -> Reshape into S tokens by dropping sequence dimension.
+        # Reshape into G groups so that each group can distribute tokens equally
+        # group_size = kwargs['group_size'] if 'group_size' in kwargs.keys() else 1
+        reshaped_input = input[0].reshape(-1, d_model)
+
+        if self.use_tutel:
+            self.l_aux, C, E, indices_, locations_, gates_, self.exp_counts = self.gate(reshaped_input, input[1], True)
+            S, M = reshaped_input.size(0), reshaped_input.size(1)
+
+            if not hasattr(self, '_tutel_dispatcher'):
+                self._tutel_dispatcher = tutel_moe.fast_dispatcher(
+                    E,
+                    C,
+                    M,
+                    dispatch_dtype=reshaped_input.dtype)
+            self._tutel_dispatcher.update(indices_, locations_, gates_, capacity=C)
+            dispatched_input = self._tutel_dispatcher.encode(reshaped_input)
+        else:
+            self.l_aux, combine_weights, dispatch_mask, self.exp_counts = self.gate(reshaped_input, input[1])
+            dispatched_input = einsum("sec,sm->ecm",
+                                      dispatch_mask.type_as(input[0]),
+                                      reshaped_input)
+
+        if self.wall_clock_breakdown:
+            self.timers('falltoall').start()
+
+        dispatched_input = _AllToAll.apply(self.ep_group, dispatched_input)
+
+        if self.wall_clock_breakdown:
+            self.timers('falltoall').stop()
+            self.time_falltoall = self.timers('falltoall').elapsed(reset=False) * 1000
+
+        # Re-shape after all-to-all: ecm -> gecm
+        dispatched_input = dispatched_input.reshape(self.ep_size,
+                                                    self.num_local_experts,
+                                                    -1,
+                                                    d_model)
+
+        expert_output = self.experts(dispatched_input)
+
+        if self.wall_clock_breakdown:
+            self.timers('salltoall').start()
+
+        expert_output = _AllToAll.apply(self.ep_group, expert_output)
+
+        if self.wall_clock_breakdown:
+            self.timers('salltoall').stop()
+            self.time_salltoall = self.timers('salltoall').elapsed(reset=False) * 1000
+
+        # Re-shape back: gecm -> ecm
+        expert_output = expert_output.reshape(self.ep_size * self.num_local_experts,
+                                              -1,
+                                              d_model)
+
+        if self.use_tutel:
+            combined_output = self._tutel_dispatcher.decode(expert_output.view(E * C, M))
+        else:
+            combined_output = einsum("sec,ecm->sm",
+                                     combine_weights.type_as(input[0]),
+                                     expert_output)
+
+        a = combined_output.reshape(input[0].shape)
+
+        if self.wall_clock_breakdown:
+            self.timers('moe').stop()
+            self.time_moe = self.timers('moe').elapsed(reset=False) * 1000
+
+        return a
diff --git a/deepspeed/moe/utils.py b/deepspeed/moe/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..09922937e6d825abc99a669b638a513c2c464042
--- /dev/null
+++ b/deepspeed/moe/utils.py
@@ -0,0 +1,119 @@
+from typing import List, Tuple, Dict
+import torch
+import deepspeed.utils.groups as groups
+from .layer import MoE
+
+
+def has_moe_layers(m):
+    has_moe = False
+    num_experts = 0
+    for _, module in m.named_modules():
+        if isinstance(module, MoE):
+            has_moe = True
+            num_experts = module.num_experts
+            break
+    return has_moe, num_experts
+
+
+def is_moe_param(param: torch.Tensor) -> bool:
+    if hasattr(param, "allreduce") and not param.allreduce:
+        return True
+    return False
+
+
+def split_params_into_shared_and_expert_params(
+        params: List[torch.nn.Parameter]) -> Tuple[torch.nn.Parameter,
+                                                   torch.nn.Parameter]:
+    shared_params, expert_params = [], []
+    for p in params:
+        if is_moe_param(p):
+            expert_params.append(p)
+        else:
+            shared_params.append(p)
+    return shared_params, expert_params
+
+
+def split_params_grads_into_shared_and_expert_params(
+        group: List[torch.nn.Parameter]) -> Tuple[torch.nn.Parameter,
+                                                  torch.nn.Parameter]:
+    """Split grad of parameters into grads of non-expert params
+    and grads of expert params. This is useful while computing
+    grad-norms for clipping and overflow detection
+
+        group (List[torch.nn.Parameter]):
+    Args:
+            The group of parameters to split
+
+    Returns:
+        Tuple[List[torch.nn.Parameter], List[torch.nn.Parameter]]:
+        list of gradients for non MoE params, list of gradients of MoE params
+    """
+    expert_grads = []
+    shared_grads = []
+    for p in group:
+        if p.grad is not None:
+            if is_moe_param(p):
+                expert_grads.append(p.grad.to(p.dtype))
+            else:
+                shared_grads.append(p.grad.to(p.dtype))
+    return shared_grads, expert_grads
+
+
+def split_params_into_different_moe_groups_for_optimizer(
+        param_groups: Tuple[Dict]) -> Tuple[Dict]:
+    """Split parameters into different MoE groups for optimizer
+
+    Args:
+        param_groups (Tuple[Dict]):
+            The list of parameter groups to split
+
+    Returns:
+        Tuple[Dict]:
+        list of MoE/non-MoE groups for optimizer
+    """
+    if isinstance(param_groups, tuple):
+        param_groups = list(param_groups)  # Tuple cannot be modified
+    elif isinstance(param_groups, dict):
+        param_groups = [param_groups]
+    elif not isinstance(param_groups, list):
+        raise ValueError(f"Unknown param group type of {type(param_groups)}")
+
+    # gather all data parallel group names
+    data_parallel_group_names = set()
+    for param_group in param_groups:
+        for param in param_group["params"]:
+            if is_moe_param(param):
+                data_parallel_group_names.add(param.group_name)
+    data_parallel_group_names = list(data_parallel_group_names)
+    group_moe = {}
+    # Create the param MoE groups, leave param assign to next step
+    for param_group in param_groups:
+        group_moe[param_group['name']] = {}
+        for key in data_parallel_group_names:
+            group_moe[param_group['name']][key] = {}
+            group_moe[param_group['name']][key]['name'] = key
+            group_moe[param_group['name']][key]['moe'] = True
+            for ori_key in param_group.keys():
+                if ori_key != 'name':
+                    if ori_key == 'params':
+                        group_moe[param_group['name']][key][ori_key] = []
+                    else:
+                        group_moe[
+                            param_group['name']][key][ori_key] = param_group[ori_key]
+    # Assign param
+    for param_group in param_groups:
+        new_params = []
+        for param in param_group['params']:
+            if is_moe_param(param):
+                group_moe[param_group['name']][param.group_name]['params'].append(param)
+                # param_group['params'].remove(param)
+            else:
+                new_params.append(param)
+        param_group['params'] = new_params
+
+    # Flatten the moe groups
+    for k, v in group_moe.items():
+        for k1, v1 in v.items():
+            param_groups.append(v1)
+
+    return tuple(param_groups)
diff --git a/deepspeed/ops/__init__.py b/deepspeed/ops/__init__.py
old mode 100755
new mode 100644
index e6fd81fb5a13f03b3ad0663ac35c9a8ffa38d42b..698f9599bf9fe46a0f217e941e01038d2f7d6df8
--- a/deepspeed/ops/__init__.py
+++ b/deepspeed/ops/__init__.py
@@ -1,9 +1,11 @@
 from . import adam
+from . import adagrad
 from . import lamb
+#from ..git_version_info_installed import installed_ops as __installed_ops__
+#if __installed_ops__['sparse_attn']:
 from . import sparse_attention
 from . import transformer
 
 from .transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
-from .module_inject import replace_module
 
 from ..git_version_info import compatible_ops as __compatible_ops__
diff --git a/deepspeed/ops/adagrad/__init__.py b/deepspeed/ops/adagrad/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3067c5c903895f6773212cbaa2a00be5c7bea33e
--- /dev/null
+++ b/deepspeed/ops/adagrad/__init__.py
@@ -0,0 +1 @@
+from .cpu_adagrad import DeepSpeedCPUAdagrad
diff --git a/deepspeed/ops/adagrad/cpu_adagrad.py b/deepspeed/ops/adagrad/cpu_adagrad.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3f70a61539f173f586e50ade29050e5cb245a5a
--- /dev/null
+++ b/deepspeed/ops/adagrad/cpu_adagrad.py
@@ -0,0 +1,141 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+
+import math
+import torch
+import time
+from pathlib import Path
+from ..op_builder import CPUAdagradBuilder
+from deepspeed.utils.logging import should_log_le
+
+
+class DeepSpeedCPUAdagrad(torch.optim.Optimizer):
+    optimizer_id = 0
+
+    def __init__(self,
+                 model_params,
+                 lr=1e-2,
+                 eps=1e-10,
+                 weight_decay=0,
+                 amsgrad=False,
+                 fp32_optimizer_states=True):
+
+        default_args = dict(lr=lr, eps=eps, weight_decay=weight_decay, amsgrad=amsgrad)
+        super(DeepSpeedCPUAdagrad, self).__init__(model_params, default_args)
+
+        self.opt_id = DeepSpeedCPUAdagrad.optimizer_id
+        DeepSpeedCPUAdagrad.optimizer_id = DeepSpeedCPUAdagrad.optimizer_id + 1
+        self.fp32_optimizer_states = fp32_optimizer_states
+        self.ds_opt_adagrad = CPUAdagradBuilder().load()
+
+        self.ds_opt_adagrad.create_adagrad(self.opt_id,
+                                           lr,
+                                           eps,
+                                           weight_decay,
+                                           should_log_le("info"))
+
+    def __del__(self):
+        # need to destroy the C++ object explicitly to avoid a memory leak when deepspeed.initialize
+        # is used multiple times in the same process (notebook or pytest worker)
+        self.ds_opt_adagrad.destroy_adagrad(self.opt_id)
+
+    def __setstate__(self, state):
+        super(DeepSpeedCPUAdagrad, self).__setstate__(state)
+        for group in self.param_groups:
+            group.setdefault('amsgrad', False)
+
+    @torch.no_grad()
+    def step(self, closure=None, fp16_param_groups=None):
+        """Update the model parameters.
+
+        .. note::
+            This method will be called internally by ZeRO-Offload. DeepSpeed
+            users should still use ``engine.step()`` as shown in the
+            `Getting Started
+            <https://www.deepspeed.ai/getting-started/#training>`_ guide.
+
+        Args:
+            closure (callable, optional): closure to compute the loss.
+                Defaults to ``None``.
+            fp16_param_groups: FP16 GPU parameters to update. Performing the
+                copy here reduces communication time. Defaults to ``None``.
+
+        Returns:
+            loss: if ``closure`` is provided. Otherwise ``None``.
+        """
+
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        # intended device for step
+        device = torch.device('cpu')
+
+        for group_id, group in enumerate(self.param_groups):
+            for param_id, p in enumerate(group['params']):
+
+                if p.grad is None:
+                    continue
+
+                assert p.device == device, f"CPUAdagrad param is on {p.device} and must be 'cpu', make " \
+                        "sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config."
+
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    #print(f'group {group_id} param {param_id} = {p.numel()}')
+                    state['step'] = 0
+
+                    #use full precision by default unless self.fp32_optimizer_states is off
+                    state_dtype = torch.float if self.fp32_optimizer_states else p.dtype
+
+                    #memory_format=torch.preserve_format)
+                    # gradient variances
+                    state['exp_avg_sq'] = torch.zeros_like(p.data,
+                                                           dtype=state_dtype,
+                                                           device='cpu')
+                    #memory_format=torch.preserve_format)
+
+                state['step'] += 1
+
+                if p.grad.is_sparse == True:
+                    sparse_param = p.sparse_mask(p.grad)
+                    sparse_exp_avg_sq = state['exp_avg_sq'].sparse_mask(p.grad)
+                    self.ds_opt_adagrad.adagrad_update(self.opt_id,
+                                                       state['step'],
+                                                       group['lr'],
+                                                       group['eps'],
+                                                       group['weight_decay'],
+                                                       sparse_param.values(),
+                                                       p.grad.values(),
+                                                       sparse_exp_avg_sq.values())
+                    p[sparse_param.indices()] = sparse_param.values()
+                    state['exp_avg_sq'][
+                        sparse_exp_avg_sq.indices()] = sparse_exp_avg_sq.values()
+                    if fp16_param_groups is not None:
+                        fp16_param_groups[group_id][param_id][
+                            sparse_param.indices()] = sparse_param.values()
+                else:
+                    if fp16_param_groups is not None:
+                        self.ds_opt_adagrad.adagrad_update_copy(
+                            self.opt_id,
+                            state['step'],
+                            group['lr'],
+                            group['eps'],
+                            group['weight_decay'],
+                            p.data,
+                            p.grad.data,
+                            state['exp_avg_sq'],
+                            fp16_param_groups[group_id][param_id].data)
+                    else:
+                        self.ds_opt_adagrad.adagrad_update(self.opt_id,
+                                                           state['step'],
+                                                           group['lr'],
+                                                           group['eps'],
+                                                           group['weight_decay'],
+                                                           p.data,
+                                                           p.grad.data,
+                                                           state['exp_avg_sq'])
+        return loss
diff --git a/deepspeed/ops/adam/__init__.py b/deepspeed/ops/adam/__init__.py
old mode 100755
new mode 100644
index 6e620b36bd8e8659a31fcfe90ae36240f9d83d38..6ab6cbd37f35205f7f9f5fd1be2bbc52ea30febc
--- a/deepspeed/ops/adam/__init__.py
+++ b/deepspeed/ops/adam/__init__.py
@@ -1,2 +1,2 @@
-from .cpu_adam import DeepSpeedCPUAdam
-from .fused_adam import FusedAdam
+from .cpu_adam import DeepSpeedCPUAdam
+from .fused_adam import FusedAdam
diff --git a/deepspeed/ops/adam/cpu_adam.py b/deepspeed/ops/adam/cpu_adam.py
old mode 100755
new mode 100644
index 7977d232b1facc4be5285321763c8832bb59b96f..1c7dd5d3f61d690ae10e0c3a1987f1d8d0cf13dc
--- a/deepspeed/ops/adam/cpu_adam.py
+++ b/deepspeed/ops/adam/cpu_adam.py
@@ -1,171 +1,199 @@
-'''
-Copyright 2020 The Microsoft DeepSpeed Team
-'''
-
-import math
-import torch
-import time
-from pathlib import Path
-from ..op_builder import CPUAdamBuilder
-
-
-class DeepSpeedCPUAdam(torch.optim.Optimizer):
-    optimizer_id = 0
-
-    def __init__(self,
-                 model_params,
-                 lr=1e-3,
-                 bias_correction=True,
-                 betas=(0.9,
-                        0.999),
-                 eps=1e-8,
-                 weight_decay=0,
-                 amsgrad=False,
-                 adamw_mode=True):
-        """Fast vectorized implementation of two variations of Adam optimizer on CPU:
-
-        * Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);
-        * AdamW: Fixing Weight Decay Regularization in Adam (https://arxiv.org/abs/1711.05101)
-
-        DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W).
-        In order to apply this optimizer, the model requires to have its master parameter (in FP32)
-        reside on the CPU memory.
-
-        To train on a hetrogeneous system, such as coordinating CPU and GPU, DeepSpeed offers
-        the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory,
-        with minimal impact on training througput. DeepSpeedCPUAdam plays an important role to minimize
-        the overhead of the optimizer's latency on CPU. Please refer to ZeRO-Offload tutorial
-        (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.
-
-        For calling step function, there are two options available: (1) update optimizer's states and (2) update
-        optimizer's states and copy the parameters back to GPU at the same time. We have seen that the second
-        option can bring 30% higher throughput than the doing the copy separately using option one.
-
-
-        .. note::
-                We recommend using our `config
-                <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
-                to allow :meth:`deepspeed.initialize` to build this optimizer
-                for you.
-
-
-        Arguments:
-            model_params (iterable): iterable of parameters to optimize or dicts defining
-                parameter groups.
-            lr (float, optional): learning rate. (default: 1e-3)
-            betas (Tuple[float, float], optional): coefficients used for computing
-                running averages of gradient and its square. (default: (0.9, 0.999))
-            eps (float, optional): term added to the denominator to improve
-                numerical stability. (default: 1e-8)
-            weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
-            amsgrad (boolean, optional): whether to use the AMSGrad variant of this
-                algorithm from the paper `On the Convergence of Adam and Beyond`_
-                (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
-            adamw_mode: select between Adam and AdamW implementations (default: AdamW)
-        """
-
-        default_args = dict(lr=lr,
-                            betas=betas,
-                            eps=eps,
-                            weight_decay=weight_decay,
-                            bias_correction=bias_correction,
-                            amsgrad=amsgrad)
-        super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)
-
-        self.opt_id = DeepSpeedCPUAdam.optimizer_id
-        DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
-        self.adam_w_mode = adamw_mode
-        self.ds_opt_adam = CPUAdamBuilder().load()
-
-        self.ds_opt_adam.create_adam(self.opt_id,
-                                     lr,
-                                     betas[0],
-                                     betas[1],
-                                     eps,
-                                     weight_decay,
-                                     adamw_mode)
-
-    def __setstate__(self, state):
-        super(DeepSpeedCPUAdam, self).__setstate__(state)
-        for group in self.param_groups:
-            group.setdefault('amsgrad', False)
-
-    @torch.no_grad()
-    def step(self, closure=None, fp16_param_groups=None):
-        """Update the model parameters.
-
-        .. note::
-            This method will be called internally by ZeRO-Offload. DeepSpeed
-            users should still use ``engine.step()`` as shown in the
-            `Getting Started
-            <https://www.deepspeed.ai/getting-started/#training>`_ guide.
-
-        Args:
-            closure (callable, optional): closure to compute the loss.
-                Defaults to ``None``.
-            fp16_param_groups: FP16 GPU parameters to update. Performing the
-                copy here reduces communication time. Defaults to ``None``.
-
-        Returns:
-            loss: if ``closure`` is provided. Otherwise ``None``.
-        """
-
-        loss = None
-        if closure is not None:
-            with torch.enable_grad():
-                loss = closure()
-
-        for group_id, group in enumerate(self.param_groups):
-            for param_id, p in enumerate(group['params']):
-
-                if p.grad is None:
-                    continue
-
-                state = self.state[p]
-                # State initialization
-                if len(state) == 0:
-                    #print(f'group {group_id} param {param_id} = {p.numel()}')
-                    state['step'] = 0
-                    # gradient momentums
-                    state['exp_avg'] = torch.zeros_like(p.data,
-                                                        dtype=p.dtype,
-                                                        device='cpu')
-                    #memory_format=torch.preserve_format)
-                    # gradient variances
-                    state['exp_avg_sq'] = torch.zeros_like(p.data,
-                                                           dtype=p.dtype,
-                                                           device='cpu')
-                    #memory_format=torch.preserve_format)
-
-                state['step'] += 1
-                beta1, beta2 = group['betas']
-
-                if fp16_param_groups is not None:
-                    self.ds_opt_adam.adam_update_copy(
-                        self.opt_id,
-                        state['step'],
-                        group['lr'],
-                        beta1,
-                        beta2,
-                        group['eps'],
-                        group['weight_decay'],
-                        group['bias_correction'],
-                        p.data,
-                        p.grad.data,
-                        state['exp_avg'],
-                        state['exp_avg_sq'],
-                        fp16_param_groups[group_id][param_id].data)
-                else:
-                    self.ds_opt_adam.adam_update(self.opt_id,
-                                                 state['step'],
-                                                 group['lr'],
-                                                 beta1,
-                                                 beta2,
-                                                 group['eps'],
-                                                 group['weight_decay'],
-                                                 group['bias_correction'],
-                                                 p.data,
-                                                 p.grad.data,
-                                                 state['exp_avg'],
-                                                 state['exp_avg_sq'])
-        return loss
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+
+import math
+import torch
+import time
+from pathlib import Path
+from ..op_builder import CPUAdamBuilder
+from deepspeed.utils.logging import should_log_le
+
+
+class DeepSpeedCPUAdam(torch.optim.Optimizer):
+    optimizer_id = 0
+
+    def __init__(self,
+                 model_params,
+                 lr=1e-3,
+                 bias_correction=True,
+                 betas=(0.9,
+                        0.999),
+                 eps=1e-8,
+                 weight_decay=0,
+                 amsgrad=False,
+                 adamw_mode=True,
+                 fp32_optimizer_states=True):
+        """Fast vectorized implementation of two variations of Adam optimizer on CPU:
+
+        * Adam: A Method for Stochastic Optimization: (https://arxiv.org/abs/1412.6980);
+        * AdamW: Fixing Weight Decay Regularization in Adam (https://arxiv.org/abs/1711.05101)
+
+        DeepSpeed CPU Adam(W) provides between 5x to 7x speedup over torch.optim.adam(W).
+        In order to apply this optimizer, the model requires to have its master parameter (in FP32)
+        reside on the CPU memory.
+
+        To train on a heterogeneous system, such as coordinating CPU and GPU, DeepSpeed offers
+        the ZeRO-Offload technology which efficiently offloads the optimizer states into CPU memory,
+        with minimal impact on training throughput. DeepSpeedCPUAdam plays an important role to minimize
+        the overhead of the optimizer's latency on CPU. Please refer to ZeRO-Offload tutorial
+        (https://www.deepspeed.ai/tutorials/zero-offload/) for more information on how to enable this technology.
+
+        For calling step function, there are two options available: (1) update optimizer's states and (2) update
+        optimizer's states and copy the parameters back to GPU at the same time. We have seen that the second
+        option can bring 30% higher throughput than the doing the copy separately using option one.
+
+
+        .. note::
+                We recommend using our `config
+                <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
+                to allow :meth:`deepspeed.initialize` to build this optimizer
+                for you.
+
+
+        Arguments:
+            model_params (iterable): iterable of parameters to optimize or dicts defining
+                parameter groups.
+            lr (float, optional): learning rate. (default: 1e-3)
+            betas (Tuple[float, float], optional): coefficients used for computing
+                running averages of gradient and its square. (default: (0.9, 0.999))
+            eps (float, optional): term added to the denominator to improve
+                numerical stability. (default: 1e-8)
+            weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+            amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+                algorithm from the paper `On the Convergence of Adam and Beyond`_
+                (default: False) NOT SUPPORTED in DeepSpeed CPUAdam!
+            adamw_mode: select between Adam and AdamW implementations (default: AdamW)
+            full_precision_optimizer_states: creates momementum and variance in full precision regardless of
+                        the precision of the parameters (default: True)
+        """
+
+        default_args = dict(lr=lr,
+                            betas=betas,
+                            eps=eps,
+                            weight_decay=weight_decay,
+                            bias_correction=bias_correction,
+                            amsgrad=amsgrad)
+        super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)
+
+        self.opt_id = DeepSpeedCPUAdam.optimizer_id
+        DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
+        self.adam_w_mode = adamw_mode
+        self.fp32_optimizer_states = fp32_optimizer_states
+        self.ds_opt_adam = CPUAdamBuilder().load()
+
+        self.ds_opt_adam.create_adam(self.opt_id,
+                                     lr,
+                                     betas[0],
+                                     betas[1],
+                                     eps,
+                                     weight_decay,
+                                     adamw_mode,
+                                     should_log_le("info"))
+
+    def __del__(self):
+        # need to destroy the C++ object explicitly to avoid a memory leak when deepspeed.initialize
+        # is used multiple times in the same process (notebook or pytest worker)
+        self.ds_opt_adam.destroy_adam(self.opt_id)
+
+    def __setstate__(self, state):
+        super(DeepSpeedCPUAdam, self).__setstate__(state)
+        for group in self.param_groups:
+            group.setdefault('amsgrad', False)
+
+    @torch.no_grad()
+    def step(self, closure=None, fp16_param_groups=None):
+        """Update the model parameters.
+
+        .. note::
+            This method will be called internally by ZeRO-Offload. DeepSpeed
+            users should still use ``engine.step()`` as shown in the
+            `Getting Started
+            <https://www.deepspeed.ai/getting-started/#training>`_ guide.
+
+        Args:
+            closure (callable, optional): closure to compute the loss.
+                Defaults to ``None``.
+            fp16_param_groups: FP16 GPU parameters to update. Performing the
+                copy here reduces communication time. Defaults to ``None``.
+
+        Returns:
+            loss: if ``closure`` is provided. Otherwise ``None``.
+        """
+
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        # intended device for step
+        device = torch.device('cpu')
+
+        # converting the fp16 params to a group of parameter
+        if type(fp16_param_groups) is list:
+            if type(fp16_param_groups[0]) is not list:
+                fp16_param_groups = [fp16_param_groups]
+        elif fp16_param_groups is not None:
+            fp16_param_groups = [[fp16_param_groups]]
+
+        for group_id, group in enumerate(self.param_groups):
+            for param_id, p in enumerate(group['params']):
+
+                if p.grad is None:
+                    continue
+
+                assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
+                        "sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config."
+
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    #print(f'group {group_id} param {param_id} = {p.numel()}')
+                    state['step'] = 0
+
+                    #use full precision by default unless self.fp32_optimizer_states is off
+                    state_dtype = torch.float if self.fp32_optimizer_states else p.dtype
+
+                    # gradient momentums
+                    state['exp_avg'] = torch.zeros_like(p.data,
+                                                        dtype=state_dtype,
+                                                        device=device)
+                    #memory_format=torch.preserve_format)
+                    # gradient variances
+                    state['exp_avg_sq'] = torch.zeros_like(p.data,
+                                                           dtype=state_dtype,
+                                                           device=device)
+                    #memory_format=torch.preserve_format)
+
+                state['step'] += 1
+                beta1, beta2 = group['betas']
+
+                if fp16_param_groups is not None:
+                    self.ds_opt_adam.adam_update_copy(
+                        self.opt_id,
+                        state['step'],
+                        group['lr'],
+                        beta1,
+                        beta2,
+                        group['eps'],
+                        group['weight_decay'],
+                        group['bias_correction'],
+                        p.data,
+                        p.grad.data,
+                        state['exp_avg'],
+                        state['exp_avg_sq'],
+                        fp16_param_groups[group_id][param_id].data)
+                else:
+                    self.ds_opt_adam.adam_update(self.opt_id,
+                                                 state['step'],
+                                                 group['lr'],
+                                                 beta1,
+                                                 beta2,
+                                                 group['eps'],
+                                                 group['weight_decay'],
+                                                 group['bias_correction'],
+                                                 p.data,
+                                                 p.grad.data,
+                                                 state['exp_avg'],
+                                                 state['exp_avg_sq'])
+        return loss
diff --git a/deepspeed/ops/adam/fused_adam.py b/deepspeed/ops/adam/fused_adam.py
index ae7c5fac88f0d961ef2cf868641a684404cb9b48..80e5bf5657035ede2557409e7aacb94ac9be79cc 100644
--- a/deepspeed/ops/adam/fused_adam.py
+++ b/deepspeed/ops/adam/fused_adam.py
@@ -8,6 +8,7 @@ This file is adapted from fused adam in NVIDIA/apex, commit a109f85
 import torch
 import importlib
 from .multi_tensor_apply import MultiTensorApply
+
 multi_tensor_applier = MultiTensorApply(2048 * 32)
 from ..op_builder import FusedAdamBuilder
 
diff --git a/deepspeed/ops/aio/__init__.py b/deepspeed/ops/aio/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d25f815739aaaa7ec03a2509fe9cd4b4eb330849
--- /dev/null
+++ b/deepspeed/ops/aio/__init__.py
@@ -0,0 +1,6 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+'''
+
+from ..op_builder import AsyncIOBuilder
diff --git a/deepspeed/ops/csrc b/deepspeed/ops/csrc
deleted file mode 120000
index 338652302505df522420eff27c95309b6ee102fd..0000000000000000000000000000000000000000
--- a/deepspeed/ops/csrc
+++ /dev/null
@@ -1 +0,0 @@
-../../csrc
\ No newline at end of file
diff --git a/deepspeed/ops/csrc/adagrad/cpu_adagrad.cpp b/deepspeed/ops/csrc/adagrad/cpu_adagrad.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..4f2a9b69ef966599d1bd6664f79e312c9240671b
--- /dev/null
+++ b/deepspeed/ops/csrc/adagrad/cpu_adagrad.cpp
@@ -0,0 +1,227 @@
+#include "cpu_adagrad.h"
+#include <cuda_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "custom_cuda_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adagrad_Optimizer::Step_1(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float step_size = -1 * _alpha;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = grads[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0) { grad = param * _weight_decay + grad; }
+
+                variance += grad * grad;
+
+                grad = sqrt(variance);
+                grad += _eps;
+                grad = momentum / grad;
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                // STORE UPDATE TERM TO GRAD'S MEMORY
+                grads[k] = grad * step_size;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adagrad_Optimizer::Step_4(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adagrad_optimizer(int optimizer_id,
+                             float alpha = 1e-2,
+                             float eps = 1e-8,
+                             float weight_decay = 0,
+                             bool should_log = false)
+{
+    auto opt = std::make_shared<Adagrad_Optimizer>(alpha, eps, weight_decay);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adagrad Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, weight_decay=%f\n", alpha, weight_decay);
+    }
+
+    return 0;
+}
+
+void Adagrad_Optimizer::Step_8(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adagrad_step(int optimizer_id,
+                    size_t step,
+                    float lr,
+                    float epsilon,
+                    float weight_decay,
+                    torch::Tensor& params,
+                    torch::Tensor& grads,
+                    torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr, grads_ptr, exp_avg_sq_ptr, params_c.size(0));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adagrad_step_plus_copy(int optimizer_id,
+                              size_t step,
+                              float lr,
+                              float epsilon,
+                              float weight_decay,
+                              torch::Tensor& params,
+                              torch::Tensor& grads,
+                              torch::Tensor& exp_avg_sq,
+                              torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adagrad_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adagrad_update", &ds_adagrad_step, "DeepSpeed CPU Adagrad update (C++)");
+    m.def("adagrad_update_copy",
+          &ds_adagrad_step_plus_copy,
+          "DeepSpeed CPU Adagrad update and param copy (C++)");
+    m.def("create_adagrad", &create_adagrad_optimizer, "DeepSpeed CPU Adagrad (C++)");
+    m.def("destroy_adagrad", &destroy_adagrad_optimizer, "DeepSpeed CPU Adagrad destroy (C++)");
+}
diff --git a/deepspeed/ops/csrc/adagrad/cpu_adagrad_hip.cpp b/deepspeed/ops/csrc/adagrad/cpu_adagrad_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6bbe9a9ee564c9e8f081c083202326ad279eddd1
--- /dev/null
+++ b/deepspeed/ops/csrc/adagrad/cpu_adagrad_hip.cpp
@@ -0,0 +1,228 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cpu_adagrad_hip.h"
+#include <hip/hip_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "custom_hip_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adagrad_Optimizer::Step_1(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float step_size = -1 * _alpha;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = grads[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0) { grad = param * _weight_decay + grad; }
+
+                variance += grad * grad;
+
+                grad = sqrt(variance);
+                grad += _eps;
+                grad = momentum / grad;
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                // STORE UPDATE TERM TO GRAD'S MEMORY
+                grads[k] = grad * step_size;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adagrad_Optimizer::Step_4(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adagrad_optimizer(int optimizer_id,
+                             float alpha = 1e-2,
+                             float eps = 1e-8,
+                             float weight_decay = 0,
+                             bool should_log = false)
+{
+    auto opt = std::make_shared<Adagrad_Optimizer>(alpha, eps, weight_decay);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adagrad Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, weight_decay=%f\n", alpha, weight_decay);
+    }
+
+    return 0;
+}
+
+void Adagrad_Optimizer::Step_8(float* _params,
+                               float* grads,
+                               float* _exp_avg_sq,
+                               size_t _param_size,
+                               __half* dev_params,
+                               bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(
+        &rounded_size, _params, grads, _exp_avg_sq, _param_size, dev_params, half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adagrad_step(int optimizer_id,
+                    size_t step,
+                    float lr,
+                    float epsilon,
+                    float weight_decay,
+                    torch::Tensor& params,
+                    torch::Tensor& grads,
+                    torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr, grads_ptr, exp_avg_sq_ptr, params_c.size(0));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adagrad_step_plus_copy(int optimizer_id,
+                              size_t step,
+                              float lr,
+                              float epsilon,
+                              float weight_decay,
+                              torch::Tensor& params,
+                              torch::Tensor& grads,
+                              torch::Tensor& exp_avg_sq,
+                              torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adagrad_Optimizer> opt =
+        std::static_pointer_cast<Adagrad_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step);
+    opt->update_state(lr, epsilon, weight_decay);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adagrad_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adagrad_update", &ds_adagrad_step, "DeepSpeed CPU Adagrad update (C++)");
+    m.def("adagrad_update_copy",
+          &ds_adagrad_step_plus_copy,
+          "DeepSpeed CPU Adagrad update and param copy (C++)");
+    m.def("create_adagrad", &create_adagrad_optimizer, "DeepSpeed CPU Adagrad (C++)");
+    m.def("destroy_adagrad", &destroy_adagrad_optimizer, "DeepSpeed CPU Adagrad destroy (C++)");
+}
diff --git a/deepspeed/ops/csrc/adam/cpu_adam.cpp b/deepspeed/ops/csrc/adam/cpu_adam.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..727eec8182c12cdcc0dcb0df53a38918bdeae4b8
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/cpu_adam.cpp
@@ -0,0 +1,292 @@
+#include "cpu_adam.h"
+#include <cuda_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "custom_cuda_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adam_Optimizer::Step_1(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float betta1_minus1 = 1 - _betta1;
+        float betta2_minus1 = 1 - _betta2;
+
+        float step_size = -1 * _alpha / _bias_correction1;
+        float w_decay = -1 * _alpha * _weight_decay;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = _exp_avg[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
+                momentum = momentum * _betta1;
+                momentum = grad * betta1_minus1 + momentum;
+
+                variance = variance * _betta2;
+                grad = grad * grad;
+                variance = grad * betta2_minus1 + variance;
+
+                grad = sqrt(variance);
+                grad = grad * _bias_correction2 + _eps;
+                grad = momentum / grad;
+                if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                _exp_avg[k] = momentum;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adam_Optimizer::Step_4(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adam_optimizer(int optimizer_id,
+                          float alpha = 1e-3,
+                          float betta1 = 0.9,
+                          float betta2 = 0.999,
+                          float eps = 1e-8,
+                          float weight_decay = 0,
+                          bool adamw_mode = true,
+                          bool should_log = false)
+{
+    auto opt =
+        std::make_shared<Adam_Optimizer>(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adam Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
+               alpha,
+               betta1,
+               betta2,
+               weight_decay,
+               (int)adamw_mode);
+    }
+
+    return 0;
+}
+
+void Adam_Optimizer::Step_8(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adam_step(int optimizer_id,
+                 size_t step,
+                 float lr,
+                 float beta1,
+                 float beta2,
+                 float epsilon,
+                 float weight_decay,
+                 bool bias_correction,
+                 torch::Tensor& params,
+                 torch::Tensor& grads,
+                 torch::Tensor& exp_avg,
+                 torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    // assert(params.options().dtype() == grads.options().dtype());
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                nullptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adam_step_plus_copy(int optimizer_id,
+                           size_t step,
+                           float lr,
+                           float beta1,
+                           float beta2,
+                           float epsilon,
+                           float weight_decay,
+                           bool bias_correction,
+                           torch::Tensor& params,
+                           torch::Tensor& grads,
+                           torch::Tensor& exp_avg,
+                           torch::Tensor& exp_avg_sq,
+                           torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adam_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
+    m.def("adam_update_copy",
+          &ds_adam_step_plus_copy,
+          "DeepSpeed CPU Adam update and param copy (C++)");
+    m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
+    m.def("destroy_adam", &destroy_adam_optimizer, "DeepSpeed CPU Adam destroy (C++)");
+}
diff --git a/deepspeed/ops/csrc/adam/cpu_adam_hip.cpp b/deepspeed/ops/csrc/adam/cpu_adam_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..67163979fe3311b85e6b3be3d587bdc1c498485f
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/cpu_adam_hip.cpp
@@ -0,0 +1,293 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cpu_adam_hip.h"
+#include <hip/hip_runtime_api.h>
+#include <math.h>
+#include <omp.h>
+#include <torch/extension.h>
+#include <iostream>
+#include <memory>
+#include <type_traits>
+#include <unordered_map>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "custom_hip_layers.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_optimizers;
+
+// C++ interface
+
+void Adam_Optimizer::Step_1(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<1>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size) {
+        float betta1_minus1 = 1 - _betta1;
+        float betta2_minus1 = 1 - _betta2;
+
+        float step_size = -1 * _alpha / _bias_correction1;
+        float w_decay = -1 * _alpha * _weight_decay;
+        __half* grads_cast_h;
+        __half* params_cast_h;
+        if (half_precision) {
+            grads_cast_h = reinterpret_cast<__half*>(grads);
+            params_cast_h = reinterpret_cast<__half*>(_params);
+        }
+
+        for (size_t t = rounded_size; t < _param_size; t += TILE) {
+            size_t copy_size = TILE;
+            if ((t + TILE) > _param_size) copy_size = _param_size - t;
+            size_t offset = copy_size + t;
+            if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+
+#pragma omp parallel for
+            for (size_t k = t; k < offset; k++) {
+                float grad = half_precision ? (float)grads_cast_h[k] : grads[k];
+                float param = half_precision ? (float)params_cast_h[k] : _params[k];
+                float momentum = _exp_avg[k];
+                float variance = _exp_avg_sq[k];
+                if (_weight_decay > 0 && !_adamw_mode) { grad = param * _weight_decay + grad; }
+                momentum = momentum * _betta1;
+                momentum = grad * betta1_minus1 + momentum;
+
+                variance = variance * _betta2;
+                grad = grad * grad;
+                variance = grad * betta2_minus1 + variance;
+
+                grad = sqrt(variance);
+                grad = grad * _bias_correction2 + _eps;
+                grad = momentum / grad;
+                if (_weight_decay > 0 && _adamw_mode) { param += w_decay * param; }
+                param = grad * step_size + param;
+                if (dev_params) _doubled_buffer[_buf_index][k - t] = param;
+
+                if (half_precision)
+                    params_cast_h[k] = (__half)param;
+                else
+                    _params[k] = param;
+                _exp_avg[k] = momentum;
+                _exp_avg_sq[k] = variance;
+            }
+            if (dev_params) {
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, (copy_size), _streams[_buf_index]);
+
+                _buf_index = !_buf_index;
+            }
+        }
+    }
+}
+
+void Adam_Optimizer::Step_4(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<4>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_1((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int create_adam_optimizer(int optimizer_id,
+                          float alpha = 1e-3,
+                          float betta1 = 0.9,
+                          float betta2 = 0.999,
+                          float eps = 1e-8,
+                          float weight_decay = 0,
+                          bool adamw_mode = true,
+                          bool should_log = false)
+{
+    auto opt =
+        std::make_shared<Adam_Optimizer>(alpha, betta1, betta2, eps, weight_decay, adamw_mode);
+
+    s_optimizers[optimizer_id] = opt;
+
+    if (should_log) {
+        std::string avx_type = "";
+#if defined(__AVX512__)
+        avx_type = "AVX512";
+#else
+#if defined(__AVX256__)
+        avx_type = "AVX2";
+#else
+        avx_type = "scalar";
+#endif
+#endif
+
+        printf("Adam Optimizer #%d is created with %s arithmetic capability.\n",
+               optimizer_id,
+               avx_type.c_str());
+        printf("Config: alpha=%f, betas=(%f, %f), weight_decay=%f, adam_w=%d\n",
+               alpha,
+               betta1,
+               betta2,
+               weight_decay,
+               (int)adamw_mode);
+    }
+
+    return 0;
+}
+
+void Adam_Optimizer::Step_8(float* _params,
+                            float* grads,
+                            float* _exp_avg,
+                            float* _exp_avg_sq,
+                            size_t _param_size,
+                            __half* dev_params,
+                            bool half_precision)
+{
+    size_t rounded_size = 0;
+#if defined(__AVX512__) or defined(__AVX256__)
+    Step_AVX<8>(&rounded_size,
+                _params,
+                grads,
+                _exp_avg,
+                _exp_avg_sq,
+                _param_size,
+                dev_params,
+                half_precision);
+#endif
+    if (_param_size > rounded_size)
+        Step_4((_params + rounded_size),
+               (grads + rounded_size),
+               (_exp_avg + rounded_size),
+               (_exp_avg_sq + rounded_size),
+               (_param_size - rounded_size),
+               (dev_params != nullptr ? (dev_params + rounded_size) : dev_params),
+               half_precision);
+}
+
+int ds_adam_step(int optimizer_id,
+                 size_t step,
+                 float lr,
+                 float beta1,
+                 float beta2,
+                 float epsilon,
+                 float weight_decay,
+                 bool bias_correction,
+                 torch::Tensor& params,
+                 torch::Tensor& grads,
+                 torch::Tensor& exp_avg,
+                 torch::Tensor& exp_avg_sq)
+{
+    auto params_c = params.contiguous();
+    auto grads_c = grads.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+
+    // assert(params.options().dtype() == grads.options().dtype());
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                nullptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int ds_adam_step_plus_copy(int optimizer_id,
+                           size_t step,
+                           float lr,
+                           float beta1,
+                           float beta2,
+                           float epsilon,
+                           float weight_decay,
+                           bool bias_correction,
+                           torch::Tensor& params,
+                           torch::Tensor& grads,
+                           torch::Tensor& exp_avg,
+                           torch::Tensor& exp_avg_sq,
+                           torch::Tensor& gpu_params)
+{
+    auto params_c = params.contiguous();
+    auto gpu_params_c = gpu_params.contiguous();
+    auto exp_avg_c = exp_avg.contiguous();
+    auto exp_avg_sq_c = exp_avg_sq.contiguous();
+    auto grads_c = grads.contiguous();
+
+    float* params_ptr = (float*)params_c.data_ptr();
+    float* grads_ptr = (float*)grads_c.data_ptr();
+    __half* gpu_params_ptr = (__half*)gpu_params_c.data_ptr();
+    float* exp_avg_ptr = (float*)exp_avg_c.data_ptr();
+    float* exp_avg_sq_ptr = (float*)exp_avg_sq_c.data_ptr();
+
+    std::shared_ptr<Adam_Optimizer> opt =
+        std::static_pointer_cast<Adam_Optimizer>(s_optimizers[optimizer_id]);
+    opt->IncrementStep(step, beta1, beta2);
+    opt->update_state(lr, epsilon, weight_decay, bias_correction);
+    opt->Step_8(params_ptr,
+                grads_ptr,
+                exp_avg_ptr,
+                exp_avg_sq_ptr,
+                params_c.size(0),
+                gpu_params_ptr,
+                (params.options().dtype() == at::kHalf));
+
+    opt->SynchronizeStreams();
+    return 0;
+}
+
+int destroy_adam_optimizer(int optimizer_id)
+{
+    s_optimizers.erase(optimizer_id);
+
+    return 0;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("adam_update", &ds_adam_step, "DeepSpeed CPU Adam update (C++)");
+    m.def("adam_update_copy",
+          &ds_adam_step_plus_copy,
+          "DeepSpeed CPU Adam update and param copy (C++)");
+    m.def("create_adam", &create_adam_optimizer, "DeepSpeed CPU Adam (C++)");
+    m.def("destroy_adam", &destroy_adam_optimizer, "DeepSpeed CPU Adam destroy (C++)");
+}
diff --git a/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp b/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..b06531c53002c1186dac8c7e1a168bfa72e31fef
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp
@@ -0,0 +1,20 @@
+#include <torch/extension.h>
+
+void multi_tensor_adam_cuda(int chunk_size,
+                            at::Tensor noop_flag,
+                            std::vector<std::vector<at::Tensor>> tensor_lists,
+                            const float lr,
+                            const float beta1,
+                            const float beta2,
+                            const float epsilon,
+                            const int step,
+                            const int mode,
+                            const int bias_correction,
+                            const float weight_decay);
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("multi_tensor_adam",
+          &multi_tensor_adam_cuda,
+          "Compute and apply gradient update to parameters for Adam optimizer");
+}
diff --git a/deepspeed/ops/csrc/adam/multi_tensor_adam.cu b/deepspeed/ops/csrc/adam/multi_tensor_adam.cu
new file mode 100644
index 0000000000000000000000000000000000000000..3cb9763befcea663966347f23a2d9b925a7db8bd
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/multi_tensor_adam.cu
@@ -0,0 +1,163 @@
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+// Another possibility:
+// #include <torch/all.h>
+
+#include <assert.h>
+
+#include "multi_tensor_apply.cuh"
+#include "type_shim.h"
+
+#define BLOCK_SIZE 512
+#define ILP 4
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // L2 regularization mode
+    ADAM_MODE_1 = 1   // Decoupled weight decay mode(AdamW)
+} adamMode_t;
+
+using MATH_T = float;
+
+template <typename T>
+struct AdamFunctor {
+    __device__ __forceinline__ void operator()(int chunk_size,
+                                               volatile int* noop_gmem,
+                                               TensorListMetadata<4>& tl,
+                                               const float beta1,
+                                               const float beta2,
+                                               const float beta1_correction,
+                                               const float beta2_correction,
+                                               const float epsilon,
+                                               const float lr,
+                                               adamMode_t mode,
+                                               const float decay)
+    {
+        // I'd like this kernel to propagate infs/nans.
+        // if(*noop_gmem == 1)
+        //   return;
+
+        int tensor_loc = tl.block_to_tensor[blockIdx.x];
+
+        // potentially use to pass in list of scalar
+        // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
+
+        int chunk_idx = tl.block_to_chunk[blockIdx.x];
+        int n = tl.sizes[tensor_loc];
+
+        T* g = (T*)tl.addresses[0][tensor_loc];
+        g += chunk_idx * chunk_size;
+
+        T* p = (T*)tl.addresses[1][tensor_loc];
+        p += chunk_idx * chunk_size;
+
+        T* m = (T*)tl.addresses[2][tensor_loc];
+        m += chunk_idx * chunk_size;
+
+        T* v = (T*)tl.addresses[3][tensor_loc];
+        v += chunk_idx * chunk_size;
+
+        n -= chunk_idx * chunk_size;
+
+        // see note in multi_tensor_scale_kernel.cu
+        for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) {
+            MATH_T r_g[ILP];
+            MATH_T r_p[ILP];
+            MATH_T r_m[ILP];
+            MATH_T r_v[ILP];
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    r_g[ii] = g[i];
+                    r_p[ii] = p[i];
+                    r_m[ii] = m[i];
+                    r_v[ii] = v[i];
+                } else {
+                    r_g[ii] = MATH_T(0);
+                    r_p[ii] = MATH_T(0);
+                    r_m[ii] = MATH_T(0);
+                    r_v[ii] = MATH_T(0);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                if (mode == ADAM_MODE_0) {  // L2
+                    r_g[ii] = r_g[ii] + (decay * r_p[ii]);
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = next_m_unbiased / denom;
+                    r_p[ii] = r_p[ii] - (lr * update);
+                } else {  // weight decay
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
+                    r_p[ii] = r_p[ii] - (lr * update);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    p[i] = r_p[ii];
+                    m[i] = r_m[ii];
+                    v[i] = r_v[ii];
+                }
+            }
+        }
+    }
+};
+
+void multi_tensor_adam_cuda(int chunk_size,
+                            at::Tensor noop_flag,
+                            std::vector<std::vector<at::Tensor>> tensor_lists,
+                            const float lr,
+                            const float beta1,
+                            const float beta2,
+                            const float epsilon,
+                            const int step,
+                            const int mode,
+                            const int bias_correction,
+                            const float weight_decay)
+{
+    using namespace at;
+
+    // Handle bias correction mode
+    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
+    if (bias_correction == 1) {
+        bias_correction1 = 1 - std::pow(beta1, step);
+        bias_correction2 = 1 - std::pow(beta2, step);
+    }
+
+    // Assume single type across p,g,m1,m2 now
+    DISPATCH_DOUBLE_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(),
+                                   0,
+                                   "adam",
+                                   multi_tensor_apply<4>(BLOCK_SIZE,
+                                                         chunk_size,
+                                                         noop_flag,
+                                                         tensor_lists,
+                                                         AdamFunctor<scalar_t_0>(),
+                                                         beta1,
+                                                         beta2,
+                                                         bias_correction1,
+                                                         bias_correction2,
+                                                         epsilon,
+                                                         lr,
+                                                         (adamMode_t)mode,
+                                                         weight_decay);)
+
+    AT_CUDA_CHECK(cudaGetLastError());
+}
diff --git a/deepspeed/ops/csrc/adam/multi_tensor_adam.hip b/deepspeed/ops/csrc/adam/multi_tensor_adam.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f0b7ced5c29646b793f8fa904768c091fd9d749e
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/multi_tensor_adam.hip
@@ -0,0 +1,164 @@
+// !!! This is a file automatically generated by hipify!!!
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/hip/HIPContext.h>
+#include <ATen/hip/Exceptions.h>
+// Another possibility:
+// #include <torch/all.h>
+
+#include <assert.h>
+
+#include "multi_tensor_apply_hip.cuh"
+#include "type_shim_hip.h"
+
+#define BLOCK_SIZE 512
+#define ILP 4
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // L2 regularization mode
+    ADAM_MODE_1 = 1   // Decoupled weight decay mode(AdamW)
+} adamMode_t;
+
+using MATH_T = float;
+
+template <typename T>
+struct AdamFunctor {
+    __device__ __forceinline__ void operator()(int chunk_size,
+                                               volatile int* noop_gmem,
+                                               TensorListMetadata<4>& tl,
+                                               const float beta1,
+                                               const float beta2,
+                                               const float beta1_correction,
+                                               const float beta2_correction,
+                                               const float epsilon,
+                                               const float lr,
+                                               adamMode_t mode,
+                                               const float decay)
+    {
+        // I'd like this kernel to propagate infs/nans.
+        // if(*noop_gmem == 1)
+        //   return;
+
+        int tensor_loc = tl.block_to_tensor[blockIdx.x];
+
+        // potentially use to pass in list of scalar
+        // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
+
+        int chunk_idx = tl.block_to_chunk[blockIdx.x];
+        int n = tl.sizes[tensor_loc];
+
+        T* g = (T*)tl.addresses[0][tensor_loc];
+        g += chunk_idx * chunk_size;
+
+        T* p = (T*)tl.addresses[1][tensor_loc];
+        p += chunk_idx * chunk_size;
+
+        T* m = (T*)tl.addresses[2][tensor_loc];
+        m += chunk_idx * chunk_size;
+
+        T* v = (T*)tl.addresses[3][tensor_loc];
+        v += chunk_idx * chunk_size;
+
+        n -= chunk_idx * chunk_size;
+
+        // see note in multi_tensor_scale_kernel.cu
+        for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) {
+            MATH_T r_g[ILP];
+            MATH_T r_p[ILP];
+            MATH_T r_m[ILP];
+            MATH_T r_v[ILP];
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    r_g[ii] = g[i];
+                    r_p[ii] = p[i];
+                    r_m[ii] = m[i];
+                    r_v[ii] = v[i];
+                } else {
+                    r_g[ii] = MATH_T(0);
+                    r_p[ii] = MATH_T(0);
+                    r_m[ii] = MATH_T(0);
+                    r_v[ii] = MATH_T(0);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                if (mode == ADAM_MODE_0) {  // L2
+                    r_g[ii] = r_g[ii] + (decay * r_p[ii]);
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = next_m_unbiased / denom;
+                    r_p[ii] = r_p[ii] - (lr * update);
+                } else {  // weight decay
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
+                    r_p[ii] = r_p[ii] - (lr * update);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    p[i] = r_p[ii];
+                    m[i] = r_m[ii];
+                    v[i] = r_v[ii];
+                }
+            }
+        }
+    }
+};
+
+void multi_tensor_adam_cuda(int chunk_size,
+                            at::Tensor noop_flag,
+                            std::vector<std::vector<at::Tensor>> tensor_lists,
+                            const float lr,
+                            const float beta1,
+                            const float beta2,
+                            const float epsilon,
+                            const int step,
+                            const int mode,
+                            const int bias_correction,
+                            const float weight_decay)
+{
+    using namespace at;
+
+    // Handle bias correction mode
+    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
+    if (bias_correction == 1) {
+        bias_correction1 = 1 - ::pow(beta1, step);
+        bias_correction2 = 1 - ::pow(beta2, step);
+    }
+
+    // Assume single type across p,g,m1,m2 now
+    DISPATCH_DOUBLE_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(),
+                                   0,
+                                   "adam",
+                                   multi_tensor_apply<4>(BLOCK_SIZE,
+                                                         chunk_size,
+                                                         noop_flag,
+                                                         tensor_lists,
+                                                         AdamFunctor<scalar_t_0>(),
+                                                         beta1,
+                                                         beta2,
+                                                         bias_correction1,
+                                                         bias_correction2,
+                                                         epsilon,
+                                                         lr,
+                                                         (adamMode_t)mode,
+                                                         weight_decay);)
+
+    AT_CUDA_CHECK(hipGetLastError());
+}
diff --git a/deepspeed/ops/csrc/adam/multi_tensor_apply.cuh b/deepspeed/ops/csrc/adam/multi_tensor_apply.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..13af4b7578f6db7db066b7ff7f17edd86e8fd6d3
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/multi_tensor_apply.cuh
@@ -0,0 +1,127 @@
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAGuard.h>
+#include "compat.h"
+
+#include <assert.h>
+
+// #include <iostream>
+
+// This header is the one-stop shop for all your multi-tensor apply needs.
+
+// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
+constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
+constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
+
+template <int n>
+struct TensorListMetadata {
+    void* addresses[n][depth_to_max_tensors[n - 1]];
+    int sizes[depth_to_max_tensors[n - 1]];
+    unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
+    int block_to_chunk[depth_to_max_blocks[n - 1]];  // I fear this needs to be a full int.
+    int start_tensor_this_launch;
+};
+
+template <typename T, typename U, typename... ArgTypes>
+__global__ void multi_tensor_apply_kernel(int chunk_size,
+                                          volatile int* noop_flag,
+                                          T tl,
+                                          U callable,
+                                          ArgTypes... args)
+{
+    // Hand the chunk information to the user-supplied functor to process however it likes.
+    callable(chunk_size, noop_flag, tl, args...);
+}
+
+template <int depth, typename T, typename... ArgTypes>
+void multi_tensor_apply(int block_size,
+                        int chunk_size,
+                        const at::Tensor& noop_flag,
+                        const std::vector<std::vector<at::Tensor>>& tensor_lists,
+                        T callable,
+                        ArgTypes... args)
+{
+    TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
+    int len0 = tensor_lists[0].size();
+    TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
+    auto ref_device = tensor_lists[0][0].device();
+    TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
+    for (int l = 0; l < tensor_lists.size(); l++)  // No range-based for because I need indices
+    {
+        TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
+        for (int t = 0; t < tensor_lists[l].size(); t++) {
+            // TODO:  Print which tensor fails.
+            bool contiguous_memory = tensor_lists[l][t].is_contiguous();
+#ifdef VERSION_GE_1_5
+            contiguous_memory = (contiguous_memory ||
+                                 tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
+#endif
+            TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
+            TORCH_CHECK(tensor_lists[l][t].device() == ref_device,
+                        "A tensor was not on the same device as the first tensor");
+            TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
+        }
+    }
+
+    int ntensors = tensor_lists[0].size();
+
+    TensorListMetadata<depth> tl;
+
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(tensor_lists[0][0]));
+    auto stream = at::cuda::getCurrentCUDAStream();
+
+    tl.start_tensor_this_launch = 0;
+    int loc_block_info = 0;
+    int loc_tensor_info = 0;
+    for (int t = 0; t < ntensors; t++) {
+        tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
+        for (int d = 0; d < depth; d++)
+            tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
+        loc_tensor_info++;
+
+        int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
+
+        for (int chunk = 0; chunk < chunks_this_tensor; chunk++) {
+            // std::cout << chunks_this_tensor << std::endl;
+            tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
+            tl.block_to_chunk[loc_block_info] = chunk;
+            loc_block_info++;
+
+            bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
+                                 chunk == chunks_this_tensor - 1);
+            bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
+            bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
+            if (tensors_full || blocks_full || last_chunk) {
+                // using accscalar_t = acc_type<scalar_t, true>;
+                multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
+                    chunk_size, noop_flag.DATA_PTR<int>(), tl, callable, args...);
+
+                AT_CUDA_CHECK(cudaGetLastError());
+
+                // Reset.  The control flow possibilities here make my brain hurt.
+                loc_block_info = 0;
+                if (chunk == chunks_this_tensor - 1) {
+                    // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    loc_tensor_info = 0;
+                    tl.start_tensor_this_launch = t + 1;
+                } else {
+                    // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
+                    for (int d = 0; d < depth; d++)
+                        tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
+                    loc_tensor_info = 1;
+                    tl.start_tensor_this_launch = t;
+                }
+            }
+        }
+    }
+}
diff --git a/deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh b/deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..09bc9971f216f73d7e33a1b75c52d2e975115743
--- /dev/null
+++ b/deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh
@@ -0,0 +1,129 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/hip/HIPContext.h>
+#include <ATen/hip/Exceptions.h>
+#include <ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h>
+#include "compat.h"
+
+#include <assert.h>
+
+// #include <iostream>
+
+// This header is the one-stop shop for all your multi-tensor apply needs.
+
+// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
+constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
+constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
+
+template <int n>
+struct TensorListMetadata {
+    void* addresses[n][depth_to_max_tensors[n - 1]];
+    int sizes[depth_to_max_tensors[n - 1]];
+    unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
+    int block_to_chunk[depth_to_max_blocks[n - 1]];  // I fear this needs to be a full int.
+    int start_tensor_this_launch;
+};
+
+template <typename T, typename U, typename... ArgTypes>
+__global__ void multi_tensor_apply_kernel(int chunk_size,
+                                          volatile int* noop_flag,
+                                          T tl,
+                                          U callable,
+                                          ArgTypes... args)
+{
+    // Hand the chunk information to the user-supplied functor to process however it likes.
+    callable(chunk_size, noop_flag, tl, args...);
+}
+
+template <int depth, typename T, typename... ArgTypes>
+void multi_tensor_apply(int block_size,
+                        int chunk_size,
+                        const at::Tensor& noop_flag,
+                        const std::vector<std::vector<at::Tensor>>& tensor_lists,
+                        T callable,
+                        ArgTypes... args)
+{
+    TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
+    int len0 = tensor_lists[0].size();
+    TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
+    auto ref_device = tensor_lists[0][0].device();
+    TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
+    for (int l = 0; l < tensor_lists.size(); l++)  // No range-based for because I need indices
+    {
+        TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
+        for (int t = 0; t < tensor_lists[l].size(); t++) {
+            // TODO:  Print which tensor fails.
+            bool contiguous_memory = tensor_lists[l][t].is_contiguous();
+#ifdef VERSION_GE_1_5
+            contiguous_memory = (contiguous_memory ||
+                                 tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
+#endif
+            TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
+            TORCH_CHECK(tensor_lists[l][t].device() == ref_device,
+                        "A tensor was not on the same device as the first tensor");
+            TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
+        }
+    }
+
+    int ntensors = tensor_lists[0].size();
+
+    TensorListMetadata<depth> tl;
+
+    const at::hip::OptionalHIPGuardMasqueradingAsCUDA device_guard(device_of(tensor_lists[0][0]));
+    auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+
+    tl.start_tensor_this_launch = 0;
+    int loc_block_info = 0;
+    int loc_tensor_info = 0;
+    for (int t = 0; t < ntensors; t++) {
+        tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
+        for (int d = 0; d < depth; d++)
+            tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
+        loc_tensor_info++;
+
+        int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
+
+        for (int chunk = 0; chunk < chunks_this_tensor; chunk++) {
+            // std::cout << chunks_this_tensor << std::endl;
+            tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
+            tl.block_to_chunk[loc_block_info] = chunk;
+            loc_block_info++;
+
+            bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
+                                 chunk == chunks_this_tensor - 1);
+            bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
+            bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
+            if (tensors_full || blocks_full || last_chunk) {
+                // using accscalar_t = acc_type<scalar_t, true>;
+               hipLaunchKernelGGL(( multi_tensor_apply_kernel), dim3(loc_block_info), dim3(block_size), 0, stream, 
+                    chunk_size, noop_flag.DATA_PTR<int>(), tl, callable, args...);
+
+                AT_CUDA_CHECK(hipGetLastError());
+
+                // Reset.  The control flow possibilities here make my brain hurt.
+                loc_block_info = 0;
+                if (chunk == chunks_this_tensor - 1) {
+                    // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    loc_tensor_info = 0;
+                    tl.start_tensor_this_launch = t + 1;
+                } else {
+                    // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
+                    for (int d = 0; d < depth; d++)
+                        tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
+                    loc_tensor_info = 1;
+                    tl.start_tensor_this_launch = t;
+                }
+            }
+        }
+    }
+}
diff --git a/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp b/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..9e405d8e704c3c260bd8276f648b770471bcd4b7
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.cpp
@@ -0,0 +1,333 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <fcntl.h>
+#include <libaio.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <limits>
+#include <numeric>
+#include <string>
+#include <vector>
+
+#include "deepspeed_aio_common.h"
+
+using namespace std;
+using namespace std::chrono;
+
+#define DEBUG_DS_AIO_PERF 0
+#define DEBUG_DS_AIO_SUBMIT_PERF 0
+
+static const std::string c_library_name = "deepspeed_aio";
+
+static void _report_aio_statistics(const char* tag,
+                                   const std::vector<std::chrono::duration<double>>& latencies)
+    __attribute__((unused));
+
+static void _report_aio_statistics(const char* tag,
+                                   const std::vector<std::chrono::duration<double>>& latencies)
+{
+    std::vector<double> lat_usec;
+    for (auto& lat : latencies) { lat_usec.push_back(lat.count() * 1e6); }
+    const auto min_lat = *(std::min_element(lat_usec.begin(), lat_usec.end()));
+    const auto max_lat = *(std::max_element(lat_usec.begin(), lat_usec.end()));
+    const auto avg_lat = std::accumulate(lat_usec.begin(), lat_usec.end(), 0) / lat_usec.size();
+
+    std::cout << c_library_name << ": latency statistics(usec) " << tag
+              << " min/max/avg = " << min_lat << " " << max_lat << " " << avg_lat << std::endl;
+}
+
+static void _get_aio_latencies(std::vector<std::chrono::duration<double>>& raw_latencies,
+                               struct deepspeed_aio_latency_t& summary_latencies)
+{
+    std::vector<double> lat_usec;
+    for (auto& lat : raw_latencies) { lat_usec.push_back(lat.count() * 1e6); }
+    summary_latencies._min_usec = *(std::min_element(lat_usec.begin(), lat_usec.end()));
+    summary_latencies._max_usec = *(std::max_element(lat_usec.begin(), lat_usec.end()));
+    summary_latencies._avg_usec =
+        std::accumulate(lat_usec.begin(), lat_usec.end(), 0) / lat_usec.size();
+}
+
+static void _do_io_submit_singles(const long long int n_iocbs,
+                                  const long long int iocb_index,
+                                  std::unique_ptr<aio_context>& aio_ctxt,
+                                  std::vector<std::chrono::duration<double>>& submit_times)
+{
+    for (auto i = 0; i < n_iocbs; ++i) {
+        const auto st = std::chrono::high_resolution_clock::now();
+        const auto submit_ret = io_submit(aio_ctxt->_io_ctxt, 1, aio_ctxt->_iocbs.data() + i);
+        submit_times.push_back(std::chrono::high_resolution_clock::now() - st);
+#if DEBUG_DS_AIO_SUBMIT_PERF
+        printf("submit(usec) %f io_index=%lld buf=%p len=%lu off=%llu \n",
+               submit_times.back().count() * 1e6,
+               iocb_index,
+               aio_ctxt->_iocbs[i]->u.c.buf,
+               aio_ctxt->_iocbs[i]->u.c.nbytes,
+               aio_ctxt->_iocbs[i]->u.c.offset);
+#endif
+        assert(submit_ret > 0);
+    }
+}
+
+static void _do_io_submit_block(const long long int n_iocbs,
+                                const long long int iocb_index,
+                                std::unique_ptr<aio_context>& aio_ctxt,
+                                std::vector<std::chrono::duration<double>>& submit_times)
+{
+    const auto st = std::chrono::high_resolution_clock::now();
+    const auto submit_ret = io_submit(aio_ctxt->_io_ctxt, n_iocbs, aio_ctxt->_iocbs.data());
+    submit_times.push_back(std::chrono::high_resolution_clock::now() - st);
+#if DEBUG_DS_AIO_SUBMIT_PERF
+    printf("submit(usec) %f io_index=%lld nr=%lld buf=%p len=%lu off=%llu \n",
+           submit_times.back().count() * 1e6,
+           iocb_index,
+           n_iocbs,
+           aio_ctxt->_iocbs[0]->u.c.buf,
+           aio_ctxt->_iocbs[0]->u.c.nbytes,
+           aio_ctxt->_iocbs[0]->u.c.offset);
+#endif
+    assert(submit_ret > 0);
+}
+
+static int _do_io_complete(const long long int min_completes,
+                           const long long int max_completes,
+                           std::unique_ptr<aio_context>& aio_ctxt,
+                           std::vector<std::chrono::duration<double>>& reap_times)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+    const auto n_completes = io_getevents(
+        aio_ctxt->_io_ctxt, min_completes, max_completes, aio_ctxt->_io_events.data(), nullptr);
+    reap_times.push_back(std::chrono::high_resolution_clock::now() - start_time);
+
+    assert(n_completes >= min_completes);
+    return n_completes;
+}
+
+void do_aio_operation_sequential(const bool read_op,
+                                 std::unique_ptr<aio_context>& aio_ctxt,
+                                 std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                 deepspeed_aio_config_t* config,
+                                 deepspeed_aio_perf_t* perf)
+{
+    struct io_prep_context prep_ctxt(read_op, xfer_ctxt, aio_ctxt->_block_size, &aio_ctxt->_iocbs);
+
+    const auto num_io_blocks = static_cast<long long int>(
+        ceil(static_cast<double>(xfer_ctxt->_num_bytes) / aio_ctxt->_block_size));
+#if DEBUG_DS_AIO_PERF
+    const auto io_op_name = std::string(read_op ? "read" : "write");
+    std::cout << c_library_name << ": start " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes with " << num_io_blocks << " io blocks" << std::endl;
+#endif
+
+    std::vector<std::chrono::duration<double>> submit_times;
+    std::vector<std::chrono::duration<double>> reap_times;
+    const auto max_queue_bytes =
+        static_cast<long long int>(aio_ctxt->_queue_depth * aio_ctxt->_block_size);
+
+    auto start = std::chrono::high_resolution_clock::now();
+    for (long long iocb_index = 0; iocb_index < num_io_blocks;
+         iocb_index += aio_ctxt->_queue_depth) {
+        const auto start_offset = iocb_index * aio_ctxt->_block_size;
+        const auto start_buffer = (char*)xfer_ctxt->_mem_buffer + start_offset;
+        const auto n_iocbs =
+            min(static_cast<long long>(aio_ctxt->_queue_depth), (num_io_blocks - iocb_index));
+        const auto num_bytes = min(max_queue_bytes, (xfer_ctxt->_num_bytes - start_offset));
+        prep_ctxt.prep_iocbs(n_iocbs, num_bytes, start_buffer, start_offset);
+
+        if (config->_single_submit) {
+            _do_io_submit_singles(n_iocbs, iocb_index, aio_ctxt, submit_times);
+        } else {
+            _do_io_submit_block(n_iocbs, iocb_index, aio_ctxt, submit_times);
+        }
+
+        _do_io_complete(n_iocbs, n_iocbs, aio_ctxt, reap_times);
+    }
+    const std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start;
+
+    if (perf) {
+        _get_aio_latencies(submit_times, perf->_submit);
+        _get_aio_latencies(reap_times, perf->_complete);
+        perf->_e2e_usec = elapsed.count() * 1e6;
+        perf->_e2e_rate_GB = (xfer_ctxt->_num_bytes / elapsed.count() / 1e9);
+    }
+
+#if DEBUG_DS_AIO_PERF
+    _report_aio_statistics("submit", submit_times);
+    _report_aio_statistics("complete", reap_times);
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": runtime(usec) " << elapsed.count() * 1e6
+              << " rate(GB/sec) = " << (xfer_ctxt->_num_bytes / elapsed.count() / 1e9) << std::endl;
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": finish " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes " << std::endl;
+#endif
+}
+
+void do_aio_operation_overlap(const bool read_op,
+                              std::unique_ptr<aio_context>& aio_ctxt,
+                              std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                              deepspeed_aio_config_t* config,
+                              deepspeed_aio_perf_t* perf)
+{
+    struct io_prep_generator io_gen(read_op, xfer_ctxt, aio_ctxt->_block_size);
+
+#if DEBUG_DS_AIO_PERF
+    const auto io_op_name = std::string(read_op ? "read" : "write");
+    std::cout << c_library_name << ": start " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes with " << io_gen._num_io_blocks << " io blocks" << std::endl;
+#endif
+
+    std::vector<std::chrono::duration<double>> submit_times;
+    std::vector<std::chrono::duration<double>> reap_times;
+
+    auto request_iocbs = aio_ctxt->_queue_depth;
+    auto n_pending_iocbs = 0;
+    const auto min_completes = 1;
+    auto start = std::chrono::high_resolution_clock::now();
+    while (true) {
+        const auto n_iocbs = io_gen.prep_iocbs(request_iocbs - n_pending_iocbs, &aio_ctxt->_iocbs);
+        if (n_iocbs > 0) {
+            if (config->_single_submit) {
+                _do_io_submit_singles(
+                    n_iocbs, (io_gen._next_iocb_index - n_iocbs), aio_ctxt, submit_times);
+            } else {
+                _do_io_submit_block(
+                    n_iocbs, (io_gen._next_iocb_index - n_iocbs), aio_ctxt, submit_times);
+            }
+        }
+
+        n_pending_iocbs += n_iocbs;
+        assert(n_pending_iocbs <= aio_ctxt->_queue_depth);
+
+        if (n_pending_iocbs == 0) { break; }
+
+        const auto n_complete =
+            _do_io_complete(min_completes, n_pending_iocbs, aio_ctxt, reap_times);
+        n_pending_iocbs -= n_complete;
+    }
+
+    const std::chrono::duration<double> elapsed = std::chrono::high_resolution_clock::now() - start;
+
+    if (perf) {
+        _get_aio_latencies(submit_times, perf->_submit);
+        _get_aio_latencies(reap_times, perf->_complete);
+        perf->_e2e_usec = elapsed.count() * 1e6;
+        perf->_e2e_rate_GB = (xfer_ctxt->_num_bytes / elapsed.count() / 1e9);
+    }
+
+#if DEBUG_DS_AIO_PERF
+    _report_aio_statistics("submit", submit_times);
+    _report_aio_statistics("complete", reap_times);
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": runtime(usec) " << elapsed.count() * 1e6
+              << " rate(GB/sec) = " << (xfer_ctxt->_num_bytes / elapsed.count() / 1e9) << std::endl;
+#endif
+
+#if DEBUG_DS_AIO_PERF
+    std::cout << c_library_name << ": finish " << io_op_name << " " << xfer_ctxt->_num_bytes
+              << " bytes " << std::endl;
+#endif
+}
+
+void report_file_error(const char* filename, const std::string file_op, const int error_code)
+{
+    std::string err_msg = file_op + std::string(" failed on ") + std::string(filename) +
+                          " error = " + std::to_string(error_code);
+    std::cerr << c_library_name << ":  " << err_msg << std::endl;
+}
+
+int open_file(const char* filename, const bool read_op)
+{
+    const int flags = read_op ? (O_RDONLY | __O_DIRECT) : (O_WRONLY | O_CREAT | __O_DIRECT);
+    const int mode = 0600;
+    const auto fd = open(filename, flags, mode);
+    if (fd == -1) {
+        const auto error_code = errno;
+        const auto error_msg = read_op ? " open for read " : " open for write ";
+        report_file_error(filename, error_msg, error_code);
+        return -1;
+    }
+    return fd;
+}
+
+int regular_read(const char* filename, std::vector<char>& buffer)
+{
+    long long int num_bytes;
+    const auto f_size = get_file_size(filename, num_bytes);
+    assert(f_size != -1);
+    buffer.resize(num_bytes);
+    const auto fd = open(filename, O_RDONLY, 0600);
+    assert(fd != -1);
+    long long int read_bytes = 0;
+    auto r = 0;
+    do {
+        const auto buffer_ptr = buffer.data() + read_bytes;
+        const auto bytes_to_read = num_bytes - read_bytes;
+        r = read(fd, buffer_ptr, bytes_to_read);
+        read_bytes += r;
+    } while (r > 0);
+
+    if (read_bytes != num_bytes) {
+        std::cerr << "read error "
+                  << " read_bytes (read) = " << read_bytes << " num_bytes (fstat) = " << num_bytes
+                  << std::endl;
+    }
+    assert(read_bytes == num_bytes);
+    close(fd);
+    return 0;
+}
+
+static bool _validate_buffer(const char* filename, void* aio_buffer, const long long int num_bytes)
+{
+    std::vector<char> regular_buffer;
+    const auto reg_ret = regular_read(filename, regular_buffer);
+    assert(0 == reg_ret);
+    std::cout << "regular read of " << filename << " returned " << regular_buffer.size() << " bytes"
+              << std::endl;
+
+    if (static_cast<long long int>(regular_buffer.size()) != num_bytes) { return false; }
+
+    return (0 == memcmp(aio_buffer, regular_buffer.data(), regular_buffer.size()));
+}
+
+bool validate_aio_operation(const bool read_op,
+                            const char* filename,
+                            void* aio_buffer,
+                            const long long int num_bytes)
+{
+    const auto msg_suffix = std::string("deepspeed_aio_") +
+                            std::string(read_op ? "read()" : "write()") +
+                            std::string("using read()");
+
+    if (false == _validate_buffer(filename, aio_buffer, num_bytes)) {
+        std::cout << "Fail: correctness of " << msg_suffix << std::endl;
+        return false;
+    }
+
+    std::cout << "Pass: correctness of  " << msg_suffix << std::endl;
+    return true;
+}
diff --git a/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h b/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h
new file mode 100644
index 0000000000000000000000000000000000000000..cc62d33765c804e88816791c72a3477278738e76
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/common/deepspeed_aio_common.h
@@ -0,0 +1,36 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <deepspeed_aio_utils.h>
+#include <stdlib.h>
+#include <memory>
+#include <string>
+
+using namespace std;
+
+void do_aio_operation_sequential(const bool read_op,
+                                 std::unique_ptr<aio_context>& aio_ctxt,
+                                 std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                 deepspeed_aio_config_t* config,
+                                 deepspeed_aio_perf_t* perf);
+
+void do_aio_operation_overlap(const bool read_op,
+                              std::unique_ptr<aio_context>& aio_ctxt,
+                              std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                              deepspeed_aio_config_t* config,
+                              deepspeed_aio_perf_t* perf);
+
+int open_file(const char* filename, const bool read_op);
+
+void report_file_error(const char* filename, const std::string file_op, const int error_code);
+
+int regular_read(const char* filename, std::vector<char>& buffer);
+
+bool validate_aio_operation(const bool read_op,
+                            const char* filename,
+                            void* aio_buffer,
+                            const long long int num_bytes);
diff --git a/deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp b/deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..e5811bb91149fad40422692ac7cde6f9348e0029
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/common/deepspeed_aio_types.cpp
@@ -0,0 +1,74 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <cmath>
+
+#include "deepspeed_aio_utils.h"
+
+using namespace std;
+
+const int c_block_size = 128 * 1024;
+const int c_io_queue_depth = 8;
+
+deepspeed_aio_config_t::deepspeed_aio_config_t()
+    : _block_size(c_block_size),
+      _queue_depth(c_io_queue_depth),
+      _single_submit(false),
+      _overlap_events(false),
+      _lock_memory(false)
+{
+}
+
+deepspeed_aio_config_t::deepspeed_aio_config_t(const int block_size,
+                                               const int queue_depth,
+                                               const bool single_submit,
+                                               const bool overlap_events,
+                                               const bool lock_memory)
+    : _block_size(block_size),
+      _queue_depth(queue_depth),
+      _single_submit(single_submit),
+      _overlap_events(overlap_events),
+      _lock_memory(lock_memory)
+{
+}
+
+void deepspeed_aio_latency_t::dump(const std::string tag)
+{
+    std::cout << tag << _min_usec << " " << _max_usec << " " << _avg_usec << " " << std::endl;
+}
+
+void deepspeed_aio_latency_t::accumulate(const struct deepspeed_aio_latency_t& other)
+{
+    _min_usec += other._min_usec;
+    _max_usec += other._max_usec;
+    _avg_usec += other._avg_usec;
+}
+
+void deepspeed_aio_latency_t::scale(const float scaler)
+{
+    _min_usec *= scaler;
+    _max_usec *= scaler;
+    _avg_usec *= scaler;
+}
+
+aio_context::aio_context(const int block_size, const int queue_depth)
+{
+    _block_size = block_size;
+    _queue_depth = queue_depth;
+    for (auto i = 0; i < queue_depth; ++i) {
+        _iocbs.push_back((struct iocb*)calloc(1, sizeof(struct iocb)));
+    }
+    _io_events.resize(queue_depth);
+    io_queue_init(queue_depth, &_io_ctxt);
+}
+
+aio_context::~aio_context()
+{
+    for (auto& iocb : _iocbs) { free(iocb); }
+    _io_events.resize(0);
+    io_queue_release(_io_ctxt);
+}
diff --git a/deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h b/deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h
new file mode 100644
index 0000000000000000000000000000000000000000..be3b352d6be20733f7e03a821378a648384be0b5
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/common/deepspeed_aio_types.h
@@ -0,0 +1,57 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <libaio.h>
+#include <stdlib.h>
+
+#include <string>
+#include <vector>
+
+using namespace std;
+
+struct deepspeed_aio_latency_t {
+    double _min_usec;
+    double _max_usec;
+    double _avg_usec;
+
+    void dump(const std::string tag);
+    void accumulate(const deepspeed_aio_latency_t&);
+    void scale(const float value);
+};
+
+struct deepspeed_aio_perf_t {
+    deepspeed_aio_latency_t _submit;
+    deepspeed_aio_latency_t _complete;
+    double _e2e_usec;
+    double _e2e_rate_GB;
+};
+
+struct deepspeed_aio_config_t {
+    const int _block_size;
+    const int _queue_depth;
+    const bool _single_submit;
+    const bool _overlap_events;
+    const bool _lock_memory;
+
+    deepspeed_aio_config_t();
+    deepspeed_aio_config_t(const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const bool lock_memory);
+};
+
+struct aio_context {
+    io_context_t _io_ctxt;
+    std::vector<struct io_event> _io_events;
+    std::vector<struct iocb*> _iocbs;
+    int _block_size;
+    int _queue_depth;
+
+    aio_context(const int block_size, const int queue_depth);
+    ~aio_context();
+};
diff --git a/deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp b/deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..200c7030f120366c2e2a45cb6cc20785ec4518fd
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.cpp
@@ -0,0 +1,123 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <cmath>
+
+#include "deepspeed_aio_utils.h"
+
+using namespace std;
+
+const int c_block_size = 128 * 1024;
+const int c_io_queue_depth = 8;
+
+io_xfer_ctxt::io_xfer_ctxt(const int fd,
+                           const long long int file_offset,
+                           const long long int num_bytes,
+                           const void* buffer)
+    : _fd(fd), _base_offset(file_offset), _mem_buffer(buffer), _num_bytes(num_bytes)
+{
+}
+
+io_prep_context::io_prep_context(const bool read_op,
+                                 const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                 const size_t block_size,
+                                 const std::vector<struct iocb*>* iocbs)
+    : _read_op(read_op), _xfer_ctxt(xfer_ctxt), _block_size(block_size), _iocbs(iocbs)
+{
+}
+
+void io_prep_context::prep_iocbs(const int n_iocbs,
+                                 const size_t num_bytes,
+                                 const void* start_buffer,
+                                 const long long int start_offset)
+{
+    assert(static_cast<size_t>(n_iocbs) <= _iocbs->size());
+    for (auto i = 0; i < n_iocbs; ++i) {
+        const auto shift = i * _block_size;
+        const auto xfer_buffer = (char*)start_buffer + _xfer_ctxt->_base_offset + shift;
+        const auto xfer_offset = _xfer_ctxt->_base_offset + start_offset + shift;
+        auto byte_count = _block_size;
+        if ((shift + _block_size) > num_bytes) { byte_count = num_bytes - shift; }
+
+        if (_read_op) {
+            io_prep_pread(_iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, byte_count, xfer_offset);
+        } else {
+            io_prep_pwrite(_iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, byte_count, xfer_offset);
+        }
+    }
+}
+
+io_prep_generator::io_prep_generator(const bool read_op,
+                                     const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                                     const size_t block_size)
+    : _read_op(read_op),
+      _xfer_ctxt(xfer_ctxt),
+      _block_size(block_size),
+      _remaining_bytes(xfer_ctxt->_num_bytes),
+      _next_iocb_index(0)
+{
+    _num_io_blocks =
+        static_cast<long long int>(ceil(static_cast<double>(xfer_ctxt->_num_bytes) / block_size));
+    _remaining_io_blocks = _num_io_blocks;
+}
+
+int io_prep_generator::prep_iocbs(const int n_iocbs, std::vector<struct iocb*>* iocbs)
+{
+    if ((_remaining_bytes) == 0 || (_remaining_io_blocks == 0)) {
+        assert(static_cast<long long int>(_remaining_bytes) == _remaining_io_blocks);
+        return 0;
+    }
+
+    assert(static_cast<size_t>(n_iocbs) <= iocbs->size());
+
+    auto actual_n_iocbs = min(static_cast<long long int>(n_iocbs), _remaining_io_blocks);
+    for (auto i = 0; i < actual_n_iocbs; ++i, ++_next_iocb_index) {
+        const auto xfer_offset = _xfer_ctxt->_base_offset + (_next_iocb_index * _block_size);
+        const auto xfer_buffer = (char*)_xfer_ctxt->_mem_buffer + xfer_offset;
+        const auto num_bytes = min(static_cast<long long int>(_block_size), _remaining_bytes);
+
+        if (_read_op) {
+            io_prep_pread(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
+        } else {
+            io_prep_pwrite(iocbs->at(i), _xfer_ctxt->_fd, xfer_buffer, num_bytes, xfer_offset);
+        }
+        _remaining_bytes -= num_bytes;
+    }
+    _remaining_io_blocks -= actual_n_iocbs;
+
+    return actual_n_iocbs;
+}
+
+int get_file_size(const char* filename, long long int& size)
+{
+    struct stat st;
+    if (stat(filename, &st) == -1) { return -1; }
+    size = st.st_size;
+    return 0;
+}
+
+void* ds_page_aligned_alloc(const size_t size, const bool lock)
+{
+    void* ptr;
+    int retval;
+
+    retval = posix_memalign(&ptr, (size_t)sysconf(_SC_PAGESIZE), size);
+    if (retval) { return nullptr; }
+
+    if (lock == false) { return ptr; }
+
+    auto mlock_ret = mlock(ptr, size);
+    if (mlock_ret != 0) {
+        auto mlock_error = errno;
+        printf("mlock failed with %d %s\n", mlock_error, strerror(mlock_error));
+
+        free(ptr);
+        return nullptr;
+    }
+
+    return ptr;
+}
diff --git a/deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h b/deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..6c5952749dd33d5e0059c209dc14ea755424da23
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/common/deepspeed_aio_utils.h
@@ -0,0 +1,77 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#pragma once
+
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <fcntl.h>
+#include <libaio.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <deepspeed_aio_types.h>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+struct io_xfer_ctxt {
+    const int _fd;
+    const long long int _base_offset;
+    const void* _mem_buffer;
+    const long long int _num_bytes;
+
+    io_xfer_ctxt(const int fd,
+                 const long long int file_offset,
+                 const long long int num_bytes,
+                 const void* buffer);
+};
+
+struct io_prep_context {
+    const bool _read_op;
+    const std::unique_ptr<io_xfer_ctxt>& _xfer_ctxt;
+    const size_t _block_size;
+    const std::vector<struct iocb*>* _iocbs;
+
+    io_prep_context(const bool read_op,
+                    const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                    const size_t block_size,
+                    const std::vector<struct iocb*>* iocbs);
+
+    void prep_iocbs(const int n_iocbs,
+                    const size_t num_bytes,
+                    const void* start_buffer,
+                    const long long int start_offset);
+};
+
+struct io_prep_generator {
+    const bool _read_op;
+    const std::unique_ptr<io_xfer_ctxt>& _xfer_ctxt;
+    const size_t _block_size;
+
+    long long int _remaining_bytes;
+    long long int _num_io_blocks;
+    long long int _remaining_io_blocks;
+    long long int _next_iocb_index;
+
+    io_prep_generator(const bool read_op,
+                      const std::unique_ptr<io_xfer_ctxt>& xfer_ctxt,
+                      const size_t block_size);
+
+    int prep_iocbs(const int n_iocbs, std::vector<struct iocb*>* iocbs);
+};
+
+void* ds_page_aligned_alloc(const size_t size, const bool lock = false);
+
+int get_file_size(const char* filename, long long int& size);
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp b/deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..a2670fb7b4cbc6635c25d71e9b5f9bfd265f59c9
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.cpp
@@ -0,0 +1,84 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_aio_thread.h"
+
+using namespace std;
+
+io_op_desc_t::io_op_desc_t(const bool read_op,
+                           const torch::Tensor& buffer,
+                           const int fd,
+                           const char* filename,
+                           const long long int num_bytes,
+                           const bool validate)
+    : _read_op(read_op),
+      _buffer(buffer),
+      _fd(fd),
+      _filename(filename),
+      _num_bytes(num_bytes),
+      _validate(validate)
+{
+    _cpu_buffer = _buffer.is_cuda() ? _buffer.to(torch::kCPU).pin_memory() : _buffer;
+    _contiguous_buffer = _cpu_buffer.contiguous();
+}
+
+char* io_op_desc_t::data_ptr() const { return (char*)_contiguous_buffer.data_ptr(); }
+
+void io_op_desc_t::fini()
+{
+    if (_read_op && _buffer.is_cuda()) { _buffer.copy_(_cpu_buffer.to(torch::kCUDA)); }
+}
+
+deepspeed_aio_thread_t::deepspeed_aio_thread_t(const int tid, deepspeed_aio_config_t& aio_config)
+    : _tid(tid),
+      _aio_config(aio_config),
+      _aio_ctxt(new aio_context(aio_config._block_size, aio_config._queue_depth)),
+      _time_to_exit(false)
+{
+}
+
+deepspeed_aio_thread_t::~deepspeed_aio_thread_t() {}
+
+void deepspeed_aio_thread_t::run()
+{
+    while (true) {
+        std::shared_ptr<struct io_op_desc_t> next_io_op = nullptr;
+
+        {
+            std::unique_lock<std::mutex> lock(_work_sync._mutex);
+            _work_sync._cond_var.wait(lock,
+                                      [this] { return (!_work_queue.empty() || _time_to_exit); });
+            if (!_work_queue.empty()) {
+                next_io_op = _work_queue.front();
+                _work_queue.pop();
+            }
+        }
+
+        if (next_io_op) {
+            const auto base_offset = next_io_op->_num_bytes * _tid;
+
+            std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(
+                next_io_op->_fd, base_offset, next_io_op->_num_bytes, next_io_op->data_ptr()));
+
+            if (_aio_config._overlap_events) {
+                do_aio_operation_overlap(
+                    next_io_op->_read_op, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+            } else {
+                do_aio_operation_sequential(
+                    next_io_op->_read_op, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+            }
+
+            {
+                std::lock_guard<std::mutex> lock(_complete_sync._mutex);
+                _complete_queue.push(next_io_op);
+            }
+            _complete_sync._cond_var.notify_one();
+        }
+
+        if (_time_to_exit) { break; }
+    }
+}
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h b/deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h
new file mode 100644
index 0000000000000000000000000000000000000000..d1cfcab8bfc2446921422b83efa100444ce0dd31
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_aio_thread.h
@@ -0,0 +1,57 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <condition_variable>
+#include <memory>
+#include <queue>
+#include "deepspeed_py_aio.h"
+
+struct io_op_desc_t {
+    const bool _read_op;
+    torch::Tensor _buffer;
+    int _fd;
+    const std::string _filename;
+    const long long int _num_bytes;
+    torch::Tensor _cpu_buffer;
+    torch::Tensor _contiguous_buffer;
+    const bool _validate;
+
+    io_op_desc_t(const bool read_op,
+                 const torch::Tensor& buffer,
+                 const int fd,
+                 const char* filename,
+                 const long long int num_bytes,
+                 const bool validate);
+
+    char* data_ptr() const;
+    void fini();
+};
+
+struct thread_sync_t {
+    std::mutex _mutex;
+    std::condition_variable _cond_var;
+};
+
+struct deepspeed_aio_thread_t {
+    const int _tid;
+    deepspeed_aio_config_t& _aio_config;
+
+    std::unique_ptr<struct aio_context> _aio_ctxt;
+    std::queue<std::shared_ptr<struct io_op_desc_t>> _work_queue;
+    std::queue<std::shared_ptr<struct io_op_desc_t>> _complete_queue;
+
+    bool _time_to_exit;
+
+    struct thread_sync_t _work_sync;
+    struct thread_sync_t _complete_sync;
+
+    deepspeed_aio_thread_t(const int tid, deepspeed_aio_config_t& aio_config);
+
+    ~deepspeed_aio_thread_t();
+
+    void run();
+};
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..49ff1f240c433288a0e12c64389887c65926ad83
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.cpp
@@ -0,0 +1,121 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <cassert>
+#include <chrono>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "deepspeed_py_aio.h"
+
+using namespace std;
+using namespace std::chrono;
+
+#define DEBUG_DS_AIO_READ 0
+#define DEBUG_DS_AIO_WRITE 0
+
+static const std::string c_library_name = "deepspeed_aio";
+
+int deepspeed_py_aio_write(const torch::Tensor& buffer,
+                           const char* filename,
+                           const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const bool validate)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+    deepspeed_aio_config_t config(block_size, queue_depth, single_submit, overlap_events, false);
+
+    const auto fd = open_file(filename, false);
+    if (fd == -1) { return -1; }
+
+    auto write_buffer = (char*)buffer.data_ptr();
+    const auto num_write_bytes = static_cast<long long int>(buffer.nbytes());
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_write_bytes, write_buffer));
+    std::unique_ptr<aio_context> aio_ctxt(new aio_context(config._block_size, config._queue_depth));
+
+    if (config._overlap_events) {
+        do_aio_operation_overlap(false, aio_ctxt, xfer_ctxt, &config, nullptr);
+    } else {
+        do_aio_operation_sequential(false, aio_ctxt, xfer_ctxt, &config, nullptr);
+    }
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    close(fd);
+
+    if (validate) { validate_aio_operation(false, filename, write_buffer, num_write_bytes); }
+
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
+
+int deepspeed_py_aio_read(torch::Tensor& buffer,
+                          const char* filename,
+                          const int block_size,
+                          const int queue_depth,
+                          const bool single_submit,
+                          const bool overlap_events,
+                          const bool validate)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+    long long num_file_bytes;
+    if (-1 == get_file_size(filename, num_file_bytes)) {
+        const auto error_code = errno;
+        report_file_error(filename, " fstat for read", error_code);
+        return -1;
+    }
+
+    deepspeed_aio_config_t config(block_size, queue_depth, single_submit, overlap_events, false);
+    const auto fd = open_file(filename, true);
+    if (fd == -1) { return -1; }
+
+    auto read_buffer = (char*)buffer.data_ptr();
+    assert(static_cast<long long int>(buffer.nbytes()) == num_file_bytes);
+
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_file_bytes, read_buffer));
+    std::unique_ptr<aio_context> aio_ctxt(new aio_context(config._block_size, config._queue_depth));
+
+    if (config._overlap_events) {
+        do_aio_operation_overlap(true, aio_ctxt, xfer_ctxt, &config, nullptr);
+    } else {
+        do_aio_operation_sequential(true, aio_ctxt, xfer_ctxt, &config, nullptr);
+    }
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    close(fd);
+
+    if (validate) { validate_aio_operation(true, filename, read_buffer, num_file_bytes); }
+
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h
new file mode 100644
index 0000000000000000000000000000000000000000..230d88da9763a0130554ca83c5e3b1a5d914116f
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio.h
@@ -0,0 +1,27 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <deepspeed_aio_common.h>
+#include <stdlib.h>
+#include <torch/extension.h>
+
+int deepspeed_py_aio_write(const torch::Tensor& buffer,
+                           const char* filename,
+                           const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const bool validate);
+
+int deepspeed_py_aio_read(torch::Tensor& buffer,
+                          const char* filename,
+                          const int block_size,
+                          const int queue_depth,
+                          const bool single_submit,
+                          const bool overlap_events,
+                          const bool validate);
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..417319f8ae5ce3bead644c80c094d9df1061879a
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp
@@ -0,0 +1,282 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_py_aio_handle.h"
+
+using namespace std;
+
+static void _start_aio_thread(std::shared_ptr<struct deepspeed_aio_thread_t> ctxt) { ctxt->run(); }
+
+deepspeed_aio_handle_t::deepspeed_aio_handle_t(const int block_size,
+                                               const int queue_depth,
+                                               const bool single_submit,
+                                               const bool overlap_events,
+                                               const int num_threads)
+    : _aio_ctxt(new aio_context(block_size, queue_depth)),
+      _single_submit(single_submit),
+      _overlap_events(overlap_events),
+      _num_threads(num_threads),
+      _aio_config(block_size, queue_depth, single_submit, overlap_events, false),
+      _num_pending_ops(0)
+{
+    for (auto i = 0; i < num_threads; ++i) {
+        _thread_contexts.push_back(std::make_shared<deepspeed_aio_thread_t>(i, _aio_config));
+    }
+
+    for (auto& ctxt : _thread_contexts) {
+        _threads.push_back(std::thread(_start_aio_thread, ctxt));
+    }
+}
+
+deepspeed_aio_handle_t::~deepspeed_aio_handle_t()
+{
+    _stop_threads();
+    for (auto& thr : _threads) { thr.join(); }
+}
+
+const int deepspeed_aio_handle_t::get_block_size() const
+{
+    return _aio_ctxt ? _aio_ctxt->_block_size : -1;
+}
+
+const int deepspeed_aio_handle_t::get_queue_depth() const
+{
+    return _aio_ctxt ? _aio_ctxt->_queue_depth : -1;
+}
+
+const bool deepspeed_aio_handle_t::get_single_submit() const { return _single_submit; }
+
+const bool deepspeed_aio_handle_t::get_overlap_events() const { return _overlap_events; }
+
+const int deepspeed_aio_handle_t::get_thread_count() const { return _num_threads; }
+
+int deepspeed_aio_handle_t::read(torch::Tensor& buffer, const char* filename, const bool validate)
+{
+    const auto start_time = std::chrono::high_resolution_clock::now();
+
+    assert(_aio_ctxt);
+
+    long long num_file_bytes;
+    if (-1 == get_file_size(filename, num_file_bytes)) {
+        const auto error_code = errno;
+        report_file_error(filename, " fstat for read", error_code);
+        return -1;
+    }
+    assert(static_cast<long long int>(buffer.nbytes()) == num_file_bytes);
+
+    const auto fd = open_file(filename, true);
+    if (fd == -1) { return -1; }
+
+    auto read_buffer = (char*)buffer.data_ptr();
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_file_bytes, read_buffer));
+
+    if (_aio_config._overlap_events) {
+        do_aio_operation_overlap(true, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    } else {
+        do_aio_operation_sequential(true, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    }
+
+    close(fd);
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    if (validate) { validate_aio_operation(true, filename, read_buffer, num_file_bytes); }
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
+
+int deepspeed_aio_handle_t::write(const torch::Tensor& buffer,
+                                  const char* filename,
+                                  const bool validate)
+{
+    assert(_aio_ctxt);
+
+    const auto start_time = std::chrono::high_resolution_clock::now();
+
+    const auto fd = open_file(filename, false);
+    if (fd == -1) { return -1; }
+
+    auto write_buffer = (char*)buffer.data_ptr();
+    const auto num_write_bytes = static_cast<long long int>(buffer.nbytes());
+    std::unique_ptr<io_xfer_ctxt> xfer_ctxt(new io_xfer_ctxt(fd, 0, num_write_bytes, write_buffer));
+
+    if (_aio_config._overlap_events) {
+        do_aio_operation_overlap(false, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    } else {
+        do_aio_operation_sequential(false, _aio_ctxt, xfer_ctxt, &_aio_config, nullptr);
+    }
+    const std::chrono::duration<double> aio_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+
+    close(fd);
+
+    if (validate) { validate_aio_operation(false, filename, write_buffer, num_write_bytes); }
+
+    const std::chrono::duration<double> fn_time =
+        std::chrono::high_resolution_clock::now() - start_time;
+    std::cout << "Elapsed time(usec): "
+              << "aio = " << aio_time.count() * 1e6 << " call = " << fn_time.count() * 1e6
+              << std::endl;
+    return 0;
+}
+
+void deepspeed_aio_handle_t::_schedule_aio_work(std::shared_ptr<struct io_op_desc_t> scheduled_op)
+{
+    for (auto& ctxt : _thread_contexts) {
+        {
+            std::lock_guard<std::mutex> lock(ctxt->_work_sync._mutex);
+            ctxt->_work_queue.push(scheduled_op);
+        }
+        ctxt->_work_sync._cond_var.notify_one();
+    }
+    _num_pending_ops++;
+}
+
+std::shared_ptr<struct io_op_desc_t> deepspeed_aio_handle_t::_wait_for_aio_work()
+{
+    std::shared_ptr<struct io_op_desc_t> completed_op = nullptr;
+    for (auto& ctxt : _thread_contexts) {
+        std::unique_lock<std::mutex> lock(ctxt->_complete_sync._mutex);
+        ctxt->_complete_sync._cond_var.wait(lock,
+                                            [ctxt] { return !ctxt->_complete_queue.empty(); });
+        completed_op = ctxt->_complete_queue.front();
+        ctxt->_complete_queue.pop();
+    }
+    return completed_op;
+}
+
+void deepspeed_aio_handle_t::_stop_threads()
+{
+    assert(0 == _num_pending_ops);
+    for (auto& ctxt : _thread_contexts) {
+        {
+            std::lock_guard<std::mutex> lock(ctxt->_work_sync._mutex);
+            ctxt->_time_to_exit = true;
+        }
+        ctxt->_work_sync._cond_var.notify_one();
+    }
+}
+
+int deepspeed_aio_handle_t::wait()
+{
+    assert(_num_pending_ops > 0);
+    auto num_completed_ops = 0;
+
+    while (_num_pending_ops > 0) {
+        auto completed_op = _wait_for_aio_work();
+
+        completed_op->fini();
+
+        close(completed_op->_fd);
+
+        if (completed_op->_validate) {
+            validate_aio_operation(completed_op->_read_op,
+                                   completed_op->_filename.c_str(),
+                                   completed_op->data_ptr(),
+                                   _num_threads * completed_op->_num_bytes);
+        }
+        --_num_pending_ops;
+        ++num_completed_ops;
+    }
+
+    return num_completed_ops;
+}
+
+bool deepspeed_aio_handle_t::_is_valid_parallel_aio_op(const bool read_op,
+                                                       const long long int num_bytes)
+{
+    const auto op_string = read_op ? "Read" : "Write";
+    if (num_bytes % get_thread_count()) {
+        std::cout << "deepspeed_aio failure: parallel " << op_string << " num_bytes = " << num_bytes
+                  << " not divisible by thread count = " << get_thread_count() << std::endl;
+        return false;
+    }
+
+    return true;
+}
+
+int deepspeed_aio_handle_t::pread(const torch::Tensor& buffer,
+                                  const char* filename,
+                                  const bool validate,
+                                  const bool async)
+{
+    long long num_file_bytes;
+    if (-1 == get_file_size(filename, num_file_bytes)) {
+        const auto error_code = errno;
+        report_file_error(filename, " fstat for read", error_code);
+        return -1;
+    }
+    const auto buffer_bytes = static_cast<long long int>(buffer.nbytes());
+    if (buffer_bytes != num_file_bytes) {
+        std::cout << filename << ": buffer nbytes != file bytes " << buffer_bytes
+                  << " != " << num_file_bytes << std::endl;
+    }
+    assert(static_cast<long long int>(buffer.nbytes()) == num_file_bytes);
+    assert((num_file_bytes % _num_threads) == 0);
+
+    if (!_is_valid_parallel_aio_op(true, num_file_bytes)) { return -1; }
+
+    const auto fd = open_file(filename, true);
+    if (fd == -1) { return -1; }
+
+    auto scheduled_op = std::make_shared<io_op_desc_t>(
+        true, buffer, fd, filename, (num_file_bytes / _num_threads), validate);
+
+    _schedule_aio_work(scheduled_op);
+
+    if (async) { return 0; }
+
+    return wait();
+}
+
+int deepspeed_aio_handle_t::pwrite(const torch::Tensor& buffer,
+                                   const char* filename,
+                                   const bool validate,
+                                   const bool async)
+{
+    const auto num_write_bytes = static_cast<long long int>(buffer.nbytes());
+    assert((num_write_bytes % _num_threads) == 0);
+
+    if (!_is_valid_parallel_aio_op(false, num_write_bytes)) { return -1; }
+
+    const auto fd = open_file(filename, false);
+    if (fd == -1) { return -1; }
+
+    auto scheduled_op = std::make_shared<io_op_desc_t>(
+        false, buffer, fd, filename, (num_write_bytes / _num_threads), validate);
+
+    _schedule_aio_work(scheduled_op);
+
+    if (async) { return 0; }
+
+    return wait();
+}
+
+int deepspeed_aio_handle_t::sync_pread(torch::Tensor& buffer, const char* filename)
+{
+    return pread(buffer, filename, false, false);
+}
+
+int deepspeed_aio_handle_t::sync_pwrite(const torch::Tensor& buffer, const char* filename)
+{
+    return pwrite(buffer, filename, false, false);
+}
+
+int deepspeed_aio_handle_t::async_pread(torch::Tensor& buffer, const char* filename)
+{
+    return pread(buffer, filename, false, true);
+}
+
+int deepspeed_aio_handle_t::async_pwrite(const torch::Tensor& buffer, const char* filename)
+{
+    return pwrite(buffer, filename, false, true);
+}
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h
new file mode 100644
index 0000000000000000000000000000000000000000..22de4c3961d29abc94517b81ff38b7224822589c
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.h
@@ -0,0 +1,68 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <condition_variable>
+#include <memory>
+#include "deepspeed_aio_thread.h"
+
+struct deepspeed_aio_handle_t {
+    std::unique_ptr<struct aio_context> _aio_ctxt;
+    const bool _single_submit;
+    const bool _overlap_events;
+    const int _num_threads;
+    deepspeed_aio_config_t _aio_config;
+
+    std::vector<std::shared_ptr<struct deepspeed_aio_thread_t>> _thread_contexts;
+    std::vector<std::thread> _threads;
+    int _num_pending_ops;
+
+    deepspeed_aio_handle_t(const int block_size,
+                           const int queue_depth,
+                           const bool single_submit,
+                           const bool overlap_events,
+                           const int num_threads);
+
+    ~deepspeed_aio_handle_t();
+
+    const int get_block_size() const;
+    const int get_queue_depth() const;
+    const bool get_single_submit() const;
+    const bool get_overlap_events() const;
+    const int get_thread_count() const;
+
+    int read(torch::Tensor& buffer, const char* filename, const bool validate);
+
+    int write(const torch::Tensor& buffer, const char* filename, const bool validate);
+
+    int pread(const torch::Tensor& buffer,
+              const char* filename,
+              const bool validate,
+              const bool async);
+
+    int pwrite(const torch::Tensor& buffer,
+               const char* filename,
+               const bool validate,
+               const bool async);
+
+    int sync_pread(torch::Tensor& buffer, const char* filename);
+
+    int sync_pwrite(const torch::Tensor& buffer, const char* filename);
+
+    int async_pread(torch::Tensor& buffer, const char* filename);
+
+    int async_pwrite(const torch::Tensor& buffer, const char* filename);
+
+    int wait();
+
+    void _stop_threads();
+
+    void _schedule_aio_work(std::shared_ptr<struct io_op_desc_t> scheduled_op);
+
+    std::shared_ptr<struct io_op_desc_t> _wait_for_aio_work();
+
+    bool _is_valid_parallel_aio_op(const bool read_op, const long long int num_bytes);
+};
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..ee51147f9c414b184bb6ef81edd8905ca7fd4a78
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.cpp
@@ -0,0 +1,133 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include "deepspeed_py_copy.h"
+#include <omp.h>
+
+#define ROUND_DOWN(size, step) ((size) & ~((step)-1))
+
+#if defined(__AVX512__) or defined(__AVX256__)
+union AVX_Data {
+#if defined(__AVX512__)
+    __m512 data;
+#else
+    __m256 data;
+#endif
+};
+#endif
+
+static void helper_memcpy_1(float* dest, float* src, size_t param_size)
+{
+    size_t rounded_size = 0;
+
+#if defined(__AVX512__) or defined(__AVX256__)
+
+    rounded_size = ROUND_DOWN(param_size, SIMD_WIDTH);
+
+    for (size_t t = 0; t < rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
+        size_t offset = copy_size + t;
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH) {
+            AVX_Data src_4;
+            src_4.data = SIMD_LOAD(src + i);
+
+            SIMD_STORE(dest + i, src_4.data);
+        }
+    }
+
+#endif
+
+    if (param_size > rounded_size) {
+#pragma omp parallel for
+        for (size_t k = rounded_size; k < param_size; k++) { dest[k] = src[k]; }
+    }
+}
+
+static void helper_memcpy_4(float* dest, float* src, size_t param_size)
+{
+    size_t rounded_size = 0;
+
+#if defined(__AVX512__) or defined(__AVX256__)
+
+    rounded_size = ROUND_DOWN(param_size, (SIMD_WIDTH << 2));
+
+    for (size_t t = 0; t < rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
+        size_t offset = copy_size + t;
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += (SIMD_WIDTH << 2)) {
+            AVX_Data src_4[4];
+            src_4[0].data = SIMD_LOAD(src + i);
+            src_4[1].data = SIMD_LOAD(src + i + SIMD_WIDTH);
+            src_4[2].data = SIMD_LOAD(src + i + (SIMD_WIDTH << 1));
+            src_4[3].data = SIMD_LOAD(src + i + SIMD_WIDTH * 3);
+
+            SIMD_STORE(dest + i, src_4[0].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH, src_4[1].data);
+            SIMD_STORE(dest + i + (SIMD_WIDTH << 1), src_4[2].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 3, src_4[3].data);
+        }
+    }
+#endif
+    if (param_size > rounded_size)
+        helper_memcpy_1((dest + rounded_size), (src + rounded_size), (param_size - rounded_size));
+}
+
+static void helper_mempcy_8(float* dest, float* src, size_t param_size)
+{
+    size_t rounded_size = 0;
+
+#if defined(__AVX512__) or defined(__AVX256__)
+
+    rounded_size = ROUND_DOWN(param_size, (SIMD_WIDTH << 2));
+
+    for (size_t t = 0; t < rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > rounded_size) copy_size = rounded_size - t;
+        size_t offset = copy_size + t;
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += (SIMD_WIDTH << 3)) {
+            AVX_Data src_4[8];
+            src_4[0].data = SIMD_LOAD(src + i);
+            src_4[1].data = SIMD_LOAD(src + i + SIMD_WIDTH);
+            src_4[2].data = SIMD_LOAD(src + i + (SIMD_WIDTH << 1));
+            src_4[3].data = SIMD_LOAD(src + i + SIMD_WIDTH * 3);
+            src_4[4].data = SIMD_LOAD(src + i + (SIMD_WIDTH << 2));
+            src_4[5].data = SIMD_LOAD(src + i + SIMD_WIDTH * 5);
+            src_4[6].data = SIMD_LOAD(src + i + SIMD_WIDTH * 6);
+            src_4[7].data = SIMD_LOAD(src + i + SIMD_WIDTH * 7);
+
+            SIMD_STORE(dest + i, src_4[0].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH, src_4[1].data);
+            SIMD_STORE(dest + i + (SIMD_WIDTH << 1), src_4[2].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 3, src_4[3].data);
+            SIMD_STORE(dest + i + (SIMD_WIDTH << 2), src_4[4].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 5, src_4[5].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 6, src_4[6].data);
+            SIMD_STORE(dest + i + SIMD_WIDTH * 7, src_4[7].data);
+        }
+    }
+#endif
+    if (param_size > rounded_size)
+        helper_memcpy_4((dest + rounded_size), (src + rounded_size), (param_size - rounded_size));
+}
+
+int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src)
+{
+    auto dest_c = dest.contiguous();
+    auto src_c = src.contiguous();
+
+    float* dest_ptr = (float*)dest_c.data_ptr();
+    float* src_ptr = (float*)src_c.data_ptr();
+
+    helper_mempcy_8(dest_ptr, src_ptr, dest_c.size(0));
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h
new file mode 100644
index 0000000000000000000000000000000000000000..69b044851eca1cbea461925fca2133f433e77533
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_copy.h
@@ -0,0 +1,42 @@
+
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#if (__x86_64__ || __i386__)
+#include <cpuid.h>
+#include <x86intrin.h>
+#endif
+
+#include <deepspeed_aio_common.h>
+#include <stdlib.h>
+#include <torch/extension.h>
+
+#define TILE (1024 * 1024 * 1024)
+
+#if defined(__AVX512__)
+#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm512_loadu_ps(x)
+#define SIMD_SET(x) _mm512_set1_ps(x)
+#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
+#define SIMD_WIDTH 16
+#else
+#if defined(__AVX256__)
+#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm256_loadu_ps(x)
+#define SIMD_SET(x) _mm256_set1_ps(x)
+#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
+#define SIMD_WIDTH 8
+#endif
+#endif
+
+int deepspeed_py_memcpy(torch::Tensor& dest, const torch::Tensor& src);
diff --git a/deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp b/deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..68590581ce2d985bc5209a73d9de4f515c987c30
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_lib/py_ds_aio.cpp
@@ -0,0 +1,41 @@
+/*
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality for swapping optimizer tensors to/from (NVMe) storage devices.
+*/
+
+#include <torch/extension.h>
+#include "deepspeed_py_aio_handle.h"
+#include "deepspeed_py_copy.h"
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("aio_read", &deepspeed_py_aio_read, "DeepSpeed Asynchronous I/O Read");
+
+    m.def("aio_write", &deepspeed_py_aio_write, "DeepSpeed Asynchronous I/O Write");
+
+    m.def("deepspeed_memcpy", &deepspeed_py_memcpy, "DeepSpeed Memory Copy");
+
+    py::class_<deepspeed_aio_handle_t>(m, "aio_handle")
+        .def(py::init<const int, const int, const bool, const bool, const int>())
+
+        .def("get_block_size", &deepspeed_aio_handle_t::get_block_size)
+        .def("get_queue_depth", &deepspeed_aio_handle_t::get_queue_depth)
+        .def("get_single_submit", &deepspeed_aio_handle_t::get_single_submit)
+        .def("get_overlap_events", &deepspeed_aio_handle_t::get_overlap_events)
+        .def("get_thread_count", &deepspeed_aio_handle_t::get_thread_count)
+
+        .def("read", &deepspeed_aio_handle_t::read)
+        .def("write", &deepspeed_aio_handle_t::write)
+
+        .def("pread", &deepspeed_aio_handle_t::pread)
+        .def("pwrite", &deepspeed_aio_handle_t::pwrite)
+
+        .def("sync_pread", &deepspeed_aio_handle_t::sync_pread)
+        .def("sync_pwrite", &deepspeed_aio_handle_t::sync_pwrite)
+        .def("async_pread", &deepspeed_aio_handle_t::async_pread)
+        .def("async_pwrite", &deepspeed_aio_handle_t::async_pwrite)
+
+        .def("wait", &deepspeed_aio_handle_t::wait);
+}
diff --git a/deepspeed/ops/csrc/aio/py_test/aio_bench_generate_param.py b/deepspeed/ops/csrc/aio/py_test/aio_bench_generate_param.py
new file mode 100644
index 0000000000000000000000000000000000000000..caa833f5febbe26eabf3b155a236fa331899667c
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/aio_bench_generate_param.py
@@ -0,0 +1,96 @@
+"""
+Copyright 2021 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+import os
+import argparse
+import json
+from parse_aio_stats import READ_SPEED, WRITE_SPEED, get_sorted_results
+from perf_sweep_utils import BENCH_LOG_DIR, READ_LOG_DIR, WRITE_LOG_DIR
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        '--log_dir',
+        type=str,
+        default=BENCH_LOG_DIR,
+        help=
+        f'Folder of performance sweep logs. Default is {os.path.join(".", BENCH_LOG_DIR)}'
+    )
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+
+    return args
+
+
+def validate_args(args):
+    for d in [READ_LOG_DIR, WRITE_LOG_DIR]:
+        log_dir = os.path.join(args.log_dir, d)
+        if not os.path.isdir(log_dir):
+            print(f'{log_dir} folder is not existent')
+            return False
+
+    return True
+
+
+def convert_to_param(key):
+    assert len(key) == 6
+    return {
+        "single_submit": "true" if key[0] == "single" else "false",
+        "overlap_events": "true" if key[1] == "overlap" else "false",
+        "thread_count": int(key[3]),
+        "queue_depth": int(key[4]),
+        "block_size": int(key[5])
+    }
+
+
+def generate_aio_param(read_log_dir, write_log_dir):
+    _, read_results = get_sorted_results(read_log_dir, READ_SPEED)
+    _, write_results = get_sorted_results(write_log_dir, WRITE_SPEED)
+    combined_perf = {key[1:]: value for key, value in read_results.items()}
+
+    for key, value in write_results.items():
+        new_key = key[1:]
+        if new_key in combined_perf:
+            combined_perf[new_key] += value
+        else:
+            combined_perf[new_key] = 0
+
+    optimal_key = None
+    optimal_perf = 0.0
+    for key, value in combined_perf.items():
+        if value > optimal_perf:
+            optimal_perf = value
+            optimal_key = key
+
+    aio_param = {"aio": convert_to_param(optimal_key)}
+
+    read_perf_keys = {key[1:]: key for key in read_results.keys()}
+    write_perf_keys = {key[1:]: key for key in write_results.keys()}
+    optimal_config_read = read_results.get(read_perf_keys[optimal_key], None)
+    optimal_config_write = write_results.get(write_perf_keys[optimal_key], None)
+
+    print(
+        f'Best performance (GB/sec): read = {optimal_config_read:5.2f}, write = {optimal_config_write:5.2f}'
+    )
+    print(json.dumps(aio_param, indent=3))
+
+
+def main():
+    print('Generate aio param')
+    args = parse_arguments()
+    if not validate_args(args):
+        quit()
+
+    read_log_dir = os.path.join(args.log_dir, READ_LOG_DIR)
+    write_log_dir = os.path.join(args.log_dir, WRITE_LOG_DIR)
+    generate_aio_param(read_log_dir, write_log_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/deepspeed/ops/csrc/aio/py_test/aio_bench_perf_sweep.py b/deepspeed/ops/csrc/aio/py_test/aio_bench_perf_sweep.py
new file mode 100644
index 0000000000000000000000000000000000000000..be6cd74f7ac6fff899a56a80d9e3969013243c12
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/aio_bench_perf_sweep.py
@@ -0,0 +1,397 @@
+"""
+Copyright 2021 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+import os
+import sys
+import argparse
+import json
+import itertools
+import subprocess
+import shutil
+
+from test_ds_aio_utils import refine_integer_value
+from perf_sweep_utils import READ_OP_DESC, WRITE_OP_DESC, BENCH_LOG_DIR, \
+    READ_IO_DIR, WRITE_IO_DIR, READ_LOG_DIR, WRITE_LOG_DIR
+
+OTHER_OPTIONS = '--handle'
+PERF_SCRIPT = 'test_ds_aio.py'
+DEFAULT_SWEEP_CONFIG = {
+    "block_size": ["128K",
+                   "256K"],
+    "queue_depth": [4,
+                    16,
+                    32],
+    "overlap_events": [True,
+                       False],
+    "io_parallel": [2,
+                    8],
+    "single_submit": [False]
+}
+
+
+class Job(object):
+    def __init__(self, cmd_line, output_file=None, work_dir=None):
+        self.cmd_line = cmd_line
+        self.output_file = output_file
+        self.work_dir = work_dir
+        self.output_fd = None
+
+    def cmd(self):
+        return self.cmd_line
+
+    def get_stdout(self):
+        return self.output_fd
+
+    def get_stderr(self):
+        return self.output_fd
+
+    def get_cwd(self):
+        return self.work_dir
+
+    def open_output_file(self):
+        if self.output_file is not None:
+            self.output_fd = open(self.output_file, 'w')
+
+    def close_output_file(self):
+        if self.output_fd is not None:
+            self.output_fd.close()
+            self.output_fd = None
+
+
+class SweepConfig(object):
+    def __init__(self, args):
+        self.nvme_dir = args.nvme_dir
+        self.io_size = args.io_size
+        self.search_space = get_sweep_config_dict(args.sweep_config)
+        self.read = not args.no_read
+        self.write = not args.no_write
+        self.flush_cache = not args.no_sudo
+        self.log_dir = args.log_dir
+        self.loops = args.loops
+        self.other_options = f'{OTHER_OPTIONS} --loops {args.loops}'
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        '--nvme_dir',
+        required=True,
+        type=str,
+        help=
+        'Directory in which to perform I/O tests. A writeable directory on a NVMe device.'
+    )
+
+    parser.add_argument('--sweep_config',
+                        type=str,
+                        default=None,
+                        help='Performance sweep configuration json file.')
+
+    parser.add_argument('--no_read',
+                        action='store_true',
+                        help='Disable read performance measurements.')
+
+    parser.add_argument('--no_write',
+                        action='store_true',
+                        help='Disable write performance measurements.')
+
+    parser.add_argument(
+        '--io_size',
+        type=str,
+        default="400M",
+        help='Number of I/O bytes to read/write for performance measurements.')
+
+    parser.add_argument(
+        '--no_sudo',
+        action='store_true',
+        help=
+        'Run without sudo access. Page cache will not be flushed and reported read speeds may be higher than actual.'
+    )
+
+    parser.add_argument(
+        '--log_dir',
+        type=str,
+        default=BENCH_LOG_DIR,
+        help=
+        f'Output directory for performance log files. Default is {os.path.join(".", BENCH_LOG_DIR)}'
+    )
+
+    parser.add_argument('--loops',
+                        type=int,
+                        default=1,
+                        help='Count of operation repetitions')
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+
+    return args
+
+
+def dump_cmd_lines(cmd_lines):
+    print(f'cmd line count =  {len(cmd_lines)}')
+    for i, cmd in enumerate(cmd_lines):
+        print(f'{i}:  {cmd}')
+
+
+def get_sweep_config_dict(sweep_config_json):
+    if sweep_config_json is None:
+        return DEFAULT_SWEEP_CONFIG
+
+    with open(sweep_config_json) as fp:
+        sweep_config = json.load(fp)
+    return sweep_config
+
+
+def get_sweep_cmd_lines(sweep_config_dict):
+    def flatten_options(key, value_list):
+        flat_list = []
+        for v in value_list:
+            if not type(v) is bool:
+                flat_list.append(f'--{key} {v}')
+            elif v:
+                flat_list.append(f'--{key}')
+            else:
+                flat_list.append(' ')
+
+        return flat_list
+
+    flat_list = [flatten_options(key, value) for key, value in sweep_config_dict.items()]
+    cmd_list = list(itertools.product(*flat_list))
+    cmd_list = [list(cmd) for cmd in cmd_list]
+    #dump_cmd_lines(cmd_list)
+    return cmd_list
+
+
+def run_job(job):
+    args = ' '.join(job.cmd())
+    print(f'args = {args}')
+    job.open_output_file()
+    proc = subprocess.run(args=args,
+                          shell=True,
+                          stdout=job.get_stdout(),
+                          stderr=job.get_stderr(),
+                          cwd=job.get_cwd())
+    job.close_output_file()
+    assert proc.returncode == 0, \
+    f"This command failed: {job.cmd()}"
+
+
+def launch_sweep(sweep_jobs, sync_job, flush_cache_job):
+    for perf_job in sweep_jobs:
+        if flush_cache_job is not None:
+            run_job(sync_job)
+            run_job(flush_cache_job)
+
+        run_job(perf_job)
+
+        run_job(sync_job)
+
+
+def create_cmd_tags(cmd_line):
+    tags = {}
+    for param_value in cmd_line:
+        fields = param_value.split()
+        if len(fields) == 1:
+            tags[fields[0]] = None
+        elif len(fields) == 2:
+            tags[fields[0]] = fields[1]
+    return tags
+
+
+def get_log_file(io_op_desc, cmd_line):
+    QUEUE_DEPTH = "--queue_depth"
+    BLOCK_SIZE = "--block_size"
+    SINGLE_SUBMIT = "--single_submit"
+    OVERLAP_EVENTS = "--overlap_events"
+    THREAD_COUNT = "--threads"
+    IO_PARALLEL = "--io_parallel"
+
+    tag_map = {
+        QUEUE_DEPTH: "d",
+        BLOCK_SIZE: "bs",
+        SINGLE_SUBMIT: "single",
+        OVERLAP_EVENTS: "overlap",
+        THREAD_COUNT: "t",
+        IO_PARALLEL: "p"
+    }
+
+    tag_default = {
+        QUEUE_DEPTH: 1,
+        BLOCK_SIZE: "1M",
+        SINGLE_SUBMIT: "block",
+        OVERLAP_EVENTS: "sequential",
+        THREAD_COUNT: 1,
+        IO_PARALLEL: 1
+    }
+
+    def get_default_value(tag):
+        value = tag_default[tag]
+        if tag in [SINGLE_SUBMIT, OVERLAP_EVENTS]:
+            return value
+        return f'{tag_map[tag]}{value}'
+
+    def get_config_value(tag, value):
+        tag_key = tag_map[tag]
+        if value is None:
+            return tag_key
+        return f'{tag_key}{value}'
+
+    tag_list = [
+        SINGLE_SUBMIT,
+        OVERLAP_EVENTS,
+        THREAD_COUNT,
+        IO_PARALLEL,
+        QUEUE_DEPTH,
+        BLOCK_SIZE
+    ]
+    log_tags = [io_op_desc]
+    cmd_tags = create_cmd_tags(cmd_line)
+    for tag in tag_list:
+        if tag in cmd_tags:
+            log_tags.append(get_config_value(tag, cmd_tags[tag]))
+        else:
+            log_tags.append(get_default_value(tag))
+
+    log_file = '_'.join(log_tags)
+    log_file += '.txt'
+    return log_file
+
+
+def create_perf_jobs(io_op_desc, log_dir, cmd_lines):
+    py_cmd = ['python', os.path.join(script_path(), PERF_SCRIPT)]
+
+    perf_jobs = []
+    for cmd in cmd_lines:
+        log_file = os.path.join(log_dir, get_log_file(io_op_desc, cmd))
+        job = Job(cmd_line=py_cmd + cmd, output_file=log_file)
+        perf_jobs.append(job)
+
+    return perf_jobs
+
+
+def script_path():
+    return os.path.dirname(os.path.realpath(sys.argv[0]))
+
+
+def async_io_setup():
+    import deepspeed
+    from deepspeed.ops.aio import AsyncIOBuilder
+    return AsyncIOBuilder().is_compatible()
+
+
+def get_block_size_and_count(io_bytes):
+    block_size = 1
+    block_count = io_bytes
+    bytes_in_KB = 1024
+
+    while block_count % bytes_in_KB == 0:
+        block_size *= bytes_in_KB
+        block_count /= bytes_in_KB
+
+    return int(block_size), int(block_count)
+
+
+def create_read_file(sweep_config):
+    read_folder = os.path.join(sweep_config.nvme_dir, f'{READ_IO_DIR}')
+    os.makedirs(read_folder, exist_ok=True)
+    read_file_name = os.path.join(read_folder, f'random_{sweep_config.io_size}B.pt')
+    block_size, block_count = get_block_size_and_count(refine_integer_value(sweep_config.io_size))
+    dd_job = Job(cmd_line=[
+        f'dd if=/dev/urandom of={read_file_name} bs={block_size} count={block_count}'
+    ])
+    print(
+        f'[Start] Create read file of {sweep_config.io_size} bytes by running {dd_job.cmd()} ....'
+    )
+    run_job(dd_job)
+    print(
+        f'[Done] Create read file of {sweep_config.io_size} bytes by running {dd_job.cmd()} ....'
+    )
+    return read_folder, read_file_name
+
+
+def remove_folder(folder):
+    assert os.path.isdir(folder), f"Error: cannot remove {folder} - folder not found"
+    shutil.rmtree(folder)
+
+
+def run_read_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines):
+    read_folder, read_file_name = create_read_file(sweep_config)
+    read_option = f'--read_file {read_file_name}'
+    read_cmd_lines = [[f'{read_option} {sweep_config.other_options}'] + cmd
+                      for cmd in cmd_lines]
+    #dump_cmd_lines(read_cmd_lines)
+
+    log_folder = os.path.join(sweep_config.log_dir, f'{READ_LOG_DIR}')
+    os.makedirs(log_folder, exist_ok=True)
+
+    perf_jobs = create_perf_jobs(io_op_desc=READ_OP_DESC,
+                                 log_dir=log_folder,
+                                 cmd_lines=read_cmd_lines)
+
+    launch_sweep(sweep_jobs=perf_jobs,
+                 sync_job=sync_job,
+                 flush_cache_job=flush_cache_job)
+
+    remove_folder(read_folder)
+
+
+def run_write_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines):
+    write_folder = os.path.join(sweep_config.nvme_dir, f'{WRITE_IO_DIR}')
+    os.makedirs(write_folder, exist_ok=True)
+    write_file_name = os.path.join(write_folder, f'random_{sweep_config.io_size}B.pt')
+    write_option = f'--write_size {sweep_config.io_size} --write_file {write_file_name}'
+    write_cmd_lines = [[f'{write_option} {sweep_config.other_options}'] + cmd
+                       for cmd in cmd_lines]
+    #dump_cmd_lines(write_cmd_lines)
+
+    log_folder = os.path.join(sweep_config.log_dir, f'{WRITE_LOG_DIR}')
+    os.makedirs(log_folder, exist_ok=True)
+
+    perf_jobs = create_perf_jobs(io_op_desc=WRITE_OP_DESC,
+                                 log_dir=log_folder,
+                                 cmd_lines=write_cmd_lines)
+
+    launch_sweep(sweep_jobs=perf_jobs,
+                 sync_job=sync_job,
+                 flush_cache_job=flush_cache_job)
+
+    remove_folder(write_folder)
+
+
+def main():
+    print("Running performance sweep of deepspeed nvme library")
+
+    if not async_io_setup():
+        error_msg = """
+            Failing because environment is not properly configured for deepspeed async i/o module.
+            Possible fix: apt install libaio-dev.
+        """
+        print(error_msg)
+        quit()
+
+    args = parse_arguments()
+    sweep_config = SweepConfig(args)
+    cmd_lines = get_sweep_cmd_lines(sweep_config.search_space)
+
+    if sweep_config.flush_cache:
+        flush_cache_job = Job(
+            cmd_line=['sudo',
+                      'bash -c',
+                      "'echo 1 > /proc/sys/vm/drop_caches'"])
+    else:
+        flush_cache_job = None
+
+    sync_job = Job(cmd_line=['sync'])
+
+    if sweep_config.read:
+        run_read_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines)
+
+    if sweep_config.write:
+        run_write_sweep(sweep_config, flush_cache_job, sync_job, cmd_lines)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/deepspeed/ops/csrc/aio/py_test/ds_aio_basic.py b/deepspeed/ops/csrc/aio/py_test/ds_aio_basic.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf70b6655e9c1366371d24a6fb33808c41729e93
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/ds_aio_basic.py
@@ -0,0 +1,144 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import torch
+import os
+import time
+from deepspeed.ops.aio import AsyncIOBuilder
+from multiprocessing import Pool, Barrier
+from test_ds_aio_utils import report_results, task_log, task_barrier
+
+
+def pre_basic(args, tid, read_op):
+    io_string = "Read" if read_op else "Write"
+    num_bytes = os.path.getsize(args.read_file) if read_op else args.write_size
+    file = args.read_file if read_op else f'{args.write_file}.{tid}'
+
+    task_log(tid, f'Allocate tensor of size {num_bytes} bytes')
+    buffer = torch.empty(num_bytes, dtype=torch.uint8, device='cpu').pin_memory()
+    task_log(
+        tid,
+        f'{io_string} file {file} of size {num_bytes} bytes from buffer on device {buffer.device}'
+    )
+
+    ctxt = {}
+    ctxt['file'] = file
+    ctxt['num_bytes'] = num_bytes
+    ctxt['buffer'] = buffer
+    ctxt['elapsed_sec'] = 0
+
+    return ctxt
+
+
+def pre_basic_read(pool_params):
+    args, tid = pool_params
+    ctxt = pre_basic(args, tid, True)
+    return ctxt
+
+
+def pre_basic_write(pool_params):
+    args, tid = pool_params
+    ctxt = pre_basic(args, tid, False)
+    return ctxt
+
+
+def post_basic(pool_params):
+    _, _, ctxt = pool_params
+    ctxt["buffer"].detach()
+    ctxt["buffer"] = None
+    return ctxt
+
+
+def main_basic_read(pool_params):
+    args, tid, ctxt = pool_params
+    start_time = time.time()
+    AsyncIOBuilder().load().aio_read(ctxt['buffer'],
+                                     ctxt['file'],
+                                     args.block_size,
+                                     args.queue_depth,
+                                     args.single_submit,
+                                     args.overlap_events,
+                                     args.validate)
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_basic_write(pool_params):
+    args, tid, ctxt = pool_params
+    start_time = time.time()
+    AsyncIOBuilder().load().aio_write(ctxt['buffer'],
+                                      ctxt['file'],
+                                      args.block_size,
+                                      args.queue_depth,
+                                      args.single_submit,
+                                      args.overlap_events,
+                                      args.validate)
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def get_schedule(args, read_op):
+    schedule = {}
+    if read_op:
+        schedule['pre'] = pre_basic_read
+        schedule['post'] = post_basic
+        schedule['main'] = main_basic_read
+    else:
+        schedule['pre'] = pre_basic_write
+        schedule['post'] = post_basic
+        schedule['main'] = main_basic_write
+
+    return schedule
+
+
+def _aio_handle_tasklet(pool_params):
+    args, tid, read_op = pool_params
+
+    # Create schedule
+    schedule = get_schedule(args, read_op)
+    task_log(tid, f'schedule = {schedule}')
+    task_barrier(aio_barrier, args.threads)
+
+    # Run pre task
+    task_log(tid, f'running pre-task')
+    ctxt = schedule["pre"]((args, tid))
+    task_barrier(aio_barrier, args.threads)
+
+    # Run main tasks in a loop
+    ctxt["main_task_sec"] = 0
+    for i in range(args.loops):
+        task_log(tid, f'running main task {i}')
+        start_time = time.time()
+        ctxt = schedule["main"]((args, tid, ctxt))
+        task_barrier(aio_barrier, args.threads)
+        stop_time = time.time()
+        ctxt["main_task_sec"] += stop_time - start_time
+
+    # Run post task
+    task_log(tid, f'running post-task')
+    ctxt = schedule["post"]((args, tid, ctxt))
+    task_barrier(aio_barrier, args.threads)
+
+    return ctxt["main_task_sec"], ctxt["elapsed_sec"], ctxt["num_bytes"] * args.loops
+
+
+def _init_tasklet(b):
+    global aio_barrier
+    aio_barrier = b
+
+
+def aio_basic_multiprocessing(args, read_op):
+    b = Barrier(args.threads)
+    pool_params = [(args, p, read_op) for p in range(args.threads)]
+    with Pool(processes=args.threads, initializer=_init_tasklet, initargs=(b, )) as p:
+        pool_results = p.map(_aio_handle_tasklet, pool_params)
+
+    report_results(args, read_op, pool_results)
diff --git a/deepspeed/ops/csrc/aio/py_test/ds_aio_handle.py b/deepspeed/ops/csrc/aio/py_test/ds_aio_handle.py
new file mode 100644
index 0000000000000000000000000000000000000000..947ee2e6cb633e52c33c4b0ce06c56ad75b73f4c
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/ds_aio_handle.py
@@ -0,0 +1,176 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import torch
+import os
+import time
+from multiprocessing import Pool, Barrier
+from deepspeed.ops.aio import AsyncIOBuilder
+from test_ds_aio_utils import report_results, task_log, task_barrier
+
+
+def pre_handle(args, tid, read_op):
+    io_string = "Read" if read_op else "Write"
+    num_bytes = os.path.getsize(args.read_file) if read_op else args.write_size
+    file = args.read_file if read_op else f'{args.write_file}.{tid}'
+
+    task_log(tid, f'Allocate tensor of size {num_bytes} bytes')
+    if args.gpu:
+        buffer = torch.empty(num_bytes, dtype=torch.uint8, device='cuda')
+    else:
+        buffer = torch.empty(num_bytes, dtype=torch.uint8, device='cpu').pin_memory()
+    task_log(
+        tid,
+        f'{io_string} file {file} of size {num_bytes} bytes from buffer on device {buffer.device}'
+    )
+
+    io_parallel = args.io_parallel if args.io_parallel else 1
+    handle = AsyncIOBuilder().load().aio_handle(args.block_size,
+                                                args.queue_depth,
+                                                args.single_submit,
+                                                args.overlap_events,
+                                                io_parallel)
+    task_log(tid, f'created deepspeed aio handle')
+
+    ctxt = {}
+    ctxt['file'] = file
+    ctxt['num_bytes'] = num_bytes
+    ctxt['handle'] = handle
+    ctxt['buffer'] = buffer
+    ctxt['elapsed_sec'] = 0
+
+    return ctxt
+
+
+def pre_handle_read(pool_params):
+    args, tid = pool_params
+    ctxt = pre_handle(args, tid, True)
+    return ctxt
+
+
+def pre_handle_write(pool_params):
+    args, tid = pool_params
+    ctxt = pre_handle(args, tid, False)
+    return ctxt
+
+
+def post_handle(pool_params):
+    _, _, ctxt = pool_params
+    ctxt["buffer"].detach()
+    ctxt["buffer"] = None
+    return ctxt
+
+
+def main_parallel_read(pool_params):
+    args, tid, ctxt = pool_params
+    handle = ctxt['handle']
+
+    start_time = time.time()
+    ret = handle.pread(ctxt['buffer'], ctxt['file'], args.validate, True)
+    assert ret != -1
+    handle.wait()
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_parallel_write(pool_params):
+    args, tid, ctxt = pool_params
+    handle = ctxt['handle']
+    start_time = time.time()
+    ret = handle.pwrite(ctxt['buffer'], ctxt['file'], args.validate, True)
+    assert ret != -1
+    handle.wait()
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_handle_read(pool_parms):
+    args, tid, ctxt = pool_parms
+    handle = ctxt['handle']
+
+    start_time = time.time()
+    ret = handle.read(ctxt['buffer'], ctxt['file'], args.validate)
+    assert ret != -1
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def main_handle_write(pool_parms):
+    args, tid, ctxt = pool_parms
+    handle = ctxt['handle']
+    start_time = time.time()
+    ret = handle.write(ctxt['buffer'], ctxt['file'], args.validate)
+    assert ret != -1
+    end_time = time.time()
+    ctxt['elapsed_sec'] += end_time - start_time
+
+    return ctxt
+
+
+def get_schedule(args, read_op):
+    schedule = {}
+    if read_op:
+        schedule['pre'] = pre_handle_read
+        schedule['post'] = post_handle
+        schedule['main'] = main_parallel_read if args.io_parallel else main_handle_read
+    else:
+        schedule['pre'] = pre_handle_write
+        schedule['post'] = post_handle
+        schedule['main'] = main_parallel_write if args.io_parallel else main_handle_write
+
+    return schedule
+
+
+def _aio_handle_tasklet(pool_params):
+    args, tid, read_op = pool_params
+
+    # Create schedule
+    schedule = get_schedule(args, read_op)
+    task_log(tid, f'schedule = {schedule}')
+    task_barrier(aio_barrier, args.threads)
+
+    # Run pre task
+    task_log(tid, f'running pre-task')
+    ctxt = schedule["pre"]((args, tid))
+    task_barrier(aio_barrier, args.threads)
+
+    # Run main tasks in a loop
+    ctxt["main_task_sec"] = 0
+    for i in range(args.loops):
+        task_log(tid, f'running main task {i}')
+        start_time = time.time()
+        ctxt = schedule["main"]((args, tid, ctxt))
+        task_barrier(aio_barrier, args.threads)
+        stop_time = time.time()
+        ctxt["main_task_sec"] += stop_time - start_time
+
+    # Run post task
+    task_log(tid, f'running post-task')
+    ctxt = schedule["post"]((args, tid, ctxt))
+    task_barrier(aio_barrier, args.threads)
+
+    return ctxt["main_task_sec"], ctxt["elapsed_sec"], ctxt["num_bytes"] * args.loops
+
+
+def _init_tasklet(b):
+    global aio_barrier
+    aio_barrier = b
+
+
+def aio_handle_multiprocessing(args, read_op):
+    b = Barrier(args.threads)
+    pool_params = [(args, p, read_op) for p in range(args.threads)]
+    with Pool(processes=args.threads, initializer=_init_tasklet, initargs=(b, )) as p:
+        pool_results = p.map(_aio_handle_tasklet, pool_params)
+
+    report_results(args, read_op, pool_results)
diff --git a/deepspeed/ops/csrc/aio/py_test/parse_aio_stats.py b/deepspeed/ops/csrc/aio/py_test/parse_aio_stats.py
new file mode 100644
index 0000000000000000000000000000000000000000..1921973e4f735ffbe0cc0d67b0f970e4c15a47ab
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/parse_aio_stats.py
@@ -0,0 +1,154 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import argparse
+import re
+
+READ_SPEED = 'read_speed'
+WRITE_SPEED = 'write_speed'
+
+PERF_METRICS = [READ_SPEED, WRITE_SPEED]
+
+METRIC_SEARCH = {READ_SPEED: 'E2E Read Speed', WRITE_SPEED: 'E2E Write Speed'}
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--log_dir',
+                        type=str,
+                        required=True,
+                        help='Folder of statistics logs')
+
+    parser.add_argument('--metric',
+                        type=str,
+                        required=True,
+                        help='Performance metric to report: [read_speed|write_speed]')
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+
+    return args
+
+
+def extract_value(key, file):
+    INVALID_PREFIXES = ["ds"]
+    for p in INVALID_PREFIXES:
+        if key.startswith(p):
+            return key
+    try:
+        if key[0] in ['t', 'd', 'p']:
+            return int(key[1:])
+        if key.startswith("bs"):
+            if key.endswith('K'):
+                v = key[2:].split('K')
+                return int(v[0]) * 1024
+            elif key.endswith('M'):
+                v = key[2:].split('M')
+                return int(v[0]) * 1024 * 1024
+            else:
+                return int(key[2:])
+    except:
+        print(f"{file}: extract_value fails on {key}")
+        return None
+
+    return key
+
+
+def get_file_key(file):
+    f, _ = os.path.splitext(os.path.basename(file))
+    fields = f.split('_')
+    values = [extract_value(k, file) for k in fields]
+    return tuple(values)
+
+
+def get_thread_count(file):
+    f, _ = os.path.splitext(os.path.basename(file))
+    fields = f.split('_')
+    for key in fields:
+        if key[0] == 't':
+            return int(key[1:])
+    return 1
+
+
+"""
+Extract performance metric from log file.
+Sample file lines are:
+Task Read Latency = 0.031647682189941406 sec
+Task Read Speed = 12.342926020792527 GB/sec
+E2E Read Latency = 0.031697988510131836 sec
+E2E Read Speed = 12.323337169333062 GB/sec
+
+For the above sample, -metric = "read_speed" corresponds to "E2E Read Speed", and 12.32 will be returned
+"""
+
+
+def get_metric(file, metric):
+    thread_count = get_thread_count(file)
+    with open(file) as f:
+        for line in f.readlines():
+            if line.startswith(METRIC_SEARCH[metric]):
+                if metric in [READ_SPEED, WRITE_SPEED]:
+                    fields = line.split()
+                    return float(fields[-2])
+                else:
+                    fields = line.split('=')
+                    return float(fields[-1])
+
+    return None
+
+
+def validate_args(args):
+    if not args.metric in PERF_METRICS:
+        print(f'{args.metric} is not a valid performance metrics')
+        return False
+
+    if not os.path.isdir(args.log_dir):
+        print(f'{args.log_dir} folder is not existent')
+        return False
+
+    return True
+
+
+def get_results(log_files, metric):
+    results = {}
+    for f in log_files:
+        file_key = get_file_key(f)
+        value = get_metric(f, metric)
+        results[file_key] = value
+
+    return results
+
+
+def get_sorted_results(log_dir, metric):
+    log_files = [
+        f for f in os.listdir(log_dir) if os.path.isfile(os.path.join(log_dir,
+                                                                      f))
+    ]
+
+    log_files_path = [os.path.join(log_dir, f) for f in log_files]
+    results = get_results(log_files_path, metric)
+    result_keys = list(results.keys())
+    sorted_keys = sorted(result_keys)
+    return sorted_keys, results
+
+
+def main():
+    print("Parsing aio statistics")
+    args = parse_arguments()
+
+    if not validate_args(args):
+        quit()
+
+    sorted_keys, results = get_sorted_results(args.log_dir, args.metric)
+    for k in sorted_keys:
+        print(f'{k} = {results[k]}')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/deepspeed/ops/csrc/aio/py_test/perf_sweep_utils.py b/deepspeed/ops/csrc/aio/py_test/perf_sweep_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fd1a4c55d0fd90090edb223c0303f90be5a0db9
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/perf_sweep_utils.py
@@ -0,0 +1,8 @@
+SCRIPT_PREFIX = '_aio_bench'
+WRITE_OP_DESC = 'write'
+READ_OP_DESC = 'read'
+READ_IO_DIR = f'{SCRIPT_PREFIX}_{READ_OP_DESC}_io'
+WRITE_IO_DIR = f'{SCRIPT_PREFIX}_{WRITE_OP_DESC}_io'
+BENCH_LOG_DIR = f'{SCRIPT_PREFIX}_logs'
+READ_LOG_DIR = f'{SCRIPT_PREFIX}_{READ_OP_DESC}_logs'
+WRITE_LOG_DIR = f'{SCRIPT_PREFIX}_{WRITE_OP_DESC}_logs'
diff --git a/deepspeed/ops/csrc/aio/py_test/run_read_sweep.sh b/deepspeed/ops/csrc/aio/py_test/run_read_sweep.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b9d7e050454a8382878b250442de3ffd7a07793d
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/run_read_sweep.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+if [[ $# -ne 2 ]]; then
+    echo "Usage: $0 <input file> <output log dir>"
+    exit 1
+fi
+
+
+function validate_environment()
+{
+    validate_cmd="python ./validate_async_io.py"
+    eval ${validate_cmd}
+    res=$?
+    if [[ $res != 0 ]]; then
+        echo "Failing because environment is not properly configured"
+        echo "Possible fix: sudo apt-get install libaio-dev"
+        exit 1
+    fi
+}
+
+
+validate_environment
+
+INPUT_FILE=$1
+if [[ ! -f ${INPUT_FILE} ]]; then
+    echo "Input file not found: ${INPUT_FILE}"
+    exit 1
+fi
+
+LOG_DIR=$2/aio_perf_sweep
+RUN_SCRIPT=./test_ds_aio.py
+READ_OPT="--read_file ${INPUT_FILE}"
+
+if [[ -d ${LOG_DIR} ]]; then
+    rm -f ${LOG_DIR}/*
+else
+    mkdir -p ${LOG_DIR}
+fi
+
+DISABLE_CACHE="sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches' "
+SYNC="sync"
+
+for sub in single block; do
+    if [[ $sub == "single" ]]; then
+        sub_opt="--single_submit"
+    else
+        sub_opt=""
+    fi
+    for ov in overlap sequential; do
+        if [[ $ov == "overlap" ]]; then
+            ov_opt="--overlap_events"
+        else
+            ov_opt=""
+        fi
+        for t in 1 2 4 8; do
+            for p in 1 ; do
+                for d in 1 2 4 8 16 32; do
+                    for bs in 128K 256K 512K 1M; do
+                        SCHED_OPTS="${sub_opt} ${ov_opt} --handle --threads ${t}"
+                        OPTS="--io_parallel ${p} --queue_depth ${d} --block_size ${bs}"
+                        LOG="${LOG_DIR}/read_${sub}_${ov}_t${t}_p${p}_d${d}_bs${bs}.txt"
+                        cmd="python ${RUN_SCRIPT} ${READ_OPT} ${OPTS} ${SCHED_OPTS} &> ${LOG}"
+                        echo ${DISABLE_CACHE}
+                        echo ${cmd}
+                        echo ${SYNC}
+
+                        eval ${DISABLE_CACHE}
+                        eval ${cmd}
+                        eval ${SYNC}
+                        sleep 2
+                    done
+                done
+            done
+        done
+    done
+done
diff --git a/deepspeed/ops/csrc/aio/py_test/run_write_sweep.sh b/deepspeed/ops/csrc/aio/py_test/run_write_sweep.sh
new file mode 100644
index 0000000000000000000000000000000000000000..99f2113dda6fed0a9b4a2dffa4436fdf5241e8f7
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/run_write_sweep.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+function prep_folder()
+{
+    folder=$1
+    if [[ -d ${folder} ]]; then
+        rm -f ${folder}/*
+    else
+        mkdir -p ${folder}
+    fi
+}
+
+function validate_environment()
+{
+    validate_cmd="python ./validate_async_io.py"
+    eval ${validate_cmd}
+    res=$?
+    if [[ $res != 0 ]]; then
+        echo "Failing because environment is not properly configured"
+        echo "Possible fix: sudo apt-get install libaio-dev"
+        exit 1
+    fi
+}
+
+
+
+validate_environment
+
+if [[ $# -ne 3 ]]; then
+    echo "Usage: $0 <write size in MB> <write dir ><output log dir>"
+    exit 1
+fi
+
+SIZE="$1M"
+WRITE_DIR=$2
+LOG_DIR=$3/aio_perf_sweep
+
+OUTPUT_FILE=${WRITE_DIR}/ds_aio_write_${SIZE}B.pt
+WRITE_OPT="--write_file ${OUTPUT_FILE} --write_size ${SIZE}"
+
+
+prep_folder ${WRITE_DIR}
+prep_folder ${LOG_DIR}
+
+RUN_SCRIPT=./test_ds_aio.py
+
+DISABLE_CACHE="sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches' "
+SYNC="sync"
+
+for sub in single block; do
+    if [[ $sub == "single" ]]; then
+        sub_opt="--single_submit"
+    else
+        sub_opt=""
+    fi
+    for ov in overlap sequential; do
+        if [[ $ov == "overlap" ]]; then
+            ov_opt="--overlap_events"
+        else
+            ov_opt=""
+        fi
+        for t in 1 2 4 8; do
+            for p in 1; do
+                for d in 1 2 4 8 16 32; do
+                    for bs in 128K 256K 512K 1M; do
+                        SCHED_OPTS="${sub_opt} ${ov_opt} --handle --threads ${t}"
+                        OPTS="--io_parallel ${p} --queue_depth ${d} --block_size ${bs}"
+                        LOG="${LOG_DIR}/write_${sub}_${ov}_t${t}_p${p}_d${d}_bs${bs}.txt"
+                        cmd="python ${RUN_SCRIPT} ${WRITE_OPT} ${OPTS} ${SCHED_OPTS} &> ${LOG}"
+                        echo ${DISABLE_CACHE}
+                        echo ${cmd}
+                        echo ${SYNC}
+
+                        eval ${DISABLE_CACHE}
+                        eval ${cmd}
+                        eval ${SYNC}
+                        sleep 2
+                    done
+                done
+        done
+        done
+    done
+done
diff --git a/deepspeed/ops/csrc/aio/py_test/single_process_config.json b/deepspeed/ops/csrc/aio/py_test/single_process_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..275c54135cd83d3d8508ea1f769b823af9529821
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/single_process_config.json
@@ -0,0 +1,29 @@
+{
+    "block_size": [
+        "128K",
+        "256K",
+        "1M"
+    ],
+    "queue_depth": [
+        4,
+        16,
+        32
+    ],
+    "io_parallel": [
+        1,
+        2,
+        4,
+        8
+    ],
+    "single_submit": [
+        true,
+        false
+    ],
+    "overlap_events": [
+        true,
+        false
+    ],
+    "threads": [
+        1
+    ]
+}
diff --git a/deepspeed/ops/csrc/aio/py_test/test_ds_aio.py b/deepspeed/ops/csrc/aio/py_test/test_ds_aio.py
new file mode 100644
index 0000000000000000000000000000000000000000..f97d3e676c03c13d54c54b6cc23e24745b09f335
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/test_ds_aio.py
@@ -0,0 +1,101 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import torch
+import argparse
+import time
+import sys
+from multiprocessing import Pool
+import multiprocessing as mp
+from ds_aio_basic import aio_basic_multiprocessing
+from ds_aio_handle import aio_handle_multiprocessing
+from test_ds_aio_utils import refine_args
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('--read_file', type=str, default=None, help='Read file.')
+
+    parser.add_argument('--write_file', type=str, default=None, help='Write file.')
+
+    parser.add_argument('--write_size',
+                        type=str,
+                        default=None,
+                        help='Number of bytes to write.')
+
+    parser.add_argument('--block_size', type=str, default='1M', help='I/O block size.')
+
+    parser.add_argument('--queue_depth', type=int, default=32, help='I/O queue depth.')
+
+    parser.add_argument('--threads',
+                        type=int,
+                        default=1,
+                        help='Thread parallelism count.')
+
+    parser.add_argument(
+        '--single_submit',
+        action='store_true',
+        help=
+        'Submit I/O requests in singles (default is submit queue_depth amount at once.).'
+    )
+
+    parser.add_argument('--overlap_events',
+                        action='store_true',
+                        help='Overlap I/O submission and completion requests.')
+
+    parser.add_argument('--validate',
+                        action='store_true',
+                        help='Perform validation in library.')
+
+    parser.add_argument('--handle', action='store_true', help='Use AIO handle.')
+
+    parser.add_argument('--loops',
+                        type=int,
+                        default=1,
+                        help='Count of operation repetitions')
+
+    parser.add_argument('--io_parallel',
+                        type=int,
+                        default=None,
+                        help='Per iop parallelism')
+
+    parser.add_argument('--gpu', action='store_true', help='Use GPU memory')
+
+    args = parser.parse_args()
+    print(f'args = {args}')
+    return args
+
+
+def validate_args(args):
+    if args.read_file and not os.path.isfile(args.read_file):
+        print(f'args validation error: {args.read_file} not found')
+        return False
+
+    return True
+
+
+def main():
+    print(f'Testing deepspeed_aio python frontend')
+
+    args = parse_arguments()
+    refine_args(args)
+    if not validate_args(args):
+        quit()
+
+    mp.set_start_method('spawn')
+    multiprocess_function = aio_handle_multiprocessing if args.handle else aio_basic_multiprocessing
+    if args.read_file:
+        multiprocess_function(args, True)
+
+    if args.write_file:
+        multiprocess_function(args, False)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/deepspeed/ops/csrc/aio/py_test/test_ds_aio_utils.py b/deepspeed/ops/csrc/aio/py_test/test_ds_aio_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c68dfdddc23343c5d3c0a623a4be33f11f78e628
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/test_ds_aio_utils.py
@@ -0,0 +1,59 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+
+BYTES_PER_GB = 1024**3
+LOG_TIDS = [0]
+
+
+def task_log(tid, msg):
+    if tid in LOG_TIDS:
+        print(f'tid {tid}: {msg}')
+
+
+def task_barrier(barrier, num_parties):
+    assert barrier.parties == num_parties
+    barrier.wait()
+    assert barrier.broken == False
+
+
+def report_results(args, read_op, pool_results):
+    #print(f'pool_results = {pool_results}')
+    io_string = 'Read' if read_op else 'Write'
+    if None in pool_results:
+        print(f'Failure in one of {args.threads} {io_string} processes')
+        return
+
+    total_bytes = sum([num_bytes for _, _, num_bytes in pool_results])
+
+    task_latency_sec = max([sec for _, sec, _ in pool_results])
+    task_speed_GB = total_bytes / task_latency_sec / BYTES_PER_GB
+    print(f'Task {io_string} Latency = {task_latency_sec} sec')
+    print(f'Task {io_string} Speed = {task_speed_GB} GB/sec')
+
+    e2e_latency_sec = max([sec for sec, _, _ in pool_results])
+    e2e_speed_GB = total_bytes / e2e_latency_sec / BYTES_PER_GB
+    print(f'E2E {io_string} Latency = {e2e_latency_sec} sec')
+    print(f'E2E {io_string} Speed = {e2e_speed_GB} GB/sec')
+
+
+def refine_integer_value(value):
+    unit_dict = {'K': 1024, 'M': 1024**2, 'G': 1024**3}
+
+    if value[-1] in list(unit_dict.keys()):
+        int_value = int(value[:-1]) * unit_dict[value[-1]]
+        return int_value
+    return int(value)
+
+
+def refine_args(args):
+    if args.write_size and type(args.write_size) == str:
+        args.write_size = refine_integer_value(args.write_size)
+
+    if args.block_size and type(args.block_size) == str:
+        args.block_size = refine_integer_value(args.block_size)
diff --git a/deepspeed/ops/csrc/aio/py_test/validate_async_io.py b/deepspeed/ops/csrc/aio/py_test/validate_async_io.py
new file mode 100644
index 0000000000000000000000000000000000000000..ceae84c840da9ba8902310170e1c7afe782a63ce
--- /dev/null
+++ b/deepspeed/ops/csrc/aio/py_test/validate_async_io.py
@@ -0,0 +1,9 @@
+"""
+Copyright 2021 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+import deepspeed
+from deepspeed.ops.aio import AsyncIOBuilder
+assert AsyncIOBuilder().is_compatible()
diff --git a/deepspeed/ops/csrc/common/custom_cuda_kernel.cu b/deepspeed/ops/csrc/common/custom_cuda_kernel.cu
new file mode 100644
index 0000000000000000000000000000000000000000..f7a2b5d480dffae5f44b4558302c2c96240444ba
--- /dev/null
+++ b/deepspeed/ops/csrc/common/custom_cuda_kernel.cu
@@ -0,0 +1,39 @@
+#include "custom_cuda_layers.h"
+
+__global__ void param_update_kernel(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (id < size) { output[id] = (__half)input[id]; }
+}
+
+void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream)
+{
+    int threads = 1024;
+
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+    param_update_kernel<<<grid_dim, block_dim, 0, stream>>>(input, output, size);
+}
+
+__global__ void param_update_kernel_half(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+    __half2* output_cast = reinterpret_cast<__half2*>(output);
+    if (id < size) {
+        float input_f = input[id];
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+        output_cast[id] = *input_h;
+    }
+}
+
+void launch_param_update_half(const float* input, __half* output, int size, cudaStream_t stream)
+{
+    int threads = 1024;
+    size /= 2;
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+    param_update_kernel_half<<<grid_dim, block_dim, 0, stream>>>(input, output, size);
+}
diff --git a/deepspeed/ops/csrc/common/custom_hip_kernel.hip b/deepspeed/ops/csrc/common/custom_hip_kernel.hip
new file mode 100644
index 0000000000000000000000000000000000000000..119647c587ff9d3bacd8eb03ef6c97e0e4ed00a4
--- /dev/null
+++ b/deepspeed/ops/csrc/common/custom_hip_kernel.hip
@@ -0,0 +1,41 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+__global__ void param_update_kernel(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (id < size) { output[id] = (__half)input[id]; }
+}
+
+void launch_param_update(const float* input, __half* output, int size, hipStream_t stream)
+{
+    int threads = 1024;
+
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( param_update_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, input, output, size);
+}
+
+__global__ void param_update_kernel_half(const float* input, __half* output, int size)
+{
+    int id = blockIdx.x * blockDim.x + threadIdx.x;
+    __half2* output_cast = reinterpret_cast<__half2*>(output);
+    if (id < size) {
+        float input_f = input[id];
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+        output_cast[id] = *input_h;
+    }
+}
+
+void launch_param_update_half(const float* input, __half* output, int size, hipStream_t stream)
+{
+    int threads = 1024;
+    size /= 2;
+    dim3 grid_dim((size - 1) / threads + 1);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( param_update_kernel_half), dim3(grid_dim), dim3(block_dim), 0, stream, input, output, size);
+}
diff --git a/deepspeed/ops/csrc/includes/StopWatch.h b/deepspeed/ops/csrc/includes/StopWatch.h
new file mode 100644
index 0000000000000000000000000000000000000000..9bf0401ebc78ffbe37c3b40d2466401731358051
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/StopWatch.h
@@ -0,0 +1,98 @@
+#pragma once
+#ifdef _WIN32
+#include <windows.h>
+#else
+#include <time.h>
+#endif
+
+#ifdef _WIN32
+
+class Stopwatch {
+private:
+    double m_total_time;
+    LARGE_INTEGER m_start_time;
+
+public:
+    Stopwatch() { m_total_time = 0.0; }
+
+    ~Stopwatch() {}
+
+    void Reset() { m_total_time = 0.0; }
+
+    void Start() { QueryPerformanceCounter(&m_start_time); }
+
+    void Restart()
+    {
+        m_total_time = 0.0;
+        QueryPerformanceCounter(&m_start_time);
+    }
+
+    void Stop()
+    {
+        LARGE_INTEGER frequency;
+        LARGE_INTEGER stop_time;
+        QueryPerformanceFrequency(&frequency);
+        QueryPerformanceCounter(&stop_time);
+        m_total_time +=
+            ((double)(stop_time.QuadPart - m_start_time.QuadPart) / (double)frequency.QuadPart);
+    }
+
+    double GetTimeInSeconds() { return m_total_time; }
+};
+
+#else
+
+class Stopwatch {
+private:
+    double m_total_time;
+    struct timespec m_start_time;
+    bool m_is_started;
+
+public:
+    Stopwatch()
+    {
+        m_total_time = 0.0;
+        m_is_started = false;
+    }
+
+    ~Stopwatch() {}
+
+    void Reset() { m_total_time = 0.0; }
+
+    void Start()
+    {
+        clock_gettime(CLOCK_MONOTONIC, &m_start_time);
+        m_is_started = true;
+    }
+
+    void Restart()
+    {
+        m_total_time = 0.0;
+        clock_gettime(CLOCK_MONOTONIC, &m_start_time);
+        m_is_started = true;
+    }
+
+    void Stop()
+    {
+        if (m_is_started) {
+            m_is_started = false;
+
+            struct timespec end_time;
+            clock_gettime(CLOCK_MONOTONIC, &end_time);
+
+            m_total_time += (double)(end_time.tv_sec - m_start_time.tv_sec) +
+                            (double)(end_time.tv_nsec - m_start_time.tv_nsec) / 1e9;
+        }
+    }
+
+    double GetTimeInSeconds()
+    {
+        if (m_is_started) {
+            Stop();
+            Start();
+        }
+        return m_total_time;
+    }
+};
+
+#endif
diff --git a/deepspeed/ops/csrc/includes/Timer.h b/deepspeed/ops/csrc/includes/Timer.h
new file mode 100644
index 0000000000000000000000000000000000000000..efc7fff84abb86b91473d1a532c78bf16e387384
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/Timer.h
@@ -0,0 +1,47 @@
+
+#ifndef __TIMER_H__
+#define __TIMER_H__
+
+#include <cuda_runtime.h>
+#include <chrono>
+#include "cuda.h"
+
+class GPUTimer {
+    cudaEvent_t start, stop;
+
+public:
+    GPUTimer()
+    {
+        cudaEventCreate(&start);
+        cudaEventCreate(&stop);
+    }
+    ~GPUTimer()
+    {
+        cudaEventDestroy(start);
+        cudaEventDestroy(stop);
+    }
+    inline void Record() { cudaEventRecord(start); }
+    inline void Elapsed(float& time_elapsed)
+    {
+        cudaEventRecord(stop);
+        cudaEventSynchronize(stop);
+        cudaEventElapsedTime(&time_elapsed, start, stop);
+    }
+};
+
+class CPUTimer {
+    std::chrono::high_resolution_clock::time_point start;
+
+public:
+    CPUTimer() : start(std::chrono::high_resolution_clock::now()) {}
+    inline void Reset() { start = std::chrono::high_resolution_clock::now(); }
+    inline float Elapsed()
+    {
+        auto temp = start;
+        start = std::chrono::high_resolution_clock::now();
+        return (float)(std::chrono::duration_cast<std::chrono::microseconds>(start - temp).count() /
+                       1e3);
+    }
+};
+
+#endif
diff --git a/deepspeed/ops/csrc/includes/Timer_hip.h b/deepspeed/ops/csrc/includes/Timer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..2ee3d6f2944655ab1de9864b39d58c9edec53ebf
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/Timer_hip.h
@@ -0,0 +1,48 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#ifndef __TIMER_H__
+#define __TIMER_H__
+
+#include <hip/hip_runtime.h>
+#include <chrono>
+#include "hip/hip_runtime.h"
+
+class GPUTimer {
+    hipEvent_t start, stop;
+
+public:
+    GPUTimer()
+    {
+        hipEventCreate(&start);
+        hipEventCreate(&stop);
+    }
+    ~GPUTimer()
+    {
+        hipEventDestroy(start);
+        hipEventDestroy(stop);
+    }
+    inline void Record() { hipEventRecord(start); }
+    inline void Elapsed(float& time_elapsed)
+    {
+        hipEventRecord(stop);
+        hipEventSynchronize(stop);
+        hipEventElapsedTime(&time_elapsed, start, stop);
+    }
+};
+
+class CPUTimer {
+    std::chrono::high_resolution_clock::time_point start;
+
+public:
+    CPUTimer() : start(std::chrono::high_resolution_clock::now()) {}
+    inline void Reset() { start = std::chrono::high_resolution_clock::now(); }
+    inline float Elapsed()
+    {
+        auto temp = start;
+        start = std::chrono::high_resolution_clock::now();
+        return (float)(std::chrono::duration_cast<std::chrono::microseconds>(start - temp).count() /
+                       1e3);
+    }
+};
+
+#endif
diff --git a/deepspeed/ops/csrc/includes/compat.h b/deepspeed/ops/csrc/includes/compat.h
new file mode 100644
index 0000000000000000000000000000000000000000..86f84a85065c9582119296223bb24193e71e060b
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/compat.h
@@ -0,0 +1,14 @@
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#ifndef TORCH_CHECK
+#define TORCH_CHECK AT_CHECK
+#endif
+
+#ifdef VERSION_GE_1_3
+#define DATA_PTR data_ptr
+#else
+#define DATA_PTR data
+#endif
diff --git a/deepspeed/ops/csrc/includes/context.h b/deepspeed/ops/csrc/includes/context.h
new file mode 100644
index 0000000000000000000000000000000000000000..5f0424116546f3080c760338d4e02e40ae63be59
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/context.h
@@ -0,0 +1,171 @@
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "gemm_test.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return (std::max)(
+        (std::min)((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0)
+    {
+        curandCreateGenerator(&_gen, CURAND_RNG_PSEUDO_DEFAULT);
+        curandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (cublasCreate(&_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+    }
+
+    virtual ~Context()
+    {
+        cublasDestroy(_cublasHandle);
+        cudaFree(_workspace);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void SetWorkSpace(void* workspace)
+    {
+        if (!workspace) { throw std::runtime_error("Workspace is null."); }
+        _workspace = workspace;
+    }
+
+    void* GetWorkSpace() { return _workspace; }
+
+    curandGenerator_t& GetRandGenerator() { return _gen; }
+
+    cudaStream_t GetCurrentStream()
+    {
+        // get current pytorch stream.
+        cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+        return stream;
+    }
+
+    cudaStream_t GetNewStream() { return at::cuda::getStreamFromPool(); }
+
+    cublasHandle_t GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    void TestGemmFP16(bool test_gemm, int batch_size, int seq_len, int head_num, int size_per_head)
+    {
+        // avoid rerun.
+        if (_gemm_algos.size() > 0) return;
+
+        if (test_gemm) {
+            cublasHandle_t handle = GetCublasHandle();
+
+            std::unique_ptr<GemmTest<__half>> test_qkv_fw(
+                new GemmTest<__half>(batch_size * seq_len,      // M
+                                     head_num * size_per_head,  // N
+                                     head_num * size_per_head,  // K
+                                     CUBLAS_OP_T,
+                                     CUBLAS_OP_N,
+                                     handle));
+
+            std::unique_ptr<GemmTest<__half>> test_inter(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     4 * head_num * size_per_head,  // N
+                                     head_num * size_per_head,      // K
+                                     CUBLAS_OP_T,
+                                     CUBLAS_OP_N,
+                                     handle));
+
+            std::unique_ptr<GemmTest<__half>> test_output(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     head_num * size_per_head,      // N
+                                     4 * head_num * size_per_head,  // K
+                                     CUBLAS_OP_T,
+                                     CUBLAS_OP_N,
+                                     handle));
+
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_scores(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            seq_len,                // M
+                                            seq_len,                // N
+                                            size_per_head,          // K
+                                            CUBLAS_OP_T,
+                                            CUBLAS_OP_N,
+                                            handle));
+
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_context(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            size_per_head,          // M
+                                            seq_len,                // N
+                                            seq_len,                // K
+                                            CUBLAS_OP_N,
+                                            CUBLAS_OP_N,
+                                            handle));
+
+            _gemm_algos.push_back(test_qkv_fw->TestAlgo(100));
+            _gemm_algos.push_back(test_inter->TestAlgo(100));
+            _gemm_algos.push_back(test_output->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_scores->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_context->TestAlgo(100));
+        } else {
+            // Use default algo.
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+        }
+    }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+private:
+    curandGenerator_t _gen;
+    cublasHandle_t _cublasHandle;
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    std::vector<std::array<int, 3>> _gemm_algos;
+};
diff --git a/deepspeed/ops/csrc/includes/context_hip.h b/deepspeed/ops/csrc/includes/context_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..258b2bc27482e78d4458326386e4ef487e45fd54
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/context_hip.h
@@ -0,0 +1,172 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <ATen/hip/HIPContext.h>
+#include <hip/hip_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+#include "gemm_test_hip.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return (std::max)(
+        (std::min)((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0)
+    {
+        hiprandCreateGenerator(&_gen, HIPRAND_RNG_PSEUDO_DEFAULT);
+        hiprandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (rocblas_create_handle(&_cublasHandle) != rocblas_status_success) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+    }
+
+    virtual ~Context()
+    {
+        rocblas_destroy_handle(_cublasHandle);
+        hipFree(_workspace);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void SetWorkSpace(void* workspace)
+    {
+        if (!workspace) { throw std::runtime_error("Workspace is null."); }
+        _workspace = workspace;
+    }
+
+    void* GetWorkSpace() { return _workspace; }
+
+    hiprandGenerator_t& GetRandGenerator() { return _gen; }
+
+    hipStream_t GetCurrentStream()
+    {
+        // get current pytorch stream.
+        hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return stream;
+    }
+
+    hipStream_t GetNewStream() { return at::hip::getStreamFromPoolMasqueradingAsCUDA(); }
+
+    rocblas_handle GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    void TestGemmFP16(bool test_gemm, int batch_size, int seq_len, int head_num, int size_per_head)
+    {
+        // avoid rerun.
+        if (_gemm_algos.size() > 0) return;
+
+        if (test_gemm) {
+            rocblas_handle handle = GetCublasHandle();
+
+            std::unique_ptr<GemmTest<__half>> test_qkv_fw(
+                new GemmTest<__half>(batch_size * seq_len,      // M
+                                     head_num * size_per_head,  // N
+                                     head_num * size_per_head,  // K
+                                     rocblas_operation_transpose,
+                                     rocblas_operation_none,
+                                     handle));
+
+            std::unique_ptr<GemmTest<__half>> test_inter(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     4 * head_num * size_per_head,  // N
+                                     head_num * size_per_head,      // K
+                                     rocblas_operation_transpose,
+                                     rocblas_operation_none,
+                                     handle));
+
+            std::unique_ptr<GemmTest<__half>> test_output(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     head_num * size_per_head,      // N
+                                     4 * head_num * size_per_head,  // K
+                                     rocblas_operation_transpose,
+                                     rocblas_operation_none,
+                                     handle));
+
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_scores(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            seq_len,                // M
+                                            seq_len,                // N
+                                            size_per_head,          // K
+                                            rocblas_operation_transpose,
+                                            rocblas_operation_none,
+                                            handle));
+
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_context(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            size_per_head,          // M
+                                            seq_len,                // N
+                                            seq_len,                // K
+                                            rocblas_operation_none,
+                                            rocblas_operation_none,
+                                            handle));
+
+            _gemm_algos.push_back(test_qkv_fw->TestAlgo(100));
+            _gemm_algos.push_back(test_inter->TestAlgo(100));
+            _gemm_algos.push_back(test_output->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_scores->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_context->TestAlgo(100));
+        } else {
+            // Use default algo.
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+        }
+    }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+private:
+    hiprandGenerator_t _gen;
+    rocblas_handle _cublasHandle;
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    std::vector<std::array<int, 3>> _gemm_algos;
+};
diff --git a/deepspeed/ops/csrc/includes/cpu_adagrad.h b/deepspeed/ops/csrc/includes/cpu_adagrad.h
new file mode 100644
index 0000000000000000000000000000000000000000..6c21b7c8e82d36ae62c11d3cc8dfbc153af19549
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/cpu_adagrad.h
@@ -0,0 +1,150 @@
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <cuda_fp16.h>
+#include <cuda_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "cuda.h"
+#include "custom_cuda_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adagrad_Optimizer {
+public:
+    Adagrad_Optimizer(float alpha = 1e-2, float eps = 1e-8, float weight_decay = 0)
+        : _alpha(alpha), _eps(eps), _weight_decay(weight_decay), _buf_index(false)
+    {
+        cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
+        cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adagrad_Optimizer()
+    {
+        cudaFreeHost(_doubled_buffer[0]);
+        cudaFreeHost(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) cudaStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step)
+    {
+        _step++;
+        if (_step != step) { _step = step; }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+    }
+
+private:
+    float _alpha;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+
+    cudaStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adagrad_Optimizer::Step_AVX(size_t* rounded_size,
+                                 float* _params,
+                                 float* grads,
+                                 float* _exp_avg_sq,
+                                 size_t _param_size,
+                                 __half* dev_params,
+                                 bool half_precision)
+{
+    size_t new_rounded_size = 0;
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0) weight_decay4.data = SIMD_SET(_weight_decay);
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, grads + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0) { simd_fma<span>(grad_4, param_4, weight_decay4, grad_4); }
+
+            simd_fma<span>(variance_4, grad_4, grad_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_add<span>(grad_4, grad_4, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/deepspeed/ops/csrc/includes/cpu_adagrad_hip.h b/deepspeed/ops/csrc/includes/cpu_adagrad_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..cb012a1c9a9003c89235e920346ddae584462202
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/cpu_adagrad_hip.h
@@ -0,0 +1,151 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adagrad_Optimizer {
+public:
+    Adagrad_Optimizer(float alpha = 1e-2, float eps = 1e-8, float weight_decay = 0)
+        : _alpha(alpha), _eps(eps), _weight_decay(weight_decay), _buf_index(false)
+    {
+        hipHostMalloc((void**)_doubled_buffer, TILE * sizeof(float));
+        hipHostMalloc((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adagrad_Optimizer()
+    {
+        hipHostFree(_doubled_buffer[0]);
+        hipHostFree(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) hipStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step)
+    {
+        _step++;
+        if (_step != step) { _step = step; }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+    }
+
+private:
+    float _alpha;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+
+    hipStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adagrad_Optimizer::Step_AVX(size_t* rounded_size,
+                                 float* _params,
+                                 float* grads,
+                                 float* _exp_avg_sq,
+                                 size_t _param_size,
+                                 __half* dev_params,
+                                 bool half_precision)
+{
+    size_t new_rounded_size = 0;
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0) weight_decay4.data = SIMD_SET(_weight_decay);
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, grads + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0) { simd_fma<span>(grad_4, param_4, weight_decay4, grad_4); }
+
+            simd_fma<span>(variance_4, grad_4, grad_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_add<span>(grad_4, grad_4, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/deepspeed/ops/csrc/includes/cpu_adam.h b/deepspeed/ops/csrc/includes/cpu_adam.h
new file mode 100644
index 0000000000000000000000000000000000000000..09677c6842dee6a4a9abe835c245864f07739aa9
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/cpu_adam.h
@@ -0,0 +1,225 @@
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <cuda_fp16.h>
+#include <cuda_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "cuda.h"
+#include "custom_cuda_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg,             \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adam_Optimizer {
+public:
+    Adam_Optimizer(float alpha = 1e-3,
+                   float betta1 = 0.9,
+                   float betta2 = 0.999,
+                   float eps = 1e-8,
+                   float weight_decay = 0,
+                   bool adamw_mode = true)
+        : _alpha(alpha),
+          _betta1(betta1),
+          _betta2(betta2),
+          _eps(eps),
+          _weight_decay(weight_decay),
+          _betta1_t(1.0),
+          _betta2_t(1.0),
+          _step(0),
+          _buf_index(false),
+          _adamw_mode(adamw_mode)
+    {
+        cudaMallocHost((void**)_doubled_buffer, TILE * sizeof(float));
+        cudaMallocHost((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adam_Optimizer()
+    {
+        cudaFreeHost(_doubled_buffer[0]);
+        cudaFreeHost(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) cudaStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step, float beta1, float beta2)
+    {
+        if (beta1 != _betta1 || beta2 != _betta2) {
+            _step = step;
+            _betta1 = beta1;
+            _betta2 = beta2;
+            _betta1_t = std::pow(_betta1, step);
+            _betta2_t = std::pow(_betta2, step);
+        } else {
+            _step++;
+            if (_step != step) {
+                _betta1_t = std::pow(_betta1, step);
+                _betta2_t = std::pow(_betta2, step);
+                _step = step;
+            } else {
+                _betta1_t *= _betta1;
+                _betta2_t *= _betta2;
+            }
+        }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay, bool bias_correction)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+
+        _bias_correction1 = 1.0f;
+        _bias_correction2 = 1.0f;
+        if (bias_correction == 1) {
+            _bias_correction1 = 1 - _betta1_t;
+            _bias_correction2 = 1 / sqrt(1 - _betta2_t);
+        }
+    }
+
+private:
+    float _alpha;
+    float _betta1;
+    float _betta2;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float _bias_correction1;
+    float _bias_correction2;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+    bool _adamw_mode;
+
+    cudaStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adam_Optimizer::Step_AVX(size_t* rounded_size,
+                              float* _params,
+                              float* grads,
+                              float* _exp_avg,
+                              float* _exp_avg_sq,
+                              size_t _param_size,
+                              __half* dev_params,
+                              bool half_precision)
+{
+    size_t new_rounded_size = 0;
+
+    AVX_Data betta1_4;
+    betta1_4.data = SIMD_SET(_betta1);
+    AVX_Data betta2_4;
+    betta2_4.data = SIMD_SET(_betta2);
+
+    float betta1_minus1 = 1 - _betta1;
+    float betta2_minus1 = 1 - _betta2;
+    AVX_Data betta1_minus1_4;
+    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
+    AVX_Data betta2_minus1_4;
+    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
+
+    AVX_Data bias2_sqrt;
+    bias2_sqrt.data = SIMD_SET(_bias_correction2);
+
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha / _bias_correction1;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    float w_decay = -1 * _alpha * _weight_decay;
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0)
+        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { cudaStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, _exp_avg + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0 && !_adamw_mode) {
+                simd_fma<span>(grad_4, param_4, weight_decay4, grad_4);
+            }
+
+            simd_mul<span>(momentum_4, momentum_4, betta1_4);
+            simd_fma<span>(momentum_4, grad_4, betta1_minus1_4, momentum_4);
+            simd_mul<span>(variance_4, variance_4, betta2_4);
+            simd_mul<span>(grad_4, grad_4, grad_4);
+            simd_fma<span>(variance_4, grad_4, betta2_minus1_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_fma<span>(grad_4, grad_4, bias2_sqrt, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+
+            if (_weight_decay > 0 && _adamw_mode) {
+                simd_fma<span>(param_4, param_4, weight_decay4, param_4);
+            }
+
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg + i, momentum_4, false);
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/deepspeed/ops/csrc/includes/cpu_adam_hip.h b/deepspeed/ops/csrc/includes/cpu_adam_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..3622f34cb1558c39e92fa7ac6d2752d37b96974d
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/cpu_adam_hip.h
@@ -0,0 +1,226 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#define NOMINMAX  // Windows idiosyncrasy
+                  // https://stackoverflow.com/questions/4913922/possible-problems-with-nominmax-on-visual-c
+
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime_api.h>
+#include <stdio.h>
+#include <cassert>
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+#include "simd.h"
+
+#define STEP(SPAN)                                \
+    void Step_##SPAN(float* _params,              \
+                     float* grads,                \
+                     float* _exp_avg,             \
+                     float* _exp_avg_sq,          \
+                     size_t _param_size,          \
+                     __half* dev_param = nullptr, \
+                     bool half_precision = false);
+
+class Adam_Optimizer {
+public:
+    Adam_Optimizer(float alpha = 1e-3,
+                   float betta1 = 0.9,
+                   float betta2 = 0.999,
+                   float eps = 1e-8,
+                   float weight_decay = 0,
+                   bool adamw_mode = true)
+        : _alpha(alpha),
+          _betta1(betta1),
+          _betta2(betta2),
+          _eps(eps),
+          _weight_decay(weight_decay),
+          _betta1_t(1.0),
+          _betta2_t(1.0),
+          _step(0),
+          _buf_index(false),
+          _adamw_mode(adamw_mode)
+    {
+        hipHostMalloc((void**)_doubled_buffer, TILE * sizeof(float));
+        hipHostMalloc((void**)(_doubled_buffer + 1), TILE * sizeof(float));
+
+        _streams[0] = Context::Instance().GetCurrentStream();
+        _streams[1] = Context::Instance().GetNewStream();
+    }
+    ~Adam_Optimizer()
+    {
+        hipHostFree(_doubled_buffer[0]);
+        hipHostFree(_doubled_buffer[1]);
+    }
+#if defined(__AVX512__) or defined(__AVX256__)
+    template <int span>
+    void Step_AVX(size_t* rounded_size,
+                  float* _params,
+                  float* grads,
+                  float* _exp_avg,
+                  float* _exp_avg_sq,
+                  size_t param_size,
+                  __half* dev_param = nullptr,
+                  bool half_precision = false);
+#endif
+    STEP(1)
+    STEP(4)
+    STEP(8)
+    inline void SynchronizeStreams()
+    {
+        for (int i = 0; i < 2; i++) hipStreamSynchronize(_streams[i]);
+    }
+    inline void IncrementStep(size_t step, float beta1, float beta2)
+    {
+        if (beta1 != _betta1 || beta2 != _betta2) {
+            _step = step;
+            _betta1 = beta1;
+            _betta2 = beta2;
+            _betta1_t = std::pow(_betta1, step);
+            _betta2_t = std::pow(_betta2, step);
+        } else {
+            _step++;
+            if (_step != step) {
+                _betta1_t = std::pow(_betta1, step);
+                _betta2_t = std::pow(_betta2, step);
+                _step = step;
+            } else {
+                _betta1_t *= _betta1;
+                _betta2_t *= _betta2;
+            }
+        }
+    }
+    inline void update_state(float lr, float epsilon, float weight_decay, bool bias_correction)
+    {
+        _alpha = lr;
+        _eps = epsilon;
+        _weight_decay = weight_decay;
+
+        _bias_correction1 = 1.0f;
+        _bias_correction2 = 1.0f;
+        if (bias_correction == 1) {
+            _bias_correction1 = 1 - _betta1_t;
+            _bias_correction2 = 1 / sqrt(1 - _betta2_t);
+        }
+    }
+
+private:
+    float _alpha;
+    float _betta1;
+    float _betta2;
+    float _eps;
+    float _weight_decay;
+
+    float _betta1_t;
+    float _betta2_t;
+    size_t _step;
+
+    float _bias_correction1;
+    float _bias_correction2;
+
+    float* _doubled_buffer[2];
+    bool _buf_index;
+    bool _adamw_mode;
+
+    hipStream_t _streams[2];
+};
+
+#if defined(__AVX512__) or defined(__AVX256__)
+template <int span>
+void Adam_Optimizer::Step_AVX(size_t* rounded_size,
+                              float* _params,
+                              float* grads,
+                              float* _exp_avg,
+                              float* _exp_avg_sq,
+                              size_t _param_size,
+                              __half* dev_params,
+                              bool half_precision)
+{
+    size_t new_rounded_size = 0;
+
+    AVX_Data betta1_4;
+    betta1_4.data = SIMD_SET(_betta1);
+    AVX_Data betta2_4;
+    betta2_4.data = SIMD_SET(_betta2);
+
+    float betta1_minus1 = 1 - _betta1;
+    float betta2_minus1 = 1 - _betta2;
+    AVX_Data betta1_minus1_4;
+    betta1_minus1_4.data = SIMD_SET(betta1_minus1);
+    AVX_Data betta2_minus1_4;
+    betta2_minus1_4.data = SIMD_SET(betta2_minus1);
+
+    AVX_Data bias2_sqrt;
+    bias2_sqrt.data = SIMD_SET(_bias_correction2);
+
+    AVX_Data eps_4;
+    eps_4.data = SIMD_SET(_eps);
+
+    float step_size = -1 * _alpha / _bias_correction1;
+    AVX_Data step_size_4;
+    step_size_4.data = SIMD_SET(step_size);
+
+    float w_decay = -1 * _alpha * _weight_decay;
+    AVX_Data weight_decay4;
+    if (_weight_decay > 0)
+        weight_decay4.data = (_adamw_mode ? SIMD_SET(w_decay) : SIMD_SET(_weight_decay));
+    new_rounded_size = ROUND_DOWN(_param_size, SIMD_WIDTH * span);
+    for (size_t t = 0; t < new_rounded_size; t += TILE) {
+        size_t copy_size = TILE;
+        if ((t + TILE) > new_rounded_size) copy_size = new_rounded_size - t;
+        size_t offset = copy_size + t;
+        if ((t / TILE) >= 2) { hipStreamSynchronize(_streams[_buf_index]); }
+#pragma omp parallel for
+        for (size_t i = t; i < offset; i += SIMD_WIDTH * span) {
+            AVX_Data grad_4[span];
+            simd_load<span>(grad_4, grads + i, half_precision);
+
+            AVX_Data momentum_4[span];
+            simd_load<span>(momentum_4, _exp_avg + i, false);
+
+            AVX_Data variance_4[span];
+            simd_load<span>(variance_4, _exp_avg_sq + i, false);
+
+            AVX_Data param_4[span];
+            simd_load<span>(param_4, _params + i, half_precision);
+
+            if (_weight_decay > 0 && !_adamw_mode) {
+                simd_fma<span>(grad_4, param_4, weight_decay4, grad_4);
+            }
+
+            simd_mul<span>(momentum_4, momentum_4, betta1_4);
+            simd_fma<span>(momentum_4, grad_4, betta1_minus1_4, momentum_4);
+            simd_mul<span>(variance_4, variance_4, betta2_4);
+            simd_mul<span>(grad_4, grad_4, grad_4);
+            simd_fma<span>(variance_4, grad_4, betta2_minus1_4, variance_4);
+            simd_sqrt<span>(grad_4, variance_4);
+            simd_fma<span>(grad_4, grad_4, bias2_sqrt, eps_4);
+            simd_div<span>(grad_4, momentum_4, grad_4);
+
+            if (_weight_decay > 0 && _adamw_mode) {
+                simd_fma<span>(param_4, param_4, weight_decay4, param_4);
+            }
+
+            simd_fma<span>(param_4, grad_4, step_size_4, param_4);
+
+            simd_store<span>(_params + i, param_4, half_precision);
+            if (dev_params) {
+                simd_store<span>(_doubled_buffer[_buf_index] + (i - t), param_4, half_precision);
+            }
+            simd_store<span>(_exp_avg + i, momentum_4, false);
+            simd_store<span>(_exp_avg_sq + i, variance_4, false);
+        }
+
+        if (dev_params) {
+            if (half_precision)
+                launch_param_update_half(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+            else
+                launch_param_update(
+                    _doubled_buffer[_buf_index], dev_params + t, copy_size, _streams[_buf_index]);
+
+            _buf_index = !_buf_index;
+        }
+    }
+    *rounded_size = new_rounded_size;
+}
+#endif
diff --git a/deepspeed/ops/csrc/includes/cublas_wrappers.h b/deepspeed/ops/csrc/includes/cublas_wrappers.h
new file mode 100644
index 0000000000000000000000000000000000000000..9bb6cc30f6ae5c00013e75051118e2618a18312c
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/cublas_wrappers.h
@@ -0,0 +1,87 @@
+#pragma once
+
+#include <assert.h>
+#include <cublas_v2.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
+
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
+
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
diff --git a/deepspeed/ops/csrc/includes/cublas_wrappers_hip.h b/deepspeed/ops/csrc/includes/cublas_wrappers_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..ddfa1867cf78baa62954bc16b325121e8f569dbb
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/cublas_wrappers_hip.h
@@ -0,0 +1,88 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <assert.h>
+#include <rocblas.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+#endif
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo algo = rocblas_gemm_algo_standard);
+#else
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
diff --git a/deepspeed/ops/csrc/includes/custom_cuda_layers.h b/deepspeed/ops/csrc/includes/custom_cuda_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..30c633f72914f97fc0aa9f55f86cf963e7b49eea
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/custom_cuda_layers.h
@@ -0,0 +1,303 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+#include <curand_kernel.h>
+
+#include "context.h"
+#include "cublas_wrappers.h"
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define MAX_THREADS 1024
+#define THREADS 256
+
+#define MAX_THREAD_STRIDE 32
+#define TILE_DIM 32
+
+// Maximum sequence-length support based on the number of threads (2048) allowed in each block and
+// this MAX is 8K For higher sequence length we need to use higher Max, like for 64K : 32
+#define MAX_THREAD_ITERATIONS 8  // Maximum 8K
+#define MAX_WARP_NUM 32
+
+#define MAX_REGISTERS 256
+
+#define MAX_REG 256
+
+#define WARP_SIZE_BITS 5
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            cudaStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               cudaStream_t stream);
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 cudaStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    cudaStream_t stream);
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream);
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 cudaStream_t stream);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   cudaStream_t stream);
+
+// Custom fused bias add with layer normalization
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* X_data,
+                                         const T* vars,
+                                         const T* means,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int hidden_dim,
+                                         cudaStream_t stream[2]);
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* vals_hat,
+                                         const T* vars,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int hidden_dim,
+                                         cudaStream_t stream[2],
+                                         bool invertible = false,
+                                         const T* betta = nullptr);
+
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* X_data,
+                               const T* vars,
+                               const T* means,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int hidden_dim,
+                               cudaStream_t stream[2]);
+
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* vals_hat,
+                               const T* vars,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int hidden_dim,
+                               cudaStream_t stream[2],
+                               bool invertible = false,
+                               const T* betta = nullptr);
+
+template <typename T>
+void launch_layerNorm_backward_nreversible(const T* out_grad,
+                                           const T* vals,
+                                           const T* out_grad_trans,
+                                           const T* vals_trans,
+                                           const T* means,
+                                           const T* vars,
+                                           const T* gamma,
+                                           T* gamma_grad,
+                                           T* betta_grad,
+                                           T* inp_grad,
+                                           int batch_size,
+                                           int hidden_dim,
+                                           cudaStream_t stream[2]);
+
+template <typename T>
+void Transpose(const T* inp_mat, T* out_mat, int rows, int cols, cudaStream_t stream);
+
+template <typename T>
+void launch_attn_softmax_backward(T* out_grad,
+                                  const T* soft_inp,
+                                  int batch_size,
+                                  int heads,
+                                  int seq_length,
+                                  cudaStream_t stream);
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     cudaStream_t stream);
+
+// Custom softmax with scaling and attention mask addition
+template <typename T>
+void launch_attn_softmax(T* vals,
+                         const T* attn_mask,
+                         int batch_size,
+                         int heads,
+                         int sequence_length,
+                         cudaStream_t stream);
+
+template <typename T>
+void launch_transform_0213(T* output,
+                           const T* vals,
+                           int batch_size,
+                           int seq_length,
+                           int hidden_dim,
+                           int heads,
+                           cudaStream_t stream);
+
+// Custom bias add
+template <typename T>
+void launch_bias_add_transform_0213(T* outputs,
+                                    const T* vals,
+                                    const T* bias,
+                                    int batch_size,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    int heads,
+                                    cudaStream_t stream,
+                                    int trans_count);
+
+// 4D transform [0, 1, 2, 3] -> [0, 2, 1, 3]
+template <typename T>
+void launch_transform4d_0213(T* out,
+                             const T* in,
+                             int batch_size,
+                             int heads,
+                             int seq_length,
+                             int hidden_dim,
+                             cudaStream_t stream,
+                             int trans_count);
+
+template <typename T>
+void launch_dropout(T* vals,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream);
+
+template <typename T>
+void launch_dropout(T* vals_out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream,
+                    bool bwd = false);
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream);
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream);
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         cudaStream_t stream);
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       cudaStream_t stream);
+
+void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream);
+void launch_param_update_half(const float* input, __half* output, int size, cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/includes/custom_hip_layers.h b/deepspeed/ops/csrc/includes/custom_hip_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..9f48b31941b7985b4c2eabee02b610d1a5d9d3f8
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/custom_hip_layers.h
@@ -0,0 +1,304 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+#include <hiprand/hiprand_kernel.h>
+
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define MAX_THREADS 1024
+#define THREADS 256
+
+#define MAX_THREAD_STRIDE 32
+#define TILE_DIM 32
+
+// Maximum sequence-length support based on the number of threads (2048) allowed in each block and
+// this MAX is 8K For higher sequence length we need to use higher Max, like for 64K : 32
+#define MAX_THREAD_ITERATIONS 8  // Maximum 8K
+#define MAX_WARP_NUM 32
+
+#define MAX_REGISTERS 256
+
+#define MAX_REG 256
+
+#define WARP_SIZE_BITS 5
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            hipStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               hipStream_t stream);
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 hipStream_t stream);
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    hipStream_t stream);
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream);
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 hipStream_t stream);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   hipStream_t stream);
+
+// Custom fused bias add with layer normalization
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* X_data,
+                                         const T* vars,
+                                         const T* means,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int hidden_dim,
+                                         hipStream_t stream[2]);
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* vals_hat,
+                                         const T* vars,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int hidden_dim,
+                                         hipStream_t stream[2],
+                                         bool invertible = false,
+                                         const T* betta = nullptr);
+
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* X_data,
+                               const T* vars,
+                               const T* means,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream[2]);
+
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* vals_hat,
+                               const T* vars,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream[2],
+                               bool invertible = false,
+                               const T* betta = nullptr);
+
+template <typename T>
+void launch_layerNorm_backward_nreversible(const T* out_grad,
+                                           const T* vals,
+                                           const T* out_grad_trans,
+                                           const T* vals_trans,
+                                           const T* means,
+                                           const T* vars,
+                                           const T* gamma,
+                                           T* gamma_grad,
+                                           T* betta_grad,
+                                           T* inp_grad,
+                                           int batch_size,
+                                           int hidden_dim,
+                                           hipStream_t stream[2]);
+
+template <typename T>
+void Transpose(const T* inp_mat, T* out_mat, int rows, int cols, hipStream_t stream);
+
+template <typename T>
+void launch_attn_softmax_backward(T* out_grad,
+                                  const T* soft_inp,
+                                  int batch_size,
+                                  int heads,
+                                  int seq_length,
+                                  hipStream_t stream);
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     hipStream_t stream);
+
+// Custom softmax with scaling and attention mask addition
+template <typename T>
+void launch_attn_softmax(T* vals,
+                         const T* attn_mask,
+                         int batch_size,
+                         int heads,
+                         int sequence_length,
+                         hipStream_t stream);
+
+template <typename T>
+void launch_transform_0213(T* output,
+                           const T* vals,
+                           int batch_size,
+                           int seq_length,
+                           int hidden_dim,
+                           int heads,
+                           hipStream_t stream);
+
+// Custom bias add
+template <typename T>
+void launch_bias_add_transform_0213(T* outputs,
+                                    const T* vals,
+                                    const T* bias,
+                                    int batch_size,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    int heads,
+                                    hipStream_t stream,
+                                    int trans_count);
+
+// 4D transform [0, 1, 2, 3] -> [0, 2, 1, 3]
+template <typename T>
+void launch_transform4d_0213(T* out,
+                             const T* in,
+                             int batch_size,
+                             int heads,
+                             int seq_length,
+                             int hidden_dim,
+                             hipStream_t stream,
+                             int trans_count);
+
+template <typename T>
+void launch_dropout(T* vals,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream);
+
+template <typename T>
+void launch_dropout(T* vals_out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream,
+                    bool bwd = false);
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream);
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, hipStream_t stream);
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         hipStream_t stream);
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       hipStream_t stream);
+
+void launch_param_update(const float* input, __half* output, int size, hipStream_t stream);
+void launch_param_update_half(const float* input, __half* output, int size, hipStream_t stream);
diff --git a/deepspeed/ops/csrc/includes/dropout.h b/deepspeed/ops/csrc/includes/dropout.h
new file mode 100644
index 0000000000000000000000000000000000000000..a72572d0876e4ed30cb98270654fed8c201db6f0
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/dropout.h
@@ -0,0 +1,76 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+
+template <typename T>
+class Dropout {
+public:
+    struct Config {
+        float ratio;
+        uint32_t dim;
+        bool training;
+
+        Config(float r, uint32_t d) : ratio(r), dim(d), training(true) {}
+
+        float RATIO() const { return training ? ratio : 0.0; }
+        inline void SetDim(uint32_t d) { dim = d; }
+    };
+
+    Dropout(const Config& config) : _config(config), _mask(nullptr) {}
+
+    virtual ~Dropout() {}
+
+    void Forward(int bsz, T* out, const T* vals, cudaStream_t stream, bool bwd = false)
+    {
+        launch_dropout<T>(
+            out, vals, _mask, bsz * _config.dim, _config.dim, _config.RATIO(), stream, bwd);
+    }
+
+    void ForwardWithBias(int bsz, T* vals, const T* bias, cudaStream_t stream)
+    {
+        launch_dropout<T>(vals, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void ForwardWithBias(int bsz,
+                         T* out,
+                         const T* vals,
+                         const T* residual,
+                         const T* bias,
+                         cudaStream_t stream)
+    {
+        launch_dropout<T>(
+            out, vals, residual, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals, cudaStream_t stream)
+    {
+        launch_dropout_grad<T>(d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals_out, const T* d_vals, cudaStream_t stream)
+    {
+        launch_dropout_grad<T>(
+            d_vals_out, d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    bool HasDropout() const { return _config.RATIO() > 0.0; }
+
+    void SetTrainingMode(bool training) { _config.training = training; }
+
+    void SetMask(uint8_t* mask)
+    {
+        if (!mask) { throw std::runtime_error("Dropout mask is null."); }
+
+        _mask = mask;
+    }
+
+    Config GetConfig() const { return _config; }
+
+    inline void SetDimension(uint32_t dim) { _config.SetDim(dim); }
+
+private:
+    uint8_t* _mask;
+    Config _config;
+};
diff --git a/deepspeed/ops/csrc/includes/dropout_hip.h b/deepspeed/ops/csrc/includes/dropout_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..1bf352f9e7123b40da7b612692e07f4b1f2a783e
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/dropout_hip.h
@@ -0,0 +1,77 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+
+template <typename T>
+class Dropout {
+public:
+    struct Config {
+        float ratio;
+        uint32_t dim;
+        bool training;
+
+        Config(float r, uint32_t d) : ratio(r), dim(d), training(true) {}
+
+        float RATIO() const { return training ? ratio : 0.0; }
+        inline void SetDim(uint32_t d) { dim = d; }
+    };
+
+    Dropout(const Config& config) : _config(config), _mask(nullptr) {}
+
+    virtual ~Dropout() {}
+
+    void Forward(int bsz, T* out, const T* vals, hipStream_t stream, bool bwd = false)
+    {
+        launch_dropout<T>(
+            out, vals, _mask, bsz * _config.dim, _config.dim, _config.RATIO(), stream, bwd);
+    }
+
+    void ForwardWithBias(int bsz, T* vals, const T* bias, hipStream_t stream)
+    {
+        launch_dropout<T>(vals, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void ForwardWithBias(int bsz,
+                         T* out,
+                         const T* vals,
+                         const T* residual,
+                         const T* bias,
+                         hipStream_t stream)
+    {
+        launch_dropout<T>(
+            out, vals, residual, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals, hipStream_t stream)
+    {
+        launch_dropout_grad<T>(d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    void Backward(int bsz, T* d_vals_out, const T* d_vals, hipStream_t stream)
+    {
+        launch_dropout_grad<T>(
+            d_vals_out, d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+
+    bool HasDropout() const { return _config.RATIO() > 0.0; }
+
+    void SetTrainingMode(bool training) { _config.training = training; }
+
+    void SetMask(uint8_t* mask)
+    {
+        if (!mask) { throw std::runtime_error("Dropout mask is null."); }
+
+        _mask = mask;
+    }
+
+    Config GetConfig() const { return _config; }
+
+    inline void SetDimension(uint32_t dim) { _config.SetDim(dim); }
+
+private:
+    uint8_t* _mask;
+    Config _config;
+};
diff --git a/deepspeed/ops/csrc/includes/ds_transformer_cuda.h b/deepspeed/ops/csrc/includes/ds_transformer_cuda.h
new file mode 100644
index 0000000000000000000000000000000000000000..09afeb9d4b1950f0fc01cad7e7963359b485970c
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/ds_transformer_cuda.h
@@ -0,0 +1,184 @@
+#pragma once
+
+#include <cuda_runtime_api.h>
+#include <curand.h>
+#include <memory>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "dropout.h"
+#include "feed_forward.h"
+#include "gelu.h"
+#include "general_kernels.h"
+#include "normalize_layer.h"
+#include "softmax.h"
+#include "strided_batch_gemm.h"
+
+struct BertGemmAlgos {
+    int m_gemm_qkv_algo;
+    int m_gemm_inter_algo;
+    int m_gemm_output_algo;
+    int m_gemm_batch1_algo;
+    int m_gemm_batch2_algo;
+
+    BertGemmAlgos()
+        : m_gemm_qkv_algo(-1),
+          m_gemm_inter_algo(-1),
+          m_gemm_output_algo(-1),
+          m_gemm_batch1_algo(-1),
+          m_gemm_batch2_algo(-1)
+    {
+    }
+};
+
+template <typename T>
+class BertTransformerLayer {
+public:
+    BertTransformerLayer(unsigned layer_id,
+                         unsigned batch_size,
+                         unsigned hidden_size,
+                         unsigned num_heads,
+                         unsigned intermediate_size,
+                         unsigned seq_length,
+                         float attn_dropout_ratio,
+                         float hidden_output_dropout_ratio,
+                         float layer_norm_eps,
+                         bool pre_or_postLayerNorm,
+                         const std::vector<std::array<int, 3>>& gemm_algos,
+                         bool attn_dropout_checkpoint,
+                         bool normalize_invertible,
+                         bool gelu_checkpoint,
+                         bool stochastic_mode);
+
+    virtual ~BertTransformerLayer();
+
+    void Forward(unsigned bsz,
+                 const T* input_ptr,
+                 const T* input_mask_ptr,
+                 const T* attn_qkvw_ptr,
+                 const T* attn_qkvb_ptr,
+                 const T* attn_ow_ptr,
+                 const T* attn_ob_ptr,
+                 const T* attn_nw_ptr,
+                 const T* attn_nb_ptr,
+                 const T* inter_w_ptr,
+                 const T* inter_b_ptr,
+                 const T* output_w_ptr,
+                 const T* output_b_ptr,
+                 const T* norm_w_ptr,
+                 const T* norm_b_ptr,
+                 T* out_ptr,
+                 T* inp_norm_ptr,
+                 T* q_tf_ptr,
+                 T* k_tf_ptr,
+                 T* v_tf_ptr,
+                 T* softmax_output_ptr,
+                 T* ctx_bufB_ptr,
+                 T* attn_o_inp_ptr,
+                 T* add_res_ptr,
+                 T* ff1_inp_ptr,
+                 T* gelu_inp_ptr,
+                 T* ff2_inp_ptr);
+
+    void Backward(unsigned bsz,
+                  const T* grad_output_ptr,
+                  const T* input_ptr,
+                  const T* output_ptr,
+                  const T* inp_norm_ptr,
+                  const T* q_tf_ptr,
+                  const T* k_tf_ptr,
+                  const T* v_tf_ptr,
+                  const T* softmax_output_ptr,
+                  const T* ctx_bufB_ptr,
+                  const T* attn_o_inp_ptr,
+                  const T* add_res_ptr,
+                  const T* ff1_inp_ptr,
+                  const T* gelu_inp_ptr,
+                  const T* ff2_inp_ptr,
+                  const T* input_mask_ptr,
+                  const T* attn_qkvw_ptr,
+                  const T* attn_ow_ptr,
+                  const T* attn_nw_ptr,
+                  const T* attn_nb_ptr,
+                  const T* inter_w_ptr,
+                  const T* inter_b_ptr,
+                  const T* output_w_ptr,
+                  const T* norm_w_ptr,
+                  const T* norm_b_ptr,
+
+                  T* grad_input_ptr,
+                  T* grad_attn_qkvw_ptr,
+                  T* grad_attn_qkvb_ptr,
+                  T* grad_attn_ow_ptr,
+                  T* grad_attn_ob_ptr,
+                  T* grad_attn_nw_ptr,
+                  T* grad_attn_nb_ptr,
+                  T* grad_inter_w_ptr,
+                  T* grad_inter_b_ptr,
+                  T* grad_output_w_ptr,
+                  T* grad_output_b_ptr,
+                  T* grad_norm_w_ptr,
+                  T* grad_norm_b_ptr);
+
+    void SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                uint8_t* attn_output_dropout_mask_ptr,
+                                uint8_t* layer_output_dropout_mask_ptr,
+                                T* layer_norm_var,
+                                T* layer_norm_mean,
+                                T* attn_layer_norm_var,
+                                T* attn_layer_norm_mean);
+
+    inline unsigned GetBatchSize() const { return _batch_size; }
+    inline unsigned GetNumHeads() const { return _heads; }
+    inline unsigned GetSeqLength() const { return _seq_length; }
+    inline unsigned GetIntermediateSize() const { return _intermediate_size; }
+
+    void SetSeqLength(unsigned seq_len);
+    inline unsigned GetHiddenSize() const { return _hidden_size; }
+    void SetTrainingMode(bool training);
+    inline bool IsTrainingMode() const { return _training; }
+    inline bool GeluCheckpoint() const { return _gelu_checkpoint; }
+
+private:
+    void Initialize();
+    size_t getWorkspaceSize(int maxBatchSize) const;
+
+    // Params
+    unsigned _layer_id;
+    unsigned _batch_size;
+    unsigned _hidden_size;
+    unsigned _heads;
+    unsigned _size_per_head;
+    unsigned _intermediate_size;
+    unsigned _seq_length;
+
+    bool _pre_or_postLayerNorm;
+
+    cublasHandle_t _cublasHandle;
+    cudaStream_t _stream;
+
+    // layers
+    FeedForward<T> _qkv_linear;
+    FeedForward<T> _attn_out_linear;
+    Normalize_Layer<T> _attn_layer_norm;
+    Normalize_Layer<T> _layer_norm;
+    Normalize_Layer<T>* _last_normalize;
+    FeedForward<T> _ff1, _ff2;
+    Softmax<T> _softmax;
+    Gelu<T> _gelu;
+    Dropout<T> _attn_prob_dropout;
+    Dropout<T> _attn_output_dropout;
+    Dropout<T> _layer_output_dropout;
+    StridedBatchGemm<T> _attn_scores;
+    StridedBatchGemm<T> _attn_context;
+
+    bool _training;
+
+    // Memory saving flags
+    bool _attn_dropout_checkpoint;
+    bool _normalize_invertible;
+    bool _gelu_checkpoint;
+
+    // High Performance flags
+    bool _stochastic_mode;
+};
diff --git a/deepspeed/ops/csrc/includes/ds_transformer_hip.h b/deepspeed/ops/csrc/includes/ds_transformer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..502f2f38445cc0704d964b3c52e901eba09ce865
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/ds_transformer_hip.h
@@ -0,0 +1,185 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime_api.h>
+#include <hiprand/hiprand.h>
+#include <memory>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "dropout_hip.h"
+#include "feed_forward_hip.h"
+#include "gelu_hip.h"
+#include "general_kernels_hip.h"
+#include "normalize_layer_hip.h"
+#include "softmax_hip.h"
+#include "strided_batch_gemm_hip.h"
+
+struct BertGemmAlgos {
+    int m_gemm_qkv_algo;
+    int m_gemm_inter_algo;
+    int m_gemm_output_algo;
+    int m_gemm_batch1_algo;
+    int m_gemm_batch2_algo;
+
+    BertGemmAlgos()
+        : m_gemm_qkv_algo(-1),
+          m_gemm_inter_algo(-1),
+          m_gemm_output_algo(-1),
+          m_gemm_batch1_algo(-1),
+          m_gemm_batch2_algo(-1)
+    {
+    }
+};
+
+template <typename T>
+class BertTransformerLayer {
+public:
+    BertTransformerLayer(unsigned layer_id,
+                         unsigned batch_size,
+                         unsigned hidden_size,
+                         unsigned num_heads,
+                         unsigned intermediate_size,
+                         unsigned seq_length,
+                         float attn_dropout_ratio,
+                         float hidden_output_dropout_ratio,
+                         float layer_norm_eps,
+                         bool pre_or_postLayerNorm,
+                         const std::vector<std::array<int, 3>>& gemm_algos,
+                         bool attn_dropout_checkpoint,
+                         bool normalize_invertible,
+                         bool gelu_checkpoint,
+                         bool stochastic_mode);
+
+    virtual ~BertTransformerLayer();
+
+    void Forward(unsigned bsz,
+                 const T* input_ptr,
+                 const T* input_mask_ptr,
+                 const T* attn_qkvw_ptr,
+                 const T* attn_qkvb_ptr,
+                 const T* attn_ow_ptr,
+                 const T* attn_ob_ptr,
+                 const T* attn_nw_ptr,
+                 const T* attn_nb_ptr,
+                 const T* inter_w_ptr,
+                 const T* inter_b_ptr,
+                 const T* output_w_ptr,
+                 const T* output_b_ptr,
+                 const T* norm_w_ptr,
+                 const T* norm_b_ptr,
+                 T* out_ptr,
+                 T* inp_norm_ptr,
+                 T* q_tf_ptr,
+                 T* k_tf_ptr,
+                 T* v_tf_ptr,
+                 T* softmax_output_ptr,
+                 T* ctx_bufB_ptr,
+                 T* attn_o_inp_ptr,
+                 T* add_res_ptr,
+                 T* ff1_inp_ptr,
+                 T* gelu_inp_ptr,
+                 T* ff2_inp_ptr);
+
+    void Backward(unsigned bsz,
+                  const T* grad_output_ptr,
+                  const T* input_ptr,
+                  const T* output_ptr,
+                  const T* inp_norm_ptr,
+                  const T* q_tf_ptr,
+                  const T* k_tf_ptr,
+                  const T* v_tf_ptr,
+                  const T* softmax_output_ptr,
+                  const T* ctx_bufB_ptr,
+                  const T* attn_o_inp_ptr,
+                  const T* add_res_ptr,
+                  const T* ff1_inp_ptr,
+                  const T* gelu_inp_ptr,
+                  const T* ff2_inp_ptr,
+                  const T* input_mask_ptr,
+                  const T* attn_qkvw_ptr,
+                  const T* attn_ow_ptr,
+                  const T* attn_nw_ptr,
+                  const T* attn_nb_ptr,
+                  const T* inter_w_ptr,
+                  const T* inter_b_ptr,
+                  const T* output_w_ptr,
+                  const T* norm_w_ptr,
+                  const T* norm_b_ptr,
+
+                  T* grad_input_ptr,
+                  T* grad_attn_qkvw_ptr,
+                  T* grad_attn_qkvb_ptr,
+                  T* grad_attn_ow_ptr,
+                  T* grad_attn_ob_ptr,
+                  T* grad_attn_nw_ptr,
+                  T* grad_attn_nb_ptr,
+                  T* grad_inter_w_ptr,
+                  T* grad_inter_b_ptr,
+                  T* grad_output_w_ptr,
+                  T* grad_output_b_ptr,
+                  T* grad_norm_w_ptr,
+                  T* grad_norm_b_ptr);
+
+    void SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                uint8_t* attn_output_dropout_mask_ptr,
+                                uint8_t* layer_output_dropout_mask_ptr,
+                                T* layer_norm_var,
+                                T* layer_norm_mean,
+                                T* attn_layer_norm_var,
+                                T* attn_layer_norm_mean);
+
+    inline unsigned GetBatchSize() const { return _batch_size; }
+    inline unsigned GetNumHeads() const { return _heads; }
+    inline unsigned GetSeqLength() const { return _seq_length; }
+    inline unsigned GetIntermediateSize() const { return _intermediate_size; }
+
+    void SetSeqLength(unsigned seq_len);
+    inline unsigned GetHiddenSize() const { return _hidden_size; }
+    void SetTrainingMode(bool training);
+    inline bool IsTrainingMode() const { return _training; }
+    inline bool GeluCheckpoint() const { return _gelu_checkpoint; }
+
+private:
+    void Initialize();
+    size_t getWorkspaceSize(int maxBatchSize) const;
+
+    // Params
+    unsigned _layer_id;
+    unsigned _batch_size;
+    unsigned _hidden_size;
+    unsigned _heads;
+    unsigned _size_per_head;
+    unsigned _intermediate_size;
+    unsigned _seq_length;
+
+    bool _pre_or_postLayerNorm;
+
+    rocblas_handle _cublasHandle;
+    hipStream_t _stream;
+
+    // layers
+    FeedForward<T> _qkv_linear;
+    FeedForward<T> _attn_out_linear;
+    Normalize_Layer<T> _attn_layer_norm;
+    Normalize_Layer<T> _layer_norm;
+    Normalize_Layer<T>* _last_normalize;
+    FeedForward<T> _ff1, _ff2;
+    Softmax<T> _softmax;
+    Gelu<T> _gelu;
+    Dropout<T> _attn_prob_dropout;
+    Dropout<T> _attn_output_dropout;
+    Dropout<T> _layer_output_dropout;
+    StridedBatchGemm<T> _attn_scores;
+    StridedBatchGemm<T> _attn_context;
+
+    bool _training;
+
+    // Memory saving flags
+    bool _attn_dropout_checkpoint;
+    bool _normalize_invertible;
+    bool _gelu_checkpoint;
+
+    // High Performance flags
+    bool _stochastic_mode;
+};
diff --git a/deepspeed/ops/csrc/includes/feed_forward.h b/deepspeed/ops/csrc/includes/feed_forward.h
new file mode 100644
index 0000000000000000000000000000000000000000..de7a9cf1bf9eaf686f387e4dd1b3a45b02f28e85
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/feed_forward.h
@@ -0,0 +1,105 @@
+#ifndef __FEEDFORWARD_H__
+#define __FEEDFORWARD_H__
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+
+template <typename T>
+class FeedForward {
+public:
+    struct Config {
+        int batchSize, outputSize;
+        int inputSize;
+        std::array<int, 3> gemm_algos;
+        Config(int batch, int outputs, int inputs, const std::array<int, 3>& algos)
+            : batchSize(batch), outputSize(outputs), inputSize(inputs), gemm_algos(algos)
+        {
+        }
+    };
+
+    FeedForward(Config config) : config_(config) {}
+
+    ~FeedForward() {}
+
+    void Forward(int bsz,
+                 const T* input_ptr,
+                 const T* weights,
+                 T* out,
+                 cublasHandle_t& _cublasHandle)
+    {
+        float alpha = T(1.);
+        float beta = T(0.);
+
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_T,
+                       CUBLAS_OP_N,
+                       config_.outputSize,
+                       bsz,
+                       config_.inputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       input_ptr,
+                       out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[0]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[0]));
+#endif
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* input_ptr,
+                  const T* weights,
+                  T* weights_grad,
+                  T* bias_grad,
+                  cublasHandle_t& _cublasHandle,
+                  cudaStream_t& stream,
+                  T* inp_grad_out = nullptr,
+                  T* out_grad_trans_out = nullptr)
+    {
+        float alpha = (T)1.0, beta = (T)0.0;
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_N,
+                       CUBLAS_OP_T,
+                       config_.inputSize,
+                       config_.outputSize,
+                       bsz,
+                       &alpha,
+                       &beta,
+                       input_ptr,
+                       out_grad,
+                       weights_grad,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[1]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[1]));
+#endif
+
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_N,
+                       CUBLAS_OP_N,
+                       config_.inputSize,
+                       bsz,
+                       config_.outputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       out_grad,
+                       inp_grad_out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[2]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[2]));
+#endif
+
+        launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
+    }
+
+private:
+    Config config_;
+};
+
+#endif
diff --git a/deepspeed/ops/csrc/includes/feed_forward_hip.h b/deepspeed/ops/csrc/includes/feed_forward_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..e7e0600803f7ebe71f352c0300222c17e6f6365b
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/feed_forward_hip.h
@@ -0,0 +1,106 @@
+// !!! This is a file automatically generated by hipify!!!
+#ifndef __FEEDFORWARD_H__
+#define __FEEDFORWARD_H__
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "custom_hip_layers.h"
+
+template <typename T>
+class FeedForward {
+public:
+    struct Config {
+        int batchSize, outputSize;
+        int inputSize;
+        std::array<int, 3> gemm_algos;
+        Config(int batch, int outputs, int inputs, const std::array<int, 3>& algos)
+            : batchSize(batch), outputSize(outputs), inputSize(inputs), gemm_algos(algos)
+        {
+        }
+    };
+
+    FeedForward(Config config) : config_(config) {}
+
+    ~FeedForward() {}
+
+    void Forward(int bsz,
+                 const T* input_ptr,
+                 const T* weights,
+                 T* out,
+                 rocblas_handle& _cublasHandle)
+    {
+        float alpha = T(1.);
+        float beta = T(0.);
+
+        cublas_gemm_ex(_cublasHandle,
+                       rocblas_operation_transpose,
+                       rocblas_operation_none,
+                       config_.outputSize,
+                       bsz,
+                       config_.inputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       input_ptr,
+                       out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[0]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[0]));
+#endif
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* input_ptr,
+                  const T* weights,
+                  T* weights_grad,
+                  T* bias_grad,
+                  rocblas_handle& _cublasHandle,
+                  hipStream_t& stream,
+                  T* inp_grad_out = nullptr,
+                  T* out_grad_trans_out = nullptr)
+    {
+        float alpha = (T)1.0, beta = (T)0.0;
+        cublas_gemm_ex(_cublasHandle,
+                       rocblas_operation_none,
+                       rocblas_operation_transpose,
+                       config_.inputSize,
+                       config_.outputSize,
+                       bsz,
+                       &alpha,
+                       &beta,
+                       input_ptr,
+                       out_grad,
+                       weights_grad,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[1]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[1]));
+#endif
+
+        cublas_gemm_ex(_cublasHandle,
+                       rocblas_operation_none,
+                       rocblas_operation_none,
+                       config_.inputSize,
+                       bsz,
+                       config_.outputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       out_grad,
+                       inp_grad_out,
+#ifdef __HIP_PLATFORM_HCC__
+                       rocblas_gemm_algo(config_.gemm_algos[2]));
+#else
+                       cublasGemmAlgo_t(config_.gemm_algos[2]));
+#endif
+
+        launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
+    }
+
+private:
+    Config config_;
+};
+
+#endif
diff --git a/deepspeed/ops/csrc/includes/gelu.h b/deepspeed/ops/csrc/includes/gelu.h
new file mode 100644
index 0000000000000000000000000000000000000000..560f4140ed61e9455b78911da0a44c8944ce53ed
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/gelu.h
@@ -0,0 +1,36 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+
+template <typename T>
+class Gelu {
+public:
+    struct Config {
+        uint32_t intermediate_size;
+        Config(uint32_t inter_size) : intermediate_size(inter_size) {}
+    };
+
+    Gelu(const Config& config) : _config(config) {}
+
+    virtual ~Gelu() {}
+
+    void ForwardWithBiasAdd(int bsz,
+                            const T* input_buf,
+                            const T* bias,
+                            T* output,
+                            cudaStream_t stream)
+    {
+        launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream);
+    }
+
+    void Backward(int bsz, T* d_output, const T* input_buf, const T* bias, cudaStream_t stream)
+    {
+        launch_d_gelu<T>(d_output, input_buf, bias, _config.intermediate_size, bsz, stream);
+    }
+
+private:
+    Config _config;
+};
diff --git a/deepspeed/ops/csrc/includes/gelu_hip.h b/deepspeed/ops/csrc/includes/gelu_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..0297b66f394ec3b60c96c0453cf9ba6258296c1b
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/gelu_hip.h
@@ -0,0 +1,37 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "custom_hip_layers.h"
+
+template <typename T>
+class Gelu {
+public:
+    struct Config {
+        uint32_t intermediate_size;
+        Config(uint32_t inter_size) : intermediate_size(inter_size) {}
+    };
+
+    Gelu(const Config& config) : _config(config) {}
+
+    virtual ~Gelu() {}
+
+    void ForwardWithBiasAdd(int bsz,
+                            const T* input_buf,
+                            const T* bias,
+                            T* output,
+                            hipStream_t stream)
+    {
+        launch_bias_gelu<T>(input_buf, bias, output, _config.intermediate_size, bsz, stream);
+    }
+
+    void Backward(int bsz, T* d_output, const T* input_buf, const T* bias, hipStream_t stream)
+    {
+        launch_d_gelu<T>(d_output, input_buf, bias, _config.intermediate_size, bsz, stream);
+    }
+
+private:
+    Config _config;
+};
diff --git a/deepspeed/ops/csrc/includes/gemm_test.h b/deepspeed/ops/csrc/includes/gemm_test.h
new file mode 100644
index 0000000000000000000000000000000000000000..22c35123f2c776e2e87d53310c316497e55d214d
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/gemm_test.h
@@ -0,0 +1,327 @@
+
+#pragma once
+
+#include <cuda_fp16.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+#include <limits>
+#include <memory>
+#include "StopWatch.h"
+#include "cublas_wrappers.h"
+
+template <typename T>
+void check(T result, char const* const func, const char* const file, int const line)
+{
+    if (result) {
+        std::cout << (std::string("CUDA runtime error: ") + +file + ":" + std::to_string(line) +
+                      " \n");
+    }
+}
+
+#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
+
+template <typename T>
+class GemmTest {
+public:
+    GemmTest(int m, int n, int k, cublasOperation_t ta, cublasOperation_t tb, cublasHandle_t h)
+        : M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K));
+        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N));
+        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N));
+    }
+
+    ~GemmTest()
+    {
+        check_cuda_error(cudaFree(A));
+        check_cuda_error(cudaFree(B));
+        check_cuda_error(cudaFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_T,
+                           CUBLAS_OP_N,
+                           N,
+                           M,
+                           K,
+                           &alpha,
+                           &beta,
+                           B,
+                           A,
+                           C,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_N,
+                           CUBLAS_OP_T,
+                           K,
+                           N,
+                           M,
+                           &alpha,
+                           &beta,
+                           A,
+                           C,
+                           B,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_N,
+                           CUBLAS_OP_N,
+                           K,
+                           M,
+                           N,
+                           &alpha,
+                           &beta,
+                           B,
+                           C,
+                           A,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int M, N, K;
+    cublasHandle_t handle;
+    cublasOperation_t transa, transb;
+    T *A, *B, *C;
+};
+
+template <typename T>
+class StridedGemmTest {
+public:
+    StridedGemmTest(int b,
+                    int m,
+                    int n,
+                    int k,
+                    cublasOperation_t ta,
+                    cublasOperation_t tb,
+                    cublasHandle_t h)
+        : bsz(b), M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K * bsz));
+        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N * bsz));
+        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N * bsz));
+    }
+
+    ~StridedGemmTest()
+    {
+        check_cuda_error(cudaFree(A));
+        check_cuda_error(cudaFree(B));
+        check_cuda_error(cudaFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            int stride_a = M * K;
+            int stride_b = N * K;
+            int stride_c = M * N;
+
+            cublas_strided_batched_gemm(handle,
+                                        M,
+                                        N,
+                                        K,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        B,
+                                        C,
+                                        transa,
+                                        transb,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            int mb = (transa == CUBLAS_OP_T ? K : M);
+            int kb = (transa == CUBLAS_OP_T ? M : K);
+
+            int stride_a = mb * N;
+            int stride_b = N * kb;
+            int stride_c = M * K;
+
+            // B need to transpose.
+            cublasOperation_t op_b = (transb == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+            // Calculate d_A.
+            cublas_strided_batched_gemm(handle,
+                                        mb,
+                                        kb,
+                                        N,
+                                        &alpha,
+                                        &beta,
+                                        (transa == CUBLAS_OP_T ? B : C),
+                                        (transa == CUBLAS_OP_T ? C : B),
+                                        A,
+                                        CUBLAS_OP_N,
+                                        op_b,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            // A need to transpose.
+            cublasOperation_t op_a = (transa == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+            int stride_a = M * K;
+            int stride_b = M * N;
+            int stride_c = N * K;
+
+            // Calculate d_B.
+            cublas_strided_batched_gemm(handle,
+                                        K,
+                                        N,
+                                        M,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        C,
+                                        B,
+                                        op_a,
+                                        CUBLAS_OP_N,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            cudaDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int bsz, M, N, K;
+    cublasHandle_t handle;
+    cublasOperation_t transa, transb;
+    T *A, *B, *C;
+};
diff --git a/deepspeed/ops/csrc/includes/gemm_test_hip.h b/deepspeed/ops/csrc/includes/gemm_test_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..117302ddb4a3b250512a04115d5ee856771928c9
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/gemm_test_hip.h
@@ -0,0 +1,328 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#pragma once
+
+#include <hip/hip_fp16.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+#include <limits>
+#include <memory>
+#include "StopWatch.h"
+#include "cublas_wrappers_hip.h"
+
+template <typename T>
+void check(T result, char const* const func, const char* const file, int const line)
+{
+    if (result) {
+        std::cout << (std::string("CUDA runtime error: ") + +file + ":" + std::to_string(line) +
+                      " \n");
+    }
+}
+
+#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
+
+template <typename T>
+class GemmTest {
+public:
+    GemmTest(int m, int n, int k, rocblas_operation ta, rocblas_operation tb, rocblas_handle h)
+        : M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(hipMalloc((void**)&A, sizeof(T) * M * K));
+        check_cuda_error(hipMalloc((void**)&B, sizeof(T) * K * N));
+        check_cuda_error(hipMalloc((void**)&C, sizeof(T) * M * N));
+    }
+
+    ~GemmTest()
+    {
+        check_cuda_error(hipFree(A));
+        check_cuda_error(hipFree(B));
+        check_cuda_error(hipFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           rocblas_operation_transpose,
+                           rocblas_operation_none,
+                           N,
+                           M,
+                           K,
+                           &alpha,
+                           &beta,
+                           B,
+                           A,
+                           C,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           rocblas_operation_none,
+                           rocblas_operation_transpose,
+                           K,
+                           N,
+                           M,
+                           &alpha,
+                           &beta,
+                           A,
+                           C,
+                           B,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           rocblas_operation_none,
+                           rocblas_operation_none,
+                           K,
+                           M,
+                           N,
+                           &alpha,
+                           &beta,
+                           B,
+                           C,
+                           A,
+#ifdef __HIP_PLATFORM_HCC__
+                           static_cast<rocblas_gemm_algo>(algo));
+#else
+                           static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int M, N, K;
+    rocblas_handle handle;
+    rocblas_operation transa, transb;
+    T *A, *B, *C;
+};
+
+template <typename T>
+class StridedGemmTest {
+public:
+    StridedGemmTest(int b,
+                    int m,
+                    int n,
+                    int k,
+                    rocblas_operation ta,
+                    rocblas_operation tb,
+                    rocblas_handle h)
+        : bsz(b), M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(hipMalloc((void**)&A, sizeof(T) * M * K * bsz));
+        check_cuda_error(hipMalloc((void**)&B, sizeof(T) * K * N * bsz));
+        check_cuda_error(hipMalloc((void**)&C, sizeof(T) * M * N * bsz));
+    }
+
+    ~StridedGemmTest()
+    {
+        check_cuda_error(hipFree(A));
+        check_cuda_error(hipFree(B));
+        check_cuda_error(hipFree(C));
+    }
+
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+
+        int algo_fw = Run(loops, [=](int algo) {
+            int stride_a = M * K;
+            int stride_b = N * K;
+            int stride_c = M * N;
+
+            cublas_strided_batched_gemm(handle,
+                                        M,
+                                        N,
+                                        K,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        B,
+                                        C,
+                                        transa,
+                                        transb,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw1 = Run(loops, [=](int algo) {
+            int mb = (transa == rocblas_operation_transpose ? K : M);
+            int kb = (transa == rocblas_operation_transpose ? M : K);
+
+            int stride_a = mb * N;
+            int stride_b = N * kb;
+            int stride_c = M * K;
+
+            // B need to transpose.
+            rocblas_operation op_b = (transb == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+            // Calculate d_A.
+            cublas_strided_batched_gemm(handle,
+                                        mb,
+                                        kb,
+                                        N,
+                                        &alpha,
+                                        &beta,
+                                        (transa == rocblas_operation_transpose ? B : C),
+                                        (transa == rocblas_operation_transpose ? C : B),
+                                        A,
+                                        rocblas_operation_none,
+                                        op_b,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        int algo_bw2 = Run(loops, [=](int algo) {
+            // A need to transpose.
+            rocblas_operation op_a = (transa == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+            int stride_a = M * K;
+            int stride_b = M * N;
+            int stride_c = N * K;
+
+            // Calculate d_B.
+            cublas_strided_batched_gemm(handle,
+                                        K,
+                                        N,
+                                        M,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        C,
+                                        B,
+                                        op_a,
+                                        rocblas_operation_none,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                        static_cast<rocblas_gemm_algo>(algo));
+#else
+                                        static_cast<cublasGemmAlgo_t>(algo));
+#endif
+        });
+
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = (std::numeric_limits<float>::max)();
+        int fast_algo = 0;
+
+#ifdef __HIP_PLATFORM_HCC__
+        for (int algo = (int)rocblas_gemm_algo_standard; algo <= (int)rocblas_gemm_algo_standard;
+#else
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+#endif
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+
+            for (int i = 0; i < loops; ++i) f(algo);
+
+            hipDeviceSynchronize();
+            timer.Stop();
+
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+
+        return fast_algo;
+    }
+
+private:
+    int bsz, M, N, K;
+    rocblas_handle handle;
+    rocblas_operation transa, transb;
+    T *A, *B, *C;
+};
diff --git a/deepspeed/ops/csrc/includes/general_kernels.h b/deepspeed/ops/csrc/includes/general_kernels.h
new file mode 100644
index 0000000000000000000000000000000000000000..e949309483ce8060c86ad1a46ca9264b1f45f810
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/general_kernels.h
@@ -0,0 +1,51 @@
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <curand_kernel.h>
+
+#include "context.h"
+#include "cublas_wrappers.h"
+
+#define THREADS 256
+#define TILE_DIM 32
+
+#define minus_infinity -1 * std::numeric_limits<float>::infinity()
+
+#define FINAL_MASK 0xffffffff
+
+template <typename T>
+void launch_fused_add2(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
+
+template <typename T>
+void launch_fused_add4(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       const T* inp4,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
+
+template <typename T>
+void launch_fused_add3(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
diff --git a/deepspeed/ops/csrc/includes/general_kernels_hip.h b/deepspeed/ops/csrc/includes/general_kernels_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..2cafd95b2a1ba34f398b65f132f27846b57ac751
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/general_kernels_hip.h
@@ -0,0 +1,52 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <hiprand/hiprand_kernel.h>
+
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+
+#define THREADS 256
+#define TILE_DIM 32
+
+#define minus_infinity -1 * std::numeric_limits<float>::infinity()
+
+#define FINAL_MASK 0xffffffff
+
+template <typename T>
+void launch_fused_add2(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       hipStream_t& stream);
+
+template <typename T>
+void launch_fused_add4(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       const T* inp4,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       hipStream_t& stream);
+
+template <typename T>
+void launch_fused_add3(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       hipStream_t& stream);
diff --git a/deepspeed/ops/csrc/includes/normalize_layer.h b/deepspeed/ops/csrc/includes/normalize_layer.h
new file mode 100644
index 0000000000000000000000000000000000000000..b4d135ec495e61b041c0610deb966fb0468ca4a6
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/normalize_layer.h
@@ -0,0 +1,202 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <fstream>
+#include "custom_cuda_layers.h"
+
+using namespace std;
+
+template <typename T>
+class Normalize_Layer {
+public:
+    struct Config {
+        uint32_t batchSize;
+        uint32_t seqLength;
+        uint32_t hiddenDim;
+        float epsilon;
+        bool training;
+        bool useMean;
+        Config(uint32_t batch,
+               uint32_t seq,
+               uint32_t h,
+               float epsilon = 1e-12,
+               bool training = true,
+               bool useMean = true)
+            : batchSize(batch),
+              seqLength(seq),
+              hiddenDim(h),
+              epsilon(epsilon),
+              training(training),
+              useMean(useMean)
+        {
+        }
+    };
+
+    Normalize_Layer(Config config)
+        : config_(config), vars(nullptr), means(nullptr), vals_hat(nullptr)
+    {
+    }
+
+    ~Normalize_Layer() {}
+
+    void ForwardCheckpoint(int bsz,  // batch * seq
+                           T* vals,
+                           const T* residual,
+                           const T* gamma,
+                           const T* betta,
+                           cudaStream_t& stream,
+                           bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars,
+                                        means);
+    }
+
+    void Forward(int bsz,
+                 T* vals,
+                 const T* residual,
+                 const T* gamma,
+                 const T* betta,
+                 cudaStream_t& stream,
+                 bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  cudaStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_in,
+                                  vars,
+                                  means,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  const T* betta,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  cudaStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_out)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_out,
+                                  vars,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream,
+                                  !config_.useMean,
+                                  betta);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          cudaStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_in,
+                                            vars,
+                                            means,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          const T* betta,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          cudaStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_out)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_out,
+                                            vars,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream,
+                                            !config_.useMean,
+                                            betta);
+    }
+
+    inline bool UseMean() const { return config_.useMean; }
+
+    inline void SetVar(T* variance)
+    {
+        if (!variance) { throw std::runtime_error("Normalize variance is null."); }
+        vars = variance;
+    }
+
+    inline void SetMean(T* mean)
+    {
+        if (!mean) { throw std::runtime_error("Normalize mean is null."); }
+        means = mean;
+    }
+
+private:
+    Config config_;
+    T* vars;
+    T* means;
+    T* vals_hat;
+};
diff --git a/deepspeed/ops/csrc/includes/normalize_layer_hip.h b/deepspeed/ops/csrc/includes/normalize_layer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..41702762d3f388c6c8a3346e0c2b8f219b20e922
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/normalize_layer_hip.h
@@ -0,0 +1,203 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <fstream>
+#include "custom_hip_layers.h"
+
+using namespace std;
+
+template <typename T>
+class Normalize_Layer {
+public:
+    struct Config {
+        uint32_t batchSize;
+        uint32_t seqLength;
+        uint32_t hiddenDim;
+        float epsilon;
+        bool training;
+        bool useMean;
+        Config(uint32_t batch,
+               uint32_t seq,
+               uint32_t h,
+               float epsilon = 1e-12,
+               bool training = true,
+               bool useMean = true)
+            : batchSize(batch),
+              seqLength(seq),
+              hiddenDim(h),
+              epsilon(epsilon),
+              training(training),
+              useMean(useMean)
+        {
+        }
+    };
+
+    Normalize_Layer(Config config)
+        : config_(config), vars(nullptr), means(nullptr), vals_hat(nullptr)
+    {
+    }
+
+    ~Normalize_Layer() {}
+
+    void ForwardCheckpoint(int bsz,  // batch * seq
+                           T* vals,
+                           const T* residual,
+                           const T* gamma,
+                           const T* betta,
+                           hipStream_t& stream,
+                           bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars,
+                                        means);
+    }
+
+    void Forward(int bsz,
+                 T* vals,
+                 const T* residual,
+                 const T* gamma,
+                 const T* betta,
+                 hipStream_t& stream,
+                 bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  hipStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_in,
+                                  vars,
+                                  means,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream);
+    }
+
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  const T* betta,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  hipStream_t stream[2],
+                  T* inp_grad_out,
+                  const T* norm_out)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_out,
+                                  vars,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  inp_grad_out,
+                                  bsz,
+                                  config_.hiddenDim,
+                                  stream,
+                                  !config_.useMean,
+                                  betta);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          hipStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_in,
+                                            vars,
+                                            means,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream);
+    }
+
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          const T* betta,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          hipStream_t stream[2],
+                          T* inp_grad_out,
+                          const T* norm_out)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_out,
+                                            vars,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            inp_grad_out,
+                                            bsz,
+                                            config_.hiddenDim,
+                                            stream,
+                                            !config_.useMean,
+                                            betta);
+    }
+
+    inline bool UseMean() const { return config_.useMean; }
+
+    inline void SetVar(T* variance)
+    {
+        if (!variance) { throw std::runtime_error("Normalize variance is null."); }
+        vars = variance;
+    }
+
+    inline void SetMean(T* mean)
+    {
+        if (!mean) { throw std::runtime_error("Normalize mean is null."); }
+        means = mean;
+    }
+
+private:
+    Config config_;
+    T* vars;
+    T* means;
+    T* vals_hat;
+};
diff --git a/deepspeed/ops/csrc/includes/quantizer.h b/deepspeed/ops/csrc/includes/quantizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..79eeb14e2a0613255c61f647df44f04fbe9df6c6
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/quantizer.h
@@ -0,0 +1,9 @@
+#pragma once
+
+#include <cooperative_groups.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
diff --git a/deepspeed/ops/csrc/includes/quantizer_hip.h b/deepspeed/ops/csrc/includes/quantizer_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..7fbbbaf2c5c67bf388ab93e8a57cbf575e72489b
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/quantizer_hip.h
@@ -0,0 +1,10 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <cooperative_groups.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
diff --git a/deepspeed/ops/csrc/includes/simd.h b/deepspeed/ops/csrc/includes/simd.h
new file mode 100644
index 0000000000000000000000000000000000000000..44c4da83e01c02dc664aa712fe6847be0b3f6aa2
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/simd.h
@@ -0,0 +1,137 @@
+#pragma once
+
+#if (__x86_64__ || __i386__)
+#include <cpuid.h>
+#include <x86intrin.h>
+#endif
+
+#define TILE (128 * 1024 * 1024)
+#if defined(__AVX512__) or defined(__AVX256__)
+
+#define ROUND_DOWN(size, step) ((size) & ~((step)-1))
+
+#if defined(__AVX512__)
+#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm512_loadu_ps(x)
+#define SIMD_SET(x) _mm512_set1_ps(x)
+#define SIMD_ADD(x, y) _mm512_add_ps(x, y)
+#define SIMD_MUL(x, y) _mm512_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm512_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm512_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm512_div_ps(x, y)
+#define SIMD_WIDTH 16
+
+#define SIMD_LOAD2(x, h) \
+    ((h) ? _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i*)x)) : _mm512_loadu_ps(x))
+#define SIMD_STORE2(x, d, h)                                                                      \
+    ((h) ? _mm256_store_ps(x, _mm256_castsi256_ps(_mm512_cvtps_ph(d, _MM_FROUND_TO_NEAREST_INT))) \
+         : _mm512_storeu_ps(x, d))
+
+#define INTV __m256i
+#elif defined(__AVX256__)
+#define SIMD_STORE(a, d) _mm256_storeu_ps(a, d)
+#define SIMD_LOAD(x) _mm256_loadu_ps(x)
+#define SIMD_SET(x) _mm256_set1_ps(x)
+#define SIMD_ADD(x, y) _mm256_add_ps(x, y)
+#define SIMD_MUL(x, y) _mm256_mul_ps(x, y)
+#define SIMD_FMA(x, y, c) _mm256_fmadd_ps(x, y, c)
+#define SIMD_SQRT(x) _mm256_sqrt_ps(x)
+#define SIMD_DIV(x, y) _mm256_div_ps(x, y)
+#define SIMD_WIDTH 8
+#define SIMD_LOAD2(x, h) \
+    ((h) ? _mm256_cvtph_ps(_mm_loadu_si128((const __m128i*)x)) : _mm256_loadu_ps(x))
+
+#define SIMD_STORE2(x, d, h)                                                                \
+    ((h) ? _mm_store_ps(x, _mm_castsi128_ps(_mm256_cvtps_ph(d, _MM_FROUND_TO_NEAREST_INT))) \
+         : _mm256_storeu_ps(x, d))
+
+#define INTV __m128i
+#endif
+
+union AVX_Data {
+#if defined(__AVX512__)
+    __m512 data;
+#elif defined(__AVX256__)
+    __m256 data;
+#endif
+    // float data_f[16];
+};
+
+template <int span>
+inline void simd_store(float* dst, AVX_Data* src, bool half_precision)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        SIMD_STORE2(dst + SIMD_WIDTH * i, src[i].data, half_precision);
+    }
+}
+template <int span>
+inline void simd_load(AVX_Data* dst, float* src, bool half_precision)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_LOAD2(src + SIMD_WIDTH * i, half_precision);
+    }
+}
+template <int span>
+inline void simd_fma(AVX_Data* dst, AVX_Data* src_m_l, AVX_Data src_m_r, AVX_Data* src_a)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_FMA(src_m_l[i].data, src_m_r.data, src_a[i].data);
+    }
+}
+template <int span>
+inline void simd_fma(AVX_Data* dst, AVX_Data* src_m_l, AVX_Data src_m_r, AVX_Data src_a)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_FMA(src_m_l[i].data, src_m_r.data, src_a.data);
+    }
+}
+template <int span>
+inline void simd_fma(AVX_Data* dst, AVX_Data* src_m_l, AVX_Data* src_m_r, AVX_Data* src_a)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) {
+        dst[i].data = SIMD_FMA(src_m_l[i].data, src_m_r[i].data, src_a[i].data);
+    }
+}
+template <int span>
+inline void simd_sqrt(AVX_Data* dst, AVX_Data* src)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_SQRT(src[i].data); }
+}
+template <int span>
+inline void simd_add(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_ADD(src_a_l[i].data, src_a_r.data); }
+}
+template <int span>
+inline void simd_add(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data* src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_ADD(src_a_l[i].data, src_a_r[i].data); }
+}
+template <int span>
+inline void simd_mul(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_MUL(src_a_l[i].data, src_a_r.data); }
+}
+template <int span>
+inline void simd_mul(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data* src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_MUL(src_a_l[i].data, src_a_r[i].data); }
+}
+template <int span>
+inline void simd_div(AVX_Data* dst, AVX_Data* src_a_l, AVX_Data* src_a_r)
+{
+#pragma unroll
+    for (size_t i = 0; i < span; ++i) { dst[i].data = SIMD_DIV(src_a_l[i].data, src_a_r[i].data); }
+}
+
+#endif
diff --git a/deepspeed/ops/csrc/includes/softmax.h b/deepspeed/ops/csrc/includes/softmax.h
new file mode 100644
index 0000000000000000000000000000000000000000..8d541a6fe0e691522a3add63bfd92e2620c914d3
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/softmax.h
@@ -0,0 +1,60 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+
+#include <fstream>
+
+using namespace std;
+
+template <typename T>
+class Softmax {
+public:
+    struct Config {
+        size_t batchSize;
+        size_t heads;
+        size_t seq_length;
+        size_t prob_depth;
+        float temperature;
+        bool mem_alloc;
+        Config(size_t batch, size_t h, size_t seq, int prob_size = 0, bool mem_alloc = false)
+            : batchSize(batch),
+              heads(h),
+              seq_length(seq),
+              prob_depth(prob_size),
+              temperature(1.0),
+              mem_alloc(mem_alloc)
+        {
+        }
+    };
+
+    Softmax(Config config) : config_(config) {}
+
+    ~Softmax() {}
+
+    void Forward(int bsz, T* vals, const T* attn_mask, cudaStream_t& stream)
+    {
+        launch_attn_softmax<T>(vals, attn_mask, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    void Backward(int bsz, T* out_grad, const T* soft_out, cudaStream_t stream)
+    {
+        launch_attn_softmax_backward_v2<T>(
+            out_grad, soft_out, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    inline size_t GetProbDepth() const { return config_.prob_depth; }
+
+    inline size_t GetBatchSize() const { return config_.batchSize; }
+
+    inline size_t GetNumHeads() const { return config_.heads; }
+
+    inline size_t GetSeqLength() const { return config_.seq_length; }
+
+    inline void SetSeqLength(size_t seq_len) { config_.seq_length = seq_len; }
+
+private:
+    Config config_;
+};
diff --git a/deepspeed/ops/csrc/includes/softmax_hip.h b/deepspeed/ops/csrc/includes/softmax_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..47822e6e999ada42ec761a8c80571aad30edf57c
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/softmax_hip.h
@@ -0,0 +1,61 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "custom_hip_layers.h"
+
+#include <fstream>
+
+using namespace std;
+
+template <typename T>
+class Softmax {
+public:
+    struct Config {
+        size_t batchSize;
+        size_t heads;
+        size_t seq_length;
+        size_t prob_depth;
+        float temperature;
+        bool mem_alloc;
+        Config(size_t batch, size_t h, size_t seq, int prob_size = 0, bool mem_alloc = false)
+            : batchSize(batch),
+              heads(h),
+              seq_length(seq),
+              prob_depth(prob_size),
+              temperature(1.0),
+              mem_alloc(mem_alloc)
+        {
+        }
+    };
+
+    Softmax(Config config) : config_(config) {}
+
+    ~Softmax() {}
+
+    void Forward(int bsz, T* vals, const T* attn_mask, hipStream_t& stream)
+    {
+        launch_attn_softmax<T>(vals, attn_mask, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    void Backward(int bsz, T* out_grad, const T* soft_out, hipStream_t stream)
+    {
+        launch_attn_softmax_backward_v2<T>(
+            out_grad, soft_out, bsz, config_.heads, config_.seq_length, stream);
+    }
+
+    inline size_t GetProbDepth() const { return config_.prob_depth; }
+
+    inline size_t GetBatchSize() const { return config_.batchSize; }
+
+    inline size_t GetNumHeads() const { return config_.heads; }
+
+    inline size_t GetSeqLength() const { return config_.seq_length; }
+
+    inline void SetSeqLength(size_t seq_len) { config_.seq_length = seq_len; }
+
+private:
+    Config config_;
+};
diff --git a/deepspeed/ops/csrc/includes/strided_batch_gemm.h b/deepspeed/ops/csrc/includes/strided_batch_gemm.h
new file mode 100644
index 0000000000000000000000000000000000000000..037319ba0dd963de0aca8327250eeffaf870bcd0
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/strided_batch_gemm.h
@@ -0,0 +1,195 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "context.h"
+
+template <typename T>
+class StridedBatchGemm {
+public:
+    struct Config {
+        int batch_size;
+        int m;
+        int n;
+        int k;
+        float alpha;
+        float beta;
+        cublasOperation_t op_A;
+        cublasOperation_t op_B;
+        std::array<int, 3> gemm_algos;
+
+        Config(int batch,
+               int mm,
+               int nn,
+               int kk,
+               float param_alpha,
+               float param_beta,
+               cublasOperation_t opA,
+               cublasOperation_t opB,
+               const std::array<int, 3>& algos)
+            : batch_size(batch),
+              m(mm),
+              n(nn),
+              k(kk),
+              alpha(param_alpha),
+              beta(param_beta),
+              op_A(opA),
+              op_B(opB),
+              gemm_algos(algos)
+        {
+        }
+        void SetConfig(int mm, int nn, int kk)
+        {
+            m = mm;
+            n = nn;
+            k = kk;
+        }
+    };
+
+    StridedBatchGemm(const Config& config) : _config(config) {}
+
+    virtual ~StridedBatchGemm() {}
+
+    void Forward(int bsz, T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+    }
+
+    void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    _config.batch_size,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+
+        k_buf = _buffer_a;
+        q_buf = _buffer_b;
+    }
+
+    void Backward(int bsz,
+                  const T* d_output,
+                  const T* _buffer_a,
+                  const T* _buffer_b,
+                  cublasHandle_t handle,
+                  T* inpGradA = nullptr,
+                  T* inpGradB = nullptr)
+    {
+        int mb = (_config.op_A == CUBLAS_OP_T ? _config.k : _config.m);
+        int kb = (_config.op_A == CUBLAS_OP_T ? _config.m : _config.k);
+
+        int stride_a = mb * _config.n;
+        int stride_b = _config.n * kb;
+        int stride_c = _config.m * _config.k;
+
+        // B need to transpose.
+        cublasOperation_t op_b = (_config.op_B == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+        // Calculate d_A.
+        cublas_strided_batched_gemm(handle,
+                                    mb,
+                                    kb,
+                                    _config.n,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    (_config.op_A == CUBLAS_OP_T ? _buffer_b : d_output),
+                                    (_config.op_A == CUBLAS_OP_T ? d_output : _buffer_b),
+                                    inpGradA,
+                                    CUBLAS_OP_N,
+                                    op_b,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[1]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[1]));
+#endif
+
+        // A need to transpose.
+        cublasOperation_t op_a = (_config.op_A == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+
+        stride_a = _config.m * _config.k;
+        stride_b = _config.m * _config.n;
+        stride_c = _config.n * _config.k;
+
+        // Calculate d_B.
+        cublas_strided_batched_gemm(handle,
+                                    _config.k,
+                                    _config.n,
+                                    _config.m,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    d_output,
+                                    inpGradB,
+                                    op_a,
+                                    CUBLAS_OP_N,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[2]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[2]));
+#endif
+    }
+
+    inline int GetN() const { return _config.k; }
+
+    inline const T* GetBufferA() const { return k_buf; }
+
+    inline const T* GetBufferB() const { return q_buf; }
+
+    inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
+
+private:
+    Config _config;
+    const T* q_buf;
+    const T* k_buf;
+};
diff --git a/deepspeed/ops/csrc/includes/strided_batch_gemm_hip.h b/deepspeed/ops/csrc/includes/strided_batch_gemm_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..9db208dc7230033e9681b7e4d2e0b651c9f458cd
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/strided_batch_gemm_hip.h
@@ -0,0 +1,196 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include "context_hip.h"
+
+template <typename T>
+class StridedBatchGemm {
+public:
+    struct Config {
+        int batch_size;
+        int m;
+        int n;
+        int k;
+        float alpha;
+        float beta;
+        rocblas_operation op_A;
+        rocblas_operation op_B;
+        std::array<int, 3> gemm_algos;
+
+        Config(int batch,
+               int mm,
+               int nn,
+               int kk,
+               float param_alpha,
+               float param_beta,
+               rocblas_operation opA,
+               rocblas_operation opB,
+               const std::array<int, 3>& algos)
+            : batch_size(batch),
+              m(mm),
+              n(nn),
+              k(kk),
+              alpha(param_alpha),
+              beta(param_beta),
+              op_A(opA),
+              op_B(opB),
+              gemm_algos(algos)
+        {
+        }
+        void SetConfig(int mm, int nn, int kk)
+        {
+            m = mm;
+            n = nn;
+            k = kk;
+        }
+    };
+
+    StridedBatchGemm(const Config& config) : _config(config) {}
+
+    virtual ~StridedBatchGemm() {}
+
+    void Forward(int bsz, T* output, const T* _buffer_a, const T* _buffer_b, rocblas_handle handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+    }
+
+    void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, rocblas_handle handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    _config.batch_size,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[0]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+#endif
+
+        k_buf = _buffer_a;
+        q_buf = _buffer_b;
+    }
+
+    void Backward(int bsz,
+                  const T* d_output,
+                  const T* _buffer_a,
+                  const T* _buffer_b,
+                  rocblas_handle handle,
+                  T* inpGradA = nullptr,
+                  T* inpGradB = nullptr)
+    {
+        int mb = (_config.op_A == rocblas_operation_transpose ? _config.k : _config.m);
+        int kb = (_config.op_A == rocblas_operation_transpose ? _config.m : _config.k);
+
+        int stride_a = mb * _config.n;
+        int stride_b = _config.n * kb;
+        int stride_c = _config.m * _config.k;
+
+        // B need to transpose.
+        rocblas_operation op_b = (_config.op_B == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+        // Calculate d_A.
+        cublas_strided_batched_gemm(handle,
+                                    mb,
+                                    kb,
+                                    _config.n,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    (_config.op_A == rocblas_operation_transpose ? _buffer_b : d_output),
+                                    (_config.op_A == rocblas_operation_transpose ? d_output : _buffer_b),
+                                    inpGradA,
+                                    rocblas_operation_none,
+                                    op_b,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[1]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[1]));
+#endif
+
+        // A need to transpose.
+        rocblas_operation op_a = (_config.op_A == rocblas_operation_transpose ? rocblas_operation_none : rocblas_operation_transpose);
+
+        stride_a = _config.m * _config.k;
+        stride_b = _config.m * _config.n;
+        stride_c = _config.n * _config.k;
+
+        // Calculate d_B.
+        cublas_strided_batched_gemm(handle,
+                                    _config.k,
+                                    _config.n,
+                                    _config.m,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    d_output,
+                                    inpGradB,
+                                    op_a,
+                                    rocblas_operation_none,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+#ifdef __HIP_PLATFORM_HCC__
+                                    rocblas_gemm_algo(_config.gemm_algos[2]));
+#else
+                                    cublasGemmAlgo_t(_config.gemm_algos[2]));
+#endif
+    }
+
+    inline int GetN() const { return _config.k; }
+
+    inline const T* GetBufferA() const { return k_buf; }
+
+    inline const T* GetBufferB() const { return q_buf; }
+
+    inline void SetConfig(int m, int n, int k) { _config.SetConfig(m, n, k); }
+
+private:
+    Config _config;
+    const T* q_buf;
+    const T* k_buf;
+};
diff --git a/deepspeed/ops/csrc/includes/type_shim.h b/deepspeed/ops/csrc/includes/type_shim.h
new file mode 100644
index 0000000000000000000000000000000000000000..4f4e7a539ac15d6931a6ac3dfd541c6bd2f6760d
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/type_shim.h
@@ -0,0 +1,119 @@
+/* Taken from NVIDIA/apex commit 855808f3fc268e9715d613f3c2e56469d8c986d8 */
+#include <ATen/ATen.h>
+
+// Forward/backward compatibility hack around
+// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
+// pending more future-proof guidance from upstream.
+// struct TypeShim
+// {
+//   const at::Type& payload;
+//   TypeShim(const at::Type& type) : payload(type) {}
+//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
+//   operator const at::Type&(){ return payload; };
+//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
+//   //operator at::ScalarType(){ return payload.; };
+// };
+
+#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                          \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                   \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+#define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...)                        \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+template <typename T>
+__device__ __forceinline__ T
+reduce_block_into_lanes(T* x,
+                        T val,
+                        int lanes = 1,
+                        bool share_result = false)  // lanes is intended to be <= 32.
+{
+    int tid = threadIdx.x + threadIdx.y * blockDim.x;
+    int blockSize = blockDim.x * blockDim.y;  // blockSize is intended to be a multiple of 32.
+
+    if (blockSize >= 64) {
+        x[tid] = val;
+        __syncthreads();
+    }
+
+#pragma unroll
+    for (int i = (blockSize >> 1); i >= 64; i >>= 1) {
+        if (tid < i) x[tid] = x[tid] + x[tid + i];
+        __syncthreads();
+    }
+
+    T final;
+
+    if (tid < 32) {
+        if (blockSize >= 64)
+            final = x[tid] + x[tid + 32];
+        else
+            final = val;
+            // __SYNCWARP();
+
+#pragma unroll
+        for (int i = 16; i >= lanes; i >>= 1)
+            final = final + __shfl_down_sync(0xffffffff, final, i);
+    }
+
+    if (share_result) {
+        if (tid < lanes) x[tid] = final;  // EpilogueOp
+        // Make sure the smem result is visible to all warps.
+        __syncthreads();
+    }
+
+    return final;
+}
diff --git a/deepspeed/ops/csrc/includes/type_shim_hip.h b/deepspeed/ops/csrc/includes/type_shim_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..3bd86d52bee9f5cbe3ffed6da9f3af00f851899b
--- /dev/null
+++ b/deepspeed/ops/csrc/includes/type_shim_hip.h
@@ -0,0 +1,121 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+/* Taken from NVIDIA/apex commit 855808f3fc268e9715d613f3c2e56469d8c986d8 */
+#include <ATen/ATen.h>
+
+// Forward/backward compatibility hack around
+// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
+// pending more future-proof guidance from upstream.
+// struct TypeShim
+// {
+//   const at::Type& payload;
+//   TypeShim(const at::Type& type) : payload(type) {}
+//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
+//   operator const at::Type&(){ return payload; };
+//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
+//   //operator at::ScalarType(){ return payload.; };
+// };
+
+#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                          \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                   \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::BFloat16: {                                         \
+            using scalar_t_##LEVEL = at::BFloat16;                               \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+#define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...)                        \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+
+template <typename T>
+__device__ __forceinline__ T
+reduce_block_into_lanes(T* x,
+                        T val,
+                        int lanes = 1,
+                        bool share_result = false)  // lanes is intended to be <= 32.
+{
+    int tid = threadIdx.x + threadIdx.y * blockDim.x;
+    int blockSize = blockDim.x * blockDim.y;  // blockSize is intended to be a multiple of 32.
+
+    if (blockSize >= 64) {
+        x[tid] = val;
+        __syncthreads();
+    }
+
+#pragma unroll
+    for (int i = (blockSize >> 1); i >= 64; i >>= 1) {
+        if (tid < i) x[tid] = x[tid] + x[tid + i];
+        __syncthreads();
+    }
+
+    T final;
+
+    if (tid < 32) {
+        if (blockSize >= 64)
+            final = x[tid] + x[tid + 32];
+        else
+            final = val;
+            // __SYNCWARP();
+
+#pragma unroll
+        for (int i = 16; i >= lanes; i >>= 1)
+            final = final + __shfl_down_sync(0xffffffff, final, i);
+    }
+
+    if (share_result) {
+        if (tid < lanes) x[tid] = final;  // EpilogueOp
+        // Make sure the smem result is visible to all warps.
+        __syncthreads();
+    }
+
+    return final;
+}
diff --git a/deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp b/deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..7a142b13b00ccafbc102b5217c9567ec42384af7
--- /dev/null
+++ b/deepspeed/ops/csrc/lamb/fused_lamb_cuda.cpp
@@ -0,0 +1,109 @@
+/* Copyright 2019 The Microsoft DeepSpeed Team */
+#include <torch/extension.h>
+
+// CUDA forward declaration
+void fused_lamb_cuda(at::Tensor& p,
+                     at::Tensor& p_copy,
+                     at::Tensor& m,
+                     at::Tensor& v,
+                     at::Tensor& g,
+                     float lr,
+                     float beta1,
+                     float beta2,
+                     float max_coeff,
+                     float min_coeff,
+                     float eps,
+                     float grad_scale,
+                     int step,
+                     int mode,
+                     int bias_correction,
+                     float decay,
+                     at::Tensor& w_l2_i,
+                     at::Tensor& u_l2_i,
+                     at::Tensor& lamb_coeff_val);
+
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+// C++ interface
+at::Tensor lamb(at::Tensor& p,
+                at::Tensor& p_copy,
+                at::Tensor& m,
+                at::Tensor& v,
+                at::Tensor& g,
+                float lr,
+                float beta1,
+                float beta2,
+                float max_coeff,
+                float min_coeff,
+                float eps,
+                float grad_scale,
+                int step,
+                int mode,
+                int bias_correction,
+                float decay)
+{
+    CHECK_INPUT(p);
+    if (p_copy.numel() > 0) CHECK_INPUT(p_copy);
+    CHECK_INPUT(m);
+    CHECK_INPUT(v);
+    CHECK_INPUT(g);
+    int64_t num_elem = p.numel();
+    AT_ASSERTM(m.numel() == num_elem, "number of elements in m and p tensors should be equal");
+    AT_ASSERTM(v.numel() == num_elem, "number of elements in v and p tensors should be equal");
+    AT_ASSERTM(g.numel() == num_elem, "number of elements in g and p tensors should be equal");
+    AT_ASSERTM(
+        p_copy.numel() == num_elem || p_copy.numel() == 0,
+        "number of elements in p_copy and p tensors should be equal, or p_copy should be empty");
+
+    // intermediate for weight L2 reduction
+    // make sure that the threads per block is at least 512 during the kernel launch otherwise the
+    // behaviour is unexpected
+    at::Tensor w_l2_i = at::empty(
+        {512},
+        p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
+                                                                        : p.type().scalarType()));
+
+    // intermediate for update L2 reduction
+    // make sure that the threads per block is at least 512 during the kernel launch otherwise the
+    // behaviour is unexpected
+    at::Tensor u_l2_i = at::empty(
+        {512},
+        p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
+                                                                        : p.type().scalarType()));
+
+    at::Tensor lamb_coeff_val = at::empty(
+        {1},
+        p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
+                                                                        : p.type().scalarType()));
+
+    fused_lamb_cuda(p,
+                    p_copy,
+                    m,
+                    v,
+                    g,
+                    lr,
+                    beta1,
+                    beta2,
+                    max_coeff,
+                    min_coeff,
+                    eps,
+                    grad_scale,
+                    step,
+                    mode,
+                    bias_correction,
+                    decay,
+                    w_l2_i,
+                    u_l2_i,
+                    lamb_coeff_val);
+
+    return lamb_coeff_val;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("lamb", &lamb, "Adam optimized CUDA implementation with LAMB.");
+}
diff --git a/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu b/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu
new file mode 100644
index 0000000000000000000000000000000000000000..c76632362cb3e12abe5ac95a6b8889ea25a61697
--- /dev/null
+++ b/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu
@@ -0,0 +1,474 @@
+/* Copyright 2019 The Microsoft DeepSpeed Team */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <cmath>
+#include "ATen/ATen.h"
+#include "ATen/TensorUtils.h"
+#include "ATen/cuda/CUDAContext.h"
+#include "ATen/cuda/detail/IndexUtils.cuh"
+//#include "ATen/Type.h"
+#include "ATen/AccumulateType.h"
+
+#include <iostream>
+
+//#include <helper_functions.h>
+#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION > 305
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <cuda_runtime_api.h>
+#include <stdio.h>
+
+namespace cg = cooperative_groups;
+
+// Utility class used to avoid linker errors with extern
+// unsized shared memory arrays with templated type
+namespace {
+// This is the un-specialized struct.  Note that we prevent instantiation of this
+// struct by putting an undefined symbol in the function body so it won't compile.
+template <typename T>
+struct SharedMemory {
+    // Ensure that we won't compile any un-specialized types
+    __device__ inline operator T*()
+    {
+#ifndef _WIN32
+        extern __device__ void error(void);
+        error();
+#endif
+        return NULL;
+    }
+};
+
+template <>
+struct SharedMemory<float> {
+    __device__ inline operator float*()
+    {
+        extern __shared__ float s_float[];
+        return s_float;
+    }
+};
+
+template <>
+struct SharedMemory<double> {
+    __device__ inline operator double*()
+    {
+        extern __shared__ double s_double[];
+        return s_double;
+    }
+};
+}  // namespace
+
+#include "type_shim.h"
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // eps under square root
+    ADAM_MODE_1 = 1   // eps outside square root
+} adamMode_t;
+
+// s_a and s_b are in shared memory
+// g_a and g_b are in shared memory
+template <typename T, int blockSize>
+__device__ void reduce_block_in_shared_memory(T* s_a, T* s_b, T* g_a, T* g_b)
+{
+    // Handle to thread block group
+    cg::thread_block cta = cg::this_thread_block();
+
+    // perform block reduction in shared memory,
+    unsigned int tid = cta.thread_rank();
+
+    T a_sum = s_a[tid];
+    T b_sum = s_b[tid];
+
+    cg::sync(cta);
+
+    // do reduction in shared mem
+    if ((blockSize >= 512) && (tid < 256)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 256];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 256];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 256) && (tid < 128)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 128];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 128];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 128) && (tid < 64)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 64];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 64];
+    }
+
+    cg::sync(cta);
+
+#if (__CUDA_ARCH__ >= 300)
+    if (tid < 32) {
+        cg::coalesced_group active = cg::coalesced_threads();
+
+        // Fetch final intermediate sum from 2nd warp
+        if (blockSize >= 64) {
+            a_sum = a_sum + s_a[tid + 32];
+            b_sum = b_sum + s_b[tid + 32];
+        }
+
+        // Reduce final warp using shuffle
+        for (int offset = warpSize / 2; offset > 0; offset /= 2) {
+            a_sum += active.shfl_down(a_sum, offset);
+            b_sum += active.shfl_down(b_sum, offset);
+        }
+    }
+#else
+    if ((blockSize >= 64) && (tid < 32)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 32];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 32];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 32) && (tid < 16)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 16];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 16];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 16) && (tid < 8)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 8];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 8];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 8) && (tid < 4)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 4];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 4];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 4) && (tid < 2)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 2];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 2];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 2) && (tid < 1)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 1];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 1];
+    }
+
+    cg::sync(cta);
+
+#endif
+
+    // write result for this block to global mem
+    if (tid == 0) {
+        g_a[blockIdx.x] = (T)a_sum;
+        g_b[blockIdx.x] = (T)b_sum;
+    }
+}
+
+template <typename T, int blockSize>
+__device__ void reduce_two_vectors_in_register(T a, T b, T* g_a, T* g_b)
+{
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+
+    T* s_a = SharedMemory<T>();
+    T* s_b = SharedMemory<T>() + cg::this_thread_block().size();
+
+    s_a[threadIdInBlock] = a;
+    s_b[threadIdInBlock] = b;
+
+    reduce_block_in_shared_memory<T, blockSize>(s_a, s_b, g_a, g_b);
+}
+
+template <typename T, typename GRAD_T, int blockSize>
+__global__ void lamb_cuda_kernel_part1(
+    T* __restrict__ p,
+    GRAD_T* __restrict__ p_copy,  // For mixed precision training, pass NULL if not needed
+    T* __restrict__ m,
+    T* __restrict__ v,
+    const GRAD_T* __restrict__ g,
+    const float b1,
+    const float b2,
+    const float eps,
+    const float grad_scale,
+    const float step_size,
+    const size_t tsize,
+    adamMode_t mode,
+    const float decay,
+    T* __restrict__ w_l2_i,
+    T* __restrict__ u_l2_i)
+{
+    // Assuming 2D grids and 2D blocks
+    const int blockId = gridDim.x * blockIdx.y + blockIdx.x;
+    const int threadsPerBlock = blockDim.x * blockDim.y;
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+    const int i = (blockId * threadsPerBlock + threadIdInBlock);
+    const int totThreads = gridDim.x * gridDim.y * threadsPerBlock;
+
+    T reg_w = 0;
+    T reg_u = 0;
+
+    for (int j = i; j < tsize; j += totThreads) {
+        T scaled_grad = g[j] / grad_scale;
+        T pj = p[j];
+        m[j] = b1 * m[j] + (1 - b1) * scaled_grad;
+        v[j] = b2 * v[j] + (1 - b2) * scaled_grad * scaled_grad;
+        float denom;
+        if (mode == ADAM_MODE_0)
+            denom = sqrtf(v[j] + eps);
+        else  // Mode 1
+            denom = sqrtf(v[j]) + eps;
+        T update = (m[j] / denom) + (decay * p[j]);
+
+        reg_u += update * update;
+        reg_w += pj * pj;
+    }
+
+    reduce_two_vectors_in_register<T, blockSize>(reg_w, reg_u, w_l2_i, u_l2_i);
+}
+
+template <typename T, typename GRAD_T, int blockSize>
+__global__ void lamb_cuda_kernel_part2(const size_t tsize, T* __restrict__ g_a, T* __restrict__ g_b)
+{
+    T* s_a = SharedMemory<T>();
+    T* s_b = SharedMemory<T>() + cg::this_thread_block().size();
+
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+
+    s_a[threadIdInBlock] = g_a[threadIdInBlock];
+    s_b[threadIdInBlock] = g_b[threadIdInBlock];
+
+    if (threadIdInBlock >= tsize) {
+        s_a[threadIdInBlock] = 0.0;
+        s_b[threadIdInBlock] = 0.0;
+    }
+
+    reduce_block_in_shared_memory<T, blockSize>(s_a, s_b, g_a, g_b);
+}
+
+template <typename T, typename GRAD_T>
+__global__ void lamb_cuda_kernel_part3(
+    T* __restrict__ p,
+    GRAD_T* __restrict__ p_copy,  // For mixed precision training, pass NULL if not needed
+    T* __restrict__ m,
+    T* __restrict__ v,
+    const GRAD_T* __restrict__ g,
+    const float b1,
+    const float b2,
+    const float max_coeff,
+    const float min_coeff,
+    const float eps,
+    const float grad_scale,
+    const float step_size,
+    const size_t tsize,
+    adamMode_t mode,
+    const float decay,
+    T* __restrict__ w_l2_i,
+    T* __restrict__ u_l2_i,
+    T* __restrict__ lamb_coeff_val)
+{
+    // Assuming 2D grids and 2D blocks
+    const int blockId = gridDim.x * blockIdx.y + blockIdx.x;
+    const int threadsPerBlock = blockDim.x * blockDim.y;
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+    const int i = (blockId * threadsPerBlock + threadIdInBlock);
+    const int totThreads = gridDim.x * gridDim.y * threadsPerBlock;
+
+    T reg_w = sqrtf(w_l2_i[0]);
+    T reg_u = sqrtf(u_l2_i[0]);
+
+    float lamb_coeff = 1.0;
+
+    if (reg_w != 0 && reg_u != 0) {
+        lamb_coeff = reg_w / reg_u;
+        if (lamb_coeff > max_coeff) { lamb_coeff = max_coeff; }
+        if (lamb_coeff < min_coeff) { lamb_coeff = min_coeff; }
+    }
+
+    if (blockId == 0 && threadIdInBlock == 0) {
+        lamb_coeff_val[0] = lamb_coeff;
+        // printf("Cuda Lamb Coeff is %.6f \n",lamb_coeff);
+    }
+
+    for (int j = i; j < tsize; j += totThreads) {
+        T pj = (float)p[j];
+        T mj = m[j];
+        T vj = v[j];
+        float denom;
+        if (mode == ADAM_MODE_0)
+            denom = sqrtf(vj + eps);
+        else  // Mode 1
+            denom = sqrtf(vj) + eps;
+        T update = (mj / denom) + (decay * pj);
+
+        pj = pj - (step_size * lamb_coeff * update);
+        p[j] = pj;
+        if (p_copy != NULL) p_copy[j] = (GRAD_T)pj;
+    }
+}
+
+void fused_lamb_cuda(at::Tensor& p,
+                     at::Tensor& p_copy,
+                     at::Tensor& m,
+                     at::Tensor& v,
+                     at::Tensor& g,
+                     float lr,
+                     float beta1,
+                     float beta2,
+                     float max_coeff,
+                     float min_coeff,
+                     float eps,
+                     float grad_scale,
+                     int step,
+                     int mode,
+                     int bias_correction,
+                     float decay,
+                     at::Tensor& w_l2_i,
+                     at::Tensor& u_l2_i,
+                     at::Tensor& lamb_coeff)
+{
+    //        using namespace at;
+
+    // Get tensor size
+    int tsize = p.numel();
+    // Determine #threads and #blocks
+    const int threadsPerBlock = 512;
+    int num_blocks = (tsize + threadsPerBlock - 1) / threadsPerBlock;
+    if (num_blocks > 512) num_blocks = 512;
+
+    int smemsize = 0;
+
+    if (p.type().scalarType() == at::ScalarType::Double)
+        smemsize = 2 * threadsPerBlock * sizeof(double);
+    else
+        smemsize = 2 * threadsPerBlock * sizeof(float);
+
+    const dim3 blocks(num_blocks);
+    const dim3 threads(threadsPerBlock);
+
+    AT_ASSERTM(at::cuda::detail::canUse32BitIndexMath(p),
+               "parameter tensor is too large to be indexed with int32");
+    // Constants
+    float step_size = 0;
+    if (bias_correction == 1) {
+        const float bias_correction1 = 1 - std::pow(beta1, step);
+        const float bias_correction2 = 1 - std::pow(beta2, step);
+        step_size = lr * std::sqrt(bias_correction2) / bias_correction1;
+    } else {
+        step_size = lr;
+    }
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+
+    if (g.type().scalarType() == at::ScalarType::Half) {
+        // all other values should be fp32 for half gradients
+        AT_ASSERTM(p.type().scalarType() == at::ScalarType::Float,
+                   "expected parameter to be of float type");
+        // dispatch is done on the gradient type
+        using namespace at;  // prevents "toString is undefined" errors
+        AT_DISPATCH_FLOATING_TYPES_AND_HALF(
+            g.scalar_type(), "lamb_cuda_kernel", ([&] {
+                using accscalar_t = at::acc_type<scalar_t, true>;
+
+                lamb_cuda_kernel_part1<accscalar_t, scalar_t, threadsPerBlock>
+                    <<<blocks, threadsPerBlock, smemsize, stream>>>(
+                        p.data<accscalar_t>(),
+                        p_copy.numel() ? p_copy.data<scalar_t>() : NULL,
+                        m.data<accscalar_t>(),
+                        v.data<accscalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<accscalar_t>(),
+                        u_l2_i.data<accscalar_t>());
+
+                lamb_cuda_kernel_part2<accscalar_t, scalar_t, threadsPerBlock>
+                    <<<1, threadsPerBlock, smemsize, stream>>>(
+                        num_blocks, w_l2_i.data<accscalar_t>(), u_l2_i.data<accscalar_t>());
+
+                lamb_cuda_kernel_part3<accscalar_t, scalar_t>
+                    <<<blocks, threadsPerBlock, smemsize, stream>>>(
+                        p.data<accscalar_t>(),
+                        p_copy.numel() ? p_copy.data<scalar_t>() : NULL,
+                        m.data<accscalar_t>(),
+                        v.data<accscalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        max_coeff,
+                        min_coeff,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<accscalar_t>(),
+                        u_l2_i.data<accscalar_t>(),
+                        lamb_coeff.data<accscalar_t>());
+            }));
+    } else {
+        using namespace at;
+        AT_DISPATCH_FLOATING_TYPES(
+            g.scalar_type(), "lamb_cuda_kernel", ([&] {
+                lamb_cuda_kernel_part1<scalar_t, scalar_t, threadsPerBlock>
+                    <<<blocks, threadsPerBlock, smemsize, stream>>>(
+                        p.data<scalar_t>(),
+                        NULL,  // don't output p_copy for fp32, it's wasted write
+                        m.data<scalar_t>(),
+                        v.data<scalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<scalar_t>(),
+                        u_l2_i.data<scalar_t>());
+
+                lamb_cuda_kernel_part2<scalar_t, scalar_t, threadsPerBlock>
+                    <<<1, threadsPerBlock, smemsize, stream>>>(
+                        num_blocks, w_l2_i.data<scalar_t>(), u_l2_i.data<scalar_t>());
+
+                lamb_cuda_kernel_part3<scalar_t, scalar_t>
+                    <<<blocks, threadsPerBlock, smemsize, stream>>>(
+                        p.data<scalar_t>(),
+                        NULL,  // don't output p_copy for fp32, it's wasted write
+                        m.data<scalar_t>(),
+                        v.data<scalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        max_coeff,
+                        min_coeff,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<scalar_t>(),
+                        u_l2_i.data<scalar_t>(),
+                        lamb_coeff.data<scalar_t>());
+            }));
+    }
+    C10_CUDA_CHECK(cudaGetLastError());
+}
+
+// template __device__ void reduce_two_vectors_in_register<float,512>(float a, float b, float* g_a,
+// float* g_b, cg::grid_group &cgg);
diff --git a/deepspeed/ops/csrc/lamb/fused_lamb_hip_kernel.hip b/deepspeed/ops/csrc/lamb/fused_lamb_hip_kernel.hip
new file mode 100644
index 0000000000000000000000000000000000000000..2e2bc69f6156c9a21cc2a481dc423ded17351651
--- /dev/null
+++ b/deepspeed/ops/csrc/lamb/fused_lamb_hip_kernel.hip
@@ -0,0 +1,475 @@
+// !!! This is a file automatically generated by hipify!!!
+/* Copyright 2019 The Microsoft DeepSpeed Team */
+#include <hip/hip_runtime.h>
+#include <hip/hip_runtime.h>
+#include <stdio.h>
+#include <cmath>
+#include "ATen/ATen.h"
+#include "ATen/TensorUtils.h"
+#include "ATen/hip/HIPContext.h"
+#include "ATen/hip/detail/IndexUtils.cuh"
+//#include "ATen/Type.h"
+#include "ATen/AccumulateType.h"
+
+#include <iostream>
+
+//#include <helper_functions.h>
+#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION > 305
+#include <hip/hip_cooperative_groups.h>
+#else
+#include <cooperative_groups.h>
+#endif
+#include <hip/hip_runtime_api.h>
+#include <stdio.h>
+
+namespace cg = cooperative_groups;
+
+// Utility class used to avoid linker errors with extern
+// unsized shared memory arrays with templated type
+namespace {
+// This is the un-specialized struct.  Note that we prevent instantiation of this
+// struct by putting an undefined symbol in the function body so it won't compile.
+template <typename T>
+struct SharedMemory {
+    // Ensure that we won't compile any un-specialized types
+    __device__ inline operator T*()
+    {
+#ifndef _WIN32
+        extern __device__ void error(void);
+        error();
+#endif
+        return NULL;
+    }
+};
+
+template <>
+struct SharedMemory<float> {
+    __device__ inline operator float*()
+    {
+        HIP_DYNAMIC_SHARED( float, s_float)
+        return s_float;
+    }
+};
+
+template <>
+struct SharedMemory<double> {
+    __device__ inline operator double*()
+    {
+        HIP_DYNAMIC_SHARED( double, s_double)
+        return s_double;
+    }
+};
+}  // namespace
+
+#include "type_shim_hip.h"
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // eps under square root
+    ADAM_MODE_1 = 1   // eps outside square root
+} adamMode_t;
+
+// s_a and s_b are in shared memory
+// g_a and g_b are in shared memory
+template <typename T, int blockSize>
+__device__ void reduce_block_in_shared_memory(T* s_a, T* s_b, T* g_a, T* g_b)
+{
+    // Handle to thread block group
+    cg::thread_block cta = cg::this_thread_block();
+
+    // perform block reduction in shared memory,
+    unsigned int tid = cta.thread_rank();
+
+    T a_sum = s_a[tid];
+    T b_sum = s_b[tid];
+
+    cg::sync(cta);
+
+    // do reduction in shared mem
+    if ((blockSize >= 512) && (tid < 256)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 256];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 256];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 256) && (tid < 128)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 128];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 128];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 128) && (tid < 64)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 64];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 64];
+    }
+
+    cg::sync(cta);
+
+#if (__CUDA_ARCH__ >= 300)
+    if (tid < 32) {
+        cg::coalesced_group active = cg::coalesced_threads();
+
+        // Fetch final intermediate sum from 2nd warp
+        if (blockSize >= 64) {
+            a_sum = a_sum + s_a[tid + 32];
+            b_sum = b_sum + s_b[tid + 32];
+        }
+
+        // Reduce final warp using shuffle
+        for (int offset = warpSize / 2; offset > 0; offset /= 2) {
+            a_sum += active.shfl_down(a_sum, offset);
+            b_sum += active.shfl_down(b_sum, offset);
+        }
+    }
+#else
+    if ((blockSize >= 64) && (tid < 32)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 32];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 32];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 32) && (tid < 16)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 16];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 16];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 16) && (tid < 8)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 8];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 8];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 8) && (tid < 4)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 4];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 4];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 4) && (tid < 2)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 2];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 2];
+    }
+
+    cg::sync(cta);
+
+    if ((blockSize >= 2) && (tid < 1)) {
+        s_a[tid] = a_sum = a_sum + s_a[tid + 1];
+        s_b[tid] = b_sum = b_sum + s_b[tid + 1];
+    }
+
+    cg::sync(cta);
+
+#endif
+
+    // write result for this block to global mem
+    if (tid == 0) {
+        g_a[blockIdx.x] = (T)a_sum;
+        g_b[blockIdx.x] = (T)b_sum;
+    }
+}
+
+template <typename T, int blockSize>
+__device__ void reduce_two_vectors_in_register(T a, T b, T* g_a, T* g_b)
+{
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+
+    T* s_a = SharedMemory<T>();
+    T* s_b = SharedMemory<T>() + cg::this_thread_block().size();
+
+    s_a[threadIdInBlock] = a;
+    s_b[threadIdInBlock] = b;
+
+    reduce_block_in_shared_memory<T, blockSize>(s_a, s_b, g_a, g_b);
+}
+
+template <typename T, typename GRAD_T, int blockSize>
+__global__ void lamb_cuda_kernel_part1(
+    T* __restrict__ p,
+    GRAD_T* __restrict__ p_copy,  // For mixed precision training, pass NULL if not needed
+    T* __restrict__ m,
+    T* __restrict__ v,
+    const GRAD_T* __restrict__ g,
+    const float b1,
+    const float b2,
+    const float eps,
+    const float grad_scale,
+    const float step_size,
+    const size_t tsize,
+    adamMode_t mode,
+    const float decay,
+    T* __restrict__ w_l2_i,
+    T* __restrict__ u_l2_i)
+{
+    // Assuming 2D grids and 2D blocks
+    const int blockId = gridDim.x * blockIdx.y + blockIdx.x;
+    const int threadsPerBlock = blockDim.x * blockDim.y;
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+    const int i = (blockId * threadsPerBlock + threadIdInBlock);
+    const int totThreads = gridDim.x * gridDim.y * threadsPerBlock;
+
+    T reg_w = 0;
+    T reg_u = 0;
+
+    for (int j = i; j < tsize; j += totThreads) {
+        T scaled_grad = g[j] / grad_scale;
+        T pj = p[j];
+        m[j] = b1 * m[j] + (1 - b1) * scaled_grad;
+        v[j] = b2 * v[j] + (1 - b2) * scaled_grad * scaled_grad;
+        float denom;
+        if (mode == ADAM_MODE_0)
+            denom = sqrtf(v[j] + eps);
+        else  // Mode 1
+            denom = sqrtf(v[j]) + eps;
+        T update = (m[j] / denom) + (decay * p[j]);
+
+        reg_u += update * update;
+        reg_w += pj * pj;
+    }
+
+    reduce_two_vectors_in_register<T, blockSize>(reg_w, reg_u, w_l2_i, u_l2_i);
+}
+
+template <typename T, typename GRAD_T, int blockSize>
+__global__ void lamb_cuda_kernel_part2(const size_t tsize, T* __restrict__ g_a, T* __restrict__ g_b)
+{
+    T* s_a = SharedMemory<T>();
+    T* s_b = SharedMemory<T>() + cg::this_thread_block().size();
+
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+
+    s_a[threadIdInBlock] = g_a[threadIdInBlock];
+    s_b[threadIdInBlock] = g_b[threadIdInBlock];
+
+    if (threadIdInBlock >= tsize) {
+        s_a[threadIdInBlock] = 0.0;
+        s_b[threadIdInBlock] = 0.0;
+    }
+
+    reduce_block_in_shared_memory<T, blockSize>(s_a, s_b, g_a, g_b);
+}
+
+template <typename T, typename GRAD_T>
+__global__ void lamb_cuda_kernel_part3(
+    T* __restrict__ p,
+    GRAD_T* __restrict__ p_copy,  // For mixed precision training, pass NULL if not needed
+    T* __restrict__ m,
+    T* __restrict__ v,
+    const GRAD_T* __restrict__ g,
+    const float b1,
+    const float b2,
+    const float max_coeff,
+    const float min_coeff,
+    const float eps,
+    const float grad_scale,
+    const float step_size,
+    const size_t tsize,
+    adamMode_t mode,
+    const float decay,
+    T* __restrict__ w_l2_i,
+    T* __restrict__ u_l2_i,
+    T* __restrict__ lamb_coeff_val)
+{
+    // Assuming 2D grids and 2D blocks
+    const int blockId = gridDim.x * blockIdx.y + blockIdx.x;
+    const int threadsPerBlock = blockDim.x * blockDim.y;
+    const int threadIdInBlock = cg::this_thread_block().thread_rank();
+    const int i = (blockId * threadsPerBlock + threadIdInBlock);
+    const int totThreads = gridDim.x * gridDim.y * threadsPerBlock;
+
+    T reg_w = sqrtf(w_l2_i[0]);
+    T reg_u = sqrtf(u_l2_i[0]);
+
+    float lamb_coeff = 1.0;
+
+    if (reg_w != 0 && reg_u != 0) {
+        lamb_coeff = reg_w / reg_u;
+        if (lamb_coeff > max_coeff) { lamb_coeff = max_coeff; }
+        if (lamb_coeff < min_coeff) { lamb_coeff = min_coeff; }
+    }
+
+    if (blockId == 0 && threadIdInBlock == 0) {
+        lamb_coeff_val[0] = lamb_coeff;
+        // printf("Cuda Lamb Coeff is %.6f \n",lamb_coeff);
+    }
+
+    for (int j = i; j < tsize; j += totThreads) {
+        T pj = (float)p[j];
+        T mj = m[j];
+        T vj = v[j];
+        float denom;
+        if (mode == ADAM_MODE_0)
+            denom = sqrtf(vj + eps);
+        else  // Mode 1
+            denom = sqrtf(vj) + eps;
+        T update = (mj / denom) + (decay * pj);
+
+        pj = pj - (step_size * lamb_coeff * update);
+        p[j] = pj;
+        if (p_copy != NULL) p_copy[j] = (GRAD_T)pj;
+    }
+}
+
+void fused_lamb_cuda(at::Tensor& p,
+                     at::Tensor& p_copy,
+                     at::Tensor& m,
+                     at::Tensor& v,
+                     at::Tensor& g,
+                     float lr,
+                     float beta1,
+                     float beta2,
+                     float max_coeff,
+                     float min_coeff,
+                     float eps,
+                     float grad_scale,
+                     int step,
+                     int mode,
+                     int bias_correction,
+                     float decay,
+                     at::Tensor& w_l2_i,
+                     at::Tensor& u_l2_i,
+                     at::Tensor& lamb_coeff)
+{
+    //        using namespace at;
+
+    // Get tensor size
+    int tsize = p.numel();
+    // Determine #threads and #blocks
+    const int threadsPerBlock = 512;
+    int num_blocks = (tsize + threadsPerBlock - 1) / threadsPerBlock;
+    if (num_blocks > 512) num_blocks = 512;
+
+    int smemsize = 0;
+
+    if (p.type().scalarType() == at::ScalarType::Double)
+        smemsize = 2 * threadsPerBlock * sizeof(double);
+    else
+        smemsize = 2 * threadsPerBlock * sizeof(float);
+
+    const dim3 blocks(num_blocks);
+    const dim3 threads(threadsPerBlock);
+
+    AT_ASSERTM(at::cuda::detail::canUse32BitIndexMath(p),
+               "parameter tensor is too large to be indexed with int32");
+    // Constants
+    float step_size = 0;
+    if (bias_correction == 1) {
+        const float bias_correction1 = 1 - ::pow(beta1, step);
+        const float bias_correction2 = 1 - ::pow(beta2, step);
+        step_size = lr * std::sqrt(bias_correction2) / bias_correction1;
+    } else {
+        step_size = lr;
+    }
+    hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+
+    if (g.type().scalarType() == at::ScalarType::Half) {
+        // all other values should be fp32 for half gradients
+        AT_ASSERTM(p.type().scalarType() == at::ScalarType::Float,
+                   "expected parameter to be of float type");
+        // dispatch is done on the gradient type
+        using namespace at;  // prevents "toString is undefined" errors
+        AT_DISPATCH_FLOATING_TYPES_AND_HALF(
+            g.scalar_type(), "lamb_cuda_kernel", ([&] {
+                using accscalar_t = at::acc_type<scalar_t, true>;
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part1<accscalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<accscalar_t>(),
+                        p_copy.numel() ? p_copy.data<scalar_t>() : NULL,
+                        m.data<accscalar_t>(),
+                        v.data<accscalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<accscalar_t>(),
+                        u_l2_i.data<accscalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part2<accscalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(1), dim3(threadsPerBlock), smemsize, stream, 
+                        num_blocks, w_l2_i.data<accscalar_t>(), u_l2_i.data<accscalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part3<accscalar_t, scalar_t>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<accscalar_t>(),
+                        p_copy.numel() ? p_copy.data<scalar_t>() : NULL,
+                        m.data<accscalar_t>(),
+                        v.data<accscalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        max_coeff,
+                        min_coeff,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<accscalar_t>(),
+                        u_l2_i.data<accscalar_t>(),
+                        lamb_coeff.data<accscalar_t>());
+            }));
+    } else {
+        using namespace at;
+        AT_DISPATCH_FLOATING_TYPES(
+            g.scalar_type(), "lamb_cuda_kernel", ([&] {
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part1<scalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<scalar_t>(),
+                        NULL,  // don't output p_copy for fp32, it's wasted write
+                        m.data<scalar_t>(),
+                        v.data<scalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<scalar_t>(),
+                        u_l2_i.data<scalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part2<scalar_t, scalar_t, threadsPerBlock>)
+                    , dim3(1), dim3(threadsPerBlock), smemsize, stream, 
+                        num_blocks, w_l2_i.data<scalar_t>(), u_l2_i.data<scalar_t>());
+
+               hipLaunchKernelGGL(( lamb_cuda_kernel_part3<scalar_t, scalar_t>)
+                    , dim3(blocks), dim3(threadsPerBlock), smemsize, stream, 
+                        p.data<scalar_t>(),
+                        NULL,  // don't output p_copy for fp32, it's wasted write
+                        m.data<scalar_t>(),
+                        v.data<scalar_t>(),
+                        g.data<scalar_t>(),
+                        beta1,
+                        beta2,
+                        max_coeff,
+                        min_coeff,
+                        eps,
+                        grad_scale,
+                        step_size,
+                        tsize,
+                        (adamMode_t)mode,
+                        decay,
+                        w_l2_i.data<scalar_t>(),
+                        u_l2_i.data<scalar_t>(),
+                        lamb_coeff.data<scalar_t>());
+            }));
+    }
+    C10_HIP_CHECK(hipGetLastError());
+}
+
+// template __device__ void reduce_two_vectors_in_register<float,512>(float a, float b, float* g_a,
+// float* g_b, cg::grid_group &cgg);
diff --git a/deepspeed/ops/csrc/quantization/pt_binding.cpp b/deepspeed/ops/csrc/quantization/pt_binding.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..f76c4368a20090d1821776fed4877fa021db57c5
--- /dev/null
+++ b/deepspeed/ops/csrc/quantization/pt_binding.cpp
@@ -0,0 +1,77 @@
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "custom_cuda_layers.h"
+
+template <typename T>
+at::Tensor ds_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::cuda::getCurrentCUDAStream());
+    }
+    return vals;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("ds_quantize_fp32", &ds_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_quantize_fp16", &ds_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_fp32", &ds_sr_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_fp16", &ds_sr_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_quantize_asym_fp32", &ds_quantize_asym<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def(
+        "ds_quantize_asym_fp16", &ds_quantize_asym<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp32",
+          &ds_sr_quantize_asym<float>,
+          "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp16",
+          &ds_sr_quantize_asym<__half>,
+          "DeepSpeed Quantize with fp16 (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/quantization/pt_binding_hip.cpp b/deepspeed/ops/csrc/quantization/pt_binding_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..25ddba1a36a985e78be34b6b7a9c0c5c7df2fea5
--- /dev/null
+++ b/deepspeed/ops/csrc/quantization/pt_binding_hip.cpp
@@ -0,0 +1,78 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <ATen/hip/HIPContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "custom_hip_layers.h"
+
+template <typename T>
+at::Tensor ds_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if ((((size / groups) - 1) / 4096 + 1) <= MAX_REG) {
+        launch_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+template <typename T>
+at::Tensor ds_sr_quantize_asym(at::Tensor& vals, int groups, int bits)
+{
+    auto t_size = vals.sizes();
+    int size = 1;
+    for (auto dim : t_size) size *= dim;
+
+    if (((size / groups) / 4 / 1024) <= 256) {
+        launch_sr_quantize_kernel_asym(
+            (T*)vals.data_ptr(), size, groups, bits, at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return vals;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("ds_quantize_fp32", &ds_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_quantize_fp16", &ds_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_fp32", &ds_sr_quantize<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_fp16", &ds_sr_quantize<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_quantize_asym_fp32", &ds_quantize_asym<float>, "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def(
+        "ds_quantize_asym_fp16", &ds_quantize_asym<__half>, "DeepSpeed Quantize with fp16 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp32",
+          &ds_sr_quantize_asym<float>,
+          "DeepSpeed Quantize with fp32 (CUDA)");
+    m.def("ds_sr_quantize_asym_fp16",
+          &ds_sr_quantize_asym<__half>,
+          "DeepSpeed Quantize with fp16 (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/quantization/quantizer.cu b/deepspeed/ops/csrc/quantization/quantizer.cu
new file mode 100644
index 0000000000000000000000000000000000000000..37883410e976c6daaece041d86a8a6e78838a00d
--- /dev/null
+++ b/deepspeed/ops/csrc/quantization/quantizer.cu
@@ -0,0 +1,1037 @@
+#include <math.h>
+#include "custom_cuda_layers.h"
+
+namespace cg = cooperative_groups;
+
+__global__ void quantize_kernel(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (abs((float)data_h[0]) > max) max = abs((float)data_h[0]);
+            if (abs((float)data_h[1]) > max) max = abs((float)data_h[1]);
+            if (abs((float)data_h[2]) > max) max = abs((float)data_h[2]);
+            if (abs((float)data_h[3]) > max) max = abs((float)data_h[3]);
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+        float q_scale_inv = 1 / q_scale;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf(q_data[0].x * q_scale);
+                q_data_int[0].y = roundf(q_data[0].y * q_scale);
+                q_data_int[1].x = roundf(q_data[1].x * q_scale);
+                q_data_int[1].y = roundf(q_data[1].y * q_scale);
+
+                q_data_int[0].x *= q_scale_inv;
+                q_data_int[0].y *= q_scale_inv;
+                q_data_int[1].x *= q_scale_inv;
+                q_data_int[1].y *= q_scale_inv;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (abs(data_reg.x) > max) max = abs(data_reg.x);
+        if (abs(data_reg.y) > max) max = abs(data_reg.y);
+        if (abs(data_reg.z) > max) max = abs(data_reg.z);
+        if (abs(data_reg.w) > max) max = abs(data_reg.w);
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+    __shared__ float partialMax[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+
+    b.sync();
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+
+    max = g.shfl(max, 0);
+
+    float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf(q_data.x * q_scale);
+            q_data_int.y = roundf(q_data.y * q_scale);
+            q_data_int.w = roundf(q_data.w * q_scale);
+            q_data_int.z = roundf(q_data.z * q_scale);
+
+            q_data.x = q_data_int.x * q_scale_inv;
+            q_data.y = q_data_int.y * q_scale_inv;
+            q_data.w = q_data_int.w * q_scale_inv;
+            q_data.z = q_data_int.z * q_scale_inv;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            cudaStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+    quantize_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel(float* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     cudaStream_t stream);
+template void launch_quantize_kernel(__half* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     cudaStream_t stream);
+
+__global__ void sr_quantize_kernel(__half* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (abs((float)data_f[0].x) > max) max = abs((float)data_f[0].x);
+            if (abs((float)data_f[0].y) > max) max = abs((float)data_f[0].y);
+            if (abs((float)data_f[1].x) > max) max = abs((float)data_f[1].x);
+            if (abs((float)data_f[1].y) > max) max = abs((float)data_f[1].y);
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((int)(data_f[0].x * q_scale_val));
+                q_data_int[0].y = (float)((int)(data_f[0].y * q_scale_val));
+                q_data_int[1].x = (float)((int)(data_f[1].x * q_scale_val));
+                q_data_int[1].y = (float)((int)(data_f[1].y * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(data_f[0].x - (q_data_int[0].x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(data_f[0].y - (q_data_int[0].y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(data_f[1].x - (q_data_int[1].x / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(data_f[1].y - (q_data_int[1].y / q_scale_val)) * q_scale_val;
+
+                q_data_int[0].x =
+                    (rand.x < q_error[0] && q_data_int[0].x > low_q && q_data_int[0].x < high_q)
+                        ? (q_data_int[0].x + (data_f[0].x > 0 ? 1 : -1))
+                        : q_data_int[0].x;
+                q_data_int[0].y =
+                    (rand.y < q_error[1] && q_data_int[0].y > low_q && q_data_int[0].y < high_q)
+                        ? (q_data_int[0].y + (data_f[0].y > 0 ? 1 : -1))
+                        : q_data_int[0].y;
+                q_data_int[1].x =
+                    (rand.w < q_error[2] && q_data_int[1].x > low_q && q_data_int[1].x < high_q)
+                        ? (q_data_int[1].x + (data_f[1].x > 0 ? 1 : -1))
+                        : q_data_int[1].x;
+                q_data_int[1].y =
+                    (rand.z < q_error[3] && q_data_int[1].y > low_q && q_data_int[1].y < high_q)
+                        ? (q_data_int[1].y + (data_f[1].y > 0 ? 1 : -1))
+                        : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x / q_scale_val;
+                data_f[0].y = q_data_int[0].y / q_scale_val;
+                data_f[1].x = q_data_int[1].x / q_scale_val;
+                data_f[1].y = q_data_int[1].y / q_scale_val;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel(float* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            data[reg_count] = vals_cast[group_index];
+
+            if (abs(data[reg_count].x) > max) max = abs(data[reg_count].x);
+            if (abs(data[reg_count].y) > max) max = abs(data[reg_count].y);
+            if (abs(data[reg_count].z) > max) max = abs(data[reg_count].z);
+            if (abs(data[reg_count].w) > max) max = abs(data[reg_count].w);
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)(q_data.x * q_scale_val));
+                q_data_int.y = (float)((int)(q_data.y * q_scale_val));
+                q_data_int.w = (float)((int)(q_data.w * q_scale_val));
+                q_data_int.z = (float)((int)(q_data.z * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - (q_data_int.x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(q_data.y - (q_data_int.y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(q_data.w - (q_data_int.w / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(q_data.z - (q_data_int.z / q_scale_val)) * q_scale_val;
+
+                q_data_int.x =
+                    (rand.x < q_error[0] && q_data_int.x > low_q && q_data_int.x < high_q)
+                        ? (q_data_int.x + (q_data.x > 0 ? 1 : -1))
+                        : q_data_int.x;
+                q_data_int.y =
+                    (rand.y < q_error[1] && q_data_int.y > low_q && q_data_int.y < high_q)
+                        ? (q_data_int.y + (q_data.y > 0 ? 1 : -1))
+                        : q_data_int.y;
+                q_data_int.w =
+                    (rand.w < q_error[2] && q_data_int.w > low_q && q_data_int.w < high_q)
+                        ? (q_data_int.w + (q_data.w > 0 ? 1 : -1))
+                        : q_data_int.w;
+                q_data_int.z =
+                    (rand.z < q_error[3] && q_data_int.z > low_q && q_data_int.z < high_q)
+                        ? (q_data_int.z + (q_data.z > 0 ? 1 : -1))
+                        : q_data_int.z;
+
+                q_data_int.x /= q_scale_val;
+                q_data_int.y /= q_scale_val;
+                q_data_int.w /= q_scale_val;
+                q_data_int.z /= q_scale_val;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               cudaStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    sr_quantize_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel(float* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        cudaStream_t stream);
+template void launch_sr_quantize_kernel(__half* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        cudaStream_t stream);
+
+__global__ void quantize_kernel_asym(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+        float min = 10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (((float)data_h[0]) > max) max = (float)data_h[0];
+            if (((float)data_h[1]) > max) max = (float)data_h[1];
+            if (((float)data_h[2]) > max) max = (float)data_h[2];
+            if (((float)data_h[3]) > max) max = (float)data_h[3];
+
+            if (((float)data_h[0]) < min) min = (float)data_h[0];
+            if (((float)data_h[1]) < min) min = (float)data_h[1];
+            if (((float)data_h[2]) < min) min = (float)data_h[2];
+            if (((float)data_h[3]) < min) min = (float)data_h[3];
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_inv = 1 / q_scale;
+
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf((q_data[0].x - min) * q_scale_inv);
+                q_data_int[0].y = roundf((q_data[0].y - min) * q_scale_inv);
+                q_data_int[1].x = roundf((q_data[1].x - min) * q_scale_inv);
+                q_data_int[1].y = roundf((q_data[1].y - min) * q_scale_inv);
+
+                q_data_int[0].x = q_data_int[0].x * q_scale + min;
+                q_data_int[0].y = q_data_int[0].y * q_scale + min;
+                q_data_int[1].x = q_data_int[1].x * q_scale + min;
+                q_data_int[1].y = q_data_int[1].y * q_scale + min;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel_asym(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+    float min = 10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (data_reg.x > max) max = data_reg.x;
+        if (data_reg.y > max) max = data_reg.y;
+        if (data_reg.w > max) max = data_reg.w;
+        if (data_reg.z > max) max = data_reg.z;
+
+        if (data_reg.x < min) min = data_reg.x;
+        if (data_reg.y < min) min = data_reg.y;
+        if (data_reg.w < min) min = data_reg.w;
+        if (data_reg.z < min) min = data_reg.z;
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(min, i);
+        if (min > temp) min = temp;
+    }
+
+    __shared__ float partialMax[WARP_SIZE];
+    __shared__ float partialMin[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+    if (lane == 0) partialMin[gid] = min;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+    if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(min, i);
+        if (min > temp) min = temp;
+    }
+
+    max = g.shfl(max, 0);
+    min = g.shfl(min, 0);
+
+    float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf((q_data.x - min) * q_scale_inv);
+            q_data_int.y = roundf((q_data.y - min) * q_scale_inv);
+            q_data_int.w = roundf((q_data.w - min) * q_scale_inv);
+            q_data_int.z = roundf((q_data.z - min) * q_scale_inv);
+
+            q_data.x = q_data_int.x * q_scale + min;
+            q_data.y = q_data_int.y * q_scale + min;
+            q_data.w = q_data_int.w * q_scale + min;
+            q_data.z = q_data_int.z * q_scale + min;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 cudaStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+    quantize_kernel_asym<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel_asym(float* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          cudaStream_t stream);
+template void launch_quantize_kernel_asym(__half* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          cudaStream_t stream);
+
+__global__ void sr_quantize_kernel_asym(__half* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (((float)data_f[0].x) > max) max = (float)data_f[0].x;
+            if (((float)data_f[0].y) > max) max = (float)data_f[0].y;
+            if (((float)data_f[1].x) > max) max = (float)data_f[1].x;
+            if (((float)data_f[1].y) > max) max = (float)data_f[1].y;
+
+            if (((float)data_f[0].x) < min) min = (float)data_f[0].x;
+            if (((float)data_f[0].y) < min) min = (float)data_f[0].y;
+            if (((float)data_f[1].x) < min) min = (float)data_f[1].x;
+            if (((float)data_f[1].y) < min) min = (float)data_f[1].y;
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_val_inv = 1 / q_scale_val;
+        float high_q = (float)((1 << num_bits) - 1);
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((unsigned int)((data_f[0].x - min) * q_scale_val_inv));
+                q_data_int[0].y = (float)((unsigned int)((data_f[0].y - min) * q_scale_val_inv));
+                q_data_int[1].x = (float)((unsigned int)((data_f[1].x - min) * q_scale_val_inv));
+                q_data_int[1].y = (float)((unsigned int)((data_f[1].y - min) * q_scale_val_inv));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] =
+                    abs(data_f[0].x - ((q_data_int[0].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[1] =
+                    abs(data_f[0].y - ((q_data_int[0].y * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[2] =
+                    abs(data_f[1].x - ((q_data_int[1].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[3] =
+                    abs(data_f[1].y - ((q_data_int[1].y * q_scale_val) + min)) * q_scale_val_inv;
+
+                q_data_int[0].x = (rand.x < q_error[0] && q_data_int[0].x < high_q)
+                                      ? (q_data_int[0].x + 1)
+                                      : q_data_int[0].x;
+                q_data_int[0].y = (rand.y < q_error[1] && q_data_int[0].y < high_q)
+                                      ? (q_data_int[0].y + 1)
+                                      : q_data_int[0].y;
+                q_data_int[1].x = (rand.w < q_error[2] && q_data_int[1].x < high_q)
+                                      ? (q_data_int[1].x + 1)
+                                      : q_data_int[1].x;
+                q_data_int[1].y = (rand.z < q_error[3] && q_data_int[1].y < high_q)
+                                      ? (q_data_int[1].y + 1)
+                                      : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x * q_scale_val + min;
+                data_f[0].y = q_data_int[0].y * q_scale_val + min;
+                data_f[1].x = q_data_int[1].x * q_scale_val + min;
+                data_f[1].y = q_data_int[1].y * q_scale_val + min;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel_asym(float* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            float4 data_reg = vals_cast[group_index];
+            data[reg_count] = data_reg;
+            if (data_reg.x > max) max = data_reg.x;
+            if (data_reg.y > max) max = data_reg.y;
+            if (data_reg.w > max) max = data_reg.w;
+            if (data_reg.z > max) max = data_reg.z;
+
+            if (data_reg.x < min) min = data_reg.x;
+            if (data_reg.y < min) min = data_reg.y;
+            if (data_reg.w < min) min = data_reg.w;
+            if (data_reg.z < min) min = data_reg.z;
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float high_q = (float)((1 << num_bits) - 1);
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)((q_data.x - min) / q_scale_val));
+                q_data_int.y = (float)((int)((q_data.y - min) / q_scale_val));
+                q_data_int.w = (float)((int)((q_data.w - min) / q_scale_val));
+                q_data_int.z = (float)((int)((q_data.z - min) / q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = curand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - ((q_data_int.x * q_scale_val) + min)) / q_scale_val;
+                q_error[1] = abs(q_data.y - ((q_data_int.y * q_scale_val) + min)) / q_scale_val;
+                q_error[2] = abs(q_data.w - ((q_data_int.w * q_scale_val) + min)) / q_scale_val;
+                q_error[3] = abs(q_data.z - ((q_data_int.z * q_scale_val) + min)) / q_scale_val;
+
+                q_data_int.x = (rand.x < q_error[0] && q_data_int.x < high_q) ? (q_data_int.x + 1)
+                                                                              : q_data_int.x;
+                q_data_int.y = (rand.y < q_error[1] && q_data_int.y < high_q) ? (q_data_int.y + 1)
+                                                                              : q_data_int.y;
+                q_data_int.w = (rand.w < q_error[2] && q_data_int.w < high_q) ? (q_data_int.w + 1)
+                                                                              : q_data_int.w;
+                q_data_int.z = (rand.z < q_error[3] && q_data_int.z < high_q) ? (q_data_int.z + 1)
+                                                                              : q_data_int.z;
+
+                q_data_int.x = q_data_int.x * q_scale_val + min;
+                q_data_int.y = q_data_int.y * q_scale_val + min;
+                q_data_int.w = q_data_int.w * q_scale_val + min;
+                q_data_int.z = q_data_int.z * q_scale_val + min;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    cudaStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    sr_quantize_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel_asym(float* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             cudaStream_t stream);
+template void launch_sr_quantize_kernel_asym(__half* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/quantization/quantizer.hip b/deepspeed/ops/csrc/quantization/quantizer.hip
new file mode 100644
index 0000000000000000000000000000000000000000..9134593275130a29dc43384d99e15bd2722f3e4c
--- /dev/null
+++ b/deepspeed/ops/csrc/quantization/quantizer.hip
@@ -0,0 +1,1039 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <math.h>
+#include "custom_hip_layers.h"
+
+namespace cg = cooperative_groups;
+
+__global__ void quantize_kernel(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (abs((float)data_h[0]) > max) max = abs((float)data_h[0]);
+            if (abs((float)data_h[1]) > max) max = abs((float)data_h[1]);
+            if (abs((float)data_h[2]) > max) max = abs((float)data_h[2]);
+            if (abs((float)data_h[3]) > max) max = abs((float)data_h[3]);
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+        float q_scale_inv = 1 / q_scale;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf(q_data[0].x * q_scale);
+                q_data_int[0].y = roundf(q_data[0].y * q_scale);
+                q_data_int[1].x = roundf(q_data[1].x * q_scale);
+                q_data_int[1].y = roundf(q_data[1].y * q_scale);
+
+                q_data_int[0].x *= q_scale_inv;
+                q_data_int[0].y *= q_scale_inv;
+                q_data_int[1].x *= q_scale_inv;
+                q_data_int[1].y *= q_scale_inv;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (abs(data_reg.x) > max) max = abs(data_reg.x);
+        if (abs(data_reg.y) > max) max = abs(data_reg.y);
+        if (abs(data_reg.z) > max) max = abs(data_reg.z);
+        if (abs(data_reg.w) > max) max = abs(data_reg.w);
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+    __shared__ float partialMax[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+
+    b.sync();
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+
+    max = g.shfl(max, 0);
+
+    float q_scale = (1 << num_bits) / (2 * max + 1e-5);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf(q_data.x * q_scale);
+            q_data_int.y = roundf(q_data.y * q_scale);
+            q_data_int.w = roundf(q_data.w * q_scale);
+            q_data_int.z = roundf(q_data.z * q_scale);
+
+            q_data.x = q_data_int.x * q_scale_inv;
+            q_data.y = q_data_int.y * q_scale_inv;
+            q_data.w = q_data_int.w * q_scale_inv;
+            q_data.z = q_data_int.z * q_scale_inv;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel(T* vals,
+                            int total_count,
+                            int group_num,
+                            int num_bits,
+                            hipStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+   hipLaunchKernelGGL(( quantize_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel(float* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     hipStream_t stream);
+template void launch_quantize_kernel(__half* vals,
+                                     int total_count,
+                                     int group_num,
+                                     int num_bits,
+                                     hipStream_t stream);
+
+__global__ void sr_quantize_kernel(__half* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (abs((float)data_f[0].x) > max) max = abs((float)data_f[0].x);
+            if (abs((float)data_f[0].y) > max) max = abs((float)data_f[0].y);
+            if (abs((float)data_f[1].x) > max) max = abs((float)data_f[1].x);
+            if (abs((float)data_f[1].y) > max) max = abs((float)data_f[1].y);
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((int)(data_f[0].x * q_scale_val));
+                q_data_int[0].y = (float)((int)(data_f[0].y * q_scale_val));
+                q_data_int[1].x = (float)((int)(data_f[1].x * q_scale_val));
+                q_data_int[1].y = (float)((int)(data_f[1].y * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(data_f[0].x - (q_data_int[0].x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(data_f[0].y - (q_data_int[0].y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(data_f[1].x - (q_data_int[1].x / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(data_f[1].y - (q_data_int[1].y / q_scale_val)) * q_scale_val;
+
+                q_data_int[0].x =
+                    (rand.x < q_error[0] && q_data_int[0].x > low_q && q_data_int[0].x < high_q)
+                        ? (q_data_int[0].x + (data_f[0].x > 0 ? 1 : -1))
+                        : q_data_int[0].x;
+                q_data_int[0].y =
+                    (rand.y < q_error[1] && q_data_int[0].y > low_q && q_data_int[0].y < high_q)
+                        ? (q_data_int[0].y + (data_f[0].y > 0 ? 1 : -1))
+                        : q_data_int[0].y;
+                q_data_int[1].x =
+                    (rand.w < q_error[2] && q_data_int[1].x > low_q && q_data_int[1].x < high_q)
+                        ? (q_data_int[1].x + (data_f[1].x > 0 ? 1 : -1))
+                        : q_data_int[1].x;
+                q_data_int[1].y =
+                    (rand.z < q_error[3] && q_data_int[1].y > low_q && q_data_int[1].y < high_q)
+                        ? (q_data_int[1].y + (data_f[1].y > 0 ? 1 : -1))
+                        : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x / q_scale_val;
+                data_f[0].y = q_data_int[0].y / q_scale_val;
+                data_f[1].x = q_data_int[1].x / q_scale_val;
+                data_f[1].y = q_data_int[1].y / q_scale_val;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel(float* vals,
+                                   int token_size,
+                                   int token_num,
+                                   int num_bits,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        // float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            data[reg_count] = vals_cast[group_index];
+
+            if (abs(data[reg_count].x) > max) max = abs(data[reg_count].x);
+            if (abs(data[reg_count].y) > max) max = abs(data[reg_count].y);
+            if (abs(data[reg_count].z) > max) max = abs(data[reg_count].z);
+            if (abs(data[reg_count].w) > max) max = abs(data[reg_count].w);
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+        __shared__ float partialMax[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+
+        max = g.shfl(max, 0);
+
+        float q_scale_val = (float)(1 << num_bits) / (max * 2 + 1e-5);
+        float high_q = (float)((1 << (num_bits - 1)) - 1);
+        float low_q = (float)(-((1 << (num_bits - 1))));
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)(q_data.x * q_scale_val));
+                q_data_int.y = (float)((int)(q_data.y * q_scale_val));
+                q_data_int.w = (float)((int)(q_data.w * q_scale_val));
+                q_data_int.z = (float)((int)(q_data.z * q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - (q_data_int.x / q_scale_val)) * q_scale_val;
+                q_error[1] = abs(q_data.y - (q_data_int.y / q_scale_val)) * q_scale_val;
+                q_error[2] = abs(q_data.w - (q_data_int.w / q_scale_val)) * q_scale_val;
+                q_error[3] = abs(q_data.z - (q_data_int.z / q_scale_val)) * q_scale_val;
+
+                q_data_int.x =
+                    (rand.x < q_error[0] && q_data_int.x > low_q && q_data_int.x < high_q)
+                        ? (q_data_int.x + (q_data.x > 0 ? 1 : -1))
+                        : q_data_int.x;
+                q_data_int.y =
+                    (rand.y < q_error[1] && q_data_int.y > low_q && q_data_int.y < high_q)
+                        ? (q_data_int.y + (q_data.y > 0 ? 1 : -1))
+                        : q_data_int.y;
+                q_data_int.w =
+                    (rand.w < q_error[2] && q_data_int.w > low_q && q_data_int.w < high_q)
+                        ? (q_data_int.w + (q_data.w > 0 ? 1 : -1))
+                        : q_data_int.w;
+                q_data_int.z =
+                    (rand.z < q_error[3] && q_data_int.z > low_q && q_data_int.z < high_q)
+                        ? (q_data_int.z + (q_data.z > 0 ? 1 : -1))
+                        : q_data_int.z;
+
+                q_data_int.x /= q_scale_val;
+                q_data_int.y /= q_scale_val;
+                q_data_int.w /= q_scale_val;
+                q_data_int.z /= q_scale_val;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_sr_quantize_kernel(T* vals,
+                               int total_count,
+                               int group_num,
+                               int num_bits,
+                               hipStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( sr_quantize_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel(float* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        hipStream_t stream);
+template void launch_sr_quantize_kernel(__half* vals,
+                                        int total_count,
+                                        int group_num,
+                                        int num_bits,
+                                        hipStream_t stream);
+
+__global__ void quantize_kernel_asym(__half* vals, int group_size, int num_bits)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    float2 data[MAX_REG];
+
+    int group_id = blockIdx.x;
+
+    {
+        int group_index = id;
+        int reg_count = 0;
+        int offset = group_id * group_size;
+        float max = -10000.0;
+        float min = 10000.0;
+
+        while (group_index < group_size && reg_count < MAX_REG) {
+            data[reg_count] = vals_cast[offset + group_index];
+            __half* data_h = reinterpret_cast<__half*>(&data[reg_count]);
+
+            if (((float)data_h[0]) > max) max = (float)data_h[0];
+            if (((float)data_h[1]) > max) max = (float)data_h[1];
+            if (((float)data_h[2]) > max) max = (float)data_h[2];
+            if (((float)data_h[3]) > max) max = (float)data_h[3];
+
+            if (((float)data_h[0]) < min) min = (float)data_h[0];
+            if (((float)data_h[1]) < min) min = (float)data_h[1];
+            if (((float)data_h[2]) < min) min = (float)data_h[2];
+            if (((float)data_h[3]) < min) min = (float)data_h[3];
+
+            group_index += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_inv = 1 / q_scale;
+
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + id;
+            if (group_index < group_size) {
+                __half2* data_h = reinterpret_cast<__half2*>(&data[i]);
+                float2 q_data[2];
+                q_data[0] = __half22float2(data_h[0]);
+                q_data[1] = __half22float2(data_h[1]);
+
+                float2 q_data_int[2];
+
+                q_data_int[0].x = roundf((q_data[0].x - min) * q_scale_inv);
+                q_data_int[0].y = roundf((q_data[0].y - min) * q_scale_inv);
+                q_data_int[1].x = roundf((q_data[1].x - min) * q_scale_inv);
+                q_data_int[1].y = roundf((q_data[1].y - min) * q_scale_inv);
+
+                q_data_int[0].x = q_data_int[0].x * q_scale + min;
+                q_data_int[0].y = q_data_int[0].y * q_scale + min;
+                q_data_int[1].x = q_data_int[1].x * q_scale + min;
+                q_data_int[1].y = q_data_int[1].y * q_scale + min;
+
+                data_h[0] = __float22half2_rn(q_data_int[0]);
+                data_h[1] = __float22half2_rn(q_data_int[1]);
+
+                vals_cast[offset + group_index] = data[i];
+            }
+        }
+    }
+#endif
+}
+
+__global__ void quantize_kernel_asym(float* vals, int group_size, int num_bits)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[MAX_REG];
+
+    int bid = blockIdx.x;
+
+    int group_index = bid * group_size + id;
+    int reg_count = 0;
+
+    float max = -10000.0;
+    float min = 10000.0;
+
+    while (id < group_size && reg_count < MAX_REG) {
+        float4 data_reg = vals_cast[group_index];
+        data[reg_count] = data_reg;
+
+        if (data_reg.x > max) max = data_reg.x;
+        if (data_reg.y > max) max = data_reg.y;
+        if (data_reg.w > max) max = data_reg.w;
+        if (data_reg.z > max) max = data_reg.z;
+
+        if (data_reg.x < min) min = data_reg.x;
+        if (data_reg.y < min) min = data_reg.y;
+        if (data_reg.w < min) min = data_reg.w;
+        if (data_reg.z < min) min = data_reg.z;
+
+        group_index += blockDim.x;
+        id += blockDim.x;
+        reg_count++;
+    }
+    id = threadIdx.x;
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(max, i);
+        if (max < temp) max = temp;
+    }
+
+#pragma unroll
+    for (int i = 1; i < WARP_SIZE; i <<= 1) {
+        auto temp = g.shfl_xor(min, i);
+        if (min > temp) min = temp;
+    }
+
+    __shared__ float partialMax[WARP_SIZE];
+    __shared__ float partialMin[WARP_SIZE];
+
+    if (lane == 0) partialMax[gid] = max;
+    if (lane == 0) partialMin[gid] = min;
+
+    b.sync();
+
+    if (lane < warp_num) max = partialMax[lane];
+    if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(max, i);
+        if (max < temp) max = temp;
+    }
+#pragma unroll
+    for (int i = 1; i < warp_num; i <<= 1) {
+        auto temp = g.shfl_down(min, i);
+        if (min > temp) min = temp;
+    }
+
+    max = g.shfl(max, 0);
+    min = g.shfl(min, 0);
+
+    float q_scale = ((max - min) + 1e-5) / (float)(1 << num_bits);
+    float q_scale_inv = 1 / q_scale;
+    for (int i = 0; i < reg_count; i++) {
+        group_index = i * blockDim.x + id;
+        if (group_index < group_size) {
+            float4 q_data;
+            q_data = data[i];
+
+            float4 q_data_int;
+            q_data_int.x = roundf((q_data.x - min) * q_scale_inv);
+            q_data_int.y = roundf((q_data.y - min) * q_scale_inv);
+            q_data_int.w = roundf((q_data.w - min) * q_scale_inv);
+            q_data_int.z = roundf((q_data.z - min) * q_scale_inv);
+
+            q_data.x = q_data_int.x * q_scale + min;
+            q_data.y = q_data_int.y * q_scale + min;
+            q_data.w = q_data_int.w * q_scale + min;
+            q_data.z = q_data_int.z * q_scale + min;
+
+            vals_cast[group_index + bid * group_size] = q_data;
+        }
+    }
+}
+
+template <typename T>
+void launch_quantize_kernel_asym(T* vals,
+                                 int total_count,
+                                 int group_num,
+                                 int num_bits,
+                                 hipStream_t stream)
+{
+    dim3 grid_dim(group_num);
+    dim3 block_dim(1024);
+
+   hipLaunchKernelGGL(( quantize_kernel_asym), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, num_bits);
+}
+
+template void launch_quantize_kernel_asym(float* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          hipStream_t stream);
+template void launch_quantize_kernel_asym(__half* vals,
+                                          int total_count,
+                                          int group_num,
+                                          int num_bits,
+                                          hipStream_t stream);
+
+__global__ void sr_quantize_kernel_asym(__half* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    __half2 data_low[128];
+    __half2 data_high[128];
+
+    int bid = blockIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+    unsigned int tid = threadIdx.x;
+    int reg_count = 0;
+    int offset = bid * token_size;
+    int group_index = bid * token_size + tid;
+
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+        while (tid < token_size) {
+            float2 data = vals_cast[offset + tid];
+            __half2* data_h = reinterpret_cast<__half2*>(&data);
+            data_low[reg_count] = data_h[0];
+            data_high[reg_count] = data_h[1];
+
+            float2 data_f[2];
+            data_f[0] = __half22float2(data_h[0]);
+            data_f[1] = __half22float2(data_h[1]);
+
+            if (((float)data_f[0].x) > max) max = (float)data_f[0].x;
+            if (((float)data_f[0].y) > max) max = (float)data_f[0].y;
+            if (((float)data_f[1].x) > max) max = (float)data_f[1].x;
+            if (((float)data_f[1].y) > max) max = (float)data_f[1].y;
+
+            if (((float)data_f[0].x) < min) min = (float)data_f[0].x;
+            if (((float)data_f[0].y) < min) min = (float)data_f[0].y;
+            if (((float)data_f[1].x) < min) min = (float)data_f[1].x;
+            if (((float)data_f[1].y) < min) min = (float)data_f[1].y;
+
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float q_scale_val_inv = 1 / q_scale_val;
+        float high_q = (float)((1 << num_bits) - 1);
+
+        for (int i = 0; i < reg_count; i++) {
+            int token_index = i * blockDim.x + threadIdx.x;
+            if (token_index < token_size) {
+                float2 data_f[2];
+                data_f[0] = __half22float2(data_low[i]);
+                data_f[1] = __half22float2(data_high[i]);
+
+                float2 q_data_int[2];
+                q_data_int[0].x = (float)((unsigned int)((data_f[0].x - min) * q_scale_val_inv));
+                q_data_int[0].y = (float)((unsigned int)((data_f[0].y - min) * q_scale_val_inv));
+                q_data_int[1].x = (float)((unsigned int)((data_f[1].x - min) * q_scale_val_inv));
+                q_data_int[1].y = (float)((unsigned int)((data_f[1].y - min) * q_scale_val_inv));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] =
+                    abs(data_f[0].x - ((q_data_int[0].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[1] =
+                    abs(data_f[0].y - ((q_data_int[0].y * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[2] =
+                    abs(data_f[1].x - ((q_data_int[1].x * q_scale_val) + min)) * q_scale_val_inv;
+                q_error[3] =
+                    abs(data_f[1].y - ((q_data_int[1].y * q_scale_val) + min)) * q_scale_val_inv;
+
+                q_data_int[0].x = (rand.x < q_error[0] && q_data_int[0].x < high_q)
+                                      ? (q_data_int[0].x + 1)
+                                      : q_data_int[0].x;
+                q_data_int[0].y = (rand.y < q_error[1] && q_data_int[0].y < high_q)
+                                      ? (q_data_int[0].y + 1)
+                                      : q_data_int[0].y;
+                q_data_int[1].x = (rand.w < q_error[2] && q_data_int[1].x < high_q)
+                                      ? (q_data_int[1].x + 1)
+                                      : q_data_int[1].x;
+                q_data_int[1].y = (rand.z < q_error[3] && q_data_int[1].y < high_q)
+                                      ? (q_data_int[1].y + 1)
+                                      : q_data_int[1].y;
+
+                data_f[0].x = q_data_int[0].x * q_scale_val + min;
+                data_f[0].y = q_data_int[0].y * q_scale_val + min;
+                data_f[1].x = q_data_int[1].x * q_scale_val + min;
+                data_f[1].y = q_data_int[1].y * q_scale_val + min;
+
+                float2 result;
+                __half2* result_h = reinterpret_cast<__half2*>(&result);
+                result_h[0] = __float22half2_rn(data_f[0]);
+                result_h[1] = __float22half2_rn(data_f[1]);
+
+                vals_cast[offset + token_index] = result;
+            }
+        }
+    }
+#endif
+}
+
+__global__ void sr_quantize_kernel_asym(float* vals,
+                                        int token_size,
+                                        int token_num,
+                                        int num_bits,
+                                        std::pair<uint64_t, uint64_t> seed)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int gid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+    int id = threadIdx.x;
+
+    int idx = blockIdx.x * blockDim.x + id;
+
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    float4 data[128];
+
+    int bid = blockIdx.x;
+    int tid = threadIdx.x;
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    int group_index = bid * token_size + threadIdx.x;
+    int reg_count = 0;
+    int total_count = token_size * token_num;
+    if (group_index < total_count) {
+        float min = 10000.0;
+        float max = -10000.0;
+
+        while (tid < token_size) {
+            float4 data_reg = vals_cast[group_index];
+            data[reg_count] = data_reg;
+            if (data_reg.x > max) max = data_reg.x;
+            if (data_reg.y > max) max = data_reg.y;
+            if (data_reg.w > max) max = data_reg.w;
+            if (data_reg.z > max) max = data_reg.z;
+
+            if (data_reg.x < min) min = data_reg.x;
+            if (data_reg.y < min) min = data_reg.y;
+            if (data_reg.w < min) min = data_reg.w;
+            if (data_reg.z < min) min = data_reg.z;
+
+            group_index += blockDim.x;
+            tid += blockDim.x;
+            reg_count++;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(max, i);
+            if (max < temp) max = temp;
+        }
+
+#pragma unroll
+        for (int i = 1; i < WARP_SIZE; i <<= 1) {
+            auto temp = g.shfl_xor(min, i);
+            if (min > temp) min = temp;
+        }
+
+        __shared__ float partialMax[WARP_SIZE];
+        __shared__ float partialMin[WARP_SIZE];
+
+        if (lane == 0) partialMax[gid] = max;
+        if (lane == 0) partialMin[gid] = min;
+
+        b.sync();
+
+        if (lane < warp_num) max = partialMax[lane];
+        if (lane < warp_num) min = partialMin[lane];
+
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(max, i);
+            if (max < temp) max = temp;
+        }
+#pragma unroll
+        for (int i = 1; i < warp_num; i <<= 1) {
+            auto temp = g.shfl_down(min, i);
+            if (min > temp) min = temp;
+        }
+
+        max = g.shfl(max, 0);
+        min = g.shfl(min, 0);
+
+        float q_scale_val = ((max - min) + 1e-5) / (float)(1 << num_bits);
+        float high_q = (float)((1 << num_bits) - 1);
+
+        int offset = (bid)*token_size;
+        for (int i = 0; i < reg_count; i++) {
+            group_index = i * blockDim.x + threadIdx.x;
+            if (group_index < token_size) {
+                float4 q_data = data[i];
+
+                float4 q_data_int;
+                q_data_int.x = (float)((int)((q_data.x - min) / q_scale_val));
+                q_data_int.y = (float)((int)((q_data.y - min) / q_scale_val));
+                q_data_int.w = (float)((int)((q_data.w - min) / q_scale_val));
+                q_data_int.z = (float)((int)((q_data.z - min) / q_scale_val));
+
+                // Stochastic rounding
+                float4 rand = hiprand_uniform4(&state);
+
+                float q_error[4];
+                q_error[0] = abs(q_data.x - ((q_data_int.x * q_scale_val) + min)) / q_scale_val;
+                q_error[1] = abs(q_data.y - ((q_data_int.y * q_scale_val) + min)) / q_scale_val;
+                q_error[2] = abs(q_data.w - ((q_data_int.w * q_scale_val) + min)) / q_scale_val;
+                q_error[3] = abs(q_data.z - ((q_data_int.z * q_scale_val) + min)) / q_scale_val;
+
+                q_data_int.x = (rand.x < q_error[0] && q_data_int.x < high_q) ? (q_data_int.x + 1)
+                                                                              : q_data_int.x;
+                q_data_int.y = (rand.y < q_error[1] && q_data_int.y < high_q) ? (q_data_int.y + 1)
+                                                                              : q_data_int.y;
+                q_data_int.w = (rand.w < q_error[2] && q_data_int.w < high_q) ? (q_data_int.w + 1)
+                                                                              : q_data_int.w;
+                q_data_int.z = (rand.z < q_error[3] && q_data_int.z < high_q) ? (q_data_int.z + 1)
+                                                                              : q_data_int.z;
+
+                q_data_int.x = q_data_int.x * q_scale_val + min;
+                q_data_int.y = q_data_int.y * q_scale_val + min;
+                q_data_int.w = q_data_int.w * q_scale_val + min;
+                q_data_int.z = q_data_int.z * q_scale_val + min;
+
+                vals_cast[group_index + offset] = q_data_int;
+            }
+        }
+    }
+}
+template <typename T>
+void launch_sr_quantize_kernel_asym(T* vals,
+                                    int total_count,
+                                    int group_num,
+                                    int num_bits,
+                                    hipStream_t stream)
+{
+    dim3 block_dim(1024);
+    dim3 grid_dim(group_num);
+
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( sr_quantize_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, (total_count / group_num) / 4, group_num, num_bits, seed);
+}
+template void launch_sr_quantize_kernel_asym(float* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             hipStream_t stream);
+template void launch_sr_quantize_kernel_asym(__half* vals,
+                                             int total_count,
+                                             int group_num,
+                                             int num_bits,
+                                             hipStream_t stream);
diff --git a/deepspeed/ops/csrc/sparse_attention/utils.cpp b/deepspeed/ops/csrc/sparse_attention/utils.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..8e4346be8a299a09d38ce22adf1c2f80385620c1
--- /dev/null
+++ b/deepspeed/ops/csrc/sparse_attention/utils.cpp
@@ -0,0 +1,120 @@
+// DeepSpeed note, code taken & adapted from commit 9aa94789f13ada713af36cfd8cca2fc9a7f6b79a
+// https://github.com/ptillet/torch-blocksparse/blob/master/csrc/utils.cpp
+
+#include <torch/extension.h>
+#include <string>
+#include <tuple>
+#include <vector>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+typedef std::vector<std::tuple<int, torch::Tensor>> ret_t;
+
+void segment_blocks(torch::Tensor layout,
+                    torch::Tensor idx,
+                    torch::Tensor scratch,
+                    int max_width,
+                    ret_t& ret)
+{
+    size_t H = layout.size(0);
+    size_t M = layout.size(1);
+    size_t N = layout.size(2);
+    torch::Tensor tmp = torch::zeros_like(layout);
+
+    auto _tmp = tmp.accessor<int, 3>();
+    auto _layout = layout.accessor<int, 3>();
+    auto _idx = idx.accessor<int, 3>();
+    auto _scratch = scratch.accessor<int, 3>();
+    std::vector<int> current(H, 0);
+
+#ifdef _OPENMP
+#pragma omp parallel for
+#endif
+    for (size_t h = 0; h < H; h++) {
+        // surrounding indices
+        std::vector<int> ii_left(max_width, -1);
+        std::vector<std::vector<int>> ii_top(max_width, std::vector<int>(N, -1));
+
+        for (size_t m = 0; m < M; m++) {
+            for (size_t n = 0; n < N; n++) {
+                int v = _layout[h][m][n];
+                if (v == 0) continue;
+                int n_left = ii_left[max_width - 1];
+                int m_top = ii_top[max_width - 1][n];
+                int top = (m_top >= 0) ? _tmp[h][m_top][n] : 0;
+                int left = (n_left >= 0) ? _tmp[h][m][n_left] : 0;
+                int topleft = (m_top >= 0 && n_left >= 0) ? _tmp[h][m_top][n_left] : 0;
+                int width = std::min(left, std::min(top, topleft)) + 1;
+
+                // reset width if blocks cannot be
+                // packed together (i.e., there's a 1 "in the middle")
+                for (int nn = n_left + 1; nn < n; nn++)
+                    if (ii_top[max_width - 1][nn] > ii_top[max_width - 1][n]) width = 1;
+                _tmp[h][m][n] = width;
+
+                // update n_left ring buffer
+                for (int k = 0; k < max_width - 1; k++) ii_left[k] = ii_left[k + 1];
+                ii_left[max_width - 1] = n;
+
+                // update ii_top ring buffer
+                for (int k = 0; k < max_width - 1; k++) ii_top[k][n] = ii_top[k + 1][n];
+                ii_top[max_width - 1][n] = m;
+
+                // block is too small -- skip
+                if (width != max_width) continue;
+
+                // retained blocks are set to zeros
+                for (size_t km = 0; km < max_width; km++)
+                    for (size_t kn = 0; kn < max_width; kn++) {
+                        int mm = ii_top[km][n];
+                        int nn = ii_left[kn];
+                        if (mm < 0 || nn < 0) continue;
+                        _layout[h][mm][nn] = 0;
+                        _tmp[h][mm][nn] = 0;
+                        _scratch[h][current[h]][0] = (int)h;
+                        _scratch[h][current[h]][1] = (int)mm;
+                        _scratch[h][current[h]][2] = (int)nn;
+                        _scratch[h][current[h]][3] = _idx[h][mm][nn];
+                        current[h]++;
+                    }
+            }
+        }
+    }
+    std::vector<torch::Tensor> to_cat;
+    for (size_t h = 0; h < H; h++)
+        if (current[h] > 0) to_cat.push_back(scratch[h].slice(0, 0, current[h]));
+    if (!to_cat.empty()) ret.push_back({max_width, torch::cat(to_cat)});
+}
+
+ret_t sdd_segment(torch::Tensor layout, int start_width)
+{
+    ret_t ret;
+
+    // block index
+    torch::Tensor idx = torch::zeros_like(layout);
+    int current = 0;
+    int64_t H = layout.size(0);
+    int64_t M = layout.size(1);
+    int64_t N = layout.size(2);
+    auto _layout = layout.accessor<int, 3>();
+    auto _idx = idx.accessor<int, 3>();
+    for (int64_t h = 0; h < H; h++)
+        for (int64_t m = 0; m < M; m++)
+            for (int64_t n = 0; n < N; n++) {
+                if (_layout[h][m][n] == 0) continue;
+                _idx[h][m][n] = current++;
+            }
+
+    // scratch memory
+    torch::Tensor scratch = torch::empty({H, layout.sum().item<int>(), 4}, layout.dtype());
+
+    for (int max_width = start_width; max_width > 0; max_width /= 2)
+        segment_blocks(layout, idx, scratch, max_width, ret);
+    return ret;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("sdd_segment", &sdd_segment, "SDD segmentation handler");
+}
diff --git a/deepspeed/ops/csrc/transformer/cublas_wrappers.cu b/deepspeed/ops/csrc/transformer/cublas_wrappers.cu
new file mode 100644
index 0000000000000000000000000000000000000000..75ecd3fb4ef9d5d63d9c7681bdce0cf949641b5d
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/cublas_wrappers.cu
@@ -0,0 +1,403 @@
+#include "cublas_wrappers.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer/cublas_wrappers.hip b/deepspeed/ops/csrc/transformer/cublas_wrappers.hip
new file mode 100644
index 0000000000000000000000000000000000000000..04aa0ef0a7d083a50fc7d4ec8f01b24e2ccd52e8
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/cublas_wrappers.hip
@@ -0,0 +1,404 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cublas_wrappers_hip.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer/dropout_kernels.cu b/deepspeed/ops/csrc/transformer/dropout_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d1ba135f4900f8eff3f6b4cab70d1b35b39f7833
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/dropout_kernels.cu
@@ -0,0 +1,868 @@
+#include "custom_cuda_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = curand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = curand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+        dropout_kernel_bwd<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, vals, out, mask, seed);
+    else
+        dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/dropout_kernels.hip b/deepspeed/ops/csrc/transformer/dropout_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..a4b880a721e9833d10bccd0fa438acf6b14ded54
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/dropout_kernels.hip
@@ -0,0 +1,870 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = hiprand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = hiprand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+       hipLaunchKernelGGL(( dropout_kernel_bwd), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, vals, out, mask, seed);
+    else
+       hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp b/deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..0e86322291f76573730b56fb25fb5e89f3d09ddd
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/ds_transformer_cuda.cpp
@@ -0,0 +1,1051 @@
+#include <torch/extension.h>
+
+#include <cublas_v2.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer.h"
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+#include "ds_transformer_cuda.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        //aiss debug 0506
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        CUBLAS_OP_T,
+                                                        CUBLAS_OP_N,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         CUBLAS_OP_N,
+                                                         CUBLAS_OP_N,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    cublasSetStream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) cudaStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    cublasSetStream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) cudaStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    cudaStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer/ds_transformer_hip.cpp b/deepspeed/ops/csrc/transformer/ds_transformer_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..f9e0a53a93c1bf9aa04c7a072af00f159e9156a3
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/ds_transformer_hip.cpp
@@ -0,0 +1,1052 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <torch/extension.h>
+
+#include <rocblas.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer_hip.h"
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+#include "ds_transformer_hip.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        //aiss debug 0506
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        rocblas_operation_transpose,
+                                                        rocblas_operation_none,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         rocblas_operation_none,
+                                                         rocblas_operation_none,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    hipStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer/gelu_kernels.cu b/deepspeed/ops/csrc/transformer/gelu_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d683cf0af83daf829e390d83182eb6f0ffd040a6
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/gelu_kernels.cu
@@ -0,0 +1,330 @@
+#include "custom_cuda_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    gelu_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       cudaStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, cudaStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, cudaStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    d_gelu_func<<<grid_dims, block_dims, 0, stream>>>(
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, cudaStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, cudaStream_t);
diff --git a/deepspeed/ops/csrc/transformer/gelu_kernels.hip b/deepspeed/ops/csrc/transformer/gelu_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f7e7a7fa7e0f22a6d20de4a1fbb5c2071afb9c77
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/gelu_kernels.hip
@@ -0,0 +1,332 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( gelu_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       hipStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, hipStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, hipStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( d_gelu_func), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, hipStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, hipStream_t);
diff --git a/deepspeed/ops/csrc/transformer/general_kernels.cu b/deepspeed/ops/csrc/transformer/general_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..1eaa94e1e71a40bf44b661656395cf1ed087f589
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/general_kernels.cu
@@ -0,0 +1,411 @@
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       cudaStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<float><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<__half><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/deepspeed/ops/csrc/transformer/general_kernels.hip b/deepspeed/ops/csrc/transformer/general_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..5be2fc240debf3dbddba72f0f9587331222910df
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/general_kernels.hip
@@ -0,0 +1,413 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       hipStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<float>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu b/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu
new file mode 100644
index 0000000000000000000000000000000000000000..175854b8860b92e68485342a21bc7d636d58065c
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu
@@ -0,0 +1,374 @@
+#include "custom_cuda_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+        apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+        apply_rotary_pos_emb1<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  cudaStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+cudaStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+
+apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+*/
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip b/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip
new file mode 100644
index 0000000000000000000000000000000000000000..4e04f7aeb4c80be79c2fe6d8b91a4cc2fecde823
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip
@@ -0,0 +1,376 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb1), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  hipStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+hipStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+hipLaunchKernelGGL((
+apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+*/
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu b/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..4ddaabda3eb70b1b958b1fc4c2f959867828d1a2
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu
@@ -0,0 +1,110 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+    dequantize_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       cudaStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        cudaStream_t);
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.hip b/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..7c22e306aace1058947ed47e58c0427a4f066ecb
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.hip
@@ -0,0 +1,112 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+   hipLaunchKernelGGL(( dequantize_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       hipStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        hipStream_t);
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu b/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu
new file mode 100644
index 0000000000000000000000000000000000000000..70bbf42cf9ed74558ce1b789d939c17d38573a86
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu
@@ -0,0 +1,525 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+    fused_bias_add<<<grid_dims, block_dims, 0, stream>>>(input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    fused_bias_residual<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, cudaStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           cudaStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    gptj_residual_add<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              cudaStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               cudaStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+    moe_res_matmul<<<grid_dim, block_dim, 0, stream>>>(
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/gelu.hip b/deepspeed/ops/csrc/transformer/inference/csrc/gelu.hip
new file mode 100644
index 0000000000000000000000000000000000000000..00c03efb9b6b3d7b05f19549472b5b771f46e1f4
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/gelu.hip
@@ -0,0 +1,527 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_add), dim3(grid_dims), dim3(block_dims), 0, stream, input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_residual), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, hipStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           hipStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( gptj_residual_add), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              hipStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               hipStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+   hipLaunchKernelGGL(( moe_res_matmul), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/normalize.cu b/deepspeed/ops/csrc/transformer/inference/csrc/normalize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..7f3cfc118631145cd30766cdf13d439a23c138c6
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/normalize.cu
@@ -0,0 +1,453 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/normalize.hip b/deepspeed/ops/csrc/transformer/inference/csrc/normalize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..333e91f7c046a2e7ca3e2843f045cede327cae49
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/normalize.hip
@@ -0,0 +1,455 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp b/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..320e6491b1cd1cb87749e8c8cb8624871b1cc904
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp
@@ -0,0 +1,951 @@
+
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_T,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // cudaEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    // cudaEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // cudaStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::cuda::getCurrentCUDAStream());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::cuda::getCurrentCUDAStream());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp b/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6fed126f2c360dd3eec0ce9831b200acce3cd9d9
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp
@@ -0,0 +1,952 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#include <ATen/hip/HIPContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_transpose,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+#ifdef __HIP_PLATFORM_HCC__
+                                rocblas_gemm_algo_standard);
+#else
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // hipEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+#ifdef __HIP_PLATFORM_HCC__
+                   rocblas_gemm_algo_standard);
+#else
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+#endif
+    // hipEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // hipStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu b/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu
new file mode 100644
index 0000000000000000000000000000000000000000..bf3c8bc90049ddd9cf91ce4006729d02ebcdcf3e
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu
@@ -0,0 +1,434 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    cudaError_t err = cudaGetLastError();
+    if (err == cudaSuccess) return;
+    std::cerr << cudaGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+        attn_softmax_v2<<<grid_dim, block_dim, 0, stream>>>(
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/inference/csrc/softmax.hip b/deepspeed/ops/csrc/transformer/inference/csrc/softmax.hip
new file mode 100644
index 0000000000000000000000000000000000000000..51d5bef3a72436a23f910b3a73ada214b012389a
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/csrc/softmax.hip
@@ -0,0 +1,436 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+#ifndef __HIP_PLATFORM_HCC__
+#include <cuda_profiler_api.h>
+#endif
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    hipError_t err = hipGetLastError();
+    if (err == hipSuccess) return;
+    std::cerr << hipGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+       hipLaunchKernelGGL(( attn_softmax_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/inference/includes/context.h b/deepspeed/ops/csrc/transformer/inference/includes/context.h
new file mode 100644
index 0000000000000000000000000000000000000000..21f0b3cfe07b3f5f519af7a1b3a4daa4f7b88424
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/includes/context.h
@@ -0,0 +1,184 @@
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        curandCreateGenerator(&_gen, CURAND_RNG_PSEUDO_DEFAULT);
+        curandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (cublasCreate(&_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+#ifndef __HIP_PLATFORM_HCC__
+        cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        cudaEventCreate(&_comp1_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp2_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comm_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+#else
+        cudaEventCreate(&_comp1_event);
+        cudaEventCreate(&_comp2_event);
+        cudaEventCreate(&_comp_event);
+        cudaEventCreate(&_comm_event);
+#endif
+    }
+
+    virtual ~Context()
+    {
+        cublasDestroy(_cublasHandle);
+        cudaFree(_workspace);
+        cudaEventDestroy(_comp1_event);
+        cudaEventDestroy(_comp2_event);
+        cudaEventDestroy(_comp_event);
+        cudaEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            cudaMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            cudaFree(_workspace);
+            cudaMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    cudaEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    curandGenerator_t& GetRandGenerator() { return _gen; }
+
+    cudaStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::cuda::getStreamFromPool(true)
+                                    : at::cuda::getCurrentCUDAStream();
+        return _comm_stream;
+    }
+    cudaStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::cuda::getStreamFromPool(true);
+            return _stream;
+        }
+        cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+        return stream;
+    }
+
+    cublasHandle_t GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        cudaEventRecord(_comp_event, _comp_stream);
+        cudaStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        cudaEventRecord(_comm_event, _comm_stream);
+        cudaStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    curandGenerator_t _gen;
+    cublasHandle_t _cublasHandle;
+
+    cudaEvent_t _comp_event;
+    cudaEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    cudaEvent_t _comp1_event;
+    cudaEvent_t _comp2_event;
+
+    cudaStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    cudaStream_t _comp_stream;
+    cudaStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/deepspeed/ops/csrc/transformer/inference/includes/context_hip.h b/deepspeed/ops/csrc/transformer/inference/includes/context_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..738e2dcd61e7ef8c11afcfdf7d4385299307469a
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/includes/context_hip.h
@@ -0,0 +1,185 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <ATen/hip/HIPContext.h>
+#include <hip/hip_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        hiprandCreateGenerator(&_gen, HIPRAND_RNG_PSEUDO_DEFAULT);
+        hiprandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (rocblas_create_handle(&_cublasHandle) != rocblas_status_success) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+#ifndef __HIP_PLATFORM_HCC__
+        rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        hipEventCreate(&_comp1_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp2_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comm_event, (hipEventDisableTiming | hipEventBlockingSync));
+#else
+        hipEventCreate(&_comp1_event);
+        hipEventCreate(&_comp2_event);
+        hipEventCreate(&_comp_event);
+        hipEventCreate(&_comm_event);
+#endif
+    }
+
+    virtual ~Context()
+    {
+        rocblas_destroy_handle(_cublasHandle);
+        hipFree(_workspace);
+        hipEventDestroy(_comp1_event);
+        hipEventDestroy(_comp2_event);
+        hipEventDestroy(_comp_event);
+        hipEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            hipMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            hipFree(_workspace);
+            hipMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    hipEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    hiprandGenerator_t& GetRandGenerator() { return _gen; }
+
+    hipStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::hip::getStreamFromPoolMasqueradingAsCUDA(true)
+                                    : at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return _comm_stream;
+    }
+    hipStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::hip::getStreamFromPoolMasqueradingAsCUDA(true);
+            return _stream;
+        }
+        hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return stream;
+    }
+
+    rocblas_handle GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        hipEventRecord(_comp_event, _comp_stream);
+        hipStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        hipEventRecord(_comm_event, _comm_stream);
+        hipStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    hiprandGenerator_t _gen;
+    rocblas_handle _cublasHandle;
+
+    hipEvent_t _comp_event;
+    hipEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    hipEvent_t _comp1_event;
+    hipEvent_t _comp2_event;
+
+    hipStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    hipStream_t _comp_stream;
+    hipStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/deepspeed/ops/csrc/transformer/inference/includes/cublas_wrappers.h b/deepspeed/ops/csrc/transformer/inference/includes/cublas_wrappers.h
new file mode 100644
index 0000000000000000000000000000000000000000..75d18a40fc8e468c3ddcc5b1ae8bbdfc421c7072
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/includes/cublas_wrappers.h
@@ -0,0 +1,413 @@
+#pragma once
+
+#include <assert.h>
+#include <cublas_v2.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/includes/cublas_wrappers_hip.h b/deepspeed/ops/csrc/transformer/inference/includes/cublas_wrappers_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..e7c81906bd790e200e6401f66c99dbcac2a0cbc5
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/includes/cublas_wrappers_hip.h
@@ -0,0 +1,414 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <assert.h>
+#include <rocblas.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#ifndef __HIP_PLATFORM_HCC__
+#include <mma.h>
+#endif
+#include <stdio.h>
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer/inference/includes/custom_cuda_layers.h b/deepspeed/ops/csrc/transformer/inference/includes/custom_cuda_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..06b4340061c98c65b4b301c7349d2da03185f715
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/includes/custom_cuda_layers.h
@@ -0,0 +1,124 @@
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              cudaStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/inference/includes/custom_hip_layers.h b/deepspeed/ops/csrc/transformer/inference/includes/custom_hip_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..36cab34d6262f5d6211a18584f6d55284c04846e
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/inference/includes/custom_hip_layers.h
@@ -0,0 +1,125 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              hipStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/normalize_kernels.cu b/deepspeed/ops/csrc/transformer/normalize_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d634c7f1b2cd1c2632495d8e1f3b47b45867c353
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/normalize_kernels.cu
@@ -0,0 +1,2121 @@
+#include "custom_cuda_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    // LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/deepspeed/ops/csrc/transformer/normalize_kernels.hip b/deepspeed/ops/csrc/transformer/normalize_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..3d1b17c8f779f0940593a66fea8c07bba6c5534c
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/normalize_kernels.hip
@@ -0,0 +1,2123 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    //hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/deepspeed/ops/csrc/transformer/softmax_kernels.cu b/deepspeed/ops/csrc/transformer/softmax_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..34487cba9bbec060baa0fb79922e691076296476
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/softmax_kernels.cu
@@ -0,0 +1,595 @@
+#include <math.h>
+#include "custom_cuda_layers.h"
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, cudaStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     cudaStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+        softmax_backward_kernel_v2<T, 1>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+        softmax_backward_kernel_v2<T, 2>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+        softmax_backward_kernel_v2<T, 4>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+        softmax_backward_kernel_v2<T, 8>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+        softmax_backward_kernel_v2<T, 12>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+        softmax_backward_kernel_v2<T, 16>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+        softmax_backward_kernel_v2<T, 24>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+        softmax_backward_kernel_v2<T, 32>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+        softmax_backward_kernel_v2<T, 64>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      cudaStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/softmax_kernels.hip b/deepspeed/ops/csrc/transformer/softmax_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..afe65b0c9cbdc6b10027db2ddd5c7e8f447e0c24
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/softmax_kernels.hip
@@ -0,0 +1,597 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <math.h>
+#include "custom_hip_layers.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, hipStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     hipStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 1>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 12>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 24>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      hipStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer/transform_kernels.cu b/deepspeed/ops/csrc/transformer/transform_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..15a2219333e43a6da1b93038a406b35d302bb9d9
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/transform_kernels.cu
@@ -0,0 +1,575 @@
+#include "custom_cuda_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<__half><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<float><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  cudaStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+    transform_0213<float>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   cudaStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+    transform_0213<__half>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           cudaStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+    bias_add_transform_0213<float><<<grid_dim, block_dim, 0, stream>>>(
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            cudaStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+        bias_add_transform_0213<__half><<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+        bias_add_transform_0213_v2<<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    cudaStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+    transform4d_0213<float>
+        <<<grid_dims, block_dims, 0, stream>>>(out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+        transform4d_0213<__half><<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+        transform4d_0213_v2<<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/deepspeed/ops/csrc/transformer/transform_kernels.hip b/deepspeed/ops/csrc/transformer/transform_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..0aaa4cca150e18ed63c701e66ce4eaf6313e30ab
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer/transform_kernels.hip
@@ -0,0 +1,577 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<__half>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<float>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  hipStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+   hipLaunchKernelGGL(( transform_0213<float>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   hipStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+   hipLaunchKernelGGL(( transform_0213<__half>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           hipStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+   hipLaunchKernelGGL(( bias_add_transform_0213<float>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            hipStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+       hipLaunchKernelGGL(( bias_add_transform_0213<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+       hipLaunchKernelGGL(( bias_add_transform_0213_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    hipStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+   hipLaunchKernelGGL(( transform4d_0213<float>)
+        , dim3(grid_dims), dim3(block_dims), 0, stream, out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+       hipLaunchKernelGGL(( transform4d_0213<__half>), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+       hipLaunchKernelGGL(( transform4d_0213_v2), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/cublas_wrappers.cu b/deepspeed/ops/csrc/transformer_bak/cublas_wrappers.cu
new file mode 100644
index 0000000000000000000000000000000000000000..75ecd3fb4ef9d5d63d9c7681bdce0cf949641b5d
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/cublas_wrappers.cu
@@ -0,0 +1,403 @@
+#include "cublas_wrappers.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != CUBLAS_STATUS_SUCCESS) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/cublas_wrappers.hip b/deepspeed/ops/csrc/transformer_bak/cublas_wrappers.hip
new file mode 100644
index 0000000000000000000000000000000000000000..04aa0ef0a7d083a50fc7d4ec8f01b24e2ccd52e8
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/cublas_wrappers.hip
@@ -0,0 +1,404 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "cublas_wrappers_hip.h"
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f32_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f32_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            C,
+                                            rocblas_datatype_f32_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   rocblas_gemm_algo algo)
+#else
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status = rocblas_gemm_ex(handle,
+                                            transa,
+                                            transb,
+                                            m,
+                                            n,
+                                            k,
+                                            (const void*)alpha,
+                                            (const void*)A,
+                                            rocblas_datatype_f16_r,
+                                            (transa == rocblas_operation_none) ? m : k,
+                                            (const void*)B,
+                                            rocblas_datatype_f16_r,
+                                            (transb == rocblas_operation_none) ? k : n,
+                                            (const void*)beta,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            (void*)C,
+                                            rocblas_datatype_f16_r,
+                                            m,
+                                            rocblas_datatype_f32_r,
+                                            algo,
+                                            0,
+                                            0);
+#else
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f32_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f32_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f32_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+#ifdef __HIP_PLATFORM_HCC__
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                rocblas_gemm_algo algo)
+#else
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+#endif
+{
+#ifdef __HIP_PLATFORM_HCC__
+    rocblas_status status =
+        rocblas_gemm_strided_batched_ex(handle,
+                                        op_A,
+                                        op_B,
+                                        m,
+                                        n,
+                                        k,
+                                        alpha,
+                                        A,
+                                        rocblas_datatype_f16_r,
+                                        (op_A == rocblas_operation_none) ? m : k,
+                                        stride_A,
+                                        B,
+                                        rocblas_datatype_f16_r,
+                                        (op_B == rocblas_operation_none) ? k : n,
+                                        stride_B,
+                                        beta,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        C,
+                                        rocblas_datatype_f16_r,
+                                        m,
+                                        stride_C,
+                                        batch,
+                                        rocblas_datatype_f32_r,
+                                        algo,
+                                        0,
+                                        0);
+#else
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+#endif
+
+#ifdef __HIP_PLATFORM_HCC__
+    if (status != rocblas_status_success) {
+#else
+    if (status != rocblas_status_success) {
+#endif
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/dropout_kernels.cu b/deepspeed/ops/csrc/transformer_bak/dropout_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d1ba135f4900f8eff3f6b4cab70d1b35b39f7833
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/dropout_kernels.cu
@@ -0,0 +1,868 @@
+#include "custom_cuda_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = curand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = curand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+        dropout_kernel_bwd<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, vals, out, mask, seed);
+    else
+        dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+    dropout_grad_kernel<<<DS_GET_BLOCKS(total_count / unroll_factor),
+                          DS_CUDA_NUM_THREADS,
+                          0,
+                          stream>>>(total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = curand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = curand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+    dropout_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/dropout_kernels.hip b/deepspeed/ops/csrc/transformer_bak/dropout_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..a4b880a721e9833d10bccd0fa438acf6b14ded54
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/dropout_kernels.hip
@@ -0,0 +1,870 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+const int unroll_factor = 4;
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               float* out,
+                               const float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint8_t m[unroll_factor];
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        int i = j * unroll_factor;
+
+        mask[i] = (uint8_t)m[0];
+        mask[i + 1] = (uint8_t)m[1];
+        mask[i + 2] = (uint8_t)m[2];
+        mask[i + 3] = (uint8_t)m[3];
+
+        out[i] = Xdata[i] * scale * m[0];
+        out[i + 1] = Xdata[i + 1] * scale * m[1];
+        out[i + 2] = Xdata[i + 2] * scale * m[2];
+        out[i + 3] = Xdata[i + 3] * scale * m[3];
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = Xdata[i] * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const float ratio,
+                               __half* out,
+                               const __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    uint32_t m_32;
+    uint8_t* m = reinterpret_cast<uint8_t*>(&m_32);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+    __half2 mask_h[2];
+    float2 mask_f[2];
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        float4 rand = hiprand_uniform4(&state);
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+
+        mask_cast[j] = m_32;
+    }
+
+#else
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+        float2 vals_half_f[2];
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        uint8_t m[unroll_factor];
+        float4 rand = hiprand_uniform4(&state);
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+
+        mask[i] = m[0];
+        mask[i + 1] = m[1];
+        mask[i + 2] = m[2];
+        mask[i + 3] = m[3];
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            out[i] = __float2half((float)Xdata[i] * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const float* Xdata,
+                                   float* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        out[i] = mask[i] ? Xdata[i] * scale : 0.0;
+        out[i + 1] = mask[i + 1] ? Xdata[i + 1] * scale : 0.0;
+        out[i + 2] = mask[i + 2] ? Xdata[i + 2] * scale : 0.0;
+        out[i + 3] = mask[i + 3] ? Xdata[i + 3] * scale : 0.0;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) { out[i] = mask[i] ? Xdata[i] * scale : 0.0; }
+    }
+}
+
+__global__ void dropout_kernel_bwd(const int N,
+                                   const float ratio,
+                                   const __half* Xdata,
+                                   __half* out,
+                                   uint8_t* mask,
+                                   std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+
+#ifdef __STOCHASTIC_MODE__
+
+    const __half2 h_scale = __float2half2_rn(scale);
+
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_f = x_cast[j];
+        __half2* x_h = reinterpret_cast<__half2*>(&x_f);
+
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) mask_f_data[i] = (float)(m[i]);
+
+#pragma unroll
+        for (int i = 0; i < 2; i++) mask_h[i] = __float22half2_rn(mask_f[i]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = x_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_h[1] * h_scale * mask_h[1];
+
+        out_cast[j] = result_f;
+    }
+
+#else
+
+    const __half h_scale = __float2half(scale);
+    const __half h_zero = __float2half(0.0);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        int i = j * unroll_factor;
+
+        const __half2* vals_half = reinterpret_cast<const __half2*>(Xdata + i);
+
+        uint8_t* m = mask + i;
+
+        float2 vals_half_f[2];
+
+        vals_half_f[0] = __half22float2(vals_half[0]);
+        vals_half_f[1] = __half22float2(vals_half[1]);
+
+        out[i] = __float2half(vals_half_f[0].x * scale * m[0]);
+        out[i + 1] = __float2half(vals_half_f[0].y * scale * m[1]);
+        out[i + 2] = __float2half(vals_half_f[1].x * scale * m[2]);
+        out[i + 3] = __float2half(vals_half_f[1].y * scale * m[3]);
+    }
+
+#endif
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream,
+                    bool bwd)
+{
+    assert(unroll_factor == 4);
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count / unroll_factor);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    if (dim > 512) {
+        block_dim.x >>= 1;
+        grid_dim.x <<= 1;
+    }
+    uint64_t inc = total_count / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+    if (bwd)
+       hipLaunchKernelGGL(( dropout_kernel_bwd), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, vals, out, mask, seed);
+    else
+       hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            total_count, ratio, out, vals, mask, seed);
+}
+
+template void launch_dropout(float* out,
+                             const float* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+template void launch_dropout(__half* out,
+                             const __half* vals,
+                             uint8_t* mask,
+                             int total_count,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream,
+                             bool);
+
+__global__ void dropout_grad_kernel(const int N, const float scale, float* Xdata, uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { Xdata[i] *= scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N, const float scale, __half* Xdata, uint8_t* mask)
+{
+    const __half2 h_scale = __float2half2_rn(scale);
+    float2* x_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_cast = reinterpret_cast<uint32_t*>(mask);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+#ifdef __STOCHASTIC_MODE__
+
+        __half2* x_data_h = reinterpret_cast<__half2*>(&x_data);
+        __half2 mask_h[2];
+        float2 mask_f[2];
+
+        float* mask_f_data = &mask_f[0].x;
+#pragma unroll
+        for (int i = 0; i < unroll_factor; i++) *(mask_f_data++) = (float)(m[i]);
+
+        mask_h[0] = __float22half2_rn(mask_f[0]);
+        mask_h[1] = __float22half2_rn(mask_f[1]);
+
+        result_h[0] = x_data_h[0] * h_scale * mask_h[0];
+        result_h[1] = x_data_h[1] * h_scale * mask_h[1];
+
+#else
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+#endif
+        x_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            Xdata[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, mask);
+}
+
+template void launch_dropout_grad(float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const float* Xdata,
+                                    float* out,
+                                    uint8_t* mask)
+{
+    CUDA_1D_KERNEL_LOOP(i, N) { out[i] = Xdata[i] * scale * mask[i]; }
+}
+
+__global__ void dropout_grad_kernel(const int N,
+                                    const float scale,
+                                    const __half* Xdata,
+                                    __half* out,
+                                    uint8_t* mask)
+{
+    const float2* x_cast = reinterpret_cast<const float2*>(Xdata);
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    const uint32_t* mask_cast = reinterpret_cast<const uint32_t*>(mask);
+
+    float2 result_f;
+    __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+    CUDA_1D_KERNEL_LOOP(j, N / unroll_factor)
+    {
+        float2 x_data = x_cast[j];
+        uint32_t m_32 = mask_cast[j];
+        uint8_t* m = (uint8_t*)&m_32;
+
+        __half* x_data_h = reinterpret_cast<__half*>(&x_data);
+        float2 result[2];
+
+        result[0].x = (float)x_data_h[0] * scale * m[0];
+        result[0].y = (float)x_data_h[1] * scale * m[1];
+        result[1].x = (float)x_data_h[2] * scale * m[2];
+        result[1].y = (float)x_data_h[3] * scale * m[3];
+
+        result_h[0] = __float22half2_rn(result[0]);
+        result_h[1] = __float22half2_rn(result[1]);
+
+        out_cast[j] = result_f;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        for (int i = high_index; i < N; i++) {
+            out[i] = __float2half((float)Xdata[i] * scale * mask[i]);
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    const float scale = 1. / (1. - ratio);
+   hipLaunchKernelGGL(( dropout_grad_kernel), dim3(DS_GET_BLOCKS(total_count / unroll_factor)),
+                          dim3(DS_CUDA_NUM_THREADS),
+                          0,
+                          stream, total_count, scale, vals, vals_out, mask);
+}
+template void launch_dropout_grad(float*,
+                                  const float* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+template void launch_dropout_grad(__half*,
+                                  const __half* vals,
+                                  uint8_t* mask,
+                                  int total_count,
+                                  float ratio,
+                                  hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* bias,
+                               float* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* Xdata_cast = reinterpret_cast<float4*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 x_data = Xdata_cast[j];
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+
+        x_data.x += b_data.x;
+        x_data.y += b_data.y;
+        x_data.z += b_data.z;
+        x_data.w += b_data.w;
+
+        x_data.x = x_data.x * scale * m[0];
+        x_data.y = x_data.y * scale * m[1];
+        x_data.z = x_data.z * scale * m[2];
+        x_data.w = x_data.w * scale * m[3];
+
+        mask_32[j] = m_32;
+        Xdata_cast[j] = x_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = Xdata[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = x_data * scale * m;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* bias,
+                               __half* Xdata,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* Xdata_cast = reinterpret_cast<float2*>(Xdata);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        data_f = Xdata_cast[j];
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        data_h_0.x += bias_h_0.x;
+        data_h_0.y += bias_h_0.y;
+        data_h_1.x += bias_h_1.x;
+        data_h_1.y += bias_h_1.y;
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        Xdata_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)Xdata[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            Xdata[i] = __float2half(x_data * scale * m);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const float* input,
+                               const float* residual,
+                               const float* bias,
+                               float* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float4* out_cast = reinterpret_cast<float4*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    const float4* residual_cast = reinterpret_cast<const float4*>(residual);
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        float4 out_data;
+        float4 b_data = bias_cast[j % (dim / unroll_factor)];
+        float4 res_data = residual_cast[j];
+        float4 inp_data = input_cast[j];
+
+        out_data.x = (b_data.x + inp_data.x);
+        out_data.y = (b_data.y + inp_data.y);
+        out_data.z = (b_data.z + inp_data.z);
+        out_data.w = (b_data.w + inp_data.w);
+
+        out_data.x = out_data.x * scale * m[0];
+        out_data.y = out_data.y * scale * m[1];
+        out_data.z = out_data.z * scale * m[2];
+        out_data.w = out_data.w * scale * m[3];
+
+        out_data.x += res_data.x;
+        out_data.y += res_data.y;
+        out_data.z += res_data.z;
+        out_data.w += res_data.w;
+
+        mask_32[j] = m_32;
+        out_cast[j] = out_data;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = input[i] + bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += residual[i];
+
+            out[i] = x_data;
+            mask[i] = m;
+        }
+    }
+}
+
+__global__ void dropout_kernel(const int N,
+                               const int dim,
+                               const float ratio,
+                               const __half* input,
+                               const __half* residual,
+                               const __half* bias,
+                               __half* out,
+                               uint8_t* mask,
+                               std::pair<uint64_t, uint64_t> seed)
+{
+    const float scale = 1. / (1. - ratio);
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int tid = threadIdx.x % (dim / unroll_factor);
+
+    hiprandStatePhilox4_32_10_t state;
+    hiprand_init(seed.first, idx, seed.second, &state);
+
+    float2* out_cast = reinterpret_cast<float2*>(out);
+    uint32_t* mask_32 = reinterpret_cast<uint32_t*>(mask);
+
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+    const float2* residual_cast = reinterpret_cast<const float2*>(residual);
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 rand = hiprand_uniform4(&state);
+
+        float2 data_f;
+        __half2* data_h = reinterpret_cast<__half2*>(&data_f);
+
+        float2 bias_f;
+        __half2* bias_h = reinterpret_cast<__half2*>(&bias_f);
+
+        float2 residual_f;
+        __half2* residual_h = reinterpret_cast<__half2*>(&residual_f);
+
+        float2 input_f;
+        __half2* input_h = reinterpret_cast<__half2*>(&input_f);
+
+        bias_f = bias_cast[j % (dim / unroll_factor)];
+        residual_f = residual_cast[j];
+        input_f = input_cast[j];
+
+        float2 data_h_0 = __half22float2(data_h[0]);
+        float2 data_h_1 = __half22float2(data_h[1]);
+
+        float2 bias_h_0 = __half22float2(bias_h[0]);
+        float2 bias_h_1 = __half22float2(bias_h[1]);
+
+        float2 residual_h_0 = __half22float2(residual_h[0]);
+        float2 residual_h_1 = __half22float2(residual_h[1]);
+
+        float2 input_h_0 = __half22float2(input_h[0]);
+        float2 input_h_1 = __half22float2(input_h[1]);
+
+        data_h_0.x = (bias_h_0.x + input_h_0.x);
+        data_h_0.y = (bias_h_0.y + input_h_0.y);
+        data_h_1.x = (bias_h_1.x + input_h_1.x);
+        data_h_1.y = (bias_h_1.y + input_h_1.y);
+
+        uint32_t m_32;
+        uint8_t* m = (uint8_t*)&m_32;
+
+        m[0] = (uint8_t)(rand.x > ratio);
+        m[1] = (uint8_t)(rand.y > ratio);
+        m[2] = (uint8_t)(rand.z > ratio);
+        m[3] = (uint8_t)(rand.w > ratio);
+
+        data_h_0.x = __float2half(data_h_0.x * scale * m[0]);
+        data_h_0.y = __float2half(data_h_0.y * scale * m[1]);
+        data_h_1.x = __float2half(data_h_1.x * scale * m[2]);
+        data_h_1.y = __float2half(data_h_1.y * scale * m[3]);
+
+        data_h_0.x += residual_h_0.x;
+        data_h_0.y += residual_h_0.y;
+        data_h_1.x += residual_h_1.x;
+        data_h_1.y += residual_h_1.y;
+
+        float2 result_f;
+        __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+        result_h[0] = __float22half2_rn(data_h_0);
+        result_h[1] = __float22half2_rn(data_h_1);
+
+        out_cast[j] = result_f;
+        mask_32[j] = m_32;
+    }
+    int high_index =
+        ((((N / unroll_factor) - 1) / blockDim.x + 1) * (unroll_factor * blockDim.x)) + threadIdx.x;
+    if (N > high_index) {
+        float4 rand = hiprand_uniform4(&state);
+        float* rand_data = &(rand.x);
+        int k = 0;
+        for (int i = high_index; i < N; i++) {
+            float x_data = (float)input[i] + (float)bias[i % dim];
+            uint8_t m = (uint8_t)(rand_data[k++] > ratio);
+            x_data = x_data * scale * m;
+            x_data += (float)residual[i];
+
+            out[i] = __float2half(x_data);
+            mask[i] = m;
+        }
+    }
+}
+
+template <typename T>
+void launch_dropout(T* out,
+                    const T* input,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    hipStream_t stream)
+{
+    assert(unroll_factor == 4);
+
+    int total_count = batch * dim / unroll_factor;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);
+    dim3 block_dim = DS_CUDA_NUM_THREADS;
+
+    uint64_t inc = (batch * dim) / grid_dim.x / block_dim.x;
+    std::pair<uint64_t, uint64_t> seed = Context::Instance().IncrementOffset(inc);
+
+   hipLaunchKernelGGL(( dropout_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        total_count, dim, ratio, input, residual, bias, out, mask, seed);
+}
+
+template void launch_dropout(float*,
+                             const float*,
+                             const float* residual,
+                             const float* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
+template void launch_dropout(__half*,
+                             const __half*,
+                             const __half* residual,
+                             const __half* bias,
+                             uint8_t* mask,
+                             int batch,
+                             int dim,
+                             float ratio,
+                             hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/ds_transformer_cuda.cpp b/deepspeed/ops/csrc/transformer_bak/ds_transformer_cuda.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..18e7fffc1f5ddcd28588589742cf384c0fd96080
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/ds_transformer_cuda.cpp
@@ -0,0 +1,1051 @@
+#include <torch/extension.h>
+
+#include <cublas_v2.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer.h"
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+#include "ds_transformer_cuda.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //aiss debug 0506
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        CUBLAS_OP_T,
+                                                        CUBLAS_OP_N,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         CUBLAS_OP_N,
+                                                         CUBLAS_OP_N,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    cublasSetStream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) cudaStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    cublasSetStream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) cudaStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    cudaStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/ds_transformer_hip.cpp b/deepspeed/ops/csrc/transformer_bak/ds_transformer_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..7b47686737500b9c47ebf66a651c21a6590fc8e0
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/ds_transformer_hip.cpp
@@ -0,0 +1,1052 @@
+// !!! This is a file automatically generated by hipify!!!
+#include <torch/extension.h>
+
+#include <rocblas.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <unordered_map>
+#include <vector>
+#include "Timer_hip.h"
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+#include "ds_transformer_hip.h"
+
+static std::unordered_map<int, std::shared_ptr<void>> s_transformer_layers;
+
+const int init_seq_length = 128;
+
+// C++ interface
+
+template <typename T>
+unsigned get_workspace_size(unsigned maxBatchSize,
+                            unsigned seq_len,
+                            unsigned hidden_size,
+                            unsigned intermediate_size,
+                            unsigned heads,
+                            bool training,
+                            bool gelu_checkpoint)
+{
+    unsigned workSpacesize = 4 * (size_t(maxBatchSize) * seq_len * hidden_size);
+    if (training) {
+        workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * hidden_size);
+        workSpacesize += ((std::max)((size_t(maxBatchSize) * seq_len * intermediate_size),
+                                     2 * (size_t(maxBatchSize) * heads * seq_len * seq_len)));
+        if (gelu_checkpoint)
+            workSpacesize += 2 * (size_t(maxBatchSize) * seq_len * intermediate_size);
+    }
+    return workSpacesize;  // * sizeof(T);
+}
+
+// NOTE: AT_ASSERT has become AT_CHECK on master after 0.4.
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+
+template <typename T>
+BertTransformerLayer<T>::BertTransformerLayer(unsigned layer_id,
+                                              unsigned batch_size,
+                                              unsigned hidden_size,
+                                              unsigned num_heads,
+                                              unsigned intermediate_size,
+                                              unsigned seq_length,
+                                              float attn_prob_dropout_ratio,
+                                              float hidden_output_dropout_ratio,
+                                              float layer_norm_eps,
+                                              bool pre_or_postLayerNorm,
+                                              const std::vector<std::array<int, 3>>& gemm_algos,
+                                              bool attn_dropout_checkpoint,
+                                              bool normalize_invertible,
+                                              bool gelu_checkpoint,
+                                              bool stochastic_mode)
+    : _layer_id(layer_id),
+      _batch_size(batch_size),
+      _hidden_size(hidden_size),
+      _heads(num_heads),
+      _intermediate_size(intermediate_size),
+      _seq_length(seq_length),
+      _training(true),
+      _pre_or_postLayerNorm(pre_or_postLayerNorm),
+      _attn_dropout_checkpoint(attn_dropout_checkpoint),
+      _normalize_invertible(normalize_invertible),
+      _gelu_checkpoint(gelu_checkpoint),
+      _stochastic_mode(stochastic_mode),
+      _stream(Context::Instance().GetCurrentStream()),
+      _cublasHandle(Context::Instance().GetCublasHandle()),
+      _qkv_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                  3 * hidden_size,
+                                                  hidden_size,
+                                                  gemm_algos[0])),
+      _attn_out_linear(typename FeedForward<T>::Config(batch_size * seq_length,
+                                                       hidden_size,
+                                                       hidden_size,
+                                                       gemm_algos[0])),
+      _attn_layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                           seq_length,
+                                                           hidden_size,
+                                                           layer_norm_eps,
+                                                           true,
+                                                           !normalize_invertible)),
+      _layer_norm(typename Normalize_Layer<T>::Config(batch_size,
+                                                      seq_length,
+                                                      hidden_size,
+                                                      layer_norm_eps,
+                                                      true,
+                                                      !normalize_invertible)),
+      _ff1(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           _intermediate_size,
+                                           hidden_size,
+                                           gemm_algos[1])),
+      _ff2(typename FeedForward<T>::Config(batch_size * seq_length,
+                                           hidden_size,
+                                           _intermediate_size,
+                                           gemm_algos[2])),
+      _softmax(typename Softmax<T>::Config(batch_size, num_heads, seq_length)),
+      _gelu(typename Gelu<T>::Config(_intermediate_size)),
+      _attn_prob_dropout(typename Dropout<T>::Config(attn_prob_dropout_ratio, _seq_length)),
+      _attn_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _layer_output_dropout(typename Dropout<T>::Config(hidden_output_dropout_ratio, _hidden_size)),
+      _attn_scores(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                        _seq_length,
+                                                        _seq_length,
+                                                        _hidden_size / _heads,
+                                                        //aiss debug 0506
+                                                        //(T(1.0) / T(sqrt(_hidden_size / _heads))),
+                                                        (T(1.0 / (sqrt(_hidden_size / _heads)))),
+                                                        T(0.0),
+                                                        rocblas_operation_transpose,
+                                                        rocblas_operation_none,
+                                                        gemm_algos[3])),
+      _attn_context(typename StridedBatchGemm<T>::Config(_batch_size * _heads,
+                                                         _hidden_size / _heads,
+                                                         _seq_length,
+                                                         _seq_length,
+                                                         T(1.0),
+                                                         T(0.0),
+                                                         rocblas_operation_none,
+                                                         rocblas_operation_none,
+                                                         gemm_algos[4]))
+{
+    assert(_hidden_size % _heads == 0);
+
+    Initialize();
+}
+
+template <typename T>
+BertTransformerLayer<T>::~BertTransformerLayer()
+{
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Initialize()
+{
+#ifndef __HIP_PLATFORM_HCC__
+    if (std::is_same<T, __half>::value) rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+#endif
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Forward(unsigned bsz,
+                                      const T* input_ptr,
+                                      const T* input_mask_ptr,
+                                      const T* attn_qkvw_ptr,
+                                      const T* attn_qkvb_ptr,
+                                      const T* attn_ow_ptr,
+                                      const T* attn_ob_ptr,
+                                      const T* attn_nw_ptr,
+                                      const T* attn_nb_ptr,
+                                      const T* inter_w_ptr,
+                                      const T* inter_b_ptr,
+                                      const T* output_w_ptr,
+                                      const T* output_b_ptr,
+                                      const T* norm_w_ptr,
+                                      const T* norm_b_ptr,
+                                      T* out_ptr,
+                                      T* inp_norm_ptr,
+                                      T* q_tf_ptr,
+                                      T* k_tf_ptr,
+                                      T* v_tf_ptr,
+                                      T* soft_out_ptr,
+                                      T* ctx_bufB_ptr,
+                                      T* attn_o_inp_ptr,
+                                      T* add_res_ptr,
+                                      T* ff1_inp_ptr,
+                                      T* gelu_inp_ptr,
+                                      T* ff2_inp_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1;
+
+    if (_normalize_invertible) {
+        add_res_ptr = buf_1 + 3 * small_buf_size;
+        buf_2 = add_res_ptr;
+    }
+    if (_gelu_checkpoint) buf_2 += small_buf_size;
+    if (_attn_dropout_checkpoint)
+        ctx_bufB_ptr =
+            (_gelu_checkpoint ? (buf_2 + (_intermediate_size / _hidden_size) * small_buf_size)
+                              : (buf_1 + 4 * small_buf_size));
+
+    int bsz_seq = bsz * _seq_length;
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+
+        else
+            _layer_norm.Forward(
+                bsz_seq, inp_norm_ptr, input_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Forward(bsz_seq, inp_norm_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+    else
+        _qkv_linear.Forward(bsz_seq, input_ptr, attn_qkvw_ptr, buf_0, _cublasHandle);
+
+    launch_bias_add_transform_0213<T>(
+        q_tf_ptr, buf_0, attn_qkvb_ptr, bsz, _seq_length, _hidden_size, _heads, _stream, 3);
+
+    int bsz_heads = bsz * _heads;
+
+    // attention scores
+    _attn_scores.Forward(bsz_heads, soft_out_ptr, k_tf_ptr, q_tf_ptr, _cublasHandle);
+
+    // Softmax + Mask
+    _softmax.Forward(bsz, soft_out_ptr, input_mask_ptr, _stream);
+
+    // attn prob dropout.
+    _attn_prob_dropout.Forward(bsz_heads * _seq_length, ctx_bufB_ptr, soft_out_ptr, _stream);
+
+    // attention context
+    _attn_context.Forward(bsz_heads, buf_1, v_tf_ptr, ctx_bufB_ptr, _cublasHandle);
+
+    launch_transform4d_0213<T>(
+        attn_o_inp_ptr, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 1);
+
+    if (_pre_or_postLayerNorm)
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, buf_1, _cublasHandle);
+    else
+        _attn_out_linear.Forward(bsz_seq, attn_o_inp_ptr, attn_ow_ptr, ff1_inp_ptr, _cublasHandle);
+
+    // attn output dropout.
+    if (_pre_or_postLayerNorm)
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, buf_1, input_ptr, attn_ob_ptr, _stream);
+    else
+        _attn_output_dropout.ForwardWithBias(
+            bsz_seq, add_res_ptr, ff1_inp_ptr, input_ptr, attn_ob_ptr, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.ForwardCheckpoint(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+        else
+            _attn_layer_norm.Forward(
+                bsz_seq, ff1_inp_ptr, add_res_ptr, attn_nw_ptr, attn_nb_ptr, _stream, true);
+    }
+
+    _ff1.Forward(bsz_seq,
+                 ff1_inp_ptr,
+                 inter_w_ptr,
+                 (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                 _cublasHandle);
+
+    _gelu.ForwardWithBiasAdd(bsz_seq,
+                             (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr),
+                             inter_b_ptr,
+                             (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                             _stream);
+
+    _ff2.Forward(
+        bsz_seq, (_gelu_checkpoint ? buf_2 : ff2_inp_ptr), output_w_ptr, out_ptr, _cublasHandle);
+
+    // layer output dropout.
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, out_ptr, out_ptr, add_res_ptr, output_b_ptr, _stream);
+    else
+        _layer_output_dropout.ForwardWithBias(
+            bsz_seq, inp_norm_ptr, out_ptr, ff1_inp_ptr, output_b_ptr, _stream);
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.ForwardCheckpoint(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+        else
+            _layer_norm.Forward(
+                bsz_seq, out_ptr, inp_norm_ptr, norm_w_ptr, norm_b_ptr, _stream, true);
+    }
+}
+
+template <typename T>
+void BertTransformerLayer<T>::Backward(unsigned bsz,
+                                       const T* grad_output_ptr,
+                                       const T* input_ptr,
+                                       const T* output_ptr,
+                                       const T* inp_norm_ptr,
+                                       const T* q_tf_ptr,
+                                       const T* k_tf_ptr,
+                                       const T* v_tf_ptr,
+                                       const T* soft_out_ptr,
+                                       const T* ctx_bufB_ptr,
+                                       const T* attn_o_inp_ptr,
+                                       const T* add_res_ptr,
+                                       const T* ff1_inp_ptr,
+                                       const T* gelu_inp_ptr,
+                                       const T* ff2_inp_ptr,
+                                       const T* input_mask_ptr,
+                                       const T* attn_qkvw_ptr,
+                                       const T* attn_ow_ptr,
+                                       const T* attn_nw_ptr,
+                                       const T* attn_nb_ptr,
+                                       const T* inter_w_ptr,
+                                       const T* inter_b_ptr,
+                                       const T* output_w_ptr,
+                                       const T* norm_w_ptr,
+                                       const T* norm_b_ptr,
+
+                                       T* grad_input_ptr,
+                                       T* grad_attn_qkvw_ptr,
+                                       T* grad_attn_qkvb_ptr,
+                                       T* grad_attn_ow_ptr,
+                                       T* grad_attn_ob_ptr,
+                                       T* grad_attn_nw_ptr,
+                                       T* grad_attn_nb_ptr,
+                                       T* grad_inter_w_ptr,
+                                       T* grad_inter_b_ptr,
+                                       T* grad_output_w_ptr,
+                                       T* grad_output_b_ptr,
+                                       T* grad_norm_w_ptr,
+                                       T* grad_norm_b_ptr)
+{
+    rocblas_set_stream(_cublasHandle, _stream);
+
+    if (!_stochastic_mode) hipStreamSynchronize(_stream);
+
+    T* workspace = static_cast<T*>(Context::Instance().GetWorkSpace());
+    size_t small_buf_size = bsz * _seq_length * _hidden_size;
+    T* buf_0 = workspace;
+    T* buf_1 = buf_0 + small_buf_size;
+    T* buf_2 = buf_1 + small_buf_size;
+    T* buf_3 = buf_2 + small_buf_size;
+
+    T* ff2_buf = (_gelu_checkpoint ? buf_3 + (bsz * _seq_length * _intermediate_size)
+                                   : buf_3 + small_buf_size);
+    T* ctx_bufB_ptr_recomp = ff2_buf + (_seq_length * _seq_length * bsz * _heads);
+
+    hipStream_t streams[2] = {_stream, _stream};
+
+    int bsz_seq = bsz * _seq_length;
+    int bsz_heads = bsz * _heads;
+
+    if (!_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 inp_norm_ptr);
+
+        else
+            _layer_norm.Backward(bsz_seq,
+                                 grad_output_ptr,
+                                 norm_w_ptr,
+                                 norm_b_ptr,
+                                 grad_norm_w_ptr,
+                                 grad_norm_b_ptr,
+                                 streams,
+                                 buf_1,
+                                 output_ptr);
+    }
+
+    if (_pre_or_postLayerNorm)
+        _layer_output_dropout.Backward(bsz_seq, buf_0, grad_output_ptr, _stream);
+    else
+        _layer_output_dropout.Backward(bsz_seq, buf_0, buf_1, _stream);
+
+    const T* layer_dropout_buf = _layer_output_dropout.HasDropout()
+                                     ? buf_0
+                                     : (_pre_or_postLayerNorm ? grad_output_ptr : buf_1);
+
+    if (_gelu_checkpoint)
+        _gelu.ForwardWithBiasAdd(bsz_seq, ff2_inp_ptr, inter_b_ptr, buf_2, _stream);
+    _ff2.Backward(bsz_seq,
+                  layer_dropout_buf,
+                  (_gelu_checkpoint ? buf_2 : ff2_inp_ptr),
+                  output_w_ptr,
+                  grad_output_w_ptr,
+                  grad_output_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  ff2_buf);
+
+    _gelu.Backward(
+        bsz_seq, ff2_buf, (_gelu_checkpoint ? ff2_inp_ptr : gelu_inp_ptr), inter_b_ptr, _stream);
+
+    _ff1.Backward(bsz_seq,
+                  ff2_buf,
+                  ff1_inp_ptr,
+                  inter_w_ptr,
+                  grad_inter_w_ptr,
+                  grad_inter_b_ptr,
+                  _cublasHandle,
+                  _stream,
+                  buf_3);
+
+    if (!_pre_or_postLayerNorm)
+        launch_fused_add2<T>(buf_2, buf_3, buf_1, bsz, _seq_length, _hidden_size, _stream);
+
+    if (_pre_or_postLayerNorm) {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              add_res_ptr);
+
+        else
+            _attn_layer_norm.BackwardFusedAdd(bsz_seq,
+                                              buf_3,
+                                              grad_output_ptr,
+                                              attn_nw_ptr,
+                                              attn_nb_ptr,
+                                              grad_attn_nw_ptr,
+                                              grad_attn_nb_ptr,
+                                              streams,
+                                              buf_0,
+                                              ff1_inp_ptr);
+    } else {
+        if (_attn_layer_norm.UseMean())
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      add_res_ptr);
+
+        else
+            _attn_layer_norm.Backward(bsz_seq,
+                                      buf_2,
+                                      attn_nw_ptr,
+                                      attn_nb_ptr,
+                                      grad_attn_nw_ptr,
+                                      grad_attn_nb_ptr,
+                                      streams,
+                                      buf_0,
+                                      ff1_inp_ptr);
+    }
+
+    _attn_output_dropout.Backward(bsz_seq, buf_2, buf_0, _stream);
+
+    T* attn_output_dropout_buf = _attn_output_dropout.HasDropout() ? buf_2 : buf_0;
+
+    _attn_out_linear.Backward(bsz_seq,
+                              attn_output_dropout_buf,
+                              attn_o_inp_ptr,
+                              attn_ow_ptr,
+                              grad_attn_ow_ptr,
+                              grad_attn_ob_ptr,
+                              _cublasHandle,
+                              _stream,
+                              buf_1);
+
+    launch_transform_0213<T>(buf_2, buf_1, bsz, _seq_length, _hidden_size, _heads, _stream);
+
+    if (_attn_prob_dropout.HasDropout()) {
+        if (_attn_dropout_checkpoint)
+            _attn_prob_dropout.Forward(
+                bsz_heads * _seq_length, ctx_bufB_ptr_recomp, soft_out_ptr, _stream, true);
+
+        _attn_context.Backward(bsz_heads,
+                               buf_2,
+                               v_tf_ptr,
+                               (_attn_dropout_checkpoint ? ctx_bufB_ptr_recomp : ctx_bufB_ptr),
+                               _cublasHandle,
+                               buf_3,
+                               ff2_buf);
+    } else
+        _attn_context.Backward(
+            bsz_heads, buf_2, v_tf_ptr, soft_out_ptr, _cublasHandle, buf_3, ff2_buf);
+
+    _attn_prob_dropout.Backward(bsz_heads * _seq_length, ff2_buf, _stream);
+
+    _softmax.Backward(bsz, ff2_buf, soft_out_ptr, _stream);
+
+    _attn_scores.Backward(bsz_heads, ff2_buf, k_tf_ptr, q_tf_ptr, _cublasHandle, buf_2, buf_1);
+
+    launch_transform4d_0213(ff2_buf, buf_1, bsz, _heads, _seq_length, _hidden_size, _stream, 3);
+
+    if (_pre_or_postLayerNorm)
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             inp_norm_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+    else
+        _qkv_linear.Backward(bsz_seq,
+                             ff2_buf,
+                             input_ptr,
+                             attn_qkvw_ptr,
+                             grad_attn_qkvw_ptr,
+                             grad_attn_qkvb_ptr,
+                             _cublasHandle,
+                             _stream,
+                             buf_2);
+
+    if (_pre_or_postLayerNorm) {
+        if (_layer_norm.UseMean())
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         input_ptr);
+
+        else
+            _layer_norm.BackwardFusedAdd(bsz_seq,
+                                         buf_2,
+                                         buf_0,
+                                         norm_w_ptr,
+                                         norm_b_ptr,
+                                         grad_norm_w_ptr,
+                                         grad_norm_b_ptr,
+                                         streams,
+                                         grad_input_ptr,
+                                         inp_norm_ptr);
+    } else
+        launch_fused_add2<T>(grad_input_ptr, buf_2, buf_0, bsz, _seq_length, _hidden_size, _stream);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetTrainingMode(bool training)
+{
+    // Dropout will be skipped when not in training model.
+    _attn_prob_dropout.SetTrainingMode(training);
+    _attn_output_dropout.SetTrainingMode(training);
+    _layer_output_dropout.SetTrainingMode(training);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                                     uint8_t* attn_output_dropout_mask_ptr,
+                                                     uint8_t* layer_output_dropout_mask_ptr,
+                                                     T* attn_layer_norm_var,
+                                                     T* attn_layer_norm_mean,
+                                                     T* layer_norm_var,
+                                                     T* layer_norm_mean)
+{
+    _attn_prob_dropout.SetMask(attn_prob_dropout_mask_ptr);
+    _attn_output_dropout.SetMask(attn_output_dropout_mask_ptr);
+    _layer_output_dropout.SetMask(layer_output_dropout_mask_ptr);
+
+    _attn_layer_norm.SetVar(attn_layer_norm_var);
+    _attn_layer_norm.SetMean(attn_layer_norm_mean);
+    _layer_norm.SetVar(layer_norm_var);
+    _layer_norm.SetMean(layer_norm_mean);
+}
+
+template <typename T>
+void BertTransformerLayer<T>::SetSeqLength(unsigned seq_len)
+{
+    _seq_length = seq_len;
+
+    _softmax.SetSeqLength(_seq_length);
+    _attn_prob_dropout.SetDimension(_seq_length);
+    _attn_scores.SetConfig(_seq_length, _seq_length, _hidden_size / _heads);
+    _attn_context.SetConfig(_hidden_size / _heads, _seq_length, _seq_length);
+}
+
+template <typename T>
+int create_transformer_layer(unsigned layer_id,
+                             unsigned batch_size,
+                             unsigned hidden_dim,
+                             unsigned num_heads,
+                             unsigned intermediate_size,
+                             float attn_dropout_ratio,
+                             float hidden_dropout_ratio,
+                             float layer_norm_eps,
+                             int seed,
+                             bool pre_or_postLayerNorm,
+                             bool test_gemm,
+                             bool attn_dropout_checkpoint,
+                             bool normalize_invertible,
+                             bool gelu_checkpoint,
+                             bool stochastic_mode)
+{
+    Context::Instance().SetSeed(seed);
+    Context::Instance().TestGemmFP16(
+        test_gemm, batch_size, init_seq_length, num_heads, hidden_dim / num_heads);
+
+    auto layer = std::make_shared<BertTransformerLayer<T>>(layer_id,
+                                                           batch_size,
+                                                           hidden_dim,
+                                                           num_heads,
+                                                           intermediate_size,
+                                                           init_seq_length,
+                                                           attn_dropout_ratio,
+                                                           hidden_dropout_ratio,
+                                                           layer_norm_eps,
+                                                           pre_or_postLayerNorm,
+                                                           Context::Instance().GetGemmAlgos(),
+                                                           attn_dropout_checkpoint,
+                                                           normalize_invertible,
+                                                           gelu_checkpoint,
+                                                           stochastic_mode);
+
+    s_transformer_layers[layer_id] = layer;
+
+    std::string dtype = (std::is_same<T, __half>::value) ? "half" : "float";
+
+    std::cout << "layer #" << layer_id << " is created with date type [" << dtype << "]."
+              << std::endl;
+
+    return 0;
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_forward(unsigned layer_id,
+                                                  const torch::Tensor& input,
+                                                  const torch::Tensor& input_mask,
+                                                  const torch::Tensor& attn_qkvw,
+                                                  const torch::Tensor& attn_qkvb,
+                                                  const torch::Tensor& attn_ow,
+                                                  const torch::Tensor& attn_ob,
+                                                  const torch::Tensor& attn_nw,
+                                                  const torch::Tensor& attn_nb,
+                                                  const torch::Tensor& inter_w,
+                                                  const torch::Tensor& inter_b,
+                                                  const torch::Tensor& output_w,
+                                                  const torch::Tensor& output_b,
+                                                  const torch::Tensor& norm_w,
+                                                  const torch::Tensor& norm_b,
+                                                  bool training_mode,
+                                                  bool prelayernorm,
+                                                  bool attn_dropout_checkpoint,
+                                                  bool normalize_invertible,
+                                                  bool gelu_checkpoint)
+{
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = input.size(0);
+
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_qkvb_ptr = (const T*)attn_qkvb.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_ob_ptr = (const T*)attn_ob.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* output_b_ptr = (const T*)output_b.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    auto output = torch::empty_like(input);
+    T* out_ptr = (T*)output.data_ptr();
+
+    auto options = torch::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+
+    auto uint8_options = torch::TensorOptions()
+                             .dtype(torch::kInt8)
+                             .layout(torch::kStrided)
+                             .device(torch::kCUDA)
+                             .requires_grad(false);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (input.size(1) != seq_len) {
+        seq_len = input.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto inp_norm = ((prelayernorm || !normalize_invertible) ? torch::empty_like(input) : output);
+    auto add_res = (normalize_invertible ? inp_norm : torch::empty_like(input));
+    auto attn_o_inp = torch::empty_like(input);
+    auto qkv_tf = torch::empty({(bsz * seq_len), output_w.size(0) * 3}, options);
+
+    auto attn_prob_dropout_mask =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, uint8_options);
+    auto attn_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+    auto layer_output_dropout_mask =
+        torch::empty({(bsz * seq_len), layer->GetHiddenSize()}, uint8_options);
+
+    auto attn_layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto attn_layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_var = torch::empty({(bsz * seq_len)}, options);
+    auto layer_norm_mean = torch::empty({(bsz * seq_len)}, options);
+
+    T* inp_norm_ptr = (T*)inp_norm.data_ptr();
+    T* add_res_ptr = (T*)add_res.data_ptr();
+    T* q_tf_ptr = (T*)qkv_tf.data_ptr();
+    T* k_tf_ptr = q_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)k_tf.data_ptr();
+    T* v_tf_ptr = k_tf_ptr + (bsz * seq_len * output_w.size(0));  //(T*)v_tf.data_ptr();
+    T* attn_o_inp_ptr = (T*)attn_o_inp.data_ptr();
+
+    torch::Tensor ff2_inp = torch::empty({(bsz * seq_len), output_w.size(1)}, options);
+    torch::Tensor gelu_inp =
+        (gelu_checkpoint ? ff2_inp : torch::empty({(bsz * seq_len), output_w.size(1)}, options));
+    auto ff1_inp = torch::empty_like(input);
+    T* ff2_inp_ptr = (T*)ff2_inp.data_ptr();
+    T* gelu_inp_ptr = (T*)gelu_inp.data_ptr();
+    T* ff1_inp_ptr = (T*)ff1_inp.data_ptr();
+
+    torch::Tensor soft_out =
+        torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options);
+    torch::Tensor ctx_bufB =
+        (attn_dropout_checkpoint
+             ? soft_out
+             : torch::empty({(bsz * layer->GetNumHeads() * seq_len), seq_len}, options));
+    T* soft_out_ptr = (T*)soft_out.data_ptr();
+    T* ctx_bufB_ptr = (T*)ctx_bufB.data_ptr();
+
+    layer->SetTrainingMode(training_mode);
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Forward(bsz,
+                   input_ptr,
+                   input_mask_ptr,
+                   attn_qkvw_ptr,
+                   attn_qkvb_ptr,
+                   attn_ow_ptr,
+                   attn_ob_ptr,
+                   attn_nw_ptr,
+                   attn_nb_ptr,
+                   inter_w_ptr,
+                   inter_b_ptr,
+                   output_w_ptr,
+                   output_b_ptr,
+                   norm_w_ptr,
+                   norm_b_ptr,
+                   out_ptr,
+                   inp_norm_ptr,
+                   q_tf_ptr,
+                   k_tf_ptr,
+                   v_tf_ptr,
+                   soft_out_ptr,
+                   ctx_bufB_ptr,
+                   attn_o_inp_ptr,
+                   add_res_ptr,
+                   ff1_inp_ptr,
+                   gelu_inp_ptr,
+                   ff2_inp_ptr);
+
+    return {output,
+            inp_norm,
+            qkv_tf,
+            soft_out,
+            ctx_bufB,
+            attn_o_inp,
+            add_res,
+            ff1_inp,
+            gelu_inp,
+            ff2_inp,
+            attn_prob_dropout_mask,
+            attn_output_dropout_mask,
+            layer_output_dropout_mask,
+            attn_layer_norm_var,
+            attn_layer_norm_mean,
+            layer_norm_var,
+            layer_norm_mean};
+}
+
+template <typename T>
+std::vector<torch::Tensor> ds_transformer_backward(unsigned layer_id,
+                                                   const torch::Tensor& grad_output,
+                                                   const torch::Tensor& output,
+                                                   const torch::Tensor& inp_norm,
+                                                   const torch::Tensor& qkv_tf,
+                                                   const torch::Tensor& soft_out,
+                                                   const torch::Tensor& ctx_bufB,
+                                                   const torch::Tensor& attn_o_inp,
+                                                   const torch::Tensor& add_res,
+                                                   const torch::Tensor& ff1_inp,
+                                                   const torch::Tensor& gelu_inp,
+                                                   const torch::Tensor& ff2_inp,
+                                                   const torch::Tensor& attn_prob_dropout_mask,
+                                                   const torch::Tensor& attn_output_dropout_mask,
+                                                   const torch::Tensor& layer_output_dropout_mask,
+                                                   const torch::Tensor& attn_layer_norm_var,
+                                                   const torch::Tensor& attn_layer_norm_mean,
+                                                   const torch::Tensor& layer_norm_var,
+                                                   const torch::Tensor& layer_norm_mean,
+                                                   const torch::Tensor& input,
+                                                   const torch::Tensor& input_mask,
+                                                   const torch::Tensor& attn_qkvw,
+                                                   const torch::Tensor& attn_qkvb,
+                                                   const torch::Tensor& attn_ow,
+                                                   const torch::Tensor& attn_ob,
+                                                   const torch::Tensor& attn_nw,
+                                                   const torch::Tensor& attn_nb,
+                                                   const torch::Tensor& inter_w,
+                                                   const torch::Tensor& inter_b,
+                                                   const torch::Tensor& output_w,
+                                                   const torch::Tensor& output_b,
+                                                   const torch::Tensor& norm_w,
+                                                   const torch::Tensor& norm_b)
+{
+    auto g_output = grad_output.contiguous();
+    CHECK_INPUT(g_output);
+    CHECK_INPUT(output);
+    CHECK_INPUT(inp_norm);
+    CHECK_INPUT(qkv_tf);
+    CHECK_INPUT(add_res);
+    CHECK_INPUT(soft_out);
+    CHECK_INPUT(ctx_bufB);
+    CHECK_INPUT(attn_o_inp);
+    CHECK_INPUT(ff1_inp);
+    CHECK_INPUT(gelu_inp);
+    CHECK_INPUT(ff2_inp);
+    CHECK_INPUT(input);
+    CHECK_INPUT(input_mask);
+    CHECK_INPUT(attn_qkvw);
+    CHECK_INPUT(attn_qkvb);
+    CHECK_INPUT(attn_ow);
+    CHECK_INPUT(attn_ob);
+    CHECK_INPUT(attn_nw);
+    CHECK_INPUT(attn_nb);
+    CHECK_INPUT(inter_w);
+    CHECK_INPUT(inter_b);
+    CHECK_INPUT(output_w);
+    CHECK_INPUT(output_b);
+    CHECK_INPUT(norm_w);
+    CHECK_INPUT(norm_b);
+
+    unsigned bsz = g_output.size(0);
+
+    std::shared_ptr<BertTransformerLayer<T>> layer =
+        std::static_pointer_cast<BertTransformerLayer<T>>(s_transformer_layers[layer_id]);
+
+    unsigned seq_len = layer->GetSeqLength();
+    if (g_output.size(1) != seq_len) {
+        seq_len = g_output.size(1);
+        layer->SetSeqLength(seq_len);
+    }
+    auto options = torch::TensorOptions()
+                       .dtype(g_output.options().dtype())
+                       .layout(torch::kStrided)
+                       .device(torch::kCUDA)
+                       .requires_grad(true);
+    auto workspace = torch::empty({get_workspace_size<T>(bsz,
+                                                         seq_len,
+                                                         layer->GetHiddenSize(),
+                                                         layer->GetIntermediateSize(),
+                                                         layer->GetNumHeads(),
+                                                         layer->IsTrainingMode(),
+                                                         layer->GeluCheckpoint())},
+                                  options);
+    Context::Instance().SetWorkSpace((T*)workspace.data_ptr());
+
+    auto grad_input = torch::empty_like(input);
+    auto grad_attn_qkvw = torch::empty_like(attn_qkvw);
+    auto grad_attn_qkvb = torch::empty_like(attn_qkvb);
+    auto grad_attn_ow = torch::empty_like(attn_ow);
+    auto grad_attn_ob = torch::empty_like(attn_ob);
+    auto grad_attn_nw = torch::empty_like(attn_nw);
+    auto grad_attn_nb = torch::empty_like(attn_nb);
+    auto grad_inter_w = torch::empty_like(inter_w);
+    auto grad_inter_b = torch::empty_like(inter_b);
+    auto grad_output_w = torch::empty_like(output_w);
+    auto grad_output_b = torch::empty_like(output_b);
+    auto grad_norm_w = torch::empty_like(norm_w);
+    auto grad_norm_b = torch::empty_like(norm_b);
+
+    // inputs.
+    const T* grad_output_ptr = (const T*)g_output.data_ptr();
+    const T* input_ptr = (const T*)input.data_ptr();
+    const T* output_ptr = (const T*)output.data_ptr();
+    const T* inp_norm_ptr = (const T*)inp_norm.data_ptr();
+    const T* q_tf_ptr = (const T*)qkv_tf.data_ptr();
+    const T* add_res_ptr = (const T*)add_res.data_ptr();
+    const T* k_tf_ptr =
+        q_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)k_tf.data_ptr();
+    const T* v_tf_ptr =
+        k_tf_ptr + (bsz * layer->GetSeqLength() * output_w.size(0));  //(const T*)v_tf.data_ptr();
+    const T* ff1_inp_ptr = (const T*)ff1_inp.data_ptr();
+    const T* gelu_inp_ptr = (const T*)gelu_inp.data_ptr();
+    const T* ff2_inp_ptr = (const T*)ff2_inp.data_ptr();
+    const T* ctx_bufB_ptr = (const T*)ctx_bufB.data_ptr();
+    const T* soft_out_ptr = (const T*)soft_out.data_ptr();
+    const T* attn_o_inp_ptr = (const T*)attn_o_inp.data_ptr();
+    const T* input_mask_ptr = (const T*)input_mask.data_ptr();
+    const T* attn_qkvw_ptr = (const T*)attn_qkvw.data_ptr();
+    const T* attn_ow_ptr = (const T*)attn_ow.data_ptr();
+    const T* attn_nw_ptr = (const T*)attn_nw.data_ptr();
+    const T* attn_nb_ptr = (const T*)attn_nb.data_ptr();
+    const T* inter_w_ptr = (const T*)inter_w.data_ptr();
+    const T* inter_b_ptr = (const T*)inter_b.data_ptr();
+    const T* output_w_ptr = (const T*)output_w.data_ptr();
+    const T* norm_w_ptr = (const T*)norm_w.data_ptr();
+    const T* norm_b_ptr = (const T*)norm_b.data_ptr();
+
+    // outputs.
+    T* grad_input_ptr = (T*)grad_input.data_ptr();
+    T* grad_attn_qkvw_ptr = (T*)grad_attn_qkvw.data_ptr();
+    T* grad_attn_qkvb_ptr = (T*)grad_attn_qkvb.data_ptr();
+    T* grad_attn_ow_ptr = (T*)grad_attn_ow.data_ptr();
+    T* grad_attn_ob_ptr = (T*)grad_attn_ob.data_ptr();
+    T* grad_attn_nw_ptr = (T*)grad_attn_nw.data_ptr();
+    T* grad_attn_nb_ptr = (T*)grad_attn_nb.data_ptr();
+    T* grad_inter_w_ptr = (T*)grad_inter_w.data_ptr();
+    T* grad_inter_b_ptr = (T*)grad_inter_b.data_ptr();
+    T* grad_output_w_ptr = (T*)grad_output_w.data_ptr();
+    T* grad_output_b_ptr = (T*)grad_output_b.data_ptr();
+    T* grad_norm_w_ptr = (T*)grad_norm_w.data_ptr();
+    T* grad_norm_b_ptr = (T*)grad_norm_b.data_ptr();
+
+    layer->SetIntermediateBuffers((uint8_t*)attn_prob_dropout_mask.data_ptr(),
+                                  (uint8_t*)attn_output_dropout_mask.data_ptr(),
+                                  (uint8_t*)layer_output_dropout_mask.data_ptr(),
+                                  (T*)attn_layer_norm_var.data_ptr(),
+                                  (T*)attn_layer_norm_mean.data_ptr(),
+                                  (T*)layer_norm_var.data_ptr(),
+                                  (T*)layer_norm_mean.data_ptr());
+
+    layer->Backward(bsz,
+                    grad_output_ptr,
+                    input_ptr,
+                    output_ptr,
+                    inp_norm_ptr,
+                    q_tf_ptr,
+                    k_tf_ptr,
+                    v_tf_ptr,
+                    soft_out_ptr,
+                    ctx_bufB_ptr,
+                    attn_o_inp_ptr,
+                    add_res_ptr,
+                    ff1_inp_ptr,
+                    gelu_inp_ptr,
+                    ff2_inp_ptr,
+                    input_mask_ptr,
+                    attn_qkvw_ptr,
+                    attn_ow_ptr,
+                    attn_nw_ptr,
+                    attn_nb_ptr,
+                    inter_w_ptr,
+                    inter_b_ptr,
+                    output_w_ptr,
+                    norm_w_ptr,
+                    norm_b_ptr,
+
+                    grad_input_ptr,
+                    grad_attn_qkvw_ptr,
+                    grad_attn_qkvb_ptr,
+                    grad_attn_ow_ptr,
+                    grad_attn_ob_ptr,
+                    grad_attn_nw_ptr,
+                    grad_attn_nb_ptr,
+                    grad_inter_w_ptr,
+                    grad_inter_b_ptr,
+                    grad_output_w_ptr,
+                    grad_output_b_ptr,
+                    grad_norm_w_ptr,
+                    grad_norm_b_ptr);
+
+    return {grad_input,
+            grad_attn_qkvw,
+            grad_attn_qkvb,
+            grad_attn_ow,
+            grad_attn_ob,
+            grad_attn_nw,
+            grad_attn_nb,
+            grad_inter_w,
+            grad_inter_b,
+            grad_output_w,
+            grad_output_b,
+            grad_norm_w,
+            grad_norm_b};
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("forward_fp32",
+          &ds_transformer_forward<float>,
+          "DeepSpeed Transformer forward with fp32 (CUDA)");
+    m.def("forward_fp16",
+          &ds_transformer_forward<__half>,
+          "DeepSpeed Transformer forward with fp16 (CUDA)");
+    m.def("backward_fp32",
+          &ds_transformer_backward<float>,
+          "DeepSpeed Transformer backward with fp32 (CUDA)");
+    m.def("backward_fp16",
+          &ds_transformer_backward<__half>,
+          "DeepSpeed Transformer backward with fp16 (CUDA)");
+    m.def("create_transformer_layer_fp32",
+          &create_transformer_layer<float>,
+          "Create DeepSpeed Transformer Transformer Layer with fp32 (CUDA)");
+    m.def("create_transformer_layer_fp16",
+          &create_transformer_layer<__half>,
+          "Create DeepSpeed Transformer Transformer Layer with fp16 (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/gelu_kernels.cu b/deepspeed/ops/csrc/transformer_bak/gelu_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d683cf0af83daf829e390d83182eb6f0ffd040a6
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/gelu_kernels.cu
@@ -0,0 +1,330 @@
+#include "custom_cuda_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    gelu_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       cudaStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, cudaStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, cudaStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   cudaStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+    d_gelu_func<<<grid_dims, block_dims, 0, stream>>>(
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, cudaStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, cudaStream_t);
diff --git a/deepspeed/ops/csrc/transformer_bak/gelu_kernels.hip b/deepspeed/ops/csrc/transformer_bak/gelu_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..f7e7a7fa7e0f22a6d20de4a1fbb5c2071afb9c77
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/gelu_kernels.hip
@@ -0,0 +1,332 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+inline __device__ float d_gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+
+    float x2mul = x * x * mul_param;
+    float tan_h = tanhf(sqrt_param * (x + x * x2mul));
+    float dg1 = 0.5f * (1.0f + tan_h);
+    float dg2 = x * 0.5f * sqrt_param * (1 - tan_h * tan_h);
+    float dg3 = dg2 * 3 * x2mul;
+    return (dg1 + dg2 + dg3);
+}
+
+/*
+Fused bias add with GELU
+
+Loads a vector of 4 elements each iteration, for stride
+iterations. It was written with the intention to launch 256 thread
+threadblocks, so to launch for bert-large, we would set ITERATIONS
+to 4. This is currently done automatically as a heuristic, setting
+the number of iterations as blocks of 1024.
+
+For FP16, the values are loaded from memory as __half, but converted
+to FP32 for the arithmetic itself, to prevent numerous overflow on
+the intermediate hyperbolic tangent, since there's no intrinsic
+that computes it directly.
+*/
+
+__global__ void gelu_kernel(const float* input, float* vals, int row_stride, int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void gelu_kernel(const __half* input, __half* vals, int row_stride, int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void fused_bias_gelu(const float* input,
+                                const float* bias,
+                                float* vals,
+                                int row_stride,
+                                int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float4* input_cast = reinterpret_cast<const float4*>(input);
+    float4* vals_cast = reinterpret_cast<float4*>(vals);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 data = input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            data.x += bias_data.x;
+            data.y += bias_data.y;
+            data.z += bias_data.z;
+            data.w += bias_data.w;
+
+            data.x = gelu(data.x);
+            data.y = gelu(data.y);
+            data.z = gelu(data.z);
+            data.w = gelu(data.w);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = data;
+        }
+    }
+}
+
+__global__ void fused_bias_gelu(const __half* input,
+                                const __half* bias,
+                                __half* vals,
+                                int row_stride,
+                                int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    const float2* input_cast = reinterpret_cast<const float2*>(input);
+    float2* vals_cast = reinterpret_cast<float2*>(vals);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 vals_vec = input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 low_data = __half22float2(vals_half[0]);
+            float2 high_data = __half22float2(vals_half[1]);
+
+            float2 low_bias = __half22float2(bias_half[0]);
+            float2 high_bias = __half22float2(bias_half[1]);
+
+            low_data.x += low_bias.x;
+            low_data.y += low_bias.y;
+            high_data.x += high_bias.x;
+            high_data.y += high_bias.y;
+
+            low_data.x = gelu(low_data.x);
+            low_data.y = gelu(low_data.y);
+            high_data.x = gelu(high_data.x);
+            high_data.y = gelu(high_data.y);
+
+            vals_half[0] = __float22half2_rn(low_data);
+            vals_half[1] = __float22half2_rn(high_data);
+
+            vals_cast[row * row_stride + i * loop_stride + id] = vals_vec;
+        }
+    }
+#endif
+}
+
+__global__ void d_gelu_func(float* d_output,
+                            const float* gelu_input,
+                            const float* bias,
+                            int row_stride,
+                            int iterations)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float4* d_output_cast = reinterpret_cast<float4*>(d_output);
+    const float4* gelu_input_cast = reinterpret_cast<const float4*>(gelu_input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float4 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float4 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float4 bias_data = bias_cast[i * loop_stride + id];
+
+            gelu_input_data.x += bias_data.x;
+            gelu_input_data.y += bias_data.y;
+            gelu_input_data.z += bias_data.z;
+            gelu_input_data.w += bias_data.w;
+
+            output_data.x *= d_gelu(gelu_input_data.x);
+            output_data.y *= d_gelu(gelu_input_data.y);
+            output_data.z *= d_gelu(gelu_input_data.z);
+            output_data.w *= d_gelu(gelu_input_data.w);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = output_data;
+        }
+    }
+}
+
+__global__ void d_gelu_func(__half* d_output,
+                            const __half* gelu_input,
+                            const __half* bias,
+                            int row_stride,
+                            int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int loop_stride = blockDim.x;
+
+    float2* d_output_cast = reinterpret_cast<float2*>(d_output);
+    const float2* gelu_input_cast = reinterpret_cast<const float2*>(gelu_input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        if (i * loop_stride + id < row_stride) {
+            float2 output_data = d_output_cast[row * row_stride + i * loop_stride + id];
+            float2 gelu_input_data = gelu_input_cast[row * row_stride + i * loop_stride + id];
+            float2 bias_vec = bias_cast[i * loop_stride + id];
+
+            __half2* output_data_half = reinterpret_cast<__half2*>(&output_data);
+            __half2* gelu_input_data_half = reinterpret_cast<__half2*>(&gelu_input_data);
+            __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+            float2 output_half_0 = __half22float2(output_data_half[0]);
+            float2 output_half_1 = __half22float2(output_data_half[1]);
+
+            float2 gelu_input_half_0 = __half22float2(gelu_input_data_half[0]);
+            float2 gelu_input_half_1 = __half22float2(gelu_input_data_half[1]);
+
+            float2 bias_half_0 = __half22float2(bias_half[0]);
+            float2 bias_half_1 = __half22float2(bias_half[1]);
+
+            gelu_input_half_0.x += bias_half_0.x;
+            gelu_input_half_0.y += bias_half_0.y;
+            gelu_input_half_1.x += bias_half_1.x;
+            gelu_input_half_1.y += bias_half_1.y;
+
+            output_half_0.x *= d_gelu(gelu_input_half_0.x);
+            output_half_0.y *= d_gelu(gelu_input_half_0.y);
+            output_half_1.x *= d_gelu(gelu_input_half_1.x);
+            output_half_1.y *= d_gelu(gelu_input_half_1.y);
+
+            float2 result;
+            __half2* result_half2 = reinterpret_cast<__half2*>(&result);
+
+            result_half2[0] = __float22half2_rn(output_half_0);
+            result_half2[1] = __float22half2_rn(output_half_1);
+
+            d_output_cast[row * row_stride + i * loop_stride + id] = result;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, output, intermediate_size / 4, iterations);
+}
+
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( gelu_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, intermediate_size / 4, iterations);
+}
+
+template void launch_bias_gelu<float>(const float*, const float*, float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(const __half*,
+                                       const __half*,
+                                       __half*,
+                                       int,
+                                       int,
+                                       hipStream_t);
+
+template void launch_gelu<float>(const float*, float*, int, int, hipStream_t);
+template void launch_gelu<__half>(const __half*, __half*, int, int, hipStream_t);
+
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   hipStream_t stream)
+{
+    int iterations = (intermediate_size + 1023) / 1024;
+    int threads = (intermediate_size - 1) / (iterations * 4) + 1;
+    dim3 block_dims(threads);
+    dim3 grid_dims(batch_size);
+
+   hipLaunchKernelGGL(( d_gelu_func), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        d_output, input, bias, intermediate_size / 4, iterations);
+}
+
+template void launch_d_gelu<float>(float*, const float*, const float*, int, int, hipStream_t);
+template void launch_d_gelu<__half>(__half*, const __half*, const __half*, int, int, hipStream_t);
diff --git a/deepspeed/ops/csrc/transformer_bak/general_kernels.cu b/deepspeed/ops/csrc/transformer_bak/general_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..1eaa94e1e71a40bf44b661656395cf1ed087f589
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/general_kernels.cu
@@ -0,0 +1,411 @@
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       cudaStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<float><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               cudaStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    column_sum_reduce<__half><<<grid_dim, block_dim, 0, stream>>>(inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               cudaStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+    fused_add2_kernel<<<grid_dim, block_dim, 0, stream>>>(total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add3_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               cudaStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+    fused_add4_kernel<<<grid_dim, block_dim, 0, stream>>>(
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/general_kernels.hip b/deepspeed/ops/csrc/transformer_bak/general_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..5be2fc240debf3dbddba72f0f9587331222910df
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/general_kernels.hip
@@ -0,0 +1,413 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+template <typename T>
+__global__ void column_sum_reduce(const T* __restrict__ inp,
+                                  T* __restrict__ out,
+                                  int rows,
+                                  int width)
+{
+    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+    int y_stride = width * TILE_DIM;
+
+    float localSum = 0;
+
+    // Loop across matrix height
+    if (idx < width) {
+        int offset = threadIdx.y * width + idx;
+        for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+            localSum += (float)inp[offset];
+            offset += y_stride;
+        }
+    }
+
+    tile[threadIdx.x][threadIdx.y] = localSum;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float sum = tile[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) sum += g.shfl_down(sum, i);
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        if (pos < width) out[pos] = sum;
+    }
+}
+
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       hipStream_t stream);
+
+template <>
+void launch_fuse_transpose_bias_kernel<float>(const float* inp,
+                                              float* out,
+                                              int rows,
+                                              int cols,
+                                              hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<float>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+template <>
+void launch_fuse_transpose_bias_kernel<__half>(const __half* inp,
+                                               __half* out,
+                                               int rows,
+                                               int cols,
+                                               hipStream_t stream)
+{
+    // assert(rows % TILE_DIM == 0);
+    // assert(cols % TILE_DIM == 0);
+
+    dim3 grid_dim((cols - 1) / TILE_DIM + 1);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( column_sum_reduce<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, inp, out, rows, cols);
+}
+
+__global__ void fused_add2_kernel(const int N, float* out, const float* inp1, const float* inp2)
+{
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        float4 val;
+        float4 inp1_reg = inp1_4[j];
+        float4 inp2_reg = inp2_4[j];
+
+        val.x = inp1_reg.x + inp2_reg.x;
+        val.y = inp1_reg.y + inp2_reg.y;
+        val.z = inp1_reg.z + inp2_reg.z;
+        val.w = inp1_reg.w + inp2_reg.w;
+
+        out_4[j] = val;
+    }
+}
+
+__global__ void fused_add2_kernel(const int N, __half* out, const __half* inp1, const __half* inp2)
+{
+    float2 inp1_4;
+    float2 inp2_4;
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+
+    CUDA_1D_KERNEL_LOOP(j, N)
+    {
+        inp1_4 = inp1_arr[j];
+        inp2_4 = inp2_arr[j];
+
+        float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+        float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+        float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+        float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+        inp1_h_f_0.x += inp2_h_f_0.x;
+        inp1_h_f_0.y += inp2_h_f_0.y;
+        inp1_h_f_1.x += inp2_h_f_1.x;
+        inp1_h_f_1.y += inp2_h_f_1.y;
+
+        float2 val_f;
+        __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+        val_h[0] = __float22half2_rn(inp1_h_f_0);
+        val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+        float2* out_4 = reinterpret_cast<float2*>(out);
+        out_4[j] = val_f;
+    }
+}
+
+template <>
+void launch_fused_add2<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_dim,
+                              hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+template <>
+void launch_fused_add2<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_dim,
+                               hipStream_t& stream)
+{
+    int total_count = batch_size * seq_length * hidden_dim / 4;
+    dim3 grid_dim = DS_GET_BLOCKS(total_count);  //(batch_size * seq_length);
+
+    dim3 block_dim = DS_CUDA_NUM_THREADS;  //(hidden_dim / 4);
+
+   hipLaunchKernelGGL(( fused_add2_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, total_count, out, inp1, inp2);
+}
+
+__global__ void fused_add3_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add3_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add3<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add3<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add3_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+__global__ void fused_add4_kernel(float* out,
+                                  const float* inp1,
+                                  const float* inp2,
+                                  const float* inp3,
+                                  const float* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    const float4* inp1_4 = reinterpret_cast<const float4*>(inp1);
+    const float4* inp2_4 = reinterpret_cast<const float4*>(inp2);
+    const float4* inp3_4 = reinterpret_cast<const float4*>(inp3);
+    const float4* inp4_4 = reinterpret_cast<const float4*>(inp4);
+    float4* out_4 = reinterpret_cast<float4*>(out);
+
+    float4 val;
+    float4 inp1_reg = inp1_4[row * row_stride + id];
+    float4 inp2_reg = inp2_4[row * row_stride + id];
+    float4 inp3_reg = inp3_4[row * row_stride + id];
+    float4 inp4_reg = inp4_4[row * row_stride + id];
+
+    val.x = inp1_reg.x + inp2_reg.x + inp3_reg.x + inp4_reg.x;
+    val.y = inp1_reg.y + inp2_reg.y + inp3_reg.y + inp4_reg.y;
+    val.z = inp1_reg.z + inp2_reg.z + inp3_reg.z + inp4_reg.z;
+    val.w = inp1_reg.w + inp2_reg.w + inp3_reg.w + inp4_reg.w;
+
+    out_4[row * row_stride + id] = val;
+}
+
+__global__ void fused_add4_kernel(__half* out,
+                                  const __half* inp1,
+                                  const __half* inp2,
+                                  const __half* inp3,
+                                  const __half* inp4,
+                                  int size,
+                                  int row_stride)
+{
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    const float2* inp1_arr = reinterpret_cast<const float2*>(inp1);
+    const float2* inp2_arr = reinterpret_cast<const float2*>(inp2);
+    const float2* inp3_arr = reinterpret_cast<const float2*>(inp3);
+    const float2* inp4_arr = reinterpret_cast<const float2*>(inp4);
+
+    float2 inp1_4 = inp1_arr[row * row_stride + id];
+    float2 inp2_4 = inp2_arr[row * row_stride + id];
+    float2 inp3_4 = inp3_arr[row * row_stride + id];
+    float2 inp4_4 = inp4_arr[row * row_stride + id];
+
+    __half2* inp1_h = reinterpret_cast<__half2*>(&inp1_4);
+    __half2* inp2_h = reinterpret_cast<__half2*>(&inp2_4);
+    __half2* inp3_h = reinterpret_cast<__half2*>(&inp3_4);
+    __half2* inp4_h = reinterpret_cast<__half2*>(&inp4_4);
+
+    float2 inp1_h_f_0 = __half22float2(inp1_h[0]);
+    float2 inp1_h_f_1 = __half22float2(inp1_h[1]);
+
+    float2 inp2_h_f_0 = __half22float2(inp2_h[0]);
+    float2 inp2_h_f_1 = __half22float2(inp2_h[1]);
+
+    float2 inp3_h_f_0 = __half22float2(inp3_h[0]);
+    float2 inp3_h_f_1 = __half22float2(inp3_h[1]);
+
+    float2 inp4_h_f_0 = __half22float2(inp4_h[0]);
+    float2 inp4_h_f_1 = __half22float2(inp4_h[1]);
+
+    inp1_h_f_0.x += (inp2_h_f_0.x + inp3_h_f_0.x + inp4_h_f_0.x);
+    inp1_h_f_0.y += (inp2_h_f_0.y + inp3_h_f_0.y + inp4_h_f_0.y);
+    inp1_h_f_1.x += (inp2_h_f_1.x + inp3_h_f_1.x + inp4_h_f_1.x);
+    inp1_h_f_1.y += (inp2_h_f_1.y + inp3_h_f_1.y + inp4_h_f_1.y);
+
+    float2 val_f;
+    __half2* val_h = reinterpret_cast<__half2*>(&val_f);
+
+    val_h[0] = __float22half2_rn(inp1_h_f_0);
+    val_h[1] = __float22half2_rn(inp1_h_f_1);
+
+    float2* out_4 = reinterpret_cast<float2*>(out);
+    out_4[row * row_stride + id] = val_f;
+}
+
+template <>
+void launch_fused_add4<float>(float* out,
+                              const float* inp1,
+                              const float* inp2,
+                              const float* inp3,
+                              const float* inp4,
+                              int batch_size,
+                              int seq_length,
+                              int hidden_size,
+                              hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
+
+template <>
+void launch_fused_add4<__half>(__half* out,
+                               const __half* inp1,
+                               const __half* inp2,
+                               const __half* inp3,
+                               const __half* inp4,
+                               int batch_size,
+                               int seq_length,
+                               int hidden_size,
+                               hipStream_t& stream)
+{
+    dim3 grid_dim(batch_size * seq_length);
+
+    dim3 block_dim(hidden_size / 4);
+
+   hipLaunchKernelGGL(( fused_add4_kernel), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, inp1, inp2, inp3, inp4, (batch_size * seq_length * hidden_size), hidden_size / 4);
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.cu b/deepspeed/ops/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.cu
new file mode 100644
index 0000000000000000000000000000000000000000..524a63a83daeb97731d0685359b7be8412712d8e
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.cu
@@ -0,0 +1,372 @@
+#include "custom_cuda_layers.h"
+
+//#include <cuda_profiler_api.h>
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+        apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+        apply_rotary_pos_emb1<<<grid_dims, block_dims, 0, stream>>>(
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  cudaStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+cudaStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+
+apply_rotary_pos_emb<<<grid_dims, block_dims, 0, stream>>>(
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+cudaStream_t);
+*/
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.hip b/deepspeed/ops/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.hip
new file mode 100644
index 0000000000000000000000000000000000000000..4d70a0a80a6d831ea419624eda6afd9f186d4501
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/apply_rotary_pos_emb.hip
@@ -0,0 +1,374 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+//#include <cuda_profiler_api.h>
+
+namespace cg = cooperative_groups;
+
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+                                     float* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+                                     __half* key_layer,
+                                     unsigned rotary_dim,
+                                     unsigned seq_len,
+                                     unsigned seq_offset,
+                                     unsigned num_heads,
+                                     unsigned head_size,
+                                     unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+__global__ void apply_rotary_pos_emb1(float* mixed_query,
+                                      float* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = mixed_query[offset + lane];
+            float k = key_layer[offset + lane];
+            float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            q_rot = g.shfl_xor(q_rot, 1);
+            k_rot = g.shfl_xor(k_rot, 1);
+            q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+            mixed_query[offset + lane] = q;
+            key_layer[offset + lane] = k;
+
+            lane += WARP_SIZE;
+        }
+    }
+}
+__global__ void apply_rotary_pos_emb1(__half* mixed_query,
+                                      __half* key_layer,
+                                      unsigned rotary_dim,
+                                      unsigned seq_len,
+                                      unsigned seq_offset,
+                                      unsigned num_heads,
+                                      unsigned head_size,
+                                      unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int lane = id & 0x1f;
+
+    unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+    unsigned offset = head_id * head_size;
+
+    constexpr unsigned mask[32] = {
+        0x1 | 0x1000,     0x2 | 0x2000,     0x4 | 0x4000,     0x8 | 0x8000,     0x10 | 0x10000,
+        0x20 | 0x20000,   0x40 | 0x40000,   0x80 | 0x80000,   0x100 | 0x100000, 0x200 | 0x200000,
+        0x400 | 0x400000, 0x800 | 0x800000, 0x1000 | 0x1,     0x2000 | 0x2,     0x4000 | 0x4,
+        0x8000 | 0x8,     0x10000 | 0x10,   0x20000 | 0x20,   0x40000 | 0x40,   0x80000 | 0x80,
+        0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800, 0x1000000,
+        0x2000000,        0x4000000,        0x8000000,        0x10000000,       0x20000000,
+        0x40000000,       0x80000000};
+
+    unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+    unsigned half_dim = rotary_dim >> 1;
+    if (head_id < total_count) {
+        while (lane < rotary_dim) {
+            float inv_freq = (float)((lane % half_dim) * 2) / (float)rotary_dim;
+            inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+            float q = (float)mixed_query[offset + lane];
+            float k = (float)key_layer[offset + lane];
+            float rotary_sign = (lane > (half_dim - 1) ? -1.0 : 1.0);
+            float q_rot = (q * rotary_sign);
+            float k_rot = (k * rotary_sign);
+            auto q_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], q_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], q_rot, lane - half_dim);
+            auto k_rot_tmp = lane < half_dim ? __shfl_sync(mask[lane], k_rot, lane + half_dim)
+                                             : __shfl_sync(mask[lane], k_rot, lane - half_dim);
+            q = q * cosf(inv_freq) + q_rot_tmp * sinf(inv_freq);
+            k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+            mixed_query[offset + lane] = (__half)q;
+            key_layer[offset + lane] = (__half)k;
+
+            lane += WARP_SIZE;
+        }
+    }
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream)
+{
+    int total_count = batch * num_heads * seq_len;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+    if (rotate_every_two)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+    else if (rotate_half)
+       hipLaunchKernelGGL(( apply_rotary_pos_emb1), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+                                                 float*,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 unsigned,
+                                                 bool,
+                                                 bool,
+                                                 hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+                                                  __half*,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  unsigned,
+                                                  bool,
+                                                  bool,
+                                                  hipStream_t);
+/*
+__global__ void apply_rotary_pos_emb(float* mixed_query,
+float* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = mixed_query[offset + lane];
+float k = key_layer[offset + lane];
+float rotary_sign = (lane % 2 == 1 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+q_rot = g.shfl_xor(q_rot, 1);
+k_rot = g.shfl_xor(k_rot, 1);
+q = q * cosf(inv_freq) + q_rot * sinf(inv_freq);
+k = k * cosf(inv_freq) + k_rot * sinf(inv_freq);
+
+mixed_query[offset + lane] = q;
+key_layer[offset + lane] = k;
+
+lane += WARP_SIZE;
+}
+}
+}
+
+__global__ void apply_rotary_pos_emb(__half* mixed_query,
+__half* key_layer,
+unsigned rotary_dim,
+unsigned seq_len,
+unsigned seq_offset,
+unsigned num_heads,
+unsigned head_size,
+unsigned total_count)
+{
+#if __CUDA_ARCH__ >= 700
+cg::thread_block b = cg::this_thread_block();
+cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+int id = threadIdx.x;
+int gid = id >> 5;
+int lane = id & 0x1f;
+
+unsigned head_id = blockIdx.x * MAX_WARP_NUM + gid;
+unsigned offset = head_id * head_size;
+constexpr unsigned mask[32] = {0x1 | 0x1000, 0x2 | 0x2000, 0x4 | 0x4000, 0x8 | 0x8000,
+0x10 | 0x10000, 0x20 | 0x20000, 0x40 | 0x40000, 0x80 | 0x80000,
+0x100 | 0x100000, 0x200 | 0x200000, 0x400 | 0x400000, 0x800 | 0x800000,
+0x1000 | 0x1, 0x2000 | 0x2, 0x4000 | 0x4, 0x8000 | 0x8,
+0x10000 | 0x10, 0x20000 | 0x20, 0x40000 | 0x40, 0x80000 | 0x80,
+0x100000 | 0x100, 0x200000 | 0x200, 0x400000 | 0x400, 0x800000 | 0x800,
+0x1000000, 0x2000000, 0x4000000, 0x8000000,
+0x10000000, 0x20000000, 0x40000000, 0x80000000};
+unsigned seq_id = (head_id / num_heads) % seq_len + seq_offset;
+
+if (head_id < total_count) {
+while (lane < rotary_dim) {
+//float inv_freq = (float)((lane / 2) * 2) / (float)rotary_dim;
+float inv_freq = (float)((lane % (rotary_dim >> 1)) * 2) / (float)rotary_dim;
+inv_freq = 1.0 / powf(10000.0, inv_freq) * (float)seq_id;
+float q = (float)mixed_query[offset + lane];
+float k = (float)key_layer[offset + lane];
+float rotary_sign = (lane > 11 ? -1.0 : 1.0);
+float q_rot = (q * rotary_sign);
+float k_rot = (k * rotary_sign);
+auto q_rot_tmp = lane < 12 ? __shfl_sync(mask[lane], q_rot, lane + 12) : __shfl_sync(mask[lane],
+q_rot, lane - 12);//g.shfl_xor(q_rot, 12); auto k_rot_tmp = lane < 12 ? __shfl_sync(mask[lane],
+k_rot, lane + 12) : __shfl_sync(mask[lane], k_rot, lane - 12);//g.shfl_xor(k_rot, 12); q = q *
+cosf(inv_freq) + q_rot_tmp * sinf(inv_freq); k = k * cosf(inv_freq) + k_rot_tmp * sinf(inv_freq);
+
+mixed_query[offset + lane] = (__half)q;
+key_layer[offset + lane] = (__half)k;
+
+lane += WARP_SIZE;
+}
+}
+#endif
+}
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+T* key_layer,
+unsigned head_size,
+unsigned seq_len,
+unsigned rotary_dim,
+unsigned offset,
+unsigned num_heads,
+unsigned batch,
+hipStream_t stream)
+{
+int total_count = batch * num_heads * seq_len;
+dim3 block_dims(1024);
+dim3 grid_dims((total_count - 1) / MAX_WARP_NUM + 1);  // (batch_size);
+hipLaunchKernelGGL((
+apply_rotary_pos_emb), dim3(grid_dims), dim3(block_dims), 0, stream, 
+mixed_query, key_layer, rotary_dim, seq_len, offset, num_heads, head_size, total_count);
+}
+
+template void launch_apply_rotary_pos_emb<float>(float*,
+float*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+template void launch_apply_rotary_pos_emb<__half>(__half*,
+__half*,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+unsigned,
+hipStream_t);
+*/
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/dequantize.cu b/deepspeed/ops/csrc/transformer_bak/inference/csrc/dequantize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..4ddaabda3eb70b1b958b1fc4c2f959867828d1a2
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/dequantize.cu
@@ -0,0 +1,110 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+    dequantize_kernel<<<grid_dims, block_dims, 0, stream>>>(
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       cudaStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        cudaStream_t);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/dequantize.hip b/deepspeed/ops/csrc/transformer_bak/inference/csrc/dequantize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..7c22e306aace1058947ed47e58c0427a4f066ecb
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/dequantize.hip
@@ -0,0 +1,112 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_QUANTIZE_GROUPING 1024
+
+#define loop_unroll 1
+#define loop_unroll_bits 1
+
+__global__ void dequantize_kernel(float* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  int output_size,
+                                  int hidden_dim,
+                                  int groups,
+                                  int merge_count)
+{
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = (scale_data * (float)q);
+        tid += blockDim.x;
+    }
+}
+
+__global__ void dequantize_kernel(__half* output,
+                                  const int8_t* input,
+                                  const float* qscale,
+                                  unsigned output_size,
+                                  unsigned hidden_dim,
+                                  unsigned groups,
+                                  unsigned merge_count)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    unsigned merge_hidden = hidden_dim >> merge_count;
+    unsigned quantization_stride = (merge_hidden * output_size) / groups;
+
+    unsigned bid = blockIdx.x;
+    unsigned tid = threadIdx.x;
+
+    while (tid < output_size) {
+        unsigned w_index = bid / merge_hidden;
+        unsigned q_index = tid + bid * output_size;
+
+        auto q = input[q_index];
+
+        unsigned merge_hidden_total = w_index * merge_hidden;
+        unsigned scale_index =
+            ((((bid - merge_hidden_total) + tid * merge_hidden) / quantization_stride)
+             << merge_count) +
+            w_index;
+
+        float scale_data = qscale[scale_index];
+
+        output[q_index] = __float2half(scale_data * (float)q);
+        tid += blockDim.x;
+    }
+#endif
+}
+
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream)
+{
+    unsigned threads = 1024;
+    dim3 block_dims(threads);
+    dim3 grid_dims(hidden_dim);
+
+   hipLaunchKernelGGL(( dequantize_kernel), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        output, input, qscale, output_size, hidden_dim, groups, merge_count);
+}
+
+template void launch_dequantize<float>(float*,
+                                       const int8_t*,
+                                       const float*,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       unsigned,
+                                       hipStream_t);
+template void launch_dequantize<__half>(__half*,
+                                        const int8_t*,
+                                        const float*,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        unsigned,
+                                        hipStream_t);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/gelu.cu b/deepspeed/ops/csrc/transformer_bak/inference/csrc/gelu.cu
new file mode 100644
index 0000000000000000000000000000000000000000..70bbf42cf9ed74558ce1b789d939c17d38573a86
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/gelu.cu
@@ -0,0 +1,525 @@
+#include "custom_cuda_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+    fused_bias_gelu<<<grid_dims, block_dims, 0, stream>>>(
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+    fused_bias_add<<<grid_dims, block_dims, 0, stream>>>(input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, cudaStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, cudaStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    fused_bias_residual<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, cudaStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           cudaStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              cudaStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+    gptj_residual_add<<<grid_dims, block_dims, 0, stream>>>(
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              cudaStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               cudaStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+    moe_res_matmul<<<grid_dim, block_dim, 0, stream>>>(
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/gelu.hip b/deepspeed/ops/csrc/transformer_bak/inference/csrc/gelu.hip
new file mode 100644
index 0000000000000000000000000000000000000000..00c03efb9b6b3d7b05f19549472b5b771f46e1f4
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/gelu.hip
@@ -0,0 +1,527 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define MAX_CAP 4
+#define MAX_SEQ 2048
+
+inline __device__ float gelu(const float x)
+{
+    const float sqrt_param = 0.79788456080286535587989211986876f;
+    const float mul_param = 0.044715;
+    return x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
+}
+
+__global__ void fused_bias_gelu(float* input,
+                                const float* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        data.x = gelu(data.x);
+        data.y = gelu(data.y);
+        data.z = gelu(data.z);
+        data.w = gelu(data.w);
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_gelu(__half* input,
+                                const __half* bias,
+                                int total_count,
+                                int intermediate_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        low_data.x = gelu(low_data.x);
+        low_data.y = gelu(low_data.y);
+        high_data.x = gelu(high_data.x);
+        high_data.y = gelu(high_data.y);
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream)
+{
+    int total_count = batch_size * (intermediate_size / 4);
+    int threads = 1024;  // intermediate_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / 1024 + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_gelu), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, bias, total_count, intermediate_size / 4);
+}
+
+template void launch_bias_gelu<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_gelu<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_add(float* input, const float* bias, int total_count, int hidden_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    const float4* bias_cast = reinterpret_cast<const float4*>(bias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 bias_data = bias_cast[offset % hidden_size];
+
+        data.x += bias_data.x;
+        data.y += bias_data.y;
+        data.z += bias_data.z;
+        data.w += bias_data.w;
+
+        input_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_add(__half* input, const __half* bias, int total_count, int hidden_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    const float2* bias_cast = reinterpret_cast<const float2*>(bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 bias_vec = bias_cast[offset % hidden_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        low_data.x += low_bias.x;
+        low_data.y += low_bias.y;
+        high_data.x += high_bias.x;
+        high_data.y += high_bias.y;
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        input_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream)
+{
+    int total_count = batch_size * (hidden_size / 4);
+    int threads = 1024;  // hidden_size / iterations / 4;
+    dim3 block_dims(threads);
+    dim3 grid_dims(((total_count - 1) / threads + 1));  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_add), dim3(grid_dims), dim3(block_dims), 0, stream, input, bias, total_count, hidden_size / 4);
+}
+
+template void launch_bias_add<float>(float*, const float*, int, int, hipStream_t);
+template void launch_bias_add<__half>(__half*, const __half*, int, int, hipStream_t);
+
+__global__ void fused_bias_residual(float* input,
+                                    float* output,
+                                    float* attn,
+                                    float* bias,
+                                    float* attnbias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = (data.x + res_vec.x) * mp_size + (out.x + bias_data.x + attn_bias.x);
+        data.y = (data.y + res_vec.y) * mp_size + (out.y + bias_data.y + attn_bias.y);
+        data.z = (data.z + res_vec.z) * mp_size + (out.z + bias_data.z + attn_bias.z);
+        data.w = (data.w + res_vec.w) * mp_size + (out.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void fused_bias_residual(__half* input,
+                                    __half* output,
+                                    __half* attn,
+                                    __half* bias,
+                                    __half* attn_bias,
+                                    int total_count,
+                                    int intermediate_size,
+                                    int mp_size)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            (low_data.x + low_res.x) * mp_size + (low_out.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            (low_data.y + low_res.y) * mp_size + (low_out.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            (high_data.x + high_res.x) * mp_size + (high_out.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            (high_data.y + high_res.y) * mp_size + (high_out.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( fused_bias_residual), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void
+launch_bias_residual<float>(float*, float*, float*, float*, float*, int, int, int, hipStream_t);
+template void launch_bias_residual<__half>(__half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           __half*,
+                                           int,
+                                           int,
+                                           int,
+                                           hipStream_t);
+
+__global__ void gptj_residual_add(float* input,
+                                  float* output,
+                                  float* attn,
+                                  float* bias,
+                                  float* attnbias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+    float4* input_cast = reinterpret_cast<float4*>(input);
+    float4* output_cast = reinterpret_cast<float4*>(output);
+    float4* attn_cast = reinterpret_cast<float4*>(attn);
+    float4* bias_cast = reinterpret_cast<float4*>(bias);
+    float4* attnbias_cast = reinterpret_cast<float4*>(attnbias);
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float4 data = input_cast[offset];
+        float4 out = output_cast[offset];
+        float4 res_vec = attn_cast[offset];
+        float4 bias_data = bias_cast[offset % intermediate_size];
+        float4 attn_bias = attnbias_cast[offset % intermediate_size];
+
+        data.x = data.x * mp_size + (out.x + res_vec.x + bias_data.x + attn_bias.x);
+        data.y = data.y * mp_size + (out.y + res_vec.y + bias_data.y + attn_bias.y);
+        data.z = data.z * mp_size + (out.z + res_vec.z + bias_data.z + attn_bias.z);
+        data.w = data.w * mp_size + (out.w + res_vec.w + bias_data.w + attn_bias.w);
+
+        output_cast[offset] = data;
+    }
+}
+
+__global__ void gptj_residual_add(__half* input,
+                                  __half* output,
+                                  __half* attn,
+                                  __half* bias,
+                                  __half* attn_bias,
+                                  int total_count,
+                                  int intermediate_size,
+                                  float mp_size)
+{
+#if __CUDA_ARCH__ >= 700 || defined(__HIP_PLATFORM_HCC__)
+
+    float2* input_cast = reinterpret_cast<float2*>(input);
+    float2* output_cast = reinterpret_cast<float2*>(output);
+    float2* attn_cast = reinterpret_cast<float2*>(attn);
+
+    float2* bias_cast = reinterpret_cast<float2*>(bias);
+    float2* attnbias_cast = reinterpret_cast<float2*>(attn_bias);
+
+    int offset = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (offset < total_count) {
+        float2 vals_vec = input_cast[offset];
+        float2 out_vec = output_cast[offset];
+        float2 res_vec = attn_cast[offset];
+
+        float2 bias_vec = bias_cast[offset % intermediate_size];
+        float2 attn_bias_vec = attnbias_cast[offset % intermediate_size];
+
+        __half2* vals_half = reinterpret_cast<__half2*>(&vals_vec);
+        __half2* out_half = reinterpret_cast<__half2*>(&out_vec);
+        __half2* res_half = reinterpret_cast<__half2*>(&res_vec);
+        __half2* bias_half = reinterpret_cast<__half2*>(&bias_vec);
+        __half2* attnbias_half = reinterpret_cast<__half2*>(&attn_bias_vec);
+
+        float2 low_data = __half22float2(vals_half[0]);
+        float2 high_data = __half22float2(vals_half[1]);
+
+        float2 low_out = __half22float2(out_half[0]);
+        float2 high_out = __half22float2(out_half[1]);
+
+        float2 low_res = __half22float2(res_half[0]);
+        float2 high_res = __half22float2(res_half[1]);
+
+        float2 low_bias = __half22float2(bias_half[0]);
+        float2 high_bias = __half22float2(bias_half[1]);
+
+        float2 attn_low_bias = __half22float2(attnbias_half[0]);
+        float2 attn_high_bias = __half22float2(attnbias_half[1]);
+
+        low_data.x =
+            low_data.x * mp_size + (low_out.x + low_res.x + (low_bias.x + attn_low_bias.x));
+        low_data.y =
+            low_data.y * mp_size + (low_out.y + low_res.y + (low_bias.y + attn_low_bias.y));
+        high_data.x =
+            high_data.x * mp_size + (high_out.x + high_res.x + (high_bias.x + attn_high_bias.x));
+        high_data.y =
+            high_data.y * mp_size + (high_out.y + high_res.y + (high_bias.y + attn_high_bias.y));
+
+        vals_half[0] = __float22half2_rn(low_data);
+        vals_half[1] = __float22half2_rn(high_data);
+
+        output_cast[offset] = vals_vec;
+    }
+#endif
+}
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int hidden_dim,
+                              int batch,
+                              int mp_size,
+                              hipStream_t stream)
+{
+    int total_count = batch * hidden_dim / 4;
+    dim3 block_dims(1024);
+    dim3 grid_dims((total_count - 1) / 1024 + 1);  // (batch_size);
+
+   hipLaunchKernelGGL(( gptj_residual_add), dim3(grid_dims), dim3(block_dims), 0, stream, 
+        input, output, attn, bias, attn_bias, total_count, hidden_dim / 4, 1.0 / mp_size);
+}
+
+template void launch_gptj_residual_add<float>(float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              float*,
+                                              int,
+                                              int,
+                                              int,
+                                              hipStream_t);
+template void launch_gptj_residual_add<__half>(__half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               __half*,
+                                               int,
+                                               int,
+                                               int,
+                                               hipStream_t);
+
+__global__ void moe_res_matmul(float* residual,
+                               float* coef,
+                               float* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+    float4* residual_cast = reinterpret_cast<float4*>(residual);
+    float4* coef_cast = reinterpret_cast<float4*>(coef);
+    float4* mlp_out_cast = reinterpret_cast<float4*>(mlp_out);
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    float4* coef_cast2 = coef_cast + hidden_dim;
+
+    while (tid < hidden_dim) {
+        float4 res = residual_cast[tid];
+        float4 mlp = mlp_out_cast[tid];
+        float4 coef1 = coef_cast[tid];
+        float4 coef2 = coef_cast2[tid];
+        mlp.x = mlp.x * coef2.x + res.x * coef1.x;
+        mlp.y = mlp.y * coef2.y + res.y * coef1.y;
+        mlp.z = mlp.z * coef2.z + res.z * coef1.z;
+        mlp.w = mlp.w * coef2.w + res.w * coef1.w;
+        mlp_out_cast[tid] = mlp;
+        tid += blockDim.x;
+    }
+}
+
+__global__ void moe_res_matmul(__half* residual,
+                               __half* coef,
+                               __half* mlp_out,
+                               int seq_len,
+                               int hidden_dim)
+{
+    unsigned tid = threadIdx.x;
+
+    float2* residual_cast = reinterpret_cast<float2*>(residual);
+    float2* mlp_out_cast = reinterpret_cast<float2*>(mlp_out);
+    float2* coef_cast = reinterpret_cast<float2*>(coef);
+    float2* coef_cast2 = coef_cast + hidden_dim;
+
+    residual_cast += blockIdx.x * hidden_dim;
+    mlp_out_cast += blockIdx.x * hidden_dim;
+
+    while (tid < hidden_dim) {
+        float2 res = residual_cast[tid];
+        float2 coef1 = coef_cast[tid];
+        float2 coef2 = coef_cast[tid];
+        float2 data = mlp_out_cast[tid];
+        __half* data_h = reinterpret_cast<__half*>(&data);
+        __half* coef1_h = reinterpret_cast<__half*>(&coef1);
+        __half* coef2_h = reinterpret_cast<__half*>(&coef2);
+        __half* res_h = reinterpret_cast<__half*>(&res);
+        data_h[0] = res_h[0] * coef1_h[0] + data_h[0] * coef2_h[0];
+        data_h[1] = res_h[1] * coef1_h[1] + data_h[1] * coef2_h[1];
+        data_h[2] = res_h[2] * coef1_h[2] + data_h[2] * coef2_h[2];
+        data_h[3] = res_h[3] * coef1_h[3] + data_h[3] * coef2_h[3];
+
+        mlp_out_cast[tid] = data;
+        tid += blockDim.x;
+    }
+}
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream)
+{
+    dim3 grid_dim(seq_len);
+    dim3 block_dim(1024);
+   hipLaunchKernelGGL(( moe_res_matmul), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        residual, coef, mlp_out, seq_len, hidden_dim / 4);
+}
+
+template void launch_moe_res_matmul(float* residual,
+                                    float* coef,
+                                    float* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
+template void launch_moe_res_matmul(__half* residual,
+                                    __half* coef,
+                                    __half* mlp_out,
+                                    int seq_len,
+                                    int hidden_dim,
+                                    hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/normalize.cu b/deepspeed/ops/csrc/transformer_bak/inference/csrc/normalize.cu
new file mode 100644
index 0000000000000000000000000000000000000000..1d691394ed5e9cb1568b10b542bbe052566d3ee8
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/normalize.cu
@@ -0,0 +1,451 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        cudaStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+    fused_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/normalize.hip b/deepspeed/ops/csrc/transformer_bak/inference/csrc/normalize.hip
new file mode 100644
index 0000000000000000000000000000000000000000..dc7fa7accbebfb05a9532cf18121e1a8cc4fc052
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/normalize.hip
@@ -0,0 +1,453 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define NORM_REG (MAX_REGISTERS)
+
+namespace cg = cooperative_groups;
+
+__global__ void fused_bias_residual_layer_norm(float* output,
+                                               const float* vals,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    float sum = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        output[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* output,
+                                               const __half* vals,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    const __half2* vals_cast = reinterpret_cast<const __half2*>(vals);
+    __half2* out_cast = reinterpret_cast<__half2*>(output);
+
+    int k = 0;
+    int input_id = id;
+    while (input_id < row_stride) {
+        inp_reg[k++] = vals_cast[input_id + row * row_stride];
+        input_id += iteration_stride;
+    }
+    float sum = 0;
+    for (int f = k - 1; f >= 0; f--) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        sum += inp_f.x + inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        out_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <>
+void launch_layer_norm<float>(float* out,
+                              float* vals,
+                              const float* gamma,
+                              const float* beta,
+                              float epsilon,
+                              int batch_size,
+                              int hidden_dim,
+                              hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim);
+}
+
+template <>
+void launch_layer_norm<__half>(__half* out,
+                               __half* vals,
+                               const __half* gamma,
+                               const __half* beta,
+                               float epsilon,
+                               int batch_size,
+                               int hidden_dim,
+                               hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        out, vals, gamma, beta, epsilon, hidden_dim / 2);
+}
+
+__global__ void fused_residual_layer_norm(float* norm,
+                                          float* res_add,
+                                          float* vals,
+                                          float* residual,
+                                          const float* bias,
+                                          const float* gamma,
+                                          const float* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    float inp_reg[NORM_REG];
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals[input_id + row * row_stride];
+        float res_f = (residual[input_id + row * row_stride]);
+        float bias_f = (bias[input_id]);
+        if (mlp_after_attn) inp_reg[k] += res_f + bias_f;
+        // if (preLN) res_add[input_id + row * row_stride] = inp_reg[k];
+        sum += inp_reg[k++];
+        input_id += iteration_stride;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        inp_reg[f] -= mean;
+        sum += inp_reg[f] * inp_reg[f];
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * sum;
+        inp_reg[f] = inp_reg[f] * gamma[out_id] + beta[out_id];
+        norm[out_id + row * row_stride] = inp_reg[f];
+    }
+}
+
+__global__ void fused_residual_layer_norm(__half* norm,
+                                          __half* res_add,
+                                          __half* vals,
+                                          __half* residual,
+                                          const __half* bias,
+                                          const __half* gamma,
+                                          const __half* beta,
+                                          float epsilon,
+                                          int row_stride,
+                                          bool preLN,
+                                          bool mlp_after_attn)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> 5;
+    int warp_num = iteration_stride >> 5;
+
+    __half2 inp_reg[NORM_REG];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    __half2* norm_cast = reinterpret_cast<__half2*>(norm);
+    __half2* res_add_cast = reinterpret_cast<__half2*>(res_add);
+    __half2* residual_cast = reinterpret_cast<__half2*>(residual);
+    const __half2* bias_cast = reinterpret_cast<const __half2*>(bias);
+
+    int k = 0;
+    int input_id = id;
+
+    float sum = 0;
+    while (input_id < row_stride) {
+        inp_reg[k] = vals_cast[input_id + row * row_stride];
+        float2 inp_f = __half22float2(inp_reg[k]);
+        float2 res_f = __half22float2(residual_cast[input_id + row * row_stride]);
+        float2 bias_f = __half22float2(bias_cast[input_id]);
+        if (mlp_after_attn) {
+            inp_f.x += res_f.x + bias_f.x;
+            inp_f.y += res_f.y + bias_f.y;
+        }
+        inp_reg[k] = __float22half2_rn(inp_f);
+        // if (preLN) res_add_cast[input_id + row * row_stride] = __float22half2_rn(res_f);
+        // //inp_reg[k];
+        sum += inp_f.x + inp_f.y;
+        input_id += iteration_stride;
+        k++;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    __shared__ float shr[MAX_WARP_NUM];
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride << 1);
+    sum = 0.f;
+    for (int f = 0; f < k; f++) {
+        float2 inp_f = __half22float2(inp_reg[f]);
+        inp_f.x -= mean;
+        inp_f.y -= mean;
+        inp_reg[f] = __float22half2_rn(inp_f);
+        sum += inp_f.x * inp_f.x;
+        sum += inp_f.y * inp_f.y;
+    }
+    for (int i = 1; i < 32; i *= 2) sum += g.shfl_down(sum, i);
+    if (g.thread_rank() == 0) shr[gid] = sum;
+    b.sync();
+    if (g.thread_rank() < (warp_num)) sum = shr[g.thread_rank()];
+    b.sync();
+    for (int i = 1; i < (warp_num); i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= (row_stride << 1);
+    sum += epsilon;
+    sum = __frsqrt_rn(sum);
+    __half2 variance_h = __float2half2_rn(sum);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+    for (int f = 0; f < k; f++) {
+        int out_id = f * iteration_stride + id;
+        inp_reg[f] = inp_reg[f] * variance_h;
+        inp_reg[f] = inp_reg[f] * gamma_cast[out_id] + beta_cast[out_id];
+        norm_cast[out_id + row * row_stride] = inp_reg[f];
+    }
+#endif
+}
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+
+template <>
+void launch_residual_layer_norm<float>(float* norm,
+                                       float* res_add,
+                                       float* vals,
+                                       float* residual,
+                                       const float* bias,
+                                       const float* gamma,
+                                       const float* beta,
+                                       float epsilon,
+                                       int batch_size,
+                                       int hidden_dim,
+                                       bool preLN,
+                                       bool mlp_after_attn,
+                                       hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
+
+template <>
+void launch_residual_layer_norm<__half>(__half* norm,
+                                        __half* res_add,
+                                        __half* vals,
+                                        __half* residual,
+                                        const __half* bias,
+                                        const __half* gamma,
+                                        const __half* beta,
+                                        float epsilon,
+                                        int batch_size,
+                                        int hidden_dim,
+                                        bool preLN,
+                                        bool mlp_after_attn,
+                                        hipStream_t stream)
+{
+    constexpr int threads = 1024;
+
+    dim3 grid_dim(batch_size);
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, norm,
+                                                                  res_add,
+                                                                  vals,
+                                                                  residual,
+                                                                  bias,
+                                                                  gamma,
+                                                                  beta,
+                                                                  epsilon,
+                                                                  hidden_dim / 2,
+                                                                  preLN,
+                                                                  mlp_after_attn);
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/pt_binding.cpp b/deepspeed/ops/csrc/transformer_bak/inference/csrc/pt_binding.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..5432314bb6dd09110963e8f75232e1b9d259e1b3
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/pt_binding.cpp
@@ -0,0 +1,911 @@
+
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context.h"
+#include "cublas_wrappers.h"
+#include "custom_cuda_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_T,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // cudaEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublasSetStream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    // cudaEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // cudaStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::cuda::getCurrentCUDAStream());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::cuda::getCurrentCUDAStream());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/pt_binding_hip.cpp b/deepspeed/ops/csrc/transformer_bak/inference/csrc/pt_binding_hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..009951db340bcdaa2e3c0d806782dbca89dfdf76
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/pt_binding_hip.cpp
@@ -0,0 +1,912 @@
+// !!! This is a file automatically generated by hipify!!!
+
+#include <ATen/hip/HIPContext.h>
+#include <torch/extension.h>
+#include <vector>
+#include "context_hip.h"
+#include "cublas_wrappers_hip.h"
+#include "custom_hip_layers.h"
+
+std::array<int, 3> gemm_algos = std::array<int, 3>({99, 99, 99});
+
+#define MAX_OUT_TOKES 10
+
+template <typename T>
+at::Tensor ds_softmax(at::Tensor& attn_scores,
+                      at::Tensor& attn_mask,
+                      bool triangular,
+                      bool recompute,
+                      bool local_attention,
+                      int window_size,
+                      bool async_op)
+{
+    auto attn_scores_c = attn_scores.contiguous();
+    int bsz = attn_scores_c.size(0);
+
+    int seq_len = attn_scores_c.size(1);
+    int len = attn_scores_c.sizes().size();
+    if (len > 3) seq_len = attn_scores_c.size(2);
+
+    int soft_len = attn_scores_c.size(2);
+    if (len > 3) soft_len = attn_scores_c.size(3);
+
+    int heads = 1;
+    if (len > 3) heads = attn_scores_c.size(1);
+
+    launch_attn_softmax_v2((T*)attn_scores_c.data_ptr(),
+                           (attn_mask.sizes().size() > 1 ? (T*)attn_mask.data_ptr() : nullptr),
+                           triangular,
+                           recompute,
+                           local_attention,
+                           window_size,
+                           bsz,
+                           heads,
+                           seq_len,
+                           soft_len,
+                           1.0,
+                           Context::Instance().GetCurrentStream(async_op));
+
+    return attn_scores_c;
+}
+
+template <typename T>
+void allocate_workspace(size_t hidden_dim,
+                        size_t max_seq_len,
+                        size_t batch_size,
+                        size_t head_size = 128)
+{
+    size_t _workSpaceSize = (hidden_dim * batch_size * max_seq_len);
+    Context::Instance().GenWorkSpace(_workSpaceSize * sizeof(T));
+}
+
+template <typename T>
+at::Tensor einsum_sec_sm_ecm(at::Tensor& Q, at::Tensor& W)
+{
+    auto options = at::TensorOptions()
+                       .dtype(Q.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    T* workspace = (T*)Context::Instance().GetWorkSpace();
+    float alpha = 1;
+    float gemm_beta = 0.0;
+
+    if (!workspace) {
+        allocate_workspace<T>(W.size(1), MAX_OUT_TOKES, Q.size(0));
+        workspace = (T*)Context::Instance().GetWorkSpace();
+    }
+
+    auto O = at::from_blob(workspace, {Q.size(1), Q.size(2), W.size(1)}, options);
+    unsigned m = W.size(1);
+    unsigned n = Q.size(1) * Q.size(2);
+    unsigned k = Q.size(0);
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_transpose,
+                   m,
+                   n,
+                   k,
+                   &alpha,
+                   &gemm_beta,
+                   (T*)W.data_ptr(),
+                   (T*)Q.data_ptr(),
+                   (T*)O.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return O;
+}
+
+template <typename T>
+void attention_unfused(at::Tensor& prev_key_cont,
+                       at::Tensor& query_cont,
+                       at::Tensor& attn_mask,
+                       at::Tensor& prev_value_cont,
+                       at::Tensor& output,
+                       int& bsz,
+                       int& seq_len,
+                       int& soft_len,
+                       int& heads,
+                       float& norm_factor,
+                       bool triangular,
+                       bool recompute,
+                       bool local_attention,
+                       int window_size)
+{
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    float alpha = norm_factor;
+    float gemm_beta = 0.0;
+    auto attn_score = at::empty({bsz, heads, seq_len, soft_len}, options);
+    int k = prev_value_cont.size(2) / heads;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                soft_len,
+                                seq_len,
+                                k,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_key_cont.data_ptr(),
+                                (T*)query_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * k,
+                                seq_len * soft_len,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    attn_score = ds_softmax<T>(
+        attn_score, attn_mask, triangular, recompute, local_attention, window_size, false);
+    alpha = 1.0;
+    cublas_strided_batched_gemm(Context::Instance().GetCublasHandle(),
+                                k,
+                                seq_len,
+                                soft_len,
+                                &alpha,
+                                &gemm_beta,
+                                (T*)prev_value_cont.data_ptr(),
+                                (T*)attn_score.data_ptr(),
+                                (T*)output.data_ptr(),
+                                rocblas_operation_none,
+                                rocblas_operation_none,
+                                soft_len * k,
+                                seq_len * soft_len,
+                                seq_len * k,
+                                bsz * heads,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_softmax_context(at::Tensor& query,
+                                           at::Tensor& prev_key,
+                                           at::Tensor& new_key,
+                                           at::Tensor& attn_mask,
+                                           at::Tensor& prev_value,
+                                           at::Tensor& new_value,
+                                           int heads,
+                                           float norm_factor,
+                                           bool merging,
+                                           bool triangular,
+                                           bool local_attention,
+                                           int window_size,
+                                           bool no_masking)
+{
+    auto query_cont = query.contiguous();
+    auto prev_key_cont = prev_key.contiguous();
+    auto prev_value_cont = prev_value.contiguous();
+
+    int new_size = (new_value.sizes().size() > 1 ? new_value.size(1) : 0);
+
+    // Attn_Score [ batch Head Sequence-length Softmax-length]
+
+    int bsz = query_cont.size(0);
+    int seq_len = query_cont.size(1);
+    int soft_len = prev_value.size(1);
+
+    auto options = at::TensorOptions()
+                       .dtype(query_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output =
+        at::empty({prev_value.size(0), heads, seq_len, prev_value.size(2) / heads}, options);
+    attention_unfused<T>(prev_key_cont,
+                         query_cont,
+                         attn_mask,  //(no_masking ? nullptr : (T*)attn_mask.data_ptr()),
+                         prev_value_cont,
+                         output,
+                         bsz,
+                         seq_len,
+                         soft_len,
+                         heads,
+                         norm_factor,
+                         (triangular && (new_size == 0)),
+                         (new_size == 0),
+                         local_attention,
+                         window_size);
+
+    return {output, prev_key, prev_value};
+}
+
+template <typename T>
+at::Tensor ds_bias_gelu(at::Tensor& input, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    int intermediate_size = input_cont.size(2);
+
+    launch_bias_gelu((T*)input_cont.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     intermediate_size,
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_bias_residual(at::Tensor& input, at::Tensor& residual, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto residual_cont = residual.contiguous();
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    // launch_bias_residual((T*)input_cont.data_ptr(),
+    //                      (T*)residual_cont.data_ptr(),
+    //                      (T*)bias.data_ptr(),
+    //                      bsz,
+    //                      input_cont.size(2),
+    //                      (bias.size(0) > 1),
+    //                      Context::Instance().GetCurrentStream());
+    return input_cont;
+}
+
+template <typename T>
+at::Tensor ds_layernorm(at::Tensor& input_cont, at::Tensor& gamma, at::Tensor& betta, float epsilon)
+{
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+    launch_layer_norm((T*)inp_norm.data_ptr(),
+                      (T*)input_cont.data_ptr(),
+                      (T*)gamma.data_ptr(),
+                      (T*)betta.data_ptr(),
+                      epsilon,
+                      bsz,
+                      input_cont.size(2),
+                      Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+at::Tensor qkv_unfused_cublas(at::Tensor& output,
+                              at::Tensor& input,
+                              at::Tensor& weight,
+                              at::Tensor& bias,
+                              at::Tensor& gamma,
+                              at::Tensor& beta,
+                              const float epsilon,
+                              bool add_bias)
+{
+    auto inp_norm = ds_layernorm<T>(input, gamma, beta, epsilon);
+
+    // hipEventRecord(Context::Instance().GetCompEvent(1), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    int bsz = input.size(0) * input.size(1);
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+    return inp_norm;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_qkv_gemm(at::Tensor& input,
+                                    at::Tensor& weight,
+                                    at::Tensor& bias,
+                                    at::Tensor& gamma,
+                                    at::Tensor& beta,
+                                    const float epsilon,
+                                    bool add_bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm =
+        qkv_unfused_cublas<T>(output, input_cont, weight, bias, gamma, beta, epsilon, add_bias);
+
+    return {output, inp_norm};
+}
+
+template <typename T>
+void quantized_gemm(at::Tensor& output,
+                    at::Tensor& input,
+                    at::Tensor& weight,
+                    at::Tensor& qscale,
+                    int groups,
+                    int merge_count)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto options = at::TensorOptions()
+                       .dtype(input.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    auto weight16 = at::empty({weight.size(0), weight.size(1)}, options);
+
+    launch_dequantize((T*)weight16.data_ptr(),
+                      (int8_t*)weight.data_ptr(),
+                      (float*)qscale.data_ptr(),
+                      weight.size(1),
+                      weight.size(0),
+                      groups,
+                      merge_count,
+                      Context::Instance().GetCurrentStream());
+
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight16.data_ptr(),
+                   (T*)input.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+}
+
+template <typename T>
+at::Tensor ds_qkv_gemm_int8(at::Tensor& input,
+                            at::Tensor& weight,
+                            at::Tensor& bias,
+                            at::Tensor& gamma,
+                            at::Tensor& beta,
+                            const float epsilon,
+                            at::Tensor& q_scale,
+                            int groups,
+                            bool add_bias)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    auto inp_norm = ds_layernorm<T>(input_cont, gamma, beta, epsilon);
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    if (add_bias)
+        launch_bias_add((T*)output.data_ptr(),
+                        (T*)bias.data_ptr(),
+                        weight.size(1),
+                        bsz,
+                        Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer(at::Tensor& input, at::Tensor& weight, at::Tensor& bias)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_linear_layer_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                at::Tensor& q_scale,
+                                int groups)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_add((T*)output.data_ptr(),
+                    (T*)bias.data_ptr(),
+                    weight.size(1),
+                    bsz,
+                    Context::Instance().GetCurrentStream());
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul(at::Tensor& input, at::Tensor& weight, bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(),
+                    Context::Instance().GetCurrentStream(async_op));
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input_cont.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    return output;
+}
+
+template <typename T>
+at::Tensor ds_vector_matmul_int8(at::Tensor& input,
+                                 at::Tensor& weight,
+                                 at::Tensor& q_scale,
+                                 int groups,
+                                 int merge_count)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, merge_count);
+    return output;
+}
+
+template <typename T>
+void mlp_unfused_cublas(at::Tensor& output,
+                        at::Tensor& input,
+                        at::Tensor& residual,
+                        at::Tensor& input_bias,
+                        at::Tensor& weight,
+                        at::Tensor& bias,
+                        at::Tensor& gamma,
+                        at::Tensor& beta,
+                        const float epsilon,
+                        bool preLayerNorm,
+                        bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    auto inp_norm = at::empty_like(input);
+
+    launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+                               (T*)nullptr,
+                               (T*)input.data_ptr(),
+                               (T*)residual.data_ptr(),
+                               (T*)input_bias.data_ptr(),
+                               (T*)gamma.data_ptr(),
+                               (T*)beta.data_ptr(),
+                               epsilon,
+                               bsz,
+                               input.size(2),
+                               preLayerNorm,
+                               mlp_after_attn,
+                               Context::Instance().GetCurrentStream());
+
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)inp_norm.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+}
+template <typename T>
+at::Tensor ds_mlp_gemm(at::Tensor& input,
+                       at::Tensor& residual,
+                       at::Tensor& input_bias,
+                       at::Tensor& weight,
+                       at::Tensor& bias,
+                       at::Tensor& gamma,
+                       at::Tensor& beta,
+                       const float epsilon,
+                       bool preLayerNorm,
+                       bool mlp_after_attn)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    mlp_unfused_cublas<T>(output,
+                          mlp_after_attn ? input : residual,
+                          residual,
+                          input_bias,
+                          weight,
+                          bias,
+                          gamma,
+                          beta,
+                          epsilon,
+                          preLayerNorm,
+                          mlp_after_attn);
+
+    return output;
+}
+
+template <typename T>
+std::vector<at::Tensor> ds_mlp_gemm_int8(at::Tensor& input,
+                                         at::Tensor& residual,
+                                         at::Tensor& input_bias,
+                                         at::Tensor& weight,
+                                         at::Tensor& bias,
+                                         at::Tensor& gamma,
+                                         at::Tensor& beta,
+                                         const float epsilon,
+                                         at::Tensor& q_scale,
+                                         int groups,
+                                         bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    auto inp_norm = at::empty_like(input_cont);
+
+    auto residual_add = (preLayerNorm ? at::empty_like(input_cont) : inp_norm);
+    // computing the blocking across K dimension
+    // launch_residual_layer_norm((T*)inp_norm.data_ptr(),
+    //                           (T*)residual_add.data_ptr(),
+    //                           (T*)input_cont.data_ptr(),
+    //                           (T*)residual.data_ptr(),
+    //                           (T*)input_bias.data_ptr(),
+    //                           (T*)gamma.data_ptr(),
+    //                           (T*)beta.data_ptr(),
+    //                           epsilon,
+    //                           bsz,
+    //                           input_cont.size(2),
+    //                           preLayerNorm,
+    //                           Context::Instance().GetCurrentStream());
+
+    quantized_gemm<T>(output, inp_norm, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return {output, residual_add};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu(at::Tensor& input,
+                           at::Tensor& weight,
+                           at::Tensor& bias,
+                           at::Tensor& weight_out,
+                           const float epsilon,
+                           bool preLayerNorm,
+                           bool async_op)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto intermediate =
+        at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight_out.size(1)}, options);
+    int bsz = input_cont.size(0) * input_cont.size(1);
+    float alpha = (T)1.0;
+    float gemm_beta = (T)0.0;
+    rocblas_set_stream(Context::Instance().GetCublasHandle(), Context::Instance().GetCurrentStream());
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight.size(1),
+                   bsz,
+                   input.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight.data_ptr(),
+                   (T*)input_cont.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    launch_bias_gelu((T*)intermediate.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    cublas_gemm_ex(Context::Instance().GetCublasHandle(),
+                   rocblas_operation_none,
+                   rocblas_operation_none,
+                   weight_out.size(1),
+                   bsz,
+                   intermediate.size(2),
+                   &alpha,
+                   &gemm_beta,
+                   (T*)weight_out.data_ptr(),
+                   (T*)intermediate.data_ptr(),
+                   (T*)output.data_ptr(),
+                   CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+    // hipEventRecord(Context::Instance().GetCompEvent(2),
+    //                Context::Instance().GetCurrentStream(true));
+    return output;
+}
+
+void residual_add_bias(at::Tensor& output,
+                       at::Tensor& input,
+                       at::Tensor& attention_output,
+                       at::Tensor& output_b,
+                       at::Tensor& attention_b,
+                       int mp_size,
+                       bool mlp_after_attn)
+{
+    int bsz = input.size(0) * input.size(1);
+    int hidden_size = input.size(2);
+    // hipStreamWaitEvent(
+    //    Context::Instance().GetCurrentStream(), Context::Instance().GetCompEvent(2), 0);
+    if (input.scalar_type() == at::kFloat)
+        if (mlp_after_attn)
+            launch_bias_residual((float*)input.data_ptr(),
+                                 (float*)output.data_ptr(),
+                                 (float*)attention_output.data_ptr(),
+                                 (float*)output_b.data_ptr(),
+                                 (float*)attention_b.data_ptr(),
+                                 bsz,
+                                 hidden_size,
+                                 mp_size,
+                                 Context::Instance().GetCurrentStream());
+        else
+            launch_gptj_residual_add<float>((float*)input.data_ptr(),
+                                            (float*)output.data_ptr(),
+                                            (float*)attention_output.data_ptr(),
+                                            (float*)output_b.data_ptr(),
+                                            (float*)attention_b.data_ptr(),
+                                            hidden_size,
+                                            bsz,
+                                            mp_size,
+                                            Context::Instance().GetCurrentStream());
+    else if (mlp_after_attn)
+        launch_bias_residual((__half*)input.data_ptr(),
+                             (__half*)output.data_ptr(),
+                             (__half*)attention_output.data_ptr(),
+                             (__half*)output_b.data_ptr(),
+                             (__half*)attention_b.data_ptr(),
+                             bsz,
+                             hidden_size,
+                             mp_size,
+                             Context::Instance().GetCurrentStream());
+    else
+        launch_gptj_residual_add<__half>((__half*)input.data_ptr(),
+                                         (__half*)output.data_ptr(),
+                                         (__half*)attention_output.data_ptr(),
+                                         (__half*)output_b.data_ptr(),
+                                         (__half*)attention_b.data_ptr(),
+                                         hidden_size,
+                                         bsz,
+                                         mp_size,
+                                         Context::Instance().GetCurrentStream());
+}
+
+std::vector<at::Tensor> apply_rotary_pos_emb(at::Tensor& mixed_query,
+                                             at::Tensor& key_layer,
+                                             unsigned rotary_dim,
+                                             unsigned offset,
+                                             unsigned num_heads,
+                                             bool rotate_half,
+                                             bool rotate_every_two)
+{
+    auto query_cont = mixed_query.contiguous();
+    auto key_cont = key_layer.contiguous();
+
+    unsigned bsz = mixed_query.size(0);
+    unsigned head_size = mixed_query.size(2) / num_heads;
+    unsigned seq_len = mixed_query.size(1);
+
+    if (mixed_query.scalar_type() == at::kFloat)
+        launch_apply_rotary_pos_emb<float>((float*)query_cont.data_ptr(),
+                                           (float*)key_cont.data_ptr(),
+                                           head_size,
+                                           seq_len,
+                                           rotary_dim,
+                                           offset,
+                                           num_heads,
+                                           bsz,
+                                           rotate_half,
+                                           rotate_every_two,
+                                           Context::Instance().GetCurrentStream());
+    else
+        launch_apply_rotary_pos_emb<__half>((__half*)query_cont.data_ptr(),
+                                            (__half*)key_cont.data_ptr(),
+                                            head_size,
+                                            seq_len,
+                                            rotary_dim,
+                                            offset,
+                                            num_heads,
+                                            bsz,
+                                            rotate_half,
+                                            rotate_every_two,
+                                            Context::Instance().GetCurrentStream());
+    return {query_cont, key_cont};
+}
+
+template <typename T>
+at::Tensor fused_gemm_gelu_int8(at::Tensor& input,
+                                at::Tensor& weight,
+                                at::Tensor& bias,
+                                const float epsilon,
+                                at::Tensor& q_scale,
+                                int groups,
+                                bool preLayerNorm)
+{
+    auto input_cont = input.contiguous();
+    auto options = at::TensorOptions()
+                       .dtype(input_cont.options().dtype())
+                       .layout(at::kStrided)
+                       .device(at::kCUDA)
+                       .requires_grad(false);
+
+    auto output = at::empty({input_cont.size(0), input_cont.size(1), weight.size(1)}, options);
+
+    int bsz = input_cont.size(0) * input_cont.size(1);
+
+    quantized_gemm<T>(output, input_cont, weight, q_scale, groups, 0);
+    launch_bias_gelu((T*)output.data_ptr(),
+                     (T*)bias.data_ptr(),
+                     weight.size(1),
+                     bsz,
+                     Context::Instance().GetCurrentStream());
+
+    return output;
+}
+
+at::Tensor moe_res_matmul(at::Tensor& moe_res, at::Tensor& coef, at::Tensor& output)
+{
+    int M = moe_res.size(0) * moe_res.size(1);
+    int N = moe_res.size(2);
+    Context::Instance().SynchComm();
+    if (moe_res.scalar_type() == at::kFloat) {
+        launch_moe_res_matmul<float>((float*)moe_res.data_ptr(),
+                                     (float*)coef.data_ptr(),
+                                     (float*)output.data_ptr(),
+                                     M,
+                                     N,
+                                     at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    } else {
+        launch_moe_res_matmul<__half>((__half*)moe_res.data_ptr(),
+                                      (__half*)coef.data_ptr(),
+                                      (__half*)output.data_ptr(),
+                                      M,
+                                      N,
+                                      at::hip::getCurrentHIPStreamMasqueradingAsCUDA());
+    }
+    return output;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("softmax_fp32", &ds_softmax<float>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def("softmax_fp16", &ds_softmax<__half>, "DeepSpeed SoftMax with fp32 (CUDA)");
+    m.def(
+        "softmax_context_fp32", &ds_softmax_context<float>, "DeepSpeed attention with fp32 (CUDA)");
+    m.def("softmax_context_fp16",
+          &ds_softmax_context<__half>,
+          "DeepSpeed attention with fp32 (CUDA)");
+    m.def("bias_gelu_fp32", &ds_bias_gelu<float>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_gelu_fp16", &ds_bias_gelu<__half>, "DeepSpeed Gelu with fp32 (CUDA)");
+    m.def("bias_residual_fp32",
+          &ds_bias_residual<float>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("bias_residual_fp16",
+          &ds_bias_residual<__half>,
+          "DeepSpeed residual-bias add with fp32 (CUDA)");
+    m.def("layer_norm_fp32", &ds_layernorm<float>, "DeepSpeed layer-norm with fp32 (CUDA)");
+    m.def("layer_norm_fp16", &ds_layernorm<__half>, "DeepSpeed layer-norm with fp16 (CUDA)");
+    m.def("qkv_gemm_fp32", &ds_qkv_gemm<float>, "DeepSpeed qkv gemm with fp32 (CUDA)");
+    m.def("qkv_gemm_fp16", &ds_qkv_gemm<__half>, "DeepSpeed qkv gemm with fp16 (CUDA)");
+    m.def("qkv_gemm_int8", &ds_qkv_gemm_int8<__half>, "DeepSpeed qkv gemm with int8 (CUDA)");
+    m.def("mlp_gemm_fp32", &ds_mlp_gemm<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("mlp_gemm_fp16", &ds_mlp_gemm<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("mlp_gemm_int8", &ds_mlp_gemm_int8<__half>, "DeepSpeed mlp with int8 (CUDA)");
+    m.def("vector_matmul_fp32", &ds_vector_matmul<float>, "DeepSpeed vector-MM with fp32 (CUDA)");
+    m.def("vector_matmul_fp16", &ds_vector_matmul<__half>, "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("vector_matmul_int8",
+          &ds_vector_matmul_int8<__half>,
+          "DeepSpeed vector-MM with int8 (CUDA)");
+    m.def("linear_layer_fp32", &ds_linear_layer<float>, "DeepSpeed linear_layer with fp32 (CUDA)");
+    m.def("linear_layer_fp16", &ds_linear_layer<__half>, "DeepSpeed linear_layer with fp16 (CUDA)");
+    m.def("linear_layer_int8",
+          &ds_linear_layer_int8<__half>,
+          "DeepSpeed linear_layer with int8 (CUDA)");
+    m.def("fused_gemm_gelu_fp32", &fused_gemm_gelu<float>, "DeepSpeed mlp with fp32 (CUDA)");
+    m.def("fused_gemm_gelu_fp16", &fused_gemm_gelu<__half>, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("residual_add", &residual_add_bias, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("apply_rotary_pos_emb", &apply_rotary_pos_emb, "DeepSpeed mlp with fp16 (CUDA)");
+    m.def("einsum_sec_sm_ecm_fp32",
+          &einsum_sec_sm_ecm<float>,
+          "DeepSpeed vector-MM with fp32 (CUDA)");
+
+    m.def("einsum_sec_sm_ecm_fp16",
+          &einsum_sec_sm_ecm<__half>,
+          "DeepSpeed vector-MM with fp16 (CUDA)");
+    m.def("moe_res_matmul", &moe_res_matmul, "DeepSpeed moe residual matmul (CUDA)");
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/softmax.cu b/deepspeed/ops/csrc/transformer_bak/inference/csrc/softmax.cu
new file mode 100644
index 0000000000000000000000000000000000000000..788de78bb1d836d274c4cef4e22541acb3bc2dd4
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/softmax.cu
@@ -0,0 +1,432 @@
+#include <limits>
+#include "custom_cuda_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    cudaError_t err = cudaGetLastError();
+    if (err == cudaSuccess) return;
+    std::cerr << cudaGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+        attn_softmax_v2<<<grid_dim, block_dim, 0, stream>>>(
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/csrc/softmax.hip b/deepspeed/ops/csrc/transformer_bak/inference/csrc/softmax.hip
new file mode 100644
index 0000000000000000000000000000000000000000..a933d5177295f9f483638d19ecc89d3ae5f3937d
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/csrc/softmax.hip
@@ -0,0 +1,434 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <limits>
+#include "custom_hip_layers.h"
+
+//#include <cuda_profiler_api.h>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+#define ATTN_THREADS 1024
+#define MAX_REG_SIZE 8
+
+#define minus_infinity -10000.0
+
+void CheckCudaErrorAux(const char* file, unsigned line)
+{
+    hipError_t err = hipGetLastError();
+    if (err == hipSuccess) return;
+    std::cerr << hipGetErrorString(err) << "(" << err << ") at " << file << ":" << line
+              << std::endl;
+    throw std::runtime_error("CUDA ERROR!!!\n");
+}
+
+#define CUDA_CHECK_ERROR() CheckCudaErrorAux(__FILE__, __LINE__)
+
+namespace cg = cooperative_groups;
+
+__global__ void attn_softmax_v2(__half* vals,
+                                __half* mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float2 low_data[MAX_REG_SIZE];
+    float2 high_data[MAX_REG_SIZE];
+
+    __half2 h_scale = __float2half2_rn(scale);
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                     (data_id + 1) > window_stride)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                      (data_id + 2) > window_stride)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                      (data_id + 3) > window_stride)
+                                         ? __half2float(vals[data_id + 3])
+                                         : minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                        high_data[i].y += __half2float(mask[data_id + mask_offset + 3]);
+                    }
+                } else {
+                    low_data[i].x = data_id > window_stride ? __half2float(vals[data_id])
+                                                            : minus_infinity;
+                    low_data[i].y = (((!triangular || (data_id + 1) <= seq_id) &&
+                                      (data_id + 1) > window_stride) &&
+                                     (data_id + 1) < sequence_length)
+                                        ? __half2float(vals[data_id + 1])
+                                        : minus_infinity;
+                    high_data[i].x = (((!triangular || (data_id + 2) <= seq_id) &&
+                                       (data_id + 2) > window_stride) &&
+                                      (data_id + 2) < sequence_length)
+                                         ? __half2float(vals[data_id + 2])
+                                         : minus_infinity;
+                    high_data[i].y = minus_infinity;
+                    if (mask && recompute) {
+                        low_data[i].x += __half2float(mask[data_id + mask_offset]);
+                        if ((data_id + 1) < sequence_length)
+                            low_data[i].y += __half2float(mask[data_id + mask_offset + 1]);
+                        if ((data_id + 2) < sequence_length)
+                            high_data[i].x += __half2float(mask[data_id + mask_offset + 2]);
+                    }
+                }
+                // if(lane == 0) printf("%f , %d, %d \n", low_data[i].x, data_id, seq_id);
+                max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+                max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+                max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+                max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+            } else {
+                low_data[i].x = minus_infinity;
+                low_data[i].y = minus_infinity;
+                high_data[i].x = minus_infinity;
+                high_data[i].y = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = low_data[i].x / sum;
+                    vals[data_id + 1] = low_data[i].y / sum;
+                    vals[data_id + 2] = high_data[i].x / sum;
+                    vals[data_id + 3] = high_data[i].y / sum;
+                } else {
+                    vals[data_id] = low_data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = low_data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = high_data[i].x / sum;
+                }
+            }
+        }
+    }
+#endif
+}
+
+__global__ void attn_softmax_v2(float* vals,
+                                float* attn_mask,
+                                bool triangular,
+                                bool recompute,
+                                bool local_attention,
+                                int window_size,
+                                int total_count,
+                                int heads,
+                                int sequence_length,
+                                int num_seq,
+                                float scale,
+                                int iterations,
+                                int reduceWidth)
+{
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    float4 data[MAX_REG_SIZE];
+
+    int wid = threadIdx.x >> 5;
+    int lane = threadIdx.x & 0x1f;
+    int warp_num = blockDim.x >> 5;
+
+    int reduce_blocks = reduceWidth >> 5;
+    int seq_lane = threadIdx.x % reduceWidth;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int iter_offset = blockIdx.x * (warp_num / reduce_blocks) + (wid / reduce_blocks);
+    if (iter_offset < total_count) {
+        vals += (iter_offset * sequence_length);
+
+        int mask_offset = (iter_offset / (heads * num_seq)) * (sequence_length);
+        int seq_id = iter_offset % num_seq;
+        int seq_id4 = seq_id >> 2;
+
+        int real_seq_id = seq_id + (num_seq == sequence_length ? 0 : sequence_length);
+        int window_stride4 = (local_attention && (real_seq_id >> 2) > (window_size >> 2))
+                                 ? (real_seq_id >> 2) - (window_size >> 2)
+                                 : 0;
+        int window_stride =
+            (local_attention && real_seq_id >= window_size) ? real_seq_id - window_size : -1;
+
+        float max_val = minus_infinity;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+            if ((!triangular || ((data_id >> 2) <= seq_id4)) && (data_id >> 2) >= window_stride4 &&
+                data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    data[i].x = (data_id > window_stride ? vals[data_id] : minus_infinity);
+                    data[i].y = ((!triangular || ((data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride)
+                                    ? vals[data_id + 1]
+                                    : minus_infinity;
+                    data[i].z = ((!triangular || ((data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride)
+                                    ? vals[data_id + 2]
+                                    : minus_infinity;
+                    data[i].w = ((!triangular || ((data_id + 3) <= seq_id)) &&
+                                 (data_id + 3) > window_stride)
+                                    ? vals[data_id + 3]
+                                    : minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        data[i].y += attn_mask[data_id + mask_offset + 1];
+                        data[i].z += attn_mask[data_id + mask_offset + 2];
+                        data[i].w += attn_mask[data_id + mask_offset + 3];
+                    }
+                } else {
+                    data[i].x = data_id > window_stride ? vals[data_id] : minus_infinity;
+                    data[i].y = (((!triangular || (data_id + 1) <= seq_id)) &&
+                                 (data_id + 1) > window_stride && (data_id + 1) < sequence_length)
+                                    ? (vals[data_id + 1])
+                                    : minus_infinity;
+                    data[i].z = (((!triangular || (data_id + 2) <= seq_id)) &&
+                                 (data_id + 2) > window_stride && (data_id + 2) < sequence_length)
+                                    ? (vals[data_id + 2])
+                                    : minus_infinity;
+                    data[i].w = minus_infinity;
+                    if (attn_mask && recompute) {
+                        data[i].x += attn_mask[data_id + mask_offset];
+                        if ((data_id + 1) < sequence_length)
+                            data[i].y += attn_mask[data_id + mask_offset + 1];
+                        if ((data_id + 2) < sequence_length)
+                            data[i].z += attn_mask[data_id + mask_offset + 2];
+                    }
+                }
+                max_val = (data[i].x > max_val ? data[i].x : max_val);
+                max_val = (data[i].y > max_val ? data[i].y : max_val);
+                max_val = (data[i].z > max_val ? data[i].z : max_val);
+                max_val = (data[i].w > max_val ? data[i].w : max_val);
+            } else {
+                data[i].x = minus_infinity;
+                data[i].y = minus_infinity;
+                data[i].z = minus_infinity;
+                data[i].w = minus_infinity;
+            }
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = max_val;
+            b.sync();
+
+            if (lane < warp_num) max_val = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) {
+                auto temp = g.shfl_xor(max_val, i);
+                max_val = (temp > max_val ? temp : max_val);
+            }
+
+            max_val = g.shfl(max_val, threadIdx.x / WARP_SIZE);
+        }
+
+        float sum = 0;
+        for (int i = 0; i < iterations; i++) {
+            data[i].x = __expf(data[i].x - max_val);
+            data[i].y = __expf(data[i].y - max_val);
+            data[i].z = __expf(data[i].z - max_val);
+            data[i].w = __expf(data[i].w - max_val);
+
+            sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+        }
+
+        for (int i = 1; i < WARP_SIZE; i *= 2) sum += g.shfl_xor(sum, i);
+
+        if (reduceWidth > WARP_SIZE) {
+            if (lane == 0) partialSum[wid] = sum;
+            b.sync();
+
+            if (lane < warp_num) sum = partialSum[lane];
+
+            b.sync();
+
+            for (int i = 1; i < reduce_blocks; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+            sum = g.shfl(sum, threadIdx.x / WARP_SIZE);
+        }
+        sum += 1e-6;
+
+        for (int i = 0; i < iterations; i++) {
+            int data_id = i * (reduceWidth << 2) + (seq_lane << 2);
+
+            if (data_id < sequence_length) {
+                if ((sequence_length - data_id) >= 4) {
+                    vals[data_id] = data[i].x / sum;
+                    vals[data_id + 1] = data[i].y / sum;
+                    vals[data_id + 2] = data[i].z / sum;
+                    vals[data_id + 3] = data[i].w / sum;
+                } else {
+                    vals[data_id] = data[i].x / sum;
+                    if ((data_id + 1) < sequence_length) vals[data_id + 1] = data[i].y / sum;
+                    if ((data_id + 2) < sequence_length) vals[data_id + 2] = data[i].z / sum;
+                }
+            }
+        }
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream)
+{
+    int total_count = batch_size * heads * num_seq;
+    dim3 grid_dim((total_count - 1) / (WARP_SIZE / ((sequence_length - 1) / ATTN_THREADS + 1)) + 1);
+    dim3 block_dim(ATTN_THREADS);
+
+    const int reduce_width = ((sequence_length - 1) / ATTN_THREADS + 1) * WARP_SIZE;
+    const int iterations = (sequence_length - 1) / (reduce_width << 2) + 1;
+
+    if (sequence_length <= 32768)
+       hipLaunchKernelGGL(( attn_softmax_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            vals,
+            mask,
+            triangular,
+            recompute,
+            local_attention,
+            window_size,
+            total_count,
+            (triangular ? (heads * batch_size) : heads),
+            sequence_length,
+            num_seq,
+            scale,
+            iterations,
+            reduce_width);
+    else
+        throw std::runtime_error("Unsupport Seq_Length!");
+}
+
+template void launch_attn_softmax_v2(float* vals,
+                                     float* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
+template void launch_attn_softmax_v2(__half* vals,
+                                     __half* mask,
+                                     bool triangular,
+                                     bool recompute,
+                                     bool local_attention,
+                                     int window_size,
+                                     int batch_size,
+                                     int heads,
+                                     int num_seq,
+                                     int sequence_length,
+                                     float scale,
+                                     hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/includes/context.h b/deepspeed/ops/csrc/transformer_bak/inference/includes/context.h
new file mode 100644
index 0000000000000000000000000000000000000000..79008d4f3402bca94fdc411c2f1a07b76078f8a1
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/includes/context.h
@@ -0,0 +1,177 @@
+#pragma once
+
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        curandCreateGenerator(&_gen, CURAND_RNG_PSEUDO_DEFAULT);
+        curandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (cublasCreate(&_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+        cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        cudaEventCreate(&_comp1_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp2_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comp_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+        cudaEventCreate(&_comm_event, (cudaEventDisableTiming | cudaEventBlockingSync));
+    }
+
+    virtual ~Context()
+    {
+        cublasDestroy(_cublasHandle);
+        cudaFree(_workspace);
+        cudaEventDestroy(_comp1_event);
+        cudaEventDestroy(_comp2_event);
+        cudaEventDestroy(_comp_event);
+        cudaEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            cudaMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            cudaFree(_workspace);
+            cudaMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    cudaEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    curandGenerator_t& GetRandGenerator() { return _gen; }
+
+    cudaStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::cuda::getStreamFromPool(true)
+                                    : at::cuda::getCurrentCUDAStream();
+        return _comm_stream;
+    }
+    cudaStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::cuda::getStreamFromPool(true);
+            return _stream;
+        }
+        cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+        return stream;
+    }
+
+    cublasHandle_t GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        cudaEventRecord(_comp_event, _comp_stream);
+        cudaStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        cudaEventRecord(_comm_event, _comm_stream);
+        cudaStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    curandGenerator_t _gen;
+    cublasHandle_t _cublasHandle;
+
+    cudaEvent_t _comp_event;
+    cudaEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    cudaEvent_t _comp1_event;
+    cudaEvent_t _comp2_event;
+
+    cudaStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    cudaStream_t _comp_stream;
+    cudaStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/includes/context_hip.h b/deepspeed/ops/csrc/transformer_bak/inference/includes/context_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..89c6299bfbf89f3248a492fcfd6e7c61cb1df9fd
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/includes/context_hip.h
@@ -0,0 +1,178 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <ATen/hip/HIPContext.h>
+#include <hip/hip_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "rocblas.h"
+#include "hip/hip_runtime.h"
+#include "hiprand/hiprand.h"
+
+#define WARP_SIZE 32
+
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        hipError_t error_code = callstr;                                                      \
+        if (error_code != hipSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 262144
+
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0), _stream(0)
+    {
+        hiprandCreateGenerator(&_gen, HIPRAND_RNG_PSEUDO_DEFAULT);
+        hiprandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (rocblas_create_handle(&_cublasHandle) != rocblas_status_success) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+        rocblas_set_math_mode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);
+        hipEventCreate(&_comp1_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp2_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comp_event, (hipEventDisableTiming | hipEventBlockingSync));
+        hipEventCreate(&_comm_event, (hipEventDisableTiming | hipEventBlockingSync));
+    }
+
+    virtual ~Context()
+    {
+        rocblas_destroy_handle(_cublasHandle);
+        hipFree(_workspace);
+        hipEventDestroy(_comp1_event);
+        hipEventDestroy(_comp2_event);
+        hipEventDestroy(_comp_event);
+        hipEventDestroy(_comm_event);
+    }
+
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            hipMalloc(&_workspace, size);
+        } else if (_workSpaceSize < size) {
+            hipFree(_workspace);
+            hipMalloc(&_workspace, size);
+        }
+
+        _workSpaceSize = size;
+    }
+
+    hipEvent_t GetCompEvent(int id) { return id == 1 ? _comp1_event : _comp2_event; }
+
+    size_t get_workspace_size() const { return _workSpaceSize; }
+    void* GetWorkSpace() { return _workspace; }
+
+    inline unsigned new_token(unsigned layer_id)
+    {
+        if (layer_id == 0) _token_length++;
+        return _token_length;
+    }
+
+    inline void reset_tokens(unsigned initial_tokens = 0)
+    {
+        _num_tokens = initial_tokens;
+    }  //_token_length = 0; }
+
+    inline unsigned current_tokens() const { return _num_tokens; }
+
+    inline void advance_tokens() { _num_tokens++; }
+
+    hiprandGenerator_t& GetRandGenerator() { return _gen; }
+
+    hipStream_t GetCommStream(bool async_op = false)
+    {
+        if (!_comm_stream)
+            _comm_stream = async_op ? at::hip::getStreamFromPoolMasqueradingAsCUDA(true)
+                                    : at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return _comm_stream;
+    }
+    hipStream_t GetCurrentStream(bool other_stream = false)
+    {
+        // get current pytorch stream.
+        if (other_stream) {
+            if (!_stream) _stream = at::hip::getStreamFromPoolMasqueradingAsCUDA(true);
+            return _stream;
+        }
+        hipStream_t stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
+        return stream;
+    }
+
+    rocblas_handle GetCublasHandle() { return _cublasHandle; }
+
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+
+    inline void SynchComp()
+    {
+        hipEventRecord(_comp_event, _comp_stream);
+        hipStreamWaitEvent(_comm_stream, _comp_event, 0);
+    }
+    inline void SynchComm()
+    {
+        hipEventRecord(_comm_event, _comm_stream);
+        hipStreamWaitEvent(_comp_stream, _comm_event, 0);
+    }
+
+private:
+    hiprandGenerator_t _gen;
+    rocblas_handle _cublasHandle;
+
+    hipEvent_t _comp_event;
+    hipEvent_t _comm_event;
+
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+
+    hipEvent_t _comp1_event;
+    hipEvent_t _comp2_event;
+
+    hipStream_t _stream;
+
+    unsigned _token_length;
+    unsigned _num_tokens;
+    std::vector<std::array<int, 3>> _gemm_algos;
+
+    hipStream_t _comp_stream;
+    hipStream_t _comm_stream;
+
+    std::unordered_map<int, int> _world_sizes;
+};
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/includes/cublas_wrappers.h b/deepspeed/ops/csrc/transformer_bak/inference/includes/cublas_wrappers.h
new file mode 100644
index 0000000000000000000000000000000000000000..3addd0291f03cb46a50cc21fcbfd4e22af6929ef
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/includes/cublas_wrappers.h
@@ -0,0 +1,207 @@
+#pragma once
+
+#include <assert.h>
+#include <cublas_v2.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <mma.h>
+#include <stdio.h>
+
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/includes/cublas_wrappers_hip.h b/deepspeed/ops/csrc/transformer_bak/inference/includes/cublas_wrappers_hip.h
new file mode 100644
index 0000000000000000000000000000000000000000..285e5befdedccfca5d09b72cfd73dfc1ef002f5d
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/includes/cublas_wrappers_hip.h
@@ -0,0 +1,208 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#include <assert.h>
+#include <rocblas.h>
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <hip/hip_runtime.h>
+#include <mma.h>
+#include <stdio.h>
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+{
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR32F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR32F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         hipR32F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_gemm_ex(rocblas_handle handle,
+                   rocblas_operation transa,
+                   rocblas_operation transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+{
+    rocblas_status status = rocblas_gemmex(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         hipR16F,
+                                         (transa == rocblas_operation_none) ? m : k,
+                                         (const void*)B,
+                                         hipR16F,
+                                         (transb == rocblas_operation_none) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         hipR16F,
+                                         m,
+                                         hipR32F,
+                                         algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR32F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR32F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
+                batch,
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+
+int cublas_strided_batched_gemm(rocblas_handle handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                rocblas_operation op_A,
+                                rocblas_operation op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    rocblas_status status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       hipR16F,
+                                                       (op_A == rocblas_operation_none) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       hipR16F,
+                                                       (op_B == rocblas_operation_none) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       hipR16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       hipR32F,
+                                                       algo);
+
+    if (status != rocblas_status_success) {
+        fprintf(stderr,
+                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
+                m,
+                n,
+                k,
+                (int)status);
+        return EXIT_FAILURE;
+    }
+
+    return 0;
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/includes/custom_cuda_layers.h b/deepspeed/ops/csrc/transformer_bak/inference/includes/custom_cuda_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..06b4340061c98c65b4b301c7349d2da03185f715
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/includes/custom_cuda_layers.h
@@ -0,0 +1,124 @@
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            cudaStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      cudaStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, cudaStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          cudaStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                cudaStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       cudaStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              cudaStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 cudaStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/inference/includes/custom_hip_layers.h b/deepspeed/ops/csrc/transformer_bak/inference/includes/custom_hip_layers.h
new file mode 100644
index 0000000000000000000000000000000000000000..36cab34d6262f5d6211a18584f6d55284c04846e
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/inference/includes/custom_hip_layers.h
@@ -0,0 +1,125 @@
+// !!! This is a file automatically generated by hipify!!!
+#pragma once
+
+#ifdef __HIP_PLATFORM_HCC__
+#define HALF_PRECISION_AVAILABLE = 1
+#include <hip/hip_cooperative_groups.h>
+#else
+#if __CUDA_ARCH__ >= 700
+#define HALF_PRECISION_AVAILABLE = 1
+#endif
+#include <cooperative_groups.h>
+#endif
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cassert>
+#include <iostream>
+
+#define MAX_WARP_NUM 32
+#define WARP_SIZE 32
+#define SMs 80
+
+#define MAX_REGISTERS 256
+template <typename T>
+void launch_attn_softmax_v2(T* vals,
+                            T* mask,
+                            bool triangular,
+                            bool recompute,
+                            bool local_attention,
+                            int window_size,
+                            int batch_size,
+                            int heads,
+                            int num_seq,
+                            int sequence_length,
+                            float scale,
+                            hipStream_t stream);
+
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(T* input,
+                      const T* bias,
+                      int intermediate_size,
+                      int batch_size,
+                      hipStream_t stream);
+template <typename T>
+void launch_bias_add(T* input, const T* bias, int hidden_size, int batch_size, hipStream_t stream);
+
+template <typename T>
+void launch_bias_residual(T* input,
+                          T* output,
+                          T* attn,
+                          T* bias,
+                          T* attn_bias,
+                          int batch,
+                          int hidden_dim,
+                          int mp_size,
+                          hipStream_t stream);
+
+template <typename T>
+void launch_layer_norm(T* out,
+                       T* vals,
+                       const T* gamma,
+                       const T* beta,
+                       float epsilon,
+                       int batch_size,
+                       int hidden_dim,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_residual_layer_norm(T* norm,
+                                T* res_add,
+                                T* vals,
+                                T* residual,
+                                const T* bias,
+                                const T* gamma,
+                                const T* beta,
+                                float epsilon,
+                                int batch_size,
+                                int hidden_dim,
+                                bool preLN,
+                                bool mlp_after_attn,
+                                hipStream_t stream);
+template <typename T>
+void launch_dequantize(T* output,
+                       const int8_t* input,
+                       const float* qscale,
+                       unsigned output_size,
+                       unsigned hidden_dim,
+                       unsigned groups,
+                       unsigned merge_count,
+                       hipStream_t stream);
+
+template <typename T>
+void launch_gptj_residual_add(T* input,
+                              T* output,
+                              T* attn,
+                              T* bias,
+                              T* attn_bias,
+                              int batch,
+                              int head_size,
+                              int mp_size,
+                              hipStream_t stream);
+
+template <typename T>
+void launch_apply_rotary_pos_emb(T* mixed_query,
+                                 T* key_layer,
+                                 unsigned head_size,
+                                 unsigned seq_len,
+                                 unsigned rotary_dim,
+                                 unsigned offset,
+                                 unsigned num_heads,
+                                 unsigned batch,
+                                 bool rotate_half,
+                                 bool rotate_every_two,
+                                 hipStream_t stream);
+
+template <typename T>
+void launch_moe_res_matmul(T* residual,
+                           T* coef,
+                           T* mlp_out,
+                           int seq_len,
+                           int hidden_dim,
+                           hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/normalize_kernels.cu b/deepspeed/ops/csrc/transformer_bak/normalize_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..d634c7f1b2cd1c2632495d8e1f3b47b45867c353
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/normalize_kernels.cu
@@ -0,0 +1,2121 @@
+#include "custom_cuda_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            cudaStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             cudaStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+    fused_bias_residual_layer_norm<<<grid_dim, block_dim, 0, stream>>>(
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    // LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<float><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 cudaStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    LayerNormBackward1<__half><<<grid_dim, block_dim, 0, stream[0]>>>(
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+    LayerNormBackward2_fused_add<<<grid_dim2, block_dim2, 0, stream[1]>>>(
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/normalize_kernels.hip b/deepspeed/ops/csrc/transformer_bak/normalize_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..3d1b17c8f779f0940593a66fea8c07bba6c5534c
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/normalize_kernels.hip
@@ -0,0 +1,2123 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+namespace cg = cooperative_groups;
+
+/*
+Fused bias add, residual (elementwise) add, and normalization layer.
+
+For FP16, this kernel does not promote to FP32 in order to utilize the 2x throughput for
+__half2 instructions, and avoid the conversion overhead (1/8 of __hal2 arithmetic).
+
+For specific launch constraints, see the launch functions.
+*/
+
+#define NORM_REG (MAX_REGISTERS / 4)
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               float* means,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / WARP_SIZE;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if (high_index < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    if (training)
+        if (threadIdx.x == 0) means[row] = mean;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               __half* means,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) {
+        vars[row] = __float2half(variance);
+        means[row] = __float2half(mean);
+    }
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars,
+                                     T* means);
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars,
+                                            float* means)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars,
+                                             __half* means)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, means, hidden_dim / 2);
+}
+
+__global__ void fused_bias_residual_layer_norm(float* vals,
+                                               const float* residual,
+                                               const float* gamma,
+                                               const float* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               float* vars,
+                                               int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id / 32;
+
+    float vals_arr[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    residual += (row * row_stride);
+    vals += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = residual[i * iteration_stride + id];
+        sum += vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = residual[high_index];
+        sum += vals_arr[iterations];
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#if !defined(__STOCHASTIC_MODE__) || __CUDA_ARCH__ < 700
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+
+    sum = g.shfl(sum, 0);
+    float mean = sum / row_stride;
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] -= mean;
+        variance += vals_arr[i] * vals_arr[i];
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= row_stride;
+    variance += epsilon;
+    if (training)
+        if (threadIdx.x == 0) vars[row] = variance;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] = vals_arr[i] * rsqrtf(variance);
+        vals_arr[i] =
+            vals_arr[i] * gamma[i * iteration_stride + id] + beta[i * iteration_stride + id];
+        vals[i * iteration_stride + id] = vals_arr[i];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr[iterations] = vals_arr[iterations] * rsqrtf(variance);
+        vals_arr[iterations] = vals_arr[iterations] * gamma[high_index] + beta[high_index];
+        vals[high_index] = vals_arr[iterations];
+    }
+}
+
+__global__ void fused_bias_residual_layer_norm(__half* vals,
+                                               const __half* residual,
+                                               const __half* gamma,
+                                               const __half* beta,
+                                               float epsilon,
+                                               bool preLayerNorm,
+                                               bool training,
+                                               __half* vars,
+                                               int row_stride)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<32> g = cg::tiled_partition<32>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int gid = id >> WARP_SIZE_BITS;
+
+    float2 vals_f[NORM_REG];
+    __shared__ float shr[MAX_WARP_NUM];
+
+    __half2* vals_cast = reinterpret_cast<__half2*>(vals);
+    const __half2* residual_cast = reinterpret_cast<const __half2*>(residual);
+
+    residual_cast += (row * row_stride);
+    vals_cast += (row * row_stride);
+
+    float sum = 0.f;
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i] = __half22float2(residual_cast[i * iteration_stride + id]);
+        sum += vals_f[i].x;
+        sum += vals_f[i].y;
+    }
+    if ((high_index) < row_stride) {
+        vals_f[iterations] = __half22float2(residual_cast[high_index]);
+        sum += vals_f[iterations].x;
+        sum += vals_f[iterations].y;
+        iterations++;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = sum;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) sum = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        sum += g.shfl_down(sum, i);
+    }
+    sum = g.shfl(sum, 0);
+    float mean = sum / (row_stride * 2);
+
+    float variance = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        vals_f[i].x -= mean;
+        vals_f[i].y -= mean;
+        variance += vals_f[i].x * vals_f[i].x;
+        variance += vals_f[i].y * vals_f[i].y;
+    }
+
+    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
+
+    if (g.thread_rank() == 0) shr[gid] = variance;
+
+    b.sync();
+
+    if (g.thread_rank() < (iteration_stride >> WARP_SIZE_BITS)) variance = shr[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    b.sync();
+#endif
+
+    for (int i = 1; i < (iteration_stride >> WARP_SIZE_BITS); i *= 2) {
+        variance += g.shfl_down(variance, i);
+    }
+    variance = g.shfl(variance, 0);
+    variance /= (row_stride * 2);
+    variance += epsilon;
+
+    __half2 variance_h = __float2half2_rn(variance);
+    const __half2* gamma_cast = reinterpret_cast<const __half2*>(gamma);
+    const __half2* beta_cast = reinterpret_cast<const __half2*>(beta);
+
+    if (training && threadIdx.x == 0) vars[row] = __float2half(variance);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        __half2 vals_arr = __float22half2_rn(vals_f[i]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr =
+            vals_arr * gamma_cast[i * iteration_stride + id] + beta_cast[i * iteration_stride + id];
+        vals_cast[i * iteration_stride + id] = vals_arr;
+    }
+    if ((high_index) < row_stride) {
+        __half2 vals_arr = __float22half2_rn(vals_f[iterations]);
+        vals_arr = vals_arr * h2rsqrt(variance_h);
+        vals_arr = vals_arr * gamma_cast[high_index] + beta_cast[high_index];
+        vals_cast[high_index] = vals_arr;
+    }
+#endif
+}
+
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training,
+                                     T* vars);
+
+/*
+To tune this launch the following restrictions must be met:
+
+For float:
+row_stride == hidden_size
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+For half:
+row_stride == hidden_size / 2
+threads * iterations == row_stride
+threads is in [32, 64, 128, 256, 512, 1024]
+
+*/
+
+template <>
+void launch_bias_residual_layer_norm<float>(float* vals,
+                                            const float* residual,
+                                            const float* gamma,
+                                            const float* beta,
+                                            float epsilon,
+                                            int batch_size,
+                                            int hidden_dim,
+                                            hipStream_t stream,
+                                            bool preLayerNorm,
+                                            bool training,
+                                            float* vars)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim);
+}
+
+template <>
+void launch_bias_residual_layer_norm<__half>(__half* vals,
+                                             const __half* residual,
+                                             const __half* gamma,
+                                             const __half* beta,
+                                             float epsilon,
+                                             int batch_size,
+                                             int hidden_dim,
+                                             hipStream_t stream,
+                                             bool preLayerNorm,
+                                             bool training,
+                                             __half* vars)
+{
+    int threads = 128;
+
+    dim3 grid_dim(batch_size);
+
+    // There are some limitations to call below functions, now just enumerate the situations.
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim(threads);
+   hipLaunchKernelGGL(( fused_bias_residual_layer_norm), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        vals, residual, gamma, beta, epsilon, preLayerNorm, training, vars, hidden_dim / 2);
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using either X_hat or
+ * normalize input (invertible).
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ vals_hat,
+                                   const T* __restrict__ gamma,
+                                   const T* __restrict__ betta,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width,
+                                   bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+/* Normalize Gamma & Betta gradients
+ * Compute gradients using the input to
+ * the normalize.
+ * Combine transpose with gradients computation.
+ */
+
+template <typename T>
+__global__ void LayerNormBackward1(const T* __restrict__ out_grad,
+                                   const T* __restrict__ X_data,
+                                   const T* __restrict__ vars,
+                                   const T* __restrict__ means,
+                                   T* __restrict__ gamma_grad,
+                                   T* __restrict__ betta_grad,
+                                   int rows,
+                                   int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+/*
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is invertible!
+ * We do the backward using the X_hat (X - u) / sqrt(variance) or the output of Normalization.
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* vals_hat,
+                                   const float* gamma,
+                                   const float* betta,
+                                   const float* vars,
+                                   float* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] *
+               sqrtf(var_reg);           // dval_hat = gamma * (x - u) * out_grad
+        vals_arr[i] *= rsqrtf(var_reg);  // dvar_inv = gamma * out_grad / sqrt(var)
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* vals_hat,
+                                   const __half* gamma,
+                                   const __half* betta,
+                                   const __half* vars,
+                                   __half* inp_grad,
+                                   bool invertible,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* vals_hat,
+                                      const float* vars,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2],
+                                      bool invertible,
+                                      const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* vals_hat,
+                                       const __half* vars,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2],
+                                       bool invertible,
+                                       const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+    //hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+    //    out_grad, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2(const float* out_grad,
+                                   const float* X_vals,
+                                   const float* gamma,
+                                   const float* vars,
+                                   const float* means,
+                                   float* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (X_vals[i * iteration_stride + id] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) inp_grad[i * iteration_stride + id] = (vals_arr[i] - sum);
+    if ((high_index) < row_stride) inp_grad[high_index] = (vals_arr[iterations] - sum);
+}
+
+__global__ void LayerNormBackward2(const __half* out_grad,
+                                   const __half* X_vals,
+                                   const __half* gamma,
+                                   const __half* vars,
+                                   const __half* means,
+                                   __half* inp_grad,
+                                   int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id >> WARP_SIZE_BITS;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 xu[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h = reinterpret_cast<const __half2*>(out_grad);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+
+    __half mean_h = means[row];
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        xu[i] = (vals_hat_h[i * iteration_stride + id] - mean_reg);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        xu[iterations] = (vals_hat_h[high_index] - mean_reg);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp;
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp;
+    }
+}
+
+template <>
+void launch_layerNorm_backward<float>(const float* out_grad,
+                                      const float* X_data,
+                                      const float* vars,
+                                      const float* means,
+                                      const float* gamma,
+                                      float* gamma_grad,
+                                      float* betta_grad,
+                                      float* inp_grad,
+                                      int batch,
+                                      int hidden_dim,
+                                      hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward<__half>(const __half* out_grad,
+                                       const __half* X_data,
+                                       const __half* vars,
+                                       const __half* means,
+                                       const __half* gamma,
+                                       __half* gamma_grad,
+                                       __half* betta_grad,
+                                       __half* inp_grad,
+                                       int batch,
+                                       int hidden_dim,
+                                       hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ vals_hat,
+                                             const T* __restrict__ gamma,
+                                             const T* __restrict__ betta,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width,
+                                             bool invertible)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    float betta_reg = (invertible ? (float)betta[idx] : 0.0f);
+    float gamma_reg = (float)gamma[idx];
+
+    // Loop across matrix height
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (invertible ? ((float)vals_hat[offset] - betta_reg) / gamma_reg
+                                : (float)vals_hat[offset]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+template <typename T>
+__global__ void LayerNormBackward1_fused_add(const T* __restrict__ out_grad1,
+                                             const T* __restrict__ out_grad2,
+                                             const T* __restrict__ X_data,
+                                             const T* __restrict__ vars,
+                                             const T* __restrict__ means,
+                                             T* __restrict__ gamma_grad,
+                                             T* __restrict__ betta_grad,
+                                             int rows,
+                                             int width)
+{
+    __shared__ float betta_buffer[TILE_DIM][TILE_DIM + 1];
+    __shared__ float gamma_buffer[TILE_DIM][TILE_DIM + 1];
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<TILE_DIM> g = cg::tiled_partition<TILE_DIM>(b);
+
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int offset = threadIdx.y * width + idx;
+    int y_stride = width * TILE_DIM;
+
+    int pos = blockIdx.x * TILE_DIM + threadIdx.y;
+    // Loop across matrix height
+
+    float betta_tmp = 0;
+    float gamma_tmp = 0;
+    for (int r = threadIdx.y; r < rows; r += TILE_DIM) {
+        float grad = (float)out_grad1[offset] + (float)out_grad2[offset];
+        float val = (float)X_data[offset];
+        val = (val - (float)means[r]) * rsqrtf((float)vars[r]);
+        betta_tmp += grad;
+        gamma_tmp += (val * grad);
+
+        offset += y_stride;
+    }
+
+    betta_buffer[threadIdx.x][threadIdx.y] = betta_tmp;
+    gamma_buffer[threadIdx.x][threadIdx.y] = gamma_tmp;
+
+    __syncthreads();
+
+    // Sum the shared buffer.
+    float s1 = betta_buffer[threadIdx.y][threadIdx.x];
+    float s2 = gamma_buffer[threadIdx.y][threadIdx.x];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < TILE_DIM; i <<= 1) {
+        s1 += g.shfl_down(s1, i);
+        s2 += g.shfl_down(s2, i);
+    }
+
+    if (threadIdx.x == 0) {
+        betta_grad[pos] = s1;
+        gamma_grad[pos] = s2;
+    }
+}
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* vals_hat,
+                                             const float* gamma,
+                                             const float* betta,
+                                             const float* vars,
+                                             float* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    vals_hat += (row * row_stride);
+    inp_grad += (row * row_stride);
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] =
+            (invertible ? (vals_hat[i * iteration_stride + id] - betta[i * iteration_stride + id]) /
+                              gamma_reg
+                        : vals_hat[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat[high_index] - betta[high_index]) / gamma_reg
+                        : vals_hat[high_index]);
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        sum += vals_hat_arr[i] * vals_arr[i] * sqrtf(var_reg);
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) { vals_arr[i] += ((-sum * vals_hat_arr[i]) / var_reg); }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* vals_hat,
+                                             const __half* gamma,
+                                             const __half* betta,
+                                             const __half* vars,
+                                             __half* inp_grad,
+                                             bool invertible,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    // float2 result[iterations];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(vals_hat);
+
+    inp_grad_h += (row * row_stride);
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    const __half2* betta_h = (invertible ? reinterpret_cast<const __half2*>(betta) : nullptr);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] =
+            (invertible
+                 ? (vals_hat_h[i * iteration_stride + id] - betta_h[i * iteration_stride + id]) /
+                       gamma_reg
+                 : vals_hat_h[i * iteration_stride + id]);
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] =
+            (invertible ? (vals_hat_h[high_index] - betta_h[high_index]) / gamma_reg
+                        : vals_hat_h[high_index]);
+        iterations++;
+    }
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        __half2 result_h = (vals_hat_arr[i] * vals_arr[i] * h2sqrt(var_reg));
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 temp = ((-sum_h * vals_hat_arr[i]) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 temp_f = __half22float2(temp);
+        vals_arr_f[i].x += temp_f.x;
+        vals_arr_f[i].y += temp_f.y;
+    }
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* vals_hat,
+                                                const float* vars,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2],
+                                                bool invertible,
+                                                const float* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* vals_hat,
+                                                 const __half* vars,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2],
+                                                 bool invertible,
+                                                 const __half* betta)
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, vals_hat, gamma, betta, gamma_grad, betta_grad, batch, hidden_dim, invertible);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, vals_hat, gamma, betta, vars, inp_grad, invertible, hidden_dim / 2);
+}
+
+/* Backward Normalize (Input-Gradient)
+ * Using the means and variances from the input
+ * This type of backward is not invertible!
+ * We do the backward using the input (X)
+ */
+
+__global__ void LayerNormBackward2_fused_add(const float* out_grad1,
+                                             const float* out_grad2,
+                                             const float* X_vals,
+                                             const float* gamma,
+                                             const float* vars,
+                                             const float* means,
+                                             float* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    float vals_arr[NORM_REG];
+    float vals_hat_arr[NORM_REG];
+
+    out_grad1 += (row * row_stride);
+    out_grad2 += (row * row_stride);
+    X_vals += (row * row_stride);
+    inp_grad += (row * row_stride);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        float gamma_reg = gamma[i * iteration_stride + id];
+        vals_arr[i] = out_grad1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;
+        vals_hat_arr[i] = X_vals[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        float gamma_reg = gamma[high_index];
+        vals_arr[iterations] = out_grad1[high_index];
+        vals_arr[iterations] *= gamma_reg;
+        vals_hat_arr[iterations] = X_vals[high_index];
+        iterations++;
+    }
+
+    float var_reg = vars[row];
+    float mean_reg = means[row];
+
+    float sum = 0;
+    float xu[NORM_REG];
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        sum += vals_arr[i] * xu[i];
+        vals_arr[i] *= rsqrtf(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    for (int i = 0; i < iterations; i++) {
+        vals_arr[i] += (-sum * xu[i] * rsqrtf(var_reg) / (var_reg));
+    }
+
+    sum = 0;
+    for (int i = 0; i < iterations; i++) { sum += vals_arr[i]; }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+    sum = g.shfl(sum, 0);
+    sum /= row_stride;
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++)
+        inp_grad[i * iteration_stride + id] =
+            (vals_arr[i] - sum) + out_grad2[i * iteration_stride + id];
+    if ((high_index) < row_stride)
+        inp_grad[high_index] = (vals_arr[iterations] - sum) + out_grad2[high_index];
+}
+
+__global__ void LayerNormBackward2_fused_add(const __half* out_grad1,
+                                             const __half* out_grad2,
+                                             const __half* X_vals,
+                                             const __half* gamma,
+                                             const __half* vars,
+                                             const __half* means,
+                                             __half* inp_grad,
+                                             int row_stride)
+{
+    int iteration_stride = blockDim.x;
+    int iterations = row_stride / iteration_stride;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+    int wid = id / WARP_SIZE;
+    int warp_num = iteration_stride >> WARP_SIZE_BITS;
+
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    __half2 vals_arr[NORM_REG];
+    float2 vals_arr_f[NORM_REG];
+    __half2 vals_hat_arr[NORM_REG];
+
+    __half2* inp_grad_h = reinterpret_cast<__half2*>(inp_grad);
+    const __half2* out_grad_h1 = reinterpret_cast<const __half2*>(out_grad1);
+    const __half2* out_grad_h2 = reinterpret_cast<const __half2*>(out_grad2);
+    const __half2* vals_hat_h = reinterpret_cast<const __half2*>(X_vals);
+
+    out_grad_h1 += (row * row_stride);
+    out_grad_h2 += (row * row_stride);
+    inp_grad_h += (row * row_stride);
+    vals_hat_h += (row * row_stride);
+
+    const __half2* gamma_h = reinterpret_cast<const __half2*>(gamma);
+    int high_index = iterations * iteration_stride + id;
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        __half2 gamma_reg = gamma_h[i * iteration_stride + id];
+        vals_arr[i] = out_grad_h1[i * iteration_stride + id];
+        vals_arr[i] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[i] = vals_hat_h[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        __half2 gamma_reg = gamma_h[high_index];
+        vals_arr[iterations] = out_grad_h1[high_index];
+        vals_arr[iterations] *= gamma_reg;  // out_grad * gamma
+        vals_hat_arr[iterations] = vals_hat_h[high_index];
+        iterations++;
+    }
+
+    __half mean_h = means[row];
+    __half var_h = vars[row];
+    __half2 var_reg = __halves2half2(var_h, var_h);
+    __half2 mean_reg = __halves2half2(mean_h, mean_h);
+    __half2 xu[NORM_REG];
+
+    float sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        xu[i] = (vals_hat_arr[i] - mean_reg);
+        __half2 result_h = (xu[i] * vals_arr[i]);
+        float2 result_f = __half22float2(result_h);
+        sum += result_f.x;
+        sum += result_f.y;
+        vals_arr[i] *= h2rsqrt(var_reg);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+    __half2 sum_h = __float2half2_rn(sum);
+
+    for (int i = 0; i < iterations; i++) {
+        __half2 xu_grad = ((-sum_h * xu[i] * h2rsqrt(var_reg)) / (var_reg));
+        vals_arr_f[i] = __half22float2(vals_arr[i]);
+        float2 xu_grad_f = __half22float2(xu_grad);
+        vals_arr_f[i].x += xu_grad_f.x;
+        vals_arr_f[i].y += xu_grad_f.y;
+    }
+
+    sum = 0.f;
+    for (int i = 0; i < iterations; i++) {
+        sum += (vals_arr_f[i].x);
+        sum += (vals_arr_f[i].y);
+    }
+
+    for (int i = 1; i < WARP_SIZE; i *= 2) { sum += g.shfl_down(sum, i); }
+
+    if (g.thread_rank() == 0) partialSum[wid] = sum;
+
+    __syncthreads();
+
+    if (g.thread_rank() < warp_num) sum = partialSum[g.thread_rank()];
+
+#ifndef __STOCHASTIC_MODE__
+    __syncthreads();
+#endif
+
+    for (int i = 1; i < warp_num; i *= 2) sum += g.shfl_down(sum, i);
+
+    sum = g.shfl(sum, 0);
+    sum /= (2 * row_stride);
+
+    iterations = row_stride / iteration_stride;
+    for (int i = 0; i < iterations; i++) {
+        vals_arr_f[i].x -= sum;
+        vals_arr_f[i].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[i]);
+        inp_grad_h[i * iteration_stride + id] = temp + out_grad_h2[i * iteration_stride + id];
+    }
+    if ((high_index) < row_stride) {
+        vals_arr_f[iterations].x -= sum;
+        vals_arr_f[iterations].y -= sum;
+        __half2 temp = __float22half2_rn(vals_arr_f[iterations]);
+        inp_grad_h[high_index] = temp + out_grad_h2[high_index];
+    }
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<float>(const float* out_grad1,
+                                                const float* out_grad2,
+                                                const float* X_data,
+                                                const float* vars,
+                                                const float* means,
+                                                const float* gamma,
+                                                float* gamma_grad,
+                                                float* betta_grad,
+                                                float* inp_grad,
+                                                int batch,
+                                                int hidden_dim,
+                                                hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<float>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 1;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 2;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim);
+}
+
+template <>
+void launch_layerNorm_backward_fused_add<__half>(const __half* out_grad1,
+                                                 const __half* out_grad2,
+                                                 const __half* X_data,
+                                                 const __half* vars,
+                                                 const __half* means,
+                                                 const __half* gamma,
+                                                 __half* gamma_grad,
+                                                 __half* betta_grad,
+                                                 __half* inp_grad,
+                                                 int batch,
+                                                 int hidden_dim,
+                                                 hipStream_t stream[2])
+{
+    int threads = THREADS;
+
+    dim3 grid_dim(hidden_dim / TILE_DIM);
+    dim3 block_dim(TILE_DIM, TILE_DIM);
+
+   hipLaunchKernelGGL(( LayerNormBackward1<__half>), dim3(grid_dim), dim3(block_dim), 0, stream[0], 
+        out_grad1, X_data, vars, means, gamma_grad, betta_grad, batch, hidden_dim);
+
+    dim3 grid_dim2(batch);
+
+    if (hidden_dim > 8192 && hidden_dim <= 16384)
+        threads <<= 1;
+    else if (hidden_dim > 16384 && hidden_dim <= 32768)
+        threads <<= 2;
+    else if (hidden_dim > 32768 && hidden_dim <= 65536)
+        threads <<= 3;
+    else if (hidden_dim > 65536)
+        throw std::runtime_error("Unsupport hidden_dim.");
+
+    dim3 block_dim2(threads / 2);
+   hipLaunchKernelGGL(( LayerNormBackward2_fused_add), dim3(grid_dim2), dim3(block_dim2), 0, stream[1], 
+        out_grad1, out_grad2, X_data, gamma, vars, means, inp_grad, hidden_dim / 2);
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/softmax_kernels.cu b/deepspeed/ops/csrc/transformer_bak/softmax_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..34487cba9bbec060baa0fb79922e691076296476
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/softmax_kernels.cu
@@ -0,0 +1,595 @@
+#include <math.h>
+#include "custom_cuda_layers.h"
+#include "general_kernels.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = std::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, cudaStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 cudaStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+        attn_softmax<2, (threads / 2), 2>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+        attn_softmax<4, (threads / 4), 4>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+        attn_softmax<8, (threads / 8), 8>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+        attn_softmax<16, (threads / 16), 16>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+        attn_softmax<32, (threads / 32), 32>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+        attn_softmax<32, (threads / 64), 64>
+            <<<grid_dim, block_dim, 0, stream>>>(vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+            attn_softmax<32, (threads / 128), 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+            attn_softmax<32, 1, 128><<<grid_dim, block_dim, 0, stream>>>(
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     cudaStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+        softmax_backward_kernel_v2<T, 1>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+        softmax_backward_kernel_v2<T, 2>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+        softmax_backward_kernel_v2<T, 4>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+        softmax_backward_kernel_v2<T, 8>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+        softmax_backward_kernel_v2<T, 12>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+        softmax_backward_kernel_v2<T, 16>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+        softmax_backward_kernel_v2<T, 24>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+        softmax_backward_kernel_v2<T, 32>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+        softmax_backward_kernel_v2<T, 64>
+            <<<grid_dim, block_dim, 0, stream>>>(out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      cudaStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     cudaStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/softmax_kernels.hip b/deepspeed/ops/csrc/transformer_bak/softmax_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..afe65b0c9cbdc6b10027db2ddd5c7e8f447e0c24
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/softmax_kernels.hip
@@ -0,0 +1,597 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include <math.h>
+#include "custom_hip_layers.h"
+#include "general_kernels_hip.h"
+
+namespace cg = cooperative_groups;
+
+dim3 get_attn_softmax_grid(int batch_size, int heads, int sequence_length, int threads)
+{
+    int seq_length4 = sequence_length / 4;
+    int block_compute_size =
+        (seq_length4 < threads ? (int)pow(2.0, floor(log2((float)(threads / seq_length4)))) : 1);
+    // Note that the Y and Z dimensions are limited to 65535, while X is basically unlimited:
+    // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
+    // The batch size is typically relatively small, while the sequence length could potentially be
+    // arbitrarily large. We therefore place the batch size second to avoid hitting the Y limit.
+    unsigned x = heads * sequence_length / block_compute_size;
+    unsigned y = batch_size;
+    return {x, y};
+}
+
+// Fused attention + softmax
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(float* vals,
+                             const float* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float4* val_cast = reinterpret_cast<float4*>(vals);
+    const float4* attn_mask_cast = reinterpret_cast<const float4*>(attn_mask);
+
+    float4 data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float4 mask = attn_mask_cast[mask_offset + data_id];
+            data[i] = val_cast[data_offset + data_id];
+
+            data[i].x += mask.x;
+            data[i].y += mask.y;
+            data[i].z += mask.z;
+            data[i].w += mask.w;
+
+            max_val = (data[i].x > max_val ? data[i].x : max_val);
+            max_val = (data[i].y > max_val ? data[i].y : max_val);
+            max_val = (data[i].z > max_val ? data[i].z : max_val);
+            max_val = (data[i].w > max_val ? data[i].w : max_val);
+        } else {
+            data[i].x = minus_infinity;
+            data[i].y = minus_infinity;
+            data[i].z = minus_infinity;
+            data[i].w = minus_infinity;
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        data[i].x = __expf(data[i].x - max_val);
+        data[i].y = __expf(data[i].y - max_val);
+        data[i].z = __expf(data[i].z - max_val);
+        data[i].w = __expf(data[i].w - max_val);
+
+        sum += (data[i].x + data[i].y + data[i].z + data[i].w);
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        data[i].x /= sum;
+        data[i].y /= sum;
+        data[i].z /= sum;
+        data[i].w /= sum;
+
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) val_cast[data_offset + data_id] = data[i];
+    }
+}
+
+template <int tbSize, int blockStride, int tbSeq>
+__global__ void attn_softmax(__half* vals,
+                             const __half* attn_mask,
+                             int heads,
+                             int seq_length,
+                             int iterations)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int batch = blockIdx.y;
+    int row = blockIdx.x;
+    int max_threads_in_sequence = ::max(seq_length, tbSeq);
+    int seq_lane = threadIdx.x % max_threads_in_sequence;
+
+    int data_offset = batch * (gridDim.x * block_width) + row * block_width +
+                      (threadIdx.x / max_threads_in_sequence) * seq_length;
+    int mask_offset = batch * seq_length;
+
+    int wid = threadIdx.x >> WARP_SIZE_BITS;
+    int lane = threadIdx.x & 0x1f;
+
+    float2* val_cast = reinterpret_cast<float2*>(vals);
+    const float2* attn_mask_cast = reinterpret_cast<const float2*>(attn_mask);
+
+    val_cast += data_offset;
+    attn_mask_cast += mask_offset;
+
+    float2 low_data[MAX_THREAD_ITERATIONS];
+    float2 high_data[MAX_THREAD_ITERATIONS];
+
+    float max_val = minus_infinity;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 data = val_cast[data_id];
+            float2 mask = attn_mask_cast[data_id];
+
+            __half2* data_arr = reinterpret_cast<__half2*>(&data);
+            __half2* mask_arr = reinterpret_cast<__half2*>(&mask);
+
+            low_data[i] = __half22float2(data_arr[0]);
+            high_data[i] = __half22float2(data_arr[1]);
+            float2 low_mask = __half22float2(mask_arr[0]);
+            float2 high_mask = __half22float2(mask_arr[1]);
+
+            low_data[i].x += low_mask.x;
+            low_data[i].y += low_mask.y;
+            high_data[i].x += high_mask.x;
+            high_data[i].y += high_mask.y;
+
+            max_val = (low_data[i].x > max_val ? low_data[i].x : max_val);
+            max_val = (low_data[i].y > max_val ? low_data[i].y : max_val);
+            max_val = (high_data[i].x > max_val ? high_data[i].x : max_val);
+            max_val = (high_data[i].y > max_val ? high_data[i].y : max_val);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) {
+        auto temp = g.shfl_xor(max_val, i);
+        max_val = (temp > max_val ? temp : max_val);
+    }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = max_val;
+        b.sync();
+
+        if (lane < warp_num) max_val = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) {
+            auto temp = g.shfl_xor(max_val, i);
+            max_val = (temp > max_val ? temp : max_val);
+        }
+
+        max_val = g.shfl(max_val, threadIdx.x / tbSize);
+    }
+
+    float sum = 0;
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            low_data[i].x = __expf(low_data[i].x - max_val);
+            low_data[i].y = __expf(low_data[i].y - max_val);
+            high_data[i].x = __expf(high_data[i].x - max_val);
+            high_data[i].y = __expf(high_data[i].y - max_val);
+
+            sum += (low_data[i].x + low_data[i].y + high_data[i].x + high_data[i].y);
+        }
+    }
+
+    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = sum;
+        b.sync();
+
+        if (lane < warp_num) sum = partialSum[lane];
+
+#ifndef __STOCHASTIC_MODE__
+        b.sync();
+#endif
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride)
+            iters = warp_num / (iteration_stride / max_threads_in_sequence);
+
+        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
+
+        sum = g.shfl(sum, threadIdx.x / tbSize);
+    }
+
+    sum += 1e-6;
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + seq_lane;
+        if (data_id < seq_length) {
+            float2 result_f;
+            __half2* result_h = reinterpret_cast<__half2*>(&result_f);
+
+            low_data[i].x /= sum;
+            low_data[i].y /= sum;
+            high_data[i].x /= sum;
+            high_data[i].y /= sum;
+
+            result_h[0] = __float22half2_rn(low_data[i]);
+            result_h[1] = __float22half2_rn(high_data[i]);
+
+            val_cast[data_id] = result_f;
+        }
+    }
+
+#endif
+}
+
+template <typename T>
+void launch_attn_softmax(T*, const T*, int, int, int, hipStream_t);
+
+template <>
+void launch_attn_softmax<float>(float* vals,
+                                const float* attn_mask,
+                                int batch_size,
+                                int heads,
+                                int sequence_length,
+                                hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <>
+void launch_attn_softmax<__half>(__half* vals,
+                                 const __half* attn_mask,
+                                 int batch_size,
+                                 int heads,
+                                 int sequence_length,
+                                 hipStream_t stream)
+{
+    const int threads = 128;
+    int seq_length4 = sequence_length / 4;
+
+    dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+    int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+    dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                            subblock_max_workload * threads)
+                                         : threads);
+
+    int iterations =
+        (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                 : MAX_THREAD_ITERATIONS);
+
+    if (sequence_length <= 8)
+       hipLaunchKernelGGL(( attn_softmax<2, (threads / 2), 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 16)
+       hipLaunchKernelGGL(( attn_softmax<4, (threads / 4), 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 32)
+       hipLaunchKernelGGL(( attn_softmax<8, (threads / 8), 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 64)
+       hipLaunchKernelGGL(( attn_softmax<16, (threads / 16), 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 128)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 32), 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else if (sequence_length <= 256)
+       hipLaunchKernelGGL(( attn_softmax<32, (threads / 64), 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, vals, attn_mask, heads, seq_length4, iterations);
+    else {
+        const int threads = 256;
+        dim3 grid_dim = get_attn_softmax_grid(batch_size, heads, sequence_length, threads);
+
+        int subblock_max_workload = MAX_THREAD_ITERATIONS * 4 * threads;
+
+        dim3 block_dim(seq_length4 > threads ? ((sequence_length + subblock_max_workload - 1) /
+                                                subblock_max_workload * threads)
+                                             : threads);
+        iterations =
+            (sequence_length < subblock_max_workload ? (seq_length4 + threads - 1) / threads
+                                                     : MAX_THREAD_ITERATIONS);
+        if (sequence_length <= 512)
+           hipLaunchKernelGGL(( attn_softmax<32, (threads / 128), 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else if (sequence_length < (MAX_THREADS * MAX_THREAD_ITERATIONS * 4))
+           hipLaunchKernelGGL(( attn_softmax<32, 1, 128>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+                vals, attn_mask, heads, seq_length4, iterations);
+        else
+            throw std::runtime_error(
+                "Unsupport Seq_Length! Check the restriction of the max_threads and "
+                "max_thread_iterations!");
+    }
+}
+
+template <typename T, int tbSize, int blockStride>
+__global__ void softmax_backward_kernel(T* out_grad, const T* soft_inp, int seq_length)
+{
+    __shared__ float partialSum[MAX_WARP_NUM];
+
+    int warp_num = blockDim.x >> WARP_SIZE_BITS;  // warp-count = num_threads / WARP_SIZE (32)
+
+    int iteration_stride = blockDim.x;
+    int block_width = blockStride * seq_length;
+
+    int iterations = (seq_length < (MAX_THREAD_ITERATIONS * iteration_stride)
+                          ? (seq_length + iteration_stride - 1) / iteration_stride
+                          : MAX_THREAD_ITERATIONS);
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<tbSize> g = cg::tiled_partition<tbSize>(b);
+
+    int row = blockIdx.x;
+    int id = threadIdx.x;
+
+    int wid = id >> WARP_SIZE_BITS;
+    int lane = id & 0x1f;
+
+    T val_reg[MAX_THREAD_ITERATIONS];
+    T soft_reg[MAX_THREAD_ITERATIONS];
+    float grad_reg = 0.0f;
+
+#pragma unroll
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            val_reg[i] = out_grad[row * block_width + data_id];
+            soft_reg[i] = soft_inp[row * block_width + data_id];
+
+            grad_reg += ((float)val_reg[i] *
+                         (float)soft_reg[i]);  // if done in half, the multiplication, we may lose
+                                               // 2% of accuracy in computation!!
+        }
+    }
+    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+    if (seq_length > tbSize) {
+        if (lane == 0) partialSum[wid] = grad_reg;
+        b.sync();
+
+        if (lane < warp_num) grad_reg = partialSum[lane];
+
+        int iters = warp_num;
+        if (seq_length < iteration_stride) iters = warp_num / (iteration_stride / seq_length);
+
+        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
+
+        grad_reg = g.shfl(grad_reg, id / tbSize);
+    }
+
+    for (int i = 0; i < iterations; i++) {
+        int data_id = i * iteration_stride + id;
+        if (data_id < block_width) {
+            float temp = (float)soft_reg[i] * ((float)val_reg[i] - grad_reg);
+            out_grad[row * block_width + data_id] = (T)temp;
+        }
+    }
+}
+
+template <typename T, int ITERATIONS>
+__global__ void softmax_backward_kernel_v2(T* grad /* input & output*/,
+                                           const T* output,
+                                           int softmax_length)
+{
+    int batch_idx = blockIdx.x * blockDim.y + threadIdx.y;
+    int offset = batch_idx * softmax_length + threadIdx.x;
+
+    grad += offset;
+    output += offset;
+
+    T grad_reg[ITERATIONS];
+    T output_reg[ITERATIONS];
+    float sum = 0.0;
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length) {
+            grad_reg[i] = grad[i * WARP_SIZE];
+            output_reg[i] = output[i * WARP_SIZE];
+            sum += (float)grad_reg[i] * (float)output_reg[i];
+        }
+    }
+
+    cg::thread_block b = cg::this_thread_block();
+    cg::thread_block_tile<WARP_SIZE> g = cg::tiled_partition<WARP_SIZE>(b);
+
+    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
+
+#pragma unroll
+    for (int i = 0; i < ITERATIONS; ++i) {
+        int curr_idx = threadIdx.x + i * WARP_SIZE;
+        if (curr_idx < softmax_length)
+            grad[i * WARP_SIZE] = (float)output_reg[i] * ((float)grad_reg[i] - sum);
+    }
+}
+
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     hipStream_t stream)
+{
+    const int warps_per_block = 4;
+    dim3 grid_dim(batch_size * heads * seq_length / warps_per_block);
+    dim3 block_dim(WARP_SIZE, warps_per_block);
+
+    if (seq_length <= 32)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 1>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 64)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 2>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 128)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 4>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 256)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 8>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 384)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 12>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 512)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 16>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 768)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 24>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 1024)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 32>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else if (seq_length <= 2048)
+       hipLaunchKernelGGL(( softmax_backward_kernel_v2<T, 64>)
+            , dim3(grid_dim), dim3(block_dim), 0, stream, out_grad, soft_inp, seq_length);
+    else
+        throw std::runtime_error(
+            std::string("Special sequence length found in softmax backward, seq_length: ") +
+            std::to_string(seq_length));
+}
+
+template void launch_attn_softmax_backward_v2<__half>(__half* out_grad,
+                                                      const __half* soft_inp,
+                                                      int batch_size,
+                                                      int heads,
+                                                      int seq_length,
+                                                      hipStream_t stream);
+template void launch_attn_softmax_backward_v2<float>(float* out_grad,
+                                                     const float* soft_inp,
+                                                     int batch_size,
+                                                     int heads,
+                                                     int seq_length,
+                                                     hipStream_t stream);
diff --git a/deepspeed/ops/csrc/transformer_bak/transform_kernels.cu b/deepspeed/ops/csrc/transformer_bak/transform_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..15a2219333e43a6da1b93038a406b35d302bb9d9
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/transform_kernels.cu
@@ -0,0 +1,575 @@
+#include "custom_cuda_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<__half><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, cudaStream_t stream)
+{
+    int threads = THREADS;
+
+    Transpose_Kernel<float><<<(rows * cols + threads - 1) / threads, threads, 0, stream>>>(
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  cudaStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+    transform_0213<float>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   cudaStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+    transform_0213<__half>
+        <<<grid_dim, block_dim, 0, stream>>>(output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           cudaStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+    bias_add_transform_0213<float><<<grid_dim, block_dim, 0, stream>>>(
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            cudaStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+        bias_add_transform_0213<__half><<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+        bias_add_transform_0213_v2<<<grid_dim, block_dim, 0, stream>>>(
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    cudaStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+    transform4d_0213<float>
+        <<<grid_dims, block_dims, 0, stream>>>(out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+        transform4d_0213<__half><<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+        transform4d_0213_v2<<<grid_dims, block_dims, 0, stream>>>(
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/deepspeed/ops/csrc/transformer_bak/transform_kernels.hip b/deepspeed/ops/csrc/transformer_bak/transform_kernels.hip
new file mode 100644
index 0000000000000000000000000000000000000000..0aaa4cca150e18ed63c701e66ce4eaf6313e30ab
--- /dev/null
+++ b/deepspeed/ops/csrc/transformer_bak/transform_kernels.hip
@@ -0,0 +1,577 @@
+// !!! This is a file automatically generated by hipify!!!
+#include "hip/hip_runtime.h"
+#include "custom_hip_layers.h"
+
+#define rows_trans 16
+#define cols_trans 16
+
+template <typename T>
+__global__ void Transpose_Kernel(const T* inp, T* out, int row_width, int col_width)
+{
+    __shared__ T data_block[rows_trans * (cols_trans + 1)];
+
+    int r = threadIdx.x / cols_trans;
+    int c = threadIdx.x % cols_trans;
+
+    int m = row_width / cols_trans;
+
+    int i = blockIdx.x / m * rows_trans + r;
+    int j = blockIdx.x % m * cols_trans + c;
+
+    int row_stride = rows_trans / ((rows_trans * cols_trans + THREADS - 1) / THREADS);
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        data_block[(k + r) * cols_trans + c] = inp[(i + k) * row_width + j];
+
+    __syncthreads();
+
+    i = blockIdx.x % m * rows_trans + r;
+    j = blockIdx.x / m * cols_trans + c;
+
+    for (int k = 0; k < rows_trans; k += row_stride)
+        out[(i + k) * col_width + j] = data_block[c * cols_trans + r + k];
+}
+
+template <>
+void Transpose<__half>(const __half* inp_mat,
+                       __half* out_mat,
+                       int rows,
+                       int cols,
+                       hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<__half>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <>
+void Transpose<float>(const float* inp_mat, float* out_mat, int rows, int cols, hipStream_t stream)
+{
+    int threads = THREADS;
+
+   hipLaunchKernelGGL(( Transpose_Kernel<float>), dim3((rows * cols + threads - 1) / threads), dim3(threads), 0, stream, 
+        inp_mat, out_mat, cols, rows);
+}
+
+template <typename T>
+__global__ void transform_0213(T* output,
+                               const T* vals,
+                               int hidden_dim,
+                               int seq_length,
+                               int heads,
+                               int head_ext);
+
+template <>
+__global__ void transform_0213<float>(float* output,
+                                      const float* vals,
+                                      int hidden_dim,
+                                      int seq_length,
+                                      int heads,
+                                      int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = inputs;
+}
+
+template <>
+__global__ void transform_0213<__half>(__half* output,
+                                       const __half* vals,
+                                       int hidden_dim,
+                                       int seq_length,
+                                       int heads,
+                                       int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y / head_ext;                                       // Sequence ID (0-127)
+    int d2 = threadIdx.y + (blockIdx.y % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr[1];
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_arr[0] = vals_vec[d0 * d0_stride + d1 * d1_stride + d2 * d2_stride + d3];
+    output_vec[d0 * d0_out_stride + d1 * d1_out_stride + d2 * d2_out_stride + d3] = vals_arr[0];
+#endif
+}
+
+template <>
+void launch_transform_0213<float>(float* output,
+                                  const float* vals,
+                                  int batch_size,
+                                  int seq_length,
+                                  int hidden_dim,
+                                  int heads,
+                                  hipStream_t stream)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+
+   hipLaunchKernelGGL(( transform_0213<float>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_transform_0213<__half>(__half* output,
+                                   const __half* vals,
+                                   int batch_size,
+                                   int seq_length,
+                                   int hidden_dim,
+                                   int heads,
+                                   hipStream_t stream)
+{
+    hidden_dim >>= 3;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, (seq_length * head_ext));
+   hipLaunchKernelGGL(( transform_0213<__half>)
+        , dim3(grid_dim), dim3(block_dim), 0, stream, output, vals, hidden_dim, seq_length, heads, head_ext);
+}
+
+// Bias add
+template <typename T>
+__global__ void bias_add_transform_0213(T* output,
+                                        const T* vals,
+                                        const T* bias,
+                                        int hidden_dim,
+                                        int seq_length,
+                                        int heads,
+                                        int head_ext);
+
+template <>
+__global__ void bias_add_transform_0213<float>(float* output,
+                                               const float* vals,
+                                               const float* bias,
+                                               int hidden_dim,
+                                               int seq_length,
+                                               int heads,
+                                               int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    float4 inputs = vals_vec[d0 * d0_stride * (gridDim.z / head_ext) + cnt * d1_stride +
+                             d1 * d1_stride * (gridDim.z / head_ext) + d2 * d2_stride + d3];
+    float4 biases = bias_vec[cnt * d1_stride + d2 * d2_stride + d3];
+
+    float4 outputs;
+    outputs.x = inputs.x + biases.x;
+    outputs.y = inputs.y + biases.y;
+    outputs.z = inputs.z + biases.z;
+    outputs.w = inputs.w + biases.w;
+
+    output_vec[cnt * d0_out_stride * gridDim.x + d0 * d0_out_stride + d1 * d1_out_stride +
+               d2 * d2_out_stride + d3] = outputs;
+}
+
+#define ATTN_H 3
+#define MAX_SEQ_LINE 10
+
+template <>
+__global__ void bias_add_transform_0213<__half>(__half* output,
+                                                const __half* vals,
+                                                const __half* bias,
+                                                int hidden_dim,
+                                                int seq_length,
+                                                int heads,
+                                                int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = blockIdx.y;                                                  // Sequence ID (0-127)
+    int cnt = blockIdx.z / head_ext;                                      // Hidden count
+    int d2 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head (0-11)
+    int d3 = threadIdx.x;                                                 // Values (groups of 4)
+
+    float4 vals_arr;
+    float4 bias_arr;
+    float4 output_arr;
+    __half2* vals_half = reinterpret_cast<__half2*>(&vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(&bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(&output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    vals_vec += (d0 * d0_stride * (gridDim.z / head_ext));
+    vals_vec += (d1 * d1_stride * (gridDim.z / head_ext));
+    vals_vec += (cnt * d1_stride);
+    vals_vec += (d2 * d2_stride);
+
+    bias_vec += (cnt * d1_stride);
+    bias_vec += (d2 * d2_stride);
+
+    output_vec += (cnt * d0_stride * gridDim.x);
+    output_vec += (d1 * d2_stride);
+    output_vec += (d0 * d0_stride);
+    output_vec += (d2 * d2_out_stride);
+
+    bias_arr = bias_vec[d3];
+    vals_arr = vals_vec[d3];
+
+#if defined(__ACC_HALF__)
+    output_half[0] = vals_half[0] + bias_half[0];
+    output_half[1] = vals_half[1] + bias_half[1];
+    output_half[2] = vals_half[2] + bias_half[2];
+    output_half[3] = vals_half[3] + bias_half[3];
+#else
+    float2 bias_arr_f[4];
+    float2 vals_arr_f[4];
+#pragma unroll
+    for (int l = 0; l < 4; l++) {
+        bias_arr_f[l] = __half22float2(bias_half[l]);
+        vals_arr_f[l] = __half22float2(vals_half[l]);
+        vals_arr_f[l].x += bias_arr_f[l].x;
+        vals_arr_f[l].y += bias_arr_f[l].y;
+        output_half[l] = __float22half2_rn(vals_arr_f[l]);
+    }
+#endif
+    output_vec[d3] = output_arr;
+
+#endif
+}
+
+__global__ void bias_add_transform_0213_v2(__half* output,
+                                           const __half* vals,
+                                           const __half* bias,
+                                           int hidden_dim,
+                                           int seq_length,
+                                           int heads)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+    int iteration_stride = d1_stride * blockDim.z;  // Hidden * 3 / 8
+    int batch_stride = d0_stride * blockDim.z;      // Hidden * S * 3 / 8
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = d2_stride * seq_length;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = blockIdx.y;    // Sequence ID (0-127)
+    int cnt = threadIdx.z;  // blockIdx.z; // Hidden count
+    int d2 = threadIdx.y;   // Head (0-11)
+    int d3 = threadIdx.x;   // Values (groups of 4)
+
+    float4 vals_arr[1];
+    float4 bias_arr[1];
+    float4 output_arr[1];
+    __half2* vals_half = reinterpret_cast<__half2*>(vals_arr);
+    __half2* bias_half = reinterpret_cast<__half2*>(bias_arr);
+    __half2* output_half = reinterpret_cast<__half2*>(output_arr);
+
+    const float4* vals_vec = reinterpret_cast<const float4*>(vals);
+    const float4* bias_vec = reinterpret_cast<const float4*>(bias);
+    float4* output_vec = reinterpret_cast<float4*>(output);
+
+    int iter_index = cnt * d1_stride + d2 * d2_stride + d3;
+    int input_offset = d0 * batch_stride + d1 * (iteration_stride << 1);
+    bias_arr[0] = bias_vec[iter_index];
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        vals_arr[0] = vals_vec[input_offset + iter_id];
+
+        output_half[0] = vals_half[0] + bias_half[0];
+        output_half[1] = vals_half[1] + bias_half[1];
+        output_half[2] = vals_half[2] + bias_half[2];
+        output_half[3] = vals_half[3] + bias_half[3];
+
+        in_data[iter_id] = output_arr[0];
+    }
+    __syncthreads();
+
+    iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_out_stride * gridDim.x);
+    int head_count = (d2 >> 1) + cnt * (blockDim.y >> 1);
+
+    int out_index = d0 * d0_out_stride + d1 * (d1_out_stride << 1) + d3 + (d2 % 2) * d2_stride;
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = (iter * iteration_stride) + head_count;
+        int iter_offset =
+            (iter_row % blockDim.y) * d2_out_stride + (iter_row / blockDim.y) * matrix_stride;
+        output_vec[out_index + iter_offset] =
+            in_data[iter_row * d2_stride + d3 + (d2 % 2) * (d1_stride * blockDim.z)];
+    }
+#endif
+}
+
+// [B S C*H] - > C * [B A S N]
+template <>
+void launch_bias_add_transform_0213<float>(float* output,
+                                           const float* vals,
+                                           const float* bias,
+                                           int batch_size,
+                                           int seq_length,
+                                           int hidden_dim,
+                                           int heads,
+                                           hipStream_t stream,
+                                           int trans_count)
+{
+    hidden_dim >>= 2;
+    int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+
+    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+
+   hipLaunchKernelGGL(( bias_add_transform_0213<float>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+        output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+}
+
+template <>
+void launch_bias_add_transform_0213<__half>(__half* output,
+                                            const __half* vals,
+                                            const __half* bias,
+                                            int batch_size,
+                                            int seq_length,
+                                            int hidden_dim,
+                                            int heads,
+                                            hipStream_t stream,
+                                            int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 block_dim(hidden_dim / heads, (heads / head_ext));
+        dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
+       hipLaunchKernelGGL(( bias_add_transform_0213<__half>), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads, head_ext);
+    } else {
+        dim3 block_dim(hidden_dim / heads, heads, trans_count);
+        dim3 grid_dim(batch_size, seq_length / 2);
+       hipLaunchKernelGGL(( bias_add_transform_0213_v2), dim3(grid_dim), dim3(block_dim), 0, stream, 
+            output, vals, bias, hidden_dim, seq_length, heads);
+    }
+}
+
+template <typename T>
+__global__ void transform4d_0213(T* out,
+                                 const T* in,
+                                 int heads,
+                                 int seq_length,
+                                 int hidden_dim,
+                                 int head_ext);
+
+template <>
+__global__ void transform4d_0213<float>(float* out,
+                                        const float* in,
+                                        int heads,
+                                        int seq_length,
+                                        int hidden_dim,
+                                        int head_ext)
+{
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = d0_stride / heads;
+    int d2_stride = hidden_dim / heads;
+
+    int d0_out_stride = d0_stride;
+    int d1_out_stride = d2_stride;
+    int d2_out_stride = hidden_dim;
+
+    int d0 = blockIdx.x;                                        // Batch
+    int d1 = blockIdx.y / ((seq_length - 1) / blockDim.y + 1);  // Head
+    int d2 = (threadIdx.y + blockDim.y * blockIdx.y) % seq_length;
+    int cnt = blockIdx.z;
+    int d3 = threadIdx.x;  // Values (groups of 8)
+
+    if (d2 < seq_length) {
+        const float4* in_vec = reinterpret_cast<const float4*>(in);
+        float4* out_vec = reinterpret_cast<float4*>(out);
+
+        float4 vals_vec = in_vec[cnt * d0_stride * gridDim.x + d0 * d0_stride + d1 * d1_stride +
+                                 d2 * d2_stride + d3];
+        out_vec[d0 * d0_out_stride * gridDim.z + cnt * d2_out_stride + d1 * d1_out_stride +
+                d2 * d2_out_stride * gridDim.z + d3] = vals_vec;
+    }
+}
+
+template <>
+__global__ void transform4d_0213<__half>(__half* out,
+                                         const __half* in,
+                                         int heads,
+                                         int seq_length,
+                                         int hidden_dim,
+                                         int head_ext)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+
+    int d0_stride = hidden_dim * (seq_length / head_ext);
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;                                                  // Batch
+    int d1 = threadIdx.y + (blockIdx.z % head_ext) * (heads / head_ext);  // Head
+    int d2 = blockIdx.z / head_ext;                                       // Sequence
+    int cnt = blockIdx.y;                                                 // Hidden count
+    int d3 = threadIdx.x;                                                 // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    in_vec += (cnt * d0_stride * gridDim.x);
+    in_vec += (d0 * d0_stride);
+    in_vec += (d2 * d2_stride);
+    in_vec += (d1 * d2_stride * seq_length);
+
+    out_vec += (cnt * d1_stride);
+    out_vec += (d1 * d2_stride);
+    out_vec += (d0 * d0_stride * gridDim.y);
+    out_vec += (d2 * d1_stride * gridDim.y);
+
+    out_vec[d3] = in_vec[d3];
+
+#endif
+}
+
+__global__ void transform4d_0213_v2(__half* out,
+                                    const __half* in,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim)
+{
+#ifdef HALF_PRECISION_AVAILABLE
+    __shared__ float4 in_data[3072];
+
+    int d0_stride = hidden_dim * seq_length;
+    int d1_stride = hidden_dim;
+    int d2_stride = hidden_dim / heads;
+
+    int d0 = blockIdx.x;    // Batch
+    int d1 = threadIdx.y;   // Head
+    int d2 = blockIdx.y;    // Sequence
+    int cnt = threadIdx.z;  // Hidden count
+    int d3 = threadIdx.x;   // Values (groups of 8)
+
+    const float4* in_vec = reinterpret_cast<const float4*>(in);
+    float4* out_vec = reinterpret_cast<float4*>(out);
+
+    int input_offset = d0 * d0_stride + d2 * (d2_stride << 1) + d3 + (d1 % 2) * d2_stride;
+    int head_count = (d1 >> 1) + cnt * (blockDim.y >> 1);
+    int iteration_stride = blockDim.z * (blockDim.y >> 1);
+    int matrix_stride = (d0_stride * gridDim.x);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_row = iter * iteration_stride + head_count;
+        int iter_offset = (iter_row % blockDim.y) * d2_stride;
+
+        in_data[d3 + iter_offset + (iter_row / blockDim.y + (d1 % 2) * blockDim.z) * d1_stride] =
+            in_vec[input_offset + iter_offset * seq_length +
+                   (iter_row / blockDim.y) * matrix_stride];
+    }
+    __syncthreads();
+
+    iteration_stride = d1_stride * blockDim.z;
+    int iter_index = cnt * d1_stride + d1 * d2_stride + d3;
+    int output_offset = d0 * d0_stride * blockDim.z + d2 * (iteration_stride << 1);
+
+#pragma unroll
+    for (int iter = 0; iter < 2; iter++) {
+        int iter_id = iter * iteration_stride + iter_index;
+        out_vec[output_offset + iter_id] = in_data[iter_id];
+    }
+#endif
+}
+
+// 3 * [B A S N] - > [B S C*H]
+template <>
+void launch_transform4d_0213<float>(float* out,
+                                    const float* in,
+                                    int batch_size,
+                                    int heads,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    hipStream_t stream,
+                                    int trans_count)
+{
+    hidden_dim >>= 2;
+    dim3 grid_dims(batch_size, heads * ((seq_length - 1) / 8 + 1), trans_count);
+    dim3 block_dims(hidden_dim / heads, 8);
+   hipLaunchKernelGGL(( transform4d_0213<float>)
+        , dim3(grid_dims), dim3(block_dims), 0, stream, out, in, heads, seq_length, hidden_dim, 1);
+}
+
+template <>
+void launch_transform4d_0213<__half>(__half* out,
+                                     const __half* in,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     int hidden_dim,
+                                     hipStream_t stream,
+                                     int trans_count)
+{
+    hidden_dim >>= 3;
+    if (hidden_dim > 128 || hidden_dim < 16) {
+        int head_ext = (hidden_dim - 1) / MAX_THREADS + 1;
+        dim3 grid_dims(batch_size, trans_count, (seq_length * head_ext));
+        dim3 block_dims(hidden_dim / heads, (heads / head_ext));
+       hipLaunchKernelGGL(( transform4d_0213<__half>), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim, head_ext);
+    } else {
+        dim3 grid_dims(batch_size, seq_length / 2);
+        dim3 block_dims(hidden_dim / heads, heads, trans_count);
+       hipLaunchKernelGGL(( transform4d_0213_v2), dim3(grid_dims), dim3(block_dims), 0, stream, 
+            out, in, heads, seq_length, hidden_dim);
+    }
+}
diff --git a/deepspeed/ops/csrc/utils/flatten_unflatten.cpp b/deepspeed/ops/csrc/utils/flatten_unflatten.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..7d16c5c00a5f285a5320aa09bdf74de4aa5d2611
--- /dev/null
+++ b/deepspeed/ops/csrc/utils/flatten_unflatten.cpp
@@ -0,0 +1,25 @@
+/*
+   Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <torch/csrc/utils/tensor_flatten.h>
+#include <torch/extension.h>
+// https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_flatten.h
+
+at::Tensor flatten(std::vector<at::Tensor> tensors)
+{
+    return torch::utils::flatten_dense_tensors(tensors);
+}
+
+std::vector<at::Tensor> unflatten(at::Tensor flat, std::vector<at::Tensor> tensors)
+{
+    return torch::utils::unflatten_dense_tensors(flat, tensors);
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("flatten", &flatten, "Flatten dense tensors");
+    m.def("unflatten", &unflatten, "Unflatten dense tensors");
+}
diff --git a/deepspeed/ops/op_builder b/deepspeed/ops/op_builder
deleted file mode 120000
index db4f9c3350652dff87a26f6bb95c024160842a99..0000000000000000000000000000000000000000
--- a/deepspeed/ops/op_builder
+++ /dev/null
@@ -1 +0,0 @@
-../../op_builder
\ No newline at end of file
diff --git a/deepspeed/ops/op_builder/__init__.py b/deepspeed/ops/op_builder/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcac71011aa8be21d81f2b55f9692fc99d6b1211
--- /dev/null
+++ b/deepspeed/ops/op_builder/__init__.py
@@ -0,0 +1,32 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+from .cpu_adam import CPUAdamBuilder
+from .cpu_adagrad import CPUAdagradBuilder
+from .fused_adam import FusedAdamBuilder
+from .fused_lamb import FusedLambBuilder
+from .sparse_attn import SparseAttnBuilder
+from .transformer import TransformerBuilder
+from .stochastic_transformer import StochasticTransformerBuilder
+from .utils import UtilsBuilder
+from .async_io import AsyncIOBuilder
+from .transformer_inference import InferenceBuilder
+from .quantizer import QuantizerBuilder
+from .builder import get_default_compute_capabilities, OpBuilder
+
+# TODO: infer this list instead of hard coded
+# List of all available ops
+__op_builders__ = [
+    CPUAdamBuilder(),
+    CPUAdagradBuilder(),
+    FusedAdamBuilder(),
+    FusedLambBuilder(),
+    SparseAttnBuilder(),
+    TransformerBuilder(),
+    StochasticTransformerBuilder(),
+    AsyncIOBuilder(),
+    UtilsBuilder(),
+    QuantizerBuilder(),
+    InferenceBuilder()
+]
+ALL_OPS = {op.name: op for op in __op_builders__}
diff --git a/deepspeed/ops/op_builder/async_io.py b/deepspeed/ops/op_builder/async_io.py
new file mode 100644
index 0000000000000000000000000000000000000000..aec7911ce96bff070f3b2cc65f5fa58170a2cb86
--- /dev/null
+++ b/deepspeed/ops/op_builder/async_io.py
@@ -0,0 +1,106 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import distutils.spawn
+import subprocess
+
+from .builder import OpBuilder
+
+
+class AsyncIOBuilder(OpBuilder):
+    BUILD_VAR = "DS_BUILD_AIO"
+    NAME = "async_io"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.aio.{self.NAME}_op'
+
+    def sources(self):
+        return [
+            'csrc/aio/py_lib/deepspeed_py_copy.cpp',
+            'csrc/aio/py_lib/py_ds_aio.cpp',
+            'csrc/aio/py_lib/deepspeed_py_aio.cpp',
+            'csrc/aio/py_lib/deepspeed_py_aio_handle.cpp',
+            'csrc/aio/py_lib/deepspeed_aio_thread.cpp',
+            'csrc/aio/common/deepspeed_aio_utils.cpp',
+            'csrc/aio/common/deepspeed_aio_common.cpp',
+            'csrc/aio/common/deepspeed_aio_types.cpp'
+        ]
+
+    def include_paths(self):
+        return ['csrc/aio/py_lib', 'csrc/aio/common']
+
+    def cxx_args(self):
+        # -O0 for improved debugging, since performance is bound by I/O
+        CPU_ARCH = self.cpu_arch()
+        SIMD_WIDTH = self.simd_width()
+        return [
+            '-g',
+            '-Wall',
+            '-O0',
+            '-std=c++14',
+            '-shared',
+            '-fPIC',
+            '-Wno-reorder',
+            CPU_ARCH,
+            '-fopenmp',
+            SIMD_WIDTH,
+            '-laio',
+        ]
+
+    def extra_ldflags(self):
+        return ['-laio']
+
+    def check_for_libaio_pkg(self):
+        libs = dict(
+            dpkg=["-l",
+                  "libaio-dev",
+                  "apt"],
+            pacman=["-Q",
+                    "libaio",
+                    "pacman"],
+            rpm=["-q",
+                 "libaio-devel",
+                 "yum"],
+        )
+
+        found = False
+        for pkgmgr, data in libs.items():
+            flag, lib, tool = data
+            path = distutils.spawn.find_executable(pkgmgr)
+            if path is not None:
+                cmd = f"{pkgmgr} {flag} {lib}"
+                result = subprocess.Popen(cmd,
+                                          stdout=subprocess.PIPE,
+                                          stderr=subprocess.PIPE,
+                                          shell=True)
+                if result.wait() == 0:
+                    found = True
+                else:
+                    self.warning(
+                        f"{self.NAME}: please install the {lib} package with {tool}")
+                break
+        return found
+
+    def is_compatible(self, verbose=True):
+        # Check for the existence of libaio by using distutils
+        # to compile and link a test program that calls io_submit,
+        # which is a function provided by libaio that is used in the async_io op.
+        # If needed, one can define -I and -L entries in CFLAGS and LDFLAGS
+        # respectively to specify the directories for libaio.h and libaio.so.
+        aio_compatible = self.has_function('io_submit', ('aio', ))
+        if verbose and not aio_compatible:
+            self.warning(
+                f"{self.NAME} requires the dev libaio .so object and headers but these were not found."
+            )
+
+            # Check for the libaio package via known package managers
+            # to print suggestions on which package to install.
+            self.check_for_libaio_pkg()
+
+            self.warning(
+                "If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found."
+            )
+        return super().is_compatible(verbose) and aio_compatible
diff --git a/deepspeed/ops/op_builder/builder.py b/deepspeed/ops/op_builder/builder.py
new file mode 100644
index 0000000000000000000000000000000000000000..a895e890b5daf6b85ab062ca0a95c2cecd653387
--- /dev/null
+++ b/deepspeed/ops/op_builder/builder.py
@@ -0,0 +1,699 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import os
+import sys
+import time
+import json
+import importlib
+from pathlib import Path
+import subprocess
+import shlex
+import shutil
+import tempfile
+import distutils.ccompiler
+import distutils.log
+import distutils.sysconfig
+from distutils.errors import CompileError, LinkError
+from abc import ABC, abstractmethod
+
+YELLOW = '\033[93m'
+END = '\033[0m'
+WARNING = f"{YELLOW} [WARNING] {END}"
+
+DEFAULT_TORCH_EXTENSION_PATH = "/tmp/torch_extensions"
+DEFAULT_COMPUTE_CAPABILITIES = "6.0;6.1;7.0"
+
+try:
+    import torch
+except ImportError:
+    print(
+        f"{WARNING} unable to import torch, please install it if you want to pre-compile any deepspeed ops."
+    )
+else:
+    TORCH_MAJOR = int(torch.__version__.split('.')[0])
+    TORCH_MINOR = int(torch.__version__.split('.')[1])
+
+
+def installed_cuda_version():
+    import torch.utils.cpp_extension
+    cuda_home = torch.utils.cpp_extension.CUDA_HOME
+    assert cuda_home is not None, "CUDA_HOME does not exist, unable to compile CUDA op(s)"
+    # Ensure there is not a cuda version mismatch between torch and nvcc compiler
+    output = subprocess.check_output([cuda_home + "/bin/nvcc",
+                                      "-V"],
+                                     universal_newlines=True)
+    output_split = output.split()
+    release_idx = output_split.index("release")
+    release = output_split[release_idx + 1].replace(',', '').split(".")
+    # Ignore patch versions, only look at major + minor
+    cuda_major, cuda_minor = release[:2]
+    installed_cuda_version = ".".join(release[:2])
+    return int(cuda_major), int(cuda_minor)
+
+
+def get_default_compute_capabilities():
+    compute_caps = DEFAULT_COMPUTE_CAPABILITIES
+    import torch.utils.cpp_extension
+    if torch.utils.cpp_extension.CUDA_HOME is not None and installed_cuda_version(
+    )[0] >= 11:
+        if installed_cuda_version()[0] == 11 and installed_cuda_version()[1] == 0:
+            # Special treatment of CUDA 11.0 because compute_86 is not supported.
+            compute_caps += ";8.0"
+        else:
+            compute_caps += ";8.0;8.6"
+    return compute_caps
+
+
+# list compatible minor CUDA versions - so that for example pytorch built with cuda-11.0 can be used
+# to build deepspeed and system-wide installed cuda 11.2
+cuda_minor_mismatch_ok = {
+    10: [
+        "10.0",
+        "10.1",
+        "10.2",
+    ],
+    11: [
+        "11.0",
+        "11.1",
+        "11.2",
+        "11.3",
+        "11.4",
+        "11.5",
+        "11.6",
+    ],
+}
+
+
+def assert_no_cuda_mismatch():
+    cuda_major, cuda_minor = installed_cuda_version()
+    sys_cuda_version = f'{cuda_major}.{cuda_minor}'
+    torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
+    # This is a show-stopping error, should probably not proceed past this
+    if sys_cuda_version != torch_cuda_version:
+        if (cuda_major in cuda_minor_mismatch_ok
+                and sys_cuda_version in cuda_minor_mismatch_ok[cuda_major]
+                and torch_cuda_version in cuda_minor_mismatch_ok[cuda_major]):
+            print(f"Installed CUDA version {sys_cuda_version} does not match the "
+                  f"version torch was compiled with {torch.version.cuda} "
+                  "but since the APIs are compatible, accepting this combination")
+            return
+        raise Exception(
+            f"Installed CUDA version {sys_cuda_version} does not match the "
+            f"version torch was compiled with {torch.version.cuda}, unable to compile "
+            "cuda/cpp extensions without a matching cuda version.")
+
+
+class OpBuilder(ABC):
+    _rocm_version = None
+    _is_rocm_pytorch = None
+
+    def __init__(self, name):
+        self.name = name
+        self.jit_mode = False
+
+    @abstractmethod
+    def absolute_name(self):
+        '''
+        Returns absolute build path for cases where the op is pre-installed, e.g., deepspeed.ops.adam.cpu_adam
+        will be installed as something like: deepspeed/ops/adam/cpu_adam.so
+        '''
+        pass
+
+    @abstractmethod
+    def sources(self):
+        '''
+        Returns list of source files for your op, relative to root of deepspeed package (i.e., DeepSpeed/deepspeed)
+        '''
+        pass
+
+    def hipify_extension(self):
+        pass
+
+    @staticmethod
+    def assert_torch_info(torch_info):
+        install_torch_version = torch_info['version']
+        install_cuda_version = torch_info['cuda_version']
+        install_hip_version = torch_info['hip_version']
+
+        if not OpBuilder.is_rocm_pytorch():
+            current_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
+        else:
+            current_hip_version = ".".join(torch.version.hip.split('.')[:2])
+
+        current_torch_version = ".".join(torch.__version__.split('.')[:2])
+
+        if not OpBuilder.is_rocm_pytorch():
+            if install_cuda_version != current_cuda_version or install_torch_version != current_torch_version:
+                raise RuntimeError(
+                    "PyTorch and CUDA version mismatch! DeepSpeed ops were compiled and installed "
+                    "with a different version than what is being used at runtime. Please re-install "
+                    f"DeepSpeed or switch torch versions. DeepSpeed install versions: "
+                    f"torch={install_torch_version}, cuda={install_cuda_version}, runtime versions:"
+                    f"torch={current_torch_version}, cuda={current_cuda_version}")
+        else:
+            if install_hip_version != current_hip_version or install_torch_version != current_torch_version:
+                raise RuntimeError(
+                    "PyTorch and HIP version mismatch! DeepSpeed ops were compiled and installed "
+                    "with a different version than what is being used at runtime. Please re-install "
+                    f"DeepSpeed or switch torch versions. DeepSpeed install versions: "
+                    f"torch={install_torch_version}, hip={install_hip_version}, runtime versions:"
+                    f"torch={current_torch_version}, hip={current_hip_version}")
+
+    @staticmethod
+    def is_rocm_pytorch():
+        if OpBuilder._is_rocm_pytorch is not None:
+            return OpBuilder._is_rocm_pytorch
+
+        _is_rocm_pytorch = False
+        try:
+            import torch
+        except ImportError:
+            pass
+        else:
+            if TORCH_MAJOR > 1 or (TORCH_MAJOR == 1 and TORCH_MINOR >= 5):
+                _is_rocm_pytorch = hasattr(torch.version,
+                                           'hip') and torch.version.hip is not None
+                if _is_rocm_pytorch:
+                    from torch.utils.cpp_extension import ROCM_HOME
+                    _is_rocm_pytorch = ROCM_HOME is not None
+        OpBuilder._is_rocm_pytorch = _is_rocm_pytorch
+        return OpBuilder._is_rocm_pytorch
+
+    @staticmethod
+    def installed_rocm_version():
+        if OpBuilder._rocm_version:
+            return OpBuilder._rocm_version
+
+        ROCM_MAJOR = '0'
+        ROCM_MINOR = '0'
+        if OpBuilder.is_rocm_pytorch():
+            from torch.utils.cpp_extension import ROCM_HOME
+            #with open('/opt/rocm/.info/version-dev', 'r') as file:
+            with open('/public/home/aiss/2022/dtk-21.10.1/.info/version-dev', 'r') as file:
+                ROCM_VERSION_DEV_RAW = file.read()
+            ROCM_MAJOR = ROCM_VERSION_DEV_RAW.split('.')[0]
+            ROCM_MINOR = ROCM_VERSION_DEV_RAW.split('.')[1]
+        OpBuilder._rocm_version = (int(ROCM_MAJOR), int(ROCM_MINOR))
+        return OpBuilder._rocm_version
+
+    def include_paths(self):
+        '''
+        Returns list of include paths, relative to root of deepspeed package (i.e., DeepSpeed/deepspeed)
+        '''
+        return []
+
+    def nvcc_args(self):
+        '''
+        Returns optional list of compiler flags to forward to nvcc when building CUDA sources
+        '''
+        return []
+
+    def cxx_args(self):
+        '''
+        Returns optional list of compiler flags to forward to the build
+        '''
+        return []
+
+    def is_compatible(self, verbose=True):
+        '''
+        Check if all non-python dependencies are satisfied to build this op
+        '''
+        return True
+
+    def extra_ldflags(self):
+        return []
+
+    def libraries_installed(self, libraries):
+        valid = False
+        check_cmd = 'dpkg -l'
+        for lib in libraries:
+            result = subprocess.Popen(f'dpkg -l {lib}',
+                                      stdout=subprocess.PIPE,
+                                      stderr=subprocess.PIPE,
+                                      shell=True)
+            valid = valid or result.wait() == 0
+        return valid
+
+    def has_function(self, funcname, libraries, verbose=False):
+        '''
+        Test for existence of a function within a tuple of libraries.
+
+        This is used as a smoke test to check whether a certain library is available.
+        As a test, this creates a simple C program that calls the specified function,
+        and then distutils is used to compile that program and link it with the specified libraries.
+        Returns True if both the compile and link are successful, False otherwise.
+        '''
+        tempdir = None  # we create a temporary directory to hold various files
+        filestderr = None  # handle to open file to which we redirect stderr
+        oldstderr = None  # file descriptor for stderr
+        try:
+            # Echo compile and link commands that are used.
+            if verbose:
+                distutils.log.set_verbosity(1)
+
+            # Create a compiler object.
+            compiler = distutils.ccompiler.new_compiler(verbose=verbose)
+
+            # Configure compiler and linker to build according to Python install.
+            distutils.sysconfig.customize_compiler(compiler)
+
+            # Create a temporary directory to hold test files.
+            tempdir = tempfile.mkdtemp()
+
+            # Define a simple C program that calls the function in question
+            prog = "void %s(void); int main(int argc, char** argv) { %s(); return 0; }" % (
+                funcname,
+                funcname)
+
+            # Write the test program to a file.
+            filename = os.path.join(tempdir, 'test.c')
+            with open(filename, 'w') as f:
+                f.write(prog)
+
+            # Redirect stderr file descriptor to a file to silence compile/link warnings.
+            if not verbose:
+                filestderr = open(os.path.join(tempdir, 'stderr.txt'), 'w')
+                oldstderr = os.dup(sys.stderr.fileno())
+                os.dup2(filestderr.fileno(), sys.stderr.fileno())
+
+            # Workaround for behavior in distutils.ccompiler.CCompiler.object_filenames()
+            # Otherwise, a local directory will be used instead of tempdir
+            drive, driveless_filename = os.path.splitdrive(filename)
+            root_dir = driveless_filename[0] if os.path.isabs(driveless_filename) else ''
+            output_dir = os.path.join(drive, root_dir)
+
+            # Attempt to compile the C program into an object file.
+            cflags = shlex.split(os.environ.get('CFLAGS', ""))
+            objs = compiler.compile([filename],
+                                    output_dir=output_dir,
+                                    extra_preargs=self.strip_empty_entries(cflags))
+
+            # Attempt to link the object file into an executable.
+            # Be sure to tack on any libraries that have been specified.
+            ldflags = shlex.split(os.environ.get('LDFLAGS', ""))
+            compiler.link_executable(objs,
+                                     os.path.join(tempdir,
+                                                  'a.out'),
+                                     extra_preargs=self.strip_empty_entries(ldflags),
+                                     libraries=libraries)
+
+            # Compile and link succeeded
+            return True
+
+        except CompileError:
+            return False
+
+        except LinkError:
+            return False
+
+        except:
+            return False
+
+        finally:
+            # Restore stderr file descriptor and close the stderr redirect file.
+            if oldstderr is not None:
+                os.dup2(oldstderr, sys.stderr.fileno())
+            if filestderr is not None:
+                filestderr.close()
+
+            # Delete the temporary directory holding the test program and stderr files.
+            if tempdir is not None:
+                shutil.rmtree(tempdir)
+
+    def strip_empty_entries(self, args):
+        '''
+        Drop any empty strings from the list of compile and link flags
+        '''
+        return [x for x in args if len(x) > 0]
+
+    def cpu_arch(self):
+        try:
+            from cpuinfo import get_cpu_info
+        except ImportError as e:
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return "-march=native"
+
+        try:
+            cpu_info = get_cpu_info()
+        except Exception as e:
+            self.warning(
+                f"{self.name} attempted to use `py-cpuinfo` but failed (exception type: {type(e)}, {e}), "
+                "falling back to `lscpu` to get this information.")
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return "-march=native"
+
+        if cpu_info['arch'].startswith('PPC_'):
+            # gcc does not provide -march on PowerPC, use -mcpu instead
+            return '-mcpu=native'
+        return '-march=native'
+
+    def _backup_cpuinfo(self):
+        # Construct cpu_info dict from lscpu that is similar to what py-cpuinfo provides
+        if not self.command_exists('lscpu'):
+            self.warning(
+                f"{self.name} attempted to query 'lscpu' after failing to use py-cpuinfo "
+                "to detect the CPU architecture. 'lscpu' does not appear to exist on "
+                "your system, will fall back to use -march=native and non-vectorized execution."
+            )
+            return None
+        result = subprocess.check_output('lscpu', shell=True)
+        result = result.decode('utf-8').strip().lower()
+
+        cpu_info = {}
+        cpu_info['arch'] = None
+        cpu_info['flags'] = ""
+        if 'genuineintel' in result or 'authenticamd' in result:
+            cpu_info['arch'] = 'X86_64'
+            if 'avx512' in result:
+                cpu_info['flags'] += 'avx512,'
+            if 'avx2' in result:
+                cpu_info['flags'] += 'avx2'
+        elif 'ppc64le' in result:
+            cpu_info['arch'] = "PPC_"
+
+        return cpu_info
+
+    def simd_width(self):
+        try:
+            from cpuinfo import get_cpu_info
+        except ImportError as e:
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return '-D__SCALAR__'
+
+        try:
+            cpu_info = get_cpu_info()
+        except Exception as e:
+            self.warning(
+                f"{self.name} attempted to use `py-cpuinfo` but failed (exception type: {type(e)}, {e}), "
+                "falling back to `lscpu` to get this information.")
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return '-D__SCALAR__'
+
+        if cpu_info['arch'] == 'X86_64':
+            if 'avx512' in cpu_info['flags']:
+                return '-D__AVX512__'
+            elif 'avx2' in cpu_info['flags']:
+                return '-D__AVX256__'
+        return '-D__SCALAR__'
+
+    def python_requirements(self):
+        '''
+        Override if op wants to define special dependencies, otherwise will
+        take self.name and load requirements-<op-name>.txt if it exists.
+        '''
+        path = f'requirements/requirements-{self.name}.txt'
+        requirements = []
+        if os.path.isfile(path):
+            with open(path, 'r') as fd:
+                requirements = [r.strip() for r in fd.readlines()]
+        return requirements
+
+    def command_exists(self, cmd):
+        if '|' in cmd:
+            cmds = cmd.split("|")
+        else:
+            cmds = [cmd]
+        valid = False
+        for cmd in cmds:
+            result = subprocess.Popen(f'type {cmd}', stdout=subprocess.PIPE, shell=True)
+            valid = valid or result.wait() == 0
+
+        if not valid and len(cmds) > 1:
+            print(
+                f"{WARNING} {self.name} requires one of the following commands '{cmds}', but it does not exist!"
+            )
+        elif not valid and len(cmds) == 1:
+            print(
+                f"{WARNING} {self.name} requires the '{cmd}' command, but it does not exist!"
+            )
+        return valid
+
+    def warning(self, msg):
+        print(f"{WARNING} {msg}")
+
+    def deepspeed_src_path(self, code_path):
+        if os.path.isabs(code_path):
+            return code_path
+        else:
+            return os.path.join(Path(__file__).parent.parent.absolute(), code_path)
+
+    def builder(self):
+        from torch.utils.cpp_extension import CppExtension
+        return CppExtension(
+            name=self.absolute_name(),
+            sources=self.strip_empty_entries(self.sources()),
+            include_dirs=self.strip_empty_entries(self.include_paths()),
+            extra_compile_args={'cxx': self.strip_empty_entries(self.cxx_args())},
+            extra_link_args=self.strip_empty_entries(self.extra_ldflags()))
+
+    def load(self, verbose=True):
+        from ...git_version_info import installed_ops, torch_info
+        if installed_ops[self.name]:
+            # Ensure the op we're about to load was compiled with the same
+            # torch/cuda versions we are currently using at runtime.
+            if isinstance(self, CUDAOpBuilder):
+                self.assert_torch_info(torch_info)
+
+            return importlib.import_module(self.absolute_name())
+        else:
+            return self.jit_load(verbose)
+
+    def jit_load(self, verbose=True):
+        if not self.is_compatible(verbose):
+            raise RuntimeError(
+                f"Unable to JIT load the {self.name} op due to it not being compatible due to hardware/software issue."
+            )
+        try:
+            import ninja
+        except ImportError:
+            raise RuntimeError(
+                f"Unable to JIT load the {self.name} op due to ninja not being installed."
+            )
+
+        if isinstance(self, CUDAOpBuilder) and not self.is_rocm_pytorch():
+            assert_no_cuda_mismatch()
+
+        self.jit_mode = True
+        from torch.utils.cpp_extension import load
+
+        # Ensure directory exists to prevent race condition in some cases
+        ext_path = os.path.join(
+            os.environ.get('TORCH_EXTENSIONS_DIR',
+                           DEFAULT_TORCH_EXTENSION_PATH),
+            self.name)
+        os.makedirs(ext_path, exist_ok=True)
+
+        start_build = time.time()
+        sources = [self.deepspeed_src_path(path) for path in self.sources()]
+        extra_include_paths = [
+            self.deepspeed_src_path(path) for path in self.include_paths()
+        ]
+
+        # Torch will try and apply whatever CCs are in the arch list at compile time,
+        # we have already set the intended targets ourselves we know that will be
+        # needed at runtime. This prevents CC collisions such as multiple __half
+        # implementations. Stash arch list to reset after build.
+        torch_arch_list = None
+        if "TORCH_CUDA_ARCH_LIST" in os.environ:
+            torch_arch_list = os.environ.get("TORCH_CUDA_ARCH_LIST")
+            os.environ["TORCH_CUDA_ARCH_LIST"] = ""
+
+        op_module = load(
+            name=self.name,
+            sources=self.strip_empty_entries(sources),
+            extra_include_paths=self.strip_empty_entries(extra_include_paths),
+            extra_cflags=self.strip_empty_entries(self.cxx_args()),
+            extra_cuda_cflags=self.strip_empty_entries(self.nvcc_args()),
+            extra_ldflags=self.strip_empty_entries(self.extra_ldflags()),
+            verbose=verbose)
+        build_duration = time.time() - start_build
+        if verbose:
+            print(f"Time to load {self.name} op: {build_duration} seconds")
+
+        # Reset arch list so we are not silently removing it for other possible use cases
+        if torch_arch_list:
+            os.environ["TORCH_CUDA_ARCH_LIST"] = torch_arch_list
+
+        return op_module
+
+
+class CUDAOpBuilder(OpBuilder):
+    def compute_capability_args(self, cross_compile_archs=None):
+        """
+        Returns nvcc compute capability compile flags.
+
+        1. `TORCH_CUDA_ARCH_LIST` takes priority over `cross_compile_archs`.
+        2. If neither is set default compute capabilities will be used
+        3. Under `jit_mode` compute capabilities of all visible cards will be used plus PTX
+
+        Format:
+
+        - `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples:
+
+        TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" pip install ...
+        TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX" pip install ...
+
+        - `cross_compile_archs` uses ; separator.
+
+        """
+        ccs = []
+        if self.jit_mode:
+            # Compile for underlying architectures since we know those at runtime
+            for i in range(torch.cuda.device_count()):
+                CC_MAJOR, CC_MINOR = torch.cuda.get_device_capability(i)
+                cc = f"{CC_MAJOR}.{CC_MINOR}"
+                if cc not in ccs:
+                    ccs.append(cc)
+            ccs = sorted(ccs)
+            ccs[-1] += '+PTX'
+        else:
+            # Cross-compile mode, compile for various architectures
+            # env override takes priority
+            cross_compile_archs_env = os.environ.get('TORCH_CUDA_ARCH_LIST', None)
+            if cross_compile_archs_env is not None:
+                if cross_compile_archs is not None:
+                    print(
+                        f"{WARNING} env var `TORCH_CUDA_ARCH_LIST={cross_compile_archs_env}` overrides `cross_compile_archs={cross_compile_archs}`"
+                    )
+                cross_compile_archs = cross_compile_archs_env.replace(' ', ';')
+            else:
+                if cross_compile_archs is None:
+                    cross_compile_archs = get_default_compute_capabilities()
+            ccs = cross_compile_archs.split(';')
+
+        args = []
+        for cc in ccs:
+            num = cc[0] + cc[2]
+            args.append(f'-gencode=arch=compute_{num},code=sm_{num}')
+            if cc.endswith('+PTX'):
+                args.append(f'-gencode=arch=compute_{num},code=compute_{num}')
+
+        return args
+
+    def version_dependent_macros(self):
+        # Fix from apex that might be relevant for us as well, related to https://github.com/NVIDIA/apex/issues/456
+        version_ge_1_1 = []
+        if (TORCH_MAJOR > 1) or (TORCH_MAJOR == 1 and TORCH_MINOR > 0):
+            version_ge_1_1 = ['-DVERSION_GE_1_1']
+        version_ge_1_3 = []
+        if (TORCH_MAJOR > 1) or (TORCH_MAJOR == 1 and TORCH_MINOR > 2):
+            version_ge_1_3 = ['-DVERSION_GE_1_3']
+        version_ge_1_5 = []
+        if (TORCH_MAJOR > 1) or (TORCH_MAJOR == 1 and TORCH_MINOR > 4):
+            version_ge_1_5 = ['-DVERSION_GE_1_5']
+        return version_ge_1_1 + version_ge_1_3 + version_ge_1_5
+
+    def is_compatible(self, verbose=True):
+        return super().is_compatible(verbose)
+
+    def builder(self):
+        from torch.utils.cpp_extension import CUDAExtension
+        if not self.is_rocm_pytorch():
+            assert_no_cuda_mismatch()
+        cuda_ext = CUDAExtension(
+            name=self.absolute_name(),
+            sources=self.strip_empty_entries(self.sources()),
+            include_dirs=self.strip_empty_entries(self.include_paths()),
+            libraries=self.strip_empty_entries(self.libraries_args()),
+            extra_compile_args={
+                'cxx': self.strip_empty_entries(self.cxx_args()),
+                'nvcc': self.strip_empty_entries(self.nvcc_args())
+            })
+        if self.is_rocm_pytorch():
+            # hip converts paths to absolute, this converts back to relative
+            sources = cuda_ext.sources
+            curr_file = Path(__file__).parent.parent  # ds root
+            for i in range(len(sources)):
+                src = Path(sources[i])
+                sources[i] = str(src.relative_to(curr_file))
+            cuda_ext.sources = sources
+        return cuda_ext
+
+    def hipify_extension(self):
+        if self.is_rocm_pytorch():
+            from torch.utils.hipify import hipify_python
+            hipify_python.hipify(
+                project_directory=os.getcwd(),
+                output_directory=os.getcwd(),
+                header_include_dirs=self.include_paths(),
+                includes=[os.path.join(os.getcwd(),
+                                       '*')],
+                extra_files=[os.path.abspath(s) for s in self.sources()],
+                show_detailed=True,
+                is_pytorch_extension=True,
+                hipify_extra_files_only=True,
+            )
+
+    def cxx_args(self):
+        if sys.platform == "win32":
+            return ['-O2']
+        else:
+            return ['-O3', '-std=c++14', '-g', '-Wno-reorder']
+
+    def nvcc_args(self):
+        args = ['-O3']
+        if self.is_rocm_pytorch():
+            ROCM_MAJOR, ROCM_MINOR = self.installed_rocm_version()
+            args += [
+                '-std=c++14',
+                '-U__HIP_NO_HALF_OPERATORS__',
+                '-U__HIP_NO_HALF_CONVERSIONS__',
+                '-U__HIP_NO_HALF2_OPERATORS__',
+                '-DROCM_VERSION_MAJOR=%s' % ROCM_MAJOR,
+                '-DROCM_VERSION_MINOR=%s' % ROCM_MINOR
+            ]
+        else:
+            cuda_major, _ = installed_cuda_version()
+            args += [
+                '--use_fast_math',
+                '-std=c++17'
+                if sys.platform == "win32" and cuda_major > 10 else '-std=c++14',
+                '-U__CUDA_NO_HALF_OPERATORS__',
+                '-U__CUDA_NO_HALF_CONVERSIONS__',
+                '-U__CUDA_NO_HALF2_OPERATORS__'
+            ]
+            args += self.compute_capability_args()
+        return args
+
+    def libraries_args(self):
+        if sys.platform == "win32":
+            return ['cublas', 'curand']
+        else:
+            return []
+
+
+class TorchCPUOpBuilder(CUDAOpBuilder):
+    def extra_ldflags(self):
+        if not self.is_rocm_pytorch():
+            return ['-lcurand']
+        else:
+            return []
+
+    def cxx_args(self):
+        import torch
+        if not self.is_rocm_pytorch():
+            CUDA_LIB64 = os.path.join(torch.utils.cpp_extension.CUDA_HOME, "lib64")
+        else:
+            CUDA_LIB64 = os.path.join(torch.utils.cpp_extension.ROCM_HOME, "lib")
+        CPU_ARCH = self.cpu_arch()
+        SIMD_WIDTH = self.simd_width()
+
+        args = super().cxx_args()
+        ###aiss debug0506###########
+        args += [
+            f'-L{CUDA_LIB64}',
+            #'-lcudart',
+            #'-lcublas',
+            '-lrocblas',
+            '-lhipblas',
+            '-g',
+            CPU_ARCH,
+            '-fopenmp',
+            SIMD_WIDTH,
+        ]
+        return args
diff --git a/deepspeed/ops/op_builder/cpu_adagrad.py b/deepspeed/ops/op_builder/cpu_adagrad.py
new file mode 100644
index 0000000000000000000000000000000000000000..24f0ff3ff52505c11c61a188f34a2b15e68fc5bc
--- /dev/null
+++ b/deepspeed/ops/op_builder/cpu_adagrad.py
@@ -0,0 +1,44 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import os
+from .builder import TorchCPUOpBuilder
+
+
+class CPUAdagradBuilder(TorchCPUOpBuilder):
+    BUILD_VAR = "DS_BUILD_CPU_ADAGRAD"
+    NAME = "cpu_adagrad"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.adagrad.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/adagrad/cpu_adagrad.cpp', 'csrc/common/custom_cuda_kernel.cu']
+
+    def include_paths(self):
+        import torch
+        if not self.is_rocm_pytorch():
+            CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
+        else:
+            CUDA_INCLUDE = [
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "rocrand"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "hiprand"),
+            ]
+        return ['csrc/includes'] + CUDA_INCLUDE
+
+   ##aiss add
+    def cxx_args(self):
+
+        return [
+            '-lrocblas',
+            '-lhipblas',
+            '-D__HIP_DEVICE_COMPILE__',]
diff --git a/deepspeed/ops/op_builder/cpu_adam.py b/deepspeed/ops/op_builder/cpu_adam.py
new file mode 100644
index 0000000000000000000000000000000000000000..35a2a8a17d54b3f60a70e2ce8fee32488372a75c
--- /dev/null
+++ b/deepspeed/ops/op_builder/cpu_adam.py
@@ -0,0 +1,53 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import os
+from .builder import TorchCPUOpBuilder
+
+
+class CPUAdamBuilder(TorchCPUOpBuilder):
+    BUILD_VAR = "DS_BUILD_CPU_ADAM"
+    NAME = "cpu_adam"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.adam.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/adam/cpu_adam.cpp', 'csrc/common/custom_cuda_kernel.cu']
+
+    def libraries_args(self):
+        args = super().libraries_args()
+        #args += ['curand']
+    #aiss debug 0506########
+        args += ['hiprand']
+        args += ['rocrand']
+        return args
+
+    def include_paths(self):
+        import torch
+        if not self.is_rocm_pytorch():
+            CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
+        else:
+            CUDA_INCLUDE = [
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "rocrand"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "hiprand"),
+            ]
+        return ['csrc/includes'] + CUDA_INCLUDE
+
+   ##aiss add
+    def cxx_args(self):
+
+        return [
+            '-lrocblas',
+            '-lhipblas',
+            '-D__HIP_DEVICE_COMPILE__',
+        ]
diff --git a/deepspeed/ops/op_builder/fused_adam.py b/deepspeed/ops/op_builder/fused_adam.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ff264fbf1a1089a2d35afc520f98d9d7548f924
--- /dev/null
+++ b/deepspeed/ops/op_builder/fused_adam.py
@@ -0,0 +1,32 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+from .builder import CUDAOpBuilder
+
+
+class FusedAdamBuilder(CUDAOpBuilder):
+    BUILD_VAR = "DS_BUILD_FUSED_ADAM"
+    NAME = "fused_adam"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.adam.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/adam/fused_adam_frontend.cpp', 'csrc/adam/multi_tensor_adam.cu']
+
+    def include_paths(self):
+        return ['csrc/includes', 'csrc/adam']
+
+    def cxx_args(self):
+        args = super().cxx_args()
+        return args + self.version_dependent_macros()
+
+    def nvcc_args(self):
+        nvcc_flags = ['-O3'] + self.version_dependent_macros()
+        if not self.is_rocm_pytorch():
+            nvcc_flags.extend(['-lineinfo',
+                               '--use_fast_math'] + self.compute_capability_args())
+        return nvcc_flags
diff --git a/deepspeed/ops/op_builder/fused_lamb.py b/deepspeed/ops/op_builder/fused_lamb.py
new file mode 100644
index 0000000000000000000000000000000000000000..106728f6f3fe9e6449deeb0228410e50cfc8648a
--- /dev/null
+++ b/deepspeed/ops/op_builder/fused_lamb.py
@@ -0,0 +1,38 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+from .builder import CUDAOpBuilder
+
+
+class FusedLambBuilder(CUDAOpBuilder):
+    BUILD_VAR = 'DS_BUILD_FUSED_LAMB'
+    NAME = "fused_lamb"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.lamb.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/lamb/fused_lamb_cuda.cpp', 'csrc/lamb/fused_lamb_cuda_kernel.cu']
+
+    def include_paths(self):
+        return ['csrc/includes']
+
+    def cxx_args(self):
+        args = super().cxx_args()
+        return args + self.version_dependent_macros()
+
+    def nvcc_args(self):
+        nvcc_flags = ['-O3'] + self.version_dependent_macros()
+        if self.is_rocm_pytorch():
+            ROCM_MAJOR, ROCM_MINOR = self.installed_rocm_version()
+            nvcc_flags += [
+                '-DROCM_VERSION_MAJOR=%s' % ROCM_MAJOR,
+                '-DROCM_VERSION_MINOR=%s' % ROCM_MINOR
+            ]
+        else:
+            nvcc_flags.extend(['-lineinfo',
+                               '--use_fast_math'] + self.compute_capability_args())
+        return nvcc_flags
diff --git a/deepspeed/ops/op_builder/quantizer.py b/deepspeed/ops/op_builder/quantizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..43bc5778ea20fbb658bcde6c6f96d0fc5f840e3b
--- /dev/null
+++ b/deepspeed/ops/op_builder/quantizer.py
@@ -0,0 +1,22 @@
+from .builder import CUDAOpBuilder
+
+
+class QuantizerBuilder(CUDAOpBuilder):
+    BUILD_VAR = "DS_BUILD_QUANTIZER"
+    NAME = "quantizer"
+
+    def __init__(self, name=None):
+        name = self.NAME if name is None else name
+        super().__init__(name=name)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.quantizer.{self.NAME}_op'
+
+    def sources(self):
+        return [
+            'csrc/quantization/pt_binding.cpp',
+            'csrc/quantization/quantizer.cu',
+        ]
+
+    def include_paths(self):
+        return ['csrc/includes']
diff --git a/deepspeed/ops/op_builder/sparse_attn.py b/deepspeed/ops/op_builder/sparse_attn.py
new file mode 100644
index 0000000000000000000000000000000000000000..004fdd698200f4c8d47831297a5a3306352e4bef
--- /dev/null
+++ b/deepspeed/ops/op_builder/sparse_attn.py
@@ -0,0 +1,87 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import warnings
+from .builder import OpBuilder
+
+try:
+    from packaging import version as pkg_version
+except ImportError:
+    pkg_version = None
+
+
+class SparseAttnBuilder(OpBuilder):
+    BUILD_VAR = "DS_BUILD_SPARSE_ATTN"
+    NAME = "sparse_attn"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.sparse_attention.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/sparse_attention/utils.cpp']
+
+    def cxx_args(self):
+        return ['-O2', '-fopenmp']
+
+    def is_compatible(self, verbose=True):
+        # Check to see if llvm and cmake are installed since they are dependencies
+        #required_commands = ['llvm-config|llvm-config-9', 'cmake']
+        #command_status = list(map(self.command_exists, required_commands))
+        #deps_compatible = all(command_status)
+
+#####aiss debug 0506##############
+        if self.is_rocm_pytorch():
+        #    self.warning(f'{self.NAME} is not compatible with ROCM')
+        #    return False
+            return True
+        try:
+            import torch
+        except ImportError:
+            self.warning(f"unable to import torch, please install it first")
+            return False
+
+        # torch-cpu will not have a cuda version
+        if torch.version.cuda is None:
+            cuda_compatible = False
+            self.warning(f"{self.NAME} cuda is not available from torch")
+        else:
+            major, minor = torch.version.cuda.split('.')[:2]
+            cuda_compatible = (int(major) == 10
+                               and int(minor) >= 1) or (int(major) >= 11)
+            if not cuda_compatible:
+                self.warning(f"{self.NAME} requires CUDA version 10.1+")
+
+        TORCH_MAJOR = int(torch.__version__.split('.')[0])
+        TORCH_MINOR = int(torch.__version__.split('.')[1])
+        torch_compatible = TORCH_MAJOR == 1 and TORCH_MINOR >= 5
+        if not torch_compatible:
+            self.warning(
+                f'{self.NAME} requires a torch version >= 1.5 but detected {TORCH_MAJOR}.{TORCH_MINOR}'
+            )
+
+        try:
+            import triton
+        except ImportError:
+            # auto-install of triton is broken on some systems, reverting to manual install for now
+            # see this issue: https://github.com/microsoft/DeepSpeed/issues/1710
+            self.warning(
+                f"please install triton==1.0.0 if you want to use sparse attention")
+            return False
+
+        if pkg_version:
+            installed_triton = pkg_version.parse(triton.__version__)
+            triton_mismatch = installed_triton != pkg_version.parse("1.0.0")
+        else:
+            installed_triton = triton.__version__
+            triton_mismatch = installed_triton != "1.0.0"
+
+        if triton_mismatch:
+            self.warning(
+                f"using untested triton version ({installed_triton}), only 1.0.0 is known to be compatible"
+            )
+            return False
+
+        return super().is_compatible(verbose) and torch_compatible and cuda_compatible
diff --git a/deepspeed/ops/op_builder/stochastic_transformer.py b/deepspeed/ops/op_builder/stochastic_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa47c13c49e4b64a0a30c089825104ae075bbbbb
--- /dev/null
+++ b/deepspeed/ops/op_builder/stochastic_transformer.py
@@ -0,0 +1,20 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+from .transformer import TransformerBuilder
+
+
+class StochasticTransformerBuilder(TransformerBuilder):
+    BUILD_VAR = "DS_BUILD_STOCHASTIC_TRANSFORMER"
+    NAME = "stochastic_transformer"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.transformer.{self.NAME}_op'
+
+    def nvcc_args(self):
+        args = super().nvcc_args()
+        args.append('-D__STOCHASTIC_MODE__')
+        return args
diff --git a/deepspeed/ops/op_builder/transformer.py b/deepspeed/ops/op_builder/transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..239f29552d980984dae4884e5a3272e6a30b68ce
--- /dev/null
+++ b/deepspeed/ops/op_builder/transformer.py
@@ -0,0 +1,44 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+from .builder import CUDAOpBuilder
+
+
+class TransformerBuilder(CUDAOpBuilder):
+    BUILD_VAR = "DS_BUILD_TRANSFORMER"
+    NAME = "transformer"
+
+    def __init__(self, name=None):
+        name = self.NAME if name is None else name
+        super().__init__(name=name)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.transformer.{self.NAME}_op'
+
+    def extra_ldflags(self):
+        if not self.is_rocm_pytorch():
+            return ['-lcurand']
+        else:
+            return []
+
+    def sources(self):
+        return [
+            'csrc/transformer/ds_transformer_cuda.cpp',
+            'csrc/transformer/cublas_wrappers.cu',
+            'csrc/transformer/transform_kernels.cu',
+            'csrc/transformer/gelu_kernels.cu',
+            'csrc/transformer/dropout_kernels.cu',
+            'csrc/transformer/normalize_kernels.cu',
+            'csrc/transformer/softmax_kernels.cu',
+            'csrc/transformer/general_kernels.cu'
+        ]
+
+    def include_paths(self):
+        includes = ['csrc/includes']
+        if self.is_rocm_pytorch():
+            from torch.utils.cpp_extension import ROCM_HOME
+            includes += [
+                '{}/hiprand/include'.format(ROCM_HOME),
+                '{}/rocrand/include'.format(ROCM_HOME)
+            ]
+        return includes
diff --git a/deepspeed/ops/op_builder/transformer_inference.py b/deepspeed/ops/op_builder/transformer_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..23eab4886e80e4026e738d17411e54a9f68448d7
--- /dev/null
+++ b/deepspeed/ops/op_builder/transformer_inference.py
@@ -0,0 +1,32 @@
+from .builder import CUDAOpBuilder
+
+
+class InferenceBuilder(CUDAOpBuilder):
+    BUILD_VAR = "DS_BUILD_TRANSFORMER_INFERENCE"
+    NAME = "transformer_inference"
+
+    def __init__(self, name=None):
+        name = self.NAME if name is None else name
+        super().__init__(name=name)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.transformer.inference.{self.NAME}_op'
+
+    def sources(self):
+        return [
+            'csrc/transformer/inference/csrc/pt_binding.cpp',
+            'csrc/transformer/inference/csrc/gelu.cu',
+            'csrc/transformer/inference/csrc/normalize.cu',
+            'csrc/transformer/inference/csrc/softmax.cu',
+            'csrc/transformer/inference/csrc/dequantize.cu',
+            'csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu',
+        ]
+
+    def extra_ldflags(self):
+        if not self.is_rocm_pytorch():
+            return ['-lcurand']
+        else:
+            return []
+
+    def include_paths(self):
+        return ['csrc/transformer/inference/includes']
diff --git a/deepspeed/ops/op_builder/utils.py b/deepspeed/ops/op_builder/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..02d4daa41680aaeaa9bac923b9c79e704b2c2b17
--- /dev/null
+++ b/deepspeed/ops/op_builder/utils.py
@@ -0,0 +1,18 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+from .builder import OpBuilder
+
+
+class UtilsBuilder(OpBuilder):
+    BUILD_VAR = "DS_BUILD_UTILS"
+    NAME = "utils"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/utils/flatten_unflatten.cpp']
diff --git a/deepspeed/ops/quantizer/__init__.py b/deepspeed/ops/quantizer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bf4045a4afd77c4c190374e2cb626c8c58982ec
--- /dev/null
+++ b/deepspeed/ops/quantizer/__init__.py
@@ -0,0 +1 @@
+from .quantizer import ds_quantizer
diff --git a/deepspeed/ops/quantizer/quantizer.py b/deepspeed/ops/quantizer/quantizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..cea9434049d9a5947416be872180597366d3fe3f
--- /dev/null
+++ b/deepspeed/ops/quantizer/quantizer.py
@@ -0,0 +1,32 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+import json
+import math
+import importlib
+import torch
+from torch import nn
+from torch.autograd import Function
+
+from ..op_builder import QuantizerBuilder
+
+# Cuda modules will be imported if needed
+quantizer_cuda_module = None
+
+
+def ds_quantizer(input, groups=1, bit_num=8, sr=False, asym=False):
+    # Load cuda modules if needed
+    global quantizer_cuda_module
+    if quantizer_cuda_module is None:
+        quantizer_cuda_module = QuantizerBuilder().load()
+    if sr:
+        if asym:
+            quantize_func = quantizer_cuda_module.ds_sr_quantize_asym_fp16 if input.dtype == torch.half else quantizer_cuda_module.ds_sr_quantize_asym_fp32
+        else:
+            quantize_func = quantizer_cuda_module.ds_sr_quantize_fp16 if input.dtype == torch.half else quantizer_cuda_module.ds_sr_quantize_fp32
+    else:
+        if asym:
+            quantize_func = quantizer_cuda_module.ds_quantize_asym_fp16 if input.dtype == torch.half else quantizer_cuda_module.ds_quantize_asym_fp32
+        else:
+            quantize_func = quantizer_cuda_module.ds_quantize_fp16 if input.dtype == torch.half else quantizer_cuda_module.ds_quantize_fp32
+    return quantize_func(input, groups, bit_num)
diff --git a/deepspeed/ops/sparse_attention/__init__.py b/deepspeed/ops/sparse_attention/__init__.py
index 604ab29a8d6bc7f3198f6ce2ffad109368ecd901..b7e1db35845eb3ad2df367434dc7d40c2c6cadc4 100644
--- a/deepspeed/ops/sparse_attention/__init__.py
+++ b/deepspeed/ops/sparse_attention/__init__.py
@@ -1,6 +1,4 @@
 from .sparsity_config import SparsityConfig, DenseSparsityConfig, FixedSparsityConfig, VariableSparsityConfig, BigBirdSparsityConfig, BSLongformerSparsityConfig
-from .softmax import Softmax
-from .matmul import MatMul
 from .sparse_self_attention import SparseSelfAttention
 from .bert_sparse_self_attention import BertSparseSelfAttention
 from .sparse_attention_utils import SparseAttentionUtils
diff --git a/deepspeed/ops/sparse_attention/bert_sparse_self_attention.py b/deepspeed/ops/sparse_attention/bert_sparse_self_attention.py
old mode 100755
new mode 100644
index 40dc697e11b168dede0c0e44b2bd1293caf86c1a..6c134d71f2b53240d78f2b4c26826044162d4a8f
--- a/deepspeed/ops/sparse_attention/bert_sparse_self_attention.py
+++ b/deepspeed/ops/sparse_attention/bert_sparse_self_attention.py
@@ -1,78 +1,78 @@
-"""
-Copyright 2020 The Microsoft DeepSpeed Team
-"""
-
-from torch import nn
-from deepspeed.ops.sparse_attention import SparseSelfAttention, FixedSparsityConfig
-
-
-class BertSparseSelfAttention(nn.Module):
-    """Implements Sparse Self Attention layer of Bert model based on https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py#L373
-
-    For more information please see, TODO DeepSpeed Sparse Transformer.
-
-    For usage example please see, TODO DeepSpeed Sparse Transformer Tutorial.
-    """
-    def __init__(
-        self,
-        config,
-        # SparsityConfig parameters needs to be set accordingly
-        sparsity_config=FixedSparsityConfig(num_heads=4)):
-        """Initialize the bert sparse self attention layer.
-
-        Note) you can use any of the provided sparsity configs or simply add yours!
-
-        Arguments:
-            config: required: Bert model config
-            sparsity_config: optional: this parameter determins sparsity pattern configuration; it is based on FixedSparsityConfig class.
-        """
-
-        super(BertSparseSelfAttention, self).__init__()
-        if config.hidden_size % config.num_attention_heads != 0:
-            raise ValueError(
-                "The hidden size (%d) is not a multiple of the number of attention "
-                "heads (%d)" % (config.hidden_size,
-                                config.num_attention_heads))
-        self.num_attention_heads = config.num_attention_heads
-        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
-        self.all_head_size = self.num_attention_heads * self.attention_head_size
-
-        self.query = nn.Linear(config.hidden_size, self.all_head_size)
-        self.key = nn.Linear(config.hidden_size, self.all_head_size)
-        self.value = nn.Linear(config.hidden_size, self.all_head_size)
-
-        self.sparse_self_attention = SparseSelfAttention(sparsity_config)
-
-    def transpose_for_scores(self, x):
-        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
-                                       self.attention_head_size)
-        x = x.view(*new_x_shape)
-        return x.permute(0, 2, 1, 3)
-
-    def forward(self, hidden_states, attention_mask):
-        """Applies forward phase of bert sparse self attention
-
-        Arguments:
-            hidden_states: required: hidde_states tensor of the bert model
-            attn_mask: required: a mask tensor of size (SequenceLength X SequenceLength); currently only 2D is supported
-
-        Return:
-             context_layer: a dense tensor containing attnetion context
-        """
-        mixed_query_layer = self.query(hidden_states)
-        mixed_key_layer = self.key(hidden_states)
-        mixed_value_layer = self.value(hidden_states)
-
-        query_layer = self.transpose_for_scores(mixed_query_layer)
-        key_layer = self.transpose_for_scores(mixed_key_layer)
-        value_layer = self.transpose_for_scores(mixed_value_layer)
-
-        context_layer = self.sparse_self_attention(query_layer,
-                                                   key_layer,
-                                                   value_layer,
-                                                   key_padding_mask=attention_mask)
-
-        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
-        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size, )
-        context_layer = context_layer.view(*new_context_layer_shape)
-        return context_layer
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+
+from torch import nn
+from deepspeed.ops.sparse_attention import SparseSelfAttention, FixedSparsityConfig
+
+
+class BertSparseSelfAttention(nn.Module):
+    """Implements Sparse Self Attention layer of Bert model based on https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/nvidia/modelingpreln.py#L373
+
+    For more information please see, TODO DeepSpeed Sparse Transformer.
+
+    For usage example please see, TODO DeepSpeed Sparse Transformer Tutorial.
+    """
+    def __init__(
+        self,
+        config,
+        # SparsityConfig parameters needs to be set accordingly
+        sparsity_config=FixedSparsityConfig(num_heads=4)):
+        """Initialize the bert sparse self attention layer.
+
+        Note) you can use any of the provided sparsity configs or simply add yours!
+
+        Arguments:
+            config: required: Bert model config
+            sparsity_config: optional: this parameter determines sparsity pattern configuration; it is based on FixedSparsityConfig class.
+        """
+
+        super(BertSparseSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size,
+                                config.num_attention_heads))
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.sparse_self_attention = SparseSelfAttention(sparsity_config)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
+                                       self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(self, hidden_states, attention_mask):
+        """Applies forward phase of bert sparse self attention
+
+        Arguments:
+            hidden_states: required: hidden_states tensor of the bert model
+            attn_mask: required: a mask tensor of size (SequenceLength X SequenceLength); currently only 2D is supported
+
+        Return:
+             context_layer: a dense tensor containing attention context
+        """
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        context_layer = self.sparse_self_attention(query_layer,
+                                                   key_layer,
+                                                   value_layer,
+                                                   key_padding_mask=attention_mask)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size, )
+        context_layer = context_layer.view(*new_context_layer_shape)
+        return context_layer
diff --git a/deepspeed/ops/sparse_attention/matmul.py b/deepspeed/ops/sparse_attention/matmul.py
index db5b774c3243b100e33501a5c0e02224fb84f88d..ea83f093c74844a52111e59f5568397f47f7e1cd 100644
--- a/deepspeed/ops/sparse_attention/matmul.py
+++ b/deepspeed/ops/sparse_attention/matmul.py
@@ -4,10 +4,208 @@ import importlib
 import warnings
 import torch
 import math
-from .trsrc import matmul
-from ..op_builder import SparseAttnBuilder
 
-triton = None
+import triton
+import triton.language as tl
+import triton._C.libtriton as libtriton
+
+
+@triton.jit
+def _kernel(A,
+            B,
+            C,
+            stride_za,
+            stride_ha,
+            stride_ma,
+            stride_ka,
+            stride_zb,
+            stride_hb,
+            stride_kb,
+            stride_nb,
+            stride_zc,
+            stride_hc,
+            stride_mc,
+            stride_nc,
+            DS0,
+            DS1,
+            SDD_K,
+            SDD_off_width,
+            lut,
+            locks,
+            nlocks,
+            **meta):
+    TM = meta['TM']
+    TN = meta['TN']
+    TK = meta['TK']
+    TZ = meta['TZ']
+    BLOCK = meta['BLOCK']
+    #------------#
+    #- Prologue -#
+    #------------#
+    pid0 = tl.program_id(0)
+    pid1 = tl.program_id(1)
+    pidz = tl.program_id(2)
+    if meta['SDD']:
+        pid1 = pid1 + SDD_off_width
+        blockidm = tl.arange(0, TM) // BLOCK
+        blockidn = tl.arange(0, TN) // BLOCK
+        offlutm = blockidm * (TN // BLOCK) * 4
+        offlutn = blockidn * 4
+        header = lut + pid1 * (TM // BLOCK) * (TN // BLOCK) * 4
+        z = tl.load(header + 0)
+        i = tl.load(header + 1 + offlutm)
+        j = tl.load(header + 2 + offlutn)
+        AS1 = SDD_K // TZ
+        lockid = tl.where(TZ > 1, 1, 0)
+        offka = pid0 * AS1
+        offkb = pid0 * AS1
+        offmc = 0
+        offnc = 0
+        offpa = 0
+        offpb = 0
+        maxid = TZ
+        offhc = 0
+        offha = z
+        offhb = z
+        ram = i * BLOCK + (tl.arange(0, TM) % BLOCK)
+        rbn = j * BLOCK + (tl.arange(0, TN) % BLOCK)
+    else:
+        header = lut + pid0 * 6
+        offset = tl.load(header + 0)
+        AS1 = tl.load(header + 1)
+        column = tl.load(header + 2)
+        depth = tl.load(header + 3)
+        lockid = tl.load(header + 4)
+        maxid = tl.load(header + 5)
+        pinc = lut + offset
+        offhc = depth
+        if meta['DSD']:
+            # output offset
+            offnc = pid1 * TN
+            offmc = column * TM
+            offpc = 0
+            # dense input offset
+            offnb = pid1 * TN
+            offkb = tl.load(pinc)
+            offkb = tl.multiple_of(offkb, 8)  # compiler hint
+            offpb = 0
+            # sparse input offset
+            offma = 0
+            offka = 0
+            offpa = tl.load(pinc + 1)
+            offpa = tl.multiple_of(offpa, 8)  # compiler hint
+            offpa = offpa * BLOCK * BLOCK
+            offha = 0
+            offhb = depth
+        else:
+            # output offset
+            offmc = pid1 * TM
+            offnc = column * TN
+            offpc = 0
+            # dense input offset
+            offma = pid1 * TM
+            offka = tl.load(pinc)
+            offka = tl.multiple_of(offka, 8)  # compiler hint
+            offpa = 0
+            # sparse input offset
+            offnb = 0
+            offkb = 0
+            offpb = tl.load(pinc + 1)
+            offpb = tl.multiple_of(offpb, 8)  # compiler hint
+            offpb = offpb * BLOCK * BLOCK
+            offha = depth
+            offhb = 0
+        ram = offma + tl.arange(0, TM)
+        rbn = offnb + tl.arange(0, TN)
+
+    # initialize a, b pointers
+    rka = offka + tl.arange(0, TK)
+    rkb = offkb + tl.arange(0, TK)
+    pa = A + pidz * stride_za + offha * stride_ha + offpa + ram[:, None] * stride_ma + rka[None, :] * stride_ka
+    pb = B + pidz * stride_zb + offhb * stride_hb + offpb + rbn[None, :] * stride_nb + rkb[:, None] * stride_kb
+    if meta['DDS']:
+        checkam = ram[:, None] < DS0
+    else:
+        checkam = AS1 > 0
+    if meta['DSD']:
+        checkbn = rbn[None, :] < DS0
+    else:
+        checkbn = AS1 > 0
+    a = tl.load(pa, mask=checkam, other=0.)
+    b = tl.load(pb, mask=checkbn, other=0.)
+
+    ## ---------------- ##
+    ##    Inner Loop    ##
+    ## ---------------- ##
+    acc = tl.zeros((TM, TN), dtype=tl.float32)
+    for k in range(AS1, 0, -TK):
+        acc += tl.dot(a, b)
+        if meta['SDD']:
+            inc_a = TK * stride_ka
+            inc_b = TK * stride_kb
+        else:
+            pinc += 2
+        if meta['DSD']:
+            inc_b = tl.load(pinc)
+            inc_a = tl.load(pinc + 1)
+            inc_b = tl.multiple_of(inc_b, 8)
+            inc_a = tl.multiple_of(inc_a, 8)
+            inc_b = inc_b * stride_kb
+        if meta['DDS']:
+            inc_a = tl.load(pinc)
+            inc_b = tl.load(pinc + 1)
+            inc_a = tl.multiple_of(inc_a, 8)
+            inc_b = tl.multiple_of(inc_b, 8)
+            inc_a = inc_a * stride_ka
+        pa += inc_a
+        pb += inc_b
+        # pre-fetch
+        checkak = k > TK
+        checkbk = k > TK
+        checka = checkam & checkak
+        checkb = checkbn & checkbk
+        a = tl.load(pa, mask=checka)
+        b = tl.load(pb, mask=checkb)
+    c = acc.to(C.dtype.element_ty)
+
+    if meta['SDD']:
+        checkc = True
+        rr_blockidm = tl.arange(0, TM) // BLOCK
+        rr_blockidn = tl.arange(0, TN) // BLOCK
+        rr_offlutm = rr_blockidm * (TN // BLOCK) * 4
+        rr_offlutn = rr_blockidn * 4
+        off_bkid = 3 + rr_offlutm[:, None] + rr_offlutn[None, :]
+        bkid = tl.load(header + off_bkid)
+        offpc = bkid * BLOCK * BLOCK
+        rcm = tl.arange(0, TM) % BLOCK
+        rcn = tl.arange(0, TN) % BLOCK
+    else:
+        rcm = offmc + tl.arange(0, TM)
+        rcn = offnc + tl.arange(0, TN)
+    if meta['DSD']:
+        checkc = rcn[None, :] < DS0
+    if meta['DDS']:
+        checkc = rcm[:, None] < DS0
+
+    pc = C + offpc + offhc * stride_hc + pidz * stride_zc + rcm[:, None] * stride_mc + rcn[None, :] * stride_nc
+    # write-back directly
+    if lockid == 0:
+        tl.store(pc, c, mask=checkc)
+    # accumulate partial results using spin-locks
+    else:
+        plock = locks + tl.program_id(2) * nlocks * tl.num_programs(1) + tl.program_id(
+            1) * nlocks + lockid - 1
+        pcount = plock + tl.num_programs(2) * tl.num_programs(1) * nlocks
+        while tl.atomic_cas(plock, 0, 1) == 1:
+            pass
+        count = tl.load(pcount)
+        if count == 0:
+            tl.store(pc, c, mask=checkc)
+        else:
+            d = tl.load(pc, mask=checkc)
+            tl.store(pc, d + c, mask=checkc)
+        tl.atomic_xchg(pcount, (count + 1) % maxid)
+        tl.atomic_xchg(plock, 0)
 
 
 ##############
@@ -26,9 +224,9 @@ class _sparse_matmul(torch.autograd.Function):
     # between `seg_size` elements
     @staticmethod
     def load_balance(sizes, block):
-        global triton
-        if triton is None:
-            triton = importlib.import_module('triton')
+        #global triton
+        #if triton is None:
+        #    triton = importlib.import_module('triton')
         # segment size
         # heuristics taken from OpenAI blocksparse code
         # https://github.com/openai/blocksparse/blob/master/blocksparse/matmul.py#L95
@@ -85,23 +283,22 @@ class _sparse_matmul(torch.autograd.Function):
     ##########################
     # SPARSE = DENSE x DENSE #
     ##########################
-    cpp_utils = None
-    sdd_segment = None
-
-    @staticmethod
-    def _load_utils():
-        if _sparse_matmul.cpp_utils is None:
-            _sparse_matmul.cpp_utils = SparseAttnBuilder().load()
-            _sparse_matmul.sdd_segment = _sparse_matmul.cpp_utils.sdd_segment
 
     @staticmethod
     def make_sdd_lut(layout, block, dtype, device):
-        _sparse_matmul._load_utils()
-        start_width = 64 // block
-        segmented = _sparse_matmul.sdd_segment(layout.type(torch.int32), start_width)
+        #_sparse_matmul._load_utils()
+        #start_width = 64 // block
+        #segmented = _sparse_matmul.sdd_segment(layout.type(torch.int32), start_width)
+        start_width = (128 if block > 16 else 32) // block
+        layout = layout.type(torch.int32)
+        segmented = libtriton.superblock(layout.data_ptr(),
+                                         layout.shape[0],
+                                         layout.shape[1],
+                                         layout.shape[2],
+                                         start_width)
         luts, widths, packs = [], [], []
         for size, nnz in segmented:
-            width = nnz.shape[0] // (size * size)
+            """ width = nnz.shape[0] // (size * size)
             h = nnz[:, 0]
             i = nnz[:, 1]
             j = nnz[:, 2]
@@ -109,6 +306,11 @@ class _sparse_matmul(torch.autograd.Function):
             lut = torch.stack((h, i, j, b), dim=1).view(-1).contiguous()
             luts.append(lut.type(torch.int32).to(device))
             widths.append(width)
+            packs.append(size) """
+            nnz = nnz.reshape(-1, 4)
+            width = nnz.shape[0] // (size * size)
+            luts.append(torch.from_numpy(nnz).type(torch.int32).to(device))
+            widths.append(width)
             packs.append(size)
         # create locks
         return luts, None, widths, packs
@@ -127,72 +329,55 @@ class _sparse_matmul(torch.autograd.Function):
                     packs,
                     bench,
                     time):
-        global triton
-        if triton is None:
-            triton = importlib.import_module('triton')
-
         if trans_c:
             a, b = b, a
             trans_a, trans_b = not trans_b, not trans_a
         AS0 = a.size(0)
-        AS1 = a.size(1)
-        AS2 = a.size(3 if trans_a else 2)
-        AS3 = a.size(2 if trans_a else 3)
-        BS0 = b.size(0)
-        BS1 = b.size(1)
-        BS2 = b.size(3 if trans_b else 2)
-        BS3 = b.size(2 if trans_b else 3)
+        # Shape check
+        a_dim = -2 if trans_a else -1
+        b_dim = -1 if trans_b else -2
+        a_inner, b_inner = a.shape[a_dim], b.shape[b_dim]
+        if a_inner != b_inner:
+            raise ValueError(
+                f"Size of tensor A along the {_dim_to_name(a_dim)} dim ({a_inner}) must match size "
+                f"of tensor B along the {_dim_to_name(b_dim)} dim ({b_inner})")
+        if a_inner % 16 != 0:
+            raise ValueError('Reduction size for SDD must be a multiple of 16')
+
+        batch_size = a.size(0)
+        a_outer = a.size(3 if trans_a else 2)
         dtype = a.dtype
-        is_16_multiple = AS3 % 16 == 0
-        is_32_multiple = AS3 % 32 == 0
-        is_64_multiple = AS3 % 64 == 0
+        is_16_multiple = a_inner % 16 == 0
+        is_32_multiple = a_inner % 32 == 0
+        is_64_multiple = a_inner % 64 == 0
         if not is_16_multiple:
             raise ValueError('Reduction size for SDD must be a multiple of 16')
+        device = a.device
         # create kernel
         total_width = sum([width * pack * pack for width, pack in zip(widths, packs)])
-        c = torch.empty((AS0, total_width, block, block), dtype=dtype, device=a.device)
+        c = torch.empty((batch_size,
+                         total_width,
+                         block,
+                         block),
+                        dtype=dtype,
+                        device=a.device)
         for lut, width, pack in zip(luts, widths, packs):
+            F32TK = [8, 16]
+            F16TK = [16]
+            F16TK += [32] if is_32_multiple else []
+            F16TK += [64] if is_64_multiple else []
+            TK = {torch.float32: F32TK, torch.float16: F16TK}[dtype]
             num_lock = 1
-            key = (block,
-                   a.dtype,
-                   b.dtype,
-                   trans_a,
-                   trans_b,
-                   trans_c,
-                   pack,
-                   is_32_multiple,
-                   is_64_multiple)
-            if key not in _sparse_matmul.sdd_cache:
-                F32TK = [8, 16]
-                F16TK = [16]
-                F16TK += [32] if is_32_multiple else []
-                F16TK += [64] if is_64_multiple else []
-                TK = {torch.float32: F32TK, torch.float16: F16TK}[dtype]
-                defines = {
-                    'TM': block * pack,
-                    'TN': block * pack,
-                    'TMN': block * block * pack * pack,
-                    'BLOCK': block,
-                    'TK': TK,
-                    'TYPE': dtype,
-                    'STRIDE_AM': '1' if trans_a else 'lda',
-                    'STRIDE_AK': 'lda' if trans_a else '1',
-                    'STRIDE_BN': 'ldb' if trans_b else '1',
-                    'STRIDE_BK': '1' if trans_b else 'ldb',
-                    'STRIDE_CM': 'ldc',
-                    'STRIDE_CN': '1',
-                    'SDD': True,
-                    'TZ': 1,
-                    'NAME': 'sdd_kernel'
-                }
-                _sparse_matmul.sdd_cache[key] = triton.kernel(matmul,
-                                                              defines=defines,
-                                                              num_warps=[1,
-                                                                         2,
-                                                                         4])
-                #_sparse_matmul.sdd_cache[key] = triton.kernel(src, defines=defines, num_warps=[1, 2, 4])
-
-            kernel = _sparse_matmul.sdd_cache[key]
+            meta = {
+                'TM': block * pack,
+                'TN': block * pack,
+                'BLOCK': block,
+                'TK': TK[0],
+                'TZ': 1,
+                'SDD': True,
+                'DSD': False,
+                'DDS': False
+            }
             # create output
             locks = _sparse_matmul.get_locks(2 * width * AS0 * num_lock, a.device)
             # maximum grid size is 65535
@@ -201,33 +386,36 @@ class _sparse_matmul(torch.autograd.Function):
             max_width = 49152
             total = 0 if bench else None
             for off_width in range(0, width, max_width):
-                current = kernel(a,
-                                 b,
-                                 c,
-                                 a.stride(2),
-                                 b.stride(2),
-                                 block,
-                                 a.stride(0),
-                                 b.stride(0),
-                                 c.stride(0),
-                                 a.stride(1),
-                                 b.stride(1),
-                                 c.stride(0),
-                                 AS2,
-                                 AS2,
-                                 AS3,
-                                 off_width,
-                                 lut,
-                                 locks,
-                                 num_lock,
-                                 grid=lambda opt:
-                                 [opt.d('TZ'),
-                                  min(max_width,
-                                      width - off_width),
-                                  AS0],
-                                 bench=bench)
-                total = total + current if bench else None
-            time[0] = total
+                grid = lambda meta: [
+                    meta['TZ'],
+                    min(max_width,
+                        width - off_width),
+                    batch_size
+                ]
+                _kernel[grid](a,
+                              b,
+                              c,
+                              a.stride(0),
+                              a.stride(1),
+                              a.stride(3 if trans_a else 2),
+                              a.stride(2 if trans_a else 3),
+                              b.stride(0),
+                              b.stride(1),
+                              b.stride(3 if trans_b else 2),
+                              b.stride(2 if trans_b else 3),
+                              c.stride(0),
+                              c.stride(0),
+                              c.stride(2),
+                              c.stride(3),
+                              a_outer,
+                              a_outer,
+                              a_inner,
+                              off_width,
+                              lut,
+                              locks,
+                              num_lock,
+                              num_warps=4,
+                              **meta)
         # save for backward pass
         return c
 
@@ -359,30 +547,16 @@ class _sparse_matmul(torch.autograd.Function):
         BS2 = block * spdims[1 if trans_b else 2]
         dtype = a.dtype
         # kernel
-        key = (block, a.dtype, b.dtype, trans_a, trans_b, trans_c)
-        if key not in _sparse_matmul.dds_cache:
-            TM = [64, 128] if dtype == torch.float32 else [64, 128, 256]
-            TK = [8] if dtype == torch.float32 else [16]
-            defines = {
-                'TM': TM,
-                'TN': block,
-                'TK': TK,
-                'BLOCK': block,
-                'TYPE': dtype,
-                'STRIDE_AM': 1 if trans_a else 'lda',
-                'STRIDE_AK': 'lda' if trans_a else 1,
-                'STRIDE_BN': block if trans_b else 1,
-                'STRIDE_BK': 1 if trans_b else block,
-                'STRIDE_CM': '1' if trans_c else 'ldc',
-                'STRIDE_CN': 'ldc' if trans_c else '1',
-                'NAME': 'dds_kernel',
-                'DDS': True
-            }
-            _sparse_matmul.dds_cache[key] = triton.kernel(matmul,
-                                                          defines=defines,
-                                                          num_warps=[4])
-            #_sparse_matmul.dds_cache[key] = triton.kernel(src, defines=defines, num_warps=[4])
-        kernel = _sparse_matmul.dds_cache[key]
+        meta = {
+            'TN': block,
+            'TM': 128,
+            'TK': 16,
+            'BLOCK': block,
+            'TZ': 1,
+            'SDD': False,
+            'DSD': False,
+            'DDS': True
+        }
         # output
         CS0 = AS0
         CS1 = AS1
@@ -390,30 +564,31 @@ class _sparse_matmul(torch.autograd.Function):
         CS3 = AS2 if trans_c else BS2
         locks = _sparse_matmul.get_locks(2 * AS0 * AS2 // 32 * num_locks, a.device)
         c = torch.empty((CS0, CS1, CS2, CS3), dtype=dtype, device=a.device)
-        time[0] = kernel(a,
-                         b,
-                         c,
-                         a.stride(2),
-                         block,
-                         c.stride(2),
-                         a.stride(0),
-                         b.stride(0),
-                         c.stride(0),
-                         a.stride(1),
-                         b.stride(1),
-                         c.stride(1),
-                         AS2,
-                         BS2,
-                         0,
-                         0,
-                         lut,
-                         locks,
-                         num_locks,
-                         grid=lambda opt: [width,
-                                           triton.cdiv(AS2,
-                                                       opt.d('TM')),
-                                           AS0],
-                         bench=bench)
+        grid = lambda meta: [width, triton.cdiv(AS2, meta['TM']), AS0]
+        _kernel[grid](a,
+                      b,
+                      c,
+                      a.stride(0),
+                      a.stride(1),
+                      a.stride(3 if trans_a else 2),
+                      a.stride(2 if trans_a else 3),
+                      b.stride(0),
+                      b.stride(1),
+                      b.stride(3 if trans_b else 2),
+                      b.stride(2 if trans_b else 3),
+                      c.stride(0),
+                      c.stride(1),
+                      c.stride(3 if trans_c else 2),
+                      c.stride(2 if trans_c else 3),
+                      AS2,
+                      BS2,
+                      0,
+                      0,
+                      lut,
+                      locks,
+                      num_locks,
+                      num_warps=4,
+                      **meta)
         return c
 
     @staticmethod
@@ -444,30 +619,17 @@ class _sparse_matmul(torch.autograd.Function):
         BS3 = b.size(2 if trans_b else 3)
         dtype = a.dtype
         # kernel
-        key = (block, a.dtype, b.dtype, trans_a, trans_b, trans_c)
-        if key not in _sparse_matmul.dsd_cache:
-            TN = [64, 128] if dtype == torch.float32 else [64, 128, 256]
-            TK = [8] if dtype == torch.float32 else [16]
-            defines = {
-                'TM': block,
-                'TN': TN,
-                'TK': TK,
-                'BLOCK': block,
-                'TYPE': dtype,
-                'STRIDE_AM': 1 if trans_a else block,
-                'STRIDE_AK': block if trans_a else 1,
-                'STRIDE_BN': 'ldb' if trans_b else '1',
-                'STRIDE_BK': '1' if trans_b else 'ldb',
-                'STRIDE_CM': '1' if trans_c else 'ldc',
-                'STRIDE_CN': 'ldc' if trans_c else '1',
-                'NAME': 'dsd_kernel',
-                'DSD': True
-            }
-            _sparse_matmul.dsd_cache[key] = triton.kernel(matmul,
-                                                          defines=defines,
-                                                          num_warps=[4])
-            #_sparse_matmul.dsd_cache[key] = triton.kernel(src, defines=defines, num_warps=[4])
-        kernel = _sparse_matmul.dsd_cache[key]
+
+        meta = {
+            'TM': block,
+            'TN': 128,
+            'TK': 16,
+            'BLOCK': block,
+            'TZ': 1,
+            'SDD': False,
+            'DSD': True,
+            'DDS': False
+        }
         # output
         CS0 = BS0
         CS1 = BS1
@@ -475,30 +637,31 @@ class _sparse_matmul(torch.autograd.Function):
         CS3 = AS1 if trans_c else BS3
         locks = _sparse_matmul.get_locks(2 * BS0 * BS3 // 32 * num_locks, a.device)
         c = torch.empty((CS0, CS1, CS2, CS3), dtype=dtype, device=a.device)
-        time[0] = kernel(a,
-                         b,
-                         c,
-                         block,
-                         b.stride(2),
-                         c.stride(2),
-                         a.stride(0),
-                         b.stride(0),
-                         c.stride(0),
-                         a.stride(1),
-                         b.stride(1),
-                         c.stride(1),
-                         BS3,
-                         AS1,
-                         0,
-                         0,
-                         lut,
-                         locks,
-                         num_locks,
-                         grid=lambda opt: [width,
-                                           triton.cdiv(BS3,
-                                                       opt.d('TN')),
-                                           BS0],
-                         bench=bench)
+        grid = lambda meta: [width, triton.cdiv(BS3, meta['TN']), BS0]
+        _kernel[grid](a,
+                      b,
+                      c,
+                      a.stride(0),
+                      a.stride(1),
+                      a.stride(3 if trans_a else 2),
+                      a.stride(2 if trans_a else 3),
+                      b.stride(0),
+                      b.stride(1),
+                      b.stride(3 if trans_b else 2),
+                      b.stride(2 if trans_b else 3),
+                      c.stride(0),
+                      c.stride(1),
+                      c.stride(2),
+                      c.stride(3),
+                      BS3,
+                      AS1,
+                      0,
+                      0,
+                      lut,
+                      locks,
+                      num_locks,
+                      num_warps=4,
+                      **meta)
         return c
 
     fn = {
@@ -629,7 +792,7 @@ class MatMul:
             return self.lut_cache[key]
         # C look-up table
         layout, block = self.layout, self.block
-        step = 8 if dtype == torch.float32 else 16
+        step = 16
         if self.mode == 'sdd':
             c_lut, c_num_locks, c_width, c_packs = _sparse_matmul.make_sdd_lut(layout, block, dtype, device)
         elif self.mode == 'dsd':
@@ -675,9 +838,31 @@ class MatMul:
         self.trans_a = trans_a
         self.trans_b = trans_b
         self.mode = mode
-        self.spdims = layout.shape
         self.block = block
         self.layout = layout
+        layout_dim = layout.ndim
+        assert layout_dim in (2, 3), "Layout should be a 2 or 3 dimensional tensor of 0s and 1s"
+        if not mode == 'sdd':
+            # Dims to be reduced on the 'inside' of the matmul, either -1 or -2
+            trans_dense, trans_sparse, sparse_inner = (trans_b, trans_a, -1) if mode == 'dsd' else (trans_a, trans_b, -2)
+            self.dense_inner_dim = -(
+                (sparse_inner % 2) + 1) if not trans_dense else sparse_inner
+            sparse_inner = sparse_inner if not trans_sparse else -(
+                (sparse_inner % 2) + 1)
+
+            # Inner dim of the dense input should be equal to the inner dim of the sparse input
+            self.dense_inner_size = layout.shape[sparse_inner] * block
+            # Expected shape for sparse inputs
+            self.sparse_shape = (layout.sum().item(), block, block)
+
+        # Support using the same layout across attention heads etc.
+        if layout_dim == 2:
+            layout = layout.unsqueeze(0)
+
+        layout = layout.long(
+        )  # Above code assumes the layout tensor is an integral type
+
+        self.spdims = layout.shape
         # timings
         self.bench = bench
         self.time_c = None
@@ -714,10 +899,15 @@ class MatMul:
         time_c = [None]
         time_da = [None]
         time_db = [None]
+
+        original_dims = max(a.ndim, b.ndim)
+        a, b = self._validate_inputs(a, b)
+
         # pad shapes with ones
         a = MatMul._pad_shape(a, self.mode == 'dsd')
         b = MatMul._pad_shape(b, self.mode == 'dds')
         # execute
+
         c = _sparse_matmul.apply(a,
                                  b,
                                  self.trans_a,
@@ -744,7 +934,62 @@ class MatMul:
                                  db_packs,
                                  self.bench,
                                  time_db)
+
+        # This removes any leading singleton dimensions we may have added to the tensor that weren't in the input
+        dims_to_trim = c.ndim - original_dims
+        for _ in range(dims_to_trim):
+            c = c.squeeze(0)
+
         self.time_c = time_c[0]
         self.time_da = time_da[0]
         self.time_db = time_db[0]
         return c
+
+    def _validate_inputs(self, a, b):
+        if a.device != b.device:
+            raise ValueError(
+                f"Inputs must be on the same device; got {a.device} for tensor A "
+                f"and {b.device} for tensor B")
+        if not a.is_cuda:
+            raise ValueError("Only GPU devices are supported for now")
+
+        # When autocast is enabled, torch.matmul autocasts to float16, so we do the same here
+        if torch.is_autocast_enabled():
+            a, b = a.half(), b.half()
+        elif a.dtype != b.dtype:
+            raise ValueError(
+                f"Inputs must be the same dtype; got {a.dtype} for A and {b.dtype} for B"
+            )
+
+        mode, trans_a, trans_b = self.mode, self.trans_a, self.trans_b
+        if mode != 'sdd':
+            # One input is sparse
+            dense, dense_name, sparse, sparse_name = (a, 'A', b, 'B') if mode == 'dds' else (b, 'B', a, 'A')
+            dense_inner = dense.shape[self.dense_inner_dim]
+            if dense_inner != self.dense_inner_size:
+                raise ValueError(
+                    f"Expected tensor {dense_name} to have size {self.dense_inner_size} at dim "
+                    f"{self.dense_inner_dim % dense.ndim}, got {dense_inner}.")
+
+            if sparse.shape[-len(self.sparse_shape):] != self.sparse_shape:
+                raise ValueError(
+                    f"Expected tensor with trailing dimensions of shape {self.sparse_shape} for argument "
+                    f"{sparse_name}, got {sparse.shape}")
+
+        def add_extra_dims(x):
+            # Add extra leading singleton dimensions if needed
+            dims_needed = 4 - x.ndim
+            if dims_needed > 0:
+                singletons = [1] * dims_needed
+                x = x.view(*singletons, *x.shape)
+            elif dims_needed < 0:
+                raise ValueError(
+                    "Tensors with more than 4 dimensions are not currently supported")
+
+            return x
+
+        # Pad shapes with leading singleton dimensions
+        a = add_extra_dims(a)
+        b = add_extra_dims(b)
+
+        return a, b
diff --git a/deepspeed/ops/sparse_attention/softmax.py b/deepspeed/ops/sparse_attention/softmax.py
index a0805ada4bc0cdc9cd8010b3b06da2f5f89237f0..11d4583fd619a60a1658c21f5b51ca798301a422 100644
--- a/deepspeed/ops/sparse_attention/softmax.py
+++ b/deepspeed/ops/sparse_attention/softmax.py
@@ -5,13 +5,142 @@ import warnings
 import importlib
 import torch
 import math
-from .trsrc import softmax_fwd, softmax_bwd
 
-fwd_kernels = dict()
-bwd_kernels = dict()
+import triton
+import triton.language as tl
+import triton._C.libtriton as libtriton
 
-# Delay importing triton unless we need it
-triton = None
+
+def next_power_of_2(n):
+    n -= 1
+    n |= n >> 1
+    n |= n >> 2
+    n |= n >> 4
+    n |= n >> 8
+    n |= n >> 16
+    n += 1
+    return n
+
+
+def num_warps(n):
+    if n < 512:
+        return 4
+    if n < 2048:
+        return 8
+    return 16
+
+
+@triton.heuristics({
+    'num_warps': lambda *args,
+    **meta: num_warps(args[6] * meta['BLOCK'])
+})
+@triton.heuristics({
+    'TN': lambda *args,
+    **meta: next_power_of_2(args[6] * meta['BLOCK'])
+})
+@triton.jit
+def _forward(X,
+             scale,
+             LUT,
+             RPE,
+             KP_M,
+             ATTN_M,
+             sizemax,
+             stride_zx,
+             stride_zrpe,
+             stride_hrpe,
+             stride_srpe,
+             stride_zkpm,
+             stride_zattnm,
+             **meta):
+    TN = meta['TN']
+    BLOCK = meta['BLOCK']
+    pidhm = tl.program_id(0)
+    pidz = tl.program_id(1)
+    # create index ranges
+    rxm = pidhm % BLOCK
+    rbm = pidhm // BLOCK
+    rxn = tl.arange(0, TN) % BLOCK
+    rbn = tl.arange(0, TN) // BLOCK
+    # extract information from LUT
+    header = LUT + rbm * 2
+    size = tl.load(header + 0)
+    offset = tl.load(header + 1)
+    check = rbn < size
+    rbmn = tl.where(check, rbn, size - 1)
+    # block id and column id
+    blockid = tl.load(LUT + offset + rbmn * 4 + 0)
+    columnid = tl.load(LUT + offset + rbmn * 4 + 1)
+    rowid = tl.load(LUT + offset + rbmn * 4 + 2)
+    headid = tl.load(LUT + offset + rbmn * 4 + 3)
+    # pointers to X
+    px = X + pidz * stride_zx + blockid * BLOCK * BLOCK + rxm * BLOCK + rxn
+    x = tl.load(px, mask=check, other=-float('inf'))
+    x = x.to(tl.float32)
+    # apply scale
+    if meta['APPLY_SCALE']:
+        x = x * scale
+    # apply RPE
+    if meta['APPLY_RPE']:
+        prpe = RPE + pidz * stride_zrpe + headid * stride_hrpe + columnid * BLOCK + rowid * BLOCK * stride_srpe + rxm * stride_srpe + rxn
+        rpe = tl.load(prpe, mask=check, other=0)
+        x = x + rpe
+    # apply key-padding mask
+    if meta['APPLY_KP_MASK']:
+        pkp_m = KP_M + pidz * stride_zkpm + columnid * BLOCK + rxn
+        kp_m = tl.load(pkp_m, mask=check, other=-float('inf'))
+        if meta['KP_MASK_MUL']:
+            kp_m = tl.where(kp_m == 0, -float('inf'), 0.)
+        x = x + kp_m
+    # apply attention mask
+    if meta['APPLY_ATTN_MASK']:
+        pattn_m = ATTN_M + columnid * BLOCK + rowid * BLOCK * stride_zattnm + rxm * stride_zattnm + rxn
+        attn_m = tl.load(pattn_m, mask=check, other=-float('inf'))
+        if meta['ATTN_MASK_MUL']:
+            attn_m = tl.where(attn_m == 0, -float('inf'), 0.)
+        x = x + attn_m
+    # computation
+    x = tl.softmax(x)
+    tl.store(px, x, mask=check)
+
+
+@triton.heuristics({
+    'num_warps': lambda *args,
+    **meta: num_warps(args[4] * meta['BLOCK'])
+})
+@triton.heuristics({
+    'TN': lambda *args,
+    **meta: next_power_of_2(args[4]) * meta['BLOCK']
+})
+@triton.jit
+def _backward(X, scale, DX, LUT, sizemax, stride_zx, stride_zdx, **meta):
+    pidhm = tl.program_id(0)
+    pidz = tl.program_id(1)
+    TN = meta['TN']
+    BLOCK = meta['BLOCK']
+    # create index ranges
+    rxm = pidhm % BLOCK
+    rbm = pidhm // BLOCK
+    rxn = tl.arange(0, TN) % BLOCK
+    rbn = tl.arange(0, TN) // BLOCK
+    # extract information from look-up table
+    header = LUT + rbm * 2
+    size = tl.load(header + 0)
+    offset = tl.load(header + 1)
+    # bounds checking on lut
+    check = rbn < size
+    rbmn = tl.where(check, rbn, size - 1)
+    # initialize pointers to block-sparse input
+    blockid = tl.load(LUT + offset + rbmn * 4)
+    X = X + pidz * stride_zx + blockid * BLOCK * BLOCK + rxm * BLOCK + rxn
+    DX = DX + pidz * stride_zdx + blockid * BLOCK * BLOCK + rxm * BLOCK + rxn
+    # compute fused softmax backward
+    x = tl.load(X, mask=check, other=0)
+    dx = tl.load(DX, mask=check, other=0)
+    x = x.to(tl.float32)
+    dx = dx.to(tl.float32)
+    y = x * (dx - tl.sum(x * dx, 0)) * scale
+    tl.store(DX, y, mask=check)
 
 
 class _sparse_softmax(torch.autograd.Function):
@@ -40,66 +169,6 @@ class _sparse_softmax(torch.autograd.Function):
         lut = torch.cat((header, core)).type(torch.int32).to(device)
         return lut, int(sizes.max())
 
-    @staticmethod
-    def make_kernel(cache,
-                    src,
-                    max_k,
-                    dtype,
-                    block,
-                    apply_scale,
-                    apply_rpe,
-                    apply_kp_mask,
-                    apply_attn_mask,
-                    kp_mask_mode,
-                    attn_mask_mode):
-        global triton
-        if triton is None:
-            triton = importlib.import_module('triton')
-
-        if max_k >= 32768:
-            raise NotImplementedError('Reductions larger than 32768 elements '\
-                                      'are not yet implemented')
-        num_warps = 4 if max_k < 512 else (8 if max_k < 2048 else 16)
-        pad = num_warps * 32 * 2
-        TN = (int(max_k) + pad - 1) // pad * pad
-        # just-in-time compile kernel
-        key = (block,
-               dtype,
-               num_warps,
-               TN,
-               apply_scale,
-               apply_rpe,
-               apply_kp_mask,
-               apply_attn_mask,
-               kp_mask_mode,
-               attn_mask_mode)
-        if key not in cache:
-            defines = {
-                'TM': [1],
-                'TN': [TN],
-                'TYPE': dtype,
-                'BLOCK': block,
-                'INFINITY': {
-                    torch.float32: 'F32_INFINITY',
-                    torch.float16: 'F16_INFINITY'
-                }[dtype]
-            }
-            if apply_scale:
-                defines['APPLY_SCALE'] = True
-            if apply_rpe:
-                defines['APPLY_RPE'] = True
-            if apply_kp_mask:
-                defines['APPLY_KP_MASK'] = True
-                if kp_mask_mode == 'mul':
-                    defines['KP_MASK_MUL'] = True
-            if apply_attn_mask:
-                defines['APPLY_ATTN_MASK'] = True
-                if attn_mask_mode == 'mul':
-                    defines['ATTN_MASK_MUL'] = True
-            kernel = triton.kernel(src, defines=defines, num_warps=[num_warps])
-            cache[key] = kernel
-        return cache[key]
-
     @staticmethod
     def forward(ctx,
                 x,
@@ -116,9 +185,6 @@ class _sparse_softmax(torch.autograd.Function):
                 maxlut,
                 bench,
                 time):
-        global triton
-        if triton is None:
-            triton = importlib.import_module('triton')
 
         apply_scale = False if scale == 1.0 else True
 
@@ -150,27 +216,20 @@ class _sparse_softmax(torch.autograd.Function):
             stride_zattnm = attn_mask.stride(0)
 
         # run kernel
-        kernel = _sparse_softmax.make_kernel(fwd_kernels,
-                                             softmax_fwd,
-                                             maxlut * block,
-                                             x.dtype,
-                                             block,
-                                             apply_scale,
-                                             apply_rpe,
-                                             apply_kp_mask,
-                                             apply_attn_mask,
-                                             kp_mask_mode,
-                                             attn_mask_mode)
         M = x.shape[0]
-        grid = lambda opt: [triton.cdiv(spdims[0] * spdims[1] * block, opt.d('TM')), M]
+        meta = {
+            'BLOCK': block,
+            'APPLY_SCALE': apply_scale,
+            'APPLY_RPE': apply_rpe,
+            'APPLY_KP_MASK': apply_kp_mask,
+            'APPLY_ATTN_MASK': apply_attn_mask,
+            'KP_MASK_MUL': kp_mask_mode == 'mul',
+            'ATTN_MASK_MUL': attn_mask_mode == 'mul',
+        }
+        grid = lambda opt: [spdims[0] * spdims[1] * block, M]
+        _forward[grid](x, scale, lut, rpe, key_padding_mask, attn_mask, maxlut, x.stride(0),\
+                       stride_zrpe, stride_hrpe, stride_srpe, stride_zkpm, stride_zattnm, **meta)
 
-        # run kernel
-        time[0] = kernel(x, scale, lut, rpe, key_padding_mask, attn_mask,\
-                         num_blocks, maxlut,\
-                         x.stride(0),\
-                         stride_zrpe, stride_hrpe, stride_srpe,\
-                         stride_zkpm, stride_zattnm,\
-                         grid=grid, bench=bench)
         # save to context
         ctx.mark_dirty(x)
         ctx.save_for_backward(x, lut)
@@ -188,31 +247,20 @@ class _sparse_softmax(torch.autograd.Function):
 
     @staticmethod
     def backward(ctx, dx):
-        global triton
-        if triton is None:
-            triton = importlib.import_module('triton')
 
         # retrieve from context
         x, lut = ctx.saved_tensors
         # run kernel
-        kernel = _sparse_softmax.make_kernel(bwd_kernels,
-                                             softmax_bwd,
-                                             ctx.maxlut * ctx.block,
-                                             x.dtype,
-                                             ctx.block,
-                                             ctx.apply_scale,
-                                             ctx.apply_rpe,
-                                             ctx.apply_kp_mask,
-                                             ctx.apply_attn_mask,
-                                             ctx.kp_mask_mode,
-                                             ctx.attn_mask_mode)
         M = x.shape[0]
-        grid = lambda opt: [
-            triton.cdiv(ctx.spdims[0] * ctx.spdims[1] * ctx.block,
-                        opt.d('TM')),
-            M
-        ]
-        kernel(x, ctx.scale, dx, lut, ctx.maxlut, x.stride(0), dx.stride(0), grid=grid)
+        grid = lambda opt: [ctx.spdims[0] * ctx.spdims[1] * ctx.block, M]
+        _backward[grid](x,
+                        ctx.scale,
+                        dx,
+                        lut,
+                        ctx.maxlut,
+                        x.stride(0),
+                        dx.stride(0),
+                        BLOCK=ctx.block)
         return dx, None, None, None, None, None, None, None, None, None, None, None, None, None, None
 
 
diff --git a/deepspeed/ops/sparse_attention/sparse_attention_utils.py b/deepspeed/ops/sparse_attention/sparse_attention_utils.py
index 32999f62323fc71a68fb7053fad380c6eca84eb6..700363b204af9beca40d9bf284a8c35d3f022c7a 100644
--- a/deepspeed/ops/sparse_attention/sparse_attention_utils.py
+++ b/deepspeed/ops/sparse_attention/sparse_attention_utils.py
@@ -94,10 +94,10 @@ class SparseAttentionUtils:
         Arguments:
             model: required: a transformer model
             max_position: required: an integer determining new position embedding size
-            sparsity_config: optional: this parameter determins sparsity pattern configuration; it is based on SparsityConfig class
+            sparsity_config: optional: this parameter determines sparsity pattern configuration; it is based on SparsityConfig class
 
         Return:
-            model: updated model; in which self attention layer has been repleaced with DeepSpeed Sparse Self Attention layer.
+            model: updated model; in which self attention layer has been replaced with DeepSpeed Sparse Self Attention layer.
         """
 
         if hasattr(model, 'bert'):
@@ -131,10 +131,10 @@ class SparseAttentionUtils:
         Arguments:
             config: required: transformer model config
             layers: required: transformer model attention layers
-            sparsity_config: optional: this parameter determins sparsity pattern configuration; it is based on SparsityConfig class
+            sparsity_config: optional: this parameter determines sparsity pattern configuration; it is based on SparsityConfig class
 
         Return:
-            layers: updated attention layers; in which self attention layers have been repleaced with DeepSpeed Sparse Self Attention layer.
+            layers: updated attention layers; in which self attention layers have been replaced with DeepSpeed Sparse Self Attention layer.
         """
 
         for layer in layers:
@@ -161,7 +161,7 @@ class SparseAttentionUtils:
             It needs to be called in your model, such as BertModel, right before you calculate the embedding outputs.
             Note)
             1- instead of passing your embedding layer to this function, you can simply add this function to your model. It can be more simplified if given attention_mask and/or token_type_ids are none.
-            2- you need to call unpdad function before returning your model output to unpad the encoder sequence output.
+            2- you need to call unpad function before returning your model output to unpad the encoder sequence output.
 
             Arguments:
                 block_size: required: an integer determining the block size of sparsity config.
diff --git a/deepspeed/ops/sparse_attention/sparse_self_attention.py b/deepspeed/ops/sparse_attention/sparse_self_attention.py
index 6e7d8905e0a806b09216e3c82414f2737c24e7bc..5ccfa13198fd8109c1fc181f26fe05c928082746 100644
--- a/deepspeed/ops/sparse_attention/sparse_self_attention.py
+++ b/deepspeed/ops/sparse_attention/sparse_self_attention.py
@@ -7,8 +7,7 @@ from torch.nn.functional import *
 import torch
 from torch import distributed as dist
 from collections import namedtuple
-from deepspeed.ops.sparse_attention import MatMul, Softmax, SparsityConfig
-import sys
+from deepspeed.ops.sparse_attention import SparsityConfig
 
 
 class SparseSelfAttention(nn.Module):
@@ -19,15 +18,15 @@ class SparseSelfAttention(nn.Module):
     For usage example please see, TODO DeepSpeed Sparse Transformer Tutorial.
     """
     def __init__(
-        self,
-        # SparsityConfig parameters needs to be set accordingly
-        sparsity_config=SparsityConfig(num_heads=4),
-        key_padding_mask_mode='add',
-        attn_mask_mode='mul',
-        max_seq_length=2048):
+            self,
+            # SparsityConfig parameters needs to be set accordingly
+            sparsity_config=SparsityConfig(num_heads=4),
+            key_padding_mask_mode='add',
+            attn_mask_mode='mul',
+            max_seq_length=2048):
         """Initialize the sparse self attention layer.
         Arguments:
-            sparsity_config: optional: this parameter determins sparsity pattern configuration; it is based on SparsityConfig class.
+            sparsity_config: optional: this parameter determines sparsity pattern configuration; it is based on SparsityConfig class.
             key_padding_mask_mode: optional: a string determining if key padding mask needs to be added, `add`, or be multiplied, `mul`.
             attn_mask_mode: optional: a string determining if attention mask needs to be added, `add`, or be multiplied, `mul`.
             max_seq_length: optional: the maximum sequence length this sparse attention module will be applied to; it controls the size of the master_layout.
@@ -64,7 +63,8 @@ class SparseSelfAttention(nn.Module):
 
     # add to cache
     def get_ops(self, H, L):
-        import sys
+        from deepspeed.ops.sparse_attention.matmul import MatMul
+        from deepspeed.ops.sparse_attention.softmax import Softmax
         if L not in SparseSelfAttention.ops:
             sparsity_layout = self.get_layout(L)
             sparse_dot_sdd_nt = MatMul(sparsity_layout,
@@ -122,8 +122,9 @@ class SparseSelfAttention(nn.Module):
             attn_mask_mode: optional: a boolean determining if attn_mask needs to be added or multiplied
 
         Return:
-             attn_output: a dense tensor containing attnetion context
+             attn_output: a dense tensor containing attention context
         """
+        assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support"
         bsz, num_heads, tgt_len, head_dim = query.size()
 
         # transpose back key if it is already transposed
diff --git a/deepspeed/ops/sparse_attention/sparsity_config.py b/deepspeed/ops/sparse_attention/sparsity_config.py
index 5d8e2fa5fa46637941cd39c358a94ad52284ee85..3b9224f136cb4fcac9de2a5739aa69c55e7881d3 100644
--- a/deepspeed/ops/sparse_attention/sparsity_config.py
+++ b/deepspeed/ops/sparse_attention/sparsity_config.py
@@ -49,7 +49,7 @@ class SparsityConfig:
         """If all heads require same sparsity layout, it propagate first head layout to all heads
 
         Arguments:
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head
@@ -152,11 +152,11 @@ class FixedSparsityConfig(SparsityConfig):
         self.num_different_global_patterns = num_different_global_patterns
 
     def set_local_layout(self, h, layout):
-        """Sets local attantion layout used by the given head in the sparse attention.
+        """Sets local attention layout used by the given head in the sparse attention.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which local layout is set
@@ -173,14 +173,14 @@ class FixedSparsityConfig(SparsityConfig):
         return layout
 
     def set_global_layout(self, h, layout):
-        """Sets global attantion layout used by the given head in the sparse attention.
+        """Sets global attention layout used by the given head in the sparse attention.
 
         Currently we set global blocks starting from the last block of a local window to the first one. That means if a local window consists of 4 blocks and global attention size is one block, we use block #4 in each local window as global. If we have different layout per head, then other heads will get #3, #2, and #1. And if we have more heads (and different layout has set) than num of global attentions, multiple head may have same global attentions.
         Note) if horizontal_global_attention is set, global blocks will be set both horizontally and vertically.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which global layout is set
@@ -307,12 +307,12 @@ class VariableSparsityConfig(SparsityConfig):
         self.horizontal_global_attention = horizontal_global_attention
 
     def set_random_layout(self, h, layout):
-        """Sets random attantion layout used by the given head in the sparse attention.
+        """Sets random attention layout used by the given head in the sparse attention.
         Note) By default, it assumes there will be a unique random block layout for all heads; unless `different_layout_per_head` parameter is set in which each head can have a different random layout.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which random layout is set
@@ -321,7 +321,7 @@ class VariableSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (num_blocks < self.num_random_blocks):
             raise ValueError(
-                f'Number of random blocks, {self.num_random_blocks}, must be smaller than overal number of blocks in a row, {num_blocks}!'
+                f'Number of random blocks, {self.num_random_blocks}, must be smaller than overall number of blocks in a row, {num_blocks}!'
             )
         for row in range(0, num_blocks):
             rnd_cols = random.sample(range(0, num_blocks), self.num_random_blocks)
@@ -329,10 +329,10 @@ class VariableSparsityConfig(SparsityConfig):
         return layout
 
     def set_local_layout(self, h, layout):
-        """Sets local attantion layout used by the given head in the sparse attention.
+        """Sets local attention layout used by the given head in the sparse attention.
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which local layout is set
@@ -362,11 +362,11 @@ class VariableSparsityConfig(SparsityConfig):
         return layout
 
     def set_global_layout(self, h, layout):
-        """Sets global attantion layout used by the given head in the sparse attention.
+        """Sets global attention layout used by the given head in the sparse attention.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which global layout is set
@@ -375,7 +375,7 @@ class VariableSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (self.global_block_end_indices is None):
             for idx in self.global_block_indices:
-                # if global block idx is in the range of the sequnce blocks
+                # if global block idx is in the range of the sequence blocks
                 if (idx < num_blocks):
                     #global rows
                     if (self.horizontal_global_attention):
@@ -386,7 +386,7 @@ class VariableSparsityConfig(SparsityConfig):
                     layout[h, first_row:, idx] = 1
         else:
             for _, (start_idx, end_idx) in enumerate(zip(self.global_block_indices, self.global_block_end_indices)):
-                # if global block idx is in the range of the sequnce blocks
+                # if global block idx is in the range of the sequence blocks
                 if (start_idx < num_blocks):
                     end_idx = min(end_idx, num_blocks)
                     #global rows
@@ -450,12 +450,12 @@ class BigBirdSparsityConfig(SparsityConfig):
         self.num_global_blocks = num_global_blocks
 
     def set_random_layout(self, h, layout):
-        """Sets random attantion layout used by the given head in the sparse attention.
+        """Sets random attention layout used by the given head in the sparse attention.
         Note) By default, it assumes there will be a unique random block layout for all heads; unless `different_layout_per_head` parameter is set in which each head can have a different random layout.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which random layout is set
@@ -464,7 +464,7 @@ class BigBirdSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (num_blocks < self.num_random_blocks):
             raise ValueError(
-                f'Number of random blocks, {self.num_random_blocks}, must be smaller than overal number of blocks in a row, {num_blocks}!'
+                f'Number of random blocks, {self.num_random_blocks}, must be smaller than overall number of blocks in a row, {num_blocks}!'
             )
 
         for row in range(0, num_blocks):
@@ -473,11 +473,11 @@ class BigBirdSparsityConfig(SparsityConfig):
         return layout
 
     def set_sliding_window_layout(self, h, layout):
-        """Sets sliding local attantion layout used by the given head in the sparse attention.
+        """Sets sliding local attention layout used by the given head in the sparse attention.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which local sliding window layout is set
@@ -486,7 +486,7 @@ class BigBirdSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (num_blocks < self.num_sliding_window_blocks):
             raise ValueError(
-                f'Number of sliding window blocks, {self.num_sliding_window_blocks}, must be smaller than overal number of blocks in a row, {num_blocks}!'
+                f'Number of sliding window blocks, {self.num_sliding_window_blocks}, must be smaller than overall number of blocks in a row, {num_blocks}!'
             )
 
         w = self.num_sliding_window_blocks // 2
@@ -497,11 +497,11 @@ class BigBirdSparsityConfig(SparsityConfig):
         return layout
 
     def set_global_layout_itc(self, h, layout):
-        """Sets global attantion layout used by the given head in the sparse attention.
+        """Sets global attention layout used by the given head in the sparse attention.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which global layout is set
@@ -510,7 +510,7 @@ class BigBirdSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (num_blocks < self.num_global_blocks):
             raise ValueError(
-                f'Number of global blocks, {self.num_global_blocks}, must be smaller than overal number of blocks in a row, {num_blocks}!'
+                f'Number of global blocks, {self.num_global_blocks}, must be smaller than overall number of blocks in a row, {num_blocks}!'
             )
 
         #global rows
@@ -588,11 +588,11 @@ class BSLongformerSparsityConfig(SparsityConfig):
         self.global_block_end_indices = global_block_end_indices
 
     def set_sliding_window_layout(self, h, layout):
-        """Sets sliding local attantion layout used by the given head in the sparse attention.
+        """Sets sliding local attention layout used by the given head in the sparse attention.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which local sliding window layout is set
@@ -601,7 +601,7 @@ class BSLongformerSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (num_blocks < self.num_sliding_window_blocks):
             raise ValueError(
-                f'Number of sliding window blocks, {self.num_sliding_window_blocks}, must be smaller than overal number of blocks in a row, {num_blocks}!'
+                f'Number of sliding window blocks, {self.num_sliding_window_blocks}, must be smaller than overall number of blocks in a row, {num_blocks}!'
             )
 
         w = self.num_sliding_window_blocks // 2
@@ -612,11 +612,11 @@ class BSLongformerSparsityConfig(SparsityConfig):
         return layout
 
     def set_global_layout(self, h, layout):
-        """Sets global attantion layout used by the given head in the sparse attention.
+        """Sets global attention layout used by the given head in the sparse attention.
 
         Arguments:
              h: required: an integer determining head index
-             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completly set at this step
+             layout: required: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head; may not be completely set at this step
 
         Return:
              layout: a tensor of dimension (num_heads, num_blocks, num_blocks) containing sparsity layout of all head in which global layout is set
@@ -625,7 +625,7 @@ class BSLongformerSparsityConfig(SparsityConfig):
         num_blocks = layout.shape[1]
         if (self.global_block_end_indices is None):
             for idx in self.global_block_indices:
-                # if global block idx is in the range of the sequnce blocks
+                # if global block idx is in the range of the sequence blocks
                 if (idx < num_blocks):
                     #global rows
                     layout[h, idx, :] = 1
@@ -634,7 +634,7 @@ class BSLongformerSparsityConfig(SparsityConfig):
                     layout[h, :, idx] = 1
         else:
             for _, (start_idx, end_idx) in enumerate(zip(self.global_block_indices, self.global_block_end_indices)):
-                # if global block idx is in the range of the sequnce blocks
+                # if global block idx is in the range of the sequence blocks
                 if (start_idx < num_blocks):
                     end_idx = min(end_idx, num_blocks)
                     #global rows
diff --git a/deepspeed/ops/transformer/__init__.py b/deepspeed/ops/transformer/__init__.py
index 63c5938bb9e78585377dd12e92e37fcb495d33a5..28c8de68dd8b7999e1b141215f9e0de66a3bcf06 100644
--- a/deepspeed/ops/transformer/__init__.py
+++ b/deepspeed/ops/transformer/__init__.py
@@ -1 +1,3 @@
 from .transformer import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
+from .inference.transformer_inference import DeepSpeedTransformerInference, DeepSpeedInferenceConfig
+from .inference.moe_inference import DeepSpeedMoEInferenceConfig, DeepSpeedMoEInference
diff --git a/deepspeed/ops/transformer/inference/__init__.py b/deepspeed/ops/transformer/inference/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..66f8124015cbba8afec7f69916659b32b20985e9
--- /dev/null
+++ b/deepspeed/ops/transformer/inference/__init__.py
@@ -0,0 +1,2 @@
+from .transformer_inference import DeepSpeedTransformerInference, DeepSpeedInferenceConfig
+from .moe_inference import DeepSpeedMoEInferenceConfig, DeepSpeedMoEInference
diff --git a/deepspeed/ops/transformer/inference/moe_inference.py b/deepspeed/ops/transformer/inference/moe_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..855211baf57b15911f97ff5d28fb3c4efa6b3207
--- /dev/null
+++ b/deepspeed/ops/transformer/inference/moe_inference.py
@@ -0,0 +1,468 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+import json
+import math
+import importlib
+import torch
+from torch import nn
+from torch.autograd import Function
+import time
+from ... import op_builder
+#from ...inference.engine import inference_cuda_module, specialized_mode
+# Cuda modules will be imported if needed
+inference_cuda_module = None
+specialized_mode = None
+import torch.nn as nn
+from .transformer_inference import DeepSpeedSelfAttention, DeepSpeedInferenceConfig
+from ....moe.sharded_moe import TopKGate
+import torch.distributed as dist
+
+import torch.nn.functional as F
+
+
+class DeepSpeedMoEInferenceConfig(DeepSpeedInferenceConfig):
+    """Initialize the DeepSpeed Transformer Config.
+        Arguments:
+            hidden_size: The hidden size of the transformer layer
+            intermediate_size: The intermediate size of the feed-forward part of transformer layer
+            heads: The number of heads in the self-attention of the transformer layer
+            num_hidden_layers: The number of transformer layers
+            layer_norm_eps: The epsilon value for the layer norm
+            local_rank: Optional: The rank of GPU running the transformer kernel, it is not required
+                to use if the model already set the current device, otherwise need to set it
+                so that the transformer kernel can work on the right device
+            mp_size (optional): This argument is mainly used to create the parameters on the kernel side
+                using model-parallel architecture. If the client model already takes care of this, there is no
+                need to pass this argument.
+            fp16: Enable half-precision computation
+            pre_layer_norm: Select between Pre-LN or Post-LN transformer architecture
+            stochastic_mode:  Enable for high performance, please note that this flag has some level of
+                non-determinism and can produce different results on different runs.  However, we have seen
+                that by enabling it, the pretraining tasks such as BERT are not affected and can obtain
+                a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend
+                to turn it off in order to be able to reproduce the same result through the regular kernel execution.
+
+            scale_attention: If true, both q and k are scaled by 1/sqrt(attention_heads) before attention computation.
+            return_tuple: if True, returns the transformer output as a tuple, otherwise returns as a tensor
+    """
+    def __init__(self,
+                 hidden_size=-1,
+                 intermediate_size=-1,
+                 heads=-1,
+                 num_hidden_layers=-1,
+                 layer_norm_eps=1e-12,
+                 local_rank=-1,
+                 mp_size=1,
+                 fp16=False,
+                 q_int8=False,
+                 pre_layer_norm=True,
+                 stochastic_mode=False,
+                 scale_attention=True,
+                 triangular_masking=True,
+                 local_attention=False,
+                 window_size=256,
+                 return_tuple=True,
+                 moe_experts=1,
+                 global_experts=1,
+                 k=1,
+                 capacity_factor=1.,
+                 eval_capacity_factor=1.,
+                 min_capacity=1,
+                 noisy_gate_policy=None,
+                 drop_tokens=True,
+                 use_rts=False,
+                 mlp_type='standard'):
+        super(DeepSpeedMoEInferenceConfig,
+              self).__init__(
+                  hidden_size,
+                  (intermediate_size if intermediate_size > 0 else 4 * hidden_size),
+                  heads,
+                  num_hidden_layers,
+                  layer_norm_eps,
+                  local_rank,
+                  mp_size,
+                  fp16,
+                  q_int8,
+                  pre_layer_norm,
+                  stochastic_mode,
+                  scale_attention,
+                  triangular_masking,
+                  local_attention,
+                  window_size,
+                  return_tuple)
+        self.moe_experts = moe_experts
+        self.k = k
+        self.capacity_factor = capacity_factor
+        self.eval_capacity_factor = eval_capacity_factor
+        self.min_capacity = min_capacity
+        self.noisy_gate_policy = noisy_gate_policy
+        self.drop_tokens = drop_tokens
+        self.use_rts = use_rts
+        self.global_experts = global_experts
+        self.mlp_type = mlp_type
+
+    @classmethod
+    def from_dict(cls, json_object):
+        config = DeepSpeedInferenceConfig()
+        for key, value in json_object.items():
+            config.__dict__[key] = value
+        return config
+
+    @classmethod
+    def from_json_file(cls, json_file):
+        with open(json_file, "r", encoding='utf-8') as reader:
+            text = reader.read()
+        return cls.from_dict(json.loads(text))
+
+
+class DeepSpeedMLPFunction(Function):
+    @staticmethod
+    def forward(ctx,
+                input,
+                inter_w,
+                inter_b,
+                config,
+                output_b,
+                output_w,
+                q_scales,
+                q_groups,
+                merge_count,
+                mp_group,
+                async_op):
+        if config.q_int8:
+            intermediate = inference_cuda_module.fused_gemm_gelu_int8(
+                input,
+                inter_w,
+                inter_b,
+                config.epsilon,
+                q_scales[2],
+                (q_groups * (2**merge_count)),
+                config.pre_layer_norm)
+            output = inference_cuda_module.vector_matmul_int8(intermediate,
+                                                              output_w,
+                                                              q_scales[3],
+                                                              q_groups,
+                                                              (merge_count))
+        else:
+            mlp_gemm_func = inference_cuda_module.fused_gemm_gelu_fp16 if config.fp16 else \
+                                    inference_cuda_module.fused_gemm_gelu_fp32
+
+            output = mlp_gemm_func(input,
+                                   inter_w,
+                                   inter_b,
+                                   output_w,
+                                   config.epsilon,
+                                   config.pre_layer_norm,
+                                   async_op)
+        if mp_group is not None and dist.get_world_size(group=mp_group) > 1:
+            dist.all_reduce(output, group=mp_group, async_op=async_op)
+
+        return output + output_b
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        raise RuntimeError('You are running with DeepSpeed Inference mode. \
+                            Please switch to Training mode for running backward!')
+
+
+class DeepSpeedMoEMLP(nn.Module):
+    def __init__(self,
+                 config,
+                 q_scales=None,
+                 q_groups=1,
+                 merge_count=1,
+                 mlp_extra_grouping=False,
+                 mp_group=None):
+        super(DeepSpeedMoEMLP, self).__init__()
+
+        self.config = config
+        self.attn_nw = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.attn_nb = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        interm_size = self.config.intermediate_size // (
+            1 if mp_group is None else dist.get_world_size(group=mp_group))
+        self.inter_w = nn.Parameter(torch.Tensor(self.config.hidden_size, interm_size))
+        self.inter_b = nn.Parameter(torch.Tensor(interm_size))
+        self.output_w = nn.Parameter(torch.Tensor((interm_size),
+                                                  self.config.hidden_size))
+        self.output_b = nn.Parameter(torch.Tensor(self.config.hidden_size))
+
+        # used for quantization
+        self.q_scales = q_scales
+        self.q_groups = q_groups * 2 if mlp_extra_grouping else q_groups
+        self.merge_count = int(math.log2(merge_count))
+        self.mp_group = mp_group
+
+    def forward(self, input, async_op=False):
+        return DeepSpeedMLPFunction.apply(input,
+                                          self.inter_w,
+                                          self.inter_b,
+                                          self.config,
+                                          self.output_b,
+                                          self.output_w,
+                                          self.q_scales,
+                                          self.q_groups,
+                                          self.merge_count,
+                                          self.mp_group,
+                                          async_op)
+
+
+class DeepSpeedMoEInference(nn.Module):
+    """Initialize the DeepSpeed MoE Transformer Layer.
+        Arguments:
+            layer_id: The layer index starting from 0, e.g. if model has 24 transformer layers,
+                layer_id will be 0,1,2...23 when each layer object is instantiated
+            config: An object of DeepSpeedInferenceConfig
+            mp_group: Model parallelism group initialized on the modeling side.
+            quantize_scales: This argument groups all the layers' scales used for quantization
+            quantize_groups: Number of groups used for quantizing the model
+            merge_count: Shows the number of model-parallel checkpoints merged before running inference.
+                We use this argument to control the quantization scale for the model parameters if a bigger
+                quantize-grouping than 1 is used.
+            mlp_extra_grouping: This flag is used to show a 2x higher number of groups used for the MLP part
+                of a Transformer layer. We use this feature for quantization to reduce the convergence impact
+                for specific downstream tasks.
+    """
+    layer_id = 0
+
+    def __init__(self,
+                 config,
+                 mp_group=None,
+                 ep_group=None,
+                 expert_mp_group=None,
+                 quantize_scales=None,
+                 quantize_groups=1,
+                 merge_count=1,
+                 mlp_extra_grouping=False,
+                 qkv_merging=False):
+        super(DeepSpeedMoEInference, self).__init__()
+
+        self.config = config
+        self.config.layer_id = DeepSpeedMoEInference.layer_id
+        global inference_cuda_module
+        global specialized_mode
+        if inference_cuda_module is None:
+            specialized_mode = False
+            if hasattr(op_builder, 'InferenceSpecializedBuilder'):
+                builder = op_builder.InferenceSpecializedBuilder()
+                if builder.is_compatible():
+                    inference_cuda_module = builder.load()
+                    specialized_mode = True
+                else:
+                    inference_cuda_module = op_builder.InferenceBuilder().load()
+            else:
+                inference_cuda_module = op_builder.InferenceBuilder().load()
+        self.config.specialized_mode = specialized_mode
+
+        DeepSpeedMoEInference.layer_id += 1
+        self.attention = DeepSpeedSelfAttention(self.config,
+                                                mp_group,
+                                                quantize_scales,
+                                                quantize_groups,
+                                                merge_count,
+                                                qkv_merging)
+        self.attn_nw = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.attn_nb = nn.Parameter(torch.Tensor(self.config.hidden_size))
+
+        self.norm_w = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.norm_b = nn.Parameter(torch.Tensor(self.config.hidden_size))
+
+        if config.mlp_type == 'residual':
+            self.res_mlp = DeepSpeedMoEMLP(config,
+                                           quantize_scales,
+                                           quantize_groups,
+                                           merge_count,
+                                           mlp_extra_grouping,
+                                           mp_group)
+            self.res_coef = nn.Parameter(torch.Tensor(self.config.hidden_size, 2))
+            self.coef_func = inference_cuda_module.softmax_fp16 if self.config.fp16 or self.config.q_int8 else \
+                                        inference_cuda_module.softmax_fp32
+            self.vector_matmul_func = inference_cuda_module.vector_matmul_fp16 if config.fp16 else \
+                                    inference_cuda_module.vector_matmul_fp32
+
+        config.mp_size = 1
+        self.mlp = nn.ModuleList(
+            DeepSpeedMoEMLP(config,
+                            quantize_scales,
+                            quantize_groups,
+                            merge_count,
+                            mlp_extra_grouping,
+                            expert_mp_group) for i in range(self.config.moe_experts))
+
+        self.moe_gate = TopKGate(self.config.hidden_size,
+                                 self.config.global_experts,
+                                 self.config.k,
+                                 self.config.capacity_factor,
+                                 self.config.eval_capacity_factor,
+                                 self.config.min_capacity,
+                                 self.config.noisy_gate_policy,
+                                 self.config.drop_tokens,
+                                 self.config.use_rts)
+
+        self.ep_group = ep_group
+        self.mp_group = mp_group
+        self.expert_mp_group = expert_mp_group
+
+        print("DeepSpeed MoE Transformer Inference config is ", self.config.__dict__)
+
+        self.bias_residual_func = inference_cuda_module.bias_residual_fp16 if config.fp16 or config.q_int8 else \
+                                        inference_cuda_module.bias_residual_fp32
+        self.ds_layernorm = inference_cuda_module.layer_norm_fp16 if self.config.fp16 or self.config.q_int8 else \
+                                        inference_cuda_module.layer_norm_fp32
+        self.einsum_sec_sm_ecm = inference_cuda_module.einsum_sec_sm_ecm_fp16 if self.config.fp16 or self.config.q_int8 else \
+                                        inference_cuda_module.einsum_sec_sm_ecm_fp32
+
+    def res_coef_func(self, inp, async_op):
+        inp = self.vector_matmul_func(inp, self.res_coef, async_op)
+        return self.coef_func(inp, torch.empty(1), False, False, False, 256, async_op)
+
+    def moe_gate_einsum(self, attention_output):
+        _, combined_weights, dispatch_mask, _ = self.moe_gate(
+            attention_output.view(-1, self.config.hidden_size),
+            None,
+            )
+        dispatched_attention = self.einsum_sec_sm_ecm(
+            dispatch_mask.type_as(attention_output),
+            attention_output.view(-1,
+                                  self.config.hidden_size))
+        return dispatched_attention, combined_weights
+
+    def expert_exec(self, dispatched_input):
+        dispatched_input = dispatched_input.reshape(
+            self.config.global_experts // self.config.moe_experts,
+            self.config.moe_experts,
+            -1,
+            self.config.hidden_size)
+
+        chunks = dispatched_input.chunk(self.config.moe_experts, dim=1)
+        expert_outputs = torch.empty((
+            self.config.moe_experts,
+            chunks[0].shape[0],
+        ) + chunks[0].shape[2:],
+                                     dtype=dispatched_input.dtype,
+                                     device=dispatched_input.device)
+        for chunk, expert in zip(chunks, range(len(self.mlp))):
+            expert_outputs[expert] = self.mlp[expert](chunk.view(
+                -1,
+                dispatched_input.shape[-2],
+                dispatched_input.shape[-1]))
+        return expert_outputs
+
+    def _alltoall(self, dispatched_attention):
+        if dist.get_world_size(group=self.ep_group) > 1:
+            dispatched_input = torch.empty_like(dispatched_attention)
+            dist.all_to_all_single(dispatched_input,
+                                   dispatched_attention,
+                                   group=self.ep_group)
+            return dispatched_input
+        else:
+            return dispatched_attention
+
+    def scale_expert_output(self, attention_output, expert_output, combined_weights):
+        combined_output = torch.matmul(
+            combined_weights.type_as(attention_output).reshape(
+                combined_weights.shape[0],
+                -1),
+            expert_output.reshape(-1,
+                                  expert_output.shape[-1]))
+        return combined_output.reshape(attention_output.shape)
+
+    def forward(self,
+                input,
+                input_mask=None,
+                attention_mask=None,
+                head_mask=None,
+                layer_past=None,
+                get_key_value=False,
+                get_present=False,
+                encoder_output=None,
+                enc_dec_attn_mask=None,
+                encoder_hidden_states=None,
+                encoder_attention_mask=None,
+                use_cache=False,
+                output_attentions=False):
+        get_present = (get_present or get_key_value or use_cache)
+        input_mask = input_mask if attention_mask is None else attention_mask
+        input_type = input.dtype
+
+        if (self.config.fp16 or self.config.q_int8) \
+            and input.dtype == torch.float:
+            input = input.half()
+
+        with torch.no_grad():
+            attention_output = self.attention(input,
+                                              input_mask,
+                                              head_mask,
+                                              layer_past,
+                                              get_present,
+                                              encoder_hidden_states,
+                                              encoder_attention_mask,
+                                              output_attentions,
+                                              self.norm_w,
+                                              self.norm_b)
+
+            if get_present:
+                attention_output, p_key, p_value = attention_output[0:3]
+                presents = (p_key, p_value)
+            elif output_attentions:
+                attention_output, _, _, context_output = attention_output[0:4]
+            else:
+                attention_output = attention_output[0]
+
+            residual_add = attention_output + self.attention.attn_ob
+            attention_output = self.ds_layernorm(residual_add,
+                                                 self.attn_nw,
+                                                 self.attn_nb,
+                                                 self.config.epsilon)
+
+            if self.config.mlp_type == 'residual':
+                res_mlp_out = self.res_mlp(attention_output, async_op=True)
+                res_coef_out = self.res_coef_func(attention_output, async_op=True)
+
+            if self.expert_mp_group is not None:
+                tensor_list = [
+                    torch.empty_like(attention_output)
+                    for _ in range(dist.get_world_size(group=self.expert_mp_group))
+                ]
+                tensor_list[dist.get_rank(group=self.expert_mp_group)] = attention_output
+                dist.all_gather(tensor_list,
+                                attention_output,
+                                group=self.expert_mp_group)
+                attention_output = torch.cat(tensor_list).contiguous()
+
+            ############## MoE Gating + Experts ###############
+            dispatched_attention, combined_weights = self.moe_gate_einsum(attention_output)
+            dispatched_input = self._alltoall(dispatched_attention)
+            expert_outputs = self.expert_exec(dispatched_input)
+            expert_output = self._alltoall(expert_outputs)
+            output = self.scale_expert_output(attention_output,
+                                              expert_output,
+                                              combined_weights)
+            ################################################
+
+            if self.expert_mp_group is not None:
+                output = output.split(output.shape[0] //
+                                      dist.get_world_size(group=self.expert_mp_group),
+                                      dim=0)[dist.get_rank(group=self.expert_mp_group)]
+
+            if self.config.mlp_type == 'residual':
+                inference_cuda_module.moe_res_matmul(res_mlp_out, res_coef_out, output)
+
+            output = self.bias_residual_func(output, residual_add, torch.empty(1))
+
+            if not self.config.pre_layer_norm:
+                output = self.ds_layernorm(output,
+                                           self.norm_w,
+                                           self.norm_b,
+                                           self.config.epsilon)
+
+            if input_type != output.dtype:
+                output = output.to(input_type)
+
+        if get_present:
+            output = (output, presents)
+
+        if self.config.return_tuple:
+            return output if type(output) is tuple else (output, )
+        else:
+            return output
diff --git a/deepspeed/ops/transformer/inference/transformer_inference.py b/deepspeed/ops/transformer/inference/transformer_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..aed03148e919533fec5281d0a9337e9a1c7af9f7
--- /dev/null
+++ b/deepspeed/ops/transformer/inference/transformer_inference.py
@@ -0,0 +1,667 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+import json
+import math
+import importlib
+import torch
+from torch import nn
+from torch.autograd import Function
+import time
+from ... import op_builder
+import torch.nn as nn
+import torch.distributed as dist
+# Cuda modules will be imported if needed
+inference_cuda_module = None
+
+
+class TransformerConfig():
+    def __init__(self, hidden_size, intermediate_size, heads, num_hidden_layers):
+        self.layer_id = -1
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.heads = heads
+        self.num_hidden_layers = num_hidden_layers
+
+
+class DeepSpeedInferenceConfig(TransformerConfig):
+    """Initialize the DeepSpeed Transformer Config.
+        Arguments:
+            hidden_size: The hidden size of the transformer layer
+            intermediate_size: The intermediate size of the feed-forward part of transformer layer
+            heads: The number of heads in the self-attention of the transformer layer
+            num_hidden_layers: The number of transformer layers
+            layer_norm_eps: The epsilon value for the layer norm
+            local_rank: Optional: The rank of GPU running the transformer kernel, it is not required
+                to use if the model already set the current device, otherwise need to set it
+                so that the transformer kernel can work on the right device
+            mp_size (optional): This argument is mainly used to create the parameters on the kernel side
+                using model-parallel architecture. If the client model already takes care of this, there is no
+                need to pass this argument.
+            fp16: Enable half-precision computation
+            pre_layer_norm: Select between Pre-LN or Post-LN transformer architecture
+            stochastic_mode:  Enable for high performance, please note that this flag has some level of
+                non-determinism and can produce different results on different runs.  However, we have seen
+                that by enabling it, the pretraining tasks such as BERT are not affected and can obtain
+                a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend
+                to turn it off in order to be able to reproduce the same result through the regular kernel execution.
+
+            scale_attention: If true, both q and k are scaled by 1/sqrt(attention_heads) before attention computation.
+            return_tuple: if True, returns the transformer output as a tuple, otherwise returns as a tensor
+    """
+    def __init__(self,
+                 hidden_size=-1,
+                 intermediate_size=-1,
+                 heads=-1,
+                 num_hidden_layers=-1,
+                 layer_norm_eps=1e-12,
+                 local_rank=-1,
+                 mp_size=1,
+                 fp16=False,
+                 q_int8=False,
+                 pre_layer_norm=True,
+                 stochastic_mode=False,
+                 scale_attention=True,
+                 triangular_masking=True,
+                 local_attention=False,
+                 window_size=256,
+                 rotary_dim=-1,
+                 rotate_half=False,
+                 rotate_every_two=True,
+                 return_tuple=True,
+                 mlp_after_attn=True,
+                 training_mp_size=1):
+        super(DeepSpeedInferenceConfig,
+              self).__init__(
+                  hidden_size,
+                  (intermediate_size if intermediate_size > 0 else 4 * hidden_size),
+                  heads,
+                  num_hidden_layers)
+        self.fp16 = fp16
+        self.pre_layer_norm = pre_layer_norm
+        self.local_rank = local_rank
+        self.stochastic_mode = stochastic_mode
+        self.epsilon = layer_norm_eps
+        self.mp_size = mp_size
+        self.q_int8 = q_int8
+        self.scale_attention = scale_attention
+        self.triangular_masking = triangular_masking
+        self.local_attention = local_attention
+        self.window_size = window_size
+        self.rotary_dim = rotary_dim
+        self.rotate_half = rotate_half
+        self.rotate_every_two = rotate_every_two
+        self.return_tuple = return_tuple
+        self.mlp_after_attn = mlp_after_attn
+        self.specialized_mode = False
+        self.training_mp_size = training_mp_size
+
+    @classmethod
+    def from_dict(cls, json_object):
+        config = DeepSpeedInferenceConfig()
+        for key, value in json_object.items():
+            config.__dict__[key] = value
+        return config
+
+    @classmethod
+    def from_json_file(cls, json_file):
+        with open(json_file, "r", encoding='utf-8') as reader:
+            text = reader.read()
+        return cls.from_dict(json.loads(text))
+
+
+class DeepSpeedSelfAttentionFunction(Function):
+    @staticmethod
+    def forward(ctx,
+                input,
+                input_mask,
+                head_mask,
+                layer_past,
+                get_present,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions,
+                norm_w,
+                norm_b,
+                config,
+                attn_qkvw,
+                attn_qkvb,
+                num_attention_heads_per_partition,
+                norm_factor,
+                hidden_size_per_partition,
+                attn_ow,
+                attn_ob,
+                mp_group,
+                q_scales,
+                q_groups,
+                merge_count,
+                qkv_merging):
+        def _transpose_for_scores(x, key=False, reshape=False):
+            attention_head_size = x.shape[-1] // num_attention_heads_per_partition
+            new_x_shape = x.size()[:-1] + (num_attention_heads_per_partition,
+                                           attention_head_size)
+            x_1 = x.view(*new_x_shape)
+            if key:
+                x_1 = x_1.permute(0, 2, 3, 1)
+            else:
+                x_1 = x_1.permute(0, 2, 1, 3)
+            if reshape:
+                return x_1.reshape(x.shape)
+            return x_1.contiguous()
+
+        def _transpose_for_context(x):
+            x = x.permute(0, 2, 1, 3).contiguous()
+            new_x_layer_shape = x.size()[:-2] + \
+                                      (hidden_size_per_partition,)
+            return x.view(*new_x_layer_shape).contiguous()
+
+        def compute_attention(qkv_out, input_mask):
+            score_context_func = inference_cuda_module.softmax_context_fp32 if (not config.fp16) else \
+                                    inference_cuda_module.softmax_context_fp16
+
+            if merge_count > 0 and config.q_int8:
+                split_dim = (qkv_out.dim() - 1)
+                qkv_split = torch.split(qkv_out,
+                                        (qkv_out.shape[-1] // (2**merge_count)),
+                                        dim=split_dim)
+                qkv_split = [
+                    torch.split(s,
+                                (s.shape[-1] // 3),
+                                dim=split_dim) for s in qkv_split
+                ]
+                (mixed_query,
+                 key_layer,
+                 value_layer) = [
+                     torch.cat([s[i] for s in qkv_split],
+                               axis=-1) for i in range(len(qkv_split[0]))
+                 ]
+            else:
+                (mixed_query,
+                 key_layer,
+                 value_layer) = torch.split(qkv_out,
+                                            (qkv_out.shape[-1] // 3),
+                                            dim=(qkv_out.dim() - 1))
+            no_masking = input_mask is None
+            if no_masking:
+                input_mask = torch.empty(1)
+            head_size = (mixed_query.shape[-1] // num_attention_heads_per_partition)
+
+            unfused_mode = not config.specialized_mode or \
+                                mixed_query.shape[1] >= 32 or head_size > 128
+
+            if config.rotary_dim > 0:
+                mixed_query, key_layer = inference_cuda_module.apply_rotary_pos_emb(
+                    mixed_query,
+                    key_layer,
+                    config.rotary_dim,
+                    0 if layer_past is None else layer_past[0].shape[-2],
+                    num_attention_heads_per_partition,
+                    config.rotate_half,
+                    config.rotate_every_two)
+            if layer_past is not None:
+                past_key, past_value = layer_past
+                if unfused_mode:
+                    key_layer = torch.cat((past_key.type_as(key_layer),
+                                           key_layer),
+                                          dim=-2)
+                    value_layer = torch.cat((past_value.type_as(value_layer),
+                                             value_layer),
+                                            dim=-2)
+            presents = (key_layer, value_layer)
+            if unfused_mode:
+                mixed_query = _transpose_for_scores(mixed_query, False, True)
+                key_layer = _transpose_for_scores(
+                    key_layer,
+                    True,
+                    True) / (norm_factor if config.scale_attention else 1.0)
+                value_layer = _transpose_for_scores(value_layer, False, True)
+            #print(f'[{torch.distributed.get_rank()}] {config.layer_id}: {mixed_query.norm()}')
+            if layer_past is None:
+                attn_key_value = score_context_func(
+                    mixed_query,
+                    key_layer,
+                    torch.empty(1),
+                    input_mask,
+                    value_layer,
+                    torch.empty(1),
+                    num_attention_heads_per_partition,
+                    (1 / norm_factor if config.scale_attention else 1.0),
+                    (not unfused_mode),
+                    config.triangular_masking,
+                    config.local_attention,
+                    config.window_size,
+                    no_masking)
+            else:
+                attn_key_value = score_context_func(
+                    mixed_query,
+                    (key_layer if unfused_mode else past_key.type_as(key_layer)),
+                    key_layer,
+                    input_mask,
+                    (value_layer if unfused_mode else past_value.type_as(value_layer)),
+                    value_layer,
+                    num_attention_heads_per_partition,
+                    (1 / norm_factor if config.scale_attention else 1.0),
+                    (not unfused_mode),
+                    config.triangular_masking,
+                    config.local_attention,
+                    config.window_size,
+                    no_masking)
+            if unfused_mode:
+                context_layer, _, _ = attn_key_value
+            else:
+                context_layer, key_layer, value_layer = attn_key_value
+
+            # Transpose Context
+            context_layer = _transpose_for_context(context_layer)
+
+            return context_layer, presents[0], presents[1] # atten_output, key_layer, value_layer
+
+        def selfAttention_fp():
+            vector_matmul_func = inference_cuda_module.vector_matmul_fp16 if config.fp16 else \
+                                    inference_cuda_module.vector_matmul_fp32
+            if not config.pre_layer_norm:
+                linear_func = inference_cuda_module.linear_layer_fp16 if config.fp16 else \
+                                    inference_cuda_module.linear_layer_fp32
+
+                qkv_out = linear_func(input, attn_qkvw, attn_qkvb)
+            else:
+                qkv_func = inference_cuda_module.qkv_gemm_fp16 if config.fp16 else \
+                                    inference_cuda_module.qkv_gemm_fp32
+                qkv_out = qkv_func(input,
+                                   attn_qkvw,
+                                   (attn_qkvb if attn_qkvb is not None else norm_b),
+                                   norm_w,
+                                   norm_b,
+                                   config.epsilon,
+                                   (attn_qkvb is not None))
+
+            context_layer, key_layer, value_layer = compute_attention(qkv_out[0] if isinstance(qkv_out, list) else qkv_out, input_mask)
+            output = vector_matmul_func(context_layer, attn_ow, False)
+            #print(f'[{torch.distributed.get_rank()}] {config.layer_id}: oooooo -> {output.norm()}')
+
+            return output, key_layer, value_layer, context_layer, qkv_out[-1] # attn_out, present_key, present_value, context_output, inp_norm
+
+        def selfAttention_int8():
+            if not config.pre_layer_norm:
+                qkv_out = inference_cuda_module.linear_layer_int8(
+                    input,
+                    attn_qkvw,
+                    attn_qkvb,
+                    q_scales[0],
+                    (q_groups * (3 if qkv_merging else 1) * (2**merge_count)))
+
+            else:
+                qkv_out = inference_cuda_module.qkv_gemm_int8(
+                    input,
+                    attn_qkvw,
+                    attn_qkvb,
+                    norm_w,
+                    norm_b,
+                    config.epsilon,
+                    q_scales[0],
+                    (q_groups * (3 if qkv_merging else 1) * (2**merge_count)),
+                    (attn_qkvb is not None))
+            context_layer, key_layer, value_layer = compute_attention(qkv_out)
+            output = inference_cuda_module.vector_matmul_int8(context_layer,
+                                                              attn_ow,
+                                                              q_scales[1],
+                                                              q_groups,
+                                                              (merge_count))
+            return output, key_layer, value_layer, context_layer
+
+        if config.q_int8:
+            output, key_layer, value_layer, context_layer = selfAttention_int8()
+        else:
+            output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
+        if config.mlp_after_attn and mp_group is not None and dist.get_world_size(
+                group=mp_group) > 1:
+            dist.all_reduce(output, group=mp_group)
+
+        return (output, key_layer, value_layer, context_layer, inp_norm)
+
+    @staticmethod
+    def backward(ctx, grad_output, grad_output1, grad_output2, grad_output3):
+        raise RuntimeError('You are running with DeepSpeed Inference mode. \
+                            Please switch to Training mode for running backward!')
+
+
+class DeepSpeedSelfAttention(nn.Module):
+    num_layers = 0
+
+    def __init__(self,
+                 config,
+                 mp_group=None,
+                 q_scales=None,
+                 q_groups=1,
+                 merge_count=1,
+                 qkv_merging=False):
+        super(DeepSpeedSelfAttention, self).__init__()
+        self.config = config
+        self.config.layer_id = DeepSpeedSelfAttention.num_layers
+        DeepSpeedSelfAttention.num_layers = DeepSpeedSelfAttention.num_layers + 1
+        self.attn_qkvw = nn.Parameter(
+            torch.Tensor(self.config.hidden_size,
+                         (self.config.hidden_size // self.config.mp_size) * 3))
+        self.attn_qkvb = nn.Parameter(
+            torch.Tensor((self.config.hidden_size // self.config.mp_size) * 3))
+
+        self.attn_ow = nn.Parameter(
+            torch.Tensor(self.config.hidden_size // self.config.mp_size,
+                         self.config.hidden_size))
+
+        self.attn_ob = nn.Parameter(torch.Tensor(self.config.hidden_size))
+
+        self.num_attention_heads_per_partition = self.config.heads // self.config.mp_size
+        self.hidden_size_per_partition = self.config.hidden_size // self.config.mp_size
+        self.hidden_size_per_attention_head = self.config.hidden_size // self.config.heads
+
+        self.mp_group = mp_group
+
+        # used for quantization
+        self.q_scales = q_scales
+        self.q_groups = q_groups
+        self.merge_count = int(math.log2(merge_count))
+
+        self.norm_factor = math.sqrt(
+            math.sqrt(self.config.hidden_size // self.config.heads))
+        self.qkv_merging = qkv_merging
+
+    def forward(self,
+                input,
+                input_mask,
+                head_mask=None,
+                layer_past=None,
+                get_present=False,
+                encoder_hidden_states=None,
+                encoder_attention_mask=None,
+                output_attentions=False,
+                norm_w=None,
+                norm_b=None):
+        output = DeepSpeedSelfAttentionFunction.apply(
+            input,
+            input_mask,
+            head_mask,
+            layer_past,
+            get_present,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            output_attentions,
+            norm_w,
+            norm_b,
+            self.config,
+            self.attn_qkvw,
+            self.attn_qkvb,
+            self.num_attention_heads_per_partition,
+            self.norm_factor,
+            self.hidden_size_per_partition,
+            self.attn_ow,
+            self.attn_ob,
+            self.mp_group,
+            self.q_scales,
+            self.q_groups,
+            self.merge_count,
+            self.qkv_merging)
+
+        return output
+
+
+class DeepSpeedMLPFunction(Function):
+    @staticmethod
+    def forward(ctx,
+                input,
+                residual,
+                residual_norm,
+                bias,
+                inter_w,
+                inter_b,
+                attn_nw,
+                attn_nb,
+                config,
+                mp_group,
+                output_b,
+                output_w,
+                q_scales,
+                q_groups,
+                merge_count,
+                mlp_gemm_func,
+                fused_gemm_gelu,
+                vector_matmul_func,
+                bias_residual_func):
+
+        if config.q_int8:
+            (intermediate,
+             residual_add) = inference_cuda_module.mlp_gemm_int8(
+                 input,
+                 residual,
+                 bias,
+                 inter_w,
+                 inter_b,
+                 attn_nw,
+                 attn_nb,
+                 config.epsilon,
+                 q_scales[2],
+                 (q_groups * (2**merge_count)),
+                 config.pre_layer_norm)
+            output = inference_cuda_module.vector_matmul_int8(intermediate,
+                                                              output_w,
+                                                              q_scales[3],
+                                                              q_groups,
+                                                              (merge_count))
+        else:
+            if attn_nw is None:
+                output = fused_gemm_gelu(residual_norm,
+                                         inter_w,
+                                         inter_b,
+                                         output_w,
+                                         config.epsilon,
+                                         config.pre_layer_norm,
+                                         False)
+            else:
+                intermediate = mlp_gemm_func(input,
+                                             residual,
+                                             bias,
+                                             inter_w,
+                                             inter_b,
+                                             attn_nw,
+                                             attn_nb,
+                                             config.epsilon,
+                                             config.pre_layer_norm,
+                                             config.mlp_after_attn)
+                output = vector_matmul_func(intermediate, output_w, False)
+        inference_cuda_module.residual_add(output,
+                                           residual,
+                                           input,
+                                           output_b,
+                                           bias,
+                                           config.mp_size,
+                                           config.mlp_after_attn)
+        if mp_group is not None and dist.get_world_size(group=mp_group) > 1:
+            dist.all_reduce(output, group=mp_group)
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        raise RuntimeError('You are running with DeepSpeed Inference mode. \
+                            Please switch to Training mode for running backward!')
+
+
+class DeepSpeedMLP(nn.Module):
+    def __init__(self,
+                 config,
+                 mp_group=None,
+                 q_scales=None,
+                 q_groups=1,
+                 merge_count=1,
+                 mlp_extra_grouping=False):
+        super(DeepSpeedMLP, self).__init__()
+
+        self.config = config
+        self.attn_nw = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.attn_nb = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.inter_w = nn.Parameter(
+            torch.Tensor(self.config.hidden_size,
+                         self.config.intermediate_size // self.config.mp_size))
+        self.inter_b = nn.Parameter(
+            torch.Tensor(self.config.intermediate_size // self.config.mp_size))
+        self.output_w = nn.Parameter(
+            torch.Tensor((self.config.intermediate_size // self.config.mp_size),
+                         self.config.hidden_size))
+        self.output_b = nn.Parameter(torch.Tensor(self.config.hidden_size))
+
+        # used for quantization
+        self.q_scales = q_scales
+        self.q_groups = q_groups * 2 if mlp_extra_grouping else q_groups
+        self.merge_count = int(math.log2(merge_count))
+
+        self.mp_group = mp_group
+        self.mlp_gemm_func = inference_cuda_module.mlp_gemm_fp16 if config.fp16 else \
+                                    inference_cuda_module.mlp_gemm_fp32
+        self.vector_matmul_func = inference_cuda_module.vector_matmul_fp16 if config.fp16 else \
+                                inference_cuda_module.vector_matmul_fp32
+        self.fused_gemm_gelu = inference_cuda_module.fused_gemm_gelu_fp16 if config.fp16 else \
+                                    inference_cuda_module.fused_gemm_gelu_fp32
+
+        self.bias_residual_func = inference_cuda_module.bias_residual_fp16 if config.fp16 or config.q_int8 else \
+                                    inference_cuda_module.bias_residual_fp32
+
+    def forward(self, input, residual, residual_norm, bias):
+        return DeepSpeedMLPFunction.apply(input,
+                                          residual,
+                                          residual_norm,
+                                          bias,
+                                          self.inter_w,
+                                          self.inter_b,
+                                          self.attn_nw,
+                                          self.attn_nb,
+                                          self.config,
+                                          self.mp_group,
+                                          self.output_b,
+                                          self.output_w,
+                                          self.q_scales,
+                                          self.q_groups,
+                                          self.merge_count,
+                                          self.mlp_gemm_func,
+                                          self.fused_gemm_gelu,
+                                          self.vector_matmul_func,
+                                          self.bias_residual_func)
+
+
+class DeepSpeedTransformerInference(nn.Module):
+    """Initialize the DeepSpeed Transformer Layer.
+        Arguments:
+            layer_id: The layer index starting from 0, e.g. if model has 24 transformer layers,
+                layer_id will be 0,1,2...23 when each layer object is instantiated
+            config: An object of DeepSpeedInferenceConfig
+            mp_group: Model parallelism group initialized on the modeling side.
+            quantize_scales: This argument groups all the layers' scales used for quantization
+            quantize_groups: Number of groups used for quantizing the model
+            merge_count: Shows the number of model-parallel checkpoints merged before running inference.
+                We use this argument to control the quantization scale for the model parameters if a bigger
+                quantize-grouping than 1 is used.
+            mlp_extra_grouping: This flag is used to show a 2x higher number of groups used for the MLP part
+                of a Transformer layer. We use this feature for quantization to reduce the convergence impact
+                for specific downstream tasks.
+    """
+    layer_id = 0
+
+    def __init__(self,
+                 config,
+                 mp_group=None,
+                 quantize_scales=None,
+                 quantize_groups=1,
+                 merge_count=1,
+                 mlp_extra_grouping=False,
+                 qkv_merging=False):
+        super(DeepSpeedTransformerInference, self).__init__()
+
+        self.config = config
+        self.config.layer_id = DeepSpeedTransformerInference.layer_id
+        DeepSpeedTransformerInference.layer_id += 1
+
+        global inference_cuda_module
+        if inference_cuda_module is None:
+            builder = op_builder.InferenceBuilder()
+            inference_cuda_module = builder.load()
+
+        print("DeepSpeed Transformer Inference config is ", self.config.__dict__)
+
+        self.attention = DeepSpeedSelfAttention(self.config,
+                                                mp_group,
+                                                quantize_scales,
+                                                quantize_groups,
+                                                merge_count,
+                                                qkv_merging)
+        self.mlp = DeepSpeedMLP(self.config,
+                                mp_group,
+                                quantize_scales,
+                                quantize_groups,
+                                merge_count,
+                                mlp_extra_grouping)
+
+        self.norm_w = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.norm_b = nn.Parameter(torch.Tensor(self.config.hidden_size))
+        self.layer_past = None
+
+    def forward(self,
+                input,
+                input_mask=None,
+                attention_mask=None,
+                head_mask=None,
+                layer_past=None,
+                get_key_value=False,
+                get_present=False,
+                encoder_output=None,
+                enc_dec_attn_mask=None,
+                encoder_hidden_states=None,
+                encoder_attention_mask=None,
+                use_cache=False,
+                output_attentions=False):
+        get_present = (get_present or get_key_value or use_cache)
+        input_mask = input_mask if attention_mask is None else attention_mask
+        layer_past = layer_past if layer_past is not None else self.layer_past
+
+        attn_mask = None
+        if isinstance(input, tuple):
+            attn_mask = input[1]
+            input = input[0]
+        input_type = input.dtype
+
+        if (self.config.fp16 or self.config.q_int8) \
+            and input.dtype == torch.float:
+            input = input.half()
+
+        with torch.no_grad():
+            attention_output, key, value, context_outputtn_ctx, inp_norm = \
+                                     self.attention(input,
+                                              input_mask,
+                                              head_mask,
+                                              layer_past,
+                                              get_present,
+                                              encoder_hidden_states,
+                                              encoder_attention_mask,
+                                              output_attentions,
+                                              self.norm_w,
+                                              self.norm_b)
+            presents = (key, value)
+            self.layer_past = presents
+
+            output = self.mlp(attention_output, input, inp_norm, self.attention.attn_ob)
+
+            if not self.config.pre_layer_norm:
+                ds_layernorm = inference_cuda_module.layer_norm_fp16 if self.config.fp16 or self.config.q_int8 else \
+                                        inference_cuda_module.layer_norm_fp32
+                output = ds_layernorm(output,
+                                      self.norm_w,
+                                      self.norm_b,
+                                      self.config.epsilon)
+
+            output = output.to(input_type)
+        #print(f'[{torch.distributed.get_rank()}] {self.config.layer_id}: {output.norm()}')
+        #exit()
+        if get_present:
+            output = (output, presents)
+
+        if self.config.return_tuple:
+            return output if type(output) is tuple else (output, attn_mask)
+        else:
+            return output
diff --git a/deepspeed/ops/transformer/transformer.py b/deepspeed/ops/transformer/transformer.py
old mode 100755
new mode 100644
index 0238eed144e19f6056153a6ccc9358d5c83d3224..4b3104b6bdeadc8e22d19e912737a3dd20a47bf1
--- a/deepspeed/ops/transformer/transformer.py
+++ b/deepspeed/ops/transformer/transformer.py
@@ -42,8 +42,6 @@ class DeepSpeedTransformerConfig(TransformerConfig):
         Arguments:
             batch_size: The maximum batch size used for running the kernel on each GPU
 
-            max_seq_length: The sequence-length of the model being trained with DeepSpeed
-
             hidden_size: The hidden size of the transformer layer
 
             intermediate_size: The intermediate size of the feed-forward part of transformer layer
@@ -88,7 +86,7 @@ class DeepSpeedTransformerConfig(TransformerConfig):
                 a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend
                 to turn it off in order to be able to reproduce the same result through the regular kernel execution.
 
-            huggingface: Enbale if using the HuggingFace interface style for sending out the forward results.
+            return_tuple: Enable if using the return_tuple interface style for sending out the forward results.
 
             training: Enable for training rather than inference.
     """
@@ -111,7 +109,7 @@ class DeepSpeedTransformerConfig(TransformerConfig):
                  adjust_init_range=True,
                  attn_dropout_checkpoint=False,
                  stochastic_mode=False,
-                 huggingface=False,
+                 return_tuple=False,
                  training=True):
         super(DeepSpeedTransformerConfig,
               self).__init__(
@@ -136,7 +134,7 @@ class DeepSpeedTransformerConfig(TransformerConfig):
         self.is_grad_enabled = True
         self.attn_dropout_checkpoint = attn_dropout_checkpoint
         self.stochastic_mode = stochastic_mode
-        self.huggingface = huggingface
+        self.return_tuple = return_tuple
 
     @classmethod
     def from_dict(cls, json_object):
@@ -174,11 +172,6 @@ class DeepSpeedTransformerFunction(Function):
                 norm_b,
                 config):
 
-        bsz = input.shape[0]
-
-        if bsz > config.batch_size:
-            raise ValueError('Input batch size exceeds the limit.')
-
         cuda_module = stochastic_transformer_cuda_module if config.stochastic_mode else transformer_cuda_module
         forward_func = cuda_module.forward_fp16 if config.fp16 else cuda_module.forward_fp32
 
@@ -323,7 +316,7 @@ class DeepSpeedTransformerFunction(Function):
         if inp_size[1] % 16 != 0:
             output = torch.narrow(output, 1, 0, inp_size[1])
 
-        if config.huggingface:
+        if config.return_tuple:
             return (output, )  # outputs -> (output) : outputs[0] = output
         else:
             return output
@@ -336,9 +329,6 @@ class DeepSpeedTransformerFunction(Function):
             grad_output = torch.cat((grad_output, torch.zeros((bsz, (16 - (grad_output_shape[1] % 16)), \
                                         grad_output_shape[2]), device=grad_output.device, dtype=grad_output.dtype)), 1)
 
-        if bsz > ctx.config.batch_size:
-            raise ValueError('grad_output batch size exceeds the limit.')
-
         assert ctx.config.training
 
         if (ctx.config.pre_layer_norm and ctx.config.normalize_invertible):
@@ -518,12 +508,14 @@ class DeepSpeedTransformerLayer(nn.Module):
             self.init_transformer_weights(self.config.adjust_init_range)
         else:
             # For testing only.
-            self.attn_qkvw = nn.Parameter(
-                torch.Tensor(self.config.hidden_size * 3,
-                             self.config.hidden_size))
-            for i in range(3):
-                self.attn_qkvw[i * self.config.hidden_size:(i + 1) * self.config.hidden_size] = \
-                    torch.empty_like(initial_weights[i]).copy_(initial_weights[i])
+            q = initial_weights[0].data
+            k = initial_weights[1].data
+            v = initial_weights[2].data
+
+            self.attn_qkvw = nn.Parameter(torch.cat((q, k, v)))
+            #self.attn_qkvw[i * self.config.hidden_size:(i + 1) * self.config.hidden_size] = \
+            #    initial_weights[i].clone()
+            #torch.empty_like(initial_weights[i]).data.copy_(initial_weights[i].data)
             self.attn_qkvb = nn.Parameter(torch.Tensor(self.config.hidden_size * 3))
             self.attn_qkvb.data.zero_()
             self.attn_ow = initial_weights[3]
@@ -588,11 +580,12 @@ class DeepSpeedTransformerLayer(nn.Module):
                 hidden_states,
                 attention_mask=None,
                 head_mask=None,
+                layer_head_mask=None,
                 encoder_hidden_states=None,
                 encoder_attention_mask=None,
+                past_key_value=None,
                 output_attentions=False,
                 grads=None):
-        self.config.training = self.training
         self.config.is_grad_enabled = torch.is_grad_enabled()
         return DeepSpeedTransformerFunction.apply(hidden_states,
                                                   attention_mask,
diff --git a/deepspeed/profiling/config.py b/deepspeed/profiling/config.py
index 807802670654383350b60d586cc5282e6c9a6d6d..0671211132c626e2a0377218dd5fe75983acd888 100644
--- a/deepspeed/profiling/config.py
+++ b/deepspeed/profiling/config.py
@@ -9,9 +9,6 @@ from deepspeed.profiling.constants import *
 
 class DeepSpeedFlopsProfilerConfig(DeepSpeedConfigObject):
     def __init__(self, param_dict):
-        """
-        docstring
-        """
         super(DeepSpeedFlopsProfilerConfig, self).__init__()
 
         self.enabled = None
@@ -27,9 +24,6 @@ class DeepSpeedFlopsProfilerConfig(DeepSpeedConfigObject):
         self._initialize(flops_profiler_dict)
 
     def _initialize(self, flops_profiler_dict):
-        """
-        docstring
-        """
         self.enabled = get_scalar_param(flops_profiler_dict,
                                         FLOPS_PROFILER_ENABLED,
                                         FLOPS_PROFILER_ENABLED_DEFAULT)
@@ -49,3 +43,7 @@ class DeepSpeedFlopsProfilerConfig(DeepSpeedConfigObject):
         self.detailed = get_scalar_param(flops_profiler_dict,
                                          FLOPS_PROFILER_DETAILED,
                                          FLOPS_PROFILER_DETAILED_DEFAULT)
+
+        self.output_file = get_scalar_param(flops_profiler_dict,
+                                            FLOPS_PROFILER_OUTPUT_FILE,
+                                            FLOPS_PROFILER_OUTPUT_FILE_DEFAULT)
diff --git a/deepspeed/profiling/constants.py b/deepspeed/profiling/constants.py
index 964e528c2163c8e77236358ca69713c95e38dbb1..d999dc61bd9fbe1aa56d4a5cf18d15d1b842f90e 100644
--- a/deepspeed/profiling/constants.py
+++ b/deepspeed/profiling/constants.py
@@ -17,6 +17,7 @@ flops profiler should be enabled as:
     "module_depth": -1,
     "top_modules": 3,
     "detailed": true,
+    "output_file": null
     }
 }
 '''
@@ -33,7 +34,10 @@ FLOPS_PROFILER_MODULE_DEPTH = "module_depth"
 FLOPS_PROFILER_MODULE_DEPTH_DEFAULT = -1
 
 FLOPS_PROFILER_TOP_MODULES = "top_modules"
-FLOPS_PROFILER_TOP_MODULES_DEFAULT = 3
+FLOPS_PROFILER_TOP_MODULES_DEFAULT = 1
 
 FLOPS_PROFILER_DETAILED = "detailed"
 FLOPS_PROFILER_DETAILED_DEFAULT = True
+
+FLOPS_PROFILER_OUTPUT_FILE = "output_file"
+FLOPS_PROFILER_OUTPUT_FILE_DEFAULT = None
diff --git a/deepspeed/profiling/flops_profiler/README.md b/deepspeed/profiling/flops_profiler/README.md
index 179a0b1347564c259e44eca44a551dcc1de0cf0d..6d749c5df81736b6c34d01dd0407a2625599babb 100644
--- a/deepspeed/profiling/flops_profiler/README.md
+++ b/deepspeed/profiling/flops_profiler/README.md
@@ -1,81 +1,154 @@
 # DeepSpeed Flops Profiler
 
-> Measures the parameters, latency, and floating point operations of your model.
+> Measures the parameters, latency, and floating-point operations of your model.
 
   - [Overview](#overview)
-  - [Supported Models](#supported-models)
-  - [Multi-GPU, Multi-node Runs](#multi-gpu-multi-node-runs)
+  - [Flops Measurement](#flops-measurement)
+  - [Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism](#multi-gpu-multi-node-data-parallelism-and-model-parallelism)
   - [Usage](#usage)
-
 ## Overview
 
-The DeepSpeed flops profiler profiles the forward pass of a PyTorch model and prints the model graph with the measured profile attached to each module.
-It shows the parameters, latency, and number of floating point operations of the modules within the model to identify potential bottlenecks.
-It also outputs the names of the top `k` modules in terms of aggregated time, flops, and number of parameters at depth `l` with `k` and `l` specified by the user.
-The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a standalone package.
-
-The output profile is computed for each batch of input and printed to the `stdout`. For each module, the measured profile is annotated after the name and is listed in the order of `number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency of the module, percentage of the total latency, floating point operations per second (FLOPS)`. Note that the number of floating point operations is estimated as `2 * MACs` in the profiler (each MAC operation is counted as 2 floating point operations).
+Effective use of hardware resources is critical to good performance, but performance inefficiency in existing implementations for large-scale model training and inference are often hard to spot and attributed to specific module components. DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point operations per second, i.e., FLOPS) of a model and its submodules, with an eye towards eliminating inefficiencies in existing implementations.
 
-Below is an example output for LeNet5 with batch size 1024:
+Below is an example output for BERT-Large(NVIDIA) on an A100 GPU with batch size `80`:
 
 ```shell
 -------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   1
-Number of parameters:           61.71 k
-Number of multiply-accumulate operations (MACs):   439.56 M
-Number of floating point operations ( = 2 * MACs):   879.12 M
-Latency:                        25.7 ms
-Floating point operations per second(FLOPS):   34.2 GFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 2 are {'Conv2d': '421.91 MMACs', 'Linear': '11.18 MMACs', 'AvgPool2d': '6.46 MMACs'}
-Top 3 modules in params at depth 2 are {'Conv2d': '50.69 k', 'Linear': '11.01 k', 'Tanh': '0'}
-Top 3 modules in latency at depth 2 are {'Conv2d': '11.37 ms', 'Linear': '5.27 ms', 'AvgPool2d': '5.02 ms'}
-
------------------------------- Detailed Profile ------------------------------
-Each module profile is listed after its name in the follwing order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
-
-LeNet5(
-  61.71 k, 100.00% Params, 439.56 MMACs, 100.00% MACs, 25.7 ms, 100.00% latency, 34.2 GFLOPS,
-  (feature_extractor): Sequential(
-    50.69 k, 82.15% Params, 428.37 MMACs, 97.45% MACs, 20.12 ms, 78.27% latency, 42.59 GFLOPS,
-    (0): Conv2d(156, 0.25% Params, 125.24 MMACs, 28.49% MACs, 9.8 ms, 38.12% latency, 25.56 GFLOPS, 1, 6, kernel_size=(5, 5), stride=(1, 1))
-    (1): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 2.85 ms, 11.08% latency, 0.0 FLOPS, )
-    (2): AvgPool2d(0, 0.00% Params, 4.82 MMACs, 1.10% MACs, 4.01 ms, 15.59% latency, 2.4 GFLOPS, kernel_size=2, stride=2, padding=0)
-    (3): Conv2d(2.42 k, 3.92% Params, 247.4 MMACs, 56.28% MACs, 924.83 us, 3.60% latency, 535.02 GFLOPS, 6, 16, kernel_size=(5, 5), stride=(1, 1))
-    (4): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 672.1 us, 2.62% latency, 0.0 FLOPS, )
-    (5): AvgPool2d(0, 0.00% Params, 1.64 MMACs, 0.37% MACs, 1.01 ms, 3.95% latency, 3.23 GFLOPS, kernel_size=2, stride=2, padding=0)
-    (6): Conv2d(48.12 k, 77.98% Params, 49.27 MMACs, 11.21% MACs, 647.31 us, 2.52% latency, 152.25 GFLOPS, 16, 120, kernel_size=(5, 5), stride=(1, 1))
-    (7): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 82.02 us, 0.32% latency, 0.0 FLOPS, )
-  )
-  (classifier): Sequential(
-    11.01 k, 17.85% Params, 11.18 MMACs, 2.54% MACs, 5.41 ms, 21.06% latency, 4.13 GFLOPS,
-    (0): Linear(10.16 k, 16.47% Params, 10.32 MMACs, 2.35% MACs, 2.47 ms, 9.60% latency, 8.37 GFLOPS, in_features=120, out_features=84, bias=True)
-    (1): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 90.12 us, 0.35% latency, 0.0 FLOPS, )
-    (2): Linear(850, 1.38% Params, 860.16 KMACs, 0.20% MACs, 2.8 ms, 10.91% latency, 613.62 MFLOPS, in_features=84, out_features=10, bias=True)
+Profile Summary at step 10:
+Notations:
+data parallel size (dp_size), model parallel size(mp_size),
+number of parameters (params), number of multiply-accumulate operations(MACs),
+number of floating-point operations (flops), floating-point operations per second (FLOPS),
+fwd latency (forward propagation latency), bwd latency (backward propagation latency),
+step (weights update latency), iter latency (sum of fwd, bwd and step latency)
+
+world size:                                                   1
+data parallel size:                                           1
+model parallel size:                                          1
+batch size per GPU:                                           80
+params per gpu:                                               336.23 M
+params of model = params per GPU * mp_size:                   336.23 M
+fwd MACs per GPU:                                             3139.93 G
+fwd flops per GPU:                                            6279.86 G
+fwd flops of model = fwd flops per GPU * mp_size:             6279.86 G
+fwd latency:                                                  76.67 ms
+bwd latency:                                                  108.02 ms
+fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          81.9 TFLOPS
+bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      116.27 TFLOPS
+fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   102.0 TFLOPS
+step latency:                                                 34.09 us
+iter latency:                                                 184.73 ms
+samples/second:                                               433.07
+
+----------------------------- Aggregated Profile per GPU -----------------------------
+Top modules in terms of params, MACs or fwd latency at different model depths:
+depth 0:
+    params      - {'BertForPreTrainingPreLN': '336.23 M'}
+    MACs        - {'BertForPreTrainingPreLN': '3139.93 GMACs'}
+    fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
+depth 1:
+    params      - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
+    MACs        - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
+    fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
+depth 2:
+    params      - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
+    MACs        - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
+    fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
+depth 3:
+    params      - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
+    MACs        - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
+    fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
+depth 4:
+    params      - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
+    MACs        - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
+    fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
+depth 5:
+    params      - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
+    MACs        - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
+    fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
+depth 6:
+    params      - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
+    MACs        - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
+    fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}
+
+------------------------------ Detailed Profile per GPU ------------------------------
+Each module profile is listed after its name in the following order:
+params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
+
+BertForPreTrainingPreLN(
+  336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
+  (bert): BertModel(
+    335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
+    (embeddings): BertEmbeddings(...)
+    (encoder): BertEncoder(
+      302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
+      (FinalLayerNorm): FusedLayerNorm(...)
+      (layer): ModuleList(
+        302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
+        (0): BertLayer(
+          12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
+          (attention): BertAttention(
+            4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
+            (self): BertSelfAttention(
+              3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
+              (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
+              (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
+              (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
+              (dropout): Dropout(...)
+              (softmax): Softmax(...)
+            )
+            (output): BertSelfOutput(
+              1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
+              (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
+              (dropout): Dropout(...)
+            )
+          )
+          (PreAttentionLayerNorm): FusedLayerNorm(...)
+          (PostAttentionLayerNorm): FusedLayerNorm(...)
+          (intermediate): BertIntermediate(
+            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
+            (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
+          )
+          (output): BertOutput(
+            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
+            (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
+            (dropout): Dropout(...)
+          )
+        )
+        ...
+        (23): BertLayer(...)
+      )
+    )
+    (pooler): BertPooler(...)
   )
+  (cls): BertPreTrainingHeads(...)
 )
 ------------------------------------------------------------------------------
+
 ```
 
-## Supported Models
+In the summary profile, the DeepSpeed Flops Profiler outputs the number of parameters, floating-point operations (flops), FLOPS, latency, and throughput in samples/second of the model. This profile shows how much performance gap (compared to the peak hardware performance) the current model execution has and helps users tune the training or inference setup (e.g., hyperparameters, data parallelism, model parallelism, system configurations, etc.) for better performance.
+
+The DeepSpeed Flops Profiler also measures significant modules at different model depths (aggregated profile) and module-specific profile in the model architecture (detailed profile). Using these profiles DeepSpeed users can understand how each layer or submodule contributes to the overall model complexity/performance. Then users can adjust or refactor the model design to achieve better performance. For example, using the profiler, DeepSpeed users can quantitatively tell if stacking smaller layers is lighter or more performant than having bigger ones. The aggregated and detailed profiles also allow users to quickly identify bottleneck modules. In the BERT-Large example above, using the DeepSpeed Flops Profiler, we find that BertLayer is the most significant layer and contains quite a few dropout, softmax, and layer norm along with linear modules. These modules are not heavy in flops and would trigger many GPU kernel invocations and create excessive read/write requests to memory. The pattern shown in the detailed profile suggests this is a perfect match for kernel fusion, and we developed fused transformer-kernels to reduce data movement (See DeepSpeedBert). After applying our optimizations, we see a 25% improvement in FLOPS per GPU and overall training samples/second in the DeepSpeed Flops Profiler output.
 
-The flops estimation is partly inspired by [ptflops](https://github.com/sovrasov/flops-counter.pytorch) with the major difference being that the DeepSpeed flops profiler captures ```torch.nn.functional``` invoked in a module to estimate the flops. Thus the DeepSpeed flops profiler allows for customized modules in the model, e.g., ```ParallelTransformerLayerworks, ParallelSelfAttention, RowParallelLinear, etc.``` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). This is in contrast to tools that profile at ```torch.nn.module``` level, such as ptflops, which require users to write customized flops calculation functions for each customized module. Finally, the DeepSpeed flops profiler also supports flops computation at module level (for RNNs).
+The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime without any user code change or be used independently from DeepSpeed as a standalone package. When using DeepSpeed for model training, the profiler can be enabled in the DeepSpeed configuration file. As a standalone package, the profiler API can be used in both training and inference code. The DeepSpeed profiler is still under active development and includes just initial features.  Stay connected for more exciting features to be added soon.
 
-## Multi-GPU, Multi-node Runs
+## Flops Measurement
 
-For models running on multi-GPU or multi-node, only the model parallelism (e.g. ```--model-parallel-size``` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)) affects the number of flops and parameters profiled, i.e.,
-`model_parallel_size * flops = total_flops` and `model_parallel_size * parameters = total_parameters`. The number of GPUs or nodes does not affect the output profile.
+Similar to existing flops calculation tools or methods, the DeepSpeed Flops Profiler measures the flops of the forward pass of a module and the flops of the backward pass is estimated as `2` times of that of the forward pass.
+Different from the PyTorch profiler which calculates the flops of PyTorch operators, the DeepSpeed Flops Profiler measures the flops within modules in a model and provides more insights to the users about the model execution.
+The flops estimation is partly inspired by [ptflops](https://github.com/sovrasov/flops-counter.pytorch) with the major difference being that the DeepSpeed Flops Profiler not only supports flops computation directly at module level, but can also capture ```torch.nn.functional``` invoked in a module to estimate the flops.
+Thus the DeepSpeed Flops Profiler allows for customized modules in the model, e.g., ```ParallelTransformerLayerworks, ParallelSelfAttention, RowParallelLinear, etc.``` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). This is in contrast to ptflops which requires users to write customized flops calculation functions for each customized module.
 
+## Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism
+
+The DeepSpeed Flops Profiler outputs the per GPU profile as well as the world size, data parallel size, and model parallel size.                                          1
+For models running on multi-GPU or multi-node, only change of the model parallelism (e.g. ```--model-parallel-size``` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)) affects the number of flops and parameters profiled, i.e.,
+`model_parallel_size * flops = total_flops` and `model_parallel_size * parameters = total_parameters`. The data parallel size or world size (related to the number of GPUs or nodes) does not affect the per GPU profile.
 
 ## Usage
 
-The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a standalone package. When using DeepSpeed for model training, the flops profiler can be configured in the deepspeed_config file without user code changes. To use the flops profiler outside of the DeepSpeed runtime, one can simply install DeepSpeed and import the flops_profiler package to use the APIs directly. Examples of each usage are given below.
+The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime or as a standalone package. When using DeepSpeed for model training, the profiler can be configured in the deepspeed configuration file without user code change. To use the flops profiler outside of the DeepSpeed runtime, one can simply install DeepSpeed and import the flops_profiler package to use the APIs directly. Examples of each usage are given below.
 
   - [Usage With the DeepSpeed Runtime](#usage-with-the-deepspeed-runtime)
     - [Example: Megatron-LM](#example-megatron-lm)
@@ -87,14 +160,7 @@ The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a stan
       - [Example Training Workflow](#example-training-workflow)
 ### Usage With the DeepSpeed Runtime
 
-When using DeepSpeed for model training, the flops profiler can be configured in the `deepspeed_config` file. No explict API calls are needed to use the profiler. Refer to [flops profiler](https://www.deepspeed.ai/docs/config-json/#flops-profiler) for details.
-
-
-#### Example: Megatron-LM
-
-For information on running Megatron-LM with DeepSpeed, please refer to our tutorial [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
-
-The flops profiler can be enabled by adding the following field to the `deepspeed_config` file.
+When using DeepSpeed for model training, the profiler can be configured in the deepspeed configuration file. No explicit API calls are needed to use the profiler. The profiler can be enabled by adding the following field to the `deepspeed_config` json file. Refer to [flops profiler](https://www.deepspeed.ai/docs/config-json/#flops-profiler) for details.
 
 ```json
 {
@@ -102,84 +168,131 @@ The flops profiler can be enabled by adding the following field to the `deepspee
     "enabled": true,
     "profile_step": 1,
     "module_depth": -1,
-    "top_modules": 3,
+    "top_modules": 1,
     "detailed": true,
+    "output_file": null
     }
 }
 ```
 
-An example output of 4-layer Megatron-LM model (`hidden_size = 512, num_attention_heads = 16, batch_size = 8, seq_length = 1024`) is shown below.
+#### Example: Megatron-LM
+
+For information on running Megatron-LM with DeepSpeed, please refer to our tutorial [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
+
+An example output of 12-layer Megatron-LM model (`hidden_size = 8192, num_attention_heads = 32, batch_size = 1024, seq_length = 1024`) is shown below.
 
 ```shell
 -------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   1
-Number of parameters:           38.89 M
-Number of multiply-accumulate operations (MACs):   314.61 G
-Number of floating point operations ( = 2 * MACs):   629.21 G
-Latency:                        33.81 ms
-Floating point operations per second(FLOPS):   18.61 TFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 8 are {'ColumnParallelLinear': '60.13 GMACs', 'RowParallelLinear': '42.95 GMACs', 'FusedScaleMaskSoftmax': '536.87 MMACs'}
-Top 3 modules in params at depth 8 are {'ColumnParallelLinear': '7.35 M', 'RowParallelLinear': '5.25 M', 'FusedScaleMaskSoftmax': '0'}
-Top 3 modules in latency at depth 8 are {'ColumnParallelLinear': '659.23 us', 'RowParallelLinear': '587.94 us', 'FusedScaleMaskSoftmax': '370.98 us'}
-
------------------------------- Detailed Profile ------------------------------
-Each module profile is listed after its name in the follwing order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
-
-DistributedDataParallel(
-  38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.81 ms, 100.00% latency, 18.61 TFLOPS,
-  (module): FP16_Module(
-    38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.77 ms, 99.89% latency, 18.63 TFLOPS,
-    (module): GPT2Model(
-      38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.69 ms, 99.66% latency, 18.67 TFLOPS,
-      (language_model): TransformerLanguageModel(
-        38.89 M, 100.00% Params, 103.62 GMACs, 32.94% MACs, 5.58 ms, 16.51% latency, 37.13 TFLOPS,
-        (embedding): Embedding(
-          26.28 M, 67.57% Params, 0 MACs, 0.00% MACs, 545.98 us, 1.61% latency, 0.0 FLOPS,
-          (word_embeddings): VocabParallelEmbedding(25.76 M, 66.23% Params, 0 MACs, 0.00% MACs, 223.88 us, 0.66% latency, 0.0 FLOPS, )
-          (position_embeddings): Embedding(524.29 k, 1.35% Params, 0 MACs, 0.00% MACs, 147.1 us, 0.44% latency, 0.0 FLOPS, 1024, 512)
-          (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.39 us, 0.23% latency, 0.0 FLOPS, p=0.1, inplace=False)
+Profile Summary at step 10:
+Notations:
+data parallel size (dp_size), model parallel size(mp_size),
+number of parameters (params), number of multiply-accumulate operations(MACs),
+number of floating-point operations (flops), floating-point operations per second (FLOPS),
+fwd latency (forward propagation latency), bwd latency (backward propagation latency),
+step (weights update latency), iter latency (sum of fwd, bwd and step latency)
+
+world size:                                                   1
+data parallel size:                                           1
+model parallel size:                                          1
+batch size per GPU:                                           1024
+params per gpu:                                               1.29 M
+params of model = params per GPU * mp_size:                   1.29 M
+fwd MACs per GPU:                                             41271.95 G
+fwd flops per GPU:                                            82543.9 G
+fwd flops of model = fwd flops per GPU * mp_size:             82543.9 G
+fwd latency:                                                  1.89 s
+bwd latency:                                                  5.38 s
+fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          43.68 TFLOPS
+bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      30.7 TFLOPS
+fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   34.07 TFLOPS
+step latency:                                                 34.12 s
+iter latency:                                                 41.39 s
+samples/second:                                               24.74
+
+----------------------------- Aggregated Profile per GPU -----------------------------
+Top 1 modules in terms of params, MACs or fwd latency at different model depths:
+depth 0:
+    params      - {'GPT2Model': '1.29 M'}
+    MACs        - {'GPT2Model': '41271.95 GMACs'}
+    fwd latency - {'GPT2Model': '1.84 s'}
+depth 1:
+    params      - {'TransformerLanguageModel': '1.29 M'}
+    MACs        - {'TransformerLanguageModel': '39584.03 GMACs'}
+    fwd latency - {'TransformerLanguageModel': '1.83 s'}
+depth 2:
+    params      - {'ParallelTransformer': '1.29 M'}
+    MACs        - {'ParallelTransformer': '39584.03 GMACs'}
+    fwd latency - {'ParallelTransformer': '1.81 s'}
+depth 3:
+    params      - {'ModuleList': '1.28 M'}
+    MACs        - {'ModuleList': '39584.03 GMACs'}
+    fwd latency - {'ModuleList': '1.3 s'}
+depth 4:
+    params      - {'ParallelTransformerLayerPart2': '688.15 k'}
+    MACs        - {'ParallelTransformerLayerPart2': '26388.28 GMACs'}
+    fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'}
+depth 5:
+    params      - {'ParallelMLP': '491.54 k'}
+    MACs        - {'ParallelMLP': '26388.28 GMACs'}
+    fwd latency - {'ParallelMLP': '849.4 ms'}
+
+------------------------------ Detailed Profile per GPU ------------------------------
+Each module profile is listed after its name in the following order:
+params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
+
+Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'.
+2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
+3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.
+
+GPT2Model(
+  1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS,
+  (language_model): TransformerLanguageModel(
+    1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS,
+    (embedding): Embedding(
+      2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS,
+      (word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, )
+      (position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192)
+      (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
+    )
+    (transformer): ParallelTransformer(
+      1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS,
+      (layers): ModuleList(
+        1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS,
+        (0): ParallelTransformerLayerPart1(
+          49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS,
+          (input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
+          (attention): ParallelSelfAttention(
+            32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS,
+            (query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, )
+            (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, )
+            (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
+            (dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, )
+          )
         )
-        (transformer): ParallelTransformer(
-          12.61 M, 32.43% Params, 103.62 GMACs, 32.94% MACs, 5.0 ms, 14.78% latency, 41.49 TFLOPS,
-          (layers): ModuleList(
-            12.61 M, 32.42% Params, 103.62 GMACs, 32.94% MACs, 4.4 ms, 13.01% latency, 47.13 TFLOPS,
-            (0): ParallelTransformerLayer(
-              3.15 M, 8.11% Params, 25.9 GMACs, 8.23% MACs, 1.36 ms, 4.02% latency, 38.09 TFLOPS,
-              (input_layernorm): FusedLayerNorm(1.02 k, 0.00% Params, 0 MACs, 0.00% MACs, 92.51 us, 0.27% latency, 0.0 FLOPS, torch.Size([512]), eps=1e-05, elementwise_affine=True)
-              (attention): ParallelSelfAttention(
-                1.05 M, 2.70% Params, 8.72 GMACs, 2.77% MACs, 754.59 us, 2.23% latency, 23.12 TFLOPS,
-                (query_key_value): ColumnParallelLinear(787.97 k, 2.03% Params, 6.44 GMACs, 2.05% MACs, 182.87 us, 0.54% latency, 70.46 TFLOPS, )
-                (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.04% MACs, 120.4 us, 0.36% latency, 2.23 TFLOPS, )
-                (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 47.45 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
-                (dense): RowParallelLinear(262.66 k, 0.68% Params, 2.15 GMACs, 0.68% MACs, 81.78 us, 0.24% latency, 52.52 TFLOPS, )
-              )
-              (post_attention_layernorm): FusedLayerNorm(1.02 k, 0.00% Params, 0 MACs, 0.00% MACs, 57.22 us, 0.17% latency, 0.0 FLOPS, torch.Size([512]), eps=1e-05, elementwise_affine=True)
-              (mlp): ParallelMLP(
-                2.1 M, 5.40% Params, 17.18 GMACs, 5.46% MACs, 224.83 us, 0.67% latency, 152.83 TFLOPS,
-                (dense_h_to_4h): ColumnParallelLinear(1.05 M, 2.70% Params, 8.59 GMACs, 2.73% MACs, 64.13 us, 0.19% latency, 267.87 TFLOPS, )
-                (dense_4h_to_h): RowParallelLinear(1.05 M, 2.70% Params, 8.59 GMACs, 2.73% MACs, 90.36 us, 0.27% latency, 190.13 TFLOPS, )
-              )
-            )
-            ...
-            (3): ParallelTransformerLayer(...)
-          (final_layernorm): FusedLayerNorm(1.02 k, 0.00% Params, 0 MACs, 0.00% MACs, 52.69 us, 0.16% latency, 0.0 TFLOPS, torch.Size([512]), eps=1e-05, elementwise_affine=True)
+        (1): ParallelTransformerLayerPart2(
+          57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS,
+          (post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
+          (mlp): ParallelMLP(
+            40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS,
+            (dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, )
+            (dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, )
+          )
         )
+        ...
+        (23): ParallelTransformerLayerPart2(...)
       )
+      (final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
     )
   )
 )
+------------------------------------------------------------------------------
+
+
 ```
 
 ###  Usage Outside the DeepSpeed Runtime
 
-The flops profiler can be used as a standalone package outside of the DeepSpeed runtime.
+The profiler can be used as a standalone package outside of the DeepSpeed runtime.
 One can simply install DeepSpeed and import the `flops_profiler` package to use the APIs directly.
 Refer to [installation of DeepSpeed](https://www.deepspeed.ai/getting-started/#installation) for installing DeepSpeed.
 
@@ -200,73 +313,18 @@ from deepspeed.profiling.flops_profiler import get_model_profile
 with torch.cuda.device(0):
     model = models.alexnet()
     batch_size = 256
-    macs, params = get_model_profile(model=model, # model
-                                     input_res=(batch_size, 3, 224, 224), # input shape or input to the input_constructor
-                                     input_constructor=None, # if specified, a constructor taking input_res is used as input to the model
-                                     print_profile=True, # prints the model graph with the measured profile attached to each module
-                                     detailed=True, # print the detailed profile
-                                     module_depth=-1, # depth into the nested modules with -1 being the inner most modules
-                                     top_modules=3, # the number of top modules to print aggregated profile
-                                     warm_up=10, # the number of warm-ups before measuring the time of each module
-                                     as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
-                                     ignore_modules=None) # the list of modules to ignore in the profiling
-```
-
-An example output:
-
-```shell
--------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   10
-Number of parameters:           61.1 M
-Number of multiply-accumulate operations (MACs):   183.18 G
-Number of floating point operations ( = 2 * MACs):   366.36 G
-Latency:                        22.13 ms
-Floating point operations per second(FLOPS):   16.56 TFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 2 are {'Conv2d': '167.95 GMACs', 'Linear': '15.01 GMACs', 'ReLU': '126.26 MMACs'}
-Top 3 modules in params at depth 2 are {'Linear': '58.63 M', 'Conv2d': '2.47 M', 'ReLU': '0'}
-Top 3 modules in latency at depth 2 are {'Conv2d': '13.96 ms', 'Linear': '6.23 ms', 'ReLU': '730.75 us'}
-
------------------------------- Detailed Profile ------------------------------
-Each module profile is listed after its name in the follwing order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
-
-AlexNet(
-  61.1 M, 100.00% Params, 183.18 GMACs, 100.00% MACs, 22.13 ms, 100.00% latency, 16.56 TFLOPS,
-  (features): Sequential(
-    2.47 M, 4.04% Params, 168.17 GMACs, 91.81% MACs, 15.17 ms, 68.57% latency, 22.17 TFLOPS,
-    (0): Conv2d(23.3 k, 0.04% Params, 18.04 GMACs, 9.85% MACs, 633.0 us, 2.86% latency, 57.0 TFLOPS, 3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
-    (1): ReLU(0, 0.00% Params, 49.56 MMACs, 0.03% MACs, 163.79 us, 0.74% latency, 605.17 GFLOPS, inplace=True)
-    (2): MaxPool2d(0, 0.00% Params, 49.56 MMACs, 0.03% MACs, 159.26 us, 0.72% latency, 622.38 GFLOPS, kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
-    (3): Conv2d(307.39 k, 0.50% Params, 57.37 GMACs, 31.32% MACs, 6.15 ms, 27.81% latency, 18.64 TFLOPS, 64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
-    (4): ReLU(0, 0.00% Params, 35.83 MMACs, 0.02% MACs, 185.01 us, 0.84% latency, 387.34 GFLOPS, inplace=True)
-    (5): MaxPool2d(0, 0.00% Params, 35.83 MMACs, 0.02% MACs, 134.23 us, 0.61% latency, 533.89 GFLOPS, kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
-    (6): Conv2d(663.94 k, 1.09% Params, 28.72 GMACs, 15.68% MACs, 389.58 us, 1.76% latency, 147.47 TFLOPS, 192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-    (7): ReLU(0, 0.00% Params, 16.61 MMACs, 0.01% MACs, 76.53 us, 0.35% latency, 434.15 GFLOPS, inplace=True)
-    (8): Conv2d(884.99 k, 1.45% Params, 38.29 GMACs, 20.90% MACs, 6.38 ms, 28.82% latency, 12.01 TFLOPS, 384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-    (9): ReLU(0, 0.00% Params, 11.08 MMACs, 0.01% MACs, 104.43 us, 0.47% latency, 212.12 GFLOPS, inplace=True)
-    (10): Conv2d(590.08 k, 0.97% Params, 25.53 GMACs, 13.94% MACs, 405.79 us, 1.83% latency, 125.83 TFLOPS, 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-    (11): ReLU(0, 0.00% Params, 11.08 MMACs, 0.01% MACs, 65.57 us, 0.30% latency, 337.85 GFLOPS, inplace=True)
-    (12): MaxPool2d(0, 0.00% Params, 11.08 MMACs, 0.01% MACs, 122.07 us, 0.55% latency, 181.46 GFLOPS, kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
-  )
-  (avgpool): AdaptiveAvgPool2d(0, 0.00% Params, 2.36 MMACs, 0.00% MACs, 259.4 us, 1.17% latency, 18.19 GFLOPS, output_size=(6, 6))
-  (classifier): Sequential(
-    58.63 M, 95.96% Params, 15.01 GMACs, 8.19% MACs, 6.54 ms, 29.54% latency, 4.59 TFLOPS,
-    (0): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 42.68 us, 0.19% latency, 0.0 FLOPS, p=0.5, inplace=False)
-    (1): Linear(37.75 M, 61.79% Params, 9.66 GMACs, 5.28% MACs, 301.36 us, 1.36% latency, 64.13 TFLOPS, in_features=9216, out_features=4096, bias=True)
-    (2): ReLU(0, 0.00% Params, 1.05 MMACs, 0.00% MACs, 79.39 us, 0.36% latency, 26.41 GFLOPS, inplace=True)
-    (3): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 39.58 us, 0.18% latency, 0.0 FLOPS, p=0.5, inplace=False)
-    (4): Linear(16.78 M, 27.46% Params, 4.29 GMACs, 2.34% MACs, 234.37 us, 1.06% latency, 36.65 TFLOPS, in_features=4096, out_features=4096, bias=True)
-    (5): ReLU(0, 0.00% Params, 1.05 MMACs, 0.00% MACs, 56.03 us, 0.25% latency, 37.43 GFLOPS, inplace=True)
-    (6): Linear(4.1 M, 6.71% Params, 1.05 GMACs, 0.57% MACs, 5.69 ms, 25.72% latency, 368.42 GFLOPS, in_features=4096, out_features=1000, bias=True)
-  )
-)
-------------------------------------------------------------------------------
+    flops, macs, params = get_model_profile(model=model, # model
+                                    input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
+                                    args=None, # list of positional arguments to the model.
+                                    kwargs=None, # dictionary of keyword arguments to the model.
+                                    print_profile=True, # prints the model graph with the measured profile attached to each module
+                                    detailed=True, # print the detailed profile
+                                    module_depth=-1, # depth into the nested modules, with -1 being the inner most modules
+                                    top_modules=1, # the number of top modules to print aggregated profile
+                                    warm_up=10, # the number of warm-ups before measuring the time of each module
+                                    as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
+                                    output_file=None, # path to the output file. If None, the profiler prints to stdout.
+                                    ignore_modules=None) # the list of modules to ignore in the profiling
 ```
 
 ##### Example: Bert
@@ -278,15 +336,15 @@ from transformers import BertForSequenceClassification, BertTokenizer
 from deepspeed.profiling.flops_profiler import get_model_profile
 
 
-def bert_input_constructor(input_shape, tokenizer):
+def bert_input_constructor(batch_size, seq_len, tokenizer):
     fake_seq = ""
-    for _ in range(input_shape[1] - 2):  # ignore the two special tokens [CLS] and [SEP]
+    for _ in range(seq_len - 2):  # ignore the two special tokens [CLS] and [SEP]
       fake_seq += tokenizer.pad_token
-    inputs = tokenizer([fake_seq] * input_shape[0],
+    inputs = tokenizer([fake_seq] * batch_size,
                        padding=True,
                        truncation=True,
                        return_tensors="pt")
-    labels = torch.tensor([1] * input_shape[0])
+    labels = torch.tensor([1] * batch_size)
     inputs = dict(inputs)
     inputs.update({"labels": labels})
     return inputs
@@ -299,11 +357,9 @@ with torch.cuda.device(0):
     seq_len = 128
     enable_profile = True
     if enable_profile:
-      macs, params = get_model_profile(
+      flops, macs, params = get_model_profile(
           model,
-          (batch_size, seq_len),
-          input_constructor=partial(bert_input_constructor,
-                                    tokenizer=tokenizer),
+          kwargs=bert_input_constructor(batch_size, seq_len, tokenizer),
           print_profile=True,
           detailed=True,
       )
@@ -312,104 +368,21 @@ with torch.cuda.device(0):
       outputs = model(inputs)
 ```
 
-An example output:
-
-```
--------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   1
-Number of parameters:           109.48 M
-Number of multiply-accumulate operations (MACs):   43.5 G
-Number of floating point operations ( = 2 * MACs):   87.0 G
-Latency:                        393.7 ms
-Floating point operations per second(FLOPS):   220.97 GFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 7 are {'Linear': '14.5 GMACs', 'Dropout': '0 MACs', 'LayerNorm': '0 MACs'}
-Top 3 modules in params at depth 7 are {'Linear': '28.35 M', 'LayerNorm': '18.43 k', 'Dropout': '0'}
-Top 3 modules in latency at depth 7 are {'Linear': '153.7 ms', 'LayerNorm': '4.74 ms', 'Dropout': '597.95 us'}
-
------------------------------- Detailed Profile ------------------------------
-Each module profile is listed after its name in the follwing order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.
-
-BertForSequenceClassification(
-  109.48 M, 100.00% Params, 43.5 GMACs, 100.00% MACs, 393.7 ms, 100.00% latency, 220.97 GFLOPS,
-  (bert): BertModel(
-    109.48 M, 100.00% Params, 43.5 GMACs, 100.00% MACs, 393.38 ms, 99.92% latency, 221.15 GFLOPS,
-    (embeddings): BertEmbeddings(
-      23.84 M, 21.77% Params, 0 MACs, 0.00% MACs, 1.79 ms, 0.45% latency, 0.0 FLOPS,
-      (word_embeddings): Embedding(23.44 M, 21.41% Params, 0 MACs, 0.00% MACs, 485.18 us, 0.12% latency, 0.0 FLOPS, 30522, 768, padding_idx=0)
-      (position_embeddings): Embedding(393.22 k, 0.36% Params, 0 MACs, 0.00% MACs, 111.1 us, 0.03% latency, 0.0 FLOPS, 512, 768)
-      (token_type_embeddings): Embedding(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 215.53 us, 0.05% latency, 0.0 FLOPS, 2, 768)
-      (LayerNorm): LayerNorm(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 386.95 us, 0.10% latency, 0.0 FLOPS, (768,), eps=1e-12, elementwise_affine=True)
-      (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 20.27 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-    )
-    (encoder): BertEncoder(
-      85.05 M, 77.69% Params, 43.5 GMACs, 99.99% MACs, 391.03 ms, 99.32% latency, 222.47 GFLOPS,
-      (layer): ModuleList(
-        85.05 M, 77.69% Params, 43.5 GMACs, 99.99% MACs, 390.82 ms, 99.27% latency, 222.59 GFLOPS,
-        (0): BertLayer(
-          7.09 M, 6.47% Params, 3.62 GMACs, 8.33% MACs, 31.91 ms, 8.10% latency, 227.21 GFLOPS,
-          (attention): BertAttention(
-            2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 16.39 ms, 4.16% latency, 147.47 GFLOPS,
-            (self): BertSelfAttention(
-              1.77 M, 1.62% Params, 906.76 MMACs, 2.08% MACs, 15.07 ms, 3.83% latency, 120.36 GFLOPS,
-              (query): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 3.66 ms, 0.93% latency, 164.91 GFLOPS, in_features=768, out_features=768, bias=True)
-              (key): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 3.72 ms, 0.94% latency, 162.36 GFLOPS, in_features=768, out_features=768, bias=True)
-              (value): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 4.52 ms, 1.15% latency, 133.65 GFLOPS, in_features=768, out_features=768, bias=True)
-              (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 24.08 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-            )
-            (output): BertSelfOutput(
-              592.13 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 1.29 ms, 0.33% latency, 469.21 GFLOPS,
-              (dense): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 504.26 us, 0.13% latency, 1.2 TFLOPS, in_features=768, out_features=768, bias=True)
-              (LayerNorm): LayerNorm(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 437.97 us, 0.11% latency, 0.0 FLOPS, (768,), eps=1e-12, elementwise_affine=True)
-              (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 21.93 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-            )
-          )
-          (intermediate): BertIntermediate(
-            2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 9.57 ms, 2.43% latency, 252.35 GFLOPS,
-            (dense): Linear(2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 8.75 ms, 2.22% latency, 276.11 GFLOPS, in_features=768, out_features=3072, bias=True)
-          )
-          (output): BertOutput(
-            2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 5.77 ms, 1.47% latency, 418.39 GFLOPS,
-            (dense): Linear(2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 5.13 ms, 1.30% latency, 471.15 GFLOPS, in_features=3072, out_features=768, bias=True)
-            (LayerNorm): LayerNorm(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 310.9 us, 0.08% latency, 0.0 FLOPS, (768,), eps=1e-12, elementwise_affine=True)
-            (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 29.8 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-          )
-        )
-        ...
-        (11): BertLayer(...)
-      )
-    )
-    (pooler): BertPooler(
-      590.59 k, 0.54% Params, 2.36 MMACs, 0.01% MACs, 337.12 us, 0.09% latency, 14.0 GFLOPS,
-      (dense): Linear(590.59 k, 0.54% Params, 2.36 MMACs, 0.01% MACs, 173.57 us, 0.04% latency, 27.19 GFLOPS, in_features=768, out_features=768, bias=True)
-      (activation): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 46.01 us, 0.01% latency, 0.0 FLOPS, )
-    )
-  )
-  (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 19.55 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
-  (classifier): Linear(1.54 k, 0.00% Params, 6.14 KMACs, 0.00% MACs, 56.51 us, 0.01% latency, 217.47 MFLOPS, in_features=768, out_features=2, bias=True)
-)
-------------------------------------------------------------------------------
-```
-
 #### In Model Training Workflow
 
 To profile model forward in a training workflow, use the `FlopsProfiler`class.
-The `FlopsProfiler`class provides the follwing methods:
+The `FlopsProfiler`class provides the following methods:
   * `start_profile()` - starts profiling
-  * `get_total_flops(as_string=False)` - returns the total number of MACs in the model
+  * `get_total_flops(as_string=False)` - returns the total number of floating-point operations in the model
+  * `get_total_macs(as_string=False)` - returns the total number of MACs in the model
   * `get_total_params(as_string=False)` - returns the total number of parameters in the model
-  * `print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True)` - prints the model profile
-  * `end_profile()` - ends profiling and cleans up. This should be invoked at the end of the profiling and AFTER `get_total_flops`, `get_total_params` or `print_model_profile`.
+  * `print_model_profile(profile_step=1, module_depth=-1, top_modules=1, detailed=True, output_file=None)` - prints the model profile
+  * `stop_profile()` - stops profiling. This stops the flops counting in the model.
+  * `end_profile()` - cleans up. This cleans up the profile attributes added to the model during the profiling. This should be invoked at the end of the profiling and AFTER `get_total_flops`, `get_total_macs`, `get_total_params` or `print_model_profile`.
 
 ##### Example Training Workflow
 
-Below is an example of this usage in a typical training workflow. Note that the flops profiler only captures the forward pass in a training step. The flops of a backward pass can be roughly estimated from that of the forward pass (~2x).
+Below is an example of this usage in a typical training workflow.
 
 ```python
 from deepspeed.profiling.flops_profiler import FlopsProfiler
@@ -430,8 +403,10 @@ for step, batch in enumerate(data_loader):
 
   # end profiling and print output
   if step == profile_step: # if using multi nodes, check global_rank == 0 as well
-    flops = prof.get_total_flops(as_string=True)
-    params = prof.get_total_params(as_string=True)
+    prof.stop_profile()
+    flops = prof.get_total_flops()
+    macs = prof.get_total_macs()
+    params = prof.get_total_params()
     if print_profile:
         prof.print_model_profile(profile_step=profile_step)
     prof.end_profile()
diff --git a/deepspeed/profiling/flops_profiler/profiler.py b/deepspeed/profiling/flops_profiler/profiler.py
index be7d772782f2dfa80ff8aa22d5cbdfaa305545a9..92c25ec94ca9516d051b635b102e84811d00a7c4 100644
--- a/deepspeed/profiling/flops_profiler/profiler.py
+++ b/deepspeed/profiling/flops_profiler/profiler.py
@@ -3,13 +3,19 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from functools import partial
+from typing import Callable, List, Optional, Tuple
+from collections import OrderedDict
+import numpy as np
+
+Tensor = torch.Tensor
 
 module_flop_count = []
+module_mac_count = []
 old_functions = {}
 
 
 class FlopsProfiler(object):
-    """Measures the latency, number of estimated floating point operations and parameters of each module in a PyTorch model.
+    """Measures the latency, number of estimated floating-point operations and parameters of each module in a PyTorch model.
 
     The flops-profiler profiles the forward pass of a PyTorch model and prints the model graph with the measured profile attached to each module. It shows how latency, flops and parameters are spent in the model and which modules or layers could be the bottleneck. It also outputs the names of the top k modules in terms of aggregated latency, flops, and parameters at depth l with k and l specified by the user. The output profile is computed for each batch of input.
     The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a standalone package.
@@ -44,8 +50,11 @@ class FlopsProfiler(object):
     Args:
         object (torch.nn.Module): The PyTorch model to profile.
     """
-    def __init__(self, model):
+    def __init__(self, model, ds_engine=None):
         self.model = model
+        self.ds_engine = ds_engine
+        self.started = False
+        self.func_patched = False
 
     def start_profile(self, ignore_list=None):
         """Starts profiling.
@@ -57,6 +66,7 @@ class FlopsProfiler(object):
         """
         self.reset_profile()
         _patch_functionals()
+        _patch_tensor_methods()
 
         def register_module_hooks(module, ignore_list):
             if ignore_list and type(module) in ignore_list:
@@ -71,6 +81,7 @@ class FlopsProfiler(object):
             # if computing the flops of the functionals in a module
             def pre_hook(module, input):
                 module_flop_count.append([])
+                module_mac_count.append([])
 
             module.__pre_hook_handle__ = module.register_forward_pre_hook(pre_hook)
 
@@ -78,36 +89,39 @@ class FlopsProfiler(object):
                 if module_flop_count:
                     module.__flops__ += sum([elem[1] for elem in module_flop_count[-1]])
                     module_flop_count.pop()
+                    module.__macs__ += sum([elem[1] for elem in module_mac_count[-1]])
+                    module_mac_count.pop()
 
             module.__post_hook_handle__ = module.register_forward_hook(post_hook)
 
             def start_time_hook(module, input):
+                torch.cuda.synchronize()
                 module.__start_time__ = time.time()
 
             module.__start_time_hook_handle__ = module.register_forward_pre_hook(
                 start_time_hook)
 
             def end_time_hook(module, input, output):
+                torch.cuda.synchronize()
                 module.__duration__ += time.time() - module.__start_time__
 
             module.__end_time_hook_handle__ = module.register_forward_hook(end_time_hook)
 
         self.model.apply(partial(register_module_hooks, ignore_list=ignore_list))
+        self.started = True
+        self.func_patched = True
 
-    def end_profile(self):
-        """Ends profiling.
+    def stop_profile(self):
+        """Stop profiling.
 
-        Added attributes and handles are removed recursively on all the modules and the torch.nn.functionals are restored.
+        All torch.nn.functionals are restored to their originals.
         """
+        if self.started and self.func_patched:
+            _reload_functionals()
+            _reload_tensor_methods()
+            self.func_patched = False
+
         def remove_profile_attrs(module):
-            if hasattr(module, "__flops__"):
-                del module.__flops__
-            if hasattr(module, "__params__"):
-                del module.__params__
-            if hasattr(module, "__start_time__"):
-                del module.__start_time__
-            if hasattr(module, "__duration__"):
-                del module.__duration__
             if hasattr(module, "__pre_hook_handle__"):
                 module.__pre_hook_handle__.remove()
                 del module.__pre_hook_handle__
@@ -125,7 +139,6 @@ class FlopsProfiler(object):
                 del module.__end_time_hook_handle__
 
         self.model.apply(remove_profile_attrs)
-        _reload_functionals()
 
     def reset_profile(self):
         """Resets the profiling.
@@ -134,6 +147,7 @@ class FlopsProfiler(object):
         """
         def add_or_reset_attrs(module):
             module.__flops__ = 0
+            module.__macs__ = 0
             module.__params__ = sum(p.numel() for p in module.parameters()
                                     if p.requires_grad)
             module.__start_time__ = 0
@@ -141,6 +155,30 @@ class FlopsProfiler(object):
 
         self.model.apply(add_or_reset_attrs)
 
+    def end_profile(self):
+        """Ends profiling.
+
+        The added attributes and handles are removed recursively on all the modules.
+        """
+        if not self.started:
+            return
+        self.stop_profile()
+        self.started = False
+
+        def remove_profile_attrs(module):
+            if hasattr(module, "__flops__"):
+                del module.__flops__
+            if hasattr(module, "__macs__"):
+                del module.__macs__
+            if hasattr(module, "__params__"):
+                del module.__params__
+            if hasattr(module, "__start_time__"):
+                del module.__start_time__
+            if hasattr(module, "__duration__"):
+                del module.__duration__
+
+        self.model.apply(remove_profile_attrs)
+
     def get_total_flops(self, as_string=False):
         """Returns the total flops of the model.
 
@@ -151,7 +189,19 @@ class FlopsProfiler(object):
             The number of multiply-accumulate operations of the model forward pass.
         """
         total_flops = get_module_flops(self.model)
-        return macs_to_string(total_flops) if as_string else total_flops
+        return num_to_string(total_flops) if as_string else total_flops
+
+    def get_total_macs(self, as_string=False):
+        """Returns the total MACs of the model.
+
+        Args:
+            as_string (bool, optional): whether to output the flops as string. Defaults to False.
+
+        Returns:
+            The number of multiply-accumulate operations of the model forward pass.
+        """
+        total_macs = get_module_macs(self.model)
+        return macs_to_string(total_macs) if as_string else total_macs
 
     def get_total_duration(self, as_string=False):
         """Returns the total duration of the model forward pass.
@@ -162,7 +212,7 @@ class FlopsProfiler(object):
         Returns:
             The latency of the model forward pass.
         """
-        total_duration = self.model.__duration__
+        total_duration = get_module_duration(self.model)
         return duration_to_string(total_duration) if as_string else total_duration
 
     def get_total_params(self, as_string=False):
@@ -180,60 +230,126 @@ class FlopsProfiler(object):
     def print_model_profile(self,
                             profile_step=1,
                             module_depth=-1,
-                            top_modules=3,
-                            detailed=True):
+                            top_modules=1,
+                            detailed=True,
+                            output_file=None):
         """Prints the model graph with the measured profile attached to each module.
 
         Args:
             profile_step (int, optional): The global training step at which to profile. Note that warm up steps are needed for accurate time measurement.
-            module_depth (int, optional): The depth of the model at which to print the aggregated module information. When set to -1, it prints information on the innermost modules (with the maximum depth).
+            module_depth (int, optional): The depth of the model to which to print the aggregated module information. When set to -1, it prints information from the top to the innermost modules (the maximum depth).
             top_modules (int, optional): Limits the aggregated profile output to the number of top modules specified.
             detailed (bool, optional): Whether to print the detailed model profile.
+            output_file (str, optional): Path to the output file. If None, the profiler prints to stdout.
         """
+        if not self.started:
+            return
+        import sys
+        import os.path
+        from os import path
+        original_stdout = None
+        f = None
+        if output_file and output_file != "":
+            dir_path = os.path.dirname(output_file)
+            if not os.path.exists(dir_path):
+                os.makedirs(dir_path)
+            original_stdout = sys.stdout
+            f = open(output_file, "w")
+            sys.stdout = f
 
         total_flops = self.get_total_flops()
+        total_macs = self.get_total_macs()
         total_duration = self.get_total_duration()
         total_params = self.get_total_params()
 
         self.flops = total_flops
+        self.macs = total_macs
         self.params = total_params
 
         print(
             "\n-------------------------- DeepSpeed Flops Profiler --------------------------"
         )
-        print("Summary of forward pass:")
-        print('{:<30}  {:<8}'.format('Profile step: ', profile_step))
-        print('{:<30}  {:<8}'.format('Number of parameters: ',
-                                     params_to_string(total_params)))
-        print('{:<30}  {:<8}'.format('Number of multiply-accumulate operations (MACs): ',
-                                     num_to_string(total_flops)))
-        print('{:<30}  {:<8}'.format(
-            'Number of floating point operations ( = 2 * MACs): ',
-            num_to_string(2 * total_flops)))
-        print('{:<30}  {:<8}'.format('Latency: ', duration_to_string(total_duration)))
-        print('{:<30}  {:<8}'.format('Floating point operations per second(FLOPS): ',
-                                     flops_to_string(2 * total_flops / total_duration)))
+        print(f'Profile Summary at step {profile_step}:')
+        print(
+            "Notations:\ndata parallel size (dp_size), model parallel size(mp_size),\nnumber of parameters (params), number of multiply-accumulate operations(MACs),\nnumber of floating-point operations (flops), floating-point operations per second (FLOPS),\nfwd latency (forward propagation latency), bwd latency (backward propagation latency),\nstep (weights update latency), iter latency (sum of fwd, bwd and step latency)\n"
+        )
+        if self.ds_engine:
+            print('{:<60}  {:<8}'.format('world size: ', self.ds_engine.world_size))
+            print('{:<60}  {:<8}'.format('data parallel size: ',
+                                         self.ds_engine.dp_world_size))
+            print('{:<60}  {:<8}'.format('model parallel size: ',
+                                         self.ds_engine.mp_world_size))
+            print('{:<60}  {:<8}'.format(
+                'batch size per GPU: ',
+                self.ds_engine.train_micro_batch_size_per_gpu()))
+
+        print('{:<60}  {:<8}'.format('params per gpu: ', params_to_string(total_params)))
+        print('{:<60}  {:<8}'.format(
+            'params of model = params per GPU * mp_size: ',
+            params_to_string(total_params *
+                             (self.ds_engine.mp_world_size) if self.ds_engine else 1)))
+
+        print('{:<60}  {:<8}'.format('fwd MACs per GPU: ', macs_to_string(total_macs)))
+
+        print('{:<60}  {:<8}'.format('fwd flops per GPU: ', num_to_string(total_flops)))
+
+        print('{:<60}  {:<8}'.format(
+            'fwd flops of model = fwd flops per GPU * mp_size: ',
+            num_to_string(total_flops *
+                          (self.ds_engine.mp_world_size) if self.ds_engine else 1)))
+
+        fwd_latency = self.get_total_duration()
+        if self.ds_engine and self.ds_engine.wall_clock_breakdown():
+            fwd_latency = self.ds_engine.timers('forward').elapsed(False)
+        print('{:<60}  {:<8}'.format('fwd latency: ', duration_to_string(fwd_latency)))
+        print('{:<60}  {:<8}'.format(
+            'fwd FLOPS per GPU = fwd flops per GPU / fwd latency: ',
+            flops_to_string(total_flops / fwd_latency)))
+
+        if self.ds_engine and self.ds_engine.wall_clock_breakdown():
+            bwd_latency = self.ds_engine.timers('backward').elapsed(False)
+            step_latency = self.ds_engine.timers('step').elapsed(False)
+            print('{:<60}  {:<8}'.format('bwd latency: ',
+                                         duration_to_string(bwd_latency)))
+            print('{:<60}  {:<8}'.format(
+                'bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: ',
+                flops_to_string(2 * total_flops / bwd_latency)))
+            print('{:<60}  {:<8}'.format(
+                'fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): ',
+                flops_to_string(3 * total_flops / (fwd_latency + bwd_latency))))
+
+            print('{:<60}  {:<8}'.format('step latency: ',
+                                         duration_to_string(step_latency)))
+
+            iter_latency = fwd_latency + bwd_latency + step_latency
+            print('{:<60}  {:<8}'.format('iter latency: ',
+                                         duration_to_string(iter_latency)))
+            print('{:<60}  {:<8}'.format(
+                'FLOPS per GPU = 3 * fwd flops per GPU / iter latency: ',
+                flops_to_string(3 * total_flops / iter_latency)))
+
+            samples_per_iter = self.ds_engine.train_micro_batch_size_per_gpu(
+            ) * self.ds_engine.world_size
+            print('{:<60}  {:<8.2f}'.format('samples/second: ',
+                                            samples_per_iter / iter_latency))
 
         def flops_repr(module):
             params = module.__params__
             flops = get_module_flops(module)
+            macs = get_module_macs(module)
             items = [
                 params_to_string(params),
                 "{:.2%} Params".format(params / total_params),
-                macs_to_string(flops),
-                "{:.2%} MACs".format(0.0 if total_flops == 0 else flops / total_flops),
+                macs_to_string(macs),
+                "{:.2%} MACs".format(0.0 if total_macs == 0 else macs / total_macs),
             ]
-            duration = module.__duration__
-            if duration == 0:  # e.g. ModuleList
-                for m in module.children():
-                    duration += m.__duration__
+            duration = get_module_duration(module)
 
             items.append(duration_to_string(duration))
             items.append(
                 "{:.2%} latency".format(0.0 if total_duration == 0 else duration /
                                         total_duration))
-            # flops = 2 * MACs
-            items.append(flops_to_string(0.0 if duration == 0 else 2 * flops / duration))
+            items.append(flops_to_string(0.0 if duration == 0 else flops / duration))
             items.append(module.original_extra_repr())
             return ", ".join(items)
 
@@ -252,20 +368,20 @@ class FlopsProfiler(object):
         self.model.apply(add_extra_repr)
 
         print(
-            "\n----------------------------- Aggregated Profile -----------------------------"
+            "\n----------------------------- Aggregated Profile per GPU -----------------------------"
         )
         self.print_model_aggregated_profile(module_depth=module_depth,
                                             top_modules=top_modules)
 
         if detailed:
             print(
-                "\n------------------------------ Detailed Profile ------------------------------"
+                "\n------------------------------ Detailed Profile per GPU ------------------------------"
             )
             print(
-                "Each module profile is listed after its name in the following order: \nnumber of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency)."
+                "Each module profile is listed after its name in the following order: \nparams, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS"
             )
             print(
-                "Note: \n1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.\n2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.\n"
+                "\nNote: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'.\n2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.\n3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.\n"
             )
             print(self.model)
 
@@ -275,12 +391,16 @@ class FlopsProfiler(object):
             "------------------------------------------------------------------------------"
         )
 
-    def print_model_aggregated_profile(self, module_depth=-1, top_modules=3):
+        if output_file:
+            sys.stdout = original_stdout
+            f.close()
+
+    def print_model_aggregated_profile(self, module_depth=-1, top_modules=1):
         """Prints the names of the top top_modules modules in terms of aggregated time, flops, and parameters at depth module_depth.
 
         Args:
             module_depth (int, optional): the depth of the modules to show. Defaults to -1 (the innermost modules).
-            top_modules (int, optional): the number of top modules to show. Defaults to 3.
+            top_modules (int, optional): the number of top modules to show. Defaults to 1.
         """
         info = {}
         if not hasattr(self.model, "__flops__"):
@@ -297,10 +417,10 @@ class FlopsProfiler(object):
                     0,
                     0,
                     0,
-                ]  # flops, params, time
-            info[curr_depth][module.__class__.__name__][0] += module.__flops__
+                ]  # macs, params, time
+            info[curr_depth][module.__class__.__name__][0] += get_module_macs(module)
             info[curr_depth][module.__class__.__name__][1] += module.__params__
-            info[curr_depth][module.__class__.__name__][2] += (module.__duration__)
+            info[curr_depth][module.__class__.__name__][2] += get_module_duration(module)
             has_children = len(module._modules.items()) != 0
             if has_children:
                 for child in module.children():
@@ -312,32 +432,39 @@ class FlopsProfiler(object):
         if module_depth == -1:
             depth = len(info) - 1
 
-        num_items = min(top_modules, len(info[depth]))
-
-        sort_flops = {
-            k: macs_to_string(v[0])
-            for k,
-            v in sorted(info[depth].items(),
-                        key=lambda item: item[1][0],
-                        reverse=True)[:num_items]
-        }
-        sort_params = {
-            k: params_to_string(v[1])
-            for k,
-            v in sorted(info[depth].items(),
-                        key=lambda item: item[1][1],
-                        reverse=True)[:num_items]
-        }
-        sort_time = {
-            k: duration_to_string(v[2])
-            for k,
-            v in sorted(info[depth].items(),
-                        key=lambda item: item[1][2],
-                        reverse=True)[:num_items]
-        }
-        print(f"Top {num_items} modules in MACs at depth {depth} are {sort_flops}")
-        print(f"Top {num_items} modules in params at depth {depth} are {sort_params}")
-        print(f"Top {num_items} modules in latency at depth {depth} are {sort_time}")
+        print(
+            f'Top {top_modules} modules in terms of params, MACs or fwd latency at different model depths:'
+        )
+
+        for d in range(depth):
+            num_items = min(top_modules, len(info[d]))
+
+            sort_macs = {
+                k: macs_to_string(v[0])
+                for k,
+                v in sorted(info[d].items(),
+                            key=lambda item: item[1][0],
+                            reverse=True)[:num_items]
+            }
+            sort_params = {
+                k: params_to_string(v[1])
+                for k,
+                v in sorted(info[d].items(),
+                            key=lambda item: item[1][1],
+                            reverse=True)[:num_items]
+            }
+            sort_time = {
+                k: duration_to_string(v[2])
+                for k,
+                v in sorted(info[d].items(),
+                            key=lambda item: item[1][2],
+                            reverse=True)[:num_items]
+            }
+
+            print(f"depth {d}:")
+            print(f"    params      - {sort_params}")
+            print(f"    MACs        - {sort_macs}")
+            print(f"    fwd latency - {sort_time}")
 
 
 def _prod(dims):
@@ -349,11 +476,38 @@ def _prod(dims):
 
 def _linear_flops_compute(input, weight, bias=None):
     out_features = weight.shape[0]
-    return torch.numel(input) * out_features
+    macs = torch.numel(input) * out_features
+    return 2 * macs, macs
 
 
 def _relu_flops_compute(input, inplace=False):
-    return torch.numel(input)
+    return torch.numel(input), 0
+
+
+def _prelu_flops_compute(input: Tensor, weight: Tensor):
+    return torch.numel(input), 0
+
+
+def _elu_flops_compute(input: Tensor, alpha: float = 1.0, inplace: bool = False):
+    return torch.numel(input), 0
+
+
+def _leaky_relu_flops_compute(input: Tensor,
+                              negative_slope: float = 0.01,
+                              inplace: bool = False):
+    return torch.numel(input), 0
+
+
+def _relu6_flops_compute(input: Tensor, inplace: bool = False):
+    return torch.numel(input), 0
+
+
+def _silu_flops_compute(input: Tensor, inplace: bool = False):
+    return torch.numel(input), 0
+
+
+def _gelu_flops_compute(input):
+    return torch.numel(input), 0
 
 
 def _pool_flops_compute(
@@ -365,7 +519,7 @@ def _pool_flops_compute(
     count_include_pad=True,
     divisor_override=None,
 ):
-    return torch.numel(input)
+    return torch.numel(input), 0
 
 
 def _conv_flops_compute(input,
@@ -383,28 +537,29 @@ def _conv_flops_compute(input,
     kernel_dims = list(weight.shape[-2:])
     input_dims = list(input.shape[2:])
 
-    paddings = padding if type(padding) is tuple else (padding, padding)
-    strides = stride if type(stride) is tuple else (stride, stride)
-    dilations = dilation if type(dilation) is tuple else (dilation, dilation)
+    length = len(input_dims)
+
+    paddings = padding if type(padding) is tuple else (padding, ) * length
+    strides = stride if type(stride) is tuple else (stride, ) * length
+    dilations = dilation if type(dilation) is tuple else (dilation, ) * length
 
-    output_dims = [0, 0]
-    output_dims[0] = (input_dims[0] + 2 * paddings[0] -
-                      (dilations[0] * (kernel_dims[0] - 1) + 1)) // strides[0] + 1
-    output_dims[1] = (input_dims[1] + 2 * paddings[1] -
-                      (dilations[1] * (kernel_dims[1] - 1) + 1)) // strides[1] + 1
+    output_dims = []
+    for idx, input_dim in enumerate(input_dims):
+        output_dim = (input_dim + 2 * paddings[idx] -
+                      (dilations[idx] * (kernel_dims[idx] - 1) + 1)) // strides[idx] + 1
+        output_dims.append(output_dim)
 
     filters_per_channel = out_channels // groups
-    conv_per_position_flops = int(_prod(kernel_dims)) * in_channels * filters_per_channel
+    conv_per_position_macs = int(_prod(kernel_dims)) * in_channels * filters_per_channel
     active_elements_count = batch_size * int(_prod(output_dims))
-    overall_conv_flops = conv_per_position_flops * active_elements_count
+    overall_conv_macs = conv_per_position_macs * active_elements_count
+    overall_conv_flops = 2 * overall_conv_macs
 
     bias_flops = 0
     if bias is not None:
         bias_flops = out_channels * active_elements_count
 
-    overall_flops = overall_conv_flops + bias_flops
-
-    return int(overall_flops)
+    return int(overall_conv_flops + bias_flops), int(overall_conv_macs)
 
 
 def _conv_trans_flops_compute(
@@ -423,28 +578,34 @@ def _conv_trans_flops_compute(
     kernel_dims = list(weight.shape[-2:])
     input_dims = list(input.shape[2:])
 
+    length = len(input_dims)
+
+    paddings = padding if type(padding) is tuple else (padding, ) * length
+    strides = stride if type(stride) is tuple else (stride, ) * length
+    dilations = dilation if type(dilation) is tuple else (dilation, ) * length
+
+    output_dims = []
+    for idx, input_dim in enumerate(input_dims):
+
+        output_dim = (input_dim + 2 * paddings[idx] -
+                      (dilations[idx] * (kernel_dims[idx] - 1) + 1)) // strides[idx] + 1
+        output_dims.append(output_dim)
+
     paddings = padding if type(padding) is tuple else (padding, padding)
     strides = stride if type(stride) is tuple else (stride, stride)
     dilations = dilation if type(dilation) is tuple else (dilation, dilation)
 
-    output_dims = [0, 0]
-    output_dims[0] = (input_dims[0] + 2 * paddings[0] -
-                      (dilations[0] * (kernel_dims[0] - 1) + 1)) // strides[0] + 1
-    output_dims[1] = (input_dims[1] + 2 * paddings[1] -
-                      (dilations[1] * (kernel_dims[1] - 1) + 1)) // strides[1] + 1
-
     filters_per_channel = out_channels // groups
-    conv_per_position_flops = int(_prod(kernel_dims)) * in_channels * filters_per_channel
+    conv_per_position_macs = int(_prod(kernel_dims)) * in_channels * filters_per_channel
     active_elements_count = batch_size * int(_prod(input_dims))
-    overall_conv_flops = conv_per_position_flops * active_elements_count
+    overall_conv_macs = conv_per_position_macs * active_elements_count
+    overall_conv_flops = 2 * overall_conv_macs
 
     bias_flops = 0
     if bias is not None:
         bias_flops = out_channels * batch_size * int(_prod(output_dims))
 
-    overall_flops = overall_conv_flops + bias_flops
-
-    return int(overall_flops)
+    return int(overall_conv_flops + bias_flops), int(overall_conv_macs)
 
 
 def _batch_norm_flops_compute(
@@ -457,9 +618,49 @@ def _batch_norm_flops_compute(
     momentum=0.1,
     eps=1e-05,
 ):
-    # assume affine is true
-    flops = 2 * torch.numel(input)
-    return flops
+    has_affine = weight is not None
+    if training:
+        # estimation
+        return torch.numel(input) * (5 if has_affine else 4), 0
+    flops = torch.numel(input) * (2 if has_affine else 1)
+    return flops, 0
+
+
+def _layer_norm_flops_compute(
+    input: Tensor,
+    normalized_shape: List[int],
+    weight: Optional[Tensor] = None,
+    bias: Optional[Tensor] = None,
+    eps: float = 1e-5,
+):
+    has_affine = weight is not None
+    # estimation
+    return torch.numel(input) * (5 if has_affine else 4), 0
+
+
+def _group_norm_flops_compute(input: Tensor,
+                              num_groups: int,
+                              weight: Optional[Tensor] = None,
+                              bias: Optional[Tensor] = None,
+                              eps: float = 1e-5):
+    has_affine = weight is not None
+    # estimation
+    return torch.numel(input) * (5 if has_affine else 4), 0
+
+
+def _instance_norm_flops_compute(
+    input: Tensor,
+    running_mean: Optional[Tensor] = None,
+    running_var: Optional[Tensor] = None,
+    weight: Optional[Tensor] = None,
+    bias: Optional[Tensor] = None,
+    use_input_stats: bool = True,
+    momentum: float = 0.1,
+    eps: float = 1e-5,
+):
+    has_affine = weight is not None
+    # estimation
+    return torch.numel(input) * (5 if has_affine else 4), 0
 
 
 def _upsample_flops_compute(input,
@@ -468,18 +669,21 @@ def _upsample_flops_compute(input,
                             mode="nearest",
                             align_corners=None):
     if size is not None:
-        return int(_prod(size))
-    assert scale_factor is not None
+        if isinstance(size, tuple):
+            return int(_prod(size)), 0
+        else:
+            return int(size), 0
+    assert scale_factor is not None, "either size or scale_factor should be defined"
     flops = torch.numel(input)
-    if len(scale_factor) == len(input):
+    if isinstance(scale_factor, tuple) and len(scale_factor) == len(input):
         flops * int(_prod(scale_factor))
     else:
         flops * scale_factor**len(input)
-    return flops
+    return flops, 0
 
 
 def _softmax_flops_compute(input, dim=None, _stacklevel=3, dtype=None):
-    return torch.numel(input)
+    return torch.numel(input), 0
 
 
 def _embedding_flops_compute(
@@ -491,24 +695,107 @@ def _embedding_flops_compute(
     scale_grad_by_freq=False,
     sparse=False,
 ):
-    return 0
+    return 0, 0
 
 
 def _dropout_flops_compute(input, p=0.5, training=True, inplace=False):
-    return 0
+    return 0, 0
+
+
+def _matmul_flops_compute(input, other, *, out=None):
+    """
+    Count flops for the matmul operation.
+    """
+    macs = _prod(input.shape) * other.shape[-1]
+    return 2 * macs, macs
+
+
+def _addmm_flops_compute(input, mat1, mat2, *, beta=1, alpha=1, out=None):
+    """
+    Count flops for the addmm operation.
+    """
+    macs = _prod(mat1.shape) * mat2.shape[-1]
+    return 2 * macs + _prod(input.shape), macs
+
+
+def _einsum_flops_compute(equation, *operands):
+    """
+    Count flops for the einsum operation.
+    """
+    equation = equation.replace(" ", "")
+    input_shapes = [o.shape for o in operands]
+
+    # Re-map equation so that same equation with different alphabet
+    # representations will look the same.
+    letter_order = OrderedDict((k, 0) for k in equation if k.isalpha()).keys()
+    mapping = {ord(x): 97 + i for i, x in enumerate(letter_order)}
+    equation = equation.translate(mapping)
+
+    np_arrs = [np.zeros(s) for s in input_shapes]
+    optim = np.einsum_path(equation, *np_arrs, optimize="optimal")[1]
+    for line in optim.split("\n"):
+        if "optimized flop" in line.lower():
+            flop = int(float(line.split(":")[-1]))
+            return flop, 0
+    raise NotImplementedError("Unsupported einsum operation.")
+
+
+def _tensor_addmm_flops_compute(self, mat1, mat2, *, beta=1, alpha=1, out=None):
+    """
+    Count flops for the tensor addmm operation.
+    """
+    macs = _prod(mat1.shape) * mat2.shape[-1]
+    return 2 * macs + _prod(self.shape), macs
+
+
+def _mul_flops_compute(input, other, *, out=None):
+    return _elementwise_flops_compute(input, other)
+
+
+def _add_flops_compute(input, other, *, alpha=1, out=None):
+    return _elementwise_flops_compute(input, other)
+
+
+def _elementwise_flops_compute(input, other):
+    if not torch.is_tensor(input):
+        if torch.is_tensor(other):
+            return _prod(other.shape), 0
+        else:
+            return 1, 0
+    elif not torch.is_tensor(other):
+        return _prod(input.shape), 0
+    else:
+        dim_input = len(input.shape)
+        dim_other = len(other.shape)
+        max_dim = max(dim_input, dim_other)
+
+        final_shape = []
+        for i in range(max_dim):
+            in_i = input.shape[i] if i < dim_input else 1
+            ot_i = other.shape[i] if i < dim_other else 1
+            if in_i > ot_i:
+                final_shape.append(in_i)
+            else:
+                final_shape.append(ot_i)
+        flops = _prod(final_shape)
+        return flops, 0
 
 
 def wrapFunc(func, funcFlopCompute):
     oldFunc = func
     name = func.__name__
-    old_functions[func.__name__] = oldFunc
+    old_functions[name] = oldFunc
 
     def newFunc(*args, **kwds):
-        flops = funcFlopCompute(*args, **kwds)
+        flops, macs = funcFlopCompute(*args, **kwds)
         if module_flop_count:
             module_flop_count[-1].append((name, flops))
+        if module_mac_count and macs:
+            module_mac_count[-1].append((name, macs))
         return oldFunc(*args, **kwds)
 
+    newFunc.__name__ = func.__name__
+
     return newFunc
 
 
@@ -528,13 +815,19 @@ def _patch_functionals():
 
     # activations
     F.relu = wrapFunc(F.relu, _relu_flops_compute)
-    F.prelu = wrapFunc(F.prelu, _relu_flops_compute)
-    F.elu = wrapFunc(F.elu, _relu_flops_compute)
-    F.leaky_relu = wrapFunc(F.leaky_relu, _relu_flops_compute)
-    F.relu6 = wrapFunc(F.relu6, _relu_flops_compute)
-
-    # BatchNorms
+    F.prelu = wrapFunc(F.prelu, _prelu_flops_compute)
+    F.elu = wrapFunc(F.elu, _elu_flops_compute)
+    F.leaky_relu = wrapFunc(F.leaky_relu, _leaky_relu_flops_compute)
+    F.relu6 = wrapFunc(F.relu6, _relu6_flops_compute)
+    if hasattr(F, "silu"):
+        F.silu = wrapFunc(F.silu, _silu_flops_compute)
+    F.gelu = wrapFunc(F.gelu, _gelu_flops_compute)
+
+    # Normalizations
     F.batch_norm = wrapFunc(F.batch_norm, _batch_norm_flops_compute)
+    F.layer_norm = wrapFunc(F.layer_norm, _layer_norm_flops_compute)
+    F.instance_norm = wrapFunc(F.instance_norm, _instance_norm_flops_compute)
+    F.group_norm = wrapFunc(F.group_norm, _group_norm_flops_compute)
 
     # poolings
     F.avg_pool1d = wrapFunc(F.avg_pool1d, _pool_flops_compute)
@@ -561,37 +854,61 @@ def _patch_functionals():
     F.embedding = wrapFunc(F.embedding, _embedding_flops_compute)
 
 
+def _patch_tensor_methods():
+    torch.matmul = wrapFunc(torch.matmul, _matmul_flops_compute)
+    torch.Tensor.matmul = wrapFunc(torch.Tensor.matmul, _matmul_flops_compute)
+    torch.mm = wrapFunc(torch.mm, _matmul_flops_compute)
+    torch.Tensor.mm = wrapFunc(torch.Tensor.mm, _matmul_flops_compute)
+    torch.bmm = wrapFunc(torch.bmm, _matmul_flops_compute)
+    torch.Tensor.bmm = wrapFunc(torch.bmm, _matmul_flops_compute)
+
+    torch.addmm = wrapFunc(torch.addmm, _addmm_flops_compute)
+    torch.Tensor.addmm = wrapFunc(torch.Tensor.addmm, _tensor_addmm_flops_compute)
+
+    torch.mul = wrapFunc(torch.mul, _mul_flops_compute)
+    torch.Tensor.mul = wrapFunc(torch.Tensor.mul, _mul_flops_compute)
+
+    torch.add = wrapFunc(torch.add, _add_flops_compute)
+    torch.Tensor.add = wrapFunc(torch.Tensor.add, _add_flops_compute)
+
+    torch.einsum = wrapFunc(torch.einsum, _einsum_flops_compute)
+
+
 def _reload_functionals():
     # torch.nn.functional does not support importlib.reload()
-    F.linear = old_functions["linear"]
-    F.conv1d = old_functions["conv1d"]
-    F.conv2d = old_functions["conv2d"]
-    F.conv3d = old_functions["conv3d"]
-    F.conv_transpose1d = old_functions["conv_transpose1d"]
-    F.conv_transpose2d = old_functions["conv_transpose2d"]
-    F.conv_transpose3d = old_functions["conv_transpose3d"]
-    F.relu = old_functions["relu"]
-    F.prelu = old_functions["prelu"]
-    F.elu = old_functions["elu"]
-    F.leaky_relu = old_functions["leaky_relu"]
-    F.relu6 = old_functions["relu6"]
-    F.batch_norm = old_functions["batch_norm"]
-    F.avg_pool1d = old_functions["avg_pool1d"]
-    F.avg_pool2d = old_functions["avg_pool2d"]
-    F.avg_pool3d = old_functions["avg_pool3d"]
-    F.max_pool1d = old_functions["max_pool1d"]
-    F.max_pool2d = old_functions["max_pool2d"]
-    F.max_pool3d = old_functions["max_pool3d"]
-    F.adaptive_avg_pool1d = old_functions["adaptive_avg_pool1d"]
-    F.adaptive_avg_pool2d = old_functions["adaptive_avg_pool2d"]
-    F.adaptive_avg_pool3d = old_functions["adaptive_avg_pool3d"]
-    F.adaptive_max_pool1d = old_functions["adaptive_max_pool1d"]
-    F.adaptive_max_pool2d = old_functions["adaptive_max_pool2d"]
-    F.adaptive_max_pool3d = old_functions["adaptive_max_pool3d"]
-    F.upsample = old_functions["upsample"]
-    F.interpolate = old_functions["interpolate"]
-    F.softmax = old_functions["softmax"]
-    F.embedding = old_functions["embedding"]
+    F.linear = old_functions[F.linear.__name__]
+    F.conv1d = old_functions[F.conv1d.__name__]
+    F.conv2d = old_functions[F.conv2d.__name__]
+    F.conv3d = old_functions[F.conv3d.__name__]
+    F.conv_transpose1d = old_functions[F.conv_transpose1d.__name__]
+    F.conv_transpose2d = old_functions[F.conv_transpose2d.__name__]
+    F.conv_transpose3d = old_functions[F.conv_transpose3d.__name__]
+    F.relu = old_functions[F.relu.__name__]
+    F.prelu = old_functions[F.prelu.__name__]
+    F.elu = old_functions[F.elu.__name__]
+    F.leaky_relu = old_functions[F.leaky_relu.__name__]
+    F.relu6 = old_functions[F.relu6.__name__]
+    F.batch_norm = old_functions[F.batch_norm.__name__]
+    F.avg_pool1d = old_functions[F.avg_pool1d.__name__]
+    F.avg_pool2d = old_functions[F.avg_pool2d.__name__]
+    F.avg_pool3d = old_functions[F.avg_pool3d.__name__]
+    F.max_pool1d = old_functions[F.max_pool1d.__name__]
+    F.max_pool2d = old_functions[F.max_pool2d.__name__]
+    F.max_pool3d = old_functions[F.max_pool3d.__name__]
+    F.adaptive_avg_pool1d = old_functions[F.adaptive_avg_pool1d.__name__]
+    F.adaptive_avg_pool2d = old_functions[F.adaptive_avg_pool2d.__name__]
+    F.adaptive_avg_pool3d = old_functions[F.adaptive_avg_pool3d.__name__]
+    F.adaptive_max_pool1d = old_functions[F.adaptive_max_pool1d.__name__]
+    F.adaptive_max_pool2d = old_functions[F.adaptive_max_pool2d.__name__]
+    F.adaptive_max_pool3d = old_functions[F.adaptive_max_pool3d.__name__]
+    F.upsample = old_functions[F.upsample.__name__]
+    F.interpolate = old_functions[F.interpolate.__name__]
+    F.softmax = old_functions[F.softmax.__name__]
+    F.embedding = old_functions[F.embedding.__name__]
+
+
+def _reload_tensor_methods():
+    torch.matmul = old_functions[torch.matmul.__name__]
 
 
 def _rnn_flops(flops, rnn_module, w_ih, w_hh, input_size):
@@ -707,6 +1024,27 @@ def macs_to_string(macs, units=None, precision=2):
             return str(macs) + " MACs"
 
 
+def number_to_string(num, units=None, precision=2):
+    if units is None:
+        if num // 10**9 > 0:
+            return str(round(num / 10.0**9, precision)) + " G"
+        elif num // 10**6 > 0:
+            return str(round(num / 10.0**6, precision)) + " M"
+        elif num // 10**3 > 0:
+            return str(round(num / 10.0**3, precision)) + " K"
+        else:
+            return str(num) + " "
+    else:
+        if units == "G":
+            return str(round(num / 10.0**9, precision)) + " " + units
+        elif units == "M":
+            return str(round(num / 10.0**6, precision)) + " " + units
+        elif units == "K":
+            return str(round(num / 10.0**3, precision)) + " " + units
+        else:
+            return str(num) + " "
+
+
 def flops_to_string(flops, units=None, precision=2):
     if units is None:
         if flops // 10**12 > 0:
@@ -778,19 +1116,37 @@ def get_module_flops(module):
     return sum
 
 
+def get_module_macs(module):
+    sum = module.__macs__
+    # iterate over immediate children modules
+    for child in module.children():
+        sum += get_module_macs(child)
+    return sum
+
+
+def get_module_duration(module):
+    duration = module.__duration__
+    if duration == 0:  # e.g. ModuleList
+        for m in module.children():
+            duration += m.__duration__
+    return duration
+
+
 def get_model_profile(
     model,
-    input_res,
-    input_constructor=None,
+    input_shape=None,
+    args=[],
+    kwargs={},
     print_profile=True,
     detailed=True,
     module_depth=-1,
-    top_modules=3,
+    top_modules=1,
     warm_up=1,
     as_string=True,
+    output_file=None,
     ignore_modules=None,
 ):
-    """Returns the total MACs and parameters of a model.
+    """Returns the total floating-point operations, MACs, and parameters of a model.
 
     Example:
 
@@ -798,71 +1154,65 @@ def get_model_profile(
 
         model = torchvision.models.alexnet()
         batch_size = 256
-        macs, params = get_model_profile(model=model, input_res= (batch_size, 3, 224, 224)))
+        flops, macs, params = get_model_profile(model=model, input_shape=(batch_size, 3, 224, 224)))
 
     Args:
         model ([torch.nn.Module]): the PyTorch model to be profiled.
-        input_res (list): input shape or input to the input_constructor
-        input_constructor (func, optional): input constructor. If specified, the constructor is applied to input_res and the constructor output is used as the input to the model. Defaults to None.
+        input_shape (tuple): input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
+        args (list): list of positional arguments to the model.
+        kwargs (dict): dictionary of keyword arguments to the model.
         print_profile (bool, optional): whether to print the model profile. Defaults to True.
         detailed (bool, optional): whether to print the detailed model profile. Defaults to True.
         module_depth (int, optional): the depth into the nested modules. Defaults to -1 (the inner most modules).
         top_modules (int, optional): the number of top modules to print in the aggregated profile. Defaults to 3.
         warm_up (int, optional): the number of warm-up steps before measuring the latency of each module. Defaults to 1.
         as_string (bool, optional): whether to print the output as string. Defaults to True.
+        output_file (str, optional): path to the output file. If None, the profiler prints to stdout.
         ignore_modules ([type], optional): the list of modules to ignore during profiling. Defaults to None.
 
     Returns:
-        The number of multiply-accumulate operations (MACs) and parameters in the model.
+        The number of floating-point operations, multiply-accumulate operations (MACs), and parameters in the model.
     """
-    assert type(input_res) is tuple
-    assert len(input_res) >= 1
-    assert isinstance(model, nn.Module)
+    assert isinstance(model, nn.Module), "model must be a PyTorch module"
     prof = FlopsProfiler(model)
     model.eval()
-    for _ in range(warm_up):
-        if input_constructor:
-            input = input_constructor(input_res)
-            _ = model(**input)
-        else:
-            try:
-                batch = torch.ones(()).new_empty(
-                    (*input_res,
-                     ),
-                    dtype=next(model.parameters()).dtype,
-                    device=next(model.parameters()).device,
-                )
-            except StopIteration:
-                batch = torch.ones(()).new_empty((*input_res, ))
-            _ = model(batch)
-
-    prof.start_profile(ignore_list=ignore_modules)
 
-    if input_constructor:
-        input = input_constructor(input_res)
-        _ = model(**input)
-    else:
+    if input_shape is not None:
+        assert type(input_shape) is tuple, "input_shape must be a tuple"
+        assert len(input_shape) >= 1, "input_shape must have at least one element"
         try:
-            batch = torch.ones(()).new_empty(
-                (*input_res,
+            input = torch.ones(()).new_empty(
+                (*input_shape,
                  ),
                 dtype=next(model.parameters()).dtype,
                 device=next(model.parameters()).device,
             )
         except StopIteration:
-            batch = torch.ones(()).new_empty((*input_res, ))
-        _ = model(batch)
+            input = torch.ones(()).new_empty((*input_shape, ))
+
+        args = [input]
+
+    assert (len(args) > 0) or (len(kwargs) > 0), "args and/or kwargs must be specified if input_shape is None"
+
+    for _ in range(warm_up):
+        _ = model(*args, **kwargs)
+
+    prof.start_profile(ignore_list=ignore_modules)
+
+    _ = model(*args, **kwargs)
 
     flops = prof.get_total_flops()
+    macs = prof.get_total_macs()
     params = prof.get_total_params()
     if print_profile:
         prof.print_model_profile(profile_step=warm_up,
                                  module_depth=module_depth,
                                  top_modules=top_modules,
-                                 detailed=detailed)
+                                 detailed=detailed,
+                                 output_file=output_file)
 
     prof.end_profile()
     if as_string:
-        return macs_to_string(flops), params_to_string(params)
+        return number_to_string(flops), macs_to_string(macs), params_to_string(params)
 
-    return flops, params
+    return flops, macs, params
diff --git a/deepspeed/runtime/activation_checkpointing/checkpointing.py b/deepspeed/runtime/activation_checkpointing/checkpointing.py
index 8a9785a9aedb85dde549bcbd4384f63bcf39ef0f..06d1a2a59833f74e6e7dfc2f2a84e780a05cbd8d 100644
--- a/deepspeed/runtime/activation_checkpointing/checkpointing.py
+++ b/deepspeed/runtime/activation_checkpointing/checkpointing.py
@@ -1,880 +1,917 @@
-'''
-Copyright (c) Microsoft Corporation
-Licensed under the MIT license.
-
-Use to partition the activations stored for backward propagation
-Therefore reduces the memory consumption
-Also implements CPU checkpointing and contiguous memory checkpointing
-Reduces memory consumption and memory fragmentation
-
-Code for rng checkpointing taken from NVIDIA Megatron-LM mpu/random.py
-b886b7bb972afe72bac0f5de4f42a4a7bae8ebef
-'''
-
-# Parts of the code here are adapted from PyTorch
-# repo: https://github.com/pytorch/pytorch
-import copy
-import torch
-import contextlib
-import torch.distributed as dist
-
-import mmap
-from torch import _C
-from torch.cuda import _lazy_call, device as device_ctx_manager
-
-from deepspeed.runtime.config import DeepSpeedConfig
-from deepspeed.utils import logger
-from deepspeed.runtime.utils import move_to_device
-from deepspeed.utils.timer import SynchronizedWallClockTimer as Timers
-
-# DeepSpeed Checkpointing Enabled or Disabled
-deepspeed_checkpointing_enabled = False
-
-# MP parameters
-mpu = None
-mp_rank = None
-mp_size = None
-mp_group = None
-
-# Model Parameters
-num_layers = None
-
-# Checkpointing buffers
-contiguous_data_buffers = []
-data_offsets = []
-
-contiguous_size_buffers = []
-size_offsets = []
-
-timers = None
-
-# optimization flags
-PARTITION_ACTIVATIONS = False
-PA_TO_CPU = False
-CONTIGUOUS_CHECKPOINTING = False
-SYNCHRONIZE = False
-PROFILE_TIME = False
-
-
-def see_memory_usage(message, force=False):
-    # return
-    if not force:
-        return
-    # dist.barrier()
-    if dist.get_rank() == 0:
-        logger.info(message)
-        logger.info(
-            "Memory Allocated %s GigaBytes",
-            torch.cuda.memory_allocated() / (1024 * 1024 * 1024),
-        )
-        logger.info(
-            "Max Memory Allocated %s GigaBytes",
-            torch.cuda.max_memory_allocated() / (1024 * 1024 * 1024),
-        )
-        logger.info(
-            "Cache Allocated %s GigaBytes",
-            torch.cuda.memory_cached() / (1024 * 1024 * 1024),
-        )
-        logger.info(
-            "Max cache Allocated %s GigaBytes",
-            torch.cuda.max_memory_cached() / (1024 * 1024 * 1024),
-        )
-        logger.info("")
-        #input("Press Any Key To Continue ..")
-
-
-# Default name for the model parallel rng tracker.
-_MODEL_PARALLEL_RNG_TRACKER_NAME = 'model-parallel-rng'
-transport_stream = None
-cuda_device = None
-
-
-def detach_variable(inputs, device=None):
-    if isinstance(inputs, tuple):
-        out = []
-        for inp in inputs:
-            if not isinstance(inp, torch.Tensor):
-                out.append(inp)
-                continue
-
-            requires_grad = inp.requires_grad
-
-            if device is not None:
-                x = inp.to(device=device)
-            else:
-                x = inp
-
-            x = x.detach()
-            x.requires_grad = requires_grad
-            out.append(x)
-        return tuple(out)
-    else:
-        raise RuntimeError(
-            "Only tuple of tensors is supported. Got Unsupported input type: ",
-            type(inputs).__name__)
-
-
-def _set_cuda_rng_state(new_state, device=-1):
-    """Sets the random number generator state of the current GPU.
-
-    Arguments:
-        new_state (torch.ByteTensor): The desired state
-    This function is adapted from PyTorch repo (torch.cuda.set_rng_state)
-    with a single change: the input state is not cloned. Cloning caused
-    major performance issues for +4 GPU cases.
-    """
-    if hasattr(_C, '_cuda_setRNGState') and callable(_C._cuda_setRNGState):
-        # older PyTorch
-        def cb():
-            with device_ctx_manager(device):
-                _C._cuda_setRNGState(new_state)
-    else:
-        # newer PyTorch
-        if device == -1:
-            device = torch.device('cuda')
-        elif isinstance(device, str):
-            device = torch.device(device)
-        elif isinstance(device, int):
-            device = torch.device('cuda', device)
-
-        def cb():
-            idx = device.index
-            if idx is None:
-                idx = torch.cuda.current_device()
-            default_generator = torch.cuda.default_generators[idx]
-            default_generator.set_state(new_state)
-
-    _lazy_call(cb)
-
-
-class CudaRNGStatesTracker:
-    """Tracker for the cuda RNG states.
-
-    Using the `add` method, a cuda rng state is initialized based on
-    the input `seed` and is assigned to `name`. Later, by forking the
-    rng state, we can perform operations and return to our starting
-    cuda state.
-    """
-    def __init__(self):
-        # Map from a string name to the cuda rng state.
-        self.states_ = {}
-        # Seeds are just for book keeping and ensure no seed is set twice.
-        self.seeds_ = set()
-
-    def reset(self):
-        """Set to the initial state (no tracker)."""
-        self.states_ = {}
-        self.seeds_ = set()
-
-    def get_states(self):
-        """Get rng states. Copy the dictionary so we have direct
-        pointers to the states, not just a pointer to the dictionary."""
-        return copy.copy(self.states_)
-
-    def set_states(self, states):
-        """Set the rng states. For efficiency purposes, we do not check
-        the size of seed for compatibility."""
-        self.states_ = states
-
-    def add(self, name, seed):
-        """Track the rng state."""
-        # Check seed is not already used.
-        if seed in self.seeds_:
-            raise Exception('seed {} already exists'.format(seed))
-        self.seeds_.add(seed)
-        # Check that state is not already defined.
-        if name in self.states_:
-            raise Exception('cuda rng state {} already exists'.format(name))
-        # Get the current rng state.
-        orig_rng_state = torch.cuda.get_rng_state()
-        # Set the new state and store it.
-        torch.cuda.manual_seed(seed)
-        self.states_[name] = torch.cuda.get_rng_state()
-        # Reset rng state to what it was.
-        _set_cuda_rng_state(orig_rng_state)
-
-    @contextlib.contextmanager
-    def fork(self, name=_MODEL_PARALLEL_RNG_TRACKER_NAME):
-        """Fork the cuda rng state, perform operations, and exit with
-        the original state."""
-        # Check if we have added the state
-        if name not in self.states_:
-            raise Exception('cuda rng state {} is not added'.format(name))
-        # Store current rng state.
-        orig_cuda_rng_state = torch.cuda.get_rng_state()
-        # Set rng state to the desired one
-        _set_cuda_rng_state(self.states_[name])
-        # Do the stuff we wanted to do.
-        try:
-            yield
-        finally:
-            # Update the current rng state for later use.
-            self.states_[name] = torch.cuda.get_rng_state()
-            # And set the state to the original state we started with.
-            _set_cuda_rng_state(orig_cuda_rng_state)
-
-
-# RNG tracker object.
-_CUDA_RNG_STATE_TRACKER = CudaRNGStatesTracker()
-
-
-def get_cuda_rng_tracker():
-    """Get cuda rng tracker."""
-    return _CUDA_RNG_STATE_TRACKER
-
-
-def model_parallel_cuda_manual_seed(seed):
-    """Initialize model parallel cuda seed.
-
-    This function should be called after the model parallel is
-    initialized. Also, no torch.cuda.manual_seed should be called
-    after this function. Basically, this is replacement for that
-    function.
-    Two set of RNG states are tracked:
-        default state: This is for data parallelism and is the same among a
-                       set of model parallel GPUs but different across
-                       different model paralle groups. This is used for
-                       example for dropout in the non-model-parallel regions.
-        model-parallel state: This state is different among a set of model
-                              parallel GPUs, but the same across data parallel
-                              groups. This is used for example for dropout in
-                              model parallel regions.
-    """
-    global mpu
-    # 2718 is just for fun and any POSITIVE value will work.
-    offset = seed + 2718
-    model_parallel_seed = offset + mpu.get_model_parallel_rank()
-    # Data parallel gets the original sedd.
-    data_parallel_seed = seed
-
-    if torch.distributed.get_rank() == 0:
-        logger.info(
-            '> initializing model parallel cuda seeds on global rank {}, '
-            'model parallel rank {}, and data parallel rank {} with '
-            'model parallel seed: {} and data parallel seed: {}'.format(
-                torch.distributed.get_rank(),
-                mpu.get_model_parallel_rank(),
-                mpu.get_data_parallel_rank(),
-                model_parallel_seed,
-                data_parallel_seed),
-        )
-    _CUDA_RNG_STATE_TRACKER.reset()
-    # Set the default state.
-    torch.cuda.manual_seed(data_parallel_seed)
-    # and model parallel state.
-    _CUDA_RNG_STATE_TRACKER.add(_MODEL_PARALLEL_RNG_TRACKER_NAME, model_parallel_seed)
-
-
-def get_partition_start(item):
-    global mp_rank, mp_size, mp_group
-    size = item.numel()
-    partition_size = size / mp_size
-    start = partition_size * mp_rank
-    return int(start)
-
-
-def get_partition_size(item):
-    global mp_rank, mp_size, mp_group
-    size = item.numel()
-    assert size % mp_size == 0, "Doesn't handle if partition activation if item is not divisible by mp size"
-    partition_size = size / mp_size
-    return int(partition_size)
-
-
-def get_full_inputs(tensors, device=None):
-    inputs = []
-    num_args = int(len(tensors) / 2)
-    for i in range(num_args - 1):
-
-        item = tensors[2 * i]
-        size = tensors[2 * i + 1]
-
-        partition_size = item.numel()
-        tensor_size = partition_size * mp_size
-        if device is not None:
-            flat_tensor = torch.zeros([tensor_size], dtype=item.dtype, device=device)
-        else:
-            flat_tensor = torch.zeros([tensor_size],
-                                      dtype=item.dtype,
-                                      device=item.device)
-        partitions = []
-        for i in range(mp_size):
-            part_i = flat_tensor.narrow(0, partition_size * i, partition_size)
-            if i == mp_rank:
-                part_i.copy_(item)
-            partitions.append(part_i)
-        if mp_group is not None:
-            dist.all_gather(partitions, partitions[mp_rank], group=mp_group)
-        input_tensor = flat_tensor.view(list(size.numpy()))
-        item.data = input_tensor.data
-
-        inputs.append(item)
-    inputs.append(tensors[-2])
-
-    return tuple(inputs)
-
-
-def extract_tensors(all_objects):
-    """
-    Separate objects in list/tuple into tensors and non-tensors and create a mapping to enable re-aggregation.
-    The order of tensors and non-tensors is preserved in their respective output groups.
-
-    Parameters:
-        all_objects (list/tuple): Objects containing tensors and non-tensors to be split.
-
-    Returns:
-        tuple: Containing tensors, non-tensors, and bools of whether each position in original list/tuple was a tensor.
-
-    """
-    tensor_objects = [v for v in all_objects if torch.is_tensor(v)]
-    non_tensor_objects = [v for v in all_objects if not torch.is_tensor(v)]
-    tensor_flags = [torch.is_tensor(v) for v in all_objects]
-    if type(all_objects) is tuple:
-        return tuple(tensor_objects), tuple(non_tensor_objects), tuple(tensor_flags)
-    return tensor_objects, non_tensor_objects, tensor_flags
-
-
-def merge_tensors(tensor_objects, non_tensor_objects, tensor_flags):
-    """
-    Merge two lists (or tuples) of tensors and non-tensors using a mapping of positions in merged list (or tuple).
-
-    Parameters:
-        tensor_objects (list/tuple): Tensors to merge.
-        non_tensor_objects (list/tuple): Non-tensors to merge.
-        tensor_flags (list/tuple): Indicates whether each position in output is a tensor.
-
-    Returns:
-        tuple: Merge of tensors and non-tensors
-    """
-    merged_objects = []
-    tensor_idx = 0
-    non_tensor_idx = 0
-
-    real_tensor_flags = None
-
-    #remove the flags that are assigned to the size of the flattened tensors
-    if PARTITION_ACTIVATIONS:
-        real_tensor_flags = []
-        previous_flag = False
-        for flag in tensor_flags:
-            if previous_flag:
-                previous_flag = False
-                continue
-            previous_flag = flag
-            real_tensor_flags.append(flag)
-    else:
-        real_tensor_flags = tensor_flags
-
-    for is_tensor in real_tensor_flags:
-        if is_tensor:
-            merged_objects.append(tensor_objects[tensor_idx])
-            tensor_idx += 1
-        else:
-            merged_objects.append(non_tensor_objects[non_tensor_idx])
-            non_tensor_idx += 1
-
-    return tuple(merged_objects)
-
-
-class CheckpointFunction(torch.autograd.Function):
-    """This function is adapted from torch.utils.checkpoint with
-       two main changes:
-           1) torch.cuda.set_rng_state is replaced with `_set_cuda_rng_state`
-           2) the states in the model parallel tracker are also properly
-              tracked/set/reset.
-           3) Performance activation partitioning, contiguous memory optimization
-           4) CPU Checkpointing
-           5) Profile forward and backward functions
-    """
-    @staticmethod
-    def forward(ctx, run_function, all_outputs, *args):
-        global mpu, timers, SYNCHRONIZE, PROFILE_TIME
-
-        def save_args_for_backward(*all_args):
-            tensor_args, non_tensor_args, tensor_flags = extract_tensors(all_objects=all_args)
-            ctx.save_for_backward(*tensor_args)
-            ctx.non_tensor_args = non_tensor_args
-            ctx.tensor_flags = tensor_flags
-
-        if SYNCHRONIZE:
-            torch.cuda.synchronize()
-
-        if timers is None and PROFILE_TIME:
-            timers = Timers()
-
-        if PROFILE_TIME:
-            timers('forward').start()
-
-        ctx.run_function = run_function
-        global num_layers
-        global mp_rank, mp_size, mp_group
-        global contiguous_data_buffers, contiguous_size_buffers
-        global data_offsets, size_offsets
-        if mp_rank is None:
-            if mpu is not None:
-                mp_rank = mpu.get_model_parallel_rank()
-                mp_size = mpu.get_model_parallel_world_size()
-                mp_group = mpu.get_model_parallel_group()
-            else:
-                mp_rank = 0
-                mp_size = 1
-                mp_group = None
-
-        global cuda_device, transport_stream, PARTITION_ACTIVATIONS, buffer_0, buffer_1, buffer_0_offset, buffer_1_offset
-
-        if cuda_device is None:
-            see_memory_usage("First Forward Begining", force=False)
-            if dist.get_rank() == 0:
-                logger.info(f"Activation Checkpointing Information")
-                logger.info(
-                    f"----Partition Activations {PARTITION_ACTIVATIONS}, CPU CHECKPOINTING {PA_TO_CPU}"
-                )
-                logger.info(
-                    f"----contiguous Memory Checkpointing {CONTIGUOUS_CHECKPOINTING} with {num_layers} total layers"
-                )
-                logger.info(f"----Synchronization {SYNCHRONIZE}")
-                logger.info(f"----Profiling {PROFILE_TIME}")
-
-            cuda_device = torch.cuda.current_device()
-            transport_stream = torch.cuda.Stream(device=cuda_device)
-
-        if PARTITION_ACTIVATIONS:
-            #inputs = [item.detach().contiguous().view(-1).narrow(0, get_partition_start(item), get_partition_size(item)).clone() for item in args[:-1]]
-            # inputs.append(args[-1])
-
-            inputs = []
-            for i, item in enumerate(args[:-1]):
-                if not torch.is_tensor(item):
-                    inputs.append(item)
-                    continue
-
-                partition_size = get_partition_size(item)
-                partition = item.detach().contiguous().view(-1).narrow(
-                    0,
-                    get_partition_start(item),
-                    partition_size).clone()
-
-                if CONTIGUOUS_CHECKPOINTING:
-                    buffer_device = torch.device(
-                        'cpu') if PA_TO_CPU else partition.device
-
-                    if i >= len(contiguous_data_buffers):
-                        tensor_list = [
-                            torch.tensor(()).new_empty([partition_size],
-                                                       dtype=partition.dtype,
-                                                       device=buffer_device)
-                            for i in range(num_layers)
-                        ]
-                        contiguous_data_buffers.append(tensor_list)
-                        data_offsets.append(0)
-                    elif contiguous_data_buffers[i] is None:
-                        tensor_list = [
-                            torch.tensor(()).new_empty([partition_size],
-                                                       dtype=partition.dtype,
-                                                       device=buffer_device)
-                            for i in range(num_layers)
-                        ]
-                        contiguous_data_buffers[i] = tensor_list
-                        data_offsets[i] = 0
-
-                    # Because the 'new_empty' returns uninitialized pages,
-                    # the pages need to be populated during the cudaMemcpy time
-                    # which increases the data copy time. To avoid this, we
-                    # pre-populate these pages by simply writing 0 ahead of
-                    # the actual cudaMemcpy operation time. Due to the
-                    # previously launched GPU kernels, there is a small
-                    # window of time here for CPUs to populate pages asynchronously.
-                    contiguous_data_buffers[i][data_offsets[i]].data[range(
-                        0,
-                        contiguous_data_buffers[i][data_offsets[i]].data.shape[0],
-                        int(mmap.PAGESIZE / contiguous_data_buffers[i][
-                            data_offsets[i]].data.element_size()))] = 0
-
-                    contiguous_partition = contiguous_data_buffers[i][
-                        data_offsets[i]].data.copy_(partition.data)
-                    data_offsets[i] = data_offsets[i] + 1
-                    inputs.append(contiguous_partition)
-                else:
-                    partition = partition.cpu() if PA_TO_CPU else partition
-                    inputs.append(partition)
-
-            inputs.append(args[-1])
-
-        #just in case something funky is happening such as reuse of inputs
-        inputs_cuda = move_to_device(args, cuda_device)
-
-        # Copy the rng states.
-        ctx.fwd_cpu_rng_state = torch.get_rng_state()
-        ctx.fwd_cuda_rng_state = torch.cuda.get_rng_state()
-        ctx.fwd_cuda_rng_state_tracker = get_cuda_rng_tracker().get_states()
-
-        see_memory_usage("Before running forward on the layer", force=False)
-        # ctx.save_for_backward(*args)
-        with torch.no_grad():
-            outputs = run_function(*inputs_cuda)
-
-        see_memory_usage("After running forward on the layer", force=False)
-        del inputs_cuda
-
-        # with torch.cuda.stream(transport_stream):
-        # if PARTITION_ACTIVATIONS:
-        #    new_args = []
-        #    for arg, inp in zip(args,inputs):
-        #        size= torch.tensor(arg.size())
-        #        arg.data = inp.data
-        #        new_args.append(arg)
-        #        new_args.append(size)
-        #    ctx.save_for_backward(*new_args)
-
-        if PARTITION_ACTIVATIONS:
-            new_args = []
-            for i, (arg, inp) in enumerate(zip(args, inputs)):
-                if not torch.is_tensor(arg):
-                    new_args.append(arg)
-                    continue
-
-                size = torch.tensor(arg.size())
-
-                arg.data = inp.data
-                new_args.append(arg)
-
-                if CONTIGUOUS_CHECKPOINTING:
-                    numel = size.numel()
-                    if i >= len(contiguous_size_buffers):
-                        tmp = torch.tensor(())
-                        contiguous_size_buffers.append(
-                            tmp.new_empty([numel * num_layers],
-                                          dtype=size.dtype,
-                                          device=size.device))
-                        size_offsets.append(0)
-                    elif contiguous_size_buffers[i] is None:
-                        tmp = torch.tensor(())
-                        contiguous_size_buffers[i] = tmp.new_empty([numel * num_layers],
-                                                                   dtype=size.dtype,
-                                                                   device=size.device)
-                        size_offsets[i] = 0
-
-                    contiguous_size = contiguous_size_buffers[i].narrow(
-                        0,
-                        size_offsets[i],
-                        numel).data.copy_(size.data)
-                    contiguous_size = contiguous_size.view_as(size)
-                    size_offsets[i] = size_offsets[i] + numel
-                    new_args.append(contiguous_size)
-                else:
-                    new_args.append(size)
-                # if dist.get_rank() == 0:
-                #    logger.info(f"The stored tensor is {contiguous_size} and orginal one is {size} ")
-
-            save_args_for_backward(*new_args)
-        else:
-            save_args_for_backward(*args)
-
-        if PROFILE_TIME:
-            timers('forward').stop()
-            timers.log(['forward'])
-        if SYNCHRONIZE:
-            torch.cuda.synchronize()
-
-        # Tensors returned from forward() may not be differentiable.
-        if torch.is_tensor(outputs):
-            non_grad_outputs = [outputs] if not outputs.is_floating_point() else []
-        else:
-            non_grad_outputs = [
-                o for o in outputs if torch.is_tensor(o) and not o.is_floating_point()
-            ]
-        ctx.mark_non_differentiable(*non_grad_outputs)
-
-        if torch.is_tensor(outputs):
-            all_outputs += [outputs]
-            return outputs
-        else:
-            all_outputs += outputs
-            outputs, _, _ = extract_tensors(all_objects=outputs)
-            return tuple(outputs)
-
-    @staticmethod
-    def backward(ctx, *grads):
-        global timers
-        see_memory_usage("In backward", force=False)
-        # removing pointers to the contiguous buffer memory
-        # so that they can be garbage collected once the checkpoints
-        # have been used
-        if SYNCHRONIZE:
-            torch.cuda.synchronize()
-        if PROFILE_TIME:
-            timers('backward').start()
-
-        if CONTIGUOUS_CHECKPOINTING:
-            global data_offsets, size_offsets
-            global contiguous_data_buffers, contiguous_size_buffers
-
-            for buffers in contiguous_data_buffers:
-                buffers = []
-
-            # frees up all the pointers to the checkpoints except for the ones
-            # stored by save for backward
-            contiguous_data_buffers = []
-            contiguous_size_buffers = []
-            data_offsets = []
-            size_offsets = []
-
-        see_memory_usage("In backward checkpointing code", force=False)
-        if not torch.autograd._is_checkpoint_valid():
-            raise RuntimeError("Checkpointing is not compatible with .grad(), "
-                               "please use .backward() if possible")
-
-        global cuda_device, transport_stream, PARTITION_ACTIVATIONS
-
-        if PARTITION_ACTIVATIONS:
-            # with torch.cuda.stream(transport_stream):
-            inputs = get_full_inputs(ctx.saved_tensors,
-                                     device=cuda_device if PA_TO_CPU else None)
-            detached_inputs = detach_variable(inputs)
-        else:
-            inputs = ctx.saved_tensors
-            detached_inputs = detach_variable(inputs)
-
-        # Add non tensor input args
-        detached_inputs = merge_tensors(tensor_objects=detached_inputs,
-                                        non_tensor_objects=ctx.non_tensor_args,
-                                        tensor_flags=ctx.tensor_flags)
-
-        # Store the current states.
-        bwd_cpu_rng_state = torch.get_rng_state()
-        bwd_cuda_rng_state = torch.cuda.get_rng_state()
-        bwd_cuda_rng_state_tracker = get_cuda_rng_tracker().get_states()
-
-        # Set the states to what it used to be before the forward pass.
-        torch.set_rng_state(ctx.fwd_cpu_rng_state)
-        _set_cuda_rng_state(ctx.fwd_cuda_rng_state)
-        get_cuda_rng_tracker().set_states(ctx.fwd_cuda_rng_state_tracker)
-
-        # if PARTITION_ACTIVATIONS:
-        #     current_stream=torch.cuda.current_stream()
-        #     current_stream.wait_stream(transport_stream)
-
-        see_memory_usage("In backward checkpointing code before forward", force=False)
-
-        with torch.enable_grad():
-            outputs = ctx.run_function(*detached_inputs)
-
-        see_memory_usage("In backward checkpointing code after forward", force=False)
-        # Set the states back to what it was at the start of this function.
-        torch.set_rng_state(bwd_cpu_rng_state)
-        _set_cuda_rng_state(bwd_cuda_rng_state)
-        get_cuda_rng_tracker().set_states(bwd_cuda_rng_state_tracker)
-
-        if isinstance(outputs, torch.Tensor):
-            outputs = (outputs, )
-
-        # Filter out non tensor outputs
-        outputs, _, _ = extract_tensors(all_objects=outputs)
-
-        # Construct arguments to autograd.backward().
-        # This is usually just outputs and grads, but forward() can return tensors that
-        # are not differentiable.
-        output_tensors = []
-        grad_tensors = []
-        for out, grad in zip(outputs, grads):
-            if out.requires_grad:
-                output_tensors.append(out)
-                grad_tensors.append(grad)
-
-        see_memory_usage("In backward checkpointing code before backward", force=False)
-
-        torch.autograd.backward(output_tensors, grad_tensors)
-
-        see_memory_usage("After backward checkpointing code before backward",
-                         force=False)
-
-        if PROFILE_TIME:
-            timers('backward').stop()
-            timers.log(['backward'])
-        if SYNCHRONIZE:
-            torch.cuda.synchronize()
-        ret_list = [None, None]  # first None for ctx
-        for inp in detached_inputs:
-            if torch.is_tensor(inp):
-                ret_list.append(inp.grad)
-            else:
-                ret_list.append(None)
-
-        return tuple(ret_list)
-
-
-def checkpoint(function, *args):
-    """Checkpoint a model or part of the model.
-    This has been directly copied from torch.utils.checkpoint. """
-
-    all_outputs = []
-    CheckpointFunction.apply(function, all_outputs, *args)
-    if len(all_outputs) == 1:
-        return all_outputs[0]
-    else:
-        return tuple(all_outputs)
-
-
-def partition_activations_in_checkpoint(partition_activation):
-    global PARTITION_ACTIVATIONS
-    PARTITION_ACTIVATIONS = partition_activation
-    if dist.get_rank() == 0:
-        logger.info(
-            f"**************Partition Activations {PARTITION_ACTIVATIONS}************")
-
-
-def set_num_layers(nlayers):
-    global num_layers
-    num_layers = nlayers
-
-
-def reset():
-    """Resets memory buffers related to contiguous memory optimizations.
-    Should be called during eval when multiple forward propagations are
-    computed without any backward propagation that usually clears these
-    buffers.
-    Arguments:
-        None
-
-    Return:
-        None
-    """
-    if CONTIGUOUS_CHECKPOINTING:
-        global data_offsets, size_offsets
-        global contiguous_data_buffers, contiguous_size_buffers
-
-        for buffers in contiguous_data_buffers:
-            buffers = []
-
-        # frees up all the pointers to the checkpoints except for the ones
-        # stored by save for backward
-        contiguous_data_buffers = []
-        contiguous_size_buffers = []
-        data_offsets = []
-        size_offsets = []
-
-
-def _configure_using_config_file(deepspeed_config, mpu=None):
-    global num_layers, PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, \
-        PA_TO_CPU, SYNCHRONIZE, PROFILE_TIME
-
-    config = DeepSpeedConfig(deepspeed_config, mpu=mpu).activation_checkpointing_config
-    if dist.get_rank() == 0:
-        logger.info(config.repr())
-    PARTITION_ACTIVATIONS = config.partition_activations
-    CONTIGUOUS_CHECKPOINTING = config.contiguous_memory_optimization
-    num_layers = config.number_checkpoints
-    PA_TO_CPU = config.cpu_checkpointing
-    SYNCHRONIZE = config.synchronize_checkpoint_boundary
-    PROFILE_TIME = config.profile
-
-
-def _configure_defaults():
-
-    global mpu, num_layers, deepspeed_checkpointing_enabled
-
-    global PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, \
-        PA_TO_CPU, SYNCHRONIZE, PROFILE_TIME
-
-    PARTITION_ACTIVATIONS = False
-    CONTIGUOUS_CHECKPOINTING = False
-    num_layers = False
-    PA_TO_CPU = False
-    SYNCHRONIZE = False
-    PROFILE_TIME = False
-    deepspeed_checkpointing_enabled = True
-
-
-def configure(
-    mpu_,
-    deepspeed_config=None,
-    partition_activations=None,
-    contiguous_checkpointing=None,
-    num_checkpoints=None,
-    checkpoint_in_cpu=None,
-    synchronize=None,
-    profile=None,
-):
-    """Configure DeepSpeed Activation Checkpointing.
-
-    Arguments:
-        mpu_: Optional: An object that implements the following methods
-            get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size
-
-        deepspeed_config: Optional: DeepSpeed Config json file when provided will be used to
-            configure DeepSpeed Activation Checkpointing
-
-        partition_activations: Optional: Partitions activation checkpoint across model parallel
-            GPUs when enabled. By default False. Will overwrite deepspeed_config if provided
-
-        contiguous_checkpointing: Optional: Copies activation checkpoints to a contiguous memory
-            buffer. Works only with homogeneous checkpoints when partition_activations is enabled.
-            Must provide num_checkpoints. By default False. Will overwrite deepspeed_config if
-            provided
-
-        num_checkpoints: Optional: Number of activation checkpoints stored during the forward
-            propagation of the model. Used to calculate the buffer size for contiguous_checkpointing
-            Will overwrite deepspeed_config if provided
-
-        checkpoint_in_cpu: Optional: Moves the activation checkpoint to CPU. Only works with
-            partition_activation. Default is false. Will overwrite deepspeed_config if provided
-
-        synchronize: Optional: Performs torch.cuda.synchronize() at the beginning and end of
-            each call to deepspeed.checkpointing.checkpoint for both forward and backward pass.
-            By default false. Will overwrite deepspeed_config if provided
-
-        profile: Optional: Logs the forward and backward time for each
-            deepspeed.checkpointing.checkpoint invocation. Will overwrite deepspeed_config
-            if provided
-
-    Returns:
-        None
-    """
-    global mpu, num_layers, deepspeed_checkpointing_enabled
-
-    global PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, \
-        PA_TO_CPU, SYNCHRONIZE, PROFILE_TIME
-
-    _configure_defaults()
-
-    if mpu_ is not None:
-        mpu = mpu_
-
-    if deepspeed_config is not None:
-        _configure_using_config_file(deepspeed_config, mpu=mpu)
-
-    if partition_activations is not None:
-        PARTITION_ACTIVATIONS = partition_activations
-
-    if contiguous_checkpointing is not None:
-        CONTIGUOUS_CHECKPOINTING = contiguous_checkpointing
-
-    if num_checkpoints is not None:
-        num_layers = num_checkpoints
-
-    if checkpoint_in_cpu is not None:
-        PA_TO_CPU = checkpoint_in_cpu
-
-    if synchronize is not None:
-        SYNCHRONIZE = synchronize
-
-    if profile is not None:
-        PROFILE_TIME = profile
-
-    if PA_TO_CPU or CONTIGUOUS_CHECKPOINTING:
-        assert PARTITION_ACTIVATIONS, "CPU Checkpointing/Contiguous Checkpointing is only availble with partitioned activations. Set partitioned activations to true in deepspeed config"
-    if CONTIGUOUS_CHECKPOINTING:
-        assert num_layers is not None, "Must specify the number of layers with contiguous memory checkpointing"
-
-
-def is_configured():
-    """True if deepspeed activation checkpointing has been configured
-        by calling deepspeed.checkpointing.configure, else returns false
-
-    Arguments:
-        None
-
-    Return:
-        True of configured, else False
-    """
-    return deepspeed_checkpointing_enabled
+'''
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+
+Use to partition the activations stored for backward propagation
+Therefore reduces the memory consumption
+Also implements CPU checkpointing and contiguous memory checkpointing
+Reduces memory consumption and memory fragmentation
+
+Code for rng checkpointing taken from NVIDIA Megatron-LM mpu/random.py
+b886b7bb972afe72bac0f5de4f42a4a7bae8ebef
+'''
+
+# Parts of the code here are adapted from PyTorch
+# repo: https://github.com/pytorch/pytorch
+import copy
+import torch
+import contextlib
+import torch.distributed as dist
+
+import mmap
+from torch import _C
+from torch.cuda import _lazy_call, device as device_ctx_manager
+
+from deepspeed.runtime.config import DeepSpeedConfig
+from deepspeed.utils import logger
+from deepspeed.runtime.utils import copy_to_device, move_to_device, see_memory_usage, bwc_tensor_model_parallel_rank
+from deepspeed.utils.timer import SynchronizedWallClockTimer as Timers
+
+# DeepSpeed Checkpointing Enabled or Disabled
+deepspeed_checkpointing_enabled = False
+
+# MP parameters
+mpu = None
+mp_rank = None
+mp_size = None
+mp_group = None
+
+# Model Parameters
+num_layers = None
+
+# Checkpointing buffers
+contiguous_data_buffers = []
+data_offsets = []
+
+contiguous_size_buffers = []
+size_offsets = []
+
+timers = None
+
+# optimization flags
+PARTITION_ACTIVATIONS = False
+CPU_CHECKPOINT = False
+CONTIGUOUS_CHECKPOINTING = False
+SYNCHRONIZE = False
+PROFILE_TIME = False
+
+# Default name for the model parallel rng tracker.
+_MODEL_PARALLEL_RNG_TRACKER_NAME = 'model-parallel-rng'
+transport_stream = None
+cuda_device = None
+
+
+def detach_variable(inputs, device=None):
+    if isinstance(inputs, tuple):
+        out = []
+        for inp in inputs:
+            if not isinstance(inp, torch.Tensor):
+                out.append(inp)
+                continue
+
+            requires_grad = inp.requires_grad
+
+            if device is not None:
+                x = inp.to(device=device)
+            else:
+                x = inp
+
+            x = x.detach()
+            x.requires_grad = requires_grad
+            out.append(x)
+        return tuple(out)
+    else:
+        raise RuntimeError(
+            "Only tuple of tensors is supported. Got Unsupported input type: ",
+            type(inputs).__name__)
+
+
+def _set_cuda_rng_state(new_state, device=-1):
+    """Sets the random number generator state of the current GPU.
+
+    Arguments:
+        new_state (torch.ByteTensor): The desired state
+    This function is adapted from PyTorch repo (torch.cuda.set_rng_state)
+    with a single change: the input state is not cloned. Cloning caused
+    major performance issues for +4 GPU cases.
+    """
+    if hasattr(_C, '_cuda_setRNGState') and callable(_C._cuda_setRNGState):
+        # older PyTorch
+        def cb():
+            with device_ctx_manager(device):
+                _C._cuda_setRNGState(new_state)
+    else:
+        # newer PyTorch
+        if device == -1:
+            device = torch.device('cuda')
+        elif isinstance(device, str):
+            device = torch.device(device)
+        elif isinstance(device, int):
+            device = torch.device('cuda', device)
+
+        def cb():
+            idx = device.index
+            if idx is None:
+                idx = torch.cuda.current_device()
+            default_generator = torch.cuda.default_generators[idx]
+            default_generator.set_state(new_state)
+
+    _lazy_call(cb)
+
+
+class CudaRNGStatesTracker:
+    """Tracker for the cuda RNG states.
+
+    Using the `add` method, a cuda rng state is initialized based on
+    the input `seed` and is assigned to `name`. Later, by forking the
+    rng state, we can perform operations and return to our starting
+    cuda state.
+    """
+    def __init__(self):
+        # Map from a string name to the cuda rng state.
+        self.states_ = {}
+        # Seeds are just for book keeping and ensure no seed is set twice.
+        self.seeds_ = set()
+
+    def reset(self):
+        """Set to the initial state (no tracker)."""
+        self.states_ = {}
+        self.seeds_ = set()
+
+    def get_states(self):
+        """Get rng states. Copy the dictionary so we have direct
+        pointers to the states, not just a pointer to the dictionary."""
+        return copy.copy(self.states_)
+
+    def set_states(self, states):
+        """Set the rng states. For efficiency purposes, we do not check
+        the size of seed for compatibility."""
+        self.states_ = states
+
+    def add(self, name, seed):
+        """Track the rng state."""
+        # Check seed is not already used.
+        if seed in self.seeds_:
+            raise Exception('seed {} already exists'.format(seed))
+        self.seeds_.add(seed)
+        # Check that state is not already defined.
+        if name in self.states_:
+            raise Exception('cuda rng state {} already exists'.format(name))
+        # Get the current rng state.
+        orig_rng_state = torch.cuda.get_rng_state()
+        # Set the new state and store it.
+        torch.cuda.manual_seed(seed)
+        self.states_[name] = torch.cuda.get_rng_state()
+        # Reset rng state to what it was.
+        _set_cuda_rng_state(orig_rng_state)
+
+    @contextlib.contextmanager
+    def fork(self, name=_MODEL_PARALLEL_RNG_TRACKER_NAME):
+        """Fork the cuda rng state, perform operations, and exit with
+        the original state."""
+        # Check if we have added the state
+        if name not in self.states_:
+            raise Exception('cuda rng state {} is not added'.format(name))
+        # Store current rng state.
+        orig_cuda_rng_state = torch.cuda.get_rng_state()
+        # Set rng state to the desired one
+        _set_cuda_rng_state(self.states_[name])
+        # Do the stuff we wanted to do.
+        try:
+            yield
+        finally:
+            # Update the current rng state for later use.
+            self.states_[name] = torch.cuda.get_rng_state()
+            # And set the state to the original state we started with.
+            _set_cuda_rng_state(orig_cuda_rng_state)
+
+
+# RNG tracker object.
+_CUDA_RNG_STATE_TRACKER = CudaRNGStatesTracker()
+
+
+def get_cuda_rng_tracker():
+    """Get cuda rng tracker."""
+    return _CUDA_RNG_STATE_TRACKER
+
+
+def model_parallel_cuda_manual_seed(seed):
+    """Initialize model parallel cuda seed.
+
+    This function should be called after the model parallel is
+    initialized. Also, no torch.cuda.manual_seed should be called
+    after this function. Basically, this is replacement for that
+    function.
+    Two set of RNG states are tracked:
+        default state: This is for data parallelism and is the same among a
+                       set of model parallel GPUs but different across
+                       different model parallel groups. This is used for
+                       example for dropout in the non-model-parallel regions.
+        model-parallel state: This state is different among a set of model
+                              parallel GPUs, but the same across data parallel
+                              groups. This is used for example for dropout in
+                              model parallel regions.
+    """
+    global mpu
+
+    tp_rank = bwc_tensor_model_parallel_rank(mpu)
+
+    # 2718 is just for fun and any POSITIVE value will work.
+    offset = seed + 2718
+    model_parallel_seed = offset + tp_rank
+    # Data parallel gets the original seed.
+    data_parallel_seed = seed
+
+    if torch.distributed.get_rank() == 0:
+        logger.info(
+            '> initializing model parallel cuda seeds on global rank {}, '
+            'model parallel rank {}, and data parallel rank {} with '
+            'model parallel seed: {} and data parallel seed: {}'.format(
+                torch.distributed.get_rank(),
+                tp_rank,
+                mpu.get_data_parallel_rank(),
+                model_parallel_seed,
+                data_parallel_seed),
+        )
+    _CUDA_RNG_STATE_TRACKER.reset()
+    # Set the default state.
+    torch.cuda.manual_seed(data_parallel_seed)
+    # and model parallel state.
+    _CUDA_RNG_STATE_TRACKER.add(_MODEL_PARALLEL_RNG_TRACKER_NAME, model_parallel_seed)
+
+
+def get_partition_start(item):
+    global mp_rank, mp_size, mp_group
+    size = item.numel()
+    partition_size = size / mp_size
+    start = partition_size * mp_rank
+    return int(start)
+
+
+def get_partition_size(item):
+    global mp_rank, mp_size, mp_group
+    size = item.numel()
+    assert size % mp_size == 0, "Doesn't handle if partition activation if item is not divisible by mp size"
+    partition_size = size / mp_size
+    return int(partition_size)
+
+
+def gather_partitioned_activations(tensors, device=None):
+    global mp_rank, mp_size, mp_group
+    assert len(tensors) % 2 == 0, f'Expected even count of tensors, instead got {len(tensors)}'
+    inputs = []
+    num_args = int(len(tensors) / 2)
+    for i in range(num_args):
+
+        item = tensors[2 * i]
+        size = tensors[2 * i + 1]
+
+        if not is_activation_to_checkpoint(item):
+            inputs.append(item)
+            continue
+
+        partition_size = item.numel()
+        tensor_size = partition_size * mp_size
+        if device is not None:
+            flat_tensor = torch.zeros([tensor_size], dtype=item.dtype, device=device)
+        else:
+            flat_tensor = torch.zeros([tensor_size],
+                                      dtype=item.dtype,
+                                      device=item.device)
+        partitions = []
+        for i in range(mp_size):
+            part_i = flat_tensor.narrow(0, partition_size * i, partition_size)
+            if i == mp_rank:
+                part_i.copy_(item)
+            partitions.append(part_i)
+        if mp_group is not None:
+            dist.all_gather(partitions, partitions[mp_rank], group=mp_group)
+        input_tensor = flat_tensor.view(list(size.numpy()))
+        item.data = input_tensor.data
+
+        inputs.append(item)
+
+    return tuple(inputs)
+
+
+def extract_tensors(all_objects):
+    """
+    Separate objects in list/tuple into tensors and non-tensors and create a mapping to enable re-aggregation.
+    The order of tensors and non-tensors is preserved in their respective output groups.
+
+    Parameters:
+        all_objects (list/tuple): Objects containing tensors and non-tensors to be split.
+
+    Returns:
+        tuple: Containing tensors, non-tensors, and bools of whether each position in original list/tuple was a tensor.
+
+    """
+    tensor_objects = [v for v in all_objects if torch.is_tensor(v)]
+    non_tensor_objects = [v for v in all_objects if not torch.is_tensor(v)]
+    tensor_flags = [torch.is_tensor(v) for v in all_objects]
+    if type(all_objects) is tuple:
+        return tuple(tensor_objects), tuple(non_tensor_objects), tuple(tensor_flags)
+    return tensor_objects, non_tensor_objects, tensor_flags
+
+
+def merge_tensors(tensor_objects, non_tensor_objects, tensor_flags):
+    """
+    Merge two lists (or tuples) of tensors and non-tensors using a mapping of positions in merged list (or tuple).
+
+    Parameters:
+        tensor_objects (list/tuple): Tensors to merge.
+        non_tensor_objects (list/tuple): Non-tensors to merge.
+        tensor_flags (list/tuple): Indicates whether each position in output is a tensor.
+
+    Returns:
+        tuple: Merge of tensors and non-tensors
+    """
+    merged_objects = []
+    tensor_idx = 0
+    non_tensor_idx = 0
+
+    real_tensor_flags = None
+
+    # remove the flags that are assigned to the size of the flattened tensors
+    if PARTITION_ACTIVATIONS:
+        real_tensor_flags = []
+        previous_flag = False
+        for flag in tensor_flags:
+            if previous_flag:
+                previous_flag = False
+                continue
+            previous_flag = flag
+            real_tensor_flags.append(flag)
+    else:
+        real_tensor_flags = tensor_flags
+
+    for is_tensor in real_tensor_flags:
+        if is_tensor:
+            merged_objects.append(tensor_objects[tensor_idx])
+            tensor_idx += 1
+        else:
+            merged_objects.append(non_tensor_objects[non_tensor_idx])
+            non_tensor_idx += 1
+
+    return tuple(merged_objects)
+
+
+def is_activation_to_checkpoint(item):
+    """
+        Is an activation to be checkpointed
+    """
+    global mp_size
+    return torch.is_tensor(item) and item.is_floating_point() and item.numel() >= mp_size
+
+
+def partition_activations(args, cpu_checkpoint, contiguous_checkpoint):
+    global contiguous_data_buffers, data_offsets
+
+    inputs = []
+    num_non_fp_tensors = 0
+
+    for arg_index, item in enumerate(args):
+        if not is_activation_to_checkpoint(item):
+            inputs.append(item)
+            num_non_fp_tensors += 1
+            continue
+
+        i = arg_index - num_non_fp_tensors
+        partition_size = get_partition_size(item)
+        partition = item.detach().contiguous().view(-1).narrow(
+            0,
+            get_partition_start(item),
+            partition_size).clone()
+
+        buffer_device = torch.device('cpu') if cpu_checkpoint else partition.device
+
+        if contiguous_checkpoint:
+            if i >= len(contiguous_data_buffers):
+                tensor_list = [
+                    torch.tensor(()).new_empty([partition_size],
+                                               dtype=partition.dtype,
+                                               device=buffer_device)
+                    for _ in range(num_layers)
+                ]
+                contiguous_data_buffers.append(tensor_list)
+                data_offsets.append(0)
+            elif contiguous_data_buffers[i] is None:
+                tensor_list = [
+                    torch.tensor(()).new_empty([partition_size],
+                                               dtype=partition.dtype,
+                                               device=buffer_device)
+                    for _ in range(num_layers)
+                ]
+                contiguous_data_buffers[i] = tensor_list
+                data_offsets[i] = 0
+
+            # Because the 'new_empty' returns uninitialized pages,
+            # the pages need to be populated during the cudaMemcpy time
+            # which increases the data copy time. To avoid this, we
+            # pre-populate these pages by simply writing 0 ahead of
+            # the actual cudaMemcpy operation time. Due to the
+            # previously launched GPU kernels, there is a small
+            # window of time here for CPUs to populate pages asynchronously.
+            contiguous_data_buffers[i][data_offsets[i]].data[range(
+                0,
+                contiguous_data_buffers[i][data_offsets[i]].data.shape[0],
+                int(mmap.PAGESIZE /
+                    contiguous_data_buffers[i][data_offsets[i]].data.element_size())
+            )] = 0
+
+            contiguous_partition = contiguous_data_buffers[i][
+                data_offsets[i]].data.copy_(partition.data)
+            data_offsets[i] = data_offsets[i] + 1
+            inputs.append(contiguous_partition)
+        else:
+            partition = partition.cpu() if CPU_CHECKPOINT else partition
+            inputs.append(partition)
+
+    return inputs
+
+
+def get_partitioned_activations_for_backward(args, inputs, contiguous_checkpoint):
+    global contiguous_size_buffers, size_offsets
+
+    new_args = []
+    num_non_fp_tensors = 0
+
+    for arg_index, (arg, inp) in enumerate(zip(args, inputs)):
+        size = torch.tensor(arg.size()) if torch.is_tensor(arg) else None
+        if not is_activation_to_checkpoint(arg):
+            new_args.append(arg)
+            new_args.append(size)
+            num_non_fp_tensors += 1
+            continue
+
+        arg.data = inp.data
+        new_args.append(arg)
+        i = arg_index - num_non_fp_tensors
+
+        if contiguous_checkpoint:
+            numel = size.numel()
+            if i >= len(contiguous_size_buffers):
+                tmp = torch.tensor(())
+                contiguous_size_buffers.append(
+                    tmp.new_empty([numel * num_layers],
+                                  dtype=size.dtype,
+                                  device=size.device))
+                size_offsets.append(0)
+            elif contiguous_size_buffers[i] is None:
+                tmp = torch.tensor(())
+                contiguous_size_buffers[i] = tmp.new_empty([numel * num_layers],
+                                                           dtype=size.dtype,
+                                                           device=size.device)
+                size_offsets[i] = 0
+
+            contiguous_size = contiguous_size_buffers[i].narrow(
+                0,
+                size_offsets[i],
+                numel).data.copy_(size.data)
+            contiguous_size = contiguous_size.view_as(size)
+            size_offsets[i] = size_offsets[i] + numel
+            new_args.append(contiguous_size)
+        else:
+            new_args.append(size)
+
+    return new_args
+
+
+def get_cpu_activations_for_backward(args, inputs):
+    new_args = []
+    for i, (arg, inp) in enumerate(zip(args, inputs)):
+        if not is_activation_to_checkpoint(arg):
+            new_args.append(arg)
+            continue
+
+        arg.data = inp.data
+        new_args.append(arg)
+
+    return new_args
+
+
+class CheckpointFunction(torch.autograd.Function):
+    """This function is adapted from torch.utils.checkpoint with
+       two main changes:
+           1) torch.cuda.set_rng_state is replaced with `_set_cuda_rng_state`
+           2) the states in the model parallel tracker are also properly
+              tracked/set/reset.
+           3) Performance activation partitioning, contiguous memory optimization
+           4) CPU Checkpointing
+           5) Profile forward and backward functions
+    """
+    @staticmethod
+    def forward(ctx, run_function, all_outputs, *args):
+        global mpu, timers, SYNCHRONIZE, PROFILE_TIME
+
+        def save_args_for_backward(*all_args):
+            tensor_args, non_tensor_args, tensor_flags = extract_tensors(all_objects=all_args)
+            ctx.deepspeed_saved_tensors = tensor_args
+            ctx.non_tensor_args = non_tensor_args
+            ctx.tensor_flags = tensor_flags
+
+        if SYNCHRONIZE:
+            torch.cuda.synchronize()
+
+        if timers is None and PROFILE_TIME:
+            timers = Timers()
+
+        if PROFILE_TIME:
+            timers('forward').start()
+
+        ctx.run_function = run_function
+        global num_layers
+        global mp_rank, mp_size, mp_group
+        global contiguous_data_buffers, contiguous_size_buffers
+        global data_offsets, size_offsets
+        if mp_rank is None:
+            if mpu is not None:
+                if hasattr(mpu, 'get_tensor_model_parallel_rank'):
+                    mp_rank = mpu.get_tensor_model_parallel_rank()
+                    mp_size = mpu.get_tensor_model_parallel_world_size()
+                    mp_group = mpu.get_tensor_model_parallel_group()
+                else:
+                    mp_rank = mpu.get_model_parallel_rank()
+                    mp_size = mpu.get_model_parallel_world_size()
+                    mp_group = mpu.get_model_parallel_group()
+            else:
+                mp_rank = 0
+                mp_size = 1
+                mp_group = None
+
+        global cuda_device, transport_stream, PARTITION_ACTIVATIONS, buffer_0, buffer_1, buffer_0_offset, buffer_1_offset
+
+        if cuda_device is None:
+            see_memory_usage("First Forward Beginning", force=False)
+            if dist.get_rank() == 0:
+                logger.info(f"Activation Checkpointing Information")
+                logger.info(
+                    f"----Partition Activations {PARTITION_ACTIVATIONS}, CPU CHECKPOINTING {CPU_CHECKPOINT}"
+                )
+                logger.info(
+                    f"----contiguous Memory Checkpointing {CONTIGUOUS_CHECKPOINTING} with {num_layers} total layers"
+                )
+                logger.info(f"----Synchronization {SYNCHRONIZE}")
+                logger.info(f"----Profiling time in checkpointing {PROFILE_TIME}")
+
+            cuda_device = torch.cuda.current_device()
+            transport_stream = torch.cuda.Stream(device=cuda_device)
+
+        if PARTITION_ACTIVATIONS:
+            inputs = partition_activations(args,
+                                           CPU_CHECKPOINT,
+                                           CONTIGUOUS_CHECKPOINTING)
+        elif CPU_CHECKPOINT:
+            inputs = copy_to_device(args,
+                                    device=torch.device('cpu'),
+                                    criterion_func=is_activation_to_checkpoint)
+
+        # just in case something funky is happening such as reuse of inputs
+        inputs_cuda = copy_to_device(args,
+                                     device=cuda_device,
+                                     criterion_func=is_activation_to_checkpoint)
+
+        # Copy the rng states.
+        ctx.fwd_cpu_rng_state = torch.get_rng_state()
+        ctx.fwd_cuda_rng_state = torch.cuda.get_rng_state()
+        ctx.fwd_cuda_rng_state_tracker = get_cuda_rng_tracker().get_states()
+
+        see_memory_usage("Before running forward on the layer", force=False)
+        # ctx.save_for_backward(*args)
+        with torch.no_grad():
+            outputs = run_function(*inputs_cuda)
+
+        see_memory_usage("After running forward on the layer", force=False)
+        del inputs_cuda
+
+        if PARTITION_ACTIVATIONS:
+            new_args = get_partitioned_activations_for_backward(
+                args,
+                inputs,
+                CONTIGUOUS_CHECKPOINTING)
+            assert len(new_args) % 2 == 0, f'save_for_backward called with odd number of args, {len(new_args)}'
+            save_args_for_backward(*new_args)
+        elif CPU_CHECKPOINT:
+            new_args = get_cpu_activations_for_backward(args, inputs)
+            save_args_for_backward(*new_args)
+        else:
+            save_args_for_backward(*args)
+
+        if PROFILE_TIME:
+            timers('forward').stop()
+            timers.log(['forward'])
+        if SYNCHRONIZE:
+            torch.cuda.synchronize()
+
+        # Tensors returned from forward() may not be differentiable.
+        if torch.is_tensor(outputs):
+            non_grad_outputs = [outputs] if not outputs.is_floating_point() else []
+        else:
+            non_grad_outputs = [
+                o for o in outputs if torch.is_tensor(o) and not o.is_floating_point()
+            ]
+        ctx.mark_non_differentiable(*non_grad_outputs)
+
+        if torch.is_tensor(outputs):
+            all_outputs += [outputs]
+            return outputs
+        else:
+            all_outputs += outputs
+            outputs, _, _ = extract_tensors(all_objects=outputs)
+            return tuple(outputs)
+
+    @staticmethod
+    def backward(ctx, *grads):
+        global timers
+        see_memory_usage("In backward", force=False)
+        # removing pointers to the contiguous buffer memory
+        # so that they can be garbage collected once the checkpoints
+        # have been used
+        if SYNCHRONIZE:
+            torch.cuda.synchronize()
+        if PROFILE_TIME:
+            timers('backward').start()
+
+        if CONTIGUOUS_CHECKPOINTING:
+            global data_offsets, size_offsets
+            global contiguous_data_buffers, contiguous_size_buffers
+
+            for buffers in contiguous_data_buffers:
+                buffers = []
+
+            # frees up all the pointers to the checkpoints except for the ones
+            # stored by save for backward
+            contiguous_data_buffers = []
+            contiguous_size_buffers = []
+            data_offsets = []
+            size_offsets = []
+
+        see_memory_usage("In backward checkpointing code", force=False)
+        if not torch.autograd._is_checkpoint_valid():
+            raise RuntimeError("Checkpointing is not compatible with .grad(), "
+                               "please use .backward() if possible")
+
+        global cuda_device, transport_stream, PARTITION_ACTIVATIONS
+
+        if PARTITION_ACTIVATIONS:
+            # with torch.cuda.stream(transport_stream):
+            inputs = gather_partitioned_activations(
+                ctx.deepspeed_saved_tensors,
+                device=cuda_device if CPU_CHECKPOINT else None)
+            detached_inputs = detach_variable(inputs)
+        elif CPU_CHECKPOINT:
+            inputs = move_to_device(ctx.deepspeed_saved_tensors,
+                                    cuda_device,
+                                    is_activation_to_checkpoint)
+            detached_inputs = detach_variable(inputs)
+        else:
+            inputs = ctx.deepspeed_saved_tensors
+            detached_inputs = detach_variable(inputs)
+
+        # Add non tensor input args
+        detached_inputs = merge_tensors(tensor_objects=detached_inputs,
+                                        non_tensor_objects=ctx.non_tensor_args,
+                                        tensor_flags=ctx.tensor_flags)
+
+        # Store the current states.
+        bwd_cpu_rng_state = torch.get_rng_state()
+        bwd_cuda_rng_state = torch.cuda.get_rng_state()
+        bwd_cuda_rng_state_tracker = get_cuda_rng_tracker().get_states()
+
+        # Set the states to what it used to be before the forward pass.
+        torch.set_rng_state(ctx.fwd_cpu_rng_state)
+        _set_cuda_rng_state(ctx.fwd_cuda_rng_state)
+        get_cuda_rng_tracker().set_states(ctx.fwd_cuda_rng_state_tracker)
+
+        # if PARTITION_ACTIVATIONS:
+        #     current_stream=torch.cuda.current_stream()
+        #     current_stream.wait_stream(transport_stream)
+
+        see_memory_usage("In backward checkpointing code before forward", force=False)
+
+        with torch.enable_grad():
+            outputs = ctx.run_function(*detached_inputs)
+
+        see_memory_usage("In backward checkpointing code after forward", force=False)
+        # Set the states back to what it was at the start of this function.
+        torch.set_rng_state(bwd_cpu_rng_state)
+        _set_cuda_rng_state(bwd_cuda_rng_state)
+        get_cuda_rng_tracker().set_states(bwd_cuda_rng_state_tracker)
+
+        if isinstance(outputs, torch.Tensor):
+            outputs = (outputs, )
+
+        # Filter out non tensor outputs
+        outputs, _, _ = extract_tensors(all_objects=outputs)
+
+        # Construct arguments to autograd.backward().
+        # This is usually just outputs and grads, but forward() can return tensors that
+        # are not differentiable.
+        output_tensors = []
+        grad_tensors = []
+        for out, grad in zip(outputs, grads):
+            if out.requires_grad:
+                output_tensors.append(out)
+                grad_tensors.append(grad)
+
+        see_memory_usage("In backward checkpointing code before backward", force=False)
+
+        torch.autograd.backward(output_tensors, grad_tensors)
+
+        # Force clear our stashed tensors to prevent a memory leak in certain scenarios
+        ctx.deepspeed_saved_tensors = None
+        ctx.non_tensor_args = None
+        ctx.tensor_flags = None
+
+        see_memory_usage("After backward checkpointing code after backward", force=False)
+
+        if PROFILE_TIME:
+            timers('backward').stop()
+            timers.log(['backward'])
+        if SYNCHRONIZE:
+            torch.cuda.synchronize()
+        ret_list = [None, None]  # first None for ctx
+        for inp in detached_inputs:
+            if torch.is_tensor(inp):
+                ret_list.append(inp.grad)
+            else:
+                ret_list.append(None)
+
+        return tuple(ret_list)
+
+
+def checkpoint(function, *args):
+    """Checkpoint a model or part of the model.
+    This has been directly copied from torch.utils.checkpoint. """
+
+    all_outputs = []
+    CheckpointFunction.apply(function, all_outputs, *args)
+    if len(all_outputs) == 1:
+        return all_outputs[0]
+    else:
+        return tuple(all_outputs)
+
+
+def partition_activations_in_checkpoint(partition_activation):
+    global PARTITION_ACTIVATIONS
+    PARTITION_ACTIVATIONS = partition_activation
+    if dist.get_rank() == 0:
+        logger.info(
+            f"**************Partition Activations {PARTITION_ACTIVATIONS}************")
+
+
+def set_num_layers(nlayers):
+    global num_layers
+    num_layers = nlayers
+
+
+def reset():
+    """Resets memory buffers related to contiguous memory optimizations.
+    Should be called during eval when multiple forward propagations are
+    computed without any backward propagation that usually clears these
+    buffers.
+    Arguments:
+        None
+
+    Return:
+        None
+    """
+    if CONTIGUOUS_CHECKPOINTING:
+        global data_offsets, size_offsets
+        global contiguous_data_buffers, contiguous_size_buffers
+
+        for buffers in contiguous_data_buffers:
+            buffers = []
+
+        # frees up all the pointers to the checkpoints except for the ones
+        # stored by save for backward
+        contiguous_data_buffers = []
+        contiguous_size_buffers = []
+        data_offsets = []
+        size_offsets = []
+
+
+def _configure_using_config_file(config, mpu=None):
+    global num_layers, PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, \
+        CPU_CHECKPOINT, SYNCHRONIZE, PROFILE_TIME
+
+    config = DeepSpeedConfig(config, mpu=mpu).activation_checkpointing_config
+    if dist.get_rank() == 0:
+        logger.info(config.repr())
+    PARTITION_ACTIVATIONS = config.partition_activations
+    CONTIGUOUS_CHECKPOINTING = config.contiguous_memory_optimization
+    num_layers = config.number_checkpoints
+    CPU_CHECKPOINT = config.cpu_checkpointing
+    SYNCHRONIZE = config.synchronize_checkpoint_boundary
+    PROFILE_TIME = config.profile
+
+
+def _configure_defaults():
+
+    global mpu, num_layers, deepspeed_checkpointing_enabled
+
+    global PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, \
+        CPU_CHECKPOINT, SYNCHRONIZE, PROFILE_TIME
+
+    PARTITION_ACTIVATIONS = False
+    CONTIGUOUS_CHECKPOINTING = False
+    num_layers = False
+    CPU_CHECKPOINT = False
+    SYNCHRONIZE = False
+    PROFILE_TIME = False
+    deepspeed_checkpointing_enabled = True
+
+
+def configure(
+    mpu_,
+    deepspeed_config=None,
+    partition_activations=None,
+    contiguous_checkpointing=None,
+    num_checkpoints=None,
+    checkpoint_in_cpu=None,
+    synchronize=None,
+    profile=None,
+):
+    """Configure DeepSpeed Activation Checkpointing.
+
+    Arguments:
+        mpu_: Optional: An object that implements the following methods
+            get_model_parallel_rank/group/world_size, and get_data_parallel_rank/group/world_size
+
+        deepspeed_config: Optional: DeepSpeed Config json file when provided will be used to
+            configure DeepSpeed Activation Checkpointing
+
+        partition_activations: Optional: Partitions activation checkpoint across model parallel
+            GPUs when enabled. By default False. Will overwrite deepspeed_config if provided
+
+        contiguous_checkpointing: Optional: Copies activation checkpoints to a contiguous memory
+            buffer. Works only with homogeneous checkpoints when partition_activations is enabled.
+            Must provide num_checkpoints. By default False. Will overwrite deepspeed_config if
+            provided
+
+        num_checkpoints: Optional: Number of activation checkpoints stored during the forward
+            propagation of the model. Used to calculate the buffer size for contiguous_checkpointing
+            Will overwrite deepspeed_config if provided
+
+        checkpoint_in_cpu: Optional: Moves the activation checkpoint to CPU. Only works with
+            partition_activation. Default is false. Will overwrite deepspeed_config if provided
+
+        synchronize: Optional: Performs torch.cuda.synchronize() at the beginning and end of
+            each call to deepspeed.checkpointing.checkpoint for both forward and backward pass.
+            By default false. Will overwrite deepspeed_config if provided
+
+        profile: Optional: Logs the forward and backward time for each
+            deepspeed.checkpointing.checkpoint invocation. Will overwrite deepspeed_config
+            if provided
+
+    Returns:
+        None
+    """
+    global mpu, num_layers, deepspeed_checkpointing_enabled
+
+    global PARTITION_ACTIVATIONS, CONTIGUOUS_CHECKPOINTING, \
+        CPU_CHECKPOINT, SYNCHRONIZE, PROFILE_TIME
+
+    _configure_defaults()
+
+    if mpu_ is not None:
+        mpu = mpu_
+
+    if deepspeed_config is not None:
+        _configure_using_config_file(deepspeed_config, mpu=mpu)
+
+    if partition_activations is not None:
+        PARTITION_ACTIVATIONS = partition_activations
+
+    if contiguous_checkpointing is not None:
+        CONTIGUOUS_CHECKPOINTING = contiguous_checkpointing
+
+    if num_checkpoints is not None:
+        num_layers = num_checkpoints
+
+    if checkpoint_in_cpu is not None:
+        CPU_CHECKPOINT = checkpoint_in_cpu
+
+    if synchronize is not None:
+        SYNCHRONIZE = synchronize
+
+    if profile is not None:
+        PROFILE_TIME = profile
+
+    if CONTIGUOUS_CHECKPOINTING:
+        assert PARTITION_ACTIVATIONS, "Contiguous Checkpointing is only available with partitioned activations. Set partitioned activations to true in deepspeed config"
+    if CONTIGUOUS_CHECKPOINTING:
+        assert num_layers is not None, "Must specify the number of layers with contiguous memory checkpointing"
+
+
+def is_configured():
+    """True if deepspeed activation checkpointing has been configured
+        by calling deepspeed.checkpointing.configure, else returns false
+
+    Arguments:
+        None
+
+    Return:
+        True of configured, else False
+    """
+    return deepspeed_checkpointing_enabled
diff --git a/deepspeed/runtime/activation_checkpointing/config.py b/deepspeed/runtime/activation_checkpointing/config.py
old mode 100755
new mode 100644
index 19e904980da713dac3683791d0267cbddd807acc..0ab59ac64eeadaa77e29e6b51f010cc1e5a3cc3b
--- a/deepspeed/runtime/activation_checkpointing/config.py
+++ b/deepspeed/runtime/activation_checkpointing/config.py
@@ -1,103 +1,103 @@
-"""
-Copyright (c) Microsoft Corporation
-Licensed under the MIT license.
-"""
-
-from deepspeed.runtime.config_utils import get_scalar_param, DeepSpeedConfigObject
-
-#########################################
-#  DeepSpeed Activation Checkpointing
-#########################################
-# Activation Checkpointing Allows to save memory by only keeping a select few
-#activations for the backpropagation.
-ACTIVATION_CHKPT_FORMAT = '''
-Activation Checkpointing should be configured as:
-"session_params": {
-  "activation_checkpointing": {
-    "partitioned_activations": [true|false],
-    "number_checkpoints": 100,
-    "contiguous_memory_optimization": [true|false],
-    "cpu_checkpointing": [true|false]
-    "profile": [true|false],
-    "synchronize_checkpoint_boundary": [true|false],
-    }
-}
-'''
-
-ACT_CHKPT_PARTITION_ACTIVATIONS = 'partition_activations'
-ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT = False
-
-ACT_CHKPT_NUMBER_CHECKPOINTS = 'number_checkpoints'
-ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT = None
-
-ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION = 'contiguous_memory_optimization'
-ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT = False
-
-ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY = 'synchronize_checkpoint_boundary'
-ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT = False
-
-ACT_CHKPT_PROFILE = 'profile'
-ACT_CHKPT_PROFILE_DEFAULT = False
-
-ACT_CHKPT_CPU_CHECKPOINTING = 'cpu_checkpointing'
-ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT = False
-
-ACT_CHKPT = 'activation_checkpointing'
-
-ACT_CHKPT_DEFAULT = {
-    ACT_CHKPT_PARTITION_ACTIVATIONS: ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT,
-    ACT_CHKPT_NUMBER_CHECKPOINTS: ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT,
-    ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION:
-    ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT,
-    ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY:
-    ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT,
-    ACT_CHKPT_PROFILE: ACT_CHKPT_PROFILE_DEFAULT,
-    ACT_CHKPT_CPU_CHECKPOINTING: ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT
-}
-
-
-class DeepSpeedActivationCheckpointingConfig(DeepSpeedConfigObject):
-    def __init__(self, param_dict):
-        super(DeepSpeedActivationCheckpointingConfig, self).__init__()
-
-        self.partition_activations = None
-        self.contiguous_memory_optimization = None
-        self.cpu_checkpointing = None
-        self.number_checkpoints = None
-        self.synchronize_checkpoint_boundary = None
-        self.profile = None
-
-        if ACT_CHKPT in param_dict.keys():
-            act_chkpt_config_dict = param_dict[ACT_CHKPT]
-        else:
-            act_chkpt_config_dict = ACT_CHKPT_DEFAULT
-
-        self._initialize(act_chkpt_config_dict)
-
-    def _initialize(self, act_chkpt_config_dict):
-        self.partition_activations = get_scalar_param(
-            act_chkpt_config_dict,
-            ACT_CHKPT_PARTITION_ACTIVATIONS,
-            ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT)
-
-        self.contiguous_memory_optimization = get_scalar_param(
-            act_chkpt_config_dict,
-            ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION,
-            ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT)
-
-        self.cpu_checkpointing = get_scalar_param(act_chkpt_config_dict,
-                                                  ACT_CHKPT_CPU_CHECKPOINTING,
-                                                  ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT)
-
-        self.number_checkpoints = get_scalar_param(act_chkpt_config_dict,
-                                                   ACT_CHKPT_NUMBER_CHECKPOINTS,
-                                                   ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT)
-
-        self.profile = get_scalar_param(act_chkpt_config_dict,
-                                        ACT_CHKPT_PROFILE,
-                                        ACT_CHKPT_PROFILE_DEFAULT)
-
-        self.synchronize_checkpoint_boundary = get_scalar_param(
-            act_chkpt_config_dict,
-            ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY,
-            ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT)
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+
+from deepspeed.runtime.config_utils import get_scalar_param, DeepSpeedConfigObject
+
+#########################################
+#  DeepSpeed Activation Checkpointing
+#########################################
+# Activation Checkpointing Allows to save memory by only keeping a select few
+#activations for the backpropagation.
+ACTIVATION_CHKPT_FORMAT = '''
+Activation Checkpointing should be configured as:
+"session_params": {
+  "activation_checkpointing": {
+    "partitioned_activations": [true|false],
+    "number_checkpoints": 100,
+    "contiguous_memory_optimization": [true|false],
+    "cpu_checkpointing": [true|false]
+    "profile": [true|false],
+    "synchronize_checkpoint_boundary": [true|false],
+    }
+}
+'''
+
+ACT_CHKPT_PARTITION_ACTIVATIONS = 'partition_activations'
+ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT = False
+
+ACT_CHKPT_NUMBER_CHECKPOINTS = 'number_checkpoints'
+ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT = None
+
+ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION = 'contiguous_memory_optimization'
+ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT = False
+
+ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY = 'synchronize_checkpoint_boundary'
+ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT = False
+
+ACT_CHKPT_PROFILE = 'profile'
+ACT_CHKPT_PROFILE_DEFAULT = False
+
+ACT_CHKPT_CPU_CHECKPOINTING = 'cpu_checkpointing'
+ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT = False
+
+ACT_CHKPT = 'activation_checkpointing'
+
+ACT_CHKPT_DEFAULT = {
+    ACT_CHKPT_PARTITION_ACTIVATIONS: ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT,
+    ACT_CHKPT_NUMBER_CHECKPOINTS: ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT,
+    ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION:
+    ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT,
+    ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY:
+    ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT,
+    ACT_CHKPT_PROFILE: ACT_CHKPT_PROFILE_DEFAULT,
+    ACT_CHKPT_CPU_CHECKPOINTING: ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT
+}
+
+
+class DeepSpeedActivationCheckpointingConfig(DeepSpeedConfigObject):
+    def __init__(self, param_dict):
+        super(DeepSpeedActivationCheckpointingConfig, self).__init__()
+
+        self.partition_activations = None
+        self.contiguous_memory_optimization = None
+        self.cpu_checkpointing = None
+        self.number_checkpoints = None
+        self.synchronize_checkpoint_boundary = None
+        self.profile = None
+
+        if ACT_CHKPT in param_dict.keys():
+            act_chkpt_config_dict = param_dict[ACT_CHKPT]
+        else:
+            act_chkpt_config_dict = ACT_CHKPT_DEFAULT
+
+        self._initialize(act_chkpt_config_dict)
+
+    def _initialize(self, act_chkpt_config_dict):
+        self.partition_activations = get_scalar_param(
+            act_chkpt_config_dict,
+            ACT_CHKPT_PARTITION_ACTIVATIONS,
+            ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT)
+
+        self.contiguous_memory_optimization = get_scalar_param(
+            act_chkpt_config_dict,
+            ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION,
+            ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT)
+
+        self.cpu_checkpointing = get_scalar_param(act_chkpt_config_dict,
+                                                  ACT_CHKPT_CPU_CHECKPOINTING,
+                                                  ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT)
+
+        self.number_checkpoints = get_scalar_param(act_chkpt_config_dict,
+                                                   ACT_CHKPT_NUMBER_CHECKPOINTS,
+                                                   ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT)
+
+        self.profile = get_scalar_param(act_chkpt_config_dict,
+                                        ACT_CHKPT_PROFILE,
+                                        ACT_CHKPT_PROFILE_DEFAULT)
+
+        self.synchronize_checkpoint_boundary = get_scalar_param(
+            act_chkpt_config_dict,
+            ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY,
+            ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT)
diff --git a/deepspeed/runtime/bf16_optimizer.py b/deepspeed/runtime/bf16_optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..43c23f0b77f1ace5720c6160ced5ae7b95c97163
--- /dev/null
+++ b/deepspeed/runtime/bf16_optimizer.py
@@ -0,0 +1,582 @@
+import torch
+import torch.distributed as dist
+from deepspeed.runtime.constants import PIPE_REPLICATED
+from deepspeed.ops.op_builder import UtilsBuilder
+from packaging import version as pkg_version
+
+from deepspeed.git_version_info import version
+from deepspeed.runtime.utils import (get_global_norm_of_tensors,
+                                     clip_tensors_by_global_norm,
+                                     get_grad_norm,
+                                     clip_gradients,
+                                     align_dense_tensors,
+                                     all_gather_dp_groups,
+                                     bwc_tensor_model_parallel_rank,
+                                     is_model_parallel_parameter,
+                                     see_memory_usage)
+
+from deepspeed.checkpoint.constants import (DS_VERSION,
+                                            PARTITION_COUNT,
+                                            BASE_OPTIMIZER_STATE,
+                                            SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            CLIP_GRAD,
+                                            GROUPS_PADDING)
+
+import types
+
+from dataclasses import dataclass
+
+
+@dataclass
+class fragment_address:
+    numel: int
+    start: int
+
+
+@dataclass
+class tensor_fragment:
+    lp_fragment: torch.Tensor
+    lp_fragment_address: fragment_address
+    hp_fragment: torch.Tensor
+    hp_fragment_address: fragment_address
+    optim_fragment: {}
+
+    def update_hp(self):
+        self.hp_fragment.data.copy_(self.lp_fragment.data)
+
+    def update_lp(self):
+        self.lp_fragment.data.copy_(self.hp_fragment.data)
+
+    def get_optim_state_fragment(self, key):
+        if key in self.optim_fragment:
+            return self.optim_fragment[key]
+        else:
+            raise ValueError(f'{key} not found in optimizer state fragment')
+
+
+def get_full_hp_param(self, optim_state_key=None):
+    reduce_buffer = torch.zeros_like(self, dtype=torch.float32).flatten()
+    if self._hp_mapping is not None:
+        lp_frag_address = self._hp_mapping.lp_fragment_address
+        reduce_fragment = torch.narrow(reduce_buffer,
+                                       0,
+                                       lp_frag_address.start,
+                                       lp_frag_address.numel)
+        if optim_state_key is None:
+            hp_fragment = self._hp_mapping.hp_fragment
+        else:
+            hp_fragment = self._hp_mapping.get_optim_state_fragment(optim_state_key)
+
+        reduce_fragment.data.copy_(hp_fragment.data)
+    torch.distributed.all_reduce(reduce_buffer, group=self._dp_group)
+    return reduce_buffer.reshape_as(self)
+
+
+class BF16_Optimizer:
+    def __init__(self,
+                 init_optimizer,
+                 mpu=None,
+                 clip_grad=0.0,
+                 norm_type=2,
+                 allgather_bucket_size=5000000000,
+                 dp_process_group=None,
+                 timers=None):
+        super().__init__()
+        see_memory_usage('begin bf16_optimizer', force=True)
+        self.timers = timers
+        self.optimizer = init_optimizer
+        self.clip_grad = clip_grad
+        self.norm_type = norm_type
+        self.mpu = mpu
+        self.allgather_bucket_size = int(allgather_bucket_size)
+        self.dp_process_group = dp_process_group
+        self.dp_rank = dist.get_rank(group=self.dp_process_group)
+        self.real_dp_process_group = [
+            dp_process_group for i in range(len(self.optimizer.param_groups))
+        ]
+        dp_world_size = dist.get_world_size(group=self.dp_process_group)
+        self.partition_count = [
+            dp_world_size for i in range(len(self.optimizer.param_groups))
+        ]
+
+        # Load pre-built or JIT compile (un)flatten ops
+        util_ops = UtilsBuilder().load()
+        self.flatten = util_ops.flatten
+        self.unflatten = util_ops.unflatten
+
+        #align nccl all-gather send buffers to 4-bye boundary
+        self.nccl_start_alignment_factor = 2  # 4-byte alignment/sizeof(fp16) = 2
+
+        # Build BF16/FP32 groups
+        self.bf16_groups = []
+        self.bf16_groups_flat = []
+        self.bf16_partitioned_groups = []
+
+        self.fp32_groups_flat_partition = []
+
+        # Maintain different fp32 gradients views for convenience
+        self.fp32_groups_gradients = []
+        self.fp32_groups_gradients_flat = []
+        self.fp32_groups_actual_gradients_flat = []
+        self.fp32_groups_gradient_flat_partition = []
+        self.fp32_groups_has_gradients = []
+
+        self.step_count = 0
+        self.groups_padding = []
+
+        for i, param_group in enumerate(self.optimizer.param_groups):
+            see_memory_usage(f'before initializing group {i}', force=True)
+
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+
+            # grab the original list
+            self.bf16_groups.append(param_group['params'])
+
+            # create flat bf16 params
+            self.bf16_groups_flat.append(
+                self._flatten_dense_tensors_aligned(
+                    self.bf16_groups[i],
+                    self.nccl_start_alignment_factor * dp_world_size))
+
+            # Make bf16 params point to flat tensor storage
+            self._update_storage_to_flattened_tensor(
+                tensor_list=self.bf16_groups[i],
+                flat_tensor=self.bf16_groups_flat[i])
+
+            # divide flat weights into equal sized partitions
+            partition_size = self.bf16_groups_flat[i].numel() // dp_world_size
+            bf16_dp_partitions = [
+                self.bf16_groups_flat[i].narrow(0,
+                                                dp_index * partition_size,
+                                                partition_size)
+                for dp_index in range(dp_world_size)
+            ]
+            self.bf16_partitioned_groups.append(bf16_dp_partitions)
+
+            # create fp32 params partition
+            self.fp32_groups_flat_partition.append(
+                bf16_dp_partitions[partition_id].clone().float().detach())
+            self.fp32_groups_flat_partition[i].requires_grad = True
+
+            num_elem_list = [t.numel() for t in self.bf16_groups[i]]
+
+            # create fp32 gradients
+            self.fp32_groups_gradients_flat.append(
+                torch.zeros_like(self.bf16_groups_flat[i],
+                                 dtype=torch.float32))
+
+            # track individual fp32 gradients for entire model
+            fp32_gradients = self._split_flat_tensor(
+                flat_tensor=self.fp32_groups_gradients_flat[i],
+                num_elem_list=num_elem_list)
+            self.fp32_groups_gradients.append(fp32_gradients)
+
+            # flat tensor corresponding to actual fp32 gradients (i.e., minus alignment padding)
+            length_without_padding = sum(num_elem_list)
+            self.fp32_groups_actual_gradients_flat.append(
+                torch.narrow(self.fp32_groups_gradients_flat[i],
+                             0,
+                             0,
+                             length_without_padding))
+
+            # flat tensor corresponding to gradient partition
+            self.fp32_groups_gradient_flat_partition.append(
+                torch.narrow(self.fp32_groups_gradients_flat[i],
+                             0,
+                             partition_id * partition_size,
+                             partition_size))
+
+            # track fp32 gradient updates
+            self.fp32_groups_has_gradients.append([False] * len(self.bf16_groups[i]))
+
+            # Record padding required for alignment
+            if partition_id == dist.get_world_size(
+                    group=self.real_dp_process_group[i]) - 1:
+                padding = self.bf16_groups_flat[i].numel() - length_without_padding
+            else:
+                padding = 0
+
+            self.groups_padding.append(padding)
+
+            # update optimizer param groups to reference fp32 params partition
+            param_group['params'] = [self.fp32_groups_flat_partition[i]]
+
+            see_memory_usage(f'after initializing group {i}', force=True)
+
+        see_memory_usage('before initialize_optimizer', force=True)
+        self.initialize_optimizer_states()
+        see_memory_usage('end initialize_optimizer', force=True)
+
+        # Need optimizer states initialized before linking lp to optimizer state
+        self._link_all_hp_params()
+
+        see_memory_usage('end bf16_optimizer', force=True)
+
+    def _link_all_hp_params(self):
+        dp_world_size = dist.get_world_size(group=self.dp_process_group)
+        for i, param_group in enumerate(self.optimizer.param_groups):
+            # Link bf16 and fp32 params in partition
+            # TODO: Make this configurable
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+            partition_size = self.bf16_groups_flat[i].numel() // dp_world_size
+            self._link_hp_params(self.bf16_groups[i],
+                                 self.fp32_groups_flat_partition[i],
+                                 partition_id * partition_size,
+                                 partition_size,
+                                 self.real_dp_process_group[i])
+
+    def _init_lp_to_hp_mapping(self,
+                               lp_param_list,
+                               partition_start,
+                               partition_size,
+                               dp_group):
+        current_offset = 0
+        param_and_offset_list = []
+        partition_end = partition_start + partition_size
+        for lp_param in lp_param_list:
+            lp_param._hp_mapping = None
+            lp_param._dp_group = dp_group
+            lp_param.get_full_hp_param = types.MethodType(get_full_hp_param, lp_param)
+            # lp_param overlaps with partition if both are true
+            # 1) current_offset < partition_end,
+            # 2) current_offset + lp_param.numel() >= partition_start
+            lp_param_end = current_offset + lp_param.numel()
+            if current_offset < partition_end and lp_param_end > partition_start:
+                param_and_offset_list.append((lp_param, current_offset))
+            current_offset += lp_param.numel()
+
+        return param_and_offset_list
+
+    def _link_hp_params(self,
+                        lp_param_list,
+                        flat_hp_partition,
+                        partition_start,
+                        partition_size,
+                        dp_group):
+        local_lp_param_and_offset = self._init_lp_to_hp_mapping(
+            lp_param_list,
+            partition_start,
+            partition_size,
+            dp_group)
+
+        hp_end = partition_start + partition_size
+        for lp_param, lp_start in local_lp_param_and_offset:
+            lp_end = lp_param.numel() + lp_start
+            hp_start = partition_start
+
+            fragment_start = max(lp_start, hp_start)
+            fragment_end = min(lp_end, hp_end)
+            #            print(
+            #                f'{self.dp_rank=} {lp_start=} {lp_end-lp_start=} {hp_start=} {hp_end-hp_start=} {fragment_start=} {fragment_end-fragment_start=}'
+            #            )
+            assert fragment_start < fragment_end, \
+                f'fragment start {fragment_start} should be < fragment_end {fragment_end}'
+
+            fragment_numel = fragment_end - fragment_start
+            hp_frag_address = fragment_address(start=fragment_start - hp_start,
+                                               numel=fragment_numel)
+            hp_fragment_tensor = flat_hp_partition.narrow(0,
+                                                          hp_frag_address.start,
+                                                          hp_frag_address.numel)
+
+            optim_fragment = {
+                key: value.narrow(0,
+                                  hp_frag_address.start,
+                                  hp_frag_address.numel)
+                for key,
+                value in self.optimizer.state[flat_hp_partition].items()
+                if torch.is_tensor(value)
+            }
+
+            lp_frag_address = fragment_address(start=fragment_start - lp_start,
+                                               numel=fragment_numel)
+            lp_fragment_tensor = lp_param.flatten().narrow(0,
+                                                           lp_frag_address.start,
+                                                           lp_frag_address.numel)
+
+            lp_param._hp_mapping = tensor_fragment(lp_fragment=lp_fragment_tensor,
+                                                   lp_fragment_address=lp_frag_address,
+                                                   hp_fragment=hp_fragment_tensor,
+                                                   hp_fragment_address=hp_frag_address,
+                                                   optim_fragment=optim_fragment)
+
+    def initialize_optimizer_states(self):
+        """Take an optimizer step with zero-valued gradients to allocate internal
+        optimizer state.
+
+        This helps prevent memory fragmentation by allocating optimizer state at the
+        beginning of training instead of after activations have been allocated.
+        """
+        for param_partition, grad_partition in zip(self.fp32_groups_flat_partition, self.fp32_groups_gradient_flat_partition):
+            param_partition.grad = grad_partition
+
+        self.optimizer.step()
+
+        self.clear_hp_grads()
+
+    def _split_flat_tensor(self, flat_tensor, num_elem_list):
+        assert sum(num_elem_list) <= flat_tensor.numel()
+        tensor_list = []
+        offset = 0
+        for num_elem in num_elem_list:
+            dense_tensor = torch.narrow(flat_tensor, 0, offset, num_elem)
+            tensor_list.append(dense_tensor)
+            offset += num_elem
+
+        return tensor_list
+
+    def _update_storage_to_flattened_tensor(self, tensor_list, flat_tensor):
+        updated_params = self.unflatten(flat_tensor, tensor_list)
+        for p, q in zip(tensor_list, updated_params):
+            p.data = q.data
+
+    def _flatten_dense_tensors_aligned(self, tensor_list, alignment):
+        return self.flatten(align_dense_tensors(tensor_list, alignment))
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        if closure is not None:
+            raise NotImplementedError(f'{self.__class__} does not support closure.')
+
+        all_groups_norm = get_global_norm_of_tensors(
+            input_tensors=self.get_grads_for_norm(),
+            mpu=self.mpu,
+            norm_type=self.norm_type)
+        self._global_grad_norm = all_groups_norm
+
+        assert all_groups_norm > 0.
+        if self.clip_grad > 0.:
+            clip_tensors_by_global_norm(
+                input_tensors=self.get_grads_for_norm(for_clipping=True),
+                max_norm=self.clip_grad,
+                global_norm=all_groups_norm,
+                mpu=self.mpu)
+
+        self.optimizer.step()
+
+        self.update_lp_params()
+
+        all_gather_dp_groups(partitioned_param_groups=self.bf16_partitioned_groups,
+                             dp_process_group=self.real_dp_process_group,
+                             start_alignment_factor=self.nccl_start_alignment_factor,
+                             allgather_bucket_size=self.allgather_bucket_size)
+
+        self.clear_hp_grads()
+        self.step_count += 1
+
+    def backward(self, loss, update_hp_grads=True, clear_lp_grads=False, **bwd_kwargs):
+        """Perform a backward pass and copy the low-precision gradients to the
+        high-precision copy.
+
+        We copy/accumulate to the high-precision grads now to prevent accumulating in the
+        bf16 grads after successive backward() calls (i.e., grad accumulation steps > 1)
+
+        The low-precision grads are deallocated during this procedure.
+        """
+        self.clear_lp_grads()
+        loss.backward(**bwd_kwargs)
+
+        if update_hp_grads:
+            self.update_hp_grads(clear_lp_grads=clear_lp_grads)
+
+    @torch.no_grad()
+    def update_hp_grads(self, clear_lp_grads=False):
+        for i, group in enumerate(self.bf16_groups):
+            for j, lp in enumerate(group):
+                if lp.grad is None:
+                    continue
+
+                hp_grad = self.fp32_groups_gradients[i][j]
+                assert hp_grad is not None, \
+                    f'high precision param has no gradient, lp param_id = {id(lp)} group_info = [{i}][{j}]'
+
+                hp_grad.data.add_(lp.grad.data.to(hp_grad.dtype).view(hp_grad.shape))
+                lp._hp_grad = hp_grad
+                self.fp32_groups_has_gradients[i][j] = True
+
+                # clear gradients
+                if clear_lp_grads:
+                    lp.grad = None
+
+    @torch.no_grad()
+    def get_grads_for_reduction(self):
+        return self.fp32_groups_gradients_flat
+
+    @torch.no_grad()
+    def get_grads_for_norm(self, for_clipping=False):
+        grads = []
+        tensor_mp_rank = bwc_tensor_model_parallel_rank(mpu=self.mpu)
+        for i, group in enumerate(self.bf16_groups):
+            for j, lp in enumerate(group):
+                if not for_clipping:
+                    if hasattr(lp, PIPE_REPLICATED) and lp.ds_pipe_replicated:
+                        continue
+
+                    if not (tensor_mp_rank == 0 or is_model_parallel_parameter(lp)):
+                        continue
+
+                if not self.fp32_groups_has_gradients[i][j]:
+                    continue
+
+                grads.append(self.fp32_groups_gradients[i][j])
+
+        return grads
+
+    @torch.no_grad()
+    def update_lp_params(self):
+        for i, (bf16_partitions, fp32_partition) in enumerate(zip(self.bf16_partitioned_groups, self.fp32_groups_flat_partition)):
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+            bf16_partitions[partition_id].data.copy_(fp32_partition.data)
+
+    def clear_hp_grads(self):
+        for flat_gradients in self.fp32_groups_gradients_flat:
+            flat_gradients.zero_()
+
+        for i, group in enumerate(self.fp32_groups_gradients):
+            self.fp32_groups_has_gradients[i] = [False] * len(group)
+
+    def clear_lp_grads(self):
+        for group in self.bf16_groups:
+            for param in group:
+                param.grad = None
+
+    def state_dict(self):
+        state_dict = {}
+        state_dict[CLIP_GRAD] = self.clip_grad
+        state_dict[BASE_OPTIMIZER_STATE] = self.optimizer.state_dict()
+        state_dict[SINGLE_PARTITION_OF_FP32_GROUPS] = self.fp32_groups_flat_partition
+        state_dict[GROUPS_PADDING] = self.groups_padding
+        state_dict[PARTITION_COUNT] = self.partition_count
+        state_dict[DS_VERSION] = version
+
+        return state_dict
+
+    # Restore base optimizer fp32 weights bfloat16 weights
+    def _restore_from_bit16_weights(self):
+        for i, group in enumerate(self.bf16_groups):
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+            for bf16_partitions, fp32_partition in zip(self.bf16_partitioned_groups, self.fp32_groups_flat_partition):
+                fp32_partition.data.copy_(bf16_partitions[partition_id].data)
+
+    def refresh_fp32_params(self):
+        self._restore_from_bit16_weights()
+
+    def load_state_dict(self,
+                        state_dict_list,
+                        load_optimizer_states=True,
+                        load_from_fp32_weights=False):
+        dp_rank = dist.get_rank(group=self.dp_process_group)
+        current_rank_sd = state_dict_list[dp_rank]
+
+        ckpt_version = current_rank_sd.get(DS_VERSION, False)
+        assert ckpt_version, f"Empty ds_version in checkpoint, not clear how to proceed"
+        ckpt_version = pkg_version.parse(ckpt_version)
+
+        self.clip_grad = current_rank_sd.get(CLIP_GRAD, self.clip_grad)
+
+        if load_optimizer_states:
+            self.optimizer.load_state_dict(current_rank_sd[BASE_OPTIMIZER_STATE])
+
+        if load_from_fp32_weights:
+            for current, saved in zip(self.fp32_groups_flat_partition, current_rank_sd[SINGLE_PARTITION_OF_FP32_GROUPS]):
+                src_tensor = _get_padded_tensor(saved, current.numel())
+                current.data.copy_(src_tensor.data)
+
+        self._link_all_hp_params()
+
+    @property
+    def param_groups(self):
+        """Forward the wrapped optimizer's parameters."""
+        return self.optimizer.param_groups
+
+
+def _get_padded_tensor(src_tensor, size):
+    if src_tensor.numel() >= size:
+        return src_tensor
+    padded_tensor = torch.zeros(size, dtype=src_tensor.dtype, device=src_tensor.device)
+    slice_tensor = torch.narrow(padded_tensor, 0, 0, src_tensor.numel())
+    slice_tensor.data.copy_(src_tensor.data)
+    return padded_tensor
+
+
+'''
+Logic for lp_param to hp_param mapping
+
+lp      lp0 lp1 lp2         lp3  lp4            <-------  indices/names
+lp      [  ][  ][          ][   ][         ]    <-------- tensors
+flat_lp [                                  ]     <-------- flat lp params
+flat_hp            [                 ]   <------------------ flat hp partition on current rank
+full_hp [                                        ] <------- full flat hp params
+
+
+lp2
+ full numel = 16
+ lp_frag
+   numel = 12
+   frag_start = 3
+   frag_end  = 15
+ hp_frag
+    numel = 12
+    frag_start = 0
+    frag_end = 11
+
+ hp_frag.copy_(lp_frag)
+
+
+lp3:
+  full numel = 4
+  lp_frag
+     numel = 4
+     start = 0
+     end = 3
+  hp_frag
+     numel = 4
+     start = 12
+     end = 15
+
+
+lp4:
+   full numel = 12
+   lp_frag
+     numel = 4
+     start = 0
+     end = 3
+  hp_frag
+     numel = 4
+     start = 16
+     end = 19
+
+
+
+Visual depiction of above
+lp              {         }
+flat_lp [                                ]
+flat_hp            (                 )
+
+
+flat_lp [       {  (      }          )   ]
+                lx  hx   ly          hy
+                    ly-hx
+
+
+lp                             {       }
+flat_lp [                                ]
+flat_hp            (                 )
+
+
+flat_lp [          (            {     ) }  ]
+                   hx           lx   hy ly
+                                   hy-lx
+
+lp                        {   }
+flat_lp [                                ]
+flat_hp            (                 )
+
+
+flat_lp [          (       {   }      )   ]
+                   hx      lx  ly    hy
+                             ly-lx
+
+lp -> (lx, hy)
+flat_hp -> (hx, hy)
+'''
diff --git a/deepspeed/runtime/comm/coalesced_collectives.py b/deepspeed/runtime/comm/coalesced_collectives.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b42d223182d357f5f188ba54f0d5732abec55eb
--- /dev/null
+++ b/deepspeed/runtime/comm/coalesced_collectives.py
@@ -0,0 +1,116 @@
+"""batched collective operations for overhead amortization and better
+bandwidth utilization"""
+
+import math
+from typing import List
+
+import torch
+from torch import Tensor
+import torch.distributed
+from torch.distributed import ProcessGroup
+import torch.nn.functional
+
+from deepspeed.utils import instrument_w_nvtx
+from deepspeed.utils.logging import logger
+
+if hasattr(torch.distributed, "_reduce_scatter_base"):
+
+    def torch_reduce_scatter_fn(input_tensor: Tensor, output_tensor: Tensor, group):
+        instrument_w_nvtx(torch.distributed._reduce_scatter_base)(
+            output_tensor,
+            input_tensor,
+            group=group,
+        )
+else:
+    logger.warning(
+        "unable to find torch.distributed._reduce_scatter_base. will fall back to "
+        "torch.distributed.reduce_scatter which will result in suboptimal performance. "
+        "please consider upgrading your pytorch installation.")
+
+    def torch_reduce_scatter_fn(input_tensor: Tensor, output_tensor: Tensor, group):
+        input_tensor_lst = list(
+            torch.chunk(input_tensor,
+                        torch.distributed.get_world_size(group)))
+        instrument_w_nvtx(torch.distributed.reduce_scatter)(
+            output_tensor,
+            input_tensor_lst,
+            group=group,
+        )
+
+
+@instrument_w_nvtx
+@torch.no_grad()
+def reduce_scatter_coalesced(
+    tensors: List[Tensor],
+    group: ProcessGroup = None,
+) -> List[Tensor]:
+    """simultaneously reduce-scatter a list of tensors - this can be done more
+    efficiently than individual reduce scatter calls
+
+    TODO. see if PyTorch team wants a c++ version of this for ProcessGroupNCCL
+    """
+    this_rank = torch.distributed.get_rank(group)
+    world_sz = torch.distributed.get_world_size(group)
+
+    partition_lst_for_each_tensor = [None] * len(tensors)
+    for tensor_idx, tensor in enumerate(tensors):
+        flattened_tensor = tensor.view(-1)
+        chunk_sz = math.ceil(tensor.numel() / world_sz)
+        partition_lst_for_each_tensor[tensor_idx] = [
+            flattened_tensor[rank * chunk_sz:rank * chunk_sz + chunk_sz]
+            for rank in range(0,
+                              world_sz)
+        ]
+
+    padded_partition_sz_for_each_tensor = tuple(
+        math.ceil(t.numel() / world_sz) for t in tensors)
+
+    if len(tensors) == 1 and tensors[0].numel() % world_sz == 0:
+        # if there's only one tensor being reduced and we don't need to pad
+        # we have an opportunity to avoid a memory allocation
+        tensor_partition_flat_buffer = tensors[0].view(-1)
+    else:
+        # interleave tensor partitions such that the correct reduced partitions of each tensor
+        # end up at each rank
+        tensor_partitions_lst_with_padding = []
+        for rank in range(world_sz):
+            for tensor_idx in range(len(tensors)):
+                # add tensor content
+                tensor_chunk = partition_lst_for_each_tensor[tensor_idx][rank]
+                tensor_partitions_lst_with_padding.append(tensor_chunk)
+
+                # add padding if necessary
+                padding_sz = padded_partition_sz_for_each_tensor[
+                    tensor_idx] - tensor_chunk.numel()
+                if padding_sz > 0:
+                    tensor_partitions_lst_with_padding.append(
+                        torch.empty(padding_sz,
+                                    dtype=tensor_chunk.dtype,
+                                    device=tensor_chunk.device))
+
+        tensor_partition_flat_buffer = instrument_w_nvtx(
+            torch.cat)(tensor_partitions_lst_with_padding)
+
+    tensor_partition_flat_buffer.div_(world_sz)  # pre-divide
+    tensor_partition_buffer_for_each_rank: List[Tensor] = torch.chunk(
+        tensor_partition_flat_buffer,
+        world_sz)
+
+    # batched reduce-scatter call
+    torch_reduce_scatter_fn(tensor_partition_flat_buffer,
+                            tensor_partition_buffer_for_each_rank[this_rank],
+                            group)
+
+    # reverse procedure of the interleaving done previously, done on the
+    # result of the batched reduce-scatter
+    output_lst: List[Tensor] = [None] * len(tensors)
+    offset = 0
+    for tensor_idx in range(len(tensors)):
+        output_lst[tensor_idx] = tensor_partition_buffer_for_each_rank[this_rank].narrow(
+            0,
+            offset,
+            partition_lst_for_each_tensor[tensor_idx][this_rank].numel())
+
+        offset += padded_partition_sz_for_each_tensor[tensor_idx]
+
+    return output_lst
diff --git a/deepspeed/runtime/comm/nccl.py b/deepspeed/runtime/comm/nccl.py
index 0ac2646bd0d7d05afd7785f27b153925469cbbc4..fcab478a3ca845d30b17e485866b1cfc6ef4566e 100644
--- a/deepspeed/runtime/comm/nccl.py
+++ b/deepspeed/runtime/comm/nccl.py
@@ -12,11 +12,20 @@ from deepspeed.runtime.compression.cupy import CupyBackend
 
 
 class NcclBackend(object):
-    def __init__(self):
-        self.world_group = dist.new_group(ranks=range(dist.get_world_size()))
+    def __init__(self, mpu=None):
+        if mpu is None:
+            self.world_group = dist.new_group(ranks=range(dist.get_world_size()))
+        else:
+            self.mpu = mpu
+            self.world_group = self.mpu.get_data_parallel_group()
         self.rank = dist.get_rank(group=self.world_group)
         self.size = dist.get_world_size(group=self.world_group)
         self.compression_backend = CupyBackend()
+        self.bool_not_supported = False
+        TORCH_MAJOR = int(torch.__version__.split('.')[0])
+        TORCH_MINOR = int(torch.__version__.split('.')[1])
+        if TORCH_MAJOR >= 1 and TORCH_MINOR >= 10:
+            self.bool_not_supported = True
 
     def my_igather(self, rank, size, group, sendbuf, recvbuf, root):
         req = []
@@ -64,9 +73,15 @@ class NcclBackend(object):
         worker_error.set_(buffer_m - worker_scale *
                           buffer_m.sign().add_(1).bool().float().add_(-0.5).mul_(2.0))
 
-        cupy_sign_list_packed = self.compression_backend.compress_by_chunk(
-            self.compression_backend.torch2cupy(buffer_m.sign_().add_(1).bool()),
-            self.size)
+        if self.bool_not_supported:
+            cupy_sign_list_packed = self.compression_backend.compress_by_chunk(
+                self.compression_backend.torch2cupy(
+                    buffer_m.sign_().add_(1).bool().to(dtype=torch.uint8)),
+                self.size)
+        else:
+            cupy_sign_list_packed = self.compression_backend.compress_by_chunk(
+                self.compression_backend.torch2cupy(buffer_m.sign_().add_(1).bool()),
+                self.size)
         cupy_worker_scale = self.compression_backend.torch2cupy(worker_scale)
 
         cupy_recvbuf_sign = cupy.zeros(
@@ -92,9 +107,11 @@ class NcclBackend(object):
         # communication phase 1
         # gather_start = time.time()
         # Alltoall for sign
-        dist.all_to_all_single(recvbuf_sign, torch.stack(sign_list_packed))
+        dist.all_to_all_single(recvbuf_sign,
+                               torch.stack(sign_list_packed),
+                               group=self.world_group)
         # Allgather for scale
-        dist.all_gather(recvbuf_scale, worker_scale)
+        dist.all_gather(recvbuf_scale, worker_scale, group=self.world_group)
 
         # gather_end = time.time()
 
@@ -118,10 +135,16 @@ class NcclBackend(object):
 
         # cupy_server_scale = self.compression_backend.torch2cupy(server_scale)
 
-        cupy_server_sign_packed = self.compression_backend.compress_by_chunk(
-            self.compression_backend.torch2cupy(
-                compensated_server_m.sign_().add_(1).bool()),
-            1)
+        if self.bool_not_supported:
+            cupy_server_sign_packed = self.compression_backend.compress_by_chunk(
+                self.compression_backend.torch2cupy(
+                    compensated_server_m.sign_().add_(1).bool().to(dtype=torch.uint8)),
+                1)
+        else:
+            cupy_server_sign_packed = self.compression_backend.compress_by_chunk(
+                self.compression_backend.torch2cupy(
+                    compensated_server_m.sign_().add_(1).bool()),
+                1)
         compensated_server_m = None
 
         cupy_recvbuf_sign_server = cupy.zeros(
@@ -151,8 +174,10 @@ class NcclBackend(object):
         ]
 
         # Communication Phase 2
-        dist.all_gather(recvbuf_sign_server, server_sign_packed[0])
-        dist.all_gather(recvbuf_scale_server, server_scale)
+        dist.all_gather(recvbuf_sign_server,
+                        server_sign_packed[0],
+                        group=self.world_group)
+        dist.all_gather(recvbuf_scale_server, server_scale, group=self.world_group)
 
         cupy_server_sign_packed = None
 
diff --git a/deepspeed/runtime/config.py b/deepspeed/runtime/config.py
old mode 100755
new mode 100644
index 11e1d4037c8ef197940cccb57ed2f1161b1608f3..1df5912ef172179aee48f02b84b1c0d74dd17ebf
--- a/deepspeed/runtime/config.py
+++ b/deepspeed/runtime/config.py
@@ -1,787 +1,1118 @@
-"""
-Copyright (c) Microsoft Corporation
-Licensed under the MIT license.
-"""
-
-import torch
-import json
-import copy
-
-from .constants import *
-from .fp16.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, DELAYED_SHIFT, MIN_LOSS_SCALE
-from .config_utils import get_scalar_param, dict_raise_error_on_duplicate_keys
-from .zero.config import DeepSpeedZeroConfig
-from .zero.constants import *
-from .activation_checkpointing.config import DeepSpeedActivationCheckpointingConfig
-
-from ..git_version_info import version as __version__
-from ..utils import logger
-
-from ..elasticity import elasticity_enabled, compute_elastic_config, ensure_immutable_elastic_config
-from ..elasticity.config import ElasticityConfigError
-from ..elasticity.constants import ELASTICITY, IGNORE_NON_ELASTIC_BATCH_INFO, \
-    IGNORE_NON_ELASTIC_BATCH_INFO_DEFAULT
-
-from ..profiling.config import DeepSpeedFlopsProfilerConfig
-
-TENSOR_CORE_ALIGN_SIZE = 8
-
-ADAM_OPTIMIZER = 'adam'
-ADAMW_OPTIMIZER = 'adamw'
-LAMB_OPTIMIZER = 'lamb'
-ONEBIT_ADAM_OPTIMIZER = 'onebitadam'
-DEEPSPEED_OPTIMIZERS = [
-    ADAM_OPTIMIZER,
-    ADAMW_OPTIMIZER,
-    LAMB_OPTIMIZER,
-    ONEBIT_ADAM_OPTIMIZER,
-]
-
-# extra optimizer parameters for adam/adamw
-TORCH_ADAM_PARAM = "torch_adam"
-
-# default to adamw logic for adam/adamw optimizers unless user explictly opts out
-ADAM_W_MODE = "adam_w_mode"
-ADAM_W_MODE_DEFAULT = True
-
-
-class DeepSpeedConfigError(Exception):
-    pass
-
-
-def get_pld_enabled(param_dict):
-    if PROGRESSIVE_LAYER_DROP in param_dict.keys():
-        return get_scalar_param(param_dict[PROGRESSIVE_LAYER_DROP],
-                                PLD_ENABLED,
-                                PLD_ENABLED_DEFAULT)
-    else:
-        return False
-
-
-def get_pld_params(param_dict):
-    if PROGRESSIVE_LAYER_DROP in param_dict.keys():
-        pld_params = copy.copy(param_dict[PROGRESSIVE_LAYER_DROP])
-        pld_params.pop(PLD_ENABLED)
-        return pld_params
-    else:
-        return False
-
-
-def get_amp_enabled(param_dict):
-    if AMP in param_dict.keys():
-        return get_scalar_param(param_dict[AMP], AMP_ENABLED, AMP_ENABLED_DEFAULT)
-    else:
-        return False
-
-
-def get_amp_params(param_dict):
-    if AMP in param_dict.keys():
-        amp_params = copy.copy(param_dict[AMP])
-        amp_params.pop(AMP_ENABLED)
-        return amp_params
-    else:
-        return False
-
-
-def get_fp16_enabled(param_dict):
-    if FP16 in param_dict.keys():
-        return get_scalar_param(param_dict[FP16], FP16_ENABLED, FP16_ENABLED_DEFAULT)
-    else:
-        return False
-
-
-def get_loss_scale(param_dict):
-    if get_fp16_enabled(param_dict):
-        return get_scalar_param(param_dict[FP16],
-                                FP16_LOSS_SCALE,
-                                FP16_LOSS_SCALE_DEFAULT)
-    else:
-        return FP16_LOSS_SCALE_DEFAULT
-
-
-def get_initial_dynamic_scale(param_dict):
-    if get_fp16_enabled(param_dict):
-        initial_scale_power = get_scalar_param(param_dict[FP16],
-                                               FP16_INITIAL_SCALE_POWER,
-                                               FP16_INITIAL_SCALE_POWER_DEFAULT)
-    else:
-        initial_scale_power = FP16_INITIAL_SCALE_POWER_DEFAULT
-
-    return 2**initial_scale_power
-
-
-def get_dynamic_loss_scale_args(param_dict):
-    loss_scale_args = None
-    if get_fp16_enabled(param_dict):
-        fp16_dict = param_dict[FP16]
-        dynamic_loss_args = [
-            FP16_INITIAL_SCALE_POWER,
-            FP16_LOSS_SCALE_WINDOW,
-            FP16_MIN_LOSS_SCALE,
-            FP16_HYSTERESIS
-        ]
-        if any(arg in list(fp16_dict.keys()) for arg in dynamic_loss_args):
-            init_scale = get_scalar_param(fp16_dict,
-                                          FP16_INITIAL_SCALE_POWER,
-                                          FP16_INITIAL_SCALE_POWER_DEFAULT)
-            scale_window = get_scalar_param(fp16_dict,
-                                            FP16_LOSS_SCALE_WINDOW,
-                                            FP16_LOSS_SCALE_WINDOW_DEFAULT)
-            delayed_shift = get_scalar_param(fp16_dict,
-                                             FP16_HYSTERESIS,
-                                             FP16_HYSTERESIS_DEFAULT)
-            min_loss_scale = get_scalar_param(fp16_dict,
-                                              FP16_MIN_LOSS_SCALE,
-                                              FP16_MIN_LOSS_SCALE_DEFAULT)
-            loss_scale_args = {
-                INITIAL_LOSS_SCALE: 2**init_scale,
-                SCALE_WINDOW: scale_window,
-                DELAYED_SHIFT: delayed_shift,
-                MIN_LOSS_SCALE: min_loss_scale
-            }
-
-    return loss_scale_args
-
-
-def get_gradient_accumulation_steps(param_dict):
-    return get_scalar_param(param_dict,
-                            GRADIENT_ACCUMULATION_STEPS,
-                            GRADIENT_ACCUMULATION_STEPS_DEFAULT)
-
-
-def get_sparse_gradients_enabled(param_dict):
-    return get_scalar_param(param_dict, SPARSE_GRADIENTS, SPARSE_GRADIENTS_DEFAULT)
-
-
-def get_zero_optimization(param_dict):
-    return get_scalar_param(param_dict, ZERO_OPTIMIZATION, ZERO_OPTIMIZATION_DEFAULT)
-
-
-def get_zero_reduce_scatter(param_dict):
-    return get_scalar_param(param_dict,
-                            ZERO_OPTIMIZATION_REDUCE_SCATTER,
-                            ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT)
-
-
-def get_allreduce_always_fp32(param_dict):
-    return get_scalar_param(param_dict, FP32_ALLREDUCE, FP32_ALLREDUCE_DEFAULT)
-
-
-def get_prescale_gradients(param_dict):
-    return get_scalar_param(param_dict, PRESCALE_GRADIENTS, PRESCALE_GRADIENTS_DEFAULT)
-
-
-def get_gradient_predivide_factor(param_dict):
-    return get_scalar_param(param_dict,
-                            GRADIENT_PREDIVIDE_FACTOR,
-                            GRADIENT_PREDIVIDE_FACTOR_DEFAULT)
-
-
-def get_steps_per_print(param_dict):
-    return get_scalar_param(param_dict, STEPS_PER_PRINT, STEPS_PER_PRINT_DEFAULT)
-
-
-def get_disable_allgather(param_dict):
-    return get_scalar_param(param_dict, DISABLE_ALLGATHER, DISABLE_ALLGATHER_DEFAULT)
-
-
-def get_dump_state(param_dict):
-    return get_scalar_param(param_dict, DUMP_STATE, DUMP_STATE_DEFAULT)
-
-
-def get_gradient_clipping(param_dict):
-    return get_scalar_param(param_dict, GRADIENT_CLIPPING, GRADIENT_CLIPPING_DEFAULT)
-
-
-def get_sparse_attention(param_dict):
-    if SPARSE_ATTENTION in param_dict.keys():
-        sparsity = param_dict[SPARSE_ATTENTION]
-        mode = get_sparse_attention_mode(sparsity)
-
-        if (mode == SPARSE_DENSE_MODE):
-            return get_sparse_dense_config(sparsity)
-        elif (mode == SPARSE_FIXED_MODE):
-            return get_sparse_fixed_config(sparsity)
-        elif (mode == SPARSE_VARIABLE_MODE):
-            return get_sparse_variable_config(sparsity)
-        elif (mode == SPARSE_BIGBIRD_MODE):
-            return get_sparse_bigbird_config(sparsity)
-        elif (mode == SPARSE_BSLONGFORMER_MODE):
-            return get_sparse_bslongformer_config(sparsity)
-        else:
-            raise NotImplementedError(
-                f'Given sparsity mode, {mode}, has not been implemented yet!')
-
-    else:
-        return None
-
-
-def get_sparse_dense_config(sparsity):
-    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
-    return {SPARSE_MODE: SPARSE_DENSE_MODE, SPARSE_BLOCK: block}
-
-
-def get_sparse_fixed_config(sparsity):
-    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
-    different_layout_per_head = get_scalar_param(
-        sparsity,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT)
-    num_local_blocks = get_scalar_param(sparsity,
-                                        SPARSE_NUM_LOCAL_BLOCKS,
-                                        SPARSE_NUM_LOCAL_BLOCKS_DEFAULT)
-    num_global_blocks = get_scalar_param(sparsity,
-                                         SPARSE_NUM_GLOBAL_BLOCKS,
-                                         SPARSE_NUM_GLOBAL_BLOCKS_DEFAULT)
-    attention = get_scalar_param(sparsity,
-                                 SPARSE_ATTENTION_TYPE,
-                                 SPARSE_ATTENTION_TYPE_DEFAULT)
-    horizontal_global_attention = get_scalar_param(
-        sparsity,
-        SPARSE_HORIZONTAL_GLOBAL_ATTENTION,
-        SPARSE_HORIZONTAL_GLOBAL_ATTENTION_DEFAULT)
-    num_different_global_patterns = get_scalar_param(
-        sparsity,
-        SPARSE_NUM_DIFFERENT_GLOBAL_PATTERNS,
-        SPARSE_NUM_DIFFERENT_GLOBAL_PATTERNS_DEFAULT)
-
-    return {
-        SPARSE_MODE: SPARSE_FIXED_MODE,
-        SPARSE_BLOCK: block,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
-        SPARSE_NUM_LOCAL_BLOCKS: num_local_blocks,
-        SPARSE_NUM_GLOBAL_BLOCKS: num_global_blocks,
-        SPARSE_ATTENTION_TYPE: attention,
-        SPARSE_HORIZONTAL_GLOBAL_ATTENTION: horizontal_global_attention,
-        SPARSE_NUM_DIFFERENT_GLOBAL_PATTERNS: num_different_global_patterns
-    }
-
-
-def get_sparse_variable_config(sparsity):
-    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
-    different_layout_per_head = get_scalar_param(
-        sparsity,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT)
-    num_random_blocks = get_scalar_param(sparsity,
-                                         SPARSE_NUM_RANDOM_BLOCKS,
-                                         SPARSE_NUM_RANDOM_BLOCKS_DEFAULT)
-    local_window_blocks = get_scalar_param(sparsity,
-                                           SPARSE_LOCAL_WINDOW_BLOCKS,
-                                           SPARSE_LOCAL_WINDOW_BLOCKS_DEFAULT)
-    global_block_indices = get_scalar_param(sparsity,
-                                            SPARSE_GLOBAL_BLOCK_INDICES,
-                                            SPARSE_GLOBAL_BLOCK_INDICES_DEFAULT)
-    global_block_end_indices = get_scalar_param(sparsity,
-                                                SPARSE_GLOBAL_BLOCK_END_INDICES,
-                                                SPARSE_GLOBAL_BLOCK_END_INDICES_DEFAULT)
-    attention = get_scalar_param(sparsity,
-                                 SPARSE_ATTENTION_TYPE,
-                                 SPARSE_ATTENTION_TYPE_DEFAULT)
-    horizontal_global_attention = get_scalar_param(
-        sparsity,
-        SPARSE_HORIZONTAL_GLOBAL_ATTENTION,
-        SPARSE_HORIZONTAL_GLOBAL_ATTENTION_DEFAULT)
-
-    return {
-        SPARSE_MODE: SPARSE_VARIABLE_MODE,
-        SPARSE_BLOCK: block,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
-        SPARSE_NUM_RANDOM_BLOCKS: num_random_blocks,
-        SPARSE_LOCAL_WINDOW_BLOCKS: local_window_blocks,
-        SPARSE_GLOBAL_BLOCK_INDICES: global_block_indices,
-        SPARSE_GLOBAL_BLOCK_END_INDICES: global_block_end_indices,
-        SPARSE_ATTENTION_TYPE: attention,
-        SPARSE_HORIZONTAL_GLOBAL_ATTENTION: horizontal_global_attention
-    }
-
-
-def get_sparse_bigbird_config(sparsity):
-    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
-    different_layout_per_head = get_scalar_param(
-        sparsity,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT)
-    num_random_blocks = get_scalar_param(sparsity,
-                                         SPARSE_NUM_RANDOM_BLOCKS,
-                                         SPARSE_NUM_RANDOM_BLOCKS_DEFAULT)
-    num_sliding_window_blocks = get_scalar_param(
-        sparsity,
-        SPARSE_NUM_SLIDING_WINDOW_BLOCKS,
-        SPARSE_NUM_SLIDING_WINDOW_BLOCKS_DEFAULT)
-    num_global_blocks = get_scalar_param(sparsity,
-                                         SPARSE_NUM_GLOBAL_BLOCKS,
-                                         SPARSE_NUM_GLOBAL_BLOCKS_DEFAULT)
-
-    return {
-        SPARSE_MODE: SPARSE_BIGBIRD_MODE,
-        SPARSE_BLOCK: block,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
-        SPARSE_NUM_RANDOM_BLOCKS: num_random_blocks,
-        SPARSE_NUM_SLIDING_WINDOW_BLOCKS: num_sliding_window_blocks,
-        SPARSE_NUM_GLOBAL_BLOCKS: num_global_blocks
-    }
-
-
-def get_sparse_bslongformer_config(sparsity):
-    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
-    different_layout_per_head = get_scalar_param(
-        sparsity,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT)
-    num_sliding_window_blocks = get_scalar_param(
-        sparsity,
-        SPARSE_NUM_SLIDING_WINDOW_BLOCKS,
-        SPARSE_NUM_SLIDING_WINDOW_BLOCKS_DEFAULT)
-    global_block_indices = get_scalar_param(sparsity,
-                                            SPARSE_GLOBAL_BLOCK_INDICES,
-                                            SPARSE_GLOBAL_BLOCK_INDICES_DEFAULT)
-    global_block_end_indices = get_scalar_param(sparsity,
-                                                SPARSE_GLOBAL_BLOCK_END_INDICES,
-                                                SPARSE_GLOBAL_BLOCK_END_INDICES_DEFAULT)
-
-    return {
-        SPARSE_MODE: SPARSE_BSLONGFORMER_MODE,
-        SPARSE_BLOCK: block,
-        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
-        SPARSE_NUM_SLIDING_WINDOW_BLOCKS: num_sliding_window_blocks,
-        SPARSE_GLOBAL_BLOCK_INDICES: global_block_indices,
-        SPARSE_GLOBAL_BLOCK_END_INDICES: global_block_end_indices
-    }
-
-
-def get_sparse_attention_mode(param_dict):
-    if SPARSE_MODE in param_dict.keys():
-        return param_dict[SPARSE_MODE]
-    else:
-        return SPARSE_MODE_DEFAULT
-
-
-def get_sparse_attention_type(param_dict):
-    if SPARSE_ATTENTION_TYPE in param_dict.keys():
-        return param_dict[SPARSE_ATTENTION_TYPE]
-    else:
-        return SPARSE_ATTENTION_TYPE_DEFAULT
-
-
-def get_pipeline_config(param_dict):
-    '''Parses pipeline engine configuration. '''
-    default_pipeline = {
-        'stages': 'auto',
-        'partition': 'best',
-        'seed_layers': False,
-        'activation_checkpoint_interval': 0
-    }
-    config = default_pipeline
-    for key, val in param_dict.get('pipeline', {}).items():
-        config[key] = val
-    return config
-
-
-def get_optimizer_name(param_dict):
-    if OPTIMIZER in param_dict.keys() and \
-            TYPE in param_dict[OPTIMIZER].keys():
-        return param_dict[OPTIMIZER][TYPE]
-    else:
-        return OPTIMIZER_TYPE_DEFAULT
-
-
-def get_optimizer_params(param_dict):
-    if get_optimizer_name(param_dict) is not None and \
-            OPTIMIZER_PARAMS in param_dict[OPTIMIZER].keys():
-        return param_dict[OPTIMIZER][OPTIMIZER_PARAMS]
-    else:
-        return None
-
-
-def get_optimizer_gradient_clipping(param_dict):
-    optimizer_params = get_optimizer_params(param_dict)
-    if optimizer_params is not None and \
-            MAX_GRAD_NORM in optimizer_params.keys():
-        return optimizer_params[MAX_GRAD_NORM]
-    else:
-        return None
-
-
-def get_optimizer_legacy_fusion(param_dict):
-    if OPTIMIZER in param_dict.keys() and \
-        LEGACY_FUSION in param_dict[OPTIMIZER].keys():
-        return param_dict[OPTIMIZER][LEGACY_FUSION]
-    else:
-        return LEGACY_FUSION_DEFAULT
-
-
-def get_zero_allow_untested_optimizer(param_dict):
-    return get_scalar_param(param_dict,
-                            ZERO_ALLOW_UNTESTED_OPTIMIZER,
-                            ZERO_ALLOW_UNTESTED_OPTIMIZER_DEFAULT)
-
-
-def get_scheduler_name(param_dict):
-    if SCHEDULER in param_dict.keys() and \
-            TYPE in param_dict[SCHEDULER].keys():
-        return param_dict[SCHEDULER][TYPE]
-    else:
-        return SCHEDULER_TYPE_DEFAULT
-
-
-def get_scheduler_params(param_dict):
-    if get_scheduler_name(param_dict) is not None and \
-            SCHEDULER_PARAMS in param_dict[SCHEDULER].keys():
-        return param_dict[SCHEDULER][SCHEDULER_PARAMS]
-    else:
-        return None
-
-
-def get_train_batch_size(param_dict):
-    return get_scalar_param(param_dict, TRAIN_BATCH_SIZE, TRAIN_BATCH_SIZE_DEFAULT)
-
-
-def get_train_micro_batch_size_per_gpu(param_dict):
-    return get_scalar_param(param_dict,
-                            TRAIN_MICRO_BATCH_SIZE_PER_GPU,
-                            TRAIN_MICRO_BATCH_SIZE_PER_GPU_DEFAULT)
-
-
-def get_wall_clock_breakdown(param_dict):
-    return get_scalar_param(param_dict,
-                            WALL_CLOCK_BREAKDOWN,
-                            WALL_CLOCK_BREAKDOWN_DEFAULT)
-
-
-def get_memory_breakdown(param_dict):
-    return get_scalar_param(param_dict, MEMORY_BREAKDOWN, MEMORY_BREAKDOWN_DEFAULT)
-
-
-def get_tensorboard_enabled(param_dict):
-    if TENSORBOARD in param_dict.keys():
-        return get_scalar_param(param_dict[TENSORBOARD],
-                                TENSORBOARD_ENABLED,
-                                TENSORBOARD_ENABLED_DEFAULT)
-    else:
-        return False
-
-
-def get_tensorboard_output_path(param_dict):
-    if get_tensorboard_enabled(param_dict):
-        return get_scalar_param(param_dict[TENSORBOARD],
-                                TENSORBOARD_OUTPUT_PATH,
-                                TENSORBOARD_OUTPUT_PATH_DEFAULT)
-    else:
-        return TENSORBOARD_OUTPUT_PATH_DEFAULT
-
-
-def get_tensorboard_job_name(param_dict):
-    if get_tensorboard_enabled(param_dict):
-        return get_scalar_param(param_dict[TENSORBOARD],
-                                TENSORBOARD_JOB_NAME,
-                                TENSORBOARD_JOB_NAME_DEFAULT)
-    else:
-        return TENSORBOARD_JOB_NAME_DEFAULT
-
-
-def get_checkpoint_params(param_dict):
-    return param_dict.get(CHECKPOINT, {})
-
-
-def get_checkpoint_tag_validation_mode(checkpoint_params):
-    tag_validation_mode = checkpoint_params.get(CHECKPOINT_TAG_VALIDATION,
-                                                CHECKPOINT_TAG_VALIDATION_DEFAULT)
-    tag_validation_mode = tag_validation_mode.upper()
-    if tag_validation_mode in CHECKPOINT_TAG_VALIDATION_MODES:
-        return tag_validation_mode
-    else:
-        raise DeepSpeedConfigError("Checkpoint config contains invalid tag_validation " \
-            f"value of {tag_validation_mode}, expecting one of {CHECKPOINT_TAG_VALIDATION_MODES}")
-
-
-'''Write deepspeed config files by modifying basic templates.
-Can be used for quicly changing parameters via command line parameters.'''
-
-
-class DeepSpeedConfigWriter:
-    def __init__(self, data=None):
-        self.data = data if data is not None else {}
-
-    def add_config(self, key, value):
-        self.data[key] = value
-
-    def load_config(self, filename):
-        self.data = json.load(open(filename,
-                                   'r'),
-                              object_pairs_hook=dict_raise_error_on_duplicate_keys)
-
-    def write_config(self, filename):
-        with open(filename, 'w') as outfile:
-            json.dump(self.data, outfile)
-
-
-class DeepSpeedConfig(object):
-    def __init__(self, json_file, mpu=None, param_dict=None):
-        super(DeepSpeedConfig, self).__init__()
-
-        if param_dict is None:
-            self._param_dict = json.load(
-                open(json_file,
-                     'r'),
-                object_pairs_hook=dict_raise_error_on_duplicate_keys)
-        else:
-            self._param_dict = param_dict
-
-        try:
-            self.global_rank = torch.distributed.get_rank()
-            if mpu is None:
-                self.world_size = torch.distributed.get_world_size()
-            else:
-                self.world_size = mpu.get_data_parallel_world_size()
-        except:
-            self.global_rank = 0
-            self.world_size = 1
-
-        # If elastic-mode enabled, update compute + update _param_dict
-        self.elasticity_enabled = elasticity_enabled(self._param_dict)
-        if self.elasticity_enabled:
-            logger.info("DeepSpeed elasticity support enabled")
-            final_batch_size, valid_gpus, micro_batch_size = compute_elastic_config(
-                ds_config=self._param_dict,
-                target_deepspeed_version=__version__,
-                world_size=self.world_size)
-
-            elastic_dict = self._param_dict[ELASTICITY]
-
-            # Ensure the resource scheduler saw the same elastic config we are using at runtime
-            ensure_immutable_elastic_config(runtime_elastic_config_dict=elastic_dict)
-
-            ignore_non_elastic_batch_info = elastic_dict.get(
-                IGNORE_NON_ELASTIC_BATCH_INFO,
-                IGNORE_NON_ELASTIC_BATCH_INFO_DEFAULT)
-
-            if not ignore_non_elastic_batch_info:
-                batch_params = [
-                    TRAIN_BATCH_SIZE,
-                    TRAIN_MICRO_BATCH_SIZE_PER_GPU,
-                    GRADIENT_ACCUMULATION_STEPS
-                ]
-                if any(map(lambda t: t in self._param_dict, batch_params)):
-                    raise ElasticityConfigError("One or more batch related parameters were found in your " \
-                        f"ds_config ({TRAIN_BATCH_SIZE}, {TRAIN_MICRO_BATCH_SIZE_PER_GPU}, and/or " \
-                        f"{GRADIENT_ACCUMULATION_STEPS}). These parameters *will not be used* since " \
-                        "elastic training is enabled, which takes control of these parameters. " \
-                        "If you want to supress this error (the parameters will be silently ignored) " \
-                        f"please set {IGNORE_NON_ELASTIC_BATCH_INFO}':true in your elasticity config.")
-
-            # micro_bsz * world_size * gas = total_batch_size
-            # gas = total_batch_size // (micro_bsz * world_size)
-            gradient_accu_steps = final_batch_size // (micro_batch_size *
-                                                       self.world_size)
-
-            if TRAIN_BATCH_SIZE in self._param_dict:
-                logger.warning("[Elasticity] overriding training_batch_size: " \
-                    f"{self._param_dict[TRAIN_BATCH_SIZE]} -> {final_batch_size}")
-            if TRAIN_MICRO_BATCH_SIZE_PER_GPU in self._param_dict:
-                logger.warning("[Elasticity] overriding train_micro_batch_size_per_gpu: " \
-                    f"{self._param_dict[TRAIN_MICRO_BATCH_SIZE_PER_GPU]} -> {micro_batch_size}")
-            if GRADIENT_ACCUMULATION_STEPS in self._param_dict:
-                logger.warning("[Elasticity] overriding gradient_accumulation_steps: "\
-                    f"{self._param_dict[GRADIENT_ACCUMULATION_STEPS]} -> {gradient_accu_steps}")
-
-            logger.info(f"[Elasticity] valid GPU counts: {valid_gpus}")
-
-            self._param_dict[TRAIN_BATCH_SIZE] = final_batch_size
-            self._param_dict[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = micro_batch_size
-            self._param_dict[GRADIENT_ACCUMULATION_STEPS] = gradient_accu_steps
-
-        self._initialize_params(self._param_dict)
-        self._configure_train_batch_size()
-        self._do_sanity_check()
-
-    def _initialize_params(self, param_dict):
-        self.train_batch_size = get_train_batch_size(param_dict)
-        self.train_micro_batch_size_per_gpu = get_train_micro_batch_size_per_gpu(
-            param_dict)
-        self.gradient_accumulation_steps = get_gradient_accumulation_steps(param_dict)
-        self.steps_per_print = get_steps_per_print(param_dict)
-        self.dump_state = get_dump_state(param_dict)
-
-        self.disable_allgather = get_disable_allgather(param_dict)
-        self.allreduce_always_fp32 = get_allreduce_always_fp32(param_dict)
-        self.prescale_gradients = get_prescale_gradients(param_dict)
-        self.gradient_predivide_factor = get_gradient_predivide_factor(param_dict)
-        self.sparse_gradients_enabled = get_sparse_gradients_enabled(param_dict)
-
-        self.zero_config = DeepSpeedZeroConfig(param_dict)
-        self.zero_optimization_stage = self.zero_config.stage
-        self.zero_enabled = self.zero_optimization_stage > 0
-
-        self.activation_checkpointing_config = DeepSpeedActivationCheckpointingConfig(
-            param_dict)
-
-        self.gradient_clipping = get_gradient_clipping(param_dict)
-        self.fp16_enabled = get_fp16_enabled(param_dict)
-        self.amp_enabled = get_amp_enabled(param_dict)
-        self.amp_params = get_amp_params(param_dict)
-        self.loss_scale = get_loss_scale(param_dict)
-        self.initial_dynamic_scale = get_initial_dynamic_scale(param_dict)
-        self.dynamic_loss_scale_args = get_dynamic_loss_scale_args(param_dict)
-
-        self.optimizer_name = get_optimizer_name(param_dict)
-        if self.optimizer_name is not None and \
-            self.optimizer_name.lower() in DEEPSPEED_OPTIMIZERS:
-            self.optimizer_name = self.optimizer_name.lower()
-
-        self.optimizer_params = get_optimizer_params(param_dict)
-        self.optimizer_legacy_fusion = get_optimizer_legacy_fusion(param_dict)
-
-        self.zero_allow_untested_optimizer = get_zero_allow_untested_optimizer(
-            param_dict)
-
-        self.scheduler_name = get_scheduler_name(param_dict)
-        self.scheduler_params = get_scheduler_params(param_dict)
-
-        self.wall_clock_breakdown = get_wall_clock_breakdown(param_dict)
-        self.flops_profiler_config = DeepSpeedFlopsProfilerConfig(param_dict)
-        self.memory_breakdown = get_memory_breakdown(param_dict)
-        self.tensorboard_enabled = get_tensorboard_enabled(param_dict)
-        self.tensorboard_output_path = get_tensorboard_output_path(param_dict)
-        self.tensorboard_job_name = get_tensorboard_job_name(param_dict)
-
-        self.sparse_attention = get_sparse_attention(param_dict)
-        self.pipeline = get_pipeline_config(param_dict)
-
-        self.pld_enabled = get_pld_enabled(param_dict)
-        self.pld_params = get_pld_params(param_dict)
-
-        checkpoint_params = get_checkpoint_params(param_dict)
-        validation_mode = get_checkpoint_tag_validation_mode(checkpoint_params)
-        self.checkpoint_tag_validation_enabled = validation_mode != ValidationMode.IGNORE
-        self.checkpoint_tag_validation_fail = validation_mode == ValidationMode.FAIL
-
-    def _batch_assertion(self):
-
-        train_batch = self.train_batch_size
-        micro_batch = self.train_micro_batch_size_per_gpu
-        grad_acc = self.gradient_accumulation_steps
-
-        assert train_batch > 0, \
-            f'Train batch size: {train_batch} has to be greater than 0'
-
-        assert micro_batch > 0, \
-            f'Micro batch size per gpu: {micro_batch} has to be greater than 0'
-
-        assert grad_acc > 0, \
-            f'Gradient accumulation steps: {grad_acc} has to be greater than 0'
-
-        assert train_batch == micro_batch * grad_acc * self.world_size, \
-                (f'Check batch related parameters. train_batch_size is not equal'
-                ' to micro_batch_per_gpu * gradient_acc_step * world_size'
-                f'{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}')
-
-    def _set_batch_related_parameters(self):
-
-        train_batch = self.train_batch_size
-        micro_batch = self.train_micro_batch_size_per_gpu
-        grad_acc = self.gradient_accumulation_steps
-
-        #all values are provided nothing needs to be set
-        if train_batch is not None and \
-            micro_batch is not None and \
-            grad_acc is not None:
-            return
-
-        #global_accumulation_steps needs to be set
-        elif train_batch is not None and \
-            micro_batch is not None:
-            grad_acc = train_batch // micro_batch
-            grad_acc //= self.world_size
-            self.gradient_accumulation_steps = grad_acc
-
-        #micro_batch_per_gpu needs to be set
-        elif train_batch is not None and \
-            grad_acc is not None:
-            micro_batch = train_batch // self.world_size
-            micro_batch //= grad_acc
-            self.train_micro_batch_size_per_gpu = micro_batch
-
-        #train_batch_size needs to be set
-        elif micro_batch is not None and \
-            grad_acc is not None:
-            train_batch_size = micro_batch * grad_acc
-            train_batch_size *= self.world_size
-            self.train_batch_size = train_batch_size
-
-        #gradient_accumulation_steps and micro_batch_per_gpus is set
-        elif train_batch is not None:
-            self.gradient_accumulation_steps = 1
-            self.train_micro_batch_size_per_gpu = train_batch // self.world_size
-
-        #train_batch_size and gradient_accumulation_step is set
-        elif micro_batch is not None:
-            self.train_batch_size = micro_batch * self.world_size
-            self.gradient_accumulation_steps = 1
-
-        #either none of the three parameters are provided or just gradient_accumulation_step is provided
-        else:
-            assert False, \
-                'Either train_batch_size or micro_batch_per_gpu needs to be provided'
-
-    def _configure_train_batch_size(self):
-        self._set_batch_related_parameters()
-        self._batch_assertion()
-
-    def _do_sanity_check(self):
-        self._do_error_check()
-
-        self._do_warning_check()
-
-    def print(self, name):
-        logger.info('{}:'.format(name))
-        for arg in sorted(vars(self)):
-            if arg != '_param_dict':
-                dots = '.' * (29 - len(arg))
-                logger.info('  {} {} {}'.format(arg, dots, getattr(self, arg)))
-
-        logger.info('  json = {}'.format(
-            json.dumps(self._param_dict,
-                       sort_keys=True,
-                       indent=4,
-                       separators=(',',
-                                   ':'))))
-
-    def _do_error_check(self):
-        assert self.train_micro_batch_size_per_gpu, "DeepSpeedConfig: {} is not defined".format(TRAIN_MICRO_BATCH_SIZE_PER_GPU)
-
-        assert self.gradient_accumulation_steps, "DeepSpeedConfig: {} is not defined".format(
-            GRADIENT_ACCUMULATION_STEPS)
-
-        if self.zero_enabled:
-            assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled"
-            assert self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION, "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(MAX_STAGE_ZERO_OPTIMIZATION)
-            #if self.zero_config.cpu_offload is True:
-            #    assert self.zero_optimization_stage == ZERO_OPTIMIZATION_GRADIENTS, "DeepSpeedConfig: cpu-offload supported ZeRO stage is {}".format(ZERO_OPTIMIZATION_GRADIENTS)
-            #assert self.gradient_accumulation_steps == 1, "DeepSpeedConfig: {}is not supported for {}".format(GRADIENT_ACCUMULATION_STEPS, ZERO_OPTIMIZATION_CPU_OFFLOAD)
-
-    def _do_warning_check(self):
-        fp16_enabled = self.fp16_enabled or self.zero_enabled
-
-        vocabulary_size = self._param_dict.get(VOCABULARY_SIZE, VOCABULARY_SIZE_DEFAULT)
-        if vocabulary_size and vocabulary_size % TENSOR_CORE_ALIGN_SIZE != 0:
-            logger.warning(
-                "DeepSpeedConfig: vocabulary size {} is not aligned to {}, may import tensor core utilization."
-                .format(vocabulary_size,
-                        TENSOR_CORE_ALIGN_SIZE))
-
-        if self.optimizer_params is not None and \
-            MAX_GRAD_NORM in self.optimizer_params.keys() and \
-                self.optimizer_params[MAX_GRAD_NORM] > 0:
-            if fp16_enabled:
-                if self.global_rank == 0:
-                    logger.warning(
-                        'DeepSpeedConfig: In FP16 mode, DeepSpeed will pass {}:{} to FP16 wrapper'
-                        .format(MAX_GRAD_NORM,
-                                self.optimizer_params[MAX_GRAD_NORM]))
-            else:
-                if self.global_rank == 0:
-                    logger.warning(
-                        'DeepSpeedConfig: In FP32 mode, DeepSpeed does not permit MAX_GRAD_NORM ({}) > 0, setting to zero'
-                        .format(self.optimizer_params[MAX_GRAD_NORM]))
-                self.optimizer_params[MAX_GRAD_NORM] = 0.0
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+import os
+from typing import Union
+
+import torch
+import json
+import copy
+
+from .constants import *
+from .fp16.loss_scaler import (
+    INITIAL_LOSS_SCALE,
+    SCALE_WINDOW,
+    DELAYED_SHIFT,
+    MIN_LOSS_SCALE,
+)
+from .config_utils import (
+    get_scalar_param,
+    dict_raise_error_on_duplicate_keys,
+    ScientificNotationEncoder,
+)
+from .zero.config import DeepSpeedZeroConfig
+from .zero.constants import *
+from .activation_checkpointing.config import DeepSpeedActivationCheckpointingConfig
+
+from ..git_version_info import version as __version__
+from ..utils import logger
+
+from ..elasticity import (
+    elasticity_enabled,
+    compute_elastic_config,
+    ensure_immutable_elastic_config,
+)
+from ..elasticity.config import ElasticityConfigError
+from ..elasticity.constants import (
+    ELASTICITY,
+    IGNORE_NON_ELASTIC_BATCH_INFO,
+    IGNORE_NON_ELASTIC_BATCH_INFO_DEFAULT,
+)
+
+from ..profiling.config import DeepSpeedFlopsProfilerConfig
+from ..autotuning.config import DeepSpeedAutotuningConfig
+
+from .swap_tensor.aio_config import get_aio_config
+
+TENSOR_CORE_ALIGN_SIZE = 8
+
+ADAGRAD_OPTIMIZER = 'adagrad'
+ADAM_OPTIMIZER = 'adam'
+ADAMW_OPTIMIZER = 'adamw'
+LAMB_OPTIMIZER = 'lamb'
+ONEBIT_ADAM_OPTIMIZER = 'onebitadam'
+ZERO_ONE_ADAM_OPTIMIZER = 'zerooneadam'
+ONEBIT_LAMB_OPTIMIZER = 'onebitlamb'
+DEEPSPEED_OPTIMIZERS = [
+    ADAGRAD_OPTIMIZER,
+    ADAM_OPTIMIZER,
+    ADAMW_OPTIMIZER,
+    LAMB_OPTIMIZER,
+    ONEBIT_ADAM_OPTIMIZER,
+    ONEBIT_LAMB_OPTIMIZER,
+    ZERO_ONE_ADAM_OPTIMIZER
+]
+
+# extra optimizer parameters for adam/adamw
+TORCH_ADAM_PARAM = "torch_adam"
+
+# default to adamw logic for adam/adamw optimizers unless user explicitly opts out
+ADAM_W_MODE = "adam_w_mode"
+ADAM_W_MODE_DEFAULT = True
+
+
+class DeepSpeedConfigError(Exception):
+    pass
+
+
+def get_curriculum_enabled(param_dict):
+    if CURRICULUM_LEARNING in param_dict.keys():
+        return get_scalar_param(param_dict[CURRICULUM_LEARNING],
+                                CURRICULUM_ENABLED,
+                                CURRICULUM_ENABLED_DEFAULT)
+    else:
+        return False
+
+
+def get_curriculum_params(param_dict):
+    if CURRICULUM_LEARNING in param_dict.keys():
+        curriculum_params = copy.copy(param_dict[CURRICULUM_LEARNING])
+        curriculum_params.pop(CURRICULUM_ENABLED)
+        return curriculum_params
+    else:
+        return False
+
+
+def get_pld_enabled(param_dict):
+    if PROGRESSIVE_LAYER_DROP in param_dict.keys():
+        return get_scalar_param(param_dict[PROGRESSIVE_LAYER_DROP],
+                                PLD_ENABLED,
+                                PLD_ENABLED_DEFAULT)
+    else:
+        return False
+
+
+def get_pld_params(param_dict):
+    if PROGRESSIVE_LAYER_DROP in param_dict.keys():
+        pld_params = copy.copy(param_dict[PROGRESSIVE_LAYER_DROP])
+        pld_params.pop(PLD_ENABLED)
+        return pld_params
+    else:
+        return False
+
+
+def get_amp_enabled(param_dict):
+    if AMP in param_dict.keys():
+        return get_scalar_param(param_dict[AMP], AMP_ENABLED, AMP_ENABLED_DEFAULT)
+    else:
+        return False
+
+
+def get_amp_params(param_dict):
+    if AMP in param_dict.keys():
+        amp_params = copy.copy(param_dict[AMP])
+        amp_params.pop(AMP_ENABLED)
+        return amp_params
+    else:
+        return False
+
+
+def get_fp16_enabled(param_dict):
+    if FP16 in param_dict.keys():
+        return get_scalar_param(param_dict[FP16], FP16_ENABLED, FP16_ENABLED_DEFAULT)
+    else:
+        return False
+
+
+def get_bfloat16_enabled(param_dict):
+    for key in [BFLOAT16, BFLOAT16_OLD]:
+        if key in param_dict.keys():
+            return get_scalar_param(param_dict[key],
+                                    BFLOAT16_ENABLED,
+                                    BFLOAT16_ENABLED_DEFAULT)
+    return False
+
+
+def get_fp16_master_weights_and_grads_enabled(param_dict):
+    if get_fp16_enabled(param_dict):
+        return get_scalar_param(param_dict[FP16],
+                                FP16_MASTER_WEIGHTS_AND_GRADS,
+                                FP16_MASTER_WEIGHTS_AND_GRADS_DEFAULT)
+    else:
+        return False
+
+
+def get_loss_scale(param_dict):
+    if get_fp16_enabled(param_dict):
+        return get_scalar_param(param_dict[FP16],
+                                FP16_LOSS_SCALE,
+                                FP16_LOSS_SCALE_DEFAULT)
+    elif get_bfloat16_enabled(param_dict):
+        return 1.0
+    else:
+        return FP16_LOSS_SCALE_DEFAULT
+
+
+def get_initial_dynamic_scale(param_dict):
+    if get_fp16_enabled(param_dict):
+        initial_scale_power = get_scalar_param(param_dict[FP16],
+                                               FP16_INITIAL_SCALE_POWER,
+                                               FP16_INITIAL_SCALE_POWER_DEFAULT)
+    elif get_bfloat16_enabled(param_dict):
+        initial_scale_power = 0
+    else:
+        initial_scale_power = FP16_INITIAL_SCALE_POWER_DEFAULT
+
+    return 2**initial_scale_power
+
+
+def get_dynamic_loss_scale_args(param_dict):
+    loss_scale_args = None
+    if get_fp16_enabled(param_dict):
+        fp16_dict = param_dict[FP16]
+        dynamic_loss_args = [
+            FP16_INITIAL_SCALE_POWER,
+            FP16_LOSS_SCALE_WINDOW,
+            FP16_MIN_LOSS_SCALE,
+            FP16_HYSTERESIS,
+        ]
+        if any(arg in list(fp16_dict.keys()) for arg in dynamic_loss_args):
+            init_scale = get_scalar_param(fp16_dict,
+                                          FP16_INITIAL_SCALE_POWER,
+                                          FP16_INITIAL_SCALE_POWER_DEFAULT)
+            scale_window = get_scalar_param(fp16_dict,
+                                            FP16_LOSS_SCALE_WINDOW,
+                                            FP16_LOSS_SCALE_WINDOW_DEFAULT)
+            delayed_shift = get_scalar_param(fp16_dict,
+                                             FP16_HYSTERESIS,
+                                             FP16_HYSTERESIS_DEFAULT)
+            min_loss_scale = get_scalar_param(fp16_dict,
+                                              FP16_MIN_LOSS_SCALE,
+                                              FP16_MIN_LOSS_SCALE_DEFAULT)
+            loss_scale_args = {
+                INITIAL_LOSS_SCALE: 2**init_scale,
+                SCALE_WINDOW: scale_window,
+                DELAYED_SHIFT: delayed_shift,
+                MIN_LOSS_SCALE: min_loss_scale,
+            }
+
+    return loss_scale_args
+
+
+def get_gradient_accumulation_steps(param_dict):
+    return get_scalar_param(param_dict,
+                            GRADIENT_ACCUMULATION_STEPS,
+                            GRADIENT_ACCUMULATION_STEPS_DEFAULT)
+
+
+def get_sparse_gradients_enabled(param_dict):
+    return get_scalar_param(param_dict, SPARSE_GRADIENTS, SPARSE_GRADIENTS_DEFAULT)
+
+
+def get_zero_optimization(param_dict):
+    return get_scalar_param(param_dict, ZERO_OPTIMIZATION, ZERO_OPTIMIZATION_DEFAULT)
+
+
+def get_zero_reduce_scatter(param_dict):
+    return get_scalar_param(
+        param_dict,
+        ZERO_OPTIMIZATION_REDUCE_SCATTER,
+        ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT,
+    )
+
+
+def get_communication_data_type(param_dict):
+    val = get_scalar_param(param_dict,
+                           COMMUNICATION_DATA_TYPE,
+                           COMMUNICATION_DATA_TYPE_DEFAULT)
+    val = val.lower() if val is not None else val
+    if val is None:
+        return val  # we must determine it by other parameters
+    elif val == "fp32":
+        return torch.float32
+    elif val == "fp16":
+        return torch.float16
+    elif val == "bfp16":
+        return torch.bfloat16
+
+    raise ValueError(
+        f"Invalid communication_data_type. Supported data types: ['fp16', 'bfp16', 'fp32']. Got: {val}"
+    )
+
+
+def get_prescale_gradients(param_dict):
+    return get_scalar_param(param_dict, PRESCALE_GRADIENTS, PRESCALE_GRADIENTS_DEFAULT)
+
+
+def get_gradient_predivide_factor(param_dict):
+    return get_scalar_param(param_dict,
+                            GRADIENT_PREDIVIDE_FACTOR,
+                            GRADIENT_PREDIVIDE_FACTOR_DEFAULT)
+
+
+def get_quantize_enabled(param_dict):
+    if QUANTIZE_TRAINING in param_dict.keys():
+        return get_scalar_param(
+            param_dict[QUANTIZE_TRAINING],
+            QUANTIZE_TRAINING_ENABLED,
+            QUANTIZE_TRAINING_ENABLED_DEFAULT,
+        )
+    else:
+        return False
+
+
+def get_quantize_training(param_dict):
+    if QUANTIZE_TRAINING in param_dict.keys():
+        return (
+            (param_dict[QUANTIZE_TRAINING][QUANTIZE_BITS][TARGET_BITS]),
+            (param_dict[QUANTIZE_TRAINING][QUANTIZE_BITS][START_BITS]
+             if START_BITS in param_dict[QUANTIZE_TRAINING][QUANTIZE_BITS].keys() else
+             QUANTIZE_START_BITS_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][QUANTIZE_SCHEDULE][QUANTIZE_PERIOD]
+             if QUANTIZE_SCHEDULE in param_dict[QUANTIZE_TRAINING].keys() else
+             QUANTIZE_PERIOD_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][QUANTIZE_SCHEDULE][SCHEDULE_OFFSET]
+             if QUANTIZE_SCHEDULE in param_dict[QUANTIZE_TRAINING].keys() and
+             SCHEDULE_OFFSET in param_dict[QUANTIZE_TRAINING][QUANTIZE_SCHEDULE].keys()
+             else QUANTIZE_OFFSET_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][QUANTIZE_GROUPS] if QUANTIZE_GROUPS
+             in param_dict[QUANTIZE_TRAINING].keys() else QUANTIZE_GROUPS_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][FP16_MIXED_QUANTIZE]
+             [FP16_MIXED_QUANTIZE_ENABLED]
+             if FP16_MIXED_QUANTIZE in param_dict[QUANTIZE_TRAINING].keys()
+             and FP16_MIXED_QUANTIZE_ENABLED
+             in param_dict[QUANTIZE_TRAINING][FP16_MIXED_QUANTIZE].keys() else
+             FP16_MIXED_QUANTIZE_ENABLED_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][FP16_MIXED_QUANTIZE][QUANTIZE_CHANGE_RATIO]
+             if FP16_MIXED_QUANTIZE in param_dict[QUANTIZE_TRAINING].keys()
+             and QUANTIZE_CHANGE_RATIO
+             in param_dict[QUANTIZE_TRAINING][FP16_MIXED_QUANTIZE].keys() else
+             QUANTIZE_CHANGE_RATIO_DEFAULT),
+            (1 if QUANTIZE_ALGO in param_dict[QUANTIZE_TRAINING]
+             and QUANTIZE_TYPE in param_dict[QUANTIZE_TRAINING][QUANTIZE_ALGO].keys()
+             and param_dict[QUANTIZE_TRAINING][QUANTIZE_ALGO][QUANTIZE_TYPE]
+             == QUANTIZE_ASYMMETRIC else QUANTIZE_TYPE_DEFAULT),
+            (1 if QUANTIZE_ALGO in param_dict[QUANTIZE_TRAINING] and QUANTIZE_ROUNDING
+             in param_dict[QUANTIZE_TRAINING][QUANTIZE_ALGO].keys()
+             and param_dict[QUANTIZE_TRAINING][QUANTIZE_ALGO][QUANTIZE_ROUNDING]
+             == STOCHASTIC_ROUNDING else QUANTIZE_ROUNDING_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][QUANTIZE_VERBOSE] if QUANTIZE_VERBOSE
+             in param_dict[QUANTIZE_TRAINING].keys() else QUANTIZE_VERBOSE_DEFAULT),
+            (param_dict[QUANTIZE_TRAINING][QUANTIZER_KERNEL] if QUANTIZER_KERNEL
+             in param_dict[QUANTIZE_TRAINING].keys() else QUANTIZER_KERNEL_DEFAULT),
+        )
+    else:
+        return (
+            QUANTIZE_TARGET_BITS_DEFAULT,
+            QUANTIZE_START_BITS_DEFAULT,
+            QUANTIZE_PERIOD_DEFAULT,
+            QUANTIZE_OFFSET_DEFAULT,
+            QUANTIZE_GROUPS_DEFAULT,
+            FP16_MIXED_QUANTIZE_ENABLED_DEFAULT,
+            QUANTIZE_CHANGE_RATIO_DEFAULT,
+            QUANTIZE_TYPE_DEFAULT,
+            QUANTIZE_ROUNDING_DEFAULT,
+            QUANTIZE_VERBOSE_DEFAULT,
+            QUANTIZER_KERNEL_DEFAULT,
+        )
+
+
+def get_steps_per_print(param_dict):
+    return get_scalar_param(param_dict, STEPS_PER_PRINT, STEPS_PER_PRINT_DEFAULT)
+
+
+def get_disable_allgather(param_dict):
+    return get_scalar_param(param_dict, DISABLE_ALLGATHER, DISABLE_ALLGATHER_DEFAULT)
+
+
+def get_dump_state(param_dict):
+    return get_scalar_param(param_dict, DUMP_STATE, DUMP_STATE_DEFAULT)
+
+
+def get_gradient_clipping(param_dict):
+    return get_scalar_param(param_dict, GRADIENT_CLIPPING, GRADIENT_CLIPPING_DEFAULT)
+
+
+def get_sparse_attention(param_dict):
+    if SPARSE_ATTENTION in param_dict.keys():
+        sparsity = param_dict[SPARSE_ATTENTION]
+        mode = get_sparse_attention_mode(sparsity)
+
+        if mode == SPARSE_DENSE_MODE:
+            return get_sparse_dense_config(sparsity)
+        elif mode == SPARSE_FIXED_MODE:
+            return get_sparse_fixed_config(sparsity)
+        elif mode == SPARSE_VARIABLE_MODE:
+            return get_sparse_variable_config(sparsity)
+        elif mode == SPARSE_BIGBIRD_MODE:
+            return get_sparse_bigbird_config(sparsity)
+        elif mode == SPARSE_BSLONGFORMER_MODE:
+            return get_sparse_bslongformer_config(sparsity)
+        else:
+            raise NotImplementedError(
+                f"Given sparsity mode, {mode}, has not been implemented yet!")
+
+    else:
+        return None
+
+
+def get_sparse_dense_config(sparsity):
+    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
+    return {SPARSE_MODE: SPARSE_DENSE_MODE, SPARSE_BLOCK: block}
+
+
+def get_sparse_fixed_config(sparsity):
+    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
+    different_layout_per_head = get_scalar_param(
+        sparsity,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT,
+    )
+    num_local_blocks = get_scalar_param(sparsity,
+                                        SPARSE_NUM_LOCAL_BLOCKS,
+                                        SPARSE_NUM_LOCAL_BLOCKS_DEFAULT)
+    num_global_blocks = get_scalar_param(sparsity,
+                                         SPARSE_NUM_GLOBAL_BLOCKS,
+                                         SPARSE_NUM_GLOBAL_BLOCKS_DEFAULT)
+    attention = get_scalar_param(sparsity,
+                                 SPARSE_ATTENTION_TYPE,
+                                 SPARSE_ATTENTION_TYPE_DEFAULT)
+    horizontal_global_attention = get_scalar_param(
+        sparsity,
+        SPARSE_HORIZONTAL_GLOBAL_ATTENTION,
+        SPARSE_HORIZONTAL_GLOBAL_ATTENTION_DEFAULT,
+    )
+    num_different_global_patterns = get_scalar_param(
+        sparsity,
+        SPARSE_NUM_DIFFERENT_GLOBAL_PATTERNS,
+        SPARSE_NUM_DIFFERENT_GLOBAL_PATTERNS_DEFAULT,
+    )
+
+    return {
+        SPARSE_MODE: SPARSE_FIXED_MODE,
+        SPARSE_BLOCK: block,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
+        SPARSE_NUM_LOCAL_BLOCKS: num_local_blocks,
+        SPARSE_NUM_GLOBAL_BLOCKS: num_global_blocks,
+        SPARSE_ATTENTION_TYPE: attention,
+        SPARSE_HORIZONTAL_GLOBAL_ATTENTION: horizontal_global_attention,
+        SPARSE_NUM_DIFFERENT_GLOBAL_PATTERNS: num_different_global_patterns,
+    }
+
+
+def get_sparse_variable_config(sparsity):
+    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
+    different_layout_per_head = get_scalar_param(
+        sparsity,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT,
+    )
+    num_random_blocks = get_scalar_param(sparsity,
+                                         SPARSE_NUM_RANDOM_BLOCKS,
+                                         SPARSE_NUM_RANDOM_BLOCKS_DEFAULT)
+    local_window_blocks = get_scalar_param(sparsity,
+                                           SPARSE_LOCAL_WINDOW_BLOCKS,
+                                           SPARSE_LOCAL_WINDOW_BLOCKS_DEFAULT)
+    global_block_indices = get_scalar_param(sparsity,
+                                            SPARSE_GLOBAL_BLOCK_INDICES,
+                                            SPARSE_GLOBAL_BLOCK_INDICES_DEFAULT)
+    global_block_end_indices = get_scalar_param(
+        sparsity,
+        SPARSE_GLOBAL_BLOCK_END_INDICES,
+        SPARSE_GLOBAL_BLOCK_END_INDICES_DEFAULT,
+    )
+    attention = get_scalar_param(sparsity,
+                                 SPARSE_ATTENTION_TYPE,
+                                 SPARSE_ATTENTION_TYPE_DEFAULT)
+    horizontal_global_attention = get_scalar_param(
+        sparsity,
+        SPARSE_HORIZONTAL_GLOBAL_ATTENTION,
+        SPARSE_HORIZONTAL_GLOBAL_ATTENTION_DEFAULT,
+    )
+
+    return {
+        SPARSE_MODE: SPARSE_VARIABLE_MODE,
+        SPARSE_BLOCK: block,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
+        SPARSE_NUM_RANDOM_BLOCKS: num_random_blocks,
+        SPARSE_LOCAL_WINDOW_BLOCKS: local_window_blocks,
+        SPARSE_GLOBAL_BLOCK_INDICES: global_block_indices,
+        SPARSE_GLOBAL_BLOCK_END_INDICES: global_block_end_indices,
+        SPARSE_ATTENTION_TYPE: attention,
+        SPARSE_HORIZONTAL_GLOBAL_ATTENTION: horizontal_global_attention,
+    }
+
+
+def get_sparse_bigbird_config(sparsity):
+    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
+    different_layout_per_head = get_scalar_param(
+        sparsity,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT,
+    )
+    num_random_blocks = get_scalar_param(sparsity,
+                                         SPARSE_NUM_RANDOM_BLOCKS,
+                                         SPARSE_NUM_RANDOM_BLOCKS_DEFAULT)
+    num_sliding_window_blocks = get_scalar_param(
+        sparsity,
+        SPARSE_NUM_SLIDING_WINDOW_BLOCKS,
+        SPARSE_NUM_SLIDING_WINDOW_BLOCKS_DEFAULT,
+    )
+    num_global_blocks = get_scalar_param(sparsity,
+                                         SPARSE_NUM_GLOBAL_BLOCKS,
+                                         SPARSE_NUM_GLOBAL_BLOCKS_DEFAULT)
+
+    return {
+        SPARSE_MODE: SPARSE_BIGBIRD_MODE,
+        SPARSE_BLOCK: block,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
+        SPARSE_NUM_RANDOM_BLOCKS: num_random_blocks,
+        SPARSE_NUM_SLIDING_WINDOW_BLOCKS: num_sliding_window_blocks,
+        SPARSE_NUM_GLOBAL_BLOCKS: num_global_blocks,
+    }
+
+
+def get_sparse_bslongformer_config(sparsity):
+    block = get_scalar_param(sparsity, SPARSE_BLOCK, SPARSE_BLOCK_DEFAULT)
+    different_layout_per_head = get_scalar_param(
+        sparsity,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD_DEFAULT,
+    )
+    num_sliding_window_blocks = get_scalar_param(
+        sparsity,
+        SPARSE_NUM_SLIDING_WINDOW_BLOCKS,
+        SPARSE_NUM_SLIDING_WINDOW_BLOCKS_DEFAULT,
+    )
+    global_block_indices = get_scalar_param(sparsity,
+                                            SPARSE_GLOBAL_BLOCK_INDICES,
+                                            SPARSE_GLOBAL_BLOCK_INDICES_DEFAULT)
+    global_block_end_indices = get_scalar_param(
+        sparsity,
+        SPARSE_GLOBAL_BLOCK_END_INDICES,
+        SPARSE_GLOBAL_BLOCK_END_INDICES_DEFAULT,
+    )
+
+    return {
+        SPARSE_MODE: SPARSE_BSLONGFORMER_MODE,
+        SPARSE_BLOCK: block,
+        SPARSE_DIFFERENT_LAYOUT_PER_HEAD: different_layout_per_head,
+        SPARSE_NUM_SLIDING_WINDOW_BLOCKS: num_sliding_window_blocks,
+        SPARSE_GLOBAL_BLOCK_INDICES: global_block_indices,
+        SPARSE_GLOBAL_BLOCK_END_INDICES: global_block_end_indices,
+    }
+
+
+def get_sparse_attention_mode(param_dict):
+    if SPARSE_MODE in param_dict.keys():
+        return param_dict[SPARSE_MODE]
+    else:
+        return SPARSE_MODE_DEFAULT
+
+
+def get_sparse_attention_type(param_dict):
+    if SPARSE_ATTENTION_TYPE in param_dict.keys():
+        return param_dict[SPARSE_ATTENTION_TYPE]
+    else:
+        return SPARSE_ATTENTION_TYPE_DEFAULT
+
+
+def get_pipeline_config(param_dict):
+    """Parses pipeline engine configuration. """
+    default_pipeline = {
+        "stages": "auto",
+        "partition": "best",
+        "seed_layers": False,
+        "activation_checkpoint_interval": 0,
+    }
+    config = default_pipeline
+    for key, val in param_dict.get("pipeline", {}).items():
+        config[key] = val
+    return config
+
+
+def get_optimizer_name(param_dict):
+    if OPTIMIZER in param_dict.keys() and TYPE in param_dict[OPTIMIZER].keys():
+        return param_dict[OPTIMIZER][TYPE]
+    else:
+        return OPTIMIZER_TYPE_DEFAULT
+
+
+def get_optimizer_params(param_dict):
+    if (get_optimizer_name(param_dict) is not None
+            and OPTIMIZER_PARAMS in param_dict[OPTIMIZER].keys()):
+        return param_dict[OPTIMIZER][OPTIMIZER_PARAMS]
+    else:
+        return None
+
+
+def get_optimizer_gradient_clipping(param_dict):
+    optimizer_params = get_optimizer_params(param_dict)
+    if optimizer_params is not None and MAX_GRAD_NORM in optimizer_params.keys():
+        return optimizer_params[MAX_GRAD_NORM]
+    else:
+        return None
+
+
+def get_optimizer_legacy_fusion(param_dict):
+    if OPTIMIZER in param_dict.keys() and LEGACY_FUSION in param_dict[OPTIMIZER].keys():
+        return param_dict[OPTIMIZER][LEGACY_FUSION]
+    else:
+        return LEGACY_FUSION_DEFAULT
+
+
+def get_zero_allow_untested_optimizer(param_dict):
+    return get_scalar_param(param_dict,
+                            ZERO_ALLOW_UNTESTED_OPTIMIZER,
+                            ZERO_ALLOW_UNTESTED_OPTIMIZER_DEFAULT)
+
+
+def get_scheduler_name(param_dict):
+    if SCHEDULER in param_dict.keys() and TYPE in param_dict[SCHEDULER].keys():
+        return param_dict[SCHEDULER][TYPE]
+    else:
+        return SCHEDULER_TYPE_DEFAULT
+
+
+def get_scheduler_params(param_dict):
+    if (get_scheduler_name(param_dict) is not None
+            and SCHEDULER_PARAMS in param_dict[SCHEDULER].keys()):
+        return param_dict[SCHEDULER][SCHEDULER_PARAMS]
+    else:
+        return None
+
+
+def get_train_batch_size(param_dict):
+    return get_scalar_param(param_dict, TRAIN_BATCH_SIZE, TRAIN_BATCH_SIZE_DEFAULT)
+
+
+def get_train_micro_batch_size_per_gpu(param_dict):
+    return get_scalar_param(
+        param_dict,
+        TRAIN_MICRO_BATCH_SIZE_PER_GPU,
+        TRAIN_MICRO_BATCH_SIZE_PER_GPU_DEFAULT,
+    )
+
+
+def get_wall_clock_breakdown(param_dict):
+    return get_scalar_param(param_dict,
+                            WALL_CLOCK_BREAKDOWN,
+                            WALL_CLOCK_BREAKDOWN_DEFAULT)
+
+
+def get_memory_breakdown(param_dict):
+    return get_scalar_param(param_dict, MEMORY_BREAKDOWN, MEMORY_BREAKDOWN_DEFAULT)
+
+
+def get_tensorboard_enabled(param_dict):
+    if TENSORBOARD in param_dict.keys():
+        return get_scalar_param(param_dict[TENSORBOARD],
+                                TENSORBOARD_ENABLED,
+                                TENSORBOARD_ENABLED_DEFAULT)
+    else:
+        return False
+
+
+def get_eigenvalue_config(param_dict):
+    if get_quantize_enabled(param_dict):
+        param_dict = param_dict[QUANTIZE_TRAINING]
+        return (
+            get_eigenvalue_enabled(param_dict),
+            get_eigenvalue_verbose(param_dict),
+            get_eigenvalue_max_iter(param_dict),
+            get_eigenvalue_tol(param_dict),
+            get_eigenvalue_stability(param_dict),
+            get_eigenvalue_gas_boundary_resolution(param_dict),
+            get_eigenvalue_layer_name(param_dict),
+            get_eigenvalue_layer_num(param_dict),
+        )
+    else:
+        return (
+            EIGENVALUE_ENABLED_DEFAULT,
+            EIGENVALUE_VERBOSE_DEFAULT,
+            EIGENVALUE_MAX_ITER_DEFAULT,
+            EIGENVALUE_TOL_DEFAULT,
+            EIGENVALUE_STABILITY_DEFAULT,
+            EIGENVALUE_GAS_BOUNDARY_RESOLUTION_DEFAULT,
+            EIGENVALUE_LAYER_NAME_DEFAULT,
+            EIGENVALUE_LAYER_NUM_DEFAULT,
+        )
+
+
+def get_eigenvalue_enabled(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_ENABLED,
+                                EIGENVALUE_ENABLED_DEFAULT)
+    else:
+        return EIGENVALUE_ENABLED_DEFAULT
+
+
+def get_eigenvalue_verbose(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_VERBOSE,
+                                EIGENVALUE_VERBOSE_DEFAULT)
+    else:
+        return EIGENVALUE_VERBOSE_DEFAULT
+
+
+def get_eigenvalue_max_iter(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_MAX_ITER,
+                                EIGENVALUE_MAX_ITER_DEFAULT)
+    else:
+        return EIGENVALUE_MAX_ITER_DEFAULT
+
+
+def get_eigenvalue_tol(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_TOL,
+                                EIGENVALUE_TOL_DEFAULT)
+    else:
+        return EIGENVALUE_TOL_DEFAULT
+
+
+def get_eigenvalue_stability(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_STABILITY,
+                                EIGENVALUE_STABILITY_DEFAULT)
+    else:
+        return EIGENVALUE_STABILITY_DEFAULT
+
+
+def get_eigenvalue_gas_boundary_resolution(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(
+            param_dict[EIGENVALUE],
+            EIGENVALUE_GAS_BOUNDARY_RESOLUTION,
+            EIGENVALUE_GAS_BOUNDARY_RESOLUTION_DEFAULT,
+        )
+    else:
+        return EIGENVALUE_GAS_BOUNDARY_RESOLUTION_DEFAULT
+
+
+def get_eigenvalue_layer_name(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_LAYER_NAME,
+                                EIGENVALUE_LAYER_NAME_DEFAULT)
+    else:
+        return EIGENVALUE_LAYER_NAME_DEFAULT
+
+
+def get_eigenvalue_layer_num(param_dict):
+    if EIGENVALUE in param_dict.keys():
+        return get_scalar_param(param_dict[EIGENVALUE],
+                                EIGENVALUE_LAYER_NUM,
+                                EIGENVALUE_LAYER_NUM_DEFAULT)
+    else:
+        return EIGENVALUE_LAYER_NUM_DEFAULT
+
+
+def get_tensorboard_output_path(param_dict):
+    if get_tensorboard_enabled(param_dict):
+        return get_scalar_param(
+            param_dict[TENSORBOARD],
+            TENSORBOARD_OUTPUT_PATH,
+            TENSORBOARD_OUTPUT_PATH_DEFAULT,
+        )
+    else:
+        return TENSORBOARD_OUTPUT_PATH_DEFAULT
+
+
+def get_tensorboard_job_name(param_dict):
+    if get_tensorboard_enabled(param_dict):
+        return get_scalar_param(param_dict[TENSORBOARD],
+                                TENSORBOARD_JOB_NAME,
+                                TENSORBOARD_JOB_NAME_DEFAULT)
+    else:
+        return TENSORBOARD_JOB_NAME_DEFAULT
+
+
+def get_checkpoint_params(param_dict):
+    return param_dict.get(CHECKPOINT, {})
+
+
+def get_checkpoint_tag_validation_mode(checkpoint_params):
+    tag_validation_mode = checkpoint_params.get(CHECKPOINT_TAG_VALIDATION,
+                                                CHECKPOINT_TAG_VALIDATION_DEFAULT)
+    tag_validation_mode = tag_validation_mode.upper()
+    if tag_validation_mode in CHECKPOINT_TAG_VALIDATION_MODES:
+        return tag_validation_mode
+    else:
+        raise DeepSpeedConfigError(
+            "Checkpoint config contains invalid tag_validation "
+            f"value of {tag_validation_mode}, expecting one of {CHECKPOINT_TAG_VALIDATION_MODES}"
+        )
+
+
+def get_dataloader_drop_last(param_dict):
+    return get_scalar_param(param_dict,
+                            DATALOADER_DROP_LAST,
+                            DATALOADER_DROP_LAST_DEFAULT)
+
+
+'''Write deepspeed config files by modifying basic templates.
+Can be used for quickly changing parameters via command line parameters.'''
+
+
+class DeepSpeedConfigWriter:
+    def __init__(self, data=None):
+        self.data = data if data is not None else {}
+
+    def add_config(self, key, value):
+        self.data[key] = value
+
+    def load_config(self, filename):
+        self.data = json.load(open(filename,
+                                   "r"),
+                              object_pairs_hook=dict_raise_error_on_duplicate_keys)
+
+    def write_config(self, filename):
+        with open(filename, "w") as outfile:
+            json.dump(self.data, outfile)
+
+
+class DeepSpeedConfig(object):
+    def __init__(self, config: Union[str, dict], mpu=None):
+        super(DeepSpeedConfig, self).__init__()
+        if isinstance(config, dict):
+            self._param_dict = config
+        elif os.path.exists(config):
+            self._param_dict = json.load(
+                open(config,
+                     "r"),
+                object_pairs_hook=dict_raise_error_on_duplicate_keys)
+        else:
+            raise ValueError(
+                f"Expected a string path to an existing deepspeed config, or a dictionary. Received: {config}"
+            )
+        try:
+            self.global_rank = torch.distributed.get_rank()
+            if mpu is None:
+                self.world_size = torch.distributed.get_world_size()
+            else:
+                self.world_size = mpu.get_data_parallel_world_size()
+        except:
+            self.global_rank = 0
+            self.world_size = 1
+
+        # If elastic-mode enabled, update compute + update _param_dict
+        self.elasticity_enabled = elasticity_enabled(self._param_dict)
+        if self.elasticity_enabled:
+            logger.info("DeepSpeed elasticity support enabled")
+            final_batch_size, valid_gpus, micro_batch_size = compute_elastic_config(
+                ds_config=self._param_dict,
+                target_deepspeed_version=__version__,
+                world_size=self.world_size,
+            )
+
+            elastic_dict = self._param_dict[ELASTICITY]
+
+            # Ensure the resource scheduler saw the same elastic config we are using at runtime
+            ensure_immutable_elastic_config(runtime_elastic_config_dict=elastic_dict)
+
+            ignore_non_elastic_batch_info = elastic_dict.get(
+                IGNORE_NON_ELASTIC_BATCH_INFO,
+                IGNORE_NON_ELASTIC_BATCH_INFO_DEFAULT)
+
+            if not ignore_non_elastic_batch_info:
+                batch_params = [
+                    TRAIN_BATCH_SIZE,
+                    TRAIN_MICRO_BATCH_SIZE_PER_GPU,
+                    GRADIENT_ACCUMULATION_STEPS,
+                ]
+                if any(map(lambda t: t in self._param_dict, batch_params)):
+                    raise ElasticityConfigError("One or more batch related parameters were found in your " \
+                        f"ds_config ({TRAIN_BATCH_SIZE}, {TRAIN_MICRO_BATCH_SIZE_PER_GPU}, and/or " \
+                        f"{GRADIENT_ACCUMULATION_STEPS}). These parameters *will not be used* since " \
+                        "elastic training is enabled, which takes control of these parameters. " \
+                        "If you want to suppress this error (the parameters will be silently ignored) " \
+                        f"please set {IGNORE_NON_ELASTIC_BATCH_INFO}':true in your elasticity config.")
+
+            # micro_bsz * world_size * gas = total_batch_size
+            # gas = total_batch_size // (micro_bsz * world_size)
+            gradient_accu_steps = final_batch_size // (micro_batch_size *
+                                                       self.world_size)
+
+            if TRAIN_BATCH_SIZE in self._param_dict:
+                logger.warning(
+                    "[Elasticity] overriding training_batch_size: "
+                    f"{self._param_dict[TRAIN_BATCH_SIZE]} -> {final_batch_size}")
+            if TRAIN_MICRO_BATCH_SIZE_PER_GPU in self._param_dict:
+                logger.warning(
+                    "[Elasticity] overriding train_micro_batch_size_per_gpu: "
+                    f"{self._param_dict[TRAIN_MICRO_BATCH_SIZE_PER_GPU]} -> {micro_batch_size}"
+                )
+            if GRADIENT_ACCUMULATION_STEPS in self._param_dict:
+                logger.warning(
+                    "[Elasticity] overriding gradient_accumulation_steps: "
+                    f"{self._param_dict[GRADIENT_ACCUMULATION_STEPS]} -> {gradient_accu_steps}"
+                )
+
+            logger.info(f"[Elasticity] valid GPU counts: {valid_gpus}")
+
+            self._param_dict[TRAIN_BATCH_SIZE] = final_batch_size
+            self._param_dict[TRAIN_MICRO_BATCH_SIZE_PER_GPU] = micro_batch_size
+            self._param_dict[GRADIENT_ACCUMULATION_STEPS] = gradient_accu_steps
+
+        self._initialize_params(self._param_dict)
+        self._configure_train_batch_size()
+        self._do_sanity_check()
+
+    def _initialize_params(self, param_dict):
+        self.train_batch_size = get_train_batch_size(param_dict)
+        #print(f"beginning get_train_batch_size = {get_train_batch_size}")
+        self.train_micro_batch_size_per_gpu = get_train_micro_batch_size_per_gpu(
+            param_dict)
+        self.gradient_accumulation_steps = get_gradient_accumulation_steps(param_dict)
+        self.steps_per_print = get_steps_per_print(param_dict)
+        self.dump_state = get_dump_state(param_dict)
+
+        self.disable_allgather = get_disable_allgather(param_dict)
+        self.communication_data_type = get_communication_data_type(param_dict)
+        self.prescale_gradients = get_prescale_gradients(param_dict)
+        self.gradient_predivide_factor = get_gradient_predivide_factor(param_dict)
+        self.sparse_gradients_enabled = get_sparse_gradients_enabled(param_dict)
+
+        self.zero_config = DeepSpeedZeroConfig(param_dict)
+        self.zero_optimization_stage = self.zero_config.stage
+        self.zero_enabled = self.zero_optimization_stage > 0
+
+        self.activation_checkpointing_config = DeepSpeedActivationCheckpointingConfig(
+            param_dict)
+
+        self.gradient_clipping = get_gradient_clipping(param_dict)
+        self.fp16_enabled = get_fp16_enabled(param_dict)
+        self.bfloat16_enabled = get_bfloat16_enabled(param_dict)
+        assert not (self.fp16_enabled and self.bfloat16_enabled), 'bfloat16 and fp16 modes cannot be simultaneously enabled'
+        self.fp16_master_weights_and_gradients = get_fp16_master_weights_and_grads_enabled(
+            param_dict)
+        self.amp_enabled = get_amp_enabled(param_dict)
+        self.amp_params = get_amp_params(param_dict)
+        self.loss_scale = get_loss_scale(param_dict)
+        self.initial_dynamic_scale = get_initial_dynamic_scale(param_dict)
+        self.dynamic_loss_scale_args = get_dynamic_loss_scale_args(param_dict)
+
+        self.quantize_training_enabled = get_quantize_enabled(param_dict)
+        (
+            self.quantize_target_bits,
+            self.quantize_start_bits,
+            self.quantize_period,
+            self.quantize_offset,
+            self.quantize_groups,
+            self.fp16_mixed_quantize,
+            self.quantize_change_rate,
+            self.quantize_type,
+            self.quantize_rounding,
+            self.quantize_verbose,
+            self.use_quantizer_kernel,
+        ) = get_quantize_training(param_dict)
+
+        self.optimizer_name = get_optimizer_name(param_dict)
+        if (self.optimizer_name is not None
+                and self.optimizer_name.lower() in DEEPSPEED_OPTIMIZERS):
+            self.optimizer_name = self.optimizer_name.lower()
+
+        self.optimizer_params = get_optimizer_params(param_dict)
+        self.optimizer_legacy_fusion = get_optimizer_legacy_fusion(param_dict)
+
+        self.zero_allow_untested_optimizer = get_zero_allow_untested_optimizer(
+            param_dict)
+
+        self.scheduler_name = get_scheduler_name(param_dict)
+        self.scheduler_params = get_scheduler_params(param_dict)
+
+        self.flops_profiler_config = DeepSpeedFlopsProfilerConfig(param_dict)
+        self.wall_clock_breakdown = (get_wall_clock_breakdown(param_dict)
+                                     | self.flops_profiler_config.enabled)
+        self.memory_breakdown = get_memory_breakdown(param_dict)
+        self.autotuning_config = DeepSpeedAutotuningConfig(param_dict)
+        self.tensorboard_enabled = get_tensorboard_enabled(param_dict)
+        self.tensorboard_output_path = get_tensorboard_output_path(param_dict)
+        self.tensorboard_job_name = get_tensorboard_job_name(param_dict)
+
+        (
+            self.eigenvalue_enabled,
+            self.eigenvalue_verbose,
+            self.eigenvalue_max_iter,
+            self.eigenvalue_tol,
+            self.eigenvalue_stability,
+            self.eigenvalue_gas_boundary_resolution,
+            self.eigenvalue_layer_name,
+            self.eigenvalue_layer_num,
+        ) = get_eigenvalue_config(param_dict)
+
+        self.sparse_attention = get_sparse_attention(param_dict)
+        self.pipeline = get_pipeline_config(param_dict)
+
+        self.pld_enabled = get_pld_enabled(param_dict)
+        self.pld_params = get_pld_params(param_dict)
+
+        self.curriculum_enabled = get_curriculum_enabled(param_dict)
+        self.curriculum_params = get_curriculum_params(param_dict)
+
+        checkpoint_params = get_checkpoint_params(param_dict)
+        validation_mode = get_checkpoint_tag_validation_mode(checkpoint_params)
+        self.checkpoint_tag_validation_enabled = (validation_mode !=
+                                                  ValidationMode.IGNORE)
+        self.checkpoint_tag_validation_fail = validation_mode == ValidationMode.FAIL
+
+        self.aio_config = get_aio_config(param_dict)
+
+        self.dataloader_drop_last = get_dataloader_drop_last(param_dict)
+
+    def _batch_assertion(self):
+
+        train_batch = self.train_batch_size
+        micro_batch = self.train_micro_batch_size_per_gpu
+        grad_acc = self.gradient_accumulation_steps
+
+        assert (
+            train_batch > 0
+        ), f"Train batch size: {train_batch} has to be greater than 0"
+
+        assert (
+            micro_batch > 0
+        ), f"Micro batch size per gpu: {micro_batch} has to be greater than 0"
+
+        assert (
+            grad_acc > 0
+        ), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
+
+        assert train_batch == micro_batch * grad_acc * self.world_size, (
+            f"Check batch related parameters. train_batch_size is not equal "
+            "to micro_batch_per_gpu * gradient_acc_step * world_size "
+            f"{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}"
+        )
+
+    def _set_batch_related_parameters(self):
+
+        train_batch = self.train_batch_size
+        micro_batch = self.train_micro_batch_size_per_gpu
+        grad_acc = self.gradient_accumulation_steps
+
+        #print(f"train_batch = {train_batch}, micro_batch={micro_batch}")
+
+        # all values are provided nothing needs to be set
+        if train_batch is not None and micro_batch is not None and grad_acc is not None:
+            return
+
+        # global_accumulation_steps needs to be set
+        elif train_batch is not None and micro_batch is not None:
+            grad_acc = train_batch // micro_batch
+            grad_acc //= self.world_size
+            self.gradient_accumulation_steps = grad_acc
+
+        # micro_batch_per_gpu needs to be set
+        elif train_batch is not None and grad_acc is not None:
+            micro_batch = train_batch // self.world_size
+            micro_batch //= grad_acc
+            self.train_micro_batch_size_per_gpu = micro_batch
+
+        # train_batch_size needs to be set
+        elif micro_batch is not None and grad_acc is not None:
+            train_batch_size = micro_batch * grad_acc
+            train_batch_size *= self.world_size
+            self.train_batch_size = train_batch_size
+
+        # gradient_accumulation_steps and micro_batch_per_gpus is set
+        elif train_batch is not None:
+            self.gradient_accumulation_steps = 1
+            self.train_micro_batch_size_per_gpu = train_batch // self.world_size
+
+        # train_batch_size and gradient_accumulation_step is set
+        elif micro_batch is not None:
+            self.train_batch_size = micro_batch * self.world_size
+            self.gradient_accumulation_steps = 1
+
+        # either none of the three parameters are provided or just gradient_accumulation_step is provided
+        else:
+            assert False, \
+                'Either train_batch_size or train_micro_batch_size_per_gpu needs to be provided'
+
+    def _configure_train_batch_size(self):
+        self._set_batch_related_parameters()
+        self._batch_assertion()
+
+    def _do_sanity_check(self):
+        self._do_error_check()
+
+        self._do_warning_check()
+
+    def print(self, name):
+        logger.info("{}:".format(name))
+        for arg in sorted(vars(self)):
+            if arg != "_param_dict":
+                dots = "." * (29 - len(arg))
+                logger.info("  {} {} {}".format(arg, dots, getattr(self, arg)))
+
+        logger.info("  json = {}".format(
+            json.dumps(
+                self._param_dict,
+                sort_keys=True,
+                indent=4,
+                cls=ScientificNotationEncoder,
+                separators=(",",
+                            ":"),
+            )))
+
+    def _do_error_check(self):
+        assert (
+            self.train_micro_batch_size_per_gpu
+        ), "DeepSpeedConfig: {} is not defined".format(TRAIN_MICRO_BATCH_SIZE_PER_GPU)
+
+        assert (
+            self.gradient_accumulation_steps
+        ), "DeepSpeedConfig: {} is not defined".format(GRADIENT_ACCUMULATION_STEPS)
+
+        if self.zero_enabled:
+            assert (
+                self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION
+            ), "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(
+                MAX_STAGE_ZERO_OPTIMIZATION
+            )
+
+        if self.fp16_master_weights_and_gradients:
+            assert self.zero_enabled and self.zero_optimization_stage == ZERO_OPTIMIZATION_GRADIENTS, "Fp16_master_weights_and_grads is only supported with ZeRO Stage 2 for now."
+
+    def _do_warning_check(self):
+        fp16_enabled = self.fp16_enabled
+
+        vocabulary_size = self._param_dict.get(VOCABULARY_SIZE, VOCABULARY_SIZE_DEFAULT)
+        if vocabulary_size and vocabulary_size % TENSOR_CORE_ALIGN_SIZE != 0:
+            logger.warning(
+                "DeepSpeedConfig: vocabulary size {} is not aligned to {}, may import tensor core utilization."
+                .format(vocabulary_size,
+                        TENSOR_CORE_ALIGN_SIZE))
+
+        if (self.optimizer_params is not None
+                and MAX_GRAD_NORM in self.optimizer_params.keys()
+                and self.optimizer_params[MAX_GRAD_NORM] > 0):
+            if fp16_enabled:
+                if self.global_rank == 0:
+                    logger.warning(
+                        "DeepSpeedConfig: In FP16 mode, DeepSpeed will pass {}:{} to FP16 wrapper"
+                        .format(MAX_GRAD_NORM,
+                                self.optimizer_params[MAX_GRAD_NORM]))
+            else:
+                if self.global_rank == 0:
+                    logger.warning(
+                        "DeepSpeedConfig: In FP32 mode, DeepSpeed does not permit MAX_GRAD_NORM ({}) > 0, setting to zero"
+                        .format(self.optimizer_params[MAX_GRAD_NORM]))
+                self.optimizer_params[MAX_GRAD_NORM] = 0.0
diff --git a/deepspeed/runtime/config_utils.py b/deepspeed/runtime/config_utils.py
old mode 100755
new mode 100644
index 62782852a3d27adbb97fdf9e35f714ce55d0117d..83c48bbee5cb698f348229fb4906ec70e824b89a
--- a/deepspeed/runtime/config_utils.py
+++ b/deepspeed/runtime/config_utils.py
@@ -1,38 +1,81 @@
-"""
-Copyright (c) Microsoft Corporation
-Licensed under the MIT license.
-"""
-"""
-Collection of DeepSpeed configuration utilities
-"""
-import json
-from collections import Counter
-
-
-class DeepSpeedConfigObject(object):
-    """
-    For json serialization
-    """
-    def repr(self):
-        return self.__dict__
-
-    def __repr__(self):
-        return json.dumps(self.__dict__, sort_keys=True, indent=4)
-
-
-def get_scalar_param(param_dict, param_name, param_default_value):
-    return param_dict.get(param_name, param_default_value)
-
-
-def get_list_param(param_dict, param_name, param_default_value):
-    return param_dict.get(param_name, param_default_value)
-
-
-def dict_raise_error_on_duplicate_keys(ordered_pairs):
-    """Reject duplicate keys."""
-    d = dict((k, v) for k, v in ordered_pairs)
-    if len(d) != len(ordered_pairs):
-        counter = Counter([pair[0] for pair in ordered_pairs])
-        keys = [key for key, value in counter.items() if value > 1]
-        raise ValueError("Duplicate keys in DeepSpeed config: {}".format(keys))
-    return d
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+"""
+Collection of DeepSpeed configuration utilities
+"""
+import json
+import collections
+import collections.abc
+
+
+# adapted from https://stackoverflow.com/a/50701137/9201239
+class ScientificNotationEncoder(json.JSONEncoder):
+    """
+    This class overrides ``json.dumps`` default formatter.
+
+    This version keeps everything as normal except formats numbers bigger than 1e3 using scientific notation.
+
+    Just pass ``cls=ScientificNotationEncoder`` to ``json.dumps`` to activate it
+
+    """
+    def iterencode(self, o, _one_shot=False, level=0):
+        indent = self.indent if self.indent is not None else 4
+        prefix_close = " " * level * indent
+        level += 1
+        prefix = " " * level * indent
+        if isinstance(o, bool):
+            return "true" if o else "false"
+        elif isinstance(o, float) or isinstance(o, int):
+            if o > 1e3:
+                return f"{o:e}"
+            else:
+                return f"{o}"
+        elif isinstance(o, collections.abc.Mapping):
+            x = [
+                f'\n{prefix}"{k}": {self.iterencode(v, level=level)}' for k,
+                v in o.items()
+            ]
+            return "{" + ', '.join(x) + f"\n{prefix_close}" + "}"
+        elif isinstance(o, collections.abc.Sequence) and not isinstance(o, str):
+            return f"[{ f', '.join(map(self.iterencode, o)) }]"
+        return "\n, ".join(super().iterencode(o, _one_shot))
+
+
+class DeepSpeedConfigObject(object):
+    """
+    For json serialization
+    """
+    def repr(self):
+        return self.__dict__
+
+    def __repr__(self):
+        return json.dumps(
+            self.__dict__,
+            sort_keys=True,
+            indent=4,
+            cls=ScientificNotationEncoder,
+        )
+
+
+def get_scalar_param(param_dict, param_name, param_default_value):
+    return param_dict.get(param_name, param_default_value)
+
+
+def get_list_param(param_dict, param_name, param_default_value):
+    return param_dict.get(param_name, param_default_value)
+
+
+def get_dict_param(param_dict, param_name, param_default_value):
+    return param_dict.get(param_name, param_default_value)
+
+
+def dict_raise_error_on_duplicate_keys(ordered_pairs):
+    """Reject duplicate keys."""
+    d = dict((k, v) for k, v in ordered_pairs)
+    if len(d) != len(ordered_pairs):
+        counter = collections.Counter([pair[0] for pair in ordered_pairs])
+        keys = [key for key, value in counter.items() if value > 1]
+        raise ValueError("Duplicate keys in DeepSpeed config: {}".format(keys))
+    return d
diff --git a/deepspeed/runtime/constants.py b/deepspeed/runtime/constants.py
old mode 100755
new mode 100644
index 2f5916df753a6cdf29068dadcfb1b6baca5dd94a..ee2e51c6109f8f038ea59491bd29e0e9ac664d47
--- a/deepspeed/runtime/constants.py
+++ b/deepspeed/runtime/constants.py
@@ -107,6 +107,23 @@ GRADIENT_ACCUMULATION_STEPS_DEFAULT = None
 SPARSE_GRADIENTS = "sparse_gradients"
 SPARSE_GRADIENTS_DEFAULT = False
 
+#########################################
+# BFLOAT16 support
+#########################################
+# BFLOAT16 feature. By default, this feature is not enabled.
+# Users can configure in ds_config.json as below example:
+BFLOAT16_FORMAT = '''
+BFLOAT16 parameters should be of the format:
+"bf16": {
+  "enabled": true
+}
+'''
+BFLOAT16 = "bf16"
+BFLOAT16_OLD = "bfloat16"  # keeping for backwards compatibility
+
+BFLOAT16_ENABLED = "enabled"
+BFLOAT16_ENABLED_DEFAULT = False
+
 #########################################
 # FP16 support
 #########################################
@@ -148,6 +165,10 @@ FP16_HYSTERESIS_DEFAULT = 2
 FP16_MIN_LOSS_SCALE = "min_loss_scale"
 FP16_MIN_LOSS_SCALE_DEFAULT = 1
 
+# FP16 master and grads
+FP16_MASTER_WEIGHTS_AND_GRADS = "fp16_master_weights_and_grads"
+FP16_MASTER_WEIGHTS_AND_GRADS_DEFAULT = False
+
 #########################################
 # Apex AMP support
 #########################################
@@ -179,16 +200,17 @@ GRADIENT_CLIPPING = 'gradient_clipping'
 GRADIENT_CLIPPING_DEFAULT = 0.
 
 #########################################
-# FP32 AllReduce
+# Communication data type
 #########################################
-# FP32 All reduce. By default, this feature is not enabled.
+# Supported types: ['none', 'fp16', 'fp32']
+# By default, this feature is not enabled ('none' value)
 # Users can configure in ds_config.json as below example:
-FP32_ALLREDUCE_FORMAT = '''
-FP32 Allreduce should be enabled as:
-"fp32_allreduce": true
+COMMUNICATION_DATA_TYPE_FORMAT = '''
+Communication data type should be set as:
+"communication_data_type": "fp32"
 '''
-FP32_ALLREDUCE = "fp32_allreduce"
-FP32_ALLREDUCE_DEFAULT = False
+COMMUNICATION_DATA_TYPE = "communication_data_type"
+COMMUNICATION_DATA_TYPE_DEFAULT = None
 
 #########################################
 # Scale/predivide gradients before allreduce
@@ -287,6 +309,48 @@ TENSORBOARD_OUTPUT_PATH_DEFAULT = ""
 TENSORBOARD_JOB_NAME = "job_name"
 TENSORBOARD_JOB_NAME_DEFAULT = "DeepSpeedJobName"
 
+#########################################
+# Eigenvalue
+#########################################
+# Eigenvalue computation. By default, this feature is not enabled.
+# Users can configure in ds_config.json as below example:
+EIGENVALUE_FORMAT = '''
+Tensorboard can be specified as:
+"eigenvalue": {
+  "enabled": true,
+  "verbose": true,
+  "max_iter": 100,
+  "tol": 1e-2,
+  "stability": 1e-6
+}
+'''
+EIGENVALUE = "eigenvalue"
+
+# Tensorboard enable signal
+EIGENVALUE_ENABLED = "enabled"
+EIGENVALUE_ENABLED_DEFAULT = False
+
+EIGENVALUE_VERBOSE = "verbose"
+EIGENVALUE_VERBOSE_DEFAULT = False
+
+EIGENVALUE_MAX_ITER = "max_iter"
+EIGENVALUE_MAX_ITER_DEFAULT = 100
+
+EIGENVALUE_TOL = "tol"
+EIGENVALUE_TOL_DEFAULT = 1e-2
+
+EIGENVALUE_STABILITY = "stability"
+EIGENVALUE_STABILITY_DEFAULT = 1e-6
+
+EIGENVALUE_GAS_BOUNDARY_RESOLUTION = "gas_boundary_resolution"
+EIGENVALUE_GAS_BOUNDARY_RESOLUTION_DEFAULT = 1
+
+EIGENVALUE_LAYER_NAME = "layer_name"
+EIGENVALUE_LAYER_NAME_DEFAULT = "bert.encoder.layer"
+
+EIGENVALUE_LAYER_NUM = "layer_num"
+EIGENVALUE_LAYER_NUM_DEFAULT = 0
+
 #########################################
 # Progressive Layer Drop (PLD)
 #########################################
@@ -302,6 +366,14 @@ PLD_THETA_DEFAULT = 1.0
 PLD_GAMMA = "gamma"
 PLD_GAMMA_DEFAULT = 0.001
 
+#########################################
+# Curriculum Learning
+#########################################
+CURRICULUM_LEARNING = "curriculum_learning"
+
+CURRICULUM_ENABLED = "enabled"
+CURRICULUM_ENABLED_DEFAULT = False
+
 
 #########################################
 # Validation modes
@@ -324,3 +396,58 @@ CHECKPOINT_TAG_VALIDATION_MODES = [
     ValidationMode.IGNORE,
     ValidationMode.FAIL
 ]
+
+#########################################
+# Quantization
+#########################################
+QUANTIZE_TRAINING = "quantize_training"
+QUANTIZE_BITS = "quantize_bits"
+START_BITS = "start_bits"
+TARGET_BITS = "target_bits"
+QUANTIZER_KERNEL = "quantizer_kernel"
+QUANTIZE_SCHEDULE = "quantize_schedule"
+QUANTIZE_PERIOD = "quantize_period"
+SCHEDULE_OFFSET = "schedule_offset"
+QUANTIZE_GROUPS = "quantize_groups"
+FP16_MIXED_QUANTIZE = "fp16_mixed_quantize"
+QUANTIZE_CHANGE_RATIO = "quantize_change_ratio"
+FP16_MIXED_QUANTIZE_ENABLED = "enabled"
+QUANTIZE_VERBOSE = "quantize_verbose"
+QUANTIZE_ALGO = "quantize_algo"
+QUANTIZE_TYPE = "q_type"
+QUANTIZE_SYMMETRIC = "symmetric"
+QUANTIZE_ASYMMETRIC = "asymmetric"
+STOCHASTIC_ROUNDING = "stochastic"
+NEAREST_ROUNDING = "nearest"
+QUANTIZE_ROUNDING = "rounding"
+QUANTIZE_TRAINING_ENABLED = "enabled"
+QUANTIZE_TRAINING_ENABLED_DEFAULT = False
+QUANTIZE_TRAINING_DEFAULT = False
+QUANTIZE_START_BITS_DEFAULT = 16
+QUANTIZE_TARGET_BITS_DEFAULT = 8
+QUANTIZER_KERNEL_DEFAULT = False
+QUANTIZE_PERIOD_DEFAULT = 1000
+QUANTIZE_OFFSET_DEFAULT = 1000
+QUANTIZE_GROUPS_DEFAULT = 1
+QUANTIZE_TYPE_DEFAULT = 0  #symmetric
+QUANTIZE_ROUNDING_DEFAULT = 0  #nearest
+FP16_MIXED_QUANTIZE_ENABLED_DEFAULT = False
+QUANTIZE_CHANGE_RATIO_DEFAULT = 0.001
+QUANTIZE_VERBOSE_DEFAULT = False
+
+#########################################
+# Drop the last incomplete Batch
+# #########################################
+# dataloader_drop_last. By default, this feature is not enabled.
+# Users can configure in ds_config.json as below example:
+DATALOADER_DROP_LAST_FORMAT = '''
+The last incomplete batch can be dropped by setting:
+"dataloader_drop_last": True
+'''
+DATALOADER_DROP_LAST = "dataloader_drop_last"
+DATALOADER_DROP_LAST_DEFAULT = False
+
+#########################################
+# PIPELINE PARALLELISM
+#########################################
+PIPE_REPLICATED = 'ds_pipe_replicated'
diff --git a/deepspeed/runtime/data_pipeline/__init__.py b/deepspeed/runtime/data_pipeline/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/deepspeed/runtime/data_pipeline/curriculum_scheduler.py b/deepspeed/runtime/data_pipeline/curriculum_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..24ff1f8d975a11cb10b6e75a3027dbdad92d1d72
--- /dev/null
+++ b/deepspeed/runtime/data_pipeline/curriculum_scheduler.py
@@ -0,0 +1,134 @@
+'''
+Copyright 2021 The Microsoft DeepSpeed Team
+'''
+import math
+from deepspeed.utils import logger
+
+
+class CurriculumScheduler(object):
+    def __init__(self, config):
+        super().__init__()
+        self.state = {}
+        assert "curriculum_type" in config, "Curriculum learning requires the config 'curriculum_type'"
+        assert "min_difficulty" in config, "Curriculum learning requires the config 'min_difficulty'"
+        assert "max_difficulty" in config, "Curriculum learning requires the config 'max_difficulty'"
+        assert "schedule_type" in config, "Curriculum learning requires the config 'schedule_type'"
+        self.state['min_difficulty'] = config['min_difficulty']
+        self.state['max_difficulty'] = config['max_difficulty']
+        self.state['current_difficulty'] = config['min_difficulty']
+        self.state['schedule_type'] = config['schedule_type']
+        self.first_step = True
+        if config['schedule_type'] == 'fixed_discrete':
+            """
+            The schedule_config is a list of difficulty and a list of max
+            step belonging to each difficulty. Example json config:
+            "schedule_config": {
+              "difficulty": [1,2,3],
+              "max_step": [5,10]
+            }
+            The "max_step" has one less element than "difficulty", because
+            the last difficulty will be used for all following steps.
+            The self.state['schedule'] is a dictionary of
+            difficulty : [max step for this difficulty, next difficulty].
+            """
+            assert "difficulty" in config['schedule_config'], "Curriculum learning with fixed_discrete schedule requires the schedule_config 'difficulty'"
+            assert "max_step" in config['schedule_config'], "Curriculum learning with fixed_discrete schedule requires the schedule_config 'max_step'"
+            assert len(config['schedule_config']['max_step']) > 0
+            assert len(config['schedule_config']['difficulty']) > 0
+            assert len(config['schedule_config']['difficulty']) == len(
+                config['schedule_config']['max_step']) + 1
+            self.state['schedule'] = config['schedule_config']
+        elif config['schedule_type'] == 'fixed_root':
+            """
+            The schedule_config includes:
+            total_curriculum_step: how many steps the curriculum learning takes to go
+            from min difficulty to max difficulty.
+            difficulty_step: the difficulty level determined every time must
+            be a multiple of this difficulty_step. This is used to determine
+            the step of difficulty increase, and to ensure the use of NVIDIA
+            Tensor Core acceleration (requires multiple of 8 (FP16) or
+            16 (INT8)).
+            root_degree: the degree of the root function. Degree of 2 means
+            square root and degree of 3 means cube root. Degree of 1 is
+            equivalent to linear.
+            "schedule_config": {
+              "total_curriculum_step": 30000,
+              "difficulty_step": 8,
+              "root_degree": 2
+            }
+            """
+            assert "total_curriculum_step" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'total_curriculum_step'"
+            assert "difficulty_step" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'difficulty_step'"
+            assert "root_degree" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'root_degree'"
+            if config['schedule_config']['difficulty_step'] % 8 != 0:
+                logger.warning(
+                    f'The difficulty_step for curriculum learning has to be multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. Disregard this warning if this is unrelated to your hardware.'
+                )
+            self.state['schedule'] = config['schedule_config']
+        elif config['schedule_type'] == 'fixed_linear':
+            """
+            The schedule_config is the same as 'fixed_root' but without the
+            root_degree.
+            "schedule_config": {
+              "total_curriculum_step": 30000,
+              "difficulty_step": 8
+            }
+            """
+            assert "total_curriculum_step" in config['schedule_config'], "Curriculum learning with fixed_linear schedule requires the schedule_config 'total_curriculum_step'"
+            assert "difficulty_step" in config['schedule_config'], "Curriculum learning with fixed_linear schedule requires the schedule_config 'difficulty_step'"
+            if config['schedule_config']['difficulty_step'] % 8 != 0:
+                logger.warning(
+                    f'The difficulty_step for curriculum learning has to be multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. Disregard this warning if this is unrelated to your hardware.'
+                )
+            self.state['schedule'] = config['schedule_config']
+        else:
+            raise RuntimeError('Unsupported curriculum schedule type')
+
+    def get_current_difficulty(self):
+        return self.state['current_difficulty']
+
+    def set_current_difficulty(self, difficulty):
+        self.state['current_difficulty'] = difficulty
+
+    def get_state(self):
+        return self.state
+
+    def set_state(self, state):
+        self.state = state
+
+    def __fixed_discrete_get_difficulty(self, global_steps):
+        s_state = self.state['schedule']
+        if global_steps > s_state['max_step'][-1]:
+            return s_state['difficulty'][-1]
+        for i in range(len(s_state['max_step'])):
+            if global_steps <= s_state['max_step'][i]:
+                return s_state['difficulty'][i]
+
+    def __fixed_root_get_difficulty(self, global_steps, root_degree=None):
+        s_state = self.state['schedule']
+        if root_degree is None:
+            root_degree = s_state['root_degree']
+        next_difficulty = (float(global_steps) /
+                           s_state['total_curriculum_step'])**(1.0 / root_degree)
+        next_difficulty = math.floor(
+            next_difficulty *
+            (self.state['max_difficulty'] - self.state['min_difficulty']) +
+            self.state['min_difficulty'])
+        next_difficulty -= (next_difficulty % s_state['difficulty_step'])
+        next_difficulty = min(next_difficulty, self.state['max_difficulty'])
+        return next_difficulty
+
+    def get_difficulty(self, global_steps):
+        if self.state['schedule_type'] == 'fixed_discrete':
+            return self.__fixed_discrete_get_difficulty(global_steps)
+        elif self.state['schedule_type'] == 'fixed_linear':
+            return self.__fixed_root_get_difficulty(global_steps, 1)
+        elif self.state['schedule_type'] == 'fixed_root':
+            return self.__fixed_root_get_difficulty(global_steps)
+        else:
+            raise RuntimeError('Unsupported curriculum schedule type')
+
+    def update_difficulty(self, global_steps):
+        if self.state['current_difficulty'] < self.state['max_difficulty']:
+            self.state['current_difficulty'] = self.get_difficulty(global_steps)
+        return self.state['current_difficulty']
diff --git a/deepspeed/runtime/dataloader.py b/deepspeed/runtime/dataloader.py
index 23b71d32fbbf417855c3ece3ba7e3d9a63e36dcc..acd21d9557fe2530b97999fd2f4e3f5719d22327 100644
--- a/deepspeed/runtime/dataloader.py
+++ b/deepspeed/runtime/dataloader.py
@@ -41,7 +41,8 @@ class DeepSpeedDataLoader(object):
                  num_local_io_workers=None,
                  data_sampler=None,
                  data_parallel_world_size=None,
-                 data_parallel_rank=None):
+                 data_parallel_rank=None,
+                 dataloader_drop_last=False):
         self.tput_timer = tput_timer
         self.batch_size = batch_size
 
@@ -69,6 +70,7 @@ class DeepSpeedDataLoader(object):
         self.pin_memory = pin_memory
         self.len = len(self.data_sampler)
         self.data = None
+        self.dataloader_drop_last = dataloader_drop_last
 
     def __iter__(self):
         self._create_dataloader()
@@ -88,14 +90,19 @@ class DeepSpeedDataLoader(object):
                                          batch_size=self.batch_size,
                                          pin_memory=self.pin_memory,
                                          sampler=self.data_sampler,
-                                         num_workers=self.num_local_io_workers)
+                                         num_workers=self.num_local_io_workers,
+                                         drop_last=self.dataloader_drop_last)
         else:
             self.dataloader = DataLoader(self.dataset,
                                          batch_size=self.batch_size,
                                          pin_memory=self.pin_memory,
                                          sampler=self.data_sampler,
                                          collate_fn=self.collate_fn,
-                                         num_workers=self.num_local_io_workers)
+                                         num_workers=self.num_local_io_workers,
+                                         drop_last=self.dataloader_drop_last)
         self.data = (x for x in self.dataloader)
 
         return self.dataloader
+
+
+# DataLoader([(torch.randn(3, 3), torch.tensor(i % 2)) for i in range(10)], batch_size=2))
diff --git a/deepspeed/runtime/eigenvalue.py b/deepspeed/runtime/eigenvalue.py
new file mode 100644
index 0000000000000000000000000000000000000000..490899bda5766a91eea10019ff080f167dcd2fc5
--- /dev/null
+++ b/deepspeed/runtime/eigenvalue.py
@@ -0,0 +1,152 @@
+import torch
+from deepspeed.utils import log_dist
+import numpy as np
+import logging
+
+
+class Eigenvalue(object):
+    def __init__(self,
+                 verbose=False,
+                 max_iter=100,
+                 tol=1e-2,
+                 stability=0,
+                 gas_boundary_resolution=1,
+                 layer_name='',
+                 layer_num=0):
+        super().__init__()
+
+        self.verbose = verbose
+        self.max_iter = max_iter
+        self.tol = tol
+        self.stability = stability
+        self.gas_boundary_resolution = gas_boundary_resolution
+        self.layer_name = layer_name
+        self.layer_num = layer_num
+
+        assert len(self.layer_name) > 0 and layer_num > 0
+
+        log_dist(
+            f'enabled eigenvalue with verbose={verbose}, max_iter={max_iter}, tol={tol}, stability={stability}, gas_boundary_resolution={gas_boundary_resolution}, layer_name={layer_name}, layer_num={layer_num}',
+            ranks=[0])
+
+    # Replace all nan/pos-inf/neg-inf to zero
+    # TODO: Pytorch new version may add this function, replace this one by then.
+    def nan_to_num(self, x):
+        device = x.device
+        x = x.cpu().numpy()
+        x = np.nan_to_num(x=x, copy=False, nan=0.0, posinf=0.0, neginf=0.0)
+        return torch.from_numpy(x).to(device)
+
+    def normalize(self, v):
+        norm_squared = self.inner_product(v, v)
+        norm = norm_squared**0.5 + self.stability
+        normalized_vectors = [vector / norm for vector in v]
+        normalized_vectors = [self.nan_to_num(vector) for vector in normalized_vectors]
+        return normalized_vectors
+
+    def inner_product(self, xs, ys):
+        return sum([torch.sum(x * y) for (x, y) in zip(xs, ys)])
+
+    def get_layers(self, module):
+        scope_names = self.layer_name.split('.')
+        assert len(scope_names) > 0
+
+        m = module
+        for name in scope_names:
+            assert hasattr(m, name), "layer_name configuration is invalid."
+            m = getattr(m, name)
+
+        return m
+
+    def compute_eigenvalue(self, module, device=None, scale=1.0):
+        block_eigenvalue = []
+        param_keys = []
+        layers = self.get_layers(module)
+
+        for block in range(self.layer_num):
+            model_block = layers[block]
+
+            # We found this randn() has obvious accuracy impact in some cases, save/recover random state here.
+            rng_state = torch.random.get_rng_state()
+            if device is None:
+                v = [
+                    torch.randn(p.size()) for p in model_block.parameters()
+                    if p.grad is not None and p.grad.grad_fn is not None
+                ]
+            else:
+                v = [
+                    torch.randn(p.size(),
+                                device=device) for p in model_block.parameters()
+                    if p.grad is not None and p.grad.grad_fn is not None
+                ]
+            torch.random.set_rng_state(rng_state)
+
+            grads = [
+                param.grad for param in model_block.parameters()
+                if param.grad is not None and param.grad.grad_fn is not None
+            ]
+            params = [
+                param for param in model_block.parameters()
+                if param.grad is not None and param.grad.grad_fn is not None
+            ]
+
+            layer_keys = [id(p) for p in model_block.parameters()]
+            param_keys.append(layer_keys)
+
+            v = self.normalize(v)
+
+            # Disable eigenvalue if the model doesn't support second order gradients computation,
+            # e.g. when enabling DS transformer kernel.
+            if len(grads) == 0 or len(params) == 0:
+                log_dist(f'The model does NOT support eigenvalue computation.',
+                         ranks=[0],
+                         level=logging.WARNING)
+                return []
+
+            i = 0
+            eigenvalue_current, eigenvalue_previous = 1., 0.
+
+            while (i < self.max_iter) and abs(eigenvalue_current) > 0 and (abs(
+                (eigenvalue_current - eigenvalue_previous) /
+                    eigenvalue_current) >= self.tol):  # test convergence criteria
+                eigenvalue_previous = eigenvalue_current
+
+                Hv = torch.autograd.grad(grads,
+                                         params,
+                                         grad_outputs=v,
+                                         only_inputs=True,
+                                         retain_graph=True)
+                #Hv = [hv.float() for hv in Hv]
+                Hv = [self.nan_to_num(hv).float() for hv in Hv]
+
+                eigenvalue_current = self.inner_product(Hv, v).item()
+
+                v = self.normalize(Hv)
+                v = [x / scale for x in v]
+                i += 1
+
+            eigenvalue_current *= scale
+            block_eigenvalue.append(eigenvalue_current)
+
+            if self.verbose:
+                log_dist(
+                    f'block: {block}, power iteration: {i}, eigenvalue: {eigenvalue_current}',
+                    ranks=[0])
+
+        block_eigenvalue = self.post_process(block_eigenvalue)
+
+        if self.verbose:
+            log_dist(f'post processed block_eigenvalue: {block_eigenvalue}', ranks=[0])
+
+        # {param_id: (eigenvalue, layer_id)}
+        ev_dict = {}
+        for i, (layer_keys, value) in enumerate(zip(param_keys, block_eigenvalue)):
+            ev_dict.update(dict.fromkeys(layer_keys, (value, i)))
+
+        return ev_dict
+
+    # 1. Map all eigenvalues to [0, 1.0].
+    # 2. Some layers can't generate valid eigenvalues on fp16 precision, use 1.0 instead.
+    def post_process(self, value_list):
+        max_value = abs(max(value_list, key=abs))
+        return [abs(v) / max_value if v != 0.0 else 1.0 for v in value_list]
diff --git a/deepspeed/runtime/engine.py b/deepspeed/runtime/engine.py
old mode 100755
new mode 100644
index dd77413c82a238a410e43390246082bf513a0fad..1567bd57b00e86849c0baa31f4907c360b87fac0
--- a/deepspeed/runtime/engine.py
+++ b/deepspeed/runtime/engine.py
@@ -1,117 +1,187 @@
-'''
+"""
 Copyright 2019 The Microsoft DeepSpeed Team
-'''
+"""
 
 import os
+import re
 import stat
+import math
 import torch
 import warnings
 import hashlib
 import torch.distributed as dist
-from collections import OrderedDict
+from collections import defaultdict, OrderedDict
 from shutil import copyfile
 
 from torch.nn.modules import Module
-from torch.distributed.distributed_c10d import _get_global_rank
-from tensorboardX import SummaryWriter
+from torch.nn.parameter import Parameter
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import _LRScheduler
 
-from deepspeed.runtime.utils import see_memory_usage
-from deepspeed.runtime.zero.stage2 import FP16_DeepSpeedZeroOptimizer
-from deepspeed.runtime.zero.stage1 import FP16_DeepSpeedZeroOptimizer_Stage1
+from typing import Callable, Dict, Optional, Union, Iterable
+
+import deepspeed
+
+from deepspeed.runtime.utils import see_memory_usage, get_ma_status, DummyOptim
+from deepspeed.runtime.zero.stage_1_and_2 import DeepSpeedZeroOptimizer
 from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
-from deepspeed.runtime.zero.utils import is_zero_supported_optimizer
-from deepspeed.runtime.activation_checkpointing import checkpointing as activation_checkpointing
+from deepspeed.runtime.zero.utils import is_zero_supported_optimizer, ZeRORuntimeException
+from deepspeed.runtime.activation_checkpointing import (
+    checkpointing as activation_checkpointing,
+)
 from deepspeed.runtime.fp16.fused_optimizer import FP16_Optimizer
 from deepspeed.runtime.fp16.unfused_optimizer import FP16_UnfusedOptimizer
+from deepspeed.runtime.bf16_optimizer import BF16_Optimizer
+
 from deepspeed.runtime.config import DeepSpeedConfig, DEEPSPEED_OPTIMIZERS, \
-    ADAM_OPTIMIZER, ADAMW_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER, \
-    TORCH_ADAM_PARAM, ADAM_W_MODE, ADAM_W_MODE_DEFAULT
+    ADAGRAD_OPTIMIZER, ADAM_OPTIMIZER, ADAMW_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER, ONEBIT_LAMB_OPTIMIZER, \
+    TORCH_ADAM_PARAM, ADAM_W_MODE, ADAM_W_MODE_DEFAULT, ZERO_ONE_ADAM_OPTIMIZER
 
 from deepspeed.runtime.dataloader import DeepSpeedDataLoader
 from deepspeed.runtime.constants import \
     ROUTE_TRAIN, ROUTE_PREDICT, ROUTE_EVAL, \
-    PLD_THETA, PLD_GAMMA
+    PLD_THETA, PLD_GAMMA, BFLOAT16, FP16
 from deepspeed.runtime.zero.constants import \
     ZERO_OPTIMIZATION_OPTIMIZER_STATES, ZERO_OPTIMIZATION_GRADIENTS, ZERO_OPTIMIZATION_WEIGHTS
-from deepspeed.runtime.csr_tensor import CSRTensor
+from deepspeed.checkpoint.constants import OPTIMIZER_STATE_DICT
+from deepspeed.runtime.sparse_tensor import SparseTensor
+
 import deepspeed.runtime.lr_schedules as lr_schedules
-from deepspeed.utils import logger, log_dist, init_distributed
+import deepspeed.utils.groups as groups
+from deepspeed.runtime.utils import get_grad_norm
+from deepspeed.utils import logger, log_dist, init_distributed, instrument_w_nvtx
 from deepspeed.utils.timer import ThroughputTimer, SynchronizedWallClockTimer
+from deepspeed.utils.debug import debug_extract_module_and_param_names
 from deepspeed.runtime.progressive_layer_drop import ProgressiveLayerDrop
+from deepspeed.runtime.utils import clip_grad_norm_
+from deepspeed.runtime.eigenvalue import Eigenvalue
+from deepspeed.runtime.data_pipeline.curriculum_scheduler import CurriculumScheduler
 
 from .pipe.module import PipelineModule
-from .utils import ensure_directory_exists
+from .utils import ensure_directory_exists, get_ma_status
 from ..ops.op_builder import UtilsBuilder
 from ..ops.adam import DeepSpeedCPUAdam
 from ..ops.adam import FusedAdam
+from ..moe.sharded_moe import TopKGate, MOELayer
+from ..moe.layer import MoE
+from ..moe.utils import is_moe_param
+from ..git_version_info import version
 
 from deepspeed.profiling.flops_profiler.profiler import FlopsProfiler
+from deepspeed.utils.logging import print_json_dist
 
 MEMORY_OPT_ALLREDUCE_SIZE = 500000000
 
+DeepSpeedOptimizerCallable = \
+    Callable[[Union[Iterable[Parameter], Dict[str, Iterable]]], Optimizer]
+DeepSpeedSchedulerCallable = Callable[[Optimizer], _LRScheduler]
+
 try:
+    import apex
     from apex import amp
+    APEX_INSTALLED = True
 except ImportError:
     # Fail silently so we don't spam logs unnecessarily if user isn't using amp
+    APEX_INSTALLED = False
     pass
 
 
-def split_half_float_double_csr(tensors):
-    dtypes = [
+def split_half_float_double_sparse(tensors):
+    supported_types = [
         "torch.cuda.HalfTensor",
         "torch.cuda.FloatTensor",
         "torch.cuda.DoubleTensor",
-        CSRTensor.type()
+        "torch.cuda.BFloat16Tensor",
+        SparseTensor.type()
     ]
+
+    for t in tensors:
+        assert t.type() in supported_types, f"attempting to reduce an unsupported grad type: {t.type()}"
+
     buckets = []
-    for i, dtype in enumerate(dtypes):
+    for i, dtype in enumerate(supported_types):
         bucket = [t for t in tensors if t.type() == dtype]
         if bucket:
             buckets.append((dtype, bucket))
     return buckets
 
 
-def _initialize_parameter_parallel_groups(parameter_parallel_size=None):
-    data_parallel_size = int(dist.get_world_size())
-    if parameter_parallel_size is None:
-        parameter_parallel_size = int(data_parallel_size)
-    logger.info("data_parallel_size: %s, parameter_parallel_size: %s",
-                data_parallel_size,
-                parameter_parallel_size)
-    assert data_parallel_size % parameter_parallel_size == 0, \
-        'world size should be divisible by parameter parallel size'
-    rank = dist.get_rank()
-    my_group = None
-    for i in range(dist.get_world_size() // parameter_parallel_size):
-        ranks = range(i * parameter_parallel_size, (i + 1) * parameter_parallel_size)
-        group = torch.distributed.new_group(ranks)
-        if rank in ranks:
-            my_group = group
-    return my_group
-
-
 def print_configuration(args, name):
-    logger.info('{}:'.format(name))
+    logger.info("{}:".format(name))
     for arg in sorted(vars(args)):
-        dots = '.' * (29 - len(arg))
-        logger.info('  {} {} {}'.format(arg, dots, getattr(args, arg)))
+        dots = "." * (29 - len(arg))
+        logger.info("  {} {} {}".format(arg, dots, getattr(args, arg)))
+
+
+FORWARD_MICRO_TIMER = 'forward_microstep'
+FORWARD_GLOBAL_TIMER = 'forward'
+BACKWARD_MICRO_TIMER = 'backward_microstep'
+BACKWARD_GLOBAL_TIMER = 'backward'
+BACKWARD_INNER_MICRO_TIMER = 'backward_inner_microstep'
+BACKWARD_INNER_GLOBAL_TIMER = 'backward_inner'
+BACKWARD_REDUCE_MICRO_TIMER = 'backward_allreduce_microstep'
+BACKWARD_REDUCE_GLOBAL_TIMER = 'backward_allreduce'
+STEP_MICRO_TIMER = 'step_microstep'
+STEP_GLOBAL_TIMER = 'step'
+
+
+class EngineTimers(object):
+    r"""Wallclock timers for DeepSpeedEngine"""
+    def __init__(self, enable_micro_timers, enable_global_timers):
+        self.forward_timers = []
+        self.backward_timers = []
+        self.backward_inner_timers = []
+        self.backward_reduce_timers = []
+        self.step_timers = []
+        self.global_timers = []
+        self.micro_timers = []
+
+        if enable_micro_timers:
+            self.forward_timers += [FORWARD_MICRO_TIMER]
+            self.backward_timers += [BACKWARD_MICRO_TIMER]
+            self.backward_inner_timers += [BACKWARD_INNER_MICRO_TIMER]
+            self.backward_reduce_timers += [BACKWARD_REDUCE_MICRO_TIMER]
+            self.step_timers += [STEP_MICRO_TIMER]
+            self.micro_timers += [
+                FORWARD_MICRO_TIMER,
+                BACKWARD_MICRO_TIMER,
+                BACKWARD_INNER_MICRO_TIMER,
+                BACKWARD_REDUCE_MICRO_TIMER,
+                STEP_MICRO_TIMER
+            ]
+
+        if enable_global_timers:
+            self.forward_timers += [FORWARD_GLOBAL_TIMER]
+            self.backward_timers += [BACKWARD_GLOBAL_TIMER]
+            self.backward_inner_timers += [BACKWARD_INNER_GLOBAL_TIMER]
+            self.backward_reduce_timers += [BACKWARD_REDUCE_GLOBAL_TIMER]
+            self.step_timers += [STEP_GLOBAL_TIMER]
+            self.global_timers += [
+                FORWARD_GLOBAL_TIMER,
+                BACKWARD_GLOBAL_TIMER,
+                BACKWARD_INNER_GLOBAL_TIMER,
+                BACKWARD_REDUCE_GLOBAL_TIMER,
+                STEP_GLOBAL_TIMER
+            ]
 
 
 class DeepSpeedEngine(Module):
-    r"""DeepSpeed engine for training.
-    """
-    def __init__(self,
-                 args,
-                 model,
-                 optimizer=None,
-                 model_parameters=None,
-                 training_data=None,
-                 lr_scheduler=None,
-                 mpu=None,
-                 dist_init_required=None,
-                 collate_fn=None,
-                 config_params=None,
-                 dont_change_device=False):
+    r"""DeepSpeed engine for training."""
+    def __init__(
+        self,
+        args,
+        model,
+        optimizer=None,
+        model_parameters=None,
+        training_data=None,
+        lr_scheduler=None,
+        mpu=None,
+        dist_init_required=None,
+        collate_fn=None,
+        config=None,
+        config_params=None,
+        dont_change_device=False,
+    ):
         super(DeepSpeedEngine, self).__init__()
         self.dont_change_device = dont_change_device
         self.client_optimizer = optimizer
@@ -127,52 +197,89 @@ class DeepSpeedEngine(Module):
         self.skipped_steps = 0
         self.gradient_average = True
         self.warn_unscaled_loss = True
-        self.config_params = config_params
+        self.config = config
         self.loaded_checkpoint_mp_world_size = None
         self.loaded_checkpoint_dp_world_size = None
         self.enable_backward_allreduce = True
         self.progressive_layer_drop = None
+        self.eigenvalue = None
+        self.block_eigenvalue = None
+        self.gas_boundary_ctr = 0
         self.dist_backend = "nccl"
+        self.has_moe_layers = False
+        self.num_experts = []
+        self.gate_modules = []
+        self.moe_layers = []
+        self._step_applied = False
+        self._global_grad_norm = None
+        self._is_gradient_accumulation_boundary = None
+
+        # for debug purposes - can then debug print: debug_get_module_name(module)
+        debug_extract_module_and_param_names(model)
+
+        # needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict
+        self.param_names = {param: name for name, param in model.named_parameters()}
+
+        # Set config using config_params for backwards compat
+        if self.config is None and config_params is not None:
+            self.config = config_params
 
         if dist_init_required is None:
             dist_init_required = not dist.is_initialized()
 
         if dist_init_required is False:
-            assert dist.is_initialized() is True, "Torch distributed not initialized. Please set dist_init_required to True or initialize before calling deepspeed.initialize()"
+            assert (
+                dist.is_initialized() is True
+            ), "Torch distributed not initialized. Please set dist_init_required to True or initialize before calling deepspeed.initialize()"
         else:
             # Initialize torch distributed if needed
             init_distributed(dist_backend=self.dist_backend)
 
-        see_memory_usage(f"DeepSpeed Engine: Before args sanity test")
         self._do_args_sanity_check(args)
         self._configure_with_arguments(args, mpu)
         self._do_sanity_check()
-
+        see_memory_usage(f"DeepSpeed Engine: After args sanity test",
+                         force=self.memory_breakdown())
         if mpu is not None:
-            assert not self.elasticity_enabled(), "Elasticity is not currently supported" \
-                " with model parallelism."
+            assert not self.elasticity_enabled(), (
+                "Elasticity is not currently supported" " with model parallelism."
+            )
 
-        self._set_distributed_vars()
+        self._set_distributed_vars(args)
 
         if self.tensorboard_enabled() and self.global_rank == 0:
             self.summary_writer = self.get_summary_writer()
 
-        see_memory_usage(f"DeepSpeed Engine: Before configure distributed model")
+        see_memory_usage(
+            f"DeepSpeed Engine: Before configure distributed model",
+            force=self.memory_breakdown(),
+        )
+
+        self.pipeline_parallelism = isinstance(model, PipelineModule)
 
         # Configure distributed model
         self._configure_distributed_model(model)
 
+        self._get_model_parameters()
+
         see_memory_usage(f"DeepSpeed Engine: After configure distributed model")
 
-        # Configure wall clock timer
+        # Configure wall clock timers
         self.timers = SynchronizedWallClockTimer()
-
         # Throughput timer
         self.tput_timer = ThroughputTimer(
             batch_size=self.train_micro_batch_size_per_gpu(),
             num_workers=self.dp_world_size,
             steps_per_output=self.steps_per_print(),
-            monitor_memory=False)
+            monitor_memory=False,
+        )
+
+        if dist.get_rank() == 0:
+            logger.info(
+                f"DeepSpeed Flops Profiler Enabled: {self.flops_profiler_enabled()}")
+
+        if self.flops_profiler_enabled():
+            self.flops_profiler = FlopsProfiler(self.module, self)
 
         if training_data:
             self.training_dataloader = self.deepspeed_io(training_data)
@@ -181,40 +288,85 @@ class DeepSpeedEngine(Module):
 
         # Configure optimizer and scheduler
         self.optimizer = None
+        self.basic_optimizer = None
         self.lr_scheduler = None
         if model_parameters or optimizer:
             self._configure_optimizer(optimizer, model_parameters)
             self._configure_lr_scheduler(lr_scheduler)
             self._report_progress(0)
-
-        # Bookkeeping for csr support
-        self.csr_tensor_module_names = set()
-        if self.sparse_gradients_enabled():
-            for name, module in self.module.named_modules():
-                if isinstance(module, torch.nn.Embedding):
-                    self.csr_tensor_module_names.add(name + ".weight")
-                    logger.info("Will convert {} to sparse (csr) "
-                                "tensor during training".format(name))
+        elif self.zero_optimization():
+            # no optim selected but zero is enabled
+            self.optimizer = self._configure_zero_optimizer(optimizer=None)
+
+        self._get_model_parameters()
+
+        # Bookkeeping for sparse support
+        self.sparse_tensor_module_names = set()
+        # if self.sparse_gradients_enabled():
+        for name, module in self.module.named_modules():
+            if isinstance(module,
+                          (torch.nn.Embedding,
+                           torch.nn.EmbeddingBag)) and self.sparse_gradients_enabled():
+                self.sparse_tensor_module_names.add(name + ".weight")
+                logger.info(
+                    "Will convert {} to sparse tensor during training".format(name))
 
         self.save_non_zero_checkpoint = False
         self.save_zero_checkpoint = False
         self._configure_checkpointing(dist_init_required)
 
+        if self.eigenvalue_enabled():
+            self.eigenvalue = self._configure_eigenvalue()
+
         if self.pld_enabled():
             self.progressive_layer_drop = self._configure_progressive_layer_drop()
 
+        if self.curriculum_enabled():
+            self.curriculum_scheduler = self._configure_curriculum_scheduler()
+
+        # Engine timers
+
+        self.engine_timers = EngineTimers(
+            enable_micro_timers=self.wall_clock_breakdown(),
+            enable_global_timers=self.wall_clock_breakdown()
+            or self.flops_profiler_enabled())
+
         if self.global_rank == 0:
-            self._config.print('DeepSpeedEngine configuration')
+            self._config.print("DeepSpeedEngine configuration")
             if self.dump_state():
-                print_configuration(self, 'DeepSpeedEngine')
+                print_configuration(self, "DeepSpeedEngine")
 
         # Load pre-installed or JIT compile (un)flatten ops
         util_ops = UtilsBuilder().load()
         self.flatten = util_ops.flatten
         self.unflatten = util_ops.unflatten
 
+    def _get_model_parameters(self):
+        if self.autotuning_profile_model_info():
+            self.autotuning_model_info = {}
+            num_params = 0
+            trainable_num_params = 0
+
+            for p in self.module.parameters():
+                # since user code might call deepspeed.zero.Init() before deepspeed.initialize(), need to check the attrbuite to check if the parameter is partitioned in zero 3 already or not
+                n = 0
+                if hasattr(p, "ds_tensor"):  # if the parameter is partitioned in zero 3
+                    n += p.ds_numel
+                else:  # if the parameter is not partitioned in zero 3 yet
+                    n += p.numel()
+                num_params += n
+                if p.requires_grad:
+                    trainable_num_params += n
+            if self.global_rank == 0:
+                self.autotuning_model_info[
+                    "num_params"] = num_params * self.mp_world_size
+                self.autotuning_model_info[
+                    "trainable_num_params"] = trainable_num_params * self.mp_world_size
+
+            logger.info(f"model parameter = {num_params}")
+
     def get_batch_info(self):
-        """ Get all training batch related settings.
+        """Get all training batch related settings.
 
         Returns:
             train_batch_size (int): The effective training batch size. This is the amount of data
@@ -224,7 +376,45 @@ class DeepSpeedEngine(Module):
             gradient_accumulation_steps (int): Number of training steps to accumulate gradients
                 before averaging and applying them.
         """
-        return self.train_batch_size, self.train_micro_batch_size_per_gpu, self.gradient_accumulation_steps
+        return (
+            self.train_batch_size,
+            self.train_micro_batch_size_per_gpu,
+            self.gradient_accumulation_steps,
+        )
+
+    def set_train_batch_size(self, train_batch_size):
+        """Adjust the global batch size by increasing or decreasing the number of
+        micro-batches (i.e., gradient accumulation steps). The size of each micro-batch
+        (i.e., ``train_micro_batch_size_per_gpu``) is not changed.
+        Args:
+            train_batch_size (int): The new global batch size for training.
+        Raises:
+            ValueError: if ``train_batch_size`` is not divisible by the
+                configured micro-batch size and data parallelism.
+        """
+        if train_batch_size % (self.train_micro_batch_size_per_gpu() *
+                               self.dp_world_size) != 0:
+            #print(f'{train_batch_size=} {self.train_micro_batch_size_per_gpu()=} {self.dp_world_size=}')
+            raise ValueError(
+                f'Train batch size must be divisible by micro-batch data parallelism')
+        new_gas = train_batch_size // (self.train_micro_batch_size_per_gpu() *
+                                       self.dp_world_size)
+        # overwrite config
+        self._config.train_batch_size = train_batch_size
+        self._config.gradient_accumulation_steps = new_gas
+
+    def get_global_grad_norm(self) -> float:
+        """Return the 2-norm of all gradients. If there is model parallelism,
+        the norm will be global.
+
+        The computed norm will be cached and reused until the next step() pass.
+        .. note::
+            In the presence of model parallelism, this is a collective call
+            and acts as a barrier among ``mpu.get_model_parallel_group()``.
+        Returns:
+            float: norm
+        """
+        return self._global_grad_norm
 
     def checkpoint_tag_validation_enabled(self):
         return self._config.checkpoint_tag_validation_enabled
@@ -247,6 +437,36 @@ class DeepSpeedEngine(Module):
     def pld_gamma(self):
         return self.pld_params()[PLD_GAMMA]
 
+    def eigenvalue_enabled(self):
+        return self._config.eigenvalue_enabled
+
+    def eigenvalue_verbose(self):
+        return self._config.eigenvalue_verbose
+
+    def eigenvalue_max_iter(self):
+        return self._config.eigenvalue_max_iter
+
+    def eigenvalue_tol(self):
+        return self._config.eigenvalue_tol
+
+    def eigenvalue_stability(self):
+        return self._config.eigenvalue_stability
+
+    def eigenvalue_gas_boundary_resolution(self):
+        return self._config.eigenvalue_gas_boundary_resolution
+
+    def eigenvalue_layer_name(self):
+        return self._config.eigenvalue_layer_name
+
+    def eigenvalue_layer_num(self):
+        return self._config.eigenvalue_layer_num
+
+    def curriculum_enabled(self):
+        return self._config.curriculum_enabled
+
+    def curriculum_params(self):
+        return self._config.curriculum_params
+
     def tensorboard_enabled(self):
         return self._config.tensorboard_enabled
 
@@ -256,10 +476,12 @@ class DeepSpeedEngine(Module):
     def tensorboard_job_name(self):
         return self._config.tensorboard_job_name
 
-    def get_summary_writer(self,
-                           name="DeepSpeedJobName",
-                           base=os.path.join(os.path.expanduser("~"),
-                                             "tensorboard")):
+    def get_summary_writer(
+            self,
+            name="DeepSpeedJobName",
+            base=os.path.join(os.path.expanduser("~"),
+                              "tensorboard"),
+    ):
         if self.tensorboard_output_path():
             base_dir = self.tensorboard_output_path()
             job_name = self.tensorboard_job_name()
@@ -269,17 +491,27 @@ class DeepSpeedEngine(Module):
                 name = self.tensorboard_job_name()
 
             # Infrastructure-specific job-id
-            if 'DLWS_JOB_ID' in os.environ:
-                infra_job_id = os.environ['DLWS_JOB_ID']
-            elif 'DLTS_JOB_ID' in os.environ:
-                infra_job_id = os.environ['DLTS_JOB_ID']
+            if "DLWS_JOB_ID" in os.environ:
+                infra_job_id = os.environ["DLWS_JOB_ID"]
+            elif "DLTS_JOB_ID" in os.environ:
+                infra_job_id = os.environ["DLTS_JOB_ID"]
             else:
-                infra_job_id = 'unknown-job-id'
+                infra_job_id = "unknown-job-id"
 
             summary_writer_dir_name = os.path.join(infra_job_id, "logs")
             log_dir = os.path.join(base, summary_writer_dir_name, name)
 
         os.makedirs(log_dir, exist_ok=True)
+        try:
+            # torch.utils.tensorboard will fail if `tensorboard` is not available,
+            # see their docs for more details: https://pytorch.org/docs/1.8.0/tensorboard.html
+            import tensorboard
+        except ImportError:
+            print(
+                'If you want to use tensorboard logging please `pip install tensorboard`'
+            )
+            raise
+        from torch.utils.tensorboard import SummaryWriter
 
         return SummaryWriter(log_dir=log_dir)
 
@@ -287,10 +519,13 @@ class DeepSpeedEngine(Module):
         return self._config.wall_clock_breakdown
 
     def flops_profiler_enabled(self):
-        return self._config.flops_profiler_config.enabled
+        return self._config.flops_profiler_config.enabled or self.autotuning_enabled()
 
     def flops_profiler_profile_step(self):
-        return self._config.flops_profiler_config.profile_step
+        step = self._config.flops_profiler_config.profile_step
+        if self._config.autotuning_config.enabled:
+            step = self.autotuning_start_profile_step()
+        return step
 
     def flops_profiler_module_depth(self):
         return self._config.flops_profiler_config.module_depth
@@ -299,11 +534,46 @@ class DeepSpeedEngine(Module):
         return self._config.flops_profiler_config.top_modules
 
     def flops_profiler_detailed(self):
+        if self._config.autotuning_config.enabled:
+            return False
         return self._config.flops_profiler_config.detailed
 
+    def flops_profiler_output_file(self):
+        return self._config.flops_profiler_config.output_file
+
     def memory_breakdown(self):
         return self._config.memory_breakdown
 
+    def autotuning_enabled(self):
+        return self._config.autotuning_config.enabled
+
+    def autotuning_start_profile_step(self):
+        return self._config.autotuning_config.start_profile_step
+
+    def autotuning_end_profile_step(self):
+        return self._config.autotuning_config.end_profile_step
+
+    def autotuning_metric_path(self):
+        path = self._config.autotuning_config.metric_path
+        if not path:
+            path = os.path.join(os.getcwd(), "autotuning_metric.json")
+        return path
+
+    def autotuning_model_info_path(self):
+        path = self._config.autotuning_config.model_info_path
+        if not path:
+            path = os.path.join(os.getcwd(), "autotuning_model_info.json")
+        return path
+
+    def autotuning_metric(self):
+        return self._config.autotuning_config.metric
+
+    def autotuning_profile_model_info(self):
+        return self.autotuning_enabled(
+        ) and self._config.autotuning_config.model_info and self._config.autotuning_config.model_info.get(
+            "profile",
+            False)
+
     def sparse_gradients_enabled(self):
         return self._config.sparse_gradients_enabled
 
@@ -314,7 +584,8 @@ class DeepSpeedEngine(Module):
         return self._config.train_micro_batch_size_per_gpu
 
     def optimizer_name(self):
-        return self.client_optimizer.__class__.__name__ if self.client_optimizer else self._config.optimizer_name
+        return (self.client_optimizer.__class__.__name__
+                if self.client_optimizer else self._config.optimizer_name)
 
     def optimizer_params(self):
         return self._config.optimizer_params
@@ -328,6 +599,22 @@ class DeepSpeedEngine(Module):
     def scheduler_params(self):
         return self._config.scheduler_params
 
+    def quantize_training(self):
+        return (
+            self._config.quantize_training_enabled,
+            self._config.quantize_target_bits,
+            self._config.quantize_start_bits,
+            self._config.quantize_period,
+            self._config.quantize_offset,
+            self._config.quantize_groups,
+            self._config.fp16_mixed_quantize,
+            self._config.quantize_change_rate,
+            self._config.quantize_type,
+            self._config.quantize_rounding,
+            self._config.quantize_verbose,
+            self._config.use_quantizer_kernel,
+        )
+
     def zero_optimization(self):
         return self._config.zero_enabled
 
@@ -340,14 +627,14 @@ class DeepSpeedEngine(Module):
     def zero_overlap_comm(self):
         return self._config.zero_config.overlap_comm
 
-    def zero_cpu_offload(self):
-        return self._config.zero_config.cpu_offload
+    def zero_offload_optimizer(self):
+        return self._config.zero_config.offload_optimizer
 
-    def zero_cpu_offload_params(self):
-        return self._config.zero_config.cpu_offload_params
+    def zero_offload_param(self):
+        return self._config.zero_config.offload_param
 
-    def zero_cpu_offload_use_pin_memory(self):
-        return self._config.zero_config.cpu_offload_use_pin_memory
+    def zero_cpu_offload(self):
+        return self._config.zero_config.offload_optimizer is not None
 
     def zero_sub_group_size(self):
         return self._config.zero_config.sub_group_size
@@ -388,12 +675,27 @@ class DeepSpeedEngine(Module):
     def zero_param_persistence_threshold(self):
         return self._config.zero_config.param_persistence_threshold
 
-    def zero_gather_fp16_weights_on_model_save(self):
-        return self._config.zero_config.gather_fp16_weights_on_model_save
+    def zero_gather_16bit_weights_on_model_save(self):
+        return self._config.zero_config.gather_16bit_weights_on_model_save
+
+    def zero_grad_hooks(self):
+        return self._config.zero_config.grad_hooks
+
+    def zero_legacy_stage1(self):
+        return self._config.zero_config.legacy_stage1
+
+    def zero_ignore_unused_parameters(self):
+        return self._config.zero_config.ignore_unused_parameters
 
     def fp16_enabled(self):
         return self._config.fp16_enabled
 
+    def bfloat16_enabled(self):
+        return self._config.bfloat16_enabled
+
+    def fp16_master_weights_and_gradients(self):
+        return self._config.fp16_master_weights_and_gradients
+
     def amp_enabled(self):
         return self._config.amp_enabled
 
@@ -406,8 +708,17 @@ class DeepSpeedEngine(Module):
     def gradient_accumulation_steps(self):
         return self._config.gradient_accumulation_steps
 
-    def allreduce_always_fp32(self):
-        return self._config.allreduce_always_fp32
+    @property
+    def communication_data_type(self):
+        res = self._config.communication_data_type
+        if res is not None:
+            return res
+        elif self.fp16_enabled() or self.zero_optimization_stage():
+            return torch.float16
+        elif self.bfloat16_enabled():
+            return torch.bfloat16
+
+        return torch.float32
 
     def postscale_gradients(self):
         return not self._config.prescale_gradients
@@ -421,6 +732,9 @@ class DeepSpeedEngine(Module):
     def zero_allgather_partitions(self):
         return self._config.zero_config.allgather_partitions
 
+    def zero_round_robin_gradients(self):
+        return self._config.zero_config.round_robin_gradients
+
     def dump_state(self):
         return self._config.dump_state
 
@@ -436,18 +750,30 @@ class DeepSpeedEngine(Module):
     def dynamic_loss_scale_args(self):
         return self._config.dynamic_loss_scale_args
 
+    def swap_tensor_config(self):
+        return self._config.swap_tensor_config
+
+    def aio_config(self):
+        return self._config.aio_config
+
     def _configure_lr_scheduler(self, client_lr_scheduler):
         # First check for scheduler in json configuration
         lr_scheduler = self._scheduler_from_config(self.optimizer)
         if lr_scheduler:
             if self.global_rank == 0:
                 logger.info(
-                    f'DeepSpeed using configured LR scheduler = {self.scheduler_name()}')
+                    f"DeepSpeed using configured LR scheduler = {self.scheduler_name()}")
             self.lr_scheduler = lr_scheduler
         else:
-            if self.global_rank == 0:
-                logger.info('DeepSpeed using client LR scheduler')
-            self.lr_scheduler = client_lr_scheduler
+            if isinstance(client_lr_scheduler, Callable):
+                if self.global_rank == 0:
+                    logger.info('DeepSpeed using client callable to create LR scheduler')
+                self.lr_scheduler = client_lr_scheduler(self.basic_optimizer)
+            else:
+                if self.global_rank == 0:
+                    logger.info('DeepSpeed using client LR scheduler')
+                self.lr_scheduler = client_lr_scheduler
+
         log_dist(f'DeepSpeed LR Scheduler = {self.lr_scheduler}', ranks=[0])
 
     def _configure_checkpointing(self, dist_init_required):
@@ -460,13 +786,13 @@ class DeepSpeedEngine(Module):
         self.save_non_zero_checkpoint = (
             dp_rank == 0) or self.zero_optimization_partition_weights()
 
-        if self.zero_optimization():
+        if self.zero_optimization() or self.bfloat16_enabled():
             param_rank = torch.distributed.get_rank(
                 group=self.optimizer.dp_process_group)
 
             # Only the first parameter parallel process needs to store the
             # optimizer state checkpoints for zero
-            self.save_zero_checkpoint = (param_rank == dp_rank)
+            self.save_zero_checkpoint = param_rank == dp_rank
 
     def _scheduler_from_config(self, optimizer):
         scheduler_name = self.scheduler_name()
@@ -474,8 +800,9 @@ class DeepSpeedEngine(Module):
             if hasattr(lr_schedules, scheduler_name):
                 scheduler = getattr(lr_schedules, scheduler_name)
             else:
-                assert hasattr(torch.optim.lr_scheduler, scheduler_name), \
-                    f"DeepSpeed does not recognize LR scheduler {scheduler_name}"
+                assert hasattr(
+                    torch.optim.lr_scheduler, scheduler_name
+                ), f"DeepSpeed does not recognize LR scheduler {scheduler_name}"
 
                 scheduler = getattr(torch.optim.lr_scheduler, scheduler_name)
 
@@ -485,10 +812,13 @@ class DeepSpeedEngine(Module):
         else:
             return None
 
-    def _set_distributed_vars(self):
-        if self.local_rank >= 0:
-            torch.cuda.set_device(self.local_rank)
-            self.device = torch.device("cuda", self.local_rank)
+    def _set_distributed_vars(self, args):
+        device_rank = args.device_rank if args is not None and hasattr(
+            args,
+            'device_rank') else self.local_rank
+        if device_rank >= 0:
+            torch.cuda.set_device(device_rank)
+            self.device = torch.device("cuda", device_rank)
             self.world_size = dist.get_world_size()
             self.global_rank = dist.get_rank()
         else:
@@ -500,122 +830,248 @@ class DeepSpeedEngine(Module):
     def _configure_with_arguments(self, args, mpu):
         # After the distributed backend is initialized we are guaranteed the LOCAL_RANK
         # environment variable is set. We must align args.local_rank to this value for
-        # backwards compatability with scripts relying on [args|self].local_rank containing
+        # backwards compatibility with scripts relying on [args|self].local_rank containing
         # the correct local rank info. _do_args_sanity_check will ensure this is the case.
+
+        if "OMPI_COMM_WORLD_LOCAL_RANK" in os.environ:
+            ompi_local_rank = os.environ.get("OMPI_COMM_WORLD_LOCAL_RANK")
+            local_rank = os.environ.get('LOCAL_RANK', ompi_local_rank)
+            assert ompi_local_rank == local_rank, f"LOCAL_RANK ({local_rank}) != OMPI_COMM_WORLD_LOCAL_RANK ({ompi_local_rank}), " \
+                "not sure how to proceed as we're seeing conflicting local rank info."
+            os.environ['LOCAL_RANK'] = local_rank
+
         self.local_rank = int(os.environ['LOCAL_RANK'])
         if hasattr(args, 'local_rank'):
             args.local_rank = self.local_rank
 
-        config_file = args.deepspeed_config if hasattr(args,
-                                                       'deepspeed_config') else None
-        self._config = DeepSpeedConfig(config_file, mpu, param_dict=self.config_params)
+        if self.config is None:
+            self.config = (args.deepspeed_config
+                           if hasattr(args,
+                                      "deepspeed_config") else None)
+        self._config = DeepSpeedConfig(self.config, mpu)
 
     # Validate command line arguments
     def _do_args_sanity_check(self, args):
-        if hasattr(args, 'deepscale_config') and args.deepscale_config is not None:
+        if hasattr(args, "deepscale_config") and args.deepscale_config is not None:
             logger.warning(
                 "************ --deepscale_config is deprecated, please use --deepspeed_config ************"
             )
-            if hasattr(args, 'deepspeed_config'):
-                assert args.deepspeed_config is None, "Not sure how to proceed, we were given both a deepscale_config and deepspeed_config"
+            if hasattr(args, "deepspeed_config"):
+                assert (
+                    args.deepspeed_config is None
+                ), "Not sure how to proceed, we were given both a deepscale_config and deepspeed_config"
             args.deepspeed_config = args.deepscale_config
 
-        assert "LOCAL_RANK" in os.environ, "DeepSpeed requires the LOCAL_RANK environment variable, it is set by the deepspeed launcher, " \
-            "deepspeed.init_distributed, or the torch.distributed launcher. If using a different launcher please ensure LOCAL_RANK is set prior to initializing deepspeed."
+        assert "LOCAL_RANK" in os.environ or "OMPI_COMM_WORLD_LOCAL_RANK" in os.environ, "DeepSpeed requires the LOCAL_RANK environment " \
+            "variable, it is set by the deepspeed launcher, deepspeed.init_distributed, or the torch.distributed launcher. If using a " \
+            "different launcher please ensure LOCAL_RANK is set prior to initializing deepspeed."
+
         if hasattr(args, 'local_rank') and args.local_rank != None:
-            assert isinstance(args.local_rank, int), f"args.local_rank of {args.local_rank} is an unknown type {type(args.local_rank)}"
+            assert isinstance(
+                args.local_rank, int), f"args.local_rank of {args.local_rank} is an unknown type {type(args.local_rank)}"
             if args.local_rank >= 0:
                 env_local_rank = int(os.environ.get("LOCAL_RANK"))
-                assert env_local_rank == args.local_rank, \
-                    f"Mismatch in local rank setting, args.local_rank={args.local_rank} but env['LOCAL_RANK']={env_local_rank}."
-
-        if self.config_params is None:
-            assert hasattr(args, 'deepspeed_config') and args.deepspeed_config is not None, \
-                'DeepSpeed requires --deepspeed_config to specify configuration file'
-
-            assert os.path.isfile(args.deepspeed_config), \
-                'DeepSpeed configuration file: {} is not an existing file'.format(args.deepspeed_config)
+                assert (
+                    env_local_rank == args.local_rank
+                ), f"Mismatch in local rank setting, args.local_rank={args.local_rank} but env['LOCAL_RANK']={env_local_rank}."
+
+        if self.config is None:
+            assert (
+                hasattr(
+                    args, "deepspeed_config") and args.deepspeed_config is not None
+            ), "DeepSpeed requires --deepspeed_config to specify configuration file"
+
+            assert os.path.isfile(
+                args.deepspeed_config
+            ), "DeepSpeed configuration file: {} is not an existing file".format(
+                args.deepspeed_config
+            )
 
     def _is_supported_optimizer(self, optimizer_name):
-        return optimizer_name in DEEPSPEED_OPTIMIZERS or \
-            getattr(torch.optim, optimizer_name, None) is not None
+        return (optimizer_name in DEEPSPEED_OPTIMIZERS
+                or getattr(torch.optim,
+                           optimizer_name,
+                           None) is not None)
 
     # Validate configuration based on command line arguments
     def _do_sanity_check(self):
+        assert isinstance(self.client_optimizer, (type(None), Optimizer, Callable)), \
+            f'Client Optimizer is of unexpected type {type(self.client_optimizer)}'
+
         if not self.client_optimizer:
             if self.optimizer_name() is not None:
-                assert self._is_supported_optimizer(self.optimizer_name()), \
-                    '{} is not a supported DeepSpeed Optimizer'.format(self.optimizer_name())
+                assert self._is_supported_optimizer(
+                    self.optimizer_name()
+                ), "{} is not a supported DeepSpeed Optimizer".format(
+                    self.optimizer_name()
+                )
 
-        if self.optimizer_name() == LAMB_OPTIMIZER:
-            assert self.dynamic_loss_scale(), \
-                'DeepSpeed {} optimizer requires dynamic loss scaling'.format(self.optimizer_name())
+        if (self.optimizer_name() == LAMB_OPTIMIZER
+                or self.optimizer_name() == ONEBIT_LAMB_OPTIMIZER):
+            assert (
+                self.dynamic_loss_scale()
+            ), "DeepSpeed {} optimizer requires dynamic loss scaling".format(
+                self.optimizer_name()
+            )
+
+        # Detect invalid combinations of client optimizer and client scheduler
+        if isinstance(self.client_lr_scheduler, _LRScheduler):
+            assert isinstance(self.client_optimizer, Optimizer), \
+                f'Client Optimizer (type = {type(self.client_optimizer)} is not instantiated but Client LR Scheduler is instantiated'
 
     def _broadcast_model(self):
         def is_replicated(p):
-            if hasattr(p, 'ds_status') and p.ds_status is not ZeroParamStatus.AVAILABLE:
+            if hasattr(p, "ds_status") and p.ds_status is not ZeroParamStatus.AVAILABLE:
                 return False
             return True
 
         for p in self.module.parameters():
-            if torch.is_tensor(p) and is_replicated(p):
-                dist.broadcast(p,
-                               self.broadcast_src_rank,
-                               group=self.data_parallel_group)
+            # Broadcast the model for different parameters
+            if is_moe_param(p):
+                if torch.is_tensor(p) and is_replicated(p):
+                    dist.broadcast(p,
+                                   groups._get_expert_broadcast_src_rank(p.group_name),
+                                   group=self.expert_data_parallel_group[p.group_name])
+            else:
+                if torch.is_tensor(p) and is_replicated(p):
+                    dist.broadcast(p,
+                                   groups._get_broadcast_src_rank(),
+                                   group=self.data_parallel_group)
+
+    @staticmethod
+    def __check_params(model: Module, dtype: torch.dtype) -> None:
+        if not all(param.dtype == dtype
+                   for param in model.parameters()) and dist.get_rank() == 0:
+            raise ValueError(
+                f"{dtype} is enabled but the following parameters have dtype that is "
+                f"not {dtype}: "
+                f"{[(n, p.dtype) for n, p in model.named_parameters() if p.dtype != dtype]}"
+            )
 
     def _configure_distributed_model(self, model):
         self.module = model
         if self.fp16_enabled():
+            if self.zero_optimization_partition_weights() and any(
+                [hasattr(param,
+                         "ds_id") for param in self.module.parameters()]):
+                if not all(
+                    [param.dtype == torch.half for param in self.module.parameters()]):
+                    names = [
+                        n for n,
+                        p in self.module.named_parameters() if p.dtype != torch.half
+                    ]
+                    raise ValueError(
+                        f"fp16 is enabled but the following parameters have dtype that is not fp16: {', '.join(names)}"
+                    )
             self.module.half()
+        elif self.bfloat16_enabled():
+            if self.zero_optimization_partition_weights() and any(
+                    hasattr(param,
+                            'ds_id') for param in self.module.parameters()):
+                self.__check_params(self.module, torch.bfloat16)
+            if self.zero_optimization_stage() == 0 and not self.pipeline_parallelism:
+                raise NotImplementedError(
+                    "When not running ZeRO, BF16 training support is only supported for Pipeline parallelism"
+                )
+            self.module.bfloat16()
+        else:
+            self.__check_params(self.module, torch.float)
 
         if not self.dont_change_device:
             self.module.to(self.device)
 
-        if self.mpu is None:
-            self.data_parallel_group = _initialize_parameter_parallel_groups()
-            self.dp_world_size = dist.get_world_size()
-            self.mp_world_size = 1
-            self.broadcast_src_rank = 0
-        else:
-            self.data_parallel_group = self.mpu.get_data_parallel_group()
-            self.dp_world_size = self.mpu.get_data_parallel_world_size()
-            self.mp_world_size = self.mpu.get_model_parallel_world_size()
-            self.broadcast_src_rank = _get_global_rank(
-                self.mpu.get_data_parallel_group(),
-                0)
+        # MoE related initialization
+        for _, module in self.module.named_modules():
+            if isinstance(module, MoE):
+                self.has_moe_layers = True
+                self.num_experts.append(module.num_experts)
+
+        if self.has_moe_layers:
+            for _, module in self.module.named_modules():
+                if isinstance(module, TopKGate):
+                    self.gate_modules.append(module)
+                    if self.wall_clock_breakdown():
+                        module.wall_clock_breakdown = True
+                if isinstance(module, MOELayer):
+                    self.moe_layers.append(module)
+                    if self.wall_clock_breakdown():
+                        module.wall_clock_breakdown = True
+
+        # Pass the mpu from here to groups. For subsequent use, just query groups
+        if self.mpu is not None:
+            groups.mpu = self.mpu
+
+        # Set deepspeed parallelism spec. for the model including expert parallelism
+        for _, module in self.module.named_modules():
+            if hasattr(module, 'set_deepspeed_parallelism'):
+                module.set_deepspeed_parallelism()
+
+        # Query the groups module to get information about various parallel groups
+        self.data_parallel_group = groups._get_data_parallel_group()
+        self.dp_world_size = groups._get_data_parallel_world_size()
+        self.mp_world_size = groups._get_model_parallel_world_size()
+        self.expert_parallel_group = groups._get_expert_parallel_group_dict()
+        self.expert_data_parallel_group = groups._get_expert_data_parallel_group_dict()
 
         if not self.amp_enabled():
             self._broadcast_model()
 
+    # check if parameters are duplicated in optimizer param_groups
+    def _check_for_duplicates(self, optimizer):
+        for name, param in self.module.named_parameters():
+            param_id = id(param)
+
+            def ids_list(group):
+                return [id(param) for param in group]
+
+            occurrence = sum([
+                ids_list(group['params']).count(param_id)
+                if param_id in ids_list(group['params']) else 0
+                for group in optimizer.param_groups
+            ])
+            assert occurrence <= 1, f"Parameter with name: {name} occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behaviour."
+
     # Configure optimizer
     def _configure_optimizer(self, client_optimizer, model_parameters):
-
         if client_optimizer is not None:
-            client_optimizer.param_groups[:] = [
-                pg for pg in client_optimizer.param_groups if len(pg["params"]) != 0
-            ]
-            logger.info(
-                "Removing param_group that has no 'params'in the client Optimizer")
+            if isinstance(client_optimizer, Optimizer):
+                client_optimizer.param_groups[:] = [
+                    pg for pg in client_optimizer.param_groups if len(pg["params"]) != 0
+                ]
+                if self.global_rank == 0:
+                    logger.info(
+                        "Removing param_group that has no 'params' in the client Optimizer"
+                    )
 
-            basic_optimizer = client_optimizer
-            if self.global_rank == 0:
-                logger.info('Using client Optimizer as basic optimizer')
+                basic_optimizer = client_optimizer
+                if self.global_rank == 0:
+                    logger.info('Using client Optimizer as basic optimizer')
+            else:
+                basic_optimizer = client_optimizer(model_parameters)
+                if self.global_rank == 0:
+                    logger.info('Using client callable to create basic optimizer')
         else:
             basic_optimizer = self._configure_basic_optimizer(model_parameters)
             if self.global_rank == 0:
                 logger.info(
-                    'Using DeepSpeed Optimizer param name {} as basic optimizer'.format(
+                    "Using DeepSpeed Optimizer param name {} as basic optimizer".format(
                         self.optimizer_name()))
 
+        self._check_for_duplicates(basic_optimizer)
+
+        self.basic_optimizer = basic_optimizer
         if self.global_rank == 0:
-            logger.info('DeepSpeed Basic Optimizer = {}'.format(
+            logger.info("DeepSpeed Basic Optimizer = {}".format(
                 basic_optimizer.__class__.__name__))
 
         if self.zero_optimization():
-            assert not self.amp_enabled(), "Amp and ZeRO are not currently compatible, please use (legacy) fp16 mode which performs similar to amp opt_mode=O2"
+            assert (
+                not self.amp_enabled()
+            ), "Amp and ZeRO are not currently compatible, please use (legacy) fp16 mode which performs similar to amp opt_mode=O2"
             if not is_zero_supported_optimizer(basic_optimizer):
-                assert self.zero_allow_untested_optimizer(), \
-                    'You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.'
+                assert (
+                    self.zero_allow_untested_optimizer()
+                ), 'You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.'
 
                 if self.global_rank == 0:
                     logger.warning(
@@ -623,7 +1079,7 @@ class DeepSpeedEngine(Module):
                     )
             self.optimizer = self._configure_zero_optimizer(basic_optimizer)
         elif self.amp_enabled():
-            assert not self.fp16_enabled(), "Cannot enable both amp with (legacy) fp16 mode"
+            assert not (self.fp16_enabled() or self.bfloat16_enabled()), "Cannot enable both amp with (legacy) fp16 or bfloat16 mode"
             amp_params = self.amp_params()
             if self.global_rank == 0:
                 logger.info(f"Initializing AMP with these params: {amp_params}")
@@ -633,28 +1089,37 @@ class DeepSpeedEngine(Module):
                 # If apex/amp is available it will be imported above
                 raise RuntimeError(
                     "Unable to import apex/amp, please make sure it is installed")
-            self.module, self.optimizer = amp.initialize(self.module, basic_optimizer, **amp_params)
+            self.module, self.optimizer = amp.initialize(
+                self.module, basic_optimizer, **amp_params
+            )
             self._broadcast_model()
+            # TODO: maybe need to broadcast experts differently?
         elif self.fp16_enabled():
             self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
+        elif self.bfloat16_enabled():
+            self.optimizer = self._configure_bf16_optimizer(basic_optimizer)
         else:
             self.optimizer = basic_optimizer
-        log_dist('DeepSpeed Final Optimizer = {}'.format(self.optimizer_name()),
+        log_dist("DeepSpeed Final Optimizer = {}".format(self.optimizer_name()),
                  ranks=[0])
 
+        self.quantizer = self._configure_quantization()
+
     def _configure_basic_optimizer(self, model_parameters):
         optimizer_parameters = self.optimizer_params()
+        if optimizer_parameters is None:
+            optimizer_parameters = {}
         # print(optimizer_parameters.keys())
-        if 'max_grad_norm' in optimizer_parameters.keys():
+        if "max_grad_norm" in optimizer_parameters.keys():
             raise ValueError(
                 "'max_grad_norm' is not supported as an optimizer parameter, please switch to using the deepspeed parameter 'gradient_clipping' see: https://www.deepspeed.ai/docs/config-json/#gradient-clipping for more details"
             )
 
-        if self.optimizer_name() in [ADAM_OPTIMIZER, ADAMW_OPTIMIZER]:
+        if self.optimizer_name() in [ADAGRAD_OPTIMIZER, ADAM_OPTIMIZER, ADAMW_OPTIMIZER]:
             torch_adam = optimizer_parameters.pop(TORCH_ADAM_PARAM, False)
             adam_w_mode = optimizer_parameters.pop(ADAM_W_MODE, ADAM_W_MODE_DEFAULT)
 
-            # Optimizer name of Adam forces AdamW logic unless adam_w_mode is explictly set
+            # Optimizer name of Adam forces AdamW logic unless adam_w_mode is explicitly set
             effective_adam_w_mode = self.optimizer_name(
             ) == ADAMW_OPTIMIZER or adam_w_mode
 
@@ -667,119 +1132,241 @@ class DeepSpeedEngine(Module):
                                                   **optimizer_parameters)
             else:
                 if self.zero_cpu_offload():
-                    from deepspeed.ops.adam import DeepSpeedCPUAdam
-                    optimizer = DeepSpeedCPUAdam(model_parameters,
-                                                 **optimizer_parameters,
-                                                 adamw_mode=effective_adam_w_mode)
+                    if self.optimizer_name() == ADAGRAD_OPTIMIZER:
+                        from deepspeed.ops.adagrad import DeepSpeedCPUAdagrad
+                        optimizer = DeepSpeedCPUAdagrad(model_parameters,
+                                                        **optimizer_parameters)
+                    else:
+                        from deepspeed.ops.adam import DeepSpeedCPUAdam
+                        optimizer = DeepSpeedCPUAdam(model_parameters,
+                                                     **optimizer_parameters,
+                                                     adamw_mode=effective_adam_w_mode)
                 else:
                     from deepspeed.ops.adam import FusedAdam
-                    optimizer = FusedAdam(model_parameters,
-                                          **optimizer_parameters,
-                                          adam_w_mode=effective_adam_w_mode)
+
+                    optimizer = FusedAdam(
+                        model_parameters,
+                        **optimizer_parameters,
+                        adam_w_mode=effective_adam_w_mode,
+                    )
 
         elif self.optimizer_name() == LAMB_OPTIMIZER:
             from deepspeed.ops.lamb import FusedLamb
+
             optimizer = FusedLamb(model_parameters, **optimizer_parameters)
         elif self.optimizer_name() == ONEBIT_ADAM_OPTIMIZER:
+            assert not self.zero_optimization(), "1bit-Adam is not compatible with ZeRO"
             from deepspeed.runtime.fp16.onebit.adam import OnebitAdam
+
             optimizer = OnebitAdam(model_parameters, self, **optimizer_parameters)
             if not self.fp16_enabled():
                 logger.warning(
-                    f'Currently the convergence of 1-bit Adam is only verified under FP16'
+                    f"Currently the convergence of 1-bit Adam is only verified under FP16"
+                )
+        elif self.optimizer_name() == ZERO_ONE_ADAM_OPTIMIZER:
+            assert not self.zero_optimization(), "0/1 Adam is not compatible with ZeRO"
+            from deepspeed.runtime.fp16.onebit.zoadam import ZeroOneAdam
+
+            optimizer = ZeroOneAdam(model_parameters, self, **optimizer_parameters)
+            if not self.fp16_enabled():
+                logger.warning(
+                    f'Currently the convergence of 0/1 Adam is only verified under FP16')
+        elif self.optimizer_name() == ONEBIT_LAMB_OPTIMIZER:
+            assert not self.zero_optimization(), "1bit-Lamb is not compatible with ZeRO"
+            from deepspeed.runtime.fp16.onebit.lamb import OnebitLamb
+
+            optimizer = OnebitLamb(model_parameters, self, **optimizer_parameters)
+            if not self.fp16_enabled():
+                logger.warning(
+                    f"Currently the convergence of 1-bit Lamb is only verified under FP16"
                 )
         else:
             torch_optimizer = getattr(torch.optim, self.optimizer_name())
             optimizer = torch_optimizer(model_parameters, **optimizer_parameters)
         return optimizer
 
+    def _configure_quantization(self):
+        (
+            quantize_enabled,
+            q_target_bits,
+            q_start_bits,
+            q_period,
+            q_offset,
+            q_groups,
+            q_mixed_fp16,
+            q_change_ratio,
+            q_type,
+            q_rounding,
+            q_verbose,
+            use_quantizer_kernel,
+        ) = self.quantize_training()
+        quantizer = None
+        if quantize_enabled:
+            from deepspeed.runtime.quantize import Quantizer
+
+            quantizer = Quantizer(
+                q_target_bits,
+                q_start_bits,
+                q_period,
+                q_offset,
+                q_groups,
+                q_mixed_fp16,
+                q_change_ratio,
+                q_type,
+                q_rounding,
+                q_verbose,
+                self.eigenvalue_enabled(),
+                use_quantizer_kernel,
+                self.eigenvalue_layer_num() if self.eigenvalue_enabled() else 0,
+            )
+        return quantizer
+
     def _configure_fp16_optimizer(self, optimizer):
         initial_dynamic_scale = self.initial_dynamic_scale()
         dynamic_loss_args = self.dynamic_loss_scale_args()
         clip_grad = self.gradient_clipping()
-        if isinstance(optimizer,
-                      FusedAdam) or self.optimizer_name() == ONEBIT_ADAM_OPTIMIZER:
+        if APEX_INSTALLED:
+            fused_opts = (apex.optimizers.FusedAdam, FusedAdam)
+        else:
+            fused_opts = FusedAdam
+        if isinstance(optimizer, fused_opts) \
+                or self.optimizer_name() in [ONEBIT_ADAM_OPTIMIZER, ZERO_ONE_ADAM_OPTIMIZER]:
             if self.dynamic_loss_scale():
-                log_dist('Creating fp16 optimizer with dynamic loss scale', ranks=[0])
+                log_dist("Creating fp16 optimizer with dynamic loss scale", ranks=[0])
                 timers = self.timers if self.wall_clock_breakdown() else None
                 optimizer = FP16_Optimizer(
                     optimizer,
+                    deepspeed=self,
                     dynamic_loss_scale=True,
                     initial_dynamic_scale=initial_dynamic_scale,
                     dynamic_loss_args=dynamic_loss_args,
                     mpu=self.mpu,
                     clip_grad=clip_grad,
                     fused_adam_legacy=self.optimizer_legacy_fusion(),
-                    timers=timers)
+                    timers=timers,
+                    has_moe_layers=self.has_moe_layers,
+                )
             else:
-                log_dist('Creating fp16 optimizer with static loss scale: {}'.format(
-                    self.loss_scale()),
-                         ranks=[0])
+                log_dist(
+                    "Creating fp16 optimizer with static loss scale: {}".format(
+                        self.loss_scale()),
+                    ranks=[0],
+                )
                 optimizer = FP16_Optimizer(
                     optimizer,
+                    deepspeed=self,
                     static_loss_scale=self.loss_scale(),
                     mpu=self.mpu,
                     clip_grad=clip_grad,
-                    fused_adam_legacy=self.optimizer_legacy_fusion())
+                    fused_adam_legacy=self.optimizer_legacy_fusion(),
+                    has_moe_layers=self.has_moe_layers,
+                )
         else:
-            log_dist('Creating fp16 unfused optimizer with dynamic loss scale',
+            log_dist("Creating fp16 unfused optimizer with dynamic loss scale",
                      ranks=[0])
             optimizer = FP16_UnfusedOptimizer(
                 optimizer,
+                deepspeed=self,
                 static_loss_scale=self.loss_scale(),
                 dynamic_loss_scale=self.dynamic_loss_scale(),
                 dynamic_loss_args=dynamic_loss_args,
                 mpu=self.mpu,
                 clip_grad=clip_grad,
-                fused_lamb_legacy=self.optimizer_name() == LAMB_OPTIMIZER)
+                fused_lamb_legacy=self.optimizer_name() == LAMB_OPTIMIZER,
+            )
+
+        return optimizer
+
+    def _configure_bf16_optimizer(self, optimizer):
+        clip_grad = self.gradient_clipping()
+
+        if self.global_rank == 0:
+            logger.info('Creating unfused BF16 optimizer')
+        timers = self.timers if self.wall_clock_breakdown() else None
+        optimizer = BF16_Optimizer(
+            optimizer,
+            mpu=self.mpu,
+            clip_grad=clip_grad,
+            allgather_bucket_size=self.zero_allgather_bucket_size(),
+            dp_process_group=self.data_parallel_group,
+            timers=timers)
 
         return optimizer
 
     def _configure_zero_optimizer(self, optimizer):
         zero_stage = self.zero_optimization_stage()
         log_dist('Creating fp16 ZeRO stage {} optimizer'.format(zero_stage), ranks=[0])
-        assert not self.allreduce_always_fp32(), "ZeRO does not support 'fp32_allreduce': true"
+        assert self.communication_data_type in (torch.float16, torch.bfloat16), "ZeRO supports only 'communication_data_type': ['fp16', 'bfp16']"
         timers = self.timers if self.wall_clock_breakdown() else None
 
-        if zero_stage == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
-            assert self.zero_reduce_scatter(), 'Stage 1 only supports reduce scatter mode'
-            optimizer = FP16_DeepSpeedZeroOptimizer_Stage1(
-                optimizer,
-                static_loss_scale=self.loss_scale(),
-                dynamic_loss_scale=self.dynamic_loss_scale(),
-                dynamic_loss_args=self.dynamic_loss_scale_args(),
-                clip_grad=self.gradient_clipping(),
-                all_gather_partitions=self.zero_allgather_partitions(),
-                allgather_size=self.zero_allgather_bucket_size(),
-                max_elements_per_comm=self.zero_reduce_bucket_size(),
-                dp_process_group=self.data_parallel_group,
-                elastic_checkpoint=self.zero_elastic_checkpoint(),
-                mpu=self.mpu)
-        elif zero_stage == ZERO_OPTIMIZATION_GRADIENTS:
-            optimizer = FP16_DeepSpeedZeroOptimizer(
+        if optimizer is None:
+            optimizer = DummyOptim(list(self.module.parameters()))
+
+        if self.zero_legacy_stage1():
+            raise Exception(
+                "The deprecated version of ZeRO Stage 1 is not supported in deepspeed >= 0.5.9. Please downgrade to a version less than 0.5.9 if you need to use this deprecated version of ZeRO."
+            )
+
+        if zero_stage <= ZERO_OPTIMIZATION_GRADIENTS:
+            overlap_comm = self.zero_overlap_comm()
+            contiguous_gradients = self.zero_contiguous_gradients()
+            round_robin_gradients = self.zero_round_robin_gradients()
+            assert not isinstance(optimizer, DummyOptim), "zero stage 2 requires an optimizer"
+
+            # Overlap and contiguous grads are meaningless in stage 1 and are ignored
+            if zero_stage == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
+                overlap_comm = False
+                contiguous_gradients = False
+                round_robin_gradients = False
+
+            if isinstance(self.module, PipelineModule):
+                if overlap_comm:
+                    logger.warning(
+                        "Pipeline parallelism does not support overlapped communication, will be disabled."
+                    )
+                    overlap_comm = False
+
+            optimizer = DeepSpeedZeroOptimizer(
                 optimizer,
                 timers=timers,
                 static_loss_scale=self.loss_scale(),
                 dynamic_loss_scale=self.dynamic_loss_scale(),
                 dynamic_loss_args=self.dynamic_loss_scale_args(),
                 clip_grad=self.gradient_clipping(),
-                contiguous_gradients=self.zero_contiguous_gradients(),
+                contiguous_gradients=contiguous_gradients,
                 reduce_bucket_size=self.zero_reduce_bucket_size(),
                 allgather_bucket_size=self.zero_allgather_bucket_size(),
                 dp_process_group=self.data_parallel_group,
+                expert_parallel_group=self.expert_parallel_group
+                if self.has_moe_layers else None,
+                expert_data_parallel_group=self.expert_data_parallel_group
+                if self.has_moe_layers else None,
                 reduce_scatter=self.zero_reduce_scatter(),
-                overlap_comm=self.zero_overlap_comm(),
+                overlap_comm=overlap_comm,
                 cpu_offload=self.zero_cpu_offload(),
                 mpu=self.mpu,
                 postscale_gradients=self.postscale_gradients(),
                 gradient_predivide_factor=self.gradient_predivide_factor(),
-                gradient_accumulation_steps=self.gradient_accumulation_steps())
+                gradient_accumulation_steps=self.gradient_accumulation_steps(),
+                ignore_unused_parameters=self.zero_ignore_unused_parameters(),
+                partition_grads=zero_stage == ZERO_OPTIMIZATION_GRADIENTS,
+                round_robin_gradients=round_robin_gradients,
+                has_moe_layers=self.has_moe_layers,
+                fp16_master_weights_and_gradients=self.fp16_master_weights_and_gradients(
+                ),
+                communication_data_type=self.communication_data_type,
+                elastic_checkpoint=self.zero_elastic_checkpoint())
+
         elif zero_stage == ZERO_OPTIMIZATION_WEIGHTS:
-            print("Initializing ZeRO Stage 3") if dist.get_rank() == 0 else None
-            from deepspeed.runtime.zero.stage3 import FP16_DeepSpeedZeroOptimizer_Stage3
-            optimizer = FP16_DeepSpeedZeroOptimizer_Stage3(
+            assert not self.has_moe_layers, "MoE not supported with Stage 3"
+            logger.info("Initializing ZeRO Stage 3") if dist.get_rank() == 0 else None
+            from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3
+
+            optimizer = DeepSpeedZeroOptimizer_Stage3(
                 self.module,
                 optimizer,
                 timers=timers,
+                ds_config=self.config,
                 static_loss_scale=self.loss_scale(),
                 dynamic_loss_scale=self.dynamic_loss_scale(),
                 dynamic_loss_args=self.dynamic_loss_scale_args(),
@@ -793,25 +1380,68 @@ class DeepSpeedEngine(Module):
                 dp_process_group=self.data_parallel_group,
                 reduce_scatter=self.zero_reduce_scatter(),
                 overlap_comm=self.zero_overlap_comm(),
-                cpu_offload_optimizer_state=self.zero_cpu_offload(),
-                cpu_offload_params=self.zero_cpu_offload_params(),
-                cpu_offload_use_pin_memory=self.zero_cpu_offload_use_pin_memory(),
+                offload_optimizer_config=self.zero_offload_optimizer(),
+                offload_param_config=self.zero_offload_param(),
                 sub_group_size=self.zero_sub_group_size(),
                 mpu=self.mpu,
                 postscale_gradients=self.postscale_gradients(),
                 gradient_predivide_factor=self.gradient_predivide_factor(),
-                gradient_accumulation_steps=self.gradient_accumulation_steps())
+                gradient_accumulation_steps=self.gradient_accumulation_steps(),
+                aio_config=self.aio_config(),
+                communication_data_type=self.communication_data_type)
 
         else:
             raise NotImplementedError("ZeRO stage {} not implemented".format(zero_stage))
 
         return optimizer
 
+    def _configure_eigenvalue(self):
+        eigenvalue = Eigenvalue(
+            verbose=self.eigenvalue_verbose(),
+            max_iter=self.eigenvalue_max_iter(),
+            tol=self.eigenvalue_tol(),
+            stability=self.eigenvalue_stability(),
+            gas_boundary_resolution=self.eigenvalue_gas_boundary_resolution(),
+            layer_name=self.eigenvalue_layer_name(),
+            layer_num=self.eigenvalue_layer_num(),
+        )
+
+        return eigenvalue
+
     def _configure_progressive_layer_drop(self):
         pld = ProgressiveLayerDrop(theta=self.pld_theta(), gamma=self.pld_gamma())
 
         return pld
 
+    def _configure_curriculum_scheduler(self):
+        scheduler = CurriculumScheduler(self.curriculum_params())
+        return scheduler
+
+    @staticmethod
+    def is_map_style_dataset(obj):
+        return hasattr(obj, "__getitem__") and hasattr(obj, "__len__")
+
+    @staticmethod
+    def is_iterable_style_dataset(obj):
+        return isinstance(obj,
+                          torch.utils.data.IterableDataset
+                          )  # hasattr(obj, "__iter__") should work as well
+
+    def dataloader_drop_last(self):
+        return self._config.dataloader_drop_last
+
+    def was_step_applied(self) -> bool:
+        """Returns True if the latest ``step()`` produced in parameter updates.
+
+        Note that a ``False`` return is not an error condition. Steps are frequently
+        no-ops, such as between gradient accumulation boundaries or when overflows
+        occur.
+
+        Returns:
+            bool: Whether the latest ``step()`` modified model parameters.
+        """
+        return self._step_applied
+
     def deepspeed_io(self,
                      dataset,
                      batch_size=None,
@@ -820,7 +1450,8 @@ class DeepSpeedEngine(Module):
                      data_sampler=None,
                      collate_fn=None,
                      num_local_io_workers=None):
-        if not isinstance(dataset, torch.utils.data.Dataset):
+        if not (self.is_map_style_dataset(dataset)
+                or self.is_iterable_style_dataset(dataset)):
             raise ValueError("Training data must be a torch Dataset")
 
         if data_sampler is None and (route == ROUTE_PREDICT or route == ROUTE_EVAL):
@@ -837,7 +1468,7 @@ class DeepSpeedEngine(Module):
         if route == ROUTE_TRAIN:
             deepspeed_io_timer = self.tput_timer
 
-        # If mpu is provied, forward world size and parallel rank to sampler.
+        # If mpu is provided, forward world size and parallel rank to sampler.
         data_parallel_world_size = None
         data_parallel_rank = None
         if self.mpu is not None:
@@ -853,23 +1484,22 @@ class DeepSpeedEngine(Module):
                                    num_local_io_workers=num_local_io_workers,
                                    data_sampler=data_sampler,
                                    data_parallel_world_size=data_parallel_world_size,
-                                   data_parallel_rank=data_parallel_rank)
+                                   data_parallel_rank=data_parallel_rank,
+                                   dataloader_drop_last=self.dataloader_drop_last())
 
     def train(self, mode=True):
-        r"""
-        """
+        r""""""
 
         self.warn_unscaled_loss = True
         self.module.train(mode)
 
     def eval(self):
-        r"""
-        """
+        r""""""
 
         self.warn_unscaled_loss = True
         self.module.train(False)
 
-    def _scale_loss(self, prescaled_loss):
+    def _scale_loss_by_gas(self, prescaled_loss):
         if isinstance(prescaled_loss, torch.Tensor):
             scaled_loss = prescaled_loss / self.gradient_accumulation_steps()
         elif isinstance(prescaled_loss, tuple) or isinstance(prescaled_loss, list):
@@ -883,73 +1513,126 @@ class DeepSpeedEngine(Module):
             scaled_loss = prescaled_loss
             if self.warn_unscaled_loss:
                 logger.warning(
-                    f'DeepSpeed unable to scale loss because of type: {type(prescaled_loss)}'
+                    f"DeepSpeed unable to scale loss because of type: {type(prescaled_loss)}"
                 )
                 self.warn_unscaled_loss = False
 
         return scaled_loss
 
+    @instrument_w_nvtx
     def forward(self, *inputs, **kwargs):
         r"""Execute forward propagation
-
         Arguments:
             *inputs: Variable length input list
             **kwargs: variable length keyword arguments
         """
-        if self.flops_profiler_enabled(
-        ) and self.global_steps == self.flops_profiler_profile_step(
-        ) and self.global_rank == 0:
-            self.flops_profiler = FlopsProfiler(self.module)
+
+        if self.autotuning_profile_model_info():
+            ma = get_ma_status()
+        else:
+            see_memory_usage("Engine before forward", force=self.memory_breakdown())
+
+        flops_profiler_active = (self.flops_profiler_enabled() and self.global_steps
+                                 == self.flops_profiler_profile_step()
+                                 and self.global_rank == 0)
+
+        if flops_profiler_active:
             self.flops_profiler.start_profile(ignore_list=None)
 
         if self.module.training and self.progressive_layer_drop:
             kwargs.update(self.progressive_layer_drop.get_state())
 
-        if self.wall_clock_breakdown():
-            self.timers('forward_microstep').start()
-            self.timers('forward').start()
+        if self.__class__.__name__ != "PipelineEngine":
+            # TODO: The above if condition is a HACK since for PipelineEngine
+            # it's difficult to inject argument in forward pass.
+            if self.module.training and self.curriculum_enabled():
+                self.curriculum_scheduler.update_difficulty(self.global_steps + 1)
+                if self.curriculum_params()["curriculum_type"] == "seqlen":
+                    kwargs.update({
+                        "curriculum_seqlen":
+                        self.curriculum_scheduler.get_current_difficulty()
+                    })
+
+        if self.zero_optimization_partition_weights():
+            # Enable automated discovery of external parameters by indicating that
+            # we are in a forward pass.
+            for module in self.module.modules():
+                module._parameters._in_forward = True
+                pass
+
+        self._start_timers(self.engine_timers.forward_timers)
 
         if self.training_dataloader is None:
             self.tput_timer.start()
+
         loss = self.module(*inputs, **kwargs)
 
-        # Reset the ZeRO-3 state if we are only doing forward-passes (ie evaluation).
         if self.zero_optimization_partition_weights():
-            if not torch._C.is_grad_enabled():
-                self.optimizer.param_coordinator.reset_step()
+            # Disable automated discovery of external parameters
+            for module in self.module.modules():
+                module._parameters._in_forward = False
+
+        self._stop_timers(self.engine_timers.forward_timers)
+
+        if flops_profiler_active:
+            self.flops_profiler.stop_profile()
+
+        if self.autotuning_profile_model_info():
+            activation_mem = get_ma_status() - ma
+            self.autotuning_model_info["activation_mem_per_gpu"] = activation_mem
+            print_json_dist(self.autotuning_model_info,
+                            [0],
+                            path=self.autotuning_model_info_path())
+            exit()
+        else:
+            see_memory_usage("Engine after forward", force=self.memory_breakdown())
+        return loss
 
-        if self.wall_clock_breakdown():
-            self.timers('forward').stop()
-            self.timers('forward_microstep').stop()
+    def print_forward_breakdown(self, fwd_time):
+        gate_time = 0.0
+        moe_time = 0.0
+        falltoall = 0.0
+        salltoall = 0.0
 
-        if self.flops_profiler_enabled(
-        ) and self.global_steps == self.flops_profiler_profile_step(
-        ) and self.global_rank == 0:
-            self.flops_profiler.print_model_profile(
-                profile_step=self.global_steps,
-                module_depth=self.flops_profiler_module_depth(),
-                top_modules=self.flops_profiler_top_modules(),
-                detailed=self.flops_profiler_detailed())
-            self.flops_profiler.end_profile()
+        for gate in self.gate_modules:
+            #logger.info(f"Individual TopK gate time: {gate.gate_time:.2f} ms")
+            gate_time += gate.gate_time
 
-        return loss
+        for l in self.moe_layers:
+            #logger.info(f"MoE layer; total: {l.time_moe:.2f} ms, first alltoall: {l.time_falltoall:.2f}, second alltoall: {l.time_salltoall:.2f}")
+            moe_time += l.time_moe
+            falltoall += l.time_falltoall
+            salltoall += l.time_salltoall
+
+        # TODO: Allreduce/average them across ranks for more accurate timing.
 
+        # if torch.distributed.get_rank() == 0:
+        log_dist(
+            f"rank={torch.distributed.get_rank()} time (ms) | forward: {fwd_time:.2f} (forward_moe: {moe_time:.2f}, 1st alltoall: {falltoall:.2f}, 2nd alltoall: {salltoall:.2f}, top-k: {gate_time:.2f})",
+            ranks=[0])
+
+    @instrument_w_nvtx
     def allreduce_gradients(self, bucket_size=MEMORY_OPT_ALLREDUCE_SIZE):
-        #Zero stage 2 communicates during non gradient accumulation boundaries as well
+        assert not (self.bfloat16_enabled() and self.pipeline_parallelism), \
+            f'allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled'
+
+        # Pass (PP) gas boundary flag to optimizer (required for zero)
+        self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumulation_boundary(
+        )
+
+        # ZeRO stage 2 communicates during non gradient accumulation boundaries as well
         if self.zero_optimization_partition_gradients():
             self.optimizer.overlapping_partition_gradients_reduce_epilogue()
 
-        #Communicate only at gradient accumulation boundaries
+        # Communicate only at gradient accumulation boundaries
         elif self.is_gradient_accumulation_boundary():
             if self.zero_optimization_stage() == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
-                assert self.zero_reduce_scatter()
-                self.optimizer.reduce_scatter_gradients(
-                    postscale_gradients=self.postscale_gradients(),
-                    gradient_predivide_factor=self.gradient_predivide_factor(),
-                    gradient_average=self.gradient_average)
+                self.optimizer.reduce_gradients(
+                    pipeline_parallel=self.pipeline_parallelism)
             else:
                 self.buffered_allreduce_fallback(elements_per_buffer=bucket_size)
 
+    @instrument_w_nvtx
     def backward(self, loss, allreduce_gradients=True, release_loss=False):
         r"""Execute backward pass on the loss
 
@@ -958,42 +1641,40 @@ class DeepSpeedEngine(Module):
             allreduce_gradients: is deprecated, ignored, and will soon be removed'
         """
 
+        see_memory_usage("Engine before backward", force=self.memory_breakdown())
+
         if not allreduce_gradients:
             logger.warning(
-                f'Argument `allreduce_gradients` is deprecated, ignored, and will soon be removed'
+                f"Argument `allreduce_gradients` is deprecated, ignored, and will soon be removed"
             )
 
         # scale loss w.r.t. gradient accumulation if needed
         if self.gradient_accumulation_steps() > 1:
-            loss = self._scale_loss(loss.float())
+            loss = self._scale_loss_by_gas(loss.float())
 
         # Log training Loss
         if self.tensorboard_enabled():
             if self.is_gradient_accumulation_boundary():
                 if self.global_rank == 0:
-                    self.summary_events = [
-                        (f'Train/Samples/train_loss',
-                         loss.mean().item() * self.gradient_accumulation_steps(),
-                         self.global_samples)
-                    ]
+                    self.summary_events = [(
+                        f"Train/Samples/train_loss",
+                        loss.mean().item() * self.gradient_accumulation_steps(),
+                        self.global_samples,
+                    )]
                     for event in self.summary_events:  # write_summary_events
                         self.summary_writer.add_scalar(event[0], event[1], event[2])
                     self.summary_writer.flush()
 
-        if self.wall_clock_breakdown():
-            self.timers('backward_microstep').start()
-            self.timers('backward').start()
+        self._start_timers(self.engine_timers.backward_timers)
 
-        assert self.optimizer is not None, "must provide optimizer during " \
-                                           "init in order to use backward"
+        assert self.optimizer is not None and not isinstance(self.optimizer, DummyOptim), \
+            "must provide optimizer during init in order to use backward"
 
-        if self.wall_clock_breakdown():
-            self.timers('backward_inner_microstep').start()
-            self.timers('backward_inner').start()
+        self._start_timers(self.engine_timers.backward_inner_timers)
 
         if self.zero_optimization():
-            self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumulation_boundary(
-            )
+            self.optimizer.is_gradient_accumulation_boundary = (
+                self.is_gradient_accumulation_boundary())
             self.optimizer.backward(loss)
         elif self.amp_enabled():
             # AMP requires delaying unscale when inside gradient accumulation boundaries
@@ -1004,43 +1685,76 @@ class DeepSpeedEngine(Module):
                                 delay_unscale=delay_unscale) as scaled_loss:
                 scaled_loss.backward()
         elif self.fp16_enabled():
+            if self.eigenvalue_enabled():
+                self.optimizer.backward(loss, create_graph=True, retain_graph=True)
+            else:
+                self.optimizer.backward(loss)
+        elif self.bfloat16_enabled():
             self.optimizer.backward(loss)
         else:
-            loss.backward()
+            if self.eigenvalue_enabled():
+                loss.backward(create_graph=True, retain_graph=True)
+            else:
+                loss.backward()
 
-        if self.wall_clock_breakdown():
-            self.timers('backward_inner').stop()
-            self.timers('backward_inner_microstep').stop()
+        self._stop_timers(self.engine_timers.backward_inner_timers)
 
-        if self.wall_clock_breakdown():
-            self.timers('backward_allreduce_microstep').start()
-            self.timers('backward_allreduce').start()
+        self._start_timers(self.engine_timers.backward_reduce_timers)
 
-        if self.enable_backward_allreduce:
+        if allreduce_gradients and self.enable_backward_allreduce:
+            # Traditional code path that allreduces the module parameter grads
             self.allreduce_gradients()
 
-        if self.wall_clock_breakdown():
-            self.timers('backward_allreduce').stop()
-            self.timers('backward_allreduce_microstep').stop()
-            self.timers('backward').stop()
-            self.timers('backward_microstep').stop()
+        self._stop_timers(self.engine_timers.backward_reduce_timers)
+
+        self._stop_timers(self.engine_timers.backward_timers)
 
         if release_loss:
             # loss.data = None
             pass
 
+        see_memory_usage("Engine after backward", force=self.memory_breakdown())
+
         return loss
 
     def is_gradient_accumulation_boundary(self):
         """Query whether the current micro-batch is at the boundary of
         gradient accumulation, and thus will trigger gradient reductions and
         an optimizer step.
-
         Returns:
             bool: if the current step is a gradient accumulation boundary.
         """
-        return (self.micro_steps + 1) % \
-            self.gradient_accumulation_steps() == 0
+        if self._is_gradient_accumulation_boundary is None:
+            return (self.micro_steps + 1) % \
+                self.gradient_accumulation_steps() == 0
+        else:
+            return self._is_gradient_accumulation_boundary
+
+    def set_gradient_accumulation_boundary(self, is_boundary):
+        """Manually overrides the DeepSpeed engine's gradient accumulation boundary state, this is an optional
+        feature and should be used with care. The state should be set before to the intended
+        value before each forward/backward. The final fordward/backward should have the
+        boundary state set to True. This style allows client code to only call engine.step() once after all
+        the gradient accumulation passes are complete. See example below:
+
+        .. code-block:: python
+
+        engine.set_gradient_accumulation_boundary(False)
+        for _ in range(gradient_accumulation_steps - 1):
+            micro_batch = next(data_loader)
+            loss = engine(micro_batch)
+            engine.backward(loss)
+        engine.set_gradient_accumulation_boundary(True)
+        micro_batch = next(data_loader)
+        loss = engine(micro_batch)
+        engine.backward(loss)
+        engine.step()
+
+        Arguments:
+            is_boundary (bool): are we at a gradient accumulation boundary or not?
+        """
+        self._is_gradient_accumulation_boundary = is_boundary
+        self.optimizer.is_gradient_accumulation_boundary = is_boundary
 
     def zero_grad(self):
         """
@@ -1050,43 +1764,75 @@ class DeepSpeedEngine(Module):
             param.grad = None
 
     def clip_fp32_gradients(self):
-        torch.nn.utils.clip_grad_norm_(parameters=self.module.parameters(),
-                                       max_norm=self.gradient_clipping())
+        clip_grad_norm_(parameters=self.module.parameters(),
+                        max_norm=self.gradient_clipping(),
+                        mpu=self.mpu)
 
-    def _take_model_step(self, lr_kwargs):
+    def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
         if self.gradient_clipping() > 0.0:
-            if not self.fp16_enabled() and not self.amp_enabled():
+            if not (self.fp16_enabled() or self.bfloat16_enabled() or self.amp_enabled()
+                    or self.zero_optimization()):
                 self.clip_fp32_gradients()
             elif self.amp_enabled():
                 # AMP's recommended way of doing clipping
                 # https://nvidia.github.io/apex/advanced.html#gradient-clipping
                 master_params = amp.master_params(self.optimizer)
-                torch.nn.utils.clip_grad_norm_(parameters=master_params,
-                                               max_norm=self.gradient_clipping())
+                clip_grad_norm_(parameters=master_params,
+                                max_norm=self.gradient_clipping(),
+                                mpu=self.mpu)
         self.optimizer.step()
 
-        #zero grad in basic optimizer could be unreliable and may not exhibit
-        #the behaviour that we want
-        if not self.zero_optimization() and not self.fp16_enabled(
-        ) and not self.amp_enabled():
-            self.zero_grad()
-        else:
+        if hasattr(self.optimizer, '_global_grad_norm'):
+            self._global_grad_norm = self.optimizer._global_grad_norm
+
+        # Quantize the updated parameter if there is no overflow
+        if self.quantizer:
+            if self.fp16_enabled():
+                tensor_to_quantize = self.optimizer.bit16_groups if self.zero_optimization_stage(
+                ) == 2 else self.optimizer.fp16_groups
+            else:
+                tensor_to_quantize = self.optimizer.param_groups
+            self.quantizer.quantize(
+                tensor_to_quantize,
+                (self.optimizer.overflow if self.fp16_enabled() else False),
+                self.eigenvalue_enabled(),
+                block_eigenvalue,
+            )
+        # zero grad in basic optimizer could be unreliable and may not exhibit
+        # the behaviour that we want
+        if self.bfloat16_enabled():
+            # TODO: Temporary until bf16_optimizer and zero_optimizer are integrated
+            if self.zero_optimization():
+                self.optimizer.zero_grad()
+            else:
+                pass
+        elif self.zero_optimization() or self.fp16_enabled() or self.amp_enabled():
             self.optimizer.zero_grad()
+        else:
+            self.zero_grad()
 
         report_progress = self.global_rank == 0 if self.global_rank else True
 
-        # Check overlow here since in DS fp16 optimizer, the overflow is updated in above step() function.
+        # Check overflow here since in DS fp16 optimizer, the overflow is updated in above step() function.
         overflow = False
-        if hasattr(self.optimizer, 'overflow'):
+        if hasattr(self.optimizer, "overflow"):
             overflow = self.optimizer.overflow
+        self._step_applied = not overflow
 
         if overflow:
             self.skipped_steps += 1
         else:
             if self.lr_scheduler is not None:
-                self.lr_scheduler.step(**(lr_kwargs or {}))
-            if report_progress and (self.global_steps + 1) % self.steps_per_print() == 0:
-                self._report_progress(self.global_steps + 1)
+                try:
+                    self.lr_scheduler.step(**(lr_kwargs or {}))
+                except TypeError:
+                    # XXX Hack to work with Megatron 2.0 and DeepSpeed pipelines.
+                    # We don't currently have a way to specify lr_kwargs from
+                    # pipe_engine.train_batch()
+                    self.lr_scheduler.step(increment=self.train_batch_size())
+
+        if report_progress and (self.global_steps + 1) % self.steps_per_print() == 0:
+            self._report_progress(self.global_steps + 1)
 
         self.global_steps += 1
         self.global_samples += self.train_batch_size()
@@ -1095,88 +1841,185 @@ class DeepSpeedEngine(Module):
         r"""Execute the weight update step after forward and backward propagation
         on effective_train_batch.
         """
-        if self.wall_clock_breakdown():
-            self.timers('step_microstep').start()
-            self.timers('step').start()
+        see_memory_usage("Engine before step", force=self.memory_breakdown())
+
+        # Check early because self.global_steps is incremented at some point here.
+        # TODO: Delay self.global_steps increment until very end of this function.
+        flops_profiler_active = self.flops_profiler_enabled(
+        ) and self.global_steps == self.flops_profiler_profile_step(
+        ) and self.global_rank == 0
+
+        self._start_timers(self.engine_timers.step_timers)
+
+        assert self.optimizer is not None and not isinstance(self.optimizer, DummyOptim), \
+            "must provide optimizer during init in order to use step"
 
-        assert self.optimizer is not None, "must provide optimizer during " \
-                                           "init in order to use step"
         report_progress = self.global_rank == 0 if self.global_rank else True
 
+        self._step_applied = False  # assume False, will flip to True
+
         # Update the model when we reach gradient accumulation boundaries
         if self.is_gradient_accumulation_boundary():
+            self.gas_boundary_ctr += 1
+
+            if (self.eigenvalue_enabled() and
+                (self.gas_boundary_ctr % self.eigenvalue_gas_boundary_resolution() == 0)
+                    and self.quantizer.any_precision_switch()):
+                log_dist(f"computing eigenvalue...", ranks=[0])
+                self.block_eigenvalue = self.eigenvalue.compute_eigenvalue(
+                    self.module,
+                    self.device,
+                    self.optimizer.cur_scale)
+
             if self.progressive_layer_drop:
                 self.progressive_layer_drop.update_state(self.global_steps)
 
-            self._take_model_step(lr_kwargs)
+            if (self.eigenvalue_enabled() and not self.gas_boundary_ctr %
+                    self.eigenvalue_gas_boundary_resolution()
+                    and self.quantizer.any_precision_switch()):
+                self._take_model_step(lr_kwargs, self.block_eigenvalue)
+            else:
+                self._take_model_step(lr_kwargs)
 
         self.tput_timer.stop(report_progress)
 
+        self._stop_timers(self.engine_timers.step_timers)
+
         # Log learning rate
         if self.tensorboard_enabled():
             if self.is_gradient_accumulation_boundary():
                 if self.global_rank == 0:
-                    self.summary_events = [(f'Train/Samples/lr',
+                    self.summary_events = [(f"Train/Samples/lr",
                                             self.get_lr()[0],
                                             self.global_samples)]
                     for event in self.summary_events:  # write_summary_events
                         self.summary_writer.add_scalar(event[0], event[1], event[2])
-                    if self.fp16_enabled() and hasattr(self.optimizer, 'cur_scale'):
-                        self.summary_events.append((f'Train/Samples/loss_scale',
-                                                    self.optimizer.cur_scale,
-                                                    self.global_samples))
+                    if self.fp16_enabled() and hasattr(self.optimizer, "cur_scale"):
+                        self.summary_events.append((
+                            f"Train/Samples/loss_scale",
+                            self.optimizer.cur_scale,
+                            self.global_samples,
+                        ))
+
+                    if (self.eigenvalue_enabled() and not self.gas_boundary_ctr %
+                            self.eigenvalue_gas_boundary_resolution()):
+                        ev_values = self.block_eigenvalue.values()
+                        for i in range(len(ev_values)):
+                            self.summary_writer.add_scalar(
+                                f"Train/Eigenvalues/ModelBlockParam_{i}",
+                                self.ev_values[i][0],
+                                self.global_samples,
+                            )
+                            self.summary_writer.flush()
+
                     for event in self.summary_events:  # write_summary_events
                         self.summary_writer.add_scalar(event[0], event[1], event[2])
                     self.summary_writer.flush()
 
+        # Check flops profiling
+        if flops_profiler_active:
+            if self.autotuning_enabled():
+                self.flops = self.flops_profiler.get_total_flops() * 3
+            else:
+                self.flops_profiler.print_model_profile(
+                    profile_step=self.global_steps,
+                    module_depth=self.flops_profiler_module_depth(),
+                    top_modules=self.flops_profiler_top_modules(),
+                    detailed=self.flops_profiler_detailed(),
+                    output_file=self.flops_profiler_output_file(),
+                )
+            self.flops_profiler.end_profile()
+
+        if self.autotuning_enabled() and self.global_steps == (
+                self.autotuning_end_profile_step() + 1):
+            self._autotuning_exit()
+
         if self.wall_clock_breakdown():
-            self.timers('step').stop()
-            self.timers('step_microstep').stop()
-            timer_names = [
-                'forward_microstep',
-                'backward_microstep',
-                'backward_inner_microstep',
-                'backward_allreduce_microstep',
-                'step_microstep'
-            ]
-            self.timers.log(names=timer_names, memory_breakdown=self.memory_breakdown())
+            # Log micro timing and reset
+            self.timers.log(names=self.engine_timers.micro_timers,
+                            memory_breakdown=self.memory_breakdown())
 
-            # Log timing
+        if self.wall_clock_breakdown() or self.flops_profiler_enabled():
+            # Log global timing and reset
             if self.is_gradient_accumulation_boundary():
                 if self.tensorboard_enabled():
-                    if self.global_rank == 0:
-                        self.summary_events = [
-                            (f'Train/Samples/elapsed_time_ms_forward',
-                             self.timers('forward').elapsed(reset=False) * 1000.0,
-                             self.global_samples),
-                            (f'Train/Samples/elapsed_time_ms_backward',
-                             self.timers('backward').elapsed(reset=False) * 1000.0,
-                             self.global_samples),
-                            (f'Train/Samples/elapsed_time_ms_backward_inner',
-                             self.timers('backward_inner').elapsed(reset=False) * 1000.0,
-                             self.global_samples),
-                            (f'Train/Samples/elapsed_time_ms_backward_allreduce',
-                             self.timers('backward_allreduce').elapsed(reset=False) *
-                             1000.0,
-                             self.global_samples),
-                            (f'Train/Samples/elapsed_time_ms_step',
-                             self.timers('step').elapsed(reset=False) * 1000.0,
-                             self.global_samples)
-                        ]
-                        for event in self.summary_events:  # write_summary_events
-                            self.summary_writer.add_scalar(event[0], event[1], event[2])
-                        self.summary_writer.flush()
-
-            if self.wall_clock_breakdown():
-                self.timers.log([
-                    'forward',
-                    'backward',
-                    'backward_inner',
-                    'backward_allreduce',
-                    'step'
-                ])
+                    self._write_tensorboard()
+
+                if self.has_moe_layers:
+                    fwd_time = self.timers(FORWARD_GLOBAL_TIMER).elapsed(
+                        reset=False) * 1000
+                    self.print_forward_breakdown(fwd_time=fwd_time)
+
+                self.timers.log(self.engine_timers.global_timers)
 
         self.micro_steps += 1
+        see_memory_usage("Engine after step", force=self.memory_breakdown())
+
+    def _start_timers(self, timer_names):
+        for name in timer_names:
+            self.timers(name).start()
+
+    def _stop_timers(self, timer_names):
+        record = self.is_gradient_accumulation_boundary() and \
+            self.flops_profiler_enabled() and \
+                (self.global_steps >= self.flops_profiler_profile_step())
+        for name in timer_names:
+            self.timers(name).stop(record=record)
+
+    def _autotuning_exit(self):
+        if self.global_rank == 0:
+            msg = self.timers.get_mean([
+                FORWARD_GLOBAL_TIMER,
+                BACKWARD_GLOBAL_TIMER,
+                STEP_GLOBAL_TIMER,
+            ],
+                                       reset=False)
+            titer = msg[FORWARD_GLOBAL_TIMER] + msg[BACKWARD_GLOBAL_TIMER] + msg[
+                STEP_GLOBAL_TIMER]
+            msg["latency"] = titer
+            msg["FLOPS_per_gpu"] = self.flops * self.gradient_accumulation_steps(
+            ) / titer
+            msg["throughput"] = self.train_batch_size() * 1000 / \
+                msg["latency"]
+            print_json_dist(msg, [0], path=self.autotuning_metric_path())
+            import atexit
+            atexit.register(print, "Autotuning: done with running current ds config.")
+        exit()
+
+    def _write_tensorboard(self):
+        if self.global_rank == 0:
+            self.summary_events = [
+                (
+                    f"Train/Samples/elapsed_time_ms_forward",
+                    self.timers(FORWARD_GLOBAL_TIMER).elapsed(reset=False) * 1000.0,
+                    self.global_samples,
+                ),
+                (
+                    f"Train/Samples/elapsed_time_ms_backward",
+                    self.timers(BACKWARD_GLOBAL_TIMER).elapsed(reset=False) * 1000.0,
+                    self.global_samples,
+                ),
+                (
+                    f"Train/Samples/elapsed_time_ms_backward_inner",
+                    self.timers(BACKWARD_INNER_GLOBAL_TIMER).elapsed(reset=False) *
+                    1000.0,
+                    self.global_samples,
+                ),
+                (
+                    f"Train/Samples/elapsed_time_ms_backward_allreduce",
+                    self.timers(BACKWARD_REDUCE_GLOBAL_TIMER).elapsed(reset=False) *
+                    1000.0,
+                    self.global_samples,
+                ),
+                (
+                    f"Train/Samples/elapsed_time_ms_step",
+                    self.timers(STEP_GLOBAL_TIMER).elapsed(reset=False) * 1000.0,
+                    self.global_samples,
+                ),
+            ]
+            for event in self.summary_events:  # write_summary_events
+                self.summary_writer.add_scalar(event[0], event[1], event[2])
+            self.summary_writer.flush()
 
     def _get_optimizer_param(self, param_name):
         result = []
@@ -1190,16 +2033,16 @@ class DeepSpeedEngine(Module):
         return result
 
     def get_lr(self):
-        return self._get_optimizer_param('lr')
+        return self._get_optimizer_param("lr")
 
     def get_type(self):
-        return self._get_optimizer_param('type')
+        return self._get_optimizer_param("type")
 
     def get_mom(self):
-        if self.optimizer_name() in ['SGD', 'RMSprop']:
-            return self._get_optimizer_param('momentum')
+        if self.optimizer_name() in ["SGD", "RMSprop"]:
+            return self._get_optimizer_param("momentum")
         else:
-            return self._get_optimizer_param('betas')
+            return self._get_optimizer_param("betas")
 
     def get_pld_theta(self):
         if self.progressive_layer_drop:
@@ -1210,56 +2053,61 @@ class DeepSpeedEngine(Module):
     def _report_progress(self, step):
         lr = self.get_lr()
         mom = self.get_mom()
-        log_dist(f'step={step}, skipped={self.skipped_steps}, lr={lr}, mom={mom}',
+        log_dist(f"step={step}, skipped={self.skipped_steps}, lr={lr}, mom={mom}",
                  ranks=[0])
 
-    def allreduce_bucket(self, bucket):
+    def allreduce_bucket(self, bucket, dp_group):
         tensor = self.flatten(bucket)
 
         tensor_to_allreduce = tensor
 
-        if self.allreduce_always_fp32():
-            tensor_to_allreduce = tensor.float()
+        if self.communication_data_type != tensor.dtype:
+            tensor_to_allreduce = tensor.to(self.communication_data_type)
 
         if self.postscale_gradients():
             if self.gradient_predivide_factor() != 1.0:
-                tensor_to_allreduce.mul_(1. / self.gradient_predivide_factor())
-
-            dist.all_reduce(tensor_to_allreduce, group=self.data_parallel_group)
+                tensor_to_allreduce.mul_(1.0 / self.gradient_predivide_factor())
 
+            dist.all_reduce(tensor_to_allreduce, group=dp_group)
             if self.gradient_average:
-                if self.gradient_predivide_factor() != self.dp_world_size:
+                if self.gradient_predivide_factor() != dist.get_world_size(
+                        group=dp_group):
                     tensor_to_allreduce.mul_(self.gradient_predivide_factor() /
-                                             self.dp_world_size)
+                                             dist.get_world_size(group=dp_group))
         else:
-            tensor_to_allreduce.div_(self.dp_world_size)
-            dist.all_reduce(tensor_to_allreduce, group=self.data_parallel_group)
+            tensor_to_allreduce.mul_(1. / dist.get_world_size(group=dp_group))
+            dist.all_reduce(tensor_to_allreduce, group=dp_group)
 
-        if self.allreduce_always_fp32() and tensor is not tensor_to_allreduce:
+        if self.communication_data_type != tensor.dtype and tensor is not tensor_to_allreduce:
             tensor.copy_(tensor_to_allreduce)
 
         return tensor
 
-    def allreduce_and_copy(self, small_bucket):
-        allreduced = self.allreduce_bucket(small_bucket)
+    def allreduce_and_copy(self, small_bucket, dp_group):
+        allreduced = self.allreduce_bucket(small_bucket, dp_group)
         for buf, synced in zip(small_bucket, self.unflatten(allreduced, small_bucket)):
             buf.copy_(synced)
 
-    def allreduce_no_retain(self, bucket, numel_per_bucket=500000000):
+    def allreduce_no_retain(self, bucket, dp_group, numel_per_bucket=500000000):
         small_bucket = []
         numel = 0
         for tensor in bucket:
             small_bucket.append(tensor)
             numel = numel + tensor.numel()
             if numel > numel_per_bucket:
-                self.allreduce_and_copy(small_bucket)
+                self.allreduce_and_copy(small_bucket, dp_group)
                 small_bucket = []
                 numel = 0
         if len(small_bucket) > 0:
-            self.allreduce_and_copy(small_bucket)
+            self.allreduce_and_copy(small_bucket, dp_group)
+
+    def _get_gradients_for_reduction(self):
+        non_expert_grads = []
+        expert_grads = {}
+        if self.has_moe_layers:
+            for key in self.expert_data_parallel_group.keys():
+                expert_grads[key] = []
 
-    def buffered_allreduce_fallback(self, grads=None, elements_per_buffer=500000000):
-        grads = []
         for param_name, param in self.module.named_parameters():
             if param.grad is None:
                 # In cases where there is an imbalance of empty grads across
@@ -1270,124 +2118,306 @@ class DeepSpeedEngine(Module):
                 param.grad = torch.zeros(param.size(),
                                          dtype=param.dtype,
                                          device=param.device)
-                grads.append(param.grad.data)
+
+            grad_data = param.grad.data
+            if param_name in self.sparse_tensor_module_names or grad_data.is_sparse:
+                grad_data = SparseTensor(grad_data)
+
+            if is_moe_param(param):
+                expert_grads[param.group_name].append(grad_data)
             else:
-                grad_data = param.grad.data
-                if self.sparse_gradients_enabled(
-                ) and param_name in self.csr_tensor_module_names:
-                    grads.append(CSRTensor(grad_data))
-                else:
-                    grads.append(grad_data)
+                non_expert_grads.append(grad_data)
 
-        split_buckets = split_half_float_double_csr(grads)
+        return non_expert_grads, expert_grads
 
-        for i, bucket_tuple in enumerate(split_buckets):
+    def _reduce_non_expert_gradients(self, grads, elements_per_buffer):
+        split_buckets = split_half_float_double_sparse(grads)
+        for _, bucket_tuple in enumerate(split_buckets):
             bucket_type, bucket = bucket_tuple
-            if bucket_type == CSRTensor.type():
-                self.csr_allreduce_no_retain(bucket)
+
+            if self.pipeline_parallelism:
+                dp_group = self.mpu.get_data_parallel_group()
             else:
-                self.allreduce_no_retain(bucket, numel_per_bucket=elements_per_buffer)
-
-    def csr_allreduce_no_retain(self, bucket):
-        allreduced_csrs = self.csr_allreduce_bucket(bucket)
-        # Densify csr tensor and copy back to original location
-        for csr in allreduced_csrs:
-            dense_tensor = csr.to_dense()
-            csr.orig_dense_tensor.copy_(dense_tensor)
-
-    def csr_allreduce_bucket(self, bucket):
-        csr_list = []
-        for csr in bucket:
-            csr_list.append(self.csr_allreduce(csr))
-        return csr_list
-
-    def csr_allreduce(self, csr):
+                dp_group = groups._get_data_parallel_group()
+
+            if bucket_type == SparseTensor.type():
+                self.sparse_allreduce_no_retain(bucket, dp_group=dp_group)
+            else:
+                self.allreduce_no_retain(bucket,
+                                         dp_group=dp_group,
+                                         numel_per_bucket=elements_per_buffer)
+
+    def _reduce_expert_gradients(self, expert_grads, elements_per_buffer):
+        for ep_name, expert_grads_group in expert_grads.items():
+            expert_split_buckets = split_half_float_double_sparse(expert_grads_group)
+            for i, bucket_tuple in enumerate(expert_split_buckets):
+                bucket_type, bucket = bucket_tuple
+                if bucket_type == SparseTensor.type():
+                    self.sparse_allreduce_no_retain(
+                        bucket,
+                        groups._get_expert_data_parallel_group(ep_name))
+                else:
+                    # Separate between diff groups
+                    self.allreduce_no_retain(
+                        bucket,
+                        dp_group=groups._get_expert_data_parallel_group(ep_name),
+                        numel_per_bucket=elements_per_buffer)
+
+    def buffered_allreduce_fallback(self, grads=None, elements_per_buffer=500000000):
+        if grads is None:
+            non_expert_grads, expert_grads = self._get_gradients_for_reduction()
+        else:
+            assert not self.has_moe_layers, "attempting to reduce grads in unsupported way w.r.t. MoE"
+            non_expert_grads = grads
+
+        self._reduce_non_expert_gradients(non_expert_grads, elements_per_buffer)
+
+        if self.has_moe_layers:
+            self._reduce_expert_gradients(expert_grads, elements_per_buffer)
+
+    def sparse_allreduce_no_retain(self, bucket, dp_group):
+        allreduced_sparses = self.sparse_allreduce_bucket(bucket, dp_group)
+        # Densify sparse tensor and copy back to original location
+        for tensor in allreduced_sparses:
+            if tensor.is_sparse:
+                tensor.orig_dense_tensor.data = tensor.to_coo_tensor()
+            else:
+                tensor.orig_dense_tensor.copy_(tensor.to_dense())
+
+    def sparse_allreduce_bucket(self, bucket, dp_group):
+        sparse_list = []
+        for sparse in bucket:
+            sparse_list.append(self.sparse_allreduce(sparse, dp_group))
+        return sparse_list
+
+    def sparse_allreduce(self, sparse, dp_group):
         # Pre-divide for fp16 stability
-        csr.values.div_(self.dp_world_size)
+        sparse.values.mul_(1.0 / dist.get_world_size(group=dp_group))
+
+        original_data_type = sparse.values.dtype
+        if self.communication_data_type != sparse.values.dtype:
+            if self.communication_data_type in (torch.float16, torch.bfloat16):
+                indices = sparse.indices.to(torch.int32)
+            else:
+                indices = sparse.indices
+            values = sparse.values.to(self.communication_data_type)
+        else:
+            indices = sparse.indices
+            values = sparse.values
 
-        indices_device_list = self.csr_all_gather(csr.indices)
-        values_device_list = self.csr_all_gather(csr.values)
+        indices_device_list = self.sparse_all_gather(indices, dp_group)
+        values_device_list = self.sparse_all_gather(values, dp_group)
 
-        csr.indices = torch.cat(indices_device_list)
-        csr.values = torch.cat(values_device_list)
-        return csr
+        sparse.indices = torch.cat(indices_device_list).to(torch.long)
+        sparse.values = torch.cat(values_device_list).to(original_data_type)
+        return sparse
 
-    def csr_all_gather(self, value):
+    def sparse_all_gather(self, value, dp_group):
         my_size = torch.LongTensor([value.size()[0]]).to(self.device)
-        all_sizes = self.all_gather_scalar(my_size)
+        all_sizes = self.all_gather_scalar(my_size, dp_group)
         max_size = torch.cat(all_sizes).max()
-        fill_size = (max_size - my_size)
+        fill_size = max_size - my_size
 
         assert value.dim() in [1, 2]
         if value.dim() == 1:
             if fill_size > 0:
-                value = torch.cat([value, value.new_zeros(fill_size)])
-            tensor_list = [value.new_zeros(max_size) for _ in range(self.dp_world_size)]
+                value = torch.cat([value, value.new_empty(fill_size)])
+            tensor_list = [
+                value.new_empty(max_size)
+                for _ in range(dist.get_world_size(group=dp_group))
+            ]
         else:
             if fill_size > 0:
-                value = torch.cat([value, value.new_zeros(fill_size, value.size()[1])])
+                value = torch.cat([value, value.new_empty(fill_size, value.size()[1])])
             tensor_list = [
-                value.new_zeros(max_size,
-                                value.size()[1]) for _ in range(self.dp_world_size)
+                value.new_empty(max_size,
+                                value.size()[1])
+                for _ in range(dist.get_world_size(group=dp_group))
             ]
 
-        dist.all_gather(tensor_list, value, group=self.data_parallel_group)
+        dist.all_gather(tensor_list, value, group=dp_group)
         tensors = []
         for dev_idx, t in enumerate(tensor_list):
             size = all_sizes[dev_idx][0]
             tensors.append(
                 t.index_select(0,
-                               torch.LongTensor(range(size)).to(self.device)))
+                               torch.arange(size,
+                                            dtype=torch.long,
+                                            device=self.device)))
 
         return tensors
 
-    def all_gather_scalar(self, value):
-        tensor_list = [value.new_zeros(value.size()) for _ in range(self.dp_world_size)]
-        dist.all_gather(tensor_list, value, group=self.data_parallel_group)
+    def all_gather_scalar(self, value, dp_group):
+        tensor_list = [
+            value.new_zeros(value.size())
+            for _ in range(dist.get_world_size(group=dp_group))
+        ]
+        dist.all_gather(tensor_list, value, group=dp_group)
         return tensor_list
 
-    def module_state_dict(self, destination=None, prefix='', keep_vars=False):
+    def module_state_dict(self, destination=None, prefix="", keep_vars=False):
         sd = self.module.state_dict(destination, prefix, keep_vars)
         return sd
 
+    @staticmethod
+    def load_moe_state_dict(checkpoint_path,
+                            tag,
+                            state_dict,
+                            old_moe_load,
+                            model=None,
+                            mpu=None,
+                            num_experts=1):
+        if old_moe_load:
+            expp_rank = groups._get_expert_data_parallel_rank(
+                groups._get_max_expert_size_name())
+
+            num_local_experts = max(
+                num_experts) // groups._get_expert_parallel_world_size(
+                    groups._get_max_expert_size_name())
+            for local_expert_id in range(num_local_experts):
+                global_expert_id = expp_rank * num_local_experts + local_expert_id
+                expert_state_dict = torch.load(DeepSpeedEngine._get_expert_ckpt_name(
+                    checkpoint_path,
+                    -1, # -1 means ignore layer_id
+                    global_expert_id,
+                    tag,
+                    mpu),
+                    map_location=torch.device('cpu'))
+
+                # Updating global -> local expert ids
+                moe_str_prefix = '.deepspeed_moe.experts.deepspeed_experts.'
+                for key in list(expert_state_dict.keys()):
+                    local_key = key.replace(f'{moe_str_prefix}{global_expert_id}',
+                                            f'{moe_str_prefix}{local_expert_id}')
+                    expert_state_dict[local_key] = expert_state_dict.pop(key)
+                state_dict.update(expert_state_dict)
+
+        else:
+            moe_layer_id = 0
+            for n_module, module in model.named_modules():
+                if isinstance(module, MoE):  # and torch.distributed.get_rank() == 0:
+                    group_name = module.expert_group_name
+                    num_local_experts = module.num_local_experts
+                    expp_rank = groups._get_expert_parallel_rank(group_name)
+                    # loop all local_experts
+                    for local_expert_id in range(num_local_experts):
+                        global_expert_id = expp_rank * num_local_experts + local_expert_id
+                        expert_state_dict = torch.load(
+                            DeepSpeedEngine._get_expert_ckpt_name(
+                                checkpoint_path,
+                                moe_layer_id,
+                                global_expert_id,
+                                tag,
+                                mpu),
+                            map_location=torch.device('cpu'))
+                        # print(expert_state_dict.keys())
+                        # Updating global -> local expert ids
+                        moe_str_prefix = '.deepspeed_moe.experts.deepspeed_experts.'
+                        for key in list(expert_state_dict.keys()):
+                            local_key = key.replace(
+                                f'{moe_str_prefix}{global_expert_id}',
+                                f'{moe_str_prefix}{local_expert_id}')
+                            expert_state_dict[local_key] = expert_state_dict.pop(key)
+                        state_dict.update(expert_state_dict)
+                    moe_layer_id += 1
+
     def load_module_state_dict(self, state_dict, strict=True):
         self.module.load_state_dict(state_dict, strict=strict)
 
-    def _get_rank_zero_ckpt_name(self, checkpoints_path, tag, mp_rank, dp_rank):
-        filename = 'zero_pp_rank_{}'.format(dp_rank)
+    def _get_zero_ckpt_prefix(self, dp_rank, bf16_mode):
+        return f'{"bf16_" if bf16_mode else ""}zero_pp_rank_{dp_rank}'
+
+    def _get_rank_zero_ckpt_name(self,
+                                 checkpoints_path,
+                                 tag,
+                                 mp_rank,
+                                 dp_rank,
+                                 bf16_mode):
+        file_prefix = self._get_zero_ckpt_prefix(dp_rank, bf16_mode=bf16_mode)
         zero_ckpt_name = os.path.join(
             checkpoints_path,
             str(tag),
-            filename + '_mp_rank_{:02d}'.format(mp_rank) + '_optim_states.pt')
+            f"{file_prefix}_mp_rank_{mp_rank:02d}_optim_states.pt",
+        )
         return zero_ckpt_name
 
     def _get_zero_ckpt_name(self, checkpoints_path, tag):
         mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
         pp_rank = torch.distributed.get_rank(group=self.optimizer.dp_process_group)
-        return self._get_rank_zero_ckpt_name(checkpoints_path, tag, mp_rank, pp_rank)
+        bf16_mode = self.bfloat16_enabled()
+        return self._get_rank_zero_ckpt_name(checkpoints_path,
+                                             tag,
+                                             mp_rank,
+                                             pp_rank,
+                                             bf16_mode)
+
+    def _get_ckpt_name(self, checkpoints_path, tag, mp_placeholder=None):
+        if mp_placeholder is not None:
+            mp_rank_str = mp_placeholder
+        else:
+            mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
+            mp_rank_str = f"{mp_rank:02d}"
 
-    def _get_ckpt_name(self, checkpoints_path, tag):
-        mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
         if self.zero_optimization_partition_weights():
-            filename = 'zero_pp_rank_{}'.format(
+            filename = "zero_pp_rank_{}".format(
                 torch.distributed.get_rank(group=self.optimizer.dp_process_group))
             ckpt_name = os.path.join(
                 checkpoints_path,
                 str(tag),
-                filename + '_mp_rank_{:02d}'.format(mp_rank) + '_model_states.pt')
+                f"{filename}_mp_rank_{mp_rank_str}_model_states.pt",
+            )
         else:
             ckpt_name = os.path.join(
                 checkpoints_path,
                 str(tag),
-                'mp_rank_{:02d}'.format(mp_rank) + '_model_states.pt')
+                "mp_rank_" + mp_rank_str + "_model_states.pt",
+            )
         return ckpt_name
 
+    def _get_optimizer_ckpt_name(self, checkpoints_path, tag, expp_rank):
+        mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
+        ckpt_name = os.path.join(
+            checkpoints_path,
+            str(tag),
+            f'expp_rank_{expp_rank}_mp_rank_{mp_rank:02d}_optim_states.pt')
+        return ckpt_name
+
+    @staticmethod
+    def _get_expert_ckpt_name(checkpoints_path, layer_id, expert_id, tag, mpu=None):
+        mp_rank = 0 if mpu is None else mpu.get_model_parallel_rank()
+        if layer_id <= -1:
+            # Used to support old checkpoint loading
+            ckpt_name = os.path.join(
+                checkpoints_path,
+                '' if tag is None else str(tag),
+                f'expert_{expert_id}_mp_rank_{mp_rank:02d}_model_states.pt')
+        else:
+            # Used to support new checkpoint loading
+            ckpt_name = os.path.join(
+                checkpoints_path,
+                '' if tag is None else str(tag),
+                f'layer_{layer_id}_expert_{expert_id}_mp_rank_{mp_rank:02d}_model_states.pt'
+            )
+        return ckpt_name
+
+    def _get_all_ckpt_names(self, checkpoints_path, tag):
+        # It is required that (checkpoints_path, tag) are consistent among all ranks.
+        ckpt_file_pattern = self._get_ckpt_name(checkpoints_path,
+                                                tag,
+                                                mp_placeholder="*")
+        import glob
+
+        ckpt_files = glob.glob(ckpt_file_pattern)
+        ckpt_files.sort()
+        return ckpt_files
+
     def load_checkpoint(self,
                         load_dir,
                         tag=None,
                         load_module_strict=True,
                         load_optimizer_states=True,
-                        load_lr_scheduler_states=True):
+                        load_lr_scheduler_states=True,
+                        load_module_only=False):
         """Load training checkpoint
 
         Arguments:
@@ -1396,34 +2426,54 @@ class DeepSpeedEngine(Module):
             load_module_strict: Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match.
             load_optimizer_states: Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM's momentum and variance
             load_lr_scheduler_states: Optional. Boolean to add the learning rate scheduler states from Checkpoint.
+            load_module_only: Optional. Boolean to load only the model weights from the checkpoint. Ex. warmstarting.
         Returns:
             A tuple of ``load_path`` and ``client_state``.
 
             *``load_path``: Path of the loaded checkpoint. ``None`` if loading the checkpoint failed.
 
             *``client_state``: State dictionary used for loading required training states in the client code.
+
+        Important: under ZeRO3, one cannot load checkpoint with ``engine.load_checkpoint()`` right
+        after ``engine.save_checkpoint()``. It is because ``engine.module`` is partitioned, and
+        ``load_checkpoint()`` wants a pristine model. If insisting to do so, please reinitialize engine
+        before ``load_checkpoint()``.
         """
 
         if tag is None:
-            latest_path = os.path.join(load_dir, 'latest')
+            latest_path = os.path.join(load_dir, "latest")
             if os.path.isfile(latest_path):
-                with open(latest_path, 'r') as fd:
+                with open(latest_path, "r") as fd:
                     tag = fd.read().strip()
             else:
-                logger.warning(f"Unable to find latest file at {latest_path}, if trying to load latest " \
-                "checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.")
+                logger.warning(
+                    f"Unable to find latest file at {latest_path}, if trying to load latest "
+                    "checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint."
+                )
                 return None, None
 
+        if self.zero_optimization_partition_weights():
+            # Prepare for checkpoint load by ensuring all parameters are partitioned
+            self.optimizer.checkpoint_event_prologue()
+
         load_path, client_states = self._load_checkpoint(load_dir,
                                                          tag,
                                                          load_module_strict=load_module_strict,
                                                          load_optimizer_states=load_optimizer_states,
-                                                         load_lr_scheduler_states=load_lr_scheduler_states)
+                                                         load_lr_scheduler_states=load_lr_scheduler_states,
+                                                         load_module_only=load_module_only)
+
+        load_zero_checkpoint = self.zero_optimization() or self.bfloat16_enabled()
+        if load_zero_checkpoint and load_path is not None:
+            success = self._load_zero_checkpoint(
+                load_dir,
+                tag,
+                load_optimizer_states=load_optimizer_states)
+            if not success:
+                self.optimizer._restore_from_bit16_weights()
 
-        if self.zero_optimization() and load_path is not None:
-            self._load_zero_checkpoint(load_dir,
-                                       tag,
-                                       load_optimizer_states=load_optimizer_states)
+        if self.zero_optimization_partition_weights():
+            self.optimizer.checkpoint_event_epilogue()
 
         return load_path, client_states
 
@@ -1432,113 +2482,191 @@ class DeepSpeedEngine(Module):
                          tag,
                          load_module_strict=True,
                          load_optimizer_states=True,
-                         load_lr_scheduler_states=True):
+                         load_lr_scheduler_states=True,
+                         load_module_only=False):
 
-        load_path = self._get_ckpt_name(load_dir, tag)
+        from deepspeed.runtime.state_dict_factory import SDLoaderFactory
 
-        if not os.path.exists(load_path):
-            logger.warn(
-                'Client provided checkpoint load path: {} does not exist ... skip checkpoint load'
-                .format(load_path))
-            return None, None
+        ckpt_list = self._get_all_ckpt_names(load_dir, tag)
+        sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list)
+
+        is_pipe_parallel = isinstance(self.module, PipelineModule)
 
-        logger.info(f'rank: {self.global_rank} loading checkpoint: {load_path}')
-        checkpoint = torch.load(load_path, map_location=lambda storage, loc: storage)
+        mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
+        load_path, checkpoint, _ = sd_loader.load(
+            self.mp_world_size, mp_rank, is_pipe_parallel=is_pipe_parallel
+        )
+
+        if checkpoint is None:
+            return None, None
 
-        if isinstance(self.module, PipelineModule):
+        if is_pipe_parallel:
             # Pipeline parallelism uses this to load its own checkpoint files.
             self._curr_ckpt_path = os.path.join(load_dir, tag)
 
+        if self.has_moe_layers:
+            # print(checkpoint.keys())
+            old_moe_load = False
+            if not isinstance(checkpoint['num_experts'], list):
+                old_moe_load = True
+            DeepSpeedEngine.load_moe_state_dict(load_dir,
+                                                tag,
+                                                state_dict=checkpoint['module'],
+                                                old_moe_load=old_moe_load,
+                                                model=self.module,
+                                                mpu=self.mpu,
+                                                num_experts=self.num_experts)
+
         self.load_module_state_dict(state_dict=checkpoint['module'],
                                     strict=load_module_strict)
-        if self.optimizer is not None and not self.zero_optimization():
-            if self.fp16_enabled():
-                self.optimizer.load_state_dict(
-                    checkpoint['optimizer'],
-                    load_optimizer_states=load_optimizer_states)
-            elif load_optimizer_states:
-                self.optimizer.load_state_dict(checkpoint['optimizer'])
-
-        if load_lr_scheduler_states and self.lr_scheduler is not None:
-            self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
-
-        self.csr_tensor_module_names = checkpoint['csr_tensor_module_names']
-        self.global_steps = checkpoint['global_steps']
-        self.global_samples = checkpoint.get('global_samples',
-                                             self.global_steps * self.train_batch_size())
-        self.skipped_steps = checkpoint['skipped_steps']
-        self.loaded_checkpoint_mp_world_size = checkpoint['mp_world_size']
+
         self.loaded_checkpoint_dp_world_size = checkpoint['dp_world_size']
-        deepspeed_states = [
-            'module',
-            'optimizer',
-            'lr_scheduler',
-            'csr_tensor_module_names',
-            'skipped_steps',
-            'global_steps',
-            'dp_world_size',
-            'mp_world_size'
-        ]
+
+        if load_module_only:
+            deepspeed_states = ['module']
+            if self.optimizer is not None and self.fp16_enabled():
+                self.optimizer.refresh_fp32_params()
+        else:
+            if self.has_moe_layers:
+                largest_group_name = groups._get_max_expert_size_name()
+                expp_rank = groups._get_expert_parallel_rank(largest_group_name)
+                optim_load_path = self._get_optimizer_ckpt_name(load_dir, tag, expp_rank)
+                optim_checkpoint = torch.load(optim_load_path,
+                                              map_location=torch.device('cpu'))
+            else:
+                optim_checkpoint = checkpoint
+
+            has_zero_optimizer_state = self.zero_optimization() or self.bfloat16_enabled(
+            )
+            if load_optimizer_states and self.optimizer is not None and not has_zero_optimizer_state:
+                if self.fp16_enabled():
+                    self.optimizer.load_state_dict(
+                        optim_checkpoint['optimizer'],
+                        load_optimizer_states=load_optimizer_states)
+                else:
+                    self.optimizer.load_state_dict(optim_checkpoint['optimizer'])
+
+            if load_lr_scheduler_states and self.lr_scheduler is not None:
+                self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
+
+            def get_sparse_tensor_module_names(original_set,
+                                               loaded_set,
+                                               original_parameters,
+                                               loaded_parameters):
+                result = set()
+
+                for name in original_set:
+                    if name in loaded_parameters and name not in loaded_set:
+                        continue  # parameter existed in previous model and was not sparse
+                    result.add(name)
+
+                for name in loaded_set:
+                    if name in original_parameters:
+                        result.add(
+                            name)  # parameter exists in both configs and it was sparse
+
+                return result
+
+            if 'sparse_tensor_module_names' in checkpoint:
+                sparse_tensor_module_names = checkpoint['sparse_tensor_module_names']
+            elif 'csr_tensor_module_names' in checkpoint:
+                sparse_tensor_module_names = checkpoint['csr_tensor_module_names']
+            else:
+                sparse_tensor_module_names = None
+            if sparse_tensor_module_names is not None:
+                if load_module_strict:
+                    self.sparse_tensor_module_names = sparse_tensor_module_names
+                else:
+                    self.sparse_tensor_module_names = get_sparse_tensor_module_names(
+                        self.sparse_tensor_module_names,
+                        sparse_tensor_module_names,
+                        dict(self.module.named_parameters()),
+                        checkpoint["module"])
+
+            self.global_steps = checkpoint['global_steps']
+            self.global_samples = checkpoint.get(
+                'global_samples',
+                self.global_steps * self.train_batch_size())
+            self.skipped_steps = checkpoint['skipped_steps']
+            self.loaded_checkpoint_mp_world_size = checkpoint['mp_world_size']
+            deepspeed_states = [
+                'module',
+                'sparse_tensor_module_names',
+                'skipped_steps',
+                'global_steps',
+                'dp_world_size',
+                'mp_world_size'
+            ]
+        client_state = {}
+
+        if load_lr_scheduler_states:
+            deepspeed_states.append('lr_scheduler')
+        if load_optimizer_states:
+            deepspeed_states.append('optimizer')
+
         client_state = {
             key: value
             for key,
             value in checkpoint.items() if not key in deepspeed_states
         }
 
+        if not load_optimizer_states and not load_module_only:
+            client_state['optimizer'] = optim_checkpoint['optimizer']
+
         return load_path, client_state
 
     def _load_zero_checkpoint(self, load_dir, tag, load_optimizer_states=True):
         zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
         if zero_sd_list is None:
-            return
+            return False
+
+        if load_optimizer_states and self.dp_world_size != self.loaded_checkpoint_dp_world_size:
+            raise ZeRORuntimeException("The checkpoint being loaded used a DP " \
+                f"world size of {self.loaded_checkpoint_dp_world_size} but the " \
+                f"current world size is {self.dp_world_size}. Automatic adjustment " \
+                "of ZeRO's optimizer state partitioning with a new world size is not " \
+                "currently supported.")
 
         self.optimizer.load_state_dict(
             state_dict_list=zero_sd_list,
             load_optimizer_states=load_optimizer_states,
-            load_from_fp32_weights=self.zero_load_from_fp32_weights())
-        print(
-            f'loading {len(zero_sd_list)} zero partition checkpoints for rank {self.global_rank}'
+            load_from_fp32_weights=self.zero_load_from_fp32_weights(),
+        )
+        logger.info(
+            f"loading {len(zero_sd_list)} zero partition checkpoints for rank {self.global_rank}"
         )
+        return True
 
-    def _get_mp_rank_zero_checkpoint_names(self, load_dir, tag, mp_rank, dp_world_size):
+    def _get_mp_rank_zero_checkpoint_names(self,
+                                           load_dir,
+                                           tag,
+                                           mp_rank,
+                                           dp_world_size,
+                                           bf16_mode):
         zero_ckpt_names = []
         for dp_rank in range(dp_world_size):
             ckpt_name = self._get_rank_zero_ckpt_name(checkpoints_path=load_dir,
                                                       tag=tag,
                                                       mp_rank=mp_rank,
-                                                      dp_rank=dp_rank)
+                                                      dp_rank=dp_rank,
+                                                      bf16_mode=bf16_mode)
             zero_ckpt_names.append(ckpt_name)
 
         return zero_ckpt_names
 
-    def _get_all_zero_checkpoint_names(self,
-                                       load_dir,
-                                       tag,
-                                       mp_world_size,
-                                       dp_world_size):
-        zero_ckpt_names = []
-        for mp_rank in range(mp_world_size):
-            mp_rank_ckpt_names = self._get_mp_rank_zero_checkpoint_names(
-                load_dir=load_dir,
-                tag=tag,
-                mp_rank=mp_rank,
-                dp_world_size=dp_world_size)
-            zero_ckpt_names += mp_rank_ckpt_names
-
-        return zero_ckpt_names
-
-    def _get_all_zero_checkpoints(self, load_dir, tag):
+    def _get_all_zero_checkpoint_names(self, load_dir, tag, bf16_mode):
         mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
         zero_ckpt_names = self._get_mp_rank_zero_checkpoint_names(
             load_dir=load_dir,
             tag=tag,
             mp_rank=mp_rank,
-            dp_world_size=self.loaded_checkpoint_dp_world_size)
+            dp_world_size=self.loaded_checkpoint_dp_world_size,
+            bf16_mode=bf16_mode)
         invalid_zero_ckpt_paths = []
         for i, ckpt_name in enumerate(zero_ckpt_names):
             if not os.path.exists(ckpt_name):
                 # transparently handle the old file pattern for optim_states
-                if 'optim_states.pt' in ckpt_name:
+                if "optim_states.pt" in ckpt_name:
                     ckpt_name_try = ckpt_name.replace("_optim_states.pt",
                                                       "optim_states.pt")
                     if os.path.exists(ckpt_name_try):
@@ -1552,16 +2680,44 @@ class DeepSpeedEngine(Module):
             )
             return None
 
+        return zero_ckpt_names
+
+    def _get_all_zero_checkpoint_state_dicts(self, zero_ckpt_names):
         zero_sd_list = []
-        for ckpt_name in zero_ckpt_names:
-            zero_sd_list.append(torch.load(ckpt_name, map_location='cpu'))
+        for i, ckpt_name in enumerate(zero_ckpt_names):
+            _state = None
+            # Fully load state for current rank
+            if self.zero_elastic_checkpoint() or dist.get_rank(
+                    group=self.optimizer.dp_process_group) == i:
+                _state = torch.load(ckpt_name, map_location='cpu')
+            else:
+                _state = {OPTIMIZER_STATE_DICT: None}
+            zero_sd_list.append(_state)
 
-        zero_optimizer_sd = [sd['optimizer_state_dict'] for sd in zero_sd_list]
-        print(
-            f"successfully loaded {len(zero_optimizer_sd)} ZeRO state_dicts for rank {self.global_rank}"
+        zero_optimizer_sd = [sd[OPTIMIZER_STATE_DICT] for sd in zero_sd_list]
+        logger.info(
+            f"successfully read {len(zero_optimizer_sd)} ZeRO state_dicts for rank {self.global_rank}"
         )
         return zero_optimizer_sd
 
+    def _get_all_zero_checkpoints(self, load_dir, tag):
+        for bf16_mode in [self.bfloat16_enabled(), not self.bfloat16_enabled()]:
+            zero_ckpt_names = self._get_all_zero_checkpoint_names(
+                load_dir,
+                tag,
+                bf16_mode)
+            if zero_ckpt_names is not None:
+                # Warn if loading checkpoint of different bit16 type
+                if bf16_mode is not self.bfloat16_enabled():
+                    checkpoint_bit16 = BFLOAT16 if bf16_mode else FP16
+                    engine_bit16 = BFLOAT16 if self.bfloat16_enabled() else FP16
+                    logger.warn(
+                        f'Loading {checkpoint_bit16} zero checkpoints into {engine_bit16} training engine'
+                    )
+                return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
+
+        return None
+
     def _checkpoint_tag_validation(self, tag):
         if self.checkpoint_tag_validation_enabled():
             s_hash = hashlib.sha1(tag.encode())
@@ -1571,9 +2727,10 @@ class DeepSpeedEngine(Module):
             dist.all_reduce(max_bhash, op=torch.distributed.ReduceOp.MAX)
             dist.all_reduce(min_bhash, op=torch.distributed.ReduceOp.MIN)
             valid = all(min_bhash == bhash) and all(max_bhash == bhash)
-            msg = f"[rank={dist.get_rank()}] The checkpoint tag name '{tag}' is not consistent across " \
-                "all ranks. Including rank unique information in checkpoint tag could cause issues when " \
-                "restoring with different world sizes."
+            msg = (
+                f"[rank={dist.get_rank()}] The checkpoint tag name '{tag}' is not consistent across "
+                "all ranks. Including rank unique information in checkpoint tag could cause issues when "
+                "restoring with different world sizes.")
             if self.checkpoint_tag_validation_fail():
                 assert valid, msg
             elif not valid:
@@ -1594,16 +2751,16 @@ class DeepSpeedEngine(Module):
         method will hang waiting to synchronize with other processes if it's called just for the
         process with rank 0.
         """
-
         if self.zero_optimization_partition_weights():
-            # Prepare for state_dict() by ensuring all parameters are partitioned
-            self.optimizer.save_checkpoint_prologue()
+            # Prepare for checkpoint save by ensuring all parameters are partitioned
+            self.optimizer.checkpoint_event_prologue()
 
         # This is to make sure the checkpoint names are created without collision
         # There seems to be issue creating them in parallel
 
         # Ensure save_dir directory exists
         os.makedirs(save_dir, exist_ok=True)
+        torch.distributed.barrier()
 
         if tag is None:
             tag = f"global_step{self.global_steps}"
@@ -1614,6 +2771,11 @@ class DeepSpeedEngine(Module):
         # Ensure checkpoint tag is consistent across ranks
         self._checkpoint_tag_validation(tag)
 
+        if self.has_moe_layers:
+            self.save_non_zero_checkpoint = False
+            self._create_checkpoint_file(save_dir, tag, False)
+            self._save_moe_checkpoint(save_dir, tag, client_state=client_state)
+
         if self.save_non_zero_checkpoint:
             self._create_checkpoint_file(save_dir, tag, False)
             self._save_checkpoint(save_dir, tag, client_state=client_state)
@@ -1622,23 +2784,146 @@ class DeepSpeedEngine(Module):
             self._create_zero_checkpoint_files(save_dir, tag)
             self._save_zero_checkpoint(save_dir, tag)
 
+        if self.zero_optimization_partition_weights():
+            self.optimizer.checkpoint_event_epilogue()
+
         # Save latest checkpoint tag
-        if save_latest:
+        torch.distributed.barrier()
+        if save_latest and self.global_rank == 0:
             with open(os.path.join(save_dir, 'latest'), 'w') as fd:
                 fd.write(tag)
 
-        if self.zero_optimization_partition_weights():
-            self.optimizer.save_checkpoint_epilogue()
-
         return True
 
+    def _get_non_moe_state_dict(self, full_state_dict):
+        """
+            Get the state dict of the non-moe layers
+        """
+        for key in list(full_state_dict.keys()):
+            if 'expert' in key and 'moe.gate.wg.weight' not in key:
+                full_state_dict.pop(key)
+
+        return full_state_dict
+
+    def _save_moe_checkpoint(self, save_dir, tag, client_state={}):
+        save_path = self._get_ckpt_name(save_dir, tag)
+        # A hack to save the checkpointing directory. Pipeline parallelism overrides
+        # module_state_dict() and uses this path to save the model. module_state_dict()
+        # then instead just returns None.
+
+        # Using layer_#_export_# to save the model's expert state_dict
+        moe_layer_id = 0
+        for n_module, module in self.module.named_modules():
+            if isinstance(module, MoE):  # and torch.distributed.get_rank() == 0:
+                group_name = module.expert_group_name
+                num_local_experts = module.num_local_experts
+                expp_rank = groups._get_expert_parallel_rank(group_name)
+                exp_dp_rank = groups._get_expert_data_parallel_rank(group_name)
+                # print(expp_rank, exp_dp_rank)
+                if exp_dp_rank != 0:
+                    moe_layer_id += 1
+                    continue
+
+                # get all moe parameters
+                moe_state_dict = {}
+                for n, p in module.state_dict().items():
+                    if 'expert' in n and 'moe.gate.wg.weight' not in n:
+                        moe_state_dict[n_module + '.' + n] = p
+                moe_str_prefix = '.deepspeed_moe.experts.deepspeed_experts.'
+                # print(moe_state_dict.keys()) # until now, everything is fine. So the bug happens at next few lines
+                # Reorder the moe name rank, so that each checkpoint only has one expert
+                experts_state_dict = defaultdict(dict)
+                for key in list(moe_state_dict.keys()):
+                    m = re.match(f".*{moe_str_prefix}([0-9]+).*", key)
+
+                    local_expert_id = None
+                    if not m:
+                        logger.warn(f'No expert found in key {key}.')
+                    else:
+                        local_expert_id = m.group(1)
+
+                    global_expert_id = expp_rank * \
+                        num_local_experts + int(local_expert_id)
+                    expert_key = key.replace(f'{moe_str_prefix}{local_expert_id}',
+                                             f'{moe_str_prefix}{global_expert_id}')
+                    experts_state_dict[str(
+                        global_expert_id)][expert_key] = moe_state_dict.pop(key)
+
+                # let save the moe parameters
+                for global_expert_id, expert_state_dict in experts_state_dict.items():
+                    # save the moe parameters
+                    moe_save_path = self._get_expert_ckpt_name(
+                        save_dir,
+                        moe_layer_id,
+                        global_expert_id,
+                        tag,
+                        self.mpu)
+                    torch.save(expert_state_dict, moe_save_path)
+                moe_layer_id += 1
+
+        self._curr_ckpt_path = os.path.join(save_dir, tag)
+
+        largest_group_name = groups._get_max_expert_size_name()
+        expp_rank = groups._get_expert_parallel_rank(largest_group_name)
+        exp_dp_rank = groups._get_expert_data_parallel_rank(largest_group_name)
+
+        # In the case of E + D parallelism, only the
+        # first expert parallel group should save the expert weights
+        # since each expert parallel group is a copy of the model's experts
+        if exp_dp_rank != 0:
+            return
+
+        # Save optimizer states. They are different across each exp parallel rank.
+        optimizer_state = {
+            'optimizer':
+            self.optimizer.state_dict()
+            if self.optimizer and not self.zero_optimization() else None
+        }
+        with open(self._get_optimizer_ckpt_name(save_dir, tag, expp_rank), 'wb') as fd:
+            torch.save(optimizer_state, fd)
+            fd.flush()
+
+        # get non-moe parameters
+        model_state_dict = self._get_non_moe_state_dict(self.module_state_dict())
+
+        if expp_rank == 0:
+            # TODO: update num experts info,.. in checkpoint
+            state = {
+                'module':
+                model_state_dict,
+                'lr_scheduler':
+                self.lr_scheduler.state_dict()
+                if self.lr_scheduler is not None else None,
+                'sparse_tensor_module_names':
+                self.sparse_tensor_module_names,
+                'skipped_steps':
+                self.skipped_steps,
+                'global_steps':
+                self.global_steps,
+                'global_samples':
+                self.global_samples,
+                'dp_world_size':
+                self.dp_world_size,
+                'mp_world_size':
+                self.mp_world_size,
+                'num_experts':
+                self.num_experts
+            }
+            state.update(client_state)
+            logger.info(f'Saving model checkpoint: {save_path}')
+            with open(save_path, 'wb') as fd:
+                torch.save(state, fd)
+                fd.flush()
+        self._curr_save_path = None
+
     def _create_checkpoint_file(self, save_dir, tag, zero_checkpoint):
-        name_function = self._get_zero_ckpt_name if zero_checkpoint else self._get_ckpt_name
+        name_function = (self._get_zero_ckpt_name
+                         if zero_checkpoint else self._get_ckpt_name)
         try:
             checkpoint_name = name_function(save_dir, tag)
             ensure_directory_exists(checkpoint_name)
         except:
-            logger.error(f'Failed saving model checkpoint to {save_dir} with tag {tag}')
+            logger.error(f"Failed saving model checkpoint to {save_dir} with tag {tag}")
             return False
 
         return True
@@ -1661,68 +2946,124 @@ class DeepSpeedEngine(Module):
         # module_state_dict() and uses this path to save the model. module_state_dict()
         # then instead just returns None.
         self._curr_ckpt_path = os.path.join(save_dir, tag)
-
-        state = {
-            'module':
-            self.module_state_dict(),
-            'optimizer':
-            self.optimizer.state_dict()
-            if self.optimizer and not self.zero_optimization() else None,
-            'lr_scheduler':
-            self.lr_scheduler.state_dict() if self.lr_scheduler is not None else None,
-            'csr_tensor_module_names':
-            self.csr_tensor_module_names,
-            'skipped_steps':
-            self.skipped_steps,
-            'global_steps':
-            self.global_steps,
-            'global_samples':
-            self.global_samples,
-            'dp_world_size':
-            self.dp_world_size,
-            'mp_world_size':
-            self.mp_world_size
-        }
+        zero_optimizer_state = self.zero_optimization() or self.bfloat16_enabled()
+        state = dict(module=self.module_state_dict(),
+                     buffer_names=self._get_buffer_names(),
+                     optimizer=self.optimizer.state_dict()
+                     if self.optimizer and not zero_optimizer_state else None,
+                     param_shapes=self._get_zero_param_shapes()
+                     if self.optimizer and zero_optimizer_state else None,
+                     lr_scheduler=self.lr_scheduler.state_dict()
+                     if self.lr_scheduler is not None else None,
+                     sparse_tensor_module_names=self.sparse_tensor_module_names,
+                     skipped_steps=self.skipped_steps,
+                     global_steps=self.global_steps,
+                     global_samples=self.global_samples,
+                     dp_world_size=self.dp_world_size,
+                     mp_world_size=self.mp_world_size,
+                     ds_config=self.config,
+                     ds_version=version)
         state.update(client_state)
 
-        log_dist(message=f'Saving model checkpoint: {save_path}', ranks=[0])
-        #logger.info('Saving model checkpoint: {}'.format(save_path))
+        log_dist(message=f'Saving model checkpoint: {save_path}', ranks=[0, 1])
         torch.save(state, save_path)
         self._curr_save_path = None
 
-    def _get_param_shapes(self):
-        param_shapes = OrderedDict()
-        for name, param in self.module.named_parameters():
-            param_shapes[name] = param.ds_shape if hasattr(param,
-                                                           "ds_shape") else param.shape
-            # print(f"saving param {name} {param_shapes[name]}")
-        return param_shapes
+    def _get_buffer_names(self):
+        buffer_names = []
+
+        # we save buffer names so that we could extract later the real buffers from the saved
+        # state_dict["module"] in the non-zero checkpoint - the buffers are already there but they
+        # are intermixed with param placeholders
+
+        # have to traverse the tree to be able to skip non-persistent buffers
+        def get_layer_named_buffers(module, prefix=""):
+            for name, buf in module.named_buffers(recurse=False):
+                if buf is not None and name not in module._non_persistent_buffers_set:
+                    buffer_names.append(prefix + name)
+
+            for name, child in module.named_children():
+                if child is not None:
+                    get_layer_named_buffers(child, prefix + name + ".")
+
+        get_layer_named_buffers(self.module, prefix="")
+
+        return buffer_names
+
+    def _get_zero_param_shapes(self):
+        """Returns a dict of name to shape mapping, only for the flattened fp32 weights saved by the
+        optimizer. the names are exactly as in state_dict. The order is absolutely important, since
+        the saved data is just flattened data with no identifiers and requires reconstruction in the
+        same order it was saved.
+
+        We can't rely on self.module.named_parameters() to get the saved tensors, as some params
+        will be missing and others unsaved and then it'd be impossible to reconstruct state_dict
+        from the flattened weights.
+
+        optimizer.bit16_groups seems to be the easiest to use as it's in all zeroX versions.
+        """
+        param_group_shapes = []
+        cnt = 0
+        numel = 0
+
+        # zero2 started using a round_robin_bit16_groups which is a shuffled version of bit16_groups -
+        # if we don't use it, we get parameters ordered incorrectly
+        if hasattr(self.optimizer, "round_robin_bit16_groups"):
+            bit16_groups = self.optimizer.round_robin_bit16_groups
+        elif self.bfloat16_enabled() and not self.zero_optimization():
+            bit16_groups = self.optimizer.bf16_groups
+        else:
+            bit16_groups = self.optimizer.bit16_groups if self.zero_optimization_stage(
+            ) == 2 else self.optimizer.fp16_groups
+
+        for bit16_group in bit16_groups:
+            param_shapes = OrderedDict()
+            for param in bit16_group:
+                cnt += 1
+                numel += param.ds_numel if hasattr(param, "ds_numel") else param.numel()
+                shape = param.ds_shape if hasattr(param, "ds_shape") else param.shape
+                if param not in self.param_names:
+                    raise ValueError(f"failed to find optimizer param in named params")
+                name = self.param_names[param]
+                param_shapes[name] = shape
+
+                # uncomment to debug zero_to_fp32.py problems
+                # if self.global_rank == 0: print(f"saving param {name} {shape} (numel={shape.numel()})")
+            param_group_shapes.append(param_shapes)
+        # if self.global_rank == 0: print(f"Total saved {numel} numels in {cnt} params")
+
+        return param_group_shapes
 
     def _copy_recovery_script(self, save_path):
         base_dir = os.path.dirname(os.path.dirname(__file__))
         script = "zero_to_fp32.py"
         src = os.path.join(base_dir, "utils", script)
         dst = os.path.join(save_path, script)
-        logger.info(f"creating recovery script {dst}")
+        #logger.info(f"creating recovery script {dst}")
         copyfile(src, dst)
         # make executable
         os.chmod(dst, os.stat(dst).st_mode | stat.S_IEXEC)
 
     def _save_zero_checkpoint(self, save_path, tag):
         zero_checkpoint_name = self._get_zero_ckpt_name(save_path, tag)
-        zero_sd = dict(
-            optimizer_state_dict=self.optimizer.state_dict(),
-            param_shapes=self._get_param_shapes(),
-        )
-        torch.save(zero_sd, zero_checkpoint_name)
-        self._copy_recovery_script(save_path)
-        logger.info('zero checkpoint saved {}'.format(zero_checkpoint_name))
+        zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
+                       ds_config=self.config,
+                       ds_version=version)
+        with open(zero_checkpoint_name, 'wb') as fd:
+            torch.save(zero_sd, fd)
+            fd.flush()
+        if self.global_rank == 0:
+            self._copy_recovery_script(save_path)
+        ckpt_type = 'zero' if self.zero_optimization() else 'bf16_zero'
+        logger.info(f'{ckpt_type} checkpoint saved {zero_checkpoint_name}')
 
-    def _zero3_consolidated_fp16_state_dict(self):
+    def _zero3_consolidated_16bit_state_dict(self):
         """
 
         Get a full non-partitioned state_dict with fp16 weights on cpu.
 
+        Important: this function must be called on all ranks and not just rank 0.
+
         This is similar to nn.Module.state_dict (modelled after _save_to_state_dict), but:
 
         1. consolidates the weights from different partitions on gpu0
@@ -1734,59 +3075,77 @@ class DeepSpeedEngine(Module):
             a consolidated fp16 ``state_dict`` on cpu on rank 0, ``None`` on other ranks
 
         """
-        import deepspeed
-
         if not self.zero_optimization_partition_weights():
             raise ValueError("this function requires ZeRO-3 mode")
 
         state_dict = OrderedDict() if torch.distributed.get_rank() == 0 else None
-        shared_weights = {}
+        shared_params = {}
 
         def get_layer_state_dict(module, prefix=""):
             # gather one layer at a time to be memory-efficient
+            # must use modifier_rank=0 to release GPU memory after each layer gathered
+            #see_memory_usage("before GatheredParameters", force=True)
             with deepspeed.zero.GatheredParameters(list(
-                    module.parameters(recurse=False))):
+                    module.parameters(recurse=False)),
+                                                   modifier_rank=0):
                 if torch.distributed.get_rank() == 0:
+                    # handle params
                     for name, param in module.named_parameters(recurse=False):
                         if param is None:
                             continue
                         key = prefix + name
-                        # for shared weights we want to make sure not to unshare them when copying to cpu
-                        data_ptr_id = param.storage().data_ptr()
-                        if data_ptr_id in shared_weights:
+                        # can't rely on param.data_ptr() as it will be reused as weights gets
+                        # gathered and reduced, but param.ds_id is unique across all zero weights
+                        # (and shared params will have the same param.ds_id)
+                        if param.ds_id in shared_params:
                             # shared weights
-                            # print(f"`{key}` is shared with `{shared_weights[data_ptr_id]}`")
-                            state_dict[key] = state_dict[shared_weights[data_ptr_id]]
+                            #print(f"`{key}` is shared with `{shared_params[param.ds_id]}`")
+                            state_dict[key] = state_dict[shared_params[param.ds_id]]
                         else:
                             state_dict[key] = param.detach().cpu()
-                            shared_weights[data_ptr_id] = key
-                        #print(f"param {name} {param.shape}")
-                        #print(f"param {key} {param.shape} {state_dict[key].storage().data_ptr()}")
+                            shared_params[param.ds_id] = key
+                        #print(f"param {param.ds_id} {param.shape} {key} ")
 
                     # now buffers - not sure if need to take care of potentially shared weights here
                     for name, buf in module.named_buffers(recurse=False):
-                        if buf is not None and name not in module._non_persistent_buffers_set:
+                        if (buf is not None
+                                and name not in module._non_persistent_buffers_set):
                             state_dict[prefix + name] = buf.detach().cpu()
+            #see_memory_usage("after GatheredParameters", force=True)
 
             for name, child in module.named_children():
                 if child is not None:
                     get_layer_state_dict(child, prefix + name + ".")
 
+        # Prepare for checkpoint save by ensuring all parameters are partitioned
+        self.optimizer.checkpoint_event_prologue()
+
         see_memory_usage("before get_layer_state_dict", force=False)
         get_layer_state_dict(self.module, prefix="")
         see_memory_usage("after get_layer_state_dict", force=False)
 
+        self.optimizer.checkpoint_event_epilogue()
+
         return state_dict
 
     def save_fp16_model(self, save_dir, save_filename="pytorch_model.bin"):
-        r"""Save fp16 model weights
+        """has been renamed to save_16bit_model, keeping this around for backwards
+        compatibility"""
+        return self.save_16bit_model(save_dir, save_filename)
 
-        This method saves the fp16 model weights at the desired destination.
+    def save_16bit_model(self, save_dir, save_filename="pytorch_model.bin"):
+        r"""Save 16bit model weights
+
+        This method saves the 16bit model weights at the desired destination.
 
         Arguments:
             save_dir: Required. Directory for saving the model
             save_filename: Optional. Filename to save to. Defaults to ``pytorch_model.bin``
 
+        Returns:
+            ``True`` when a model has been saved, ``False`` otherwise. It will not be saved if
+            stage3_gather_16bit_weights_on_model_save is ``False``.
+
         Important: all processes must call this method and not just the process with rank 0. It is
         because the processes need to work in sync to gather the weights. This method will hang
         waiting to synchronize with other processes if it's called just for the process with rank 0.
@@ -1796,15 +3155,15 @@ class DeepSpeedEngine(Module):
         path = os.path.join(save_dir, save_filename)
 
         if self.zero_optimization_partition_weights():
-            if self.zero_gather_fp16_weights_on_model_save():
+            if self.zero_gather_16bit_weights_on_model_save():
                 # consolidation is expensive in time and memory and therefore isn't a default
-                state_dict = self._zero3_consolidated_fp16_state_dict()
+                state_dict = self._zero3_consolidated_16bit_state_dict()
             else:
                 # the model will be bogus if not consolidated so don't confuse the user by saving it
                 logger.info(
-                    f"Did not save the model {path} because `stage3_gather_fp16_weights_on_model_save` is False"
+                    f"Did not save the model {path} because `stage3_gather_16bit_weights_on_model_save` is False"
                 )
-                return
+                return False
         else:
             state_dict = self.module.state_dict()
 
@@ -1812,3 +3171,5 @@ class DeepSpeedEngine(Module):
             os.makedirs(save_dir, exist_ok=True)
             logger.info(f"Saving model weights to {path}")
             torch.save(state_dict, path)
+
+        return True
diff --git a/deepspeed/runtime/fp16/fused_optimizer.py b/deepspeed/runtime/fp16/fused_optimizer.py
old mode 100755
new mode 100644
index 5f35c1884a413635e029a8d8701dee3cf0d4fa9b..dc52552aebba6852c5e31dff82d820c028596e30
--- a/deepspeed/runtime/fp16/fused_optimizer.py
+++ b/deepspeed/runtime/fp16/fused_optimizer.py
@@ -6,12 +6,13 @@ This file is adapted from FP16_Optimizer in NVIDIA/apex
 '''
 
 import torch
-import math
 from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
 
-from deepspeed.runtime.utils import get_grad_norm, CheckOverflow, get_weight_norm
+from deepspeed.runtime.utils import get_global_norm, get_grad_norm, CheckOverflow, get_weight_norm
 from deepspeed.runtime.fp16.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, MIN_LOSS_SCALE
-from deepspeed.utils import logger, log_dist
+from deepspeed.utils import groups, logger, log_dist
+from deepspeed.checkpoint.constants import OPTIMIZER_STATE_DICT, CLIP_GRAD
+import torch.distributed as dist
 
 
 class FP16_Optimizer(object):
@@ -22,6 +23,7 @@ class FP16_Optimizer(object):
     """
     def __init__(self,
                  init_optimizer,
+                 deepspeed=None,
                  static_loss_scale=1.0,
                  dynamic_loss_scale=False,
                  initial_dynamic_scale=2**32,
@@ -30,11 +32,14 @@ class FP16_Optimizer(object):
                  mpu=None,
                  clip_grad=0.0,
                  fused_adam_legacy=False,
+                 has_moe_layers=False,
                  timers=None):
 
         self.fused_adam_legacy = fused_adam_legacy
         self.timers = timers
-
+        self.deepspeed = deepspeed
+        self.has_moe_layers = has_moe_layers
+        self.using_pipeline = self.deepspeed.pipeline_parallelism
         if not torch.cuda.is_available:
             raise SystemError("Cannot use fp16 without CUDA.")
         self.optimizer = init_optimizer
@@ -44,6 +49,8 @@ class FP16_Optimizer(object):
         self.fp16_groups_flat = []
         self.fp32_groups_flat = []
 
+        self._global_grad_norm = 0.
+
         # loop to deal with groups
         for i, param_group in enumerate(self.optimizer.param_groups):
             # push this group to list before modify
@@ -88,6 +95,7 @@ class FP16_Optimizer(object):
 
         self.clip_grad = clip_grad
         self.norm_type = 2
+        self.step_count = 0
 
         TORCH_MAJOR = int(torch.__version__.split('.')[0])
         TORCH_MINOR = int(torch.__version__.split('.')[1])
@@ -100,7 +108,9 @@ class FP16_Optimizer(object):
         self.mpu = mpu
 
         self.overflow = False
-        self.overflow_checker = CheckOverflow(self.fp16_groups, mpu=self.mpu)
+        self.overflow_checker = CheckOverflow(self.fp16_groups,
+                                              mpu=self.mpu,
+                                              deepspeed=deepspeed)
         self.initialize_optimizer_states()
 
     def initialize_optimizer_states(self):
@@ -134,6 +144,7 @@ class FP16_Optimizer(object):
         """
         Not supporting closure.
         """
+
         # First compute norm for all group so we know if there is overflow
         grads_groups_flat = []
         norm_groups = []
@@ -158,9 +169,16 @@ class FP16_Optimizer(object):
                     "scale: {}, reducing to {}".format(prev_scale,
                                                        self.cur_scale))
             return self.overflow
+
+        scaled_grad_norm = get_global_norm(norm_list=norm_groups)
+
         combined_scale = self.unscale_and_clip_grads(grads_groups_flat,
-                                                     norm_groups,
+                                                     scaled_grad_norm,
                                                      apply_scale=False)
+
+        # Stash unscaled gradient norm
+        self._global_grad_norm = scaled_global_grad_norm / self.cur_scale
+
         # norm is in fact norm*cur_scale
         self.optimizer.step(grads=[[g] for g in grads_groups_flat],
                             output_params=[[p] for p in self.fp16_groups_flat],
@@ -245,11 +263,21 @@ class FP16_Optimizer(object):
             self.fp32_groups_flat[i].grad = grads_groups_flat[i]
 
         self.start_timers([COMPUTE_NORM])
+
         all_groups_norm = get_grad_norm(self.fp32_groups_flat, mpu=self.mpu)
+
         self.stop_timers([COMPUTE_NORM])
 
+        if self.has_moe_layers:
+            all_groups_norm = self._get_norm_with_moe_layers(all_groups_norm)
+
+        scaled_global_grad_norm = get_global_norm(norm_list=[all_groups_norm])
+
+        # Stash unscaled gradient norm
+        self._global_grad_norm = scaled_global_grad_norm / self.cur_scale
+
         self.start_timers([UNSCALE_AND_CLIP])
-        self.unscale_and_clip_grads(grads_groups_flat, [all_groups_norm])
+        self.unscale_and_clip_grads(grads_groups_flat, scaled_global_grad_norm)
         self.stop_timers([UNSCALE_AND_CLIP])
 
         self.start_timers([BASIC_STEP])
@@ -261,23 +289,38 @@ class FP16_Optimizer(object):
             group.grad = None
 
         self.start_timers([UPDATE_FP16])
+
         for i in range(len(self.fp16_groups)):
             updated_params = _unflatten_dense_tensors(self.fp32_groups_flat[i],
                                                       self.fp16_groups[i])
             for p, q in zip(self.fp16_groups[i], updated_params):
                 p.data.copy_(q.data)
+
         self.stop_timers([UPDATE_FP16])
 
         self.log_timers(STEP_TIMERS)
 
-        return self.overflow
+        self.step_count += 1
 
-    def unscale_and_clip_grads(self, grad_groups_flat, norm_groups, apply_scale=True):
-        total_norm = 0.0
-        for norm in norm_groups:
-            total_norm += norm**2.0
-        total_norm = math.sqrt(total_norm)
+        return self.overflow
 
+    def _get_norm_with_moe_layers(self, all_groups_norm):
+        #all_groups_norm_old = all_groups_norm
+        # Need to allreduce (avg) the norms across different ranks because moe params will not be synced during allreduce
+        if self.using_pipeline:
+            pg = self.deepspeed.mpu.get_data_parallel_group()
+        else:
+            pg = groups._get_data_parallel_group()
+        scaled_norm = all_groups_norm * 1.0 / float(dist.get_world_size(group=pg))
+        scaled_norm_tensor = torch.tensor(scaled_norm,
+                                          device=self.fp32_groups_flat[0].device,
+                                          dtype=torch.float)
+        dist.all_reduce(scaled_norm_tensor, group=pg)
+        all_groups_norm = scaled_norm_tensor.item()
+        #print(f"old = {all_groups_norm_old} and new = {all_groups_norm} at rank: {torch.distributed.get_rank()}")
+        return all_groups_norm
+
+    def unscale_and_clip_grads(self, grad_groups_flat, total_norm, apply_scale=True):
         # compute combined scale factor for this group
         combined_scale = self.cur_scale
         if self.clip_grad > 0.:
@@ -292,7 +335,7 @@ class FP16_Optimizer(object):
 
         return combined_scale
 
-    def backward(self, loss):
+    def backward(self, loss, create_graph=False, retain_graph=False):
         """
         :attr:`backward` performs the following steps:
 
@@ -300,8 +343,9 @@ class FP16_Optimizer(object):
         2. scaled_loss = fp32_loss*loss_scale
         3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
         """
+
         scaled_loss = (loss.float()) * self.cur_scale
-        scaled_loss.backward()
+        scaled_loss.backward(create_graph=create_graph, retain_graph=retain_graph)
 
     def _update_scale(self, skip):
         if self.dynamic_loss_scale:
@@ -371,9 +415,9 @@ class FP16_Optimizer(object):
             state_dict['last_overflow_iter'] = self.last_overflow_iter
             state_dict['scale_factor'] = self.scale_factor
             state_dict['scale_window'] = self.scale_window
-        state_dict['optimizer_state_dict'] = self.optimizer.state_dict()
+        state_dict[OPTIMIZER_STATE_DICT] = self.optimizer.state_dict()
         state_dict['fp32_groups_flat'] = self.fp32_groups_flat
-        state_dict['clip_grad'] = self.clip_grad
+        state_dict[CLIP_GRAD] = self.clip_grad
         return state_dict
 
     # Refresh fp32 master params from fp16 copies
@@ -406,8 +450,8 @@ class FP16_Optimizer(object):
             self.scale_factor = state_dict['scale_factor']
             self.scale_window = state_dict['scale_window']
         if load_optimizer_states:
-            self.optimizer.load_state_dict(state_dict['optimizer_state_dict'])
-        self.clip_grad = state_dict['clip_grad']
+            self.optimizer.load_state_dict(state_dict[OPTIMIZER_STATE_DICT])
+        self.clip_grad = state_dict[CLIP_GRAD]
         # At this point, the optimizer's references to the model's fp32 parameters are up to date.
         # The optimizer's hyperparameters and internal buffers are also up to date.
         # However, the fp32 master copies of the model's fp16 params stored by the optimizer are still
@@ -427,3 +471,7 @@ class FP16_Optimizer(object):
 
     def __repr__(self):
         return repr(self.optimizer)
+
+    @property
+    def loss_scale(self):
+        return self.cur_scale
diff --git a/deepspeed/runtime/fp16/loss_scaler.py b/deepspeed/runtime/fp16/loss_scaler.py
old mode 100755
new mode 100644
diff --git a/deepspeed/runtime/fp16/onebit/adam.py b/deepspeed/runtime/fp16/onebit/adam.py
index e3417fea9d6f49a2b4221c81e29c642e6cd974d5..5ce451d7676ffbe0d86410b4105620cf3fa3ffd2 100644
--- a/deepspeed/runtime/fp16/onebit/adam.py
+++ b/deepspeed/runtime/fp16/onebit/adam.py
@@ -82,6 +82,7 @@ class OnebitAdam(torch.optim.Optimizer):
         self.initialize = False
         self.freeze_step = freeze_step
         self.cuda_aware = cuda_aware
+        self.using_pipeline = False
 
         self.comm_backend_name = comm_backend_name
 
@@ -94,7 +95,9 @@ class OnebitAdam(torch.optim.Optimizer):
             assert TORCH_MAJOR >= 1 and TORCH_MINOR >= 8, "Please use torch 1.8 or greater to enable NCCL backend in 1-bit Adam. Alternatively, please specify 'mpi' as the 'comm_backend_name' in config file to proceed with the MPI backend"
             assert dist.is_initialized() == True, "Please initialize the torch distributed backend."
             from deepspeed.runtime.comm.nccl import NcclBackend
-            self.comm_backend_handle = NcclBackend()
+            self.using_pipeline = hasattr(self.deepspeed,
+                                          'pipeline_enable_backward_allreduce')
+            self.comm_backend_handle = NcclBackend(self.deepspeed.mpu)
 
         elif self.comm_backend_name == 'mpi':
             from deepspeed.runtime.comm.mpi import MpiBackend
@@ -112,7 +115,7 @@ class OnebitAdam(torch.optim.Optimizer):
             grads (list of tensors, optional): weight gradient to use for the
                 optimizer update. If gradients have type torch.half, parameters
                 are expected to be in type torch.float. (default: None)
-            output params (list of tensors, optional): A reduced recision copy
+            output params (list of tensors, optional): A reduced precision copy
                 of the updated weights written out in addition to the regular
                 updated weights. Have to be of same type as gradients. (default: None)
             scale (float, optional): factor to divide gradient tensor values
@@ -164,6 +167,8 @@ class OnebitAdam(torch.optim.Optimizer):
                     # Exponential moving average of squared gradient values
                     state['exp_avg_sq'] = torch.zeros_like(p.data)
 
+                if not self.initialize or (self.adam_freeze_key
+                                           and 'worker_error' not in state.keys()):
                     state['tensor_size'] = torch.numel(p.data)
                     state['corrected_tensor_size'] = state['tensor_size']
 
@@ -173,9 +178,6 @@ class OnebitAdam(torch.optim.Optimizer):
                                                             (self.size * self.divider)))
                     state['server_chunk_size'] = state[
                         'corrected_tensor_size'] // self.size
-
-                if not self.initialize or (self.adam_freeze_key
-                                           and 'worker_error' not in state.keys()):
                     torch.cuda.empty_cache()
                     state['worker_error'] = torch.zeros(state['corrected_tensor_size'],
                                                         device=p.device)
@@ -202,7 +204,7 @@ class OnebitAdam(torch.optim.Optimizer):
                     if 'non_freeze' in group.keys() and group['non_freeze'] is True:
                         dist.all_reduce(grad)
                         grad.mul_(1 / dist.get_world_size())
-                        exp_avg.mul_(beta1).add(1 - beta1, grad)
+                        exp_avg.mul_(beta1).add_(1 - beta1, grad)
                         exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
                         grad = None
                     else:
@@ -254,8 +256,12 @@ class OnebitAdam(torch.optim.Optimizer):
 
         if self.adam_freeze_key is False:
             if state['step'] >= self.freeze_step:
+                print('OnebitAdam - starting compressed communication')
                 self.adam_freeze_key = True
-                self.deepspeed.enable_backward_allreduce = False
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = False
+                else:
+                    self.deepspeed.enable_backward_allreduce = False
 
         return loss
 
@@ -277,18 +283,24 @@ class OnebitAdam(torch.optim.Optimizer):
         super().load_state_dict(state_dict)
         if self.state[self.param_groups[0]['params'][0]]['step'] < self.freeze_step:
             if torch.distributed.get_rank() == 0:
-                print("Checkpoint loaded and 1-bit Adam warmup stage starts/continues.")
+                print("Checkpoint loaded and OnebitAdam warmup stage starts/continues.")
             if self.adam_freeze_key is True:
                 self.adam_freeze_key = False
-                self.deepspeed.enable_backward_allreduce = True
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = True
+                else:
+                    self.deepspeed.enable_backward_allreduce = True
         else:
             if torch.distributed.get_rank() == 0:
                 print(
-                    "Checkpoint loaded and 1-bit Adam compression stage starts/continues."
+                    "Checkpoint loaded and OnebitAdam compression stage starts/continues."
                 )
             if self.adam_freeze_key is False:
                 self.adam_freeze_key = True
-                self.deepspeed.enable_backward_allreduce = False
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = False
+                else:
+                    self.deepspeed.enable_backward_allreduce = False
         # We reset the compression errors when loading checkpoints for 3 reasons:
         # 1) The worker and server error at each GPU are distinct, so in current implementation
         # only rank 0's errors are saved in the checkpoint. Thus we have to reset the errors.
diff --git a/deepspeed/runtime/fp16/onebit/lamb.py b/deepspeed/runtime/fp16/onebit/lamb.py
new file mode 100644
index 0000000000000000000000000000000000000000..01c6cd878488c73f0a08b6030622982e5fc45284
--- /dev/null
+++ b/deepspeed/runtime/fp16/onebit/lamb.py
@@ -0,0 +1,471 @@
+'''
+Copyright 2021 The Microsoft DeepSpeed Team
+'''
+import types
+import torch
+import numpy as np
+import torch.distributed as dist
+from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
+
+
+class OnebitLamb(torch.optim.Optimizer):
+    """Implements the 1-bit Lamb algorithm. Currently GPU-only.
+    For usage example please see https://www.deepspeed.ai/tutorials/onebit-lamb/
+    For technical details please see our paper https://arxiv.org/abs/2104.06069.
+
+    Arguments:
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups.
+        lr (float, optional): learning rate. (default: 1e-3)
+        freeze_step (int, optional): Number of steps for warmup (uncompressed)
+            stage before we start using compressed communication. (default 100000)
+        betas (Tuple[float, float], optional): coefficients used for computing
+            running averages of gradient and its square. (default: (0.9, 0.999))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability. (default: 1e-8)
+        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        max_coeff(float, optional): maximum value of the lamb coefficient (default: 10.0)
+        min_coeff(float, optional): minimum value of the lamb coefficient (default: 0.01)
+        amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+            algorithm from the paper `On the Convergence of Adam and Beyond`_
+            (default: False) NOT SUPPORTED in 1-bit Lamb!
+        eps_inside_sqrt (boolean, optional): in the 'update parameters' step,
+            adds eps to the bias-corrected second moment estimate before
+            evaluating square root instead of adding it to the square root of
+            second moment estimate as in the original paper. (default: False)
+        cuda_aware (boolean, required): Set True if the underlying MPI implementation
+            supports CUDA-Aware communication. (default: False)
+        comm_backend_name (string, optional): Set to 'mpi' if needed. (default: 'nccl')
+        coeff_beta (float, optional): coefficient used for computing
+            running averages of lamb coefficient (default: 0.9) note that you may want to
+            increase or decrease this beta depending on the freeze_step you choose, as
+            1/(1 - coeff_beta) should be smaller than or equal to freeze_step
+        factor_max (float, optional): maximum value of scaling factor to the frozen lamb
+            coefficient during compression stage (default: 4.0)
+        factor_min (float, optional): minimum value of scaling factor to the frozen lamb
+            coefficient during compression stage (default: 0.5)
+        factor_threshold (float, optional): threshold of how much the scaling factor can
+            fluctuate between steps (default: 0.1)
+    .. _Large Batch Optimization for Deep Learning\: Training BERT in 76 minutes:
+        https://arxiv.org/abs/1904.00962
+    .. _Adam\: A Method for Stochastic Optimization:
+        https://arxiv.org/abs/1412.6980
+    .. _On the Convergence of Adam and Beyond:
+        https://openreview.net/forum?id=ryQu7f-RZ
+    """
+    def __init__(self,
+                 params,
+                 deepspeed=None,
+                 lr=1e-3,
+                 freeze_step=100000,
+                 bias_correction=True,
+                 betas=(0.9,
+                        0.999),
+                 eps=1e-8,
+                 eps_inside_sqrt=False,
+                 weight_decay=0.,
+                 max_grad_norm=0.,
+                 max_coeff=10.0,
+                 min_coeff=0.01,
+                 amsgrad=False,
+                 cuda_aware=False,
+                 comm_backend_name='nccl',
+                 coeff_beta=0.9,
+                 factor_max=4.0,
+                 factor_min=0.5,
+                 factor_threshold=0.1):
+
+        if amsgrad:
+            raise RuntimeError('1-bit Lamb does not support the AMSGrad variant.')
+
+        defaults = dict(lr=lr,
+                        bias_correction=bias_correction,
+                        betas=betas,
+                        eps=eps,
+                        weight_decay=weight_decay,
+                        max_grad_norm=max_grad_norm,
+                        max_coeff=max_coeff,
+                        min_coeff=min_coeff)
+
+        super(OnebitLamb, self).__init__(params, defaults)
+        self.eps_mode = 0 if eps_inside_sqrt else 1
+        assert (dist.is_initialized())
+
+        self.deepspeed = deepspeed
+        self.lamb_freeze_key = False
+        self.initialize = False
+        self.freeze_step = freeze_step
+        self.cuda_aware = cuda_aware
+        self.coeff_beta = coeff_beta
+        self.factor_max = factor_max
+        self.factor_min = factor_min
+        self.factor_threshold = factor_threshold
+        self.using_pipeline = False
+
+        self.comm_backend_name = comm_backend_name
+
+        # Empty initializer. Set handle based on the comm backend as follows.
+        self.comm_backend_handle = None
+
+        if self.comm_backend_name == 'nccl':
+            TORCH_MAJOR = int(torch.__version__.split('.')[0])
+            TORCH_MINOR = int(torch.__version__.split('.')[1])
+            assert TORCH_MAJOR >= 1 and TORCH_MINOR >= 8, "Please use torch 1.8 or greater to enable NCCL backend in 1-bit Adam. Alternatively, please specify 'mpi' as the 'comm_backend_name' in config file to proceed with the MPI backend"
+            assert dist.is_initialized() == True, "Please initialize the torch distributed backend."
+            from deepspeed.runtime.comm.nccl import NcclBackend
+            self.using_pipeline = hasattr(self.deepspeed,
+                                          'pipeline_enable_backward_allreduce')
+            self.comm_backend_handle = NcclBackend(self.deepspeed.mpu)
+
+        elif self.comm_backend_name == 'mpi':
+            from deepspeed.runtime.comm.mpi import MpiBackend
+            self.comm_backend_handle = MpiBackend(cuda_aware)
+
+        self.size = self.comm_backend_handle.size
+
+        self.divider = int(self.size * 8 / np.gcd(self.size, 8))
+
+        self.exp_avg_flat = []
+        self.dummy_exp_avg = {}
+        self.corrected_tensor_sizes = []
+        self.server_chunk_sizes = []
+        self.worker_errors = []
+        self.server_errors = []
+
+        self.lamb_coeffs = []
+
+    def step(self, closure=None, grads=None):
+        """Performs a single optimization step.
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+            grads (list of tensors, optional): weight gradient to use for the
+                optimizer update. If gradients have type torch.half, parameters
+                are expected to be in type torch.float. (default: None)
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+
+        if grads is None:
+            grads_group = [None] * len(self.param_groups)
+        # backward compatibility
+        # assuming a list/generator of parameter means single group
+        elif isinstance(grads, types.GeneratorType):
+            grads_group = [grads]
+        elif type(grads[0]) != list:
+            grads_group = [grads]
+        else:
+            grads_group = grads
+
+        #remove the previous stats
+        del self.lamb_coeffs[:]
+
+        if self.lamb_freeze_key:
+            exp_avg_last_step = []
+            for group in self.param_groups:
+                exp_avg_last_step.append(
+                    [self.state[p]['exp_avg'].detach().clone() for p in group['params']])
+            if 'scaling_coeff' not in self.state[self.param_groups[0]['params'][0]]:
+                # Compute the scaling_coeff for each momentum at the end of warmup stage.
+                # This is used to reduce compression error during compression stage.
+                momentum_scales = []
+                for group in self.param_groups:
+                    momentum_scales.append([
+                        (torch.norm(self.state[p]['exp_avg']) /
+                         np.sqrt(torch.numel(self.state[p]['exp_avg']))).item()
+                        for p in group['params']
+                    ])
+                united_scale = sum([sum(x) for x in momentum_scales]) / sum(
+                    [len(x) for x in momentum_scales])
+                for i, group in enumerate(self.param_groups):
+                    for j, p in enumerate(group['params']):
+                        self.state[p][
+                            'scaling_coeff'] = united_scale / momentum_scales[i][j]
+
+        for group, grads_this_group in zip(self.param_groups, grads_group):
+            if grads_this_group is None:
+                grads_this_group = [None] * len(group['params'])
+
+            bias_correction = 1 if group['bias_correction'] else 0
+
+            for p, grad in zip(group['params'], grads_this_group):
+                if p.grad is None and grad is None:
+                    continue
+                if grad is None:
+                    grad = p.grad.data
+                if grad.is_sparse:
+                    raise RuntimeError('1-bit Lamb does not support sparse gradients')
+
+                state = self.state[p]
+
+                # State initialization
+                if len(state) == 0 or (len(state) == 1
+                                       and 'scaling_coeff' in state.keys()):
+                    state['step'] = 0
+                    state['lamb_coeff_freeze'] = 0.0
+                    state['last_factor'] = 1.0
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p.data)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p.data)
+                    state['exp_avg_sq_fresh'] = torch.zeros_like(p.data)
+
+                if not self.initialize:
+                    self.lamb_freeze_key = True
+
+                exp_avg, exp_avg_sq, exp_avg_sq_fresh = state['exp_avg'], state['exp_avg_sq'], state['exp_avg_sq_fresh']
+                beta1, beta2 = group['betas']
+                max_coeff = group['max_coeff']
+                min_coeff = group['min_coeff']
+
+                state['step'] += 1
+
+                if self.lamb_freeze_key is False:
+                    # warmup stage, baseline Lamb optimization
+                    exp_avg.mul_(beta1).add_(1 - beta1, grad)
+                    exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
+                    if state['step'] == self.freeze_step:
+                        exp_avg_sq_fresh.data = exp_avg_sq.detach().clone()
+                    grad = None
+                    if self.initialize:
+                        weight_norm = p.data.pow(2).sum().sqrt()
+                        update = exp_avg / (exp_avg_sq.sqrt() + group['eps'])
+                        if group['weight_decay'] > 0.0:
+                            update += group['weight_decay'] * p.data
+                        update_norm = update.pow(2).sum().sqrt()
+                        lamb_coeff = 1.0
+                        if weight_norm != 0 and update_norm != 0:
+                            lamb_coeff = (weight_norm / update_norm).item()
+                            if lamb_coeff > max_coeff:
+                                lamb_coeff = max_coeff
+                            if lamb_coeff < min_coeff:
+                                lamb_coeff = min_coeff
+                        if lamb_coeff != 1.0:
+                            state['lamb_coeff_freeze'] = self.coeff_beta * state[
+                                'lamb_coeff_freeze'] + (1 - self.coeff_beta) * lamb_coeff
+                        self.lamb_coeffs.append(lamb_coeff)
+                        with torch.no_grad():
+                            p.add_(-group['lr'] * lamb_coeff * update)
+                else:
+                    # compression stage, update each momentum locally, then
+                    # communicate based on the compressed_allreduce below
+                    if self.initialize:
+                        exp_avg.mul_(beta1).add_(1 - beta1, grad)
+                        exp_avg.mul_(self.state[p]['scaling_coeff'])
+                    grad = None
+
+        # init fused momentum
+        if len(self.exp_avg_flat) == 0:
+            momentum_groups = []
+            tensor_size = 0
+            for group in self.param_groups:
+                for p in group['params']:
+                    momentum_groups.append(self.state[p]['exp_avg'])
+                    tensor_size += torch.numel(p.data)
+            corrected_tensor_size = tensor_size
+            if tensor_size % (self.size * self.divider) != 0:
+                difference = ((self.size * self.divider) - (tensor_size %
+                                                            (self.size * self.divider)))
+                corrected_tensor_size += difference
+                self.dummy_exp_avg[0] = torch.zeros(
+                    difference,
+                    device=momentum_groups[0].data.device)
+                momentum_groups.append(self.dummy_exp_avg[0])
+            self.corrected_tensor_sizes.append(corrected_tensor_size)
+            self.server_chunk_sizes.append(corrected_tensor_size // self.size)
+
+            self.exp_avg_flat.append(
+                _flatten_dense_tensors([p.detach().clone() for p in momentum_groups]))
+            updated_params = _unflatten_dense_tensors(self.exp_avg_flat[0],
+                                                      momentum_groups)
+            for p, q in zip(momentum_groups, updated_params):
+                p.data = q.data
+
+        if self.initialize and len(self.worker_errors) == 0:
+            torch.cuda.empty_cache()
+            for i in range(len(self.exp_avg_flat)):
+                self.worker_errors.append(
+                    torch.zeros(self.corrected_tensor_sizes[i],
+                                device=self.exp_avg_flat[i].device))
+                self.server_errors.append(
+                    torch.zeros(self.server_chunk_sizes[i],
+                                device=self.exp_avg_flat[i].device))
+            torch.cuda.empty_cache()
+
+        if self.lamb_freeze_key:
+            if self.size > 1:
+                for i in range(len(self.exp_avg_flat)):
+                    if not self.initialize:
+                        torch.cuda.empty_cache()
+                        self.worker_errors.append(
+                            torch.zeros(self.corrected_tensor_sizes[i],
+                                        device=self.exp_avg_flat[i].device))
+                        self.server_errors.append(
+                            torch.zeros(self.server_chunk_sizes[i],
+                                        device=self.exp_avg_flat[i].device))
+                        torch.cuda.empty_cache()
+                        if torch.distributed.get_rank() == 0:
+                            print("Cupy Buffers Initialized Successfully.")
+
+                        self.comm_backend_handle.compressed_allreduce(
+                            self.exp_avg_flat[i],
+                            self.worker_errors[0],
+                            self.server_errors[0],
+                            self.deepspeed.local_rank)
+
+                        if torch.distributed.get_rank() == 0:
+                            print('Pop out errors', flush=True)
+                        del self.worker_errors[:]
+                        del self.server_errors[:]
+                    else:
+                        self.comm_backend_handle.compressed_allreduce(
+                            self.exp_avg_flat[i],
+                            self.worker_errors[i],
+                            self.server_errors[i],
+                            self.deepspeed.local_rank)
+
+        if self.lamb_freeze_key and self.initialize:
+            for i, group in enumerate(self.param_groups):
+                bias_correction = 1 if group['bias_correction'] else 0
+
+                for j, p in enumerate(group['params']):
+                    state = self.state[p]
+                    exp_avg, exp_avg_sq, exp_avg_sq_fresh = state['exp_avg'], state['exp_avg_sq'], state['exp_avg_sq_fresh']
+                    beta1, beta2 = group['betas']
+                    exp_avg.div_(self.state[p]['scaling_coeff'])
+                    # Because 1-bit compression cannot represent exact zero, it is required to
+                    # provide a momentum mask for those params that have constant exact zeros in their
+                    # momentums, otherwise the compression error would keep accumulating.
+                    # For example, for BERT pre-training seq 128, bert.embeddings.position_embeddings.weight
+                    # always have exact zeros in its momentum for row 129 to 512, because it only
+                    # learns up to seq length 128 while the model supports up to 512 seq length.
+                    # (See example in DeepSpeedExamples/bing_bert/deepspeed_train.py about how
+                    # to add this exp_avg_mask for BERT pre-training.)
+                    if 'exp_avg_mask' in group:
+                        if exp_avg.device != group['exp_avg_mask'].device:
+                            group['exp_avg_mask'] = group['exp_avg_mask'].to(
+                                device=exp_avg.device)
+                        exp_avg.mul_(group['exp_avg_mask'])
+
+                    grad_reconstruct = ((exp_avg - exp_avg_last_step[i][j] * beta1) /
+                                        (1 - beta1))
+                    exp_avg_sq_fresh.mul_(beta2).addcmul_(1 - beta2,
+                                                          grad_reconstruct,
+                                                          grad_reconstruct)
+                    denom = exp_avg_sq.sqrt() + group['eps']
+                    update_prelim = exp_avg / denom
+
+                    if group['weight_decay'] > 0.0:
+                        update = update_prelim + group['weight_decay'] * p.data
+                    else:
+                        update = update_prelim
+
+                    lamb_coeff = 1.0
+                    update_norm = update.pow(2).sum().sqrt()
+                    denom_real = exp_avg_sq_fresh.sqrt() + group['eps']
+                    factor = (denom / denom_real).max().item()
+                    if group['weight_decay'] > 0.0:
+                        update_ratio = min(1.0,
+                                           (update_prelim.pow(2).sum().sqrt() /
+                                            update_norm).item())
+                        factor = factor * update_ratio + (1.0 - update_ratio)
+                    if factor > self.factor_max:
+                        factor = self.factor_max
+                    if factor < self.factor_min:
+                        factor = self.factor_min
+                    if factor > state['last_factor'] * (1.0 + self.factor_threshold):
+                        factor = state['last_factor'] * (1.0 + self.factor_threshold)
+                    if factor < state['last_factor'] * (1.0 - self.factor_threshold):
+                        factor = state['last_factor'] * (1.0 - self.factor_threshold)
+                    state['last_factor'] = factor
+                    lamb_coeff = state['lamb_coeff_freeze'] * factor
+                    self.lamb_coeffs.append(lamb_coeff)
+                    with torch.no_grad():
+                        p.add_(-group['lr'] * lamb_coeff * update)
+            del exp_avg_last_step[:]
+            exp_avg_last_step = None
+
+        if not self.initialize:
+            self.lamb_freeze_key = False
+            self.initialize = True
+            print(
+                f"Finished the initialization step at rank {torch.distributed.get_rank()}"
+            )
+            return loss
+
+        if self.lamb_freeze_key is False:
+            if state['step'] >= self.freeze_step:
+                print('OnebitLamb - starting compressed communication')
+                self.lamb_freeze_key = True
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = False
+                else:
+                    self.deepspeed.enable_backward_allreduce = False
+
+        return loss
+
+    def load_state_dict(self, state_dict):
+        """
+        Overrides load_state_dict() to add special handling when loading checkpoints
+        """
+        # Because at different stage exp_avg_mask may change (e.g.,
+        # BERT pre-training seqlen 128 and 512 ), we don't use the exp_avg_mask
+        # in checkpoints but always use the one user provided in training script.
+        # (See example in DeepSpeedExamples/bing_bert/deepspeed_train.py.)
+        # Thus here we keep the exp_avg_mask unchanged when loading checkpoint
+        for i, group in enumerate(self.param_groups):
+            if 'exp_avg_mask' in group:
+                state_dict['param_groups'][i]['exp_avg_mask'] = group['exp_avg_mask']
+            elif 'exp_avg_mask' not in group and 'exp_avg_mask' in state_dict[
+                    'param_groups'][i]:
+                state_dict['param_groups'][i].pop('exp_avg_mask')
+        super().load_state_dict(state_dict)
+        # need to reset the fused momentum since loading states will break the linking
+        del self.exp_avg_flat[:]
+        self.dummy_exp_avg.clear()
+        del self.corrected_tensor_sizes[:]
+        del self.server_chunk_sizes[:]
+        if self.state[self.param_groups[0]['params'][0]]['step'] < self.freeze_step:
+            if torch.distributed.get_rank() == 0:
+                print("Checkpoint loaded and OnebitLamb warmup stage starts/continues.")
+            if self.lamb_freeze_key is True:
+                self.lamb_freeze_key = False
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = True
+                else:
+                    self.deepspeed.enable_backward_allreduce = True
+            for group in self.param_groups:
+                for p in group['params']:
+                    self.state[p]['lamb_coeff_freeze'] = 0.0
+                    self.state[p]['last_factor'] = 1.0
+                    if 'scaling_coeff' in self.state[p]:
+                        self.state[p].pop('scaling_coeff')
+        else:
+            if torch.distributed.get_rank() == 0:
+                print(
+                    "Checkpoint loaded and OnebitLamb compression stage starts/continues."
+                )
+            if self.lamb_freeze_key is False:
+                self.lamb_freeze_key = True
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = False
+                else:
+                    self.deepspeed.enable_backward_allreduce = False
+        # We reset the compression errors when loading checkpoints for 3 reasons:
+        # 1) The worker and server error at each GPU are distinct, so in current implementation
+        # only rank 0's errors are saved in the checkpoint. Thus we have to reset the errors.
+        # If we want to save them correctly we need O(num_gpu*model_size) memory in order to
+        # gather all the error, which is a very large memory requirement. It's possible to save
+        # them in a distributed way, but it will make the checkpoint saving/loading much more complicated.
+        # 2) Even if we are able to save the compression errors correctly, you need to have the
+        # exact same number of GPUs in order to load them correctly.
+        # 3) We verified on BERT pre-training that occasionally resetting the compression error
+        # at checkpoint loading does not affect the convergence.
+        # However, please avoid frequent checkpoint loading which could break the error
+        # compensation mechanism thus affect the convergence.
+        del self.worker_errors[:]
+        del self.server_errors[:]
+
+    def get_lamb_coeffs(self):
+        return self.lamb_coeffs
diff --git a/deepspeed/runtime/fp16/onebit/zoadam.py b/deepspeed/runtime/fp16/onebit/zoadam.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0238b1997f59d1874eabafff5661a71e74d1757
--- /dev/null
+++ b/deepspeed/runtime/fp16/onebit/zoadam.py
@@ -0,0 +1,382 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+import types
+import torch
+import importlib
+import numpy as np
+import time
+import torch.distributed as dist
+
+from deepspeed.utils.logging import logger
+
+
+class ZeroOneAdam(torch.optim.Optimizer):
+    """Implements the 0/1 Adam algorithm. Currently GPU-only.
+    For usage example please see https://www.deepspeed.ai/tutorials/zero-one-adam/
+    For technical details please read https://arxiv.org/abs/2202.06009
+    Arguments:
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups.
+        lr (float, optional): learning rate. (default: 1e-3)
+        betas (Tuple[float, float], optional): coefficients used for computing
+            running averages of gradient and its square. (default: (0.9, 0.999))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability. (default: 1e-8)
+        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        var_freeze_step (int, optional): The latest step to update the variance,
+            using the notation from https://arxiv.org/abs/2202.06009, it denotes the
+            max{i|i in T_v}. Note that this is different from the freeze step from the
+            1-bit Adam. The var_freeze_step is usually the end of the learning rate warmup
+            and thus does not require tuning. (default: 100000)
+        var_update_scaler (int, optional): The interval to update the variance. Note that
+            the update policy for variance follows an exponential rule, where var_update_scaler
+            denotes the kappa in the 0/1 Adam paper. (default: 16)
+        local_step_scaler (int, optional): The interval to scale the local steps interval
+            according to the learning rate policy. (default: 32678)
+        local_step_clipper (int, optional): The largest interval for local steps with
+            learning rate policy. This corresponds to the variable H in the 0/1 Adam paper.
+            (default: 16)
+        amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+            algorithm from the paper `On the Convergence of Adam and Beyond`_
+            (default: False) NOT SUPPORTED in 0/1 Adam!
+        eps_inside_sqrt (boolean, optional): in the 'update parameters' step,
+            adds eps to the bias-corrected second moment estimate before
+            evaluating square root instead of adding it to the square root of
+            second moment estimate as in the original paper. (default: False)
+        cuda_aware (boolean, required): Set True if the underlying MPI implementation
+            supports CUDA-Aware communication. (default: False)
+        comm_backend_name (string, optional): Set to 'mpi' if needed. (default: 'nccl')
+    .. _Adam\: A Method for Stochastic Optimization:
+        https://arxiv.org/abs/1412.6980
+    .. _On the Convergence of Adam and Beyond:
+        https://openreview.net/forum?id=ryQu7f-RZ
+    """
+    def __init__(self,
+                 params,
+                 deepspeed=None,
+                 lr=1e-3,
+                 bias_correction=True,
+                 betas=(0.9,
+                        0.999),
+                 eps=1e-8,
+                 eps_inside_sqrt=False,
+                 weight_decay=0.,
+                 max_grad_norm=0.,
+                 var_freeze_step=100000,
+                 var_update_scaler=16,
+                 local_step_scaler=32678,
+                 local_step_clipper=16,
+                 amsgrad=False,
+                 cuda_aware=False,
+                 comm_backend_name='nccl'):
+
+        if amsgrad:
+            raise RuntimeError('0/1 Adam does not support the AMSGrad variant.')
+
+        defaults = dict(lr=lr,
+                        bias_correction=bias_correction,
+                        betas=betas,
+                        eps=eps,
+                        weight_decay=weight_decay,
+                        max_grad_norm=max_grad_norm)
+
+        super(ZeroOneAdam, self).__init__(params, defaults)
+        self.eps_mode = 0 if eps_inside_sqrt else 1
+        assert (dist.is_initialized())
+
+        self.deepspeed = deepspeed
+        self.initialize = False
+        self.cuda_aware = cuda_aware
+        self.using_pipeline = False
+
+        self.var_freeze_step = var_freeze_step
+        self.var_update_scaler = var_update_scaler
+        self.local_step_scaler = local_step_scaler
+        self.local_step_clipper = local_step_clipper
+        self.freeze_key = False
+        self.reinitial_error_buffer = False
+
+        self.comm_backend_name = comm_backend_name
+
+        # Empty initializer. Set handle based on the comm backend as follows.
+        self.comm_backend_handle = None
+
+        if self.comm_backend_name == 'nccl':
+            TORCH_MAJOR = int(torch.__version__.split('.')[0])
+            TORCH_MINOR = int(torch.__version__.split('.')[1])
+            assert TORCH_MAJOR >= 1 and TORCH_MINOR >= 8, "Please use torch 1.8 or greater to enable NCCL backend in 0/1 Adam. Alternatively, please specify 'mpi' as the 'comm_backend_name' in config file to proceed with the MPI backend"
+            assert dist.is_initialized() == True, "Please initialize the torch distributed backend."
+            from deepspeed.runtime.comm.nccl import NcclBackend
+            self.using_pipeline = hasattr(self.deepspeed,
+                                          'pipeline_enable_backward_allreduce')
+            self.comm_backend_handle = NcclBackend(self.deepspeed.mpu)
+
+        elif self.comm_backend_name == 'mpi':
+            from deepspeed.runtime.comm.mpi import MpiBackend
+            self.comm_backend_handle = MpiBackend(cuda_aware)
+
+        self.size = self.comm_backend_handle.size
+
+        self.divider = int(self.size * 8 / np.gcd(self.size, 8))
+
+    def step(self, closure=None, grads=None):
+        """Performs a single optimization step.
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+            grads (list of tensors, optional): weight gradient to use for the
+                optimizer update. If gradients have type torch.half, parameters
+                are expected to be in type torch.float. (default: None)
+            output params (list of tensors, optional): A reduced precision copy
+                of the updated weights written out in addition to the regular
+                updated weights. Have to be of same type as gradients. (default: None)
+            scale (float, optional): factor to divide gradient tensor values
+                by before applying to weights. (default: 1)
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+
+        if grads is None:
+            grads_group = [None] * len(self.param_groups)
+        # backward compatibility
+        # assuming a list/generator of parameter means single group
+        elif isinstance(grads, types.GeneratorType):
+            grads_group = [grads]
+        elif type(grads[0]) != list:
+            grads_group = [grads]
+        else:
+            grads_group = grads
+
+        for group, grads_this_group in zip(self.param_groups, grads_group):
+            if grads_this_group is None:
+                grads_this_group = [None] * len(group['params'])
+
+            bias_correction = 1 if group['bias_correction'] else 0
+
+            for p, grad in zip(group['params'], grads_this_group):
+                if p.grad is None and grad is None:
+                    continue
+                if grad is None:
+                    grad = p.grad.data
+                if grad.is_sparse:
+                    raise RuntimeError('0/1 Adam does not support sparse gradients')
+
+                state = self.state[p]
+
+                # State initialization
+                if len(state) == 0:
+                    state['step'] = 0
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p.data)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p.data)
+
+                if not self.initialize or 'worker_error' not in state.keys():
+                    # Some scalars to help scale the variance update/local step policies
+                    state['var_interval'] = 1
+                    state['var_counter'] = 0
+                    state['local_step_interval'] = 1
+                    state['local_step_counter'] = 0
+                    state['lrs'] = 0
+                    state['tensor_size'] = torch.numel(p.data)
+                    state['corrected_tensor_size'] = state['tensor_size']
+
+                    if state['tensor_size'] % (self.size * self.divider) != 0:
+                        state['corrected_tensor_size'] += ((self.size * self.divider) -
+                                                           (state['tensor_size'] %
+                                                            (self.size * self.divider)))
+                    state['server_chunk_size'] = state[
+                        'corrected_tensor_size'] // self.size
+                    torch.cuda.empty_cache()
+                    state['worker_error'] = torch.zeros(state['corrected_tensor_size'],
+                                                        device=p.device)
+                    state['server_error'] = torch.zeros(state['server_chunk_size'],
+                                                        device=p.device)
+                    # Accumulation of momentum, i.e., the u variable in the 0/1 Adam paper
+                    state['momentum_accumulator'] = torch.zeros_like(p.data)
+                    torch.cuda.empty_cache()
+                    # self.freeze_key = True
+                    if not self.initialize and torch.distributed.get_rank() == 0:
+                        print("Cupy Buffers Initialized Successfully.")
+
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                comm_buffer = state['momentum_accumulator']
+                beta1, beta2 = group['betas']
+
+                state['step'] += 1
+
+                if self.initialize:
+                    if self.freeze_key is False:
+                        if state['step'] % state['var_interval'] == 0:
+                            exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
+                            exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
+                        else:
+                            if self.size > 1:
+                                with torch.no_grad():
+                                    grad_onebit = self.comm_backend_handle.compressed_allreduce(
+                                        grad,
+                                        state['worker_error'],
+                                        state['server_error'],
+                                        self.deepspeed.local_rank)
+                                    if 'exp_avg_mask' in group:
+                                        if grad_onebit.device != group[
+                                                'exp_avg_mask'].device:
+                                            group['exp_avg_mask'] = group[
+                                                'exp_avg_mask'].to(
+                                                    device=grad_onebit.device)
+                                        grad_onebit.mul_(group['exp_avg_mask'])
+                                    exp_avg.mul_(beta1).add_(1 - beta1, grad_onebit)
+                    else:
+                        exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
+                        state['lrs'] += group['lr']
+                    grad = None
+
+                if not self.initialize:
+                    if self.size > 1:
+                        comm_buffer.set_(
+                            self.comm_backend_handle.compressed_allreduce(
+                                comm_buffer,
+                                state['worker_error'],
+                                state['server_error'],
+                                self.deepspeed.local_rank))
+                        if 'exp_avg_mask' in group:
+                            if comm_buffer.device != group['exp_avg_mask'].device:
+                                group['exp_avg_mask'] = group['exp_avg_mask'].to(
+                                    device=comm_buffer.device)
+                            comm_buffer.mul_(group['exp_avg_mask'])
+
+                if self.initialize:
+                    update = exp_avg / (exp_avg_sq.sqrt() + group['eps'])
+                    if group['weight_decay'] > 0.0:
+                        update += group['weight_decay'] * p.data
+                    with torch.no_grad():
+                        p.data.add_(-group['lr'] * update)
+                        if self.freeze_key is True:
+                            comm_buffer.add_(-group['lr'] * update)
+                    if state['step'] % state[
+                            'local_step_interval'] == 0 and self.freeze_key:
+                        with torch.no_grad():
+                            p.data.add_(-1 * comm_buffer)
+                            comm_buffer.mul_(exp_avg_sq.sqrt() + group['eps'])
+                            if self.size > 1:
+                                comm_buffer.copy_(
+                                    self.comm_backend_handle.compressed_allreduce(
+                                        comm_buffer,
+                                        state['worker_error'],
+                                        state['server_error'],
+                                        self.deepspeed.local_rank))
+                                if 'exp_avg_mask' in group:
+                                    if comm_buffer.device != group['exp_avg_mask'].device:
+                                        group['exp_avg_mask'] = group['exp_avg_mask'].to(
+                                            device=comm_buffer.device)
+                                    comm_buffer.mul_(group['exp_avg_mask'])
+                            exp_avg.zero_().add_(comm_buffer / state['lrs'], alpha=-1)
+                            p.data.add_(comm_buffer / (exp_avg_sq.sqrt() + group['eps']))
+                            comm_buffer.zero_()
+
+                            state['lrs'] = 0
+
+                    # According to 0/1 Adam theory, a fixed variance would allow more accurate estimation of momentum
+                    # However, in practice, we can also disable the manual freezing of variance, since the interval of
+                    # updating variance will increase exponentially, so that it has negligible effect on the estimation.
+                    if self.freeze_key is False:
+                        if state['step'] % state['var_interval'] == 0:
+                            state['var_counter'] += 1
+                            if state['var_counter'] == self.var_update_scaler:
+                                state['var_counter'] = 0
+                                state['var_interval'] *= 2
+                        if (state['step'] + 1) % state['var_interval'] == 0:
+                            if self.using_pipeline:
+                                self.deepspeed.pipeline_enable_backward_allreduce = True
+                            else:
+                                self.deepspeed.enable_backward_allreduce = True
+                        else:
+                            if self.using_pipeline:
+                                self.deepspeed.pipeline_enable_backward_allreduce = False
+                            else:
+                                self.deepspeed.enable_backward_allreduce = False
+                    else:
+                        state['local_step_counter'] += 1
+                        if state['local_step_counter'] == self.local_step_scaler:
+                            state['local_step_counter'] = 0
+                            state['local_step_interval'] = min(
+                                self.local_step_clipper,
+                                state['local_step_interval'] * 2)
+
+            if not self.initialize:
+                print('Pop out errors', flush=True)
+                self.freeze_key = False
+                state.pop('worker_error')
+                state.pop('server_error')
+
+        if not self.initialize:
+            self.initialize = True
+            print(
+                f"Finished the initialization step at rank {torch.distributed.get_rank()}"
+            )
+            return loss
+
+        if self.state[self.param_groups[0]['params'][0]]['step'] > self.var_freeze_step:
+            self.freeze_key = True
+            if self.using_pipeline:
+                self.deepspeed.pipeline_enable_backward_allreduce = False
+            else:
+                self.deepspeed.enable_backward_allreduce = False
+
+        if self.freeze_key is True and self.reinitial_error_buffer is False:
+            # We need to reinitialize the error buffers when local step > 1 since
+            # the errors will be logged for different metrics (gradient vs. accumulated momentum).
+            for group in self.param_groups:
+                for p in group['params']:
+                    self.state[p]['worker_error'].zero_()
+                    self.state[p]['server_error'].zero_()
+            self.reinitial_error_buffer = True
+
+        return loss
+
+    def load_state_dict(self, state_dict):
+        """
+        Overrides load_state_dict() to add special handling when loading checkpoints
+        """
+        # Because at different stage exp_avg_mask may change (e.g.,
+        # BERT pre-training seqlen 128 and 512 ), we don't use the exp_avg_mask
+        # in checkpoints but always use the one user provided in training script.
+        # (See example in DeepSpeedExamples/bing_bert/deepspeed_train.py.)
+        # Thus here we keep the exp_avg_mask unchanged when loading checkpoint
+        for i, group in enumerate(self.param_groups):
+            if 'exp_avg_mask' in group:
+                state_dict['param_groups'][i]['exp_avg_mask'] = group['exp_avg_mask']
+            elif 'exp_avg_mask' not in group and 'exp_avg_mask' in state_dict[
+                    'param_groups'][i]:
+                state_dict['param_groups'][i].pop('exp_avg_mask')
+        super().load_state_dict(state_dict)
+        if self.state[self.param_groups[0]['params'][0]]['step'] < self.var_freeze_step:
+            self.var_freeze_key = False
+            if (self.state[self.param_groups[0]['params'][0]]['step'] + 1
+                ) % self.state[self.param_groups[0]['params'][0]]['var_interval'] == 0:
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = True
+                else:
+                    self.deepspeed.enable_backward_allreduce = True
+            else:
+                if self.using_pipeline:
+                    self.deepspeed.pipeline_enable_backward_allreduce = False
+                else:
+                    self.deepspeed.enable_backward_allreduce = False
+        else:
+            self.var_freeze_key = True
+            if self.using_pipeline:
+                self.deepspeed.pipeline_enable_backward_allreduce = False
+            else:
+                self.deepspeed.enable_backward_allreduce = False
+        self.reinitial_error_buffer = False
+        for group in self.param_groups:
+            for p in group['params']:
+                if 'worker_error' in self.state[p]:
+                    self.state[p].pop('worker_error')
+                if 'server_error' in self.state[p]:
+                    self.state[p].pop('server_error')
+                if 'momentum_accumulator' in self.state[p]:
+                    self.state[p].pop('momentum_accumulator')
diff --git a/deepspeed/runtime/fp16/unfused_optimizer.py b/deepspeed/runtime/fp16/unfused_optimizer.py
old mode 100755
new mode 100644
index c0cef6a56ba7d691e259c3ff56e55950fb345051..56b880feb0add1a818a0ad93e1f3458a8117ace5
--- a/deepspeed/runtime/fp16/unfused_optimizer.py
+++ b/deepspeed/runtime/fp16/unfused_optimizer.py
@@ -5,13 +5,14 @@ Copyright NVIDIA/apex
 This file is adapted from FP16_Optimizer in NVIDIA/apex
 '''
 
+from deepspeed.moe.utils import split_params_grads_into_shared_and_expert_params
 import torch
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-import math
+from torch._utils import _flatten_dense_tensors
 
-from deepspeed.runtime.utils import get_grad_norm, CheckOverflow, get_weight_norm
+from deepspeed.runtime.utils import get_global_norm, CheckOverflow, get_weight_norm
 from deepspeed.runtime.fp16.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, MIN_LOSS_SCALE
 from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import OPTIMIZER_STATE_DICT
 
 
 class FP16_UnfusedOptimizer(object):
@@ -22,6 +23,7 @@ class FP16_UnfusedOptimizer(object):
     """
     def __init__(self,
                  init_optimizer,
+                 deepspeed=None,
                  static_loss_scale=1.0,
                  dynamic_loss_scale=False,
                  dynamic_loss_args=None,
@@ -31,6 +33,7 @@ class FP16_UnfusedOptimizer(object):
                  fused_lamb_legacy=False):
 
         self.fused_lamb_legacy = fused_lamb_legacy
+        self._global_grad_norm = 0.
 
         if torch.distributed.get_rank() == 0:
             logger.info(f'Fused Lamb Legacy : {self.fused_lamb_legacy} ')
@@ -52,7 +55,7 @@ class FP16_UnfusedOptimizer(object):
             #copied to fp16 weights
             fp32_group = [p.clone().float().detach() for p in param_group['params']]
 
-            #incase the internal optimizer needs it
+            #in case the internal optimizer needs it
             for p in fp32_group:
                 p.requires_grad = True
 
@@ -96,7 +99,9 @@ class FP16_UnfusedOptimizer(object):
         self.mpu = mpu
 
         self.overflow = False
-        self.overflow_checker = CheckOverflow(self.fp16_groups, mpu=self.mpu)
+        self.overflow_checker = CheckOverflow(self.fp16_groups,
+                                              mpu=self.mpu,
+                                              deepspeed=deepspeed)
 
         self.initialize_optimizer_states()
 
@@ -123,6 +128,7 @@ class FP16_UnfusedOptimizer(object):
         grads_groups_flat = []
         grads_groups = []
         norm_groups = []
+        expert_norm_groups = []
         for i, group in enumerate(self.fp16_groups):
             grads = [
                 torch.zeros(p.size(),
@@ -131,9 +137,22 @@ class FP16_UnfusedOptimizer(object):
             ]
             grads_groups.append(grads)
             grads_groups_flat.append(_flatten_dense_tensors(grads))
-            norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
-
-        self.overflow = self.overflow_checker.check_using_norm(norm_groups)
+            grads_for_norm, expert_grads_for_norm = split_params_grads_into_shared_and_expert_params(group)
+            norm_group_value = 0.0
+            if len(grads_for_norm) > 0:
+                norm_group_value = get_weight_norm(
+                    _flatten_dense_tensors(grads_for_norm),
+                    mpu=self.mpu)
+            norm_groups.append(norm_group_value)
+            expert_norm_group_value = 0.0
+            if len(expert_grads_for_norm) > 0:
+                expert_norm_group_value = get_weight_norm(
+                    _flatten_dense_tensors(expert_grads_for_norm),
+                    mpu=self.mpu)
+            expert_norm_groups.append(expert_norm_group_value)
+
+        self.overflow = self.overflow_checker.check_using_norm(norm_groups +
+                                                               expert_norm_groups)
         prev_scale = self.cur_scale
 
         self._update_scale(self.overflow)
@@ -145,17 +164,29 @@ class FP16_UnfusedOptimizer(object):
                                                        self.cur_scale))
             return self.overflow
 
-        combined_scale = self.unscale_and_clip_grads(norm_groups, apply_scale=False)
+        self._global_grad_norm = get_global_norm(norm_list=norm_groups)
+        combined_scale = self.unscale_and_clip_grads(self._global_grad_norm,
+                                                     apply_scale=False)
         self.optimizer.step(grads=grads_groups,
                             output_params=self.fp16_groups,
                             scale=combined_scale)
 
+        for fp32_group, fp16_group in zip(self.fp32_groups, self.fp16_groups):
+            for idx, (fp32_param, fp16_param) in enumerate(zip(fp32_group, fp16_group)):
+
+                #remove the fp32 grad
+                fp32_param.grad = None
+
+                #copy data from fp32 to fp16
+                fp16_param.data.copy_(fp32_param.data)
+
         return self.overflow
 
     def step(self, closure=None):
         """
         Not supporting closure.
         """
+
         if self.fused_lamb_legacy:
             return self.step_fused_lamb()
 
@@ -173,9 +204,13 @@ class FP16_UnfusedOptimizer(object):
 
         norm_groups = []
         for i, group in enumerate(self.fp16_groups):
-            norm_groups.append(get_grad_norm(group, mpu=self.mpu))
+            grads_for_norm, _ = split_params_grads_into_shared_and_expert_params(group)
+            norm_group_value = 0.0
+            if len(grads_for_norm) > 0:
+                norm_group_value = get_weight_norm(grads_for_norm, mpu=self.mpu)
+            norm_groups.append(norm_group_value)
 
-            # copying gradients to fp32 to work with fp32 parameters
+            # copying gradients to fp32 to wor  k with fp32 parameters
             for fp32_param, fp16_param in zip(self.fp32_groups[i], self.fp16_groups[i]):
                 if fp16_param.grad is None:
                     fp32_param.grad = torch.zeros(fp16_param.size(),
@@ -184,12 +219,13 @@ class FP16_UnfusedOptimizer(object):
                 else:
                     fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
 
-        self.unscale_and_clip_grads(norm_groups)
+        self._global_grad_norm = get_global_norm(norm_list=norm_groups)
+        self.unscale_and_clip_grads(self._global_grad_norm)
 
         self.optimizer.step()
 
         for fp32_group, fp16_group in zip(self.fp32_groups, self.fp16_groups):
-            for fp32_param, fp16_param in zip(fp32_group, fp16_group):
+            for idx, (fp32_param, fp16_param) in enumerate(zip(fp32_group, fp16_group)):
 
                 #remove the fp32 grad
                 fp32_param.grad = None
@@ -199,12 +235,7 @@ class FP16_UnfusedOptimizer(object):
 
         return self.overflow
 
-    def unscale_and_clip_grads(self, norm_groups, apply_scale=True):
-        total_norm = 0.0
-        for norm in norm_groups:
-            total_norm += norm**2.0
-        total_norm = math.sqrt(total_norm)
-
+    def unscale_and_clip_grads(self, total_norm, apply_scale=True):
         # compute combined scale factor for this group
         combined_scale = self.cur_scale
         if self.clip_grad > 0.:
@@ -221,7 +252,7 @@ class FP16_UnfusedOptimizer(object):
 
         return combined_scale
 
-    def backward(self, loss):
+    def backward(self, loss, create_graph=False, retain_graph=False):
         """
         :attr:`backward` performs the following steps:
 
@@ -230,7 +261,8 @@ class FP16_UnfusedOptimizer(object):
         3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
         """
         scaled_loss = (loss.float()) * self.cur_scale
-        scaled_loss.backward()
+
+        scaled_loss.backward(create_graph=create_graph, retain_graph=retain_graph)
 
     def _update_scale(self, skip):
         if self.dynamic_loss_scale:
@@ -300,10 +332,16 @@ class FP16_UnfusedOptimizer(object):
             state_dict['last_overflow_iter'] = self.last_overflow_iter
             state_dict['scale_factor'] = self.scale_factor
             state_dict['scale_window'] = self.scale_window
-        state_dict['optimizer_state_dict'] = self.optimizer.state_dict()
+        state_dict[OPTIMIZER_STATE_DICT] = self.optimizer.state_dict()
         state_dict['fp32_groups'] = self.fp32_groups
         return state_dict
 
+    # Refresh fp32 master params from fp16 copies
+    def refresh_fp32_params(self):
+        for current_group, saved_group in zip(self.fp32_groups, self.fp16_groups):
+            for current, saved in zip(current_group, saved_group):
+                current.data.copy_(saved.data)
+
     def load_state_dict(self, state_dict, load_optimizer_states=True):
         """
         Loads a state_dict created by an earlier call to state_dict().
@@ -330,7 +368,7 @@ class FP16_UnfusedOptimizer(object):
             self.scale_window = state_dict['scale_window']
 
         if load_optimizer_states:
-            self.optimizer.load_state_dict(state_dict['optimizer_state_dict'])
+            self.optimizer.load_state_dict(state_dict[OPTIMIZER_STATE_DICT])
         # At this point, the optimizer's references to the model's fp32 parameters are up to date.
         # The optimizer's hyperparameters and internal buffers are also up to date.
         # However, the fp32 master copies of the model's fp16 params stored by the optimizer are still
diff --git a/deepspeed/runtime/lr_schedules.py b/deepspeed/runtime/lr_schedules.py
old mode 100755
new mode 100644
index 7846da12fdbdb61728ff484a706223c90fcb38e9..b4dc749ae193c8874dc19c24019f75ae0a4b9b4b
--- a/deepspeed/runtime/lr_schedules.py
+++ b/deepspeed/runtime/lr_schedules.py
@@ -47,6 +47,9 @@ DECAY_MOM_RATE = 'decay_mom_rate'
 WARMUP_MIN_LR = 'warmup_min_lr'
 WARMUP_MAX_LR = 'warmup_max_lr'
 WARMUP_NUM_STEPS = 'warmup_num_steps'
+WARMUP_TYPE = 'warmup_type'
+WARMUP_LOG_RATE = 'log'
+WARMUP_LINEAR_RATE = 'linear'
 
 TOTAL_NUM_STEPS = 'total_num_steps'
 
@@ -148,7 +151,10 @@ def add_tuning_arguments(parser):
                        type=int,
                        default=1000,
                        help='WarmupLR step count for LR warmup.')
-
+    group.add_argument('--warmup_type',
+                       type=str,
+                       default=WARMUP_LOG_RATE,
+                       help='WarmupLR increasing function during warmup')
     return parser
 
 
@@ -226,6 +232,9 @@ def override_warmupLR_params(args, params):
     if hasattr(args, WARMUP_NUM_STEPS) and args.warmup_num_steps is not None:
         params[WARMUP_NUM_STEPS] = args.warmup_num_steps
 
+    if hasattr(args, WARMUP_TYPE) and args.warmup_type is not None:
+        params[WARMUP_TYPE] = args.warmup_type
+
 
 def override_params(args, params):
     # LR range test params
@@ -305,7 +314,7 @@ class LRRangeTest(object):
     the paper `A disciplined approach to neural network hyper-parameters: Part1`_.
 
     LRRT policy is used for finding maximum LR that trains a model without divergence, and can be used to
-    configure the LR boundaries for Cylic LR schedules.
+    configure the LR boundaries for Cyclic LR schedules.
 
     LRRT changes the learning rate after every batch.
     `step` should be called after a batch has been used for training.
@@ -316,7 +325,7 @@ class LRRangeTest(object):
             lower boundary in the range test for each parameter group.
         lr_range_test_step_size (int): Interval of training steps to increase learning rate. Default: 2000
         lr_range_test_step_rate (float): Scaling rate for range test. Default: 1.0
-        lr_range_test_staircase (bool): Scale in staircase fashion, rather than continous. Default: False.
+        lr_range_test_staircase (bool): Scale in staircase fashion, rather than continuous. Default: False.
         last_batch_iteration (int): The index of the last batch. This parameter is used when
             resuming a training job. Since `step()` should be invoked after each
             batch instead of after each epoch, this number represents the total
@@ -326,7 +335,7 @@ class LRRangeTest(object):
 
     Example:
         >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
-        >>> scheduler = torch.optim.LRRangeTest(optimizer)
+        >>> scheduler = LRRangeTest(optimizer)
         >>> data_loader = torch.utils.data.DataLoader(...)
         >>> for epoch in range(10):
         >>>     for batch in data_loader:
@@ -361,7 +370,7 @@ class LRRangeTest(object):
         self.step_rate = lr_range_test_step_rate
         self.last_batch_iteration = last_batch_iteration
         self.staircase = lr_range_test_staircase
-        self.interval_fn = self._staircase_interval if lr_range_test_staircase else self._continous_interval
+        self.interval_fn = self._staircase_interval if lr_range_test_staircase else self._continuous_interval
 
         if last_batch_iteration == -1:
             self._update_optimizer(self.min_lr)
@@ -369,7 +378,7 @@ class LRRangeTest(object):
     def _staircase_interval(self):
         return math.floor(float(self.last_batch_iteration + 1) / self.step_size)
 
-    def _continous_interval(self):
+    def _continuous_interval(self):
         return float(self.last_batch_iteration + 1) / self.step_size
 
     def _get_increase(self):
@@ -463,7 +472,7 @@ class OneCycle(object):
 
     Example:
         >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
-        >>> scheduler = torch.optim.OneCycle(optimizer)
+        >>> scheduler = OneCycle(optimizer, 0.0001, 0.0010)
         >>> data_loader = torch.utils.data.DataLoader(...)
         >>> for epoch in range(10):
         >>>     for batch in data_loader:
@@ -514,7 +523,7 @@ class OneCycle(object):
                                       decay_mom_rate,
                                       last_batch_iteration)
 
-        # Initalize batch iteration tracker
+        # Initialize batch iteration tracker
         self.last_batch_iteration = last_batch_iteration
 
     # Configure cycle shape
@@ -536,6 +545,13 @@ class OneCycle(object):
         self.second_stair_count = cycle_first_stair_count if cycle_second_stair_count is None else cycle_second_stair_count
         self.decay_step_size = decay_step_size
 
+        if math.isclose(self.decay_step_size, 0):
+            self.skip_lr_decay = True
+            self.skip_mom_decay = True
+        else:
+            self.skip_lr_decay = False
+            self.skip_mom_decay = False
+
     # Configure lr schedule
     def _initialize_lr(self,
                        optimizer,
@@ -551,6 +567,9 @@ class OneCycle(object):
         self.max_lrs = [cycle_max_lr] * len(optimizer.param_groups)
         self.decay_lr_rate = decay_lr_rate
 
+        if math.isclose(self.decay_lr_rate, 0):
+            self.skip_lr_decay = True
+
     # Configure momentum schedule
     def _initialize_momentum(self,
                              optimizer,
@@ -574,6 +593,9 @@ class OneCycle(object):
             for momentum, group in zip(self.min_moms, optimizer.param_groups):
                 group['betas'] = momentum
 
+        if math.isclose(self.decay_mom_rate, 0):
+            self.skip_mom_decay = True
+
     def _get_scale_factor(self):
         batch_iteration = (self.last_batch_iteration + 1)
         cycle = math.floor(1 + batch_iteration / self.total_size)
@@ -607,9 +629,13 @@ class OneCycle(object):
         return lrs
 
     def _get_decay_mom(self, decay_batch_iteration):
+        if self.skip_mom_decay:
+            return self.max_moms
+
         decay_interval = decay_batch_iteration / self.decay_step_size
         mom_decay_factor = (1 + self.decay_mom_rate * decay_interval)
         momentums = [(beta0 * mom_decay_factor, beta1) for beta0, beta1 in self.max_moms]
+
         return momentums
 
     def _get_decay_lr(self, decay_batch_iteration):
@@ -617,6 +643,9 @@ class OneCycle(object):
         after the cycle completes and post cycle decaying of lr/mom is enabled.
         This function treats `self.last_batch_iteration` as the last batch index.
         """
+        if self.skip_lr_decay:
+            return self.min_lrs
+
         decay_interval = decay_batch_iteration / self.decay_step_size
         lr_decay_factor = (1 + self.decay_lr_rate * decay_interval)
         lrs = [cycle_min_lr / lr_decay_factor for cycle_min_lr in self.min_lrs]
@@ -683,10 +712,11 @@ class WarmupLR(object):
             warmup_min_lr (float or list): minimum learning rate. Default: 0
             warmup_max_lr (float or list): maximum learning rate. Default: 0.001
             warmup_num_steps (int): number of steps to warm up from min_lr to max_lr. Default: 1000
+            warmup_type {‘log’, ‘linear’}: increasing function from min_lr to max_lr during warmup. Default: log
             last_batch_iteration (int): The index of the last batch. Default: -1.
         Example:
             >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
-            >>> scheduler = torch.optim.WarmupLR(optimizer)
+            >>> scheduler = WarmupLR(optimizer)
             >>> data_loader = torch.utils.data.DataLoader(...)
             >>> for epoch in range(10):
             >>>     for batch in data_loader:
@@ -699,6 +729,7 @@ class WarmupLR(object):
                  warmup_min_lr: float = 0.0,
                  warmup_max_lr: float = 0.001,
                  warmup_num_steps: int = 1000,
+                 warmup_type: str = WARMUP_LOG_RATE,
                  last_batch_iteration: int = -1):
 
         self.optimizer = get_torch_optimizer(optimizer)
@@ -707,6 +738,13 @@ class WarmupLR(object):
         self.max_lrs = self._format_param(self.optimizer, warmup_max_lr, "max_lr")
         self.delta_lrs = [big - small for big, small in zip(self.max_lrs, self.min_lrs)]
         self.warmup_num_steps = max(2, warmup_num_steps)
+        # Currently only support linear and log function
+        if warmup_type not in {WARMUP_LOG_RATE, WARMUP_LINEAR_RATE}:
+            logger.warning(
+                f"Using unknown warmup_type: {warmup_type}. The increasing function "
+                f"is set to default (log)")
+            warmup_type = WARMUP_LOG_RATE
+        self.warmup_type = warmup_type
         self.inverse_log_warm_up = 1.0 / math.log(self.warmup_num_steps)
         self.last_batch_iteration = last_batch_iteration
 
@@ -744,7 +782,10 @@ class WarmupLR(object):
 
     def _get_gamma(self):
         if self.last_batch_iteration < self.warmup_num_steps:
-            return self.inverse_log_warm_up * math.log(self.last_batch_iteration + 1)
+            if self.warmup_type == WARMUP_LOG_RATE:
+                return self.inverse_log_warm_up * math.log(self.last_batch_iteration + 1)
+            elif self.warmup_type == WARMUP_LINEAR_RATE:
+                return self.last_batch_iteration / self.warmup_num_steps
         return 1.0
 
     def _format_param(self, optimizer, param_value, param_name):
@@ -768,6 +809,7 @@ class WarmupDecayLR(WarmupLR):
             warmup_min_lr (float or list): minimum learning rate. Default: 0
             warmup_max_lr (float or list): maximum learning rate. Default: 0.001
             warmup_num_steps (int): number of steps to warm up from min_lr to max_lr. Default: 1000
+            warmup_type {‘log’, ‘linear’}: increasing function from min_lr to max_lr during warmup. Default: log
             last_batch_iteration (int): The index of the last batch. Default: -1.
         Example:
             >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
@@ -785,6 +827,7 @@ class WarmupDecayLR(WarmupLR):
                  warmup_min_lr: float = 0.0,
                  warmup_max_lr: float = 0.001,
                  warmup_num_steps: int = 1000,
+                 warmup_type: str = WARMUP_LOG_RATE,
                  last_batch_iteration: int = -1):
 
         self.total_num_steps = total_num_steps
@@ -793,6 +836,7 @@ class WarmupDecayLR(WarmupLR):
                              warmup_min_lr,
                              warmup_max_lr,
                              warmup_num_steps,
+                             warmup_type,
                              last_batch_iteration)
         if self.total_num_steps < self.warmup_num_steps:
             logger.warning('total_num_steps {} is less than warmup_num_steps {}'.format(
@@ -801,7 +845,10 @@ class WarmupDecayLR(WarmupLR):
 
     def _get_gamma(self):
         if self.last_batch_iteration < self.warmup_num_steps:
-            return self.inverse_log_warm_up * math.log(self.last_batch_iteration + 1)
+            if self.warmup_type == WARMUP_LOG_RATE:
+                return self.inverse_log_warm_up * math.log(self.last_batch_iteration + 1)
+            elif self.warmup_type == WARMUP_LINEAR_RATE:
+                return self.last_batch_iteration / self.warmup_num_steps
         return max(
             0.0,
             float(self.total_num_steps - self.last_batch_iteration) /
diff --git a/deepspeed/runtime/pipe/__init__.py b/deepspeed/runtime/pipe/__init__.py
index 3171d3e254c88173bc9e7e7276a220cee76cd0fd..55e4c2ca3277d5191cd50c0079c0ac02c19fa205 100644
--- a/deepspeed/runtime/pipe/__init__.py
+++ b/deepspeed/runtime/pipe/__init__.py
@@ -1 +1,2 @@
 from .module import PipelineModule, LayerSpec, TiedLayerSpec
+from .topology import ProcessTopology
diff --git a/deepspeed/runtime/pipe/engine.py b/deepspeed/runtime/pipe/engine.py
index 573dccce78a5085a2aade2a8986c5fa50dce69c5..5807fb983ed7b6cf3dbb3cb1305c888f25b154c2 100644
--- a/deepspeed/runtime/pipe/engine.py
+++ b/deepspeed/runtime/pipe/engine.py
@@ -17,6 +17,7 @@ import torch.distributed as dist
 from deepspeed.utils.logging import logger
 from deepspeed.utils.timer import SynchronizedWallClockTimer, ThroughputTimer
 
+from deepspeed.inference.engine import InferenceEngine
 from ..engine import DeepSpeedEngine, MEMORY_OPT_ALLREDUCE_SIZE
 from ..utils import PartitionedTensor, ensure_directory_exists
 from ..dataloader import RepeatingLoader
@@ -48,12 +49,37 @@ class PipelineEngine(DeepSpeedEngine):
     This engine is created by ``deepspeed.initialize()`` when a :class:`PipelineModule`
     is provided.
     """
-    def __init__(self, *super_args, **super_kwargs):
+    ID_TO_DTYPE = [
+        torch.float32,
+        torch.float64,
+        torch.complex64,
+        torch.complex128,
+        torch.float16,
+        torch.bfloat16,
+        torch.uint8,
+        torch.int8,
+        torch.int16,
+        torch.int32,
+        torch.int64,
+        torch.bool
+    ]
+    DTYPE_TO_ID = {dtype: id_ for id_, dtype in enumerate(ID_TO_DTYPE)}
+
+    def __init__(self, has_bool_tensors=False, *super_args, **super_kwargs):
         super().__init__(*super_args, **super_kwargs)
         assert isinstance(self.module, PipelineModule), "model must base PipelineModule"
 
+        assert self.zero_optimization_stage() < 2, "ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism"
+
         # We schedule the all-reduces, so disable it in super().backward()
         self.enable_backward_allreduce = False
+        self.has_bool_tensors = has_bool_tensors
+        self.eval_return_logits = False
+        self.outputs = None
+
+        # used to disable the pipeline all-reduce when used with 1-bit Adam/1-bit LAMB
+        self.pipeline_enable_backward_allreduce = True
+
         assert not self.elasticity_enabled(), "Elasticity is not currently supported" \
             " with pipeline parallelism."
 
@@ -103,8 +129,9 @@ class PipelineEngine(DeepSpeedEngine):
         self.is_model_parallel = self.grid.model_parallel_size > 1
 
         # Partition input/output buffers
+        # XXX temporarily disable while I revert some partition hacks.
         self.is_pipe_partitioned = self.is_model_parallel
-        self.is_grad_partitioned = False
+        self.is_grad_partitioned = self.is_model_parallel
 
         model_parameters = filter(lambda p: p.requires_grad, self.module.parameters())
         num_params = sum([p.numel() for p in model_parameters])
@@ -131,7 +158,7 @@ class PipelineEngine(DeepSpeedEngine):
                         f'TOTAL_PARAMS={total_params} ({total_params/1e6:0.3f}M) '
                         f'UNIQUE_PARAMS={unique_params} ({unique_params/1e6:0.3f}M)')
 
-        #intialize peer-2-peer communication and allreduce groups
+        #initialize peer-2-peer communication and allreduce groups
         if self.is_pipe_parallel:
             p2p.init_process_groups(self.grid)
 
@@ -166,6 +193,7 @@ class PipelineEngine(DeepSpeedEngine):
         if self.is_last_stage():
             self.loss_model = self.module.loss_fn
 
+        self.has_attention_mask = self.module.__class__.__name__ == 'GPT2ModelPipe'
         # Initialize pipeline communicators. Just send a 0.
         if is_even(self.stage_id):
             if not self.is_last_stage():
@@ -194,6 +222,10 @@ class PipelineEngine(DeepSpeedEngine):
             self.timers('step_microstep').start()
             self.timers('step_microstep').stop()
 
+    def set_has_attention_mask(self, value):
+        assert isinstance(value, bool)
+        self.has_attention_mask = value
+
     def _build_data_iter(self, dataset):
         sampler = torch.utils.data.distributed.DistributedSampler(
             dataset,
@@ -216,15 +248,31 @@ class PipelineEngine(DeepSpeedEngine):
         # (see https://github.com/EleutherAI/gpt-neox/issues/62#issuecomment-761471944)
         if self.zero_optimization_partition_gradients():
             self.optimizer.overlapping_partition_gradients_reduce_epilogue()
-        self.module.allreduce_tied_weight_gradients()
+
+        weight_group_list = self.module.get_tied_weights_and_groups()
+        for weight, group in weight_group_list:
+            grad = weight._hp_grad if self.bfloat16_enabled() else weight.grad
+            dist.all_reduce(grad, group=group)
 
     def _exec_reduce_grads(self):
         self._force_grad_boundary = True
-        if self.is_data_parallel:
-            self.buffered_allreduce_fallback(
-                elements_per_buffer=MEMORY_OPT_ALLREDUCE_SIZE)
+        if self.pipeline_enable_backward_allreduce:
+            if self.bfloat16_enabled():
+                if self.zero_optimization_stage() == 0:
+                    self._bf16_reduce_grads()
+                else:
+                    assert self.zero_optimization_stage() == 1, "only bf16 + z1 are supported"
+                    raise NotImplementedError()
+            else:
+                self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
         self._force_grad_boundary = False
 
+    def _bf16_reduce_grads(self):
+        # Make our own list of gradients from the optimizer's FP32 grads
+        grads = []
+        self.buffered_allreduce_fallback(grads=self.optimizer.get_grads_for_reduction(),
+                                         elements_per_buffer=MEMORY_OPT_ALLREDUCE_SIZE)
+
     def _reserve_pipe_buffers(self, num_buffers):
         """Ensure that each pipeline buffer has at least ``num_buffers`` slots.
 
@@ -241,6 +289,16 @@ class PipelineEngine(DeepSpeedEngine):
             self.pipe_buffers[key].extend([None] * num_added)
         self.num_pipe_buffers = num_buffers
 
+    def reset_activation_shape(self):
+        """Reset the buffers when the shape of activation and gradient change.
+        For example, for curriculum learning that changes the seqlen of each
+        sample, we need to call this whenever the seqlen is going to change.
+        """
+        self.first_output_send = True
+        self.pipe_recv_buf = None
+        self.grad_layer = None
+        self.meta_buffer = None
+
     def train_batch(self, data_iter=None):
         """Progress the pipeline to train the next batch of data. The engine will ingest
         ``self.train_batch_size()`` total samples collectively across all workers.
@@ -269,11 +327,23 @@ class PipelineEngine(DeepSpeedEngine):
             raise RuntimeError(
                 f'train_batch() requires gradients enabled. Use eval_batch() instead.')
 
+        # Curriculum learning could change activation shape
+        if self.curriculum_enabled():
+            new_difficulty = self.curriculum_scheduler.update_difficulty( \
+                self.global_steps + 1)
+            if self.global_steps == 0 or self.curriculum_scheduler.first_step:
+                self.reset_activation_shape()
+                self.curriculum_scheduler.first_step = False
+            elif new_difficulty != self.curriculum_scheduler.get_difficulty( \
+                self.global_steps):
+                self.reset_activation_shape()
+
         if data_iter:
             self.set_dataiterator(data_iter)
 
         self.module.train()
         self.total_loss = None
+        self._compute_loss = True
 
         # Do the work
         self.timers('train_batch').start()
@@ -282,6 +352,7 @@ class PipelineEngine(DeepSpeedEngine):
                                        stage_id=self.stage_id)
         self._exec_schedule(sched)
         self.agg_train_loss = self._aggregate_total_loss()
+
         self.timers('train_batch').stop()
 
         if self.global_steps % self.steps_per_print() == 0:
@@ -317,7 +388,11 @@ class PipelineEngine(DeepSpeedEngine):
         # TODO: should return precisely what loss returned and allow others to be queried?
         return self.agg_train_loss
 
-    def eval_batch(self, data_iter):
+    def eval_batch(self,
+                   data_iter,
+                   return_logits=False,
+                   compute_loss=True,
+                   reduce_output='avg'):
         """Evaluate the pipeline on a batch of data from ``data_iter``. The
         engine will evaluate ``self.train_batch_size()`` total samples
         collectively across all workers.
@@ -344,9 +419,23 @@ class PipelineEngine(DeepSpeedEngine):
         Returns:
             The arithmetic mean of the losses computed this batch.
         """
-
+        self.eval_return_logits = return_logits
         self.module.eval()
-        self.total_loss = None
+
+        # Curriculum learning could change activation shape
+        if self.curriculum_enabled():
+            new_difficulty = self.curriculum_scheduler.update_difficulty( \
+                self.global_steps + 1)
+            if self.global_steps == 0 or self.curriculum_scheduler.first_step:
+                self.reset_activation_shape()
+                self.curriculum_scheduler.first_step = False
+            elif new_difficulty != self.curriculum_scheduler.get_difficulty( \
+                self.global_steps):
+                self.reset_activation_shape()
+
+        eval_output = None
+
+        self._compute_loss = compute_loss
 
         # Use the provided data iterator
         train_iterator = self.data_iterator
@@ -359,11 +448,16 @@ class PipelineEngine(DeepSpeedEngine):
         with torch.no_grad():
             self._exec_schedule(sched)
 
-        self.agg_eval_loss = self._aggregate_total_loss()
+        if self.is_last_stage():
+            eval_output = self._reduce_outputs(self.fwd_outputs, reduce=reduce_output)
+
+        if compute_loss:
+            eval_output = self._bcast_pipe_scalar(eval_output)
+
         if self.tensorboard_enabled():
             if self.global_rank == 0:
                 self.summary_events = [(f'Train/Samples/eval_loss',
-                                        self.agg_eval_loss.mean().item(),
+                                        eval_output.mean().item(),
                                         self.global_samples)]
                 for event in self.summary_events:  # write_summary_events
                     self.summary_writer.add_scalar(event[0], event[1], event[2])
@@ -374,8 +468,25 @@ class PipelineEngine(DeepSpeedEngine):
 
         # Reset any buffers that may have been populated during the forward passes.
         #ds_checkpointing.reset()
-
-        return self.agg_eval_loss
+        self.eval_return_logits = False
+        if return_logits:
+            outputs = self.outputs
+            self.outputs = None
+            return eval_output, outputs
+        return eval_output
+
+    def set_train_batch_size(self, train_batch_size):
+        """Adjust the global batch size by increasing or decreasing the number of
+        micro-batches (i.e., gradient accumulation steps). The size of each micro-batch
+        (i.e., ``train_micro_batch_size_per_gpu``) is not changed.
+        Args:
+            train_batch_size (int): The new global batch size for training.
+        Raises:
+            ValueError: if ``train_batch_size`` is not divisible by the
+                configured micro-batch size and data parallelism.
+        """
+        super().set_train_batch_size(train_batch_size)
+        self.micro_batches = self.gradient_accumulation_steps()
 
     def is_first_stage(self):
         """True if this process is in the first stage in the pipeline."""
@@ -385,10 +496,59 @@ class PipelineEngine(DeepSpeedEngine):
         """True if this process is in the last stage in the pipeline."""
         return self.stage_id == self.num_stages - 1
 
+    def _reduce_outputs(self, outputs, reduce='avg', reduce_dp=True):
+        if reduce is None:
+            return outputs
+
+        if reduce.lower() == 'avg':
+            # first sum over all microbatches
+            if torch.is_tensor(outputs[0]):
+                reduced = sum(outputs)
+            else:
+                assert isinstance(outputs, (list, tuple))
+                reduced = [torch.zeros_like(o) for o in outputs[0]]
+                for idx, out in outputs:
+                    reduced[idx] += out
+
+            # Average over the microbatches
+            reduced = self._scale_loss_by_gas(reduced)
+
+            # Average over DP groups
+            if reduce_dp and self.is_data_parallel:
+                if torch.is_tensor(reduced):
+                    dist.all_reduce(reduced, group=self.mpu.get_data_parallel_group())
+                    reduced /= self.dp_world_size
+                else:
+                    for idx in range(len(reduced)):
+                        dist.all_reduce(reduced[idx],
+                                        group=self.mpu.get_data_parallel_group())
+                        reduced[idx] /= self.dp_world_size
+
+            return reduced
+        else:
+            raise NotImplementedError(f'reduction type {reduce} not supported.')
+
+    def _bcast_pipe_scalar(self, data, src_rank=None, dtype=torch.float32):
+        # Default to last stage (e.g., for broadcasting loss)
+        if src_rank is None:
+            src_rank = self.grid.stage_to_global(self.num_stages - 1)
+        assert src_rank in self.grid.pp_group
+
+        if self.global_rank == src_rank:
+            result = data.clone().detach()
+        else:
+            result = torch.Tensor([0.]).type(dtype).to(self.device)
+
+        dist.broadcast(tensor=result,
+                       src=src_rank,
+                       group=self.mpu.get_pipe_parallel_group())
+
+        return result
+
     def _aggregate_total_loss(self):
         # Scale loss, average among DP ranks, and bcast loss to the rest of my DP group
         if self.is_last_stage():
-            loss = self._scale_loss(self.total_loss)
+            loss = self._scale_loss_by_gas(self.total_loss)
             self.dp_group_loss = loss.clone().detach()
 
             ## Average loss across all data-parallel groups
@@ -461,20 +621,12 @@ class PipelineEngine(DeepSpeedEngine):
             print(*msg)
 
     def _next_batch(self):
-        if self.is_model_parallel:
-            mp_rank = self.grid.get_slice_parallel_rank()
-        else:
-            mp_rank = 0
-
+        # If using 3D parallelism, only some first-stage ranks may do IO
         batch = None
-
-        # Only MP rank 0 loads the data.
-        if mp_rank == 0:
-            if self.data_iterator is None:
-                raise ValueError(f"RANK={self.global_rank} no data iterator provided.")
+        if self.data_iterator is not None:
             batch = next(self.data_iterator)
 
-        # All MP ranks participate in batch_fn, where they might broadcast the data.
+        # Any post-processing, like broadcasting across a slice-parallel group.
         if self.batch_fn:
             batch = self.batch_fn(batch)
 
@@ -496,11 +648,12 @@ class PipelineEngine(DeepSpeedEngine):
                 local_part=inputs[1],
                 group=self.grid.get_slice_parallel_group())
 
-            inputs = tuple([part_input.full(), inputs[2]])
+            inputs = (part_input.full(), *inputs[2:])
             inputs[0].requires_grad = True
             # skip mask
             #inputs[1].requires_grad = True
             part_input = None
+            inputs = inputs[0] if len(inputs) == 1 else inputs
             self.pipe_buffers['inputs'][buffer_id] = inputs
 
         # Zero out the gradients each time we use the tensor because only the data in
@@ -511,31 +664,49 @@ class PipelineEngine(DeepSpeedEngine):
 
         # Partition the outputs if we are not the last stage
         if self.is_pipe_partitioned and not self.is_last_stage():
-            part = PartitionedTensor(tensor=outputs[0],
+            if isinstance(outputs, tuple):
+                first_output = outputs[0]
+                # TODO: Improve pipe partitioning to pass multiple tensors that require grads
+                assert all([
+                    torch.is_tensor(elt) and elt.requires_grad is False
+                    for elt in outputs[1:]
+                ])
+                outputs_tail = outputs[1:]
+            elif torch.is_tensor(outputs):
+                first_output = outputs
+                outputs_tail = []
+            else:
+                raise ValueError("expecting a tensor or a tuple of tensors")
+            part = PartitionedTensor(tensor=first_output,
                                      group=self.grid.get_slice_parallel_group())
             # Clear the large output data, but save the computation graph
-            outputs[0].data = torch.zeros(1)
-            self.pipe_buffers['output_tensors'][buffer_id] = outputs[0]
+            first_output.data = torch.zeros(1)
+            self.pipe_buffers['output_tensors'][buffer_id] = first_output
             # Inject the partitioned tensor into the output before sending
-            outputs = tuple([part.to_meta(), part.data(), outputs[1]])
+            outputs = (part.to_meta(), part.data(), *outputs_tail)
             part = None
 
         self.pipe_buffers['outputs'][buffer_id] = outputs
 
         # Optionally compute loss on the last device
         if self.is_last_stage():
-            if self.loss_model is not None:
+            if self._compute_loss and self.loss_model is not None:
                 labels = self.pipe_buffers['labels'][buffer_id]
                 self.loss = self.loss_model(outputs, labels)
             else:
                 # Some models just return loss from forward()
                 self.loss = outputs
-
+            if self.eval_return_logits:
+                self.outputs = outputs
             if isinstance(self.loss, torch.Tensor):
+                self.fwd_outputs.append(self.loss.detach())
+
                 if self.total_loss is None:
                     self.total_loss = torch.zeros_like(self.loss)
                 self.total_loss += self.loss.detach()
             else:
+                self.fwd_outputs.append([l.detach() for l in self.loss])
+
                 if self.total_loss is None:
                     self.total_loss = [torch.zeros_like(l) for l in self.loss]
                 for idx, l in enumerate(self.loss):
@@ -571,15 +742,11 @@ class PipelineEngine(DeepSpeedEngine):
                     local_part=outputs[1],
                     group=self.grid.get_slice_parallel_group())
                 self.pipe_buffers['output_tensors'][buffer_id].data = part_output.full()
-                outputs = tuple(
-                    [self.pipe_buffers['output_tensors'][buffer_id],
-                     outputs[2]])
+                outputs = (self.pipe_buffers['output_tensors'][buffer_id], *outputs[2:])
             else:
                 # Already restored from partition
                 self.pipe_buffers['output_tensors'][buffer_id].data = outputs[0]
-                outputs = tuple(
-                    [self.pipe_buffers['output_tensors'][buffer_id],
-                     outputs[1]])
+                outputs = (self.pipe_buffers['output_tensors'][buffer_id], *outputs[1:])
 
         grad_tensors = self.grad_layer
         if self.is_grad_partitioned:
@@ -588,10 +755,14 @@ class PipelineEngine(DeepSpeedEngine):
                 meta=self.grad_layer[0],
                 local_part=self.grad_layer[1],
                 group=self.grid.get_slice_parallel_group())
-            grad_tensors = tuple([part_grad.full(), self.grad_layer[2]])
+            grad_tensors = (part_grad.full(), *grad_tensors[2:])
             part_grad = None
             #print(f'RANK={self.global_rank} BEFORE-BWD restored grad={self.grad_layer[0].size()} {self.grad_layer[1].size()}')
 
+        if self.bfloat16_enabled() and not self.is_last_stage():
+            # manually call because we don't call optimizer.backward()
+            self.optimizer.clear_lp_grads()
+
         # This handles either a single tensor or tuple of tensors.
         if isinstance(outputs, tuple):
             out_tensors = [t for t in outputs if t.is_floating_point()]
@@ -600,6 +771,10 @@ class PipelineEngine(DeepSpeedEngine):
         else:
             torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
 
+        if self.bfloat16_enabled() and not self.is_last_stage():
+            # manually call because we don't call optimizer.backward()
+            self.optimizer.update_hp_grads(clear_lp_grads=False)
+
         # Free up the memory from the output of forward()
         self.pipe_buffers['output_tensors'][buffer_id] = None
         self.pipe_buffers['outputs'][buffer_id] = None
@@ -695,6 +870,9 @@ class PipelineEngine(DeepSpeedEngine):
                 assert isinstance(tensor, torch.Tensor)
                 send_shape = torch.LongTensor(data=tensor.size()).to(self.device)
                 send_ndims = torch.LongTensor(data=[len(tensor.size())]).to(self.device)
+                send_dtype = torch.LongTensor(data=[self.DTYPE_TO_ID[tensor.dtype]]).to(
+                    self.device)
+                p2p.send(send_dtype, recv_stage)
                 p2p.send(send_ndims, recv_stage)
                 p2p.send(send_shape, recv_stage)
                 # Useful for performance debugging.
@@ -749,16 +927,19 @@ class PipelineEngine(DeepSpeedEngine):
             count_tensor = torch.LongTensor(data=[0]).to(self.device)
             p2p.recv(count_tensor, send_stage)
             num_tensors = count_tensor.item()
-            recv_shapes = []
+            recv_shapes_and_dtypes = []
             for idx in range(num_tensors):
+                recv_dtype = torch.LongTensor(data=[0]).to(self.device)
+                p2p.recv(recv_dtype, send_stage)
+                recv_dtype = self.ID_TO_DTYPE[recv_dtype.item()]
                 recv_ndims = torch.LongTensor(data=[0]).to(self.device)
                 p2p.recv(recv_ndims, send_stage)
                 recv_ndims = recv_ndims.item()
                 recv_shape = torch.LongTensor([1] * recv_ndims).to(self.device)
                 p2p.recv(recv_shape, send_stage)
-                recv_shapes.append(recv_shape.tolist())
+                recv_shapes_and_dtypes.append((recv_shape.tolist(), recv_dtype))
 
-            buffers = self._allocate_buffers(recv_shapes, num_buffers=1)[0]
+            buffers = self._allocate_buffers(recv_shapes_and_dtypes, num_buffers=1)[0]
             # Convert to tuples if requested.
             if recv_type == 2:
                 buffers = tuple(buffers)
@@ -776,7 +957,7 @@ class PipelineEngine(DeepSpeedEngine):
         # NCCL does not like to send torch.BoolTensor types, so cast the mask to half().
         # We could do char, but with half() we can eventually flatten with other fp16
         # messages (TODO)
-        if self.module.__class__.__name__ == 'GPT2ModelPipe':
+        if self.has_attention_mask or self.has_bool_tensors:
             outputs = list(outputs)
             outputs[-1] = outputs[-1].half()
             outputs = tuple(outputs)
@@ -795,7 +976,7 @@ class PipelineEngine(DeepSpeedEngine):
                                       f'{type(outputs)}')
 
         # Restore the boolean tensor
-        if self.module.__class__.__name__ == 'GPT2ModelPipe':
+        if self.has_attention_mask or self.has_bool_tensors:
             outputs = list(outputs)
             outputs[-1] = outputs[-1].bool()
             outputs = tuple(outputs)
@@ -811,20 +992,29 @@ class PipelineEngine(DeepSpeedEngine):
 
         # Partition the gradient
         if self.is_grad_partitioned:
-            part = PartitionedTensor(tensor=inputs[0].grad,
+            if isinstance(inputs, tuple):
+                first_input = inputs[0]
+                assert all([torch.is_tensor(elt) for elt in inputs[1:]])
+                inputs_grad_tail = [
+                    elt.grad for elt in inputs[1:] if elt.grad is not None
+                ]
+            elif torch.is_tensor(inputs):
+                first_input = inputs
+                inputs_grad_tail = []
+            else:
+                raise ValueError("expecting a tensor or a tuple of tensors")
+            assert torch.is_tensor(first_input)
+            part = PartitionedTensor(tensor=first_input.grad,
                                      group=self.grid.get_slice_parallel_group())
-            # Clear the large output data, but save the computation graph
-            # Inject the partitoned tensor into the output before sending
 
-            # XXX Hack
-            inputs = tuple([part.to_meta(), part.data(), inputs[1]])
+            inputs = (part.to_meta(), part.data(), *inputs_grad_tail)
 
         # XXX Terrible hack
         # Drop the attention mask from the input buffer here. It does not have
         # a grad that needs to be communicated. We free the buffer immediately
         # after, so no need to restore it. The receiver also has a hack that skips
         # the recv. This is because NCCL does not let us send torch.BoolTensor :-(.
-        if self.module.__class__.__name__ == 'GPT2ModelPipe':
+        if self.has_attention_mask or self.has_bool_tensors:
             inputs = list(inputs)
             inputs.pop()
             inputs = tuple(inputs)
@@ -838,8 +1028,6 @@ class PipelineEngine(DeepSpeedEngine):
                 # First two sends are partitioned gradient
                 p2p.send(inputs[0], self.prev_stage)
                 p2p.send(inputs[1], self.prev_stage)
-                # XXX hack hack hack
-                #p2p.send(inputs[2].grad, self.prev_stage)
             else:
                 for idx, buffer in enumerate(inputs):
                     # Skip tensors that will not produce a grad
@@ -887,7 +1075,7 @@ class PipelineEngine(DeepSpeedEngine):
 
             # NCCL does not like to send torch.BoolTensor types, so un-cast the
             # attention mask
-            if self.module.__class__.__name__ == 'GPT2ModelPipe':
+            if self.has_attention_mask or self.has_bool_tensors:
                 recvd[-1] = recvd[-1].bool()
 
             recvd = tuple(recvd)
@@ -913,7 +1101,7 @@ class PipelineEngine(DeepSpeedEngine):
                 local_part=outputs[1],
                 group=self.grid.get_slice_parallel_group())
             outputs[0].data = part_output.full()
-            outputs = tuple([outputs[0], outputs[2]])
+            outputs = (outputs[0], *outputs[2:])
             # save for backward
             self.pipe_buffers['outputs'][buffer_id] = outputs
 
@@ -921,10 +1109,36 @@ class PipelineEngine(DeepSpeedEngine):
         if self.grad_layer is None:
             if isinstance(outputs, torch.Tensor):
                 s = list(outputs.size())
-                self.grad_layer = self._allocate_buffer(s, num_buffers=1)[0]
+                self.grad_layer = self._allocate_buffer(s,
+                                                        dtype=outputs.dtype,
+                                                        num_buffers=1)[0]
             else:
-                sizes = [list(t.size()) for t in outputs if t.is_floating_point()]
-                self.grad_layer = self._allocate_buffers(sizes, num_buffers=1)[0]
+                # XXX This is a HACK
+                # When we exchange activations/gradients, the two pipe stages
+                # need to issue the send/recv with the same buffer sizes or
+                # else there is a deadlock. The is_floating_point() filter is
+                # used to avoid sending gradients for tensors that do not
+                # produce gradients. When TP>1, we partition the first
+                # activations/gradients across TP ranks to save communication
+                # volume and memory. That partitioned tensor is represented as
+                # two tensors: a 1/TPth chunk of the original data and also a
+                # small LongTensor storing the metadata used to reconstruct on
+                # the other side. When combined, the floating point filter also
+                # filtered out the metadata tensor. This quick (hacky) fix just
+                # branches on is_grad_partitioned so we don't filter out the
+                # metadata tensor.
+                if self.is_grad_partitioned:
+                    sizes_and_dtypes = [
+                        (list(t.size()),
+                         t.dtype) for t in outputs[:2]
+                    ] + [(list(t.size()),
+                          t.dtype) for t in outputs[2:] if t.is_floating_point()]
+                else:
+                    sizes_and_dtypes = [(list(t.size()),
+                                         t.dtype) for t in outputs
+                                        if t.is_floating_point()]
+                self.grad_layer = self._allocate_buffers(sizes_and_dtypes,
+                                                         num_buffers=1)[0]
 
         if isinstance(self.grad_layer, torch.Tensor):
             p2p.recv(self.grad_layer, self.next_stage)
@@ -996,25 +1210,23 @@ class PipelineEngine(DeepSpeedEngine):
                 if t.grad is not None:
                     t.grad.data.zero_()
 
-    def _allocate_zeros(self, shape, fp16=None, **kwargs):
+    def _allocate_zeros(self, shape, **kwargs):
         """ Allocate a tensor of zeros on the engine's device.
 
         Arguments:
             shape: the shape of the tensor to allocate
-            fp16 (bool): whether to use FP16. default: defer to self.fp16_enabled()
             kwargs: passed to torch.zeros()
 
         Returns:
             A tensor from torch.zeros() allocated on self.device.
         """
+        if "dtype" not in kwargs:
+            if self.fp16_enabled():
+                kwargs["dtype"] = torch.half
+            if self.bfloat16_enabled():
+                kwargs["dtype"] = torch.bfloat16
 
-        if fp16 is None:
-            fp16 = self.fp16_enabled()
-
-        if fp16:
-            return torch.zeros(shape, dtype=torch.half, device=self.device, **kwargs)
-        else:
-            return torch.zeros(shape, device=self.device, **kwargs)
+        return torch.zeros(shape, device=self.device, **kwargs)
 
     def _allocate_buffer(self, shape, num_buffers=-1, **kwargs):
         buffers = []
@@ -1024,14 +1236,17 @@ class PipelineEngine(DeepSpeedEngine):
             buffers.append(self._allocate_zeros(shape, **kwargs))
         return buffers
 
-    def _allocate_buffers(self, shapes, requires_grad=False, num_buffers=-1):
+    def _allocate_buffers(self, shapes_and_dtypes, requires_grad=False, num_buffers=-1):
         buffers = []
         if num_buffers == -1:
             num_buffers = self.num_pipe_buffers
         for count in range(num_buffers):
             buffer = []
-            for shape in shapes:
-                buffer.append(self._allocate_zeros(shape, requires_grad=requires_grad))
+            for shape, dtype in shapes_and_dtypes:
+                buffer.append(
+                    self._allocate_zeros(shape,
+                                         dtype=dtype,
+                                         requires_grad=requires_grad))
             buffers.append(buffer)
         return buffers
 
@@ -1124,6 +1339,7 @@ class PipelineEngine(DeepSpeedEngine):
             state_dict (str, None): unused
             strict (bool, optional): Strict state loading. Defaults to True.
         """
+
         if (state_dict is not None) and (not isinstance(state_dict, str)):
             super().load_module_state_dict(state_dict, strict)
             return
@@ -1146,7 +1362,10 @@ class PipelineEngine(DeepSpeedEngine):
     }
 
     def _exec_schedule(self, pipe_schedule):
+        # Reserve and reset buffers.
         self._reserve_pipe_buffers(pipe_schedule.num_pipe_buffers())
+        self.fwd_outputs = []
+
         # For each step in the schedule
         for step_cmds in pipe_schedule:
             # For each instruction in the step
diff --git a/deepspeed/runtime/pipe/module.py b/deepspeed/runtime/pipe/module.py
index 6d24ed469f3ab5d151cc76ac7ba1ce734e7256f4..4121a0f8fd3f8d5017d33ca46f2b731fba248343 100644
--- a/deepspeed/runtime/pipe/module.py
+++ b/deepspeed/runtime/pipe/module.py
@@ -1,4 +1,5 @@
 import os
+import glob
 import enum
 
 import re as regex
@@ -14,6 +15,7 @@ from deepspeed.utils import logger
 from .. import utils as ds_utils
 from ..activation_checkpointing import checkpointing
 from .topology import PipeDataParallelTopology, PipelineParallelGrid
+from deepspeed.runtime.state_dict_factory import SDLoaderFactory
 
 
 class PipelineError(Exception):
@@ -93,7 +95,8 @@ class PipelineModule(nn.Module):
                  base_seed=1234,
                  partition_method='parameters',
                  activation_checkpoint_interval=0,
-                 activation_checkpoint_func=checkpointing.checkpoint):
+                 activation_checkpoint_func=checkpointing.checkpoint,
+                 checkpointable_layers=None):
         """Modules to be parallelized with pipeline parallelism.
 
         The key constraint that enables pipeline parallelism is the
@@ -112,10 +115,13 @@ class PipelineModule(nn.Module):
                     x = layer(x)
                 return x
 
+        .. note::
+            Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3.
+
         Args:
             layers (Iterable): A sequence of layers defining pipeline structure. Can be a ``torch.nn.Sequential`` module.
             num_stages (int, optional): The degree of pipeline parallelism. If not specified, ``topology`` must be provided.
-            topology (``deepseed.pipe.ProcessTopology``, optional): Defines the axes of parallelism axes for training. Must be provided if ``num_stages`` is ``None``.
+            topology (``deepseed.runtime.pipe.ProcessTopology``, optional): Defines the axes of parallelism axes for training. Must be provided if ``num_stages`` is ``None``.
             loss_fn (callable, optional): Loss is computed ``loss = loss_fn(outputs, label)``
             base_seed (int, optional): [description]. Defaults to 1234.
             partition_method (str, optional): [description]. Defaults to 'parameters'.
@@ -132,6 +138,10 @@ class PipelineModule(nn.Module):
 
         self.loss_fn = loss_fn
 
+        self.checkpointable_layers = checkpointable_layers
+        if checkpointable_layers is not None:
+            assert isinstance(checkpointable_layers, list), "param `checkpointable_layers` must be type of list."
+
         self.seed_layers = seed_layers
         self.seed_fn = seed_fn
         self.base_seed = base_seed
@@ -148,6 +158,8 @@ class PipelineModule(nn.Module):
         self.world_group = dist.new_group(ranks=range(dist.get_world_size()))
         self.global_rank = dist.get_rank(group=self.world_group)
         self.world_size = dist.get_world_size(group=self.world_group)
+        self.local_rank = int(os.environ.get("LOCAL_RANK", None))
+        assert self.local_rank != None
 
         if topology:
             self._topo = topology
@@ -163,7 +175,7 @@ class PipelineModule(nn.Module):
                 topology = PipeDataParallelTopology(num_pp=num_stages, num_dp=dp)
                 self._topo = topology
 
-        # Contruct communicators for pipeline topology
+        # Construct communicators for pipeline topology
         self._grid = PipelineParallelGrid(process_group=self.world_group,
                                           topology=self._topo)
 
@@ -177,6 +189,7 @@ class PipelineModule(nn.Module):
         self._partition_layers(method=partition_method)
 
         self.forward_funcs = []
+        self.fwd_map = {}
         self.tied_modules = nn.ModuleDict()
         self.tied_weight_attrs = {}
 
@@ -186,7 +199,7 @@ class PipelineModule(nn.Module):
 
         #with torch.random.fork_rng(devices=[torch.cuda.current_device()]):
         self._build()
-        self.to('cuda')
+        self.to(f'cuda:{self.local_rank}')
 
         self.tied_comms = self._index_tied_modules()
         self._synchronize_tied_weights()
@@ -213,6 +226,7 @@ class PipelineModule(nn.Module):
             elif isinstance(layer, nn.Module):
                 name = str(layer_idx)
                 self.forward_funcs.append(layer)
+                self.fwd_map.update({name: len(self.forward_funcs) - 1})
                 self.add_module(name, layer)
 
             # TiedLayerSpec objects contain an nn.Module that should be allocated now.
@@ -236,6 +250,7 @@ class PipelineModule(nn.Module):
                 module = layer.build()
                 name = str(layer_idx)
                 self.forward_funcs.append(module)
+                self.fwd_map.update({name: len(self.forward_funcs) - 1})
                 self.add_module(name, module)
 
             # Last option: layer may be a functional (e.g., lambda). We do nothing in
@@ -246,7 +261,7 @@ class PipelineModule(nn.Module):
         # All pipeline parameters should be considered as model parallel in the context
         # of our FP16 optimizer
         for p in self.parameters():
-            p.model_parallel = True
+            p.ds_pipe_replicated = False
 
     def _count_layer_params(self):
         """Count the trainable parameters in individual layers.
@@ -368,9 +383,8 @@ class PipelineModule(nn.Module):
             binary_weights = [0] * len(self._layer_specs)
             for idx in self._find_layer_type(layertype):
                 binary_weights[idx] = 1
-            else:
-                self.parts = ds_utils.partition_balanced(weights=binary_weights,
-                                                         num_parts=num_stages)
+            self.parts = ds_utils.partition_balanced(weights=binary_weights,
+                                                     num_parts=num_stages)
         elif method == 'profile':
             raise NotImplementedError(f'Partitioning method {method} not implemented.')
         else:
@@ -408,6 +422,13 @@ class PipelineModule(nn.Module):
             weight = getattr(self.tied_modules[key], comm['weight_attr'])
             dist.all_reduce(weight.grad, group=comm['group'])
 
+    def get_tied_weights_and_groups(self):
+        weight_group_list = []
+        for key, comm in self.tied_comms.items():
+            weight = getattr(self.tied_modules[key], comm['weight_attr'])
+            weight_group_list.append((weight, comm['group']))
+        return weight_group_list
+
     def _synchronize_tied_weights(self):
         for key, comm in self.tied_comms.items():
             dist.broadcast(
@@ -438,10 +459,10 @@ class PipelineModule(nn.Module):
             # TODO: fiber to generate process groups.
             tied_stages = set(self.stage_owner(idx) for idx in tied_layers)
             for dp in range(self._grid.data_parallel_size):
-                for mp in range(self._grid.model_parallel_size):
+                for mp in range(self._grid.get_slice_parallel_world_size()):
                     tied_ranks = []
                     for s in sorted(tied_stages):
-                        if self._grid.model_parallel_size > 1:
+                        if self._grid.get_slice_parallel_world_size() > 1:
                             tied_ranks.append(
                                 self._grid.stage_to_global(stage_id=s,
                                                            data=dp,
@@ -465,7 +486,7 @@ class PipelineModule(nn.Module):
                             # Only count the tied module once in the eyes of the FP16 optimizer
                             if self.global_rank != tied_ranks[0]:
                                 for p in self.tied_modules[key].parameters():
-                                    p.model_parallel = False
+                                    p.ds_pipe_replicated = True
         '''
         if len(tied_comms) > 0:
             print(f'RANK={self.global_rank} tied_comms={tied_comms}')
@@ -528,11 +549,20 @@ class PipelineModule(nn.Module):
         idx = local_layer_idx + self._local_start
         layer_ckpt_path = os.path.join(ckpt_dir, f'layer_{idx:02d}')
         rank_repr = self._grid._topo.get_rank_repr(rank=self.global_rank)
-        if rank_repr is not '':
+        if rank_repr != '':
             layer_ckpt_path += f'-{rank_repr}'
         layer_ckpt_path += '-model_states.pt'
         return layer_ckpt_path
 
+    def ckpt_layer_path_list(self, ckpt_dir, local_layer_idx):
+        """Get all ckpt file list for a specific pipeline module layer. """
+        idx = local_layer_idx + self._local_start
+        layer_ckpt_path = os.path.join(ckpt_dir, f'layer_{idx:02d}-')
+        layer_ckpt_path += "*model_states.pt"
+        ckpt_files = glob.glob(layer_ckpt_path)
+        ckpt_files.sort()
+        return ckpt_files
+
     def save_state_dict(self, save_dir):
         if self._grid.data_parallel_id != 0:
             return
@@ -543,33 +573,51 @@ class PipelineModule(nn.Module):
             model_ckpt_path = self.ckpt_layer_path(save_dir, idx)
             if not hasattr(layer, 'state_dict'):
                 continue
-            torch.save(layer.state_dict(), model_ckpt_path)
+            # We pass cloned tensors to torch.save() to avoid checkpoint bloat which occurs because torch.save()
+            # saves the underlying storage rather than the slice of the storage corresponding to individual tensors.
+            # This is a problem in DeepSpeed because we often allocate tensors using slices of large flattened buffers.
+            # Tensor cloning helps to avoid this problem because the storage of cloned tensors are closer to the true size.
+            # It is expected that the garbage collector will reclaim the cloned tensor storage to avoid memory bloat.
+            # See https://pytorch.org/docs/stable/notes/serialization.html#preserve-storage-sharing
+            orig_state_dict = layer.state_dict()
+            final_state_dict = type(orig_state_dict)(
+                {k: v.clone()
+                 for k,
+                 v in orig_state_dict.items()})
+            torch.save(final_state_dict, model_ckpt_path)
 
     def load_state_dir(self, load_dir, strict=True):
-        rank = dist.get_rank()
-
-        layer_offset = self._local_start
         for idx, layer in enumerate(self.forward_funcs):
             # Functions, etc. will not have state_dicts
             if not hasattr(layer, 'load_state_dict'):
                 continue
 
-            model_ckpt_path = self.ckpt_layer_path(load_dir, idx)
-            layer.load_state_dict(torch.load(model_ckpt_path,
-                                             map_location=lambda storage,
-                                             loc: storage),
-                                  strict=strict)
-            if self._grid.data_parallel_id == 0:
-                logger.info(
-                    f'RANK={self.global_rank} Loaded layer={idx+layer_offset} file={model_ckpt_path}'
-                )
+            # get all checkpoint files for the layer.
+            model_ckpt_list = self.ckpt_layer_path_list(load_dir, idx)
+            mp_rank = self._grid.get_slice_parallel_rank()
+            mp_world_size = self._grid.get_slice_parallel_world_size()
+
+            sd_loader = SDLoaderFactory.get_sd_loader(model_ckpt_list, version=2.0)
+            load_path, checkpoint, _ = sd_loader.load(mp_world_size, mp_rank, module_key=None, is_pipe_parallel=True)
+
+            layer.load_state_dict(checkpoint)
+
+            # if self._grid.data_parallel_id == 0:
+            #     logger.info(
+            #         f'RANK={self.global_rank} Loaded layer={idx+self._local_start} file={load_path}'
+            #     )
 
         self._synchronize_tied_weights()
 
     def _is_checkpointable(self, funcs):
-        if self.__class__.__name__ == 'GPT2ModelPipe':
+        # This is an unfortunate hack related to torch and deepspeed activation checkpoint implementations.
+        # Some layers like torch.nn.Embedding will not receive grads if checkpointed, which breaks things.
+        # I presume it's related to the discrete inputs that cannot require_grad? Need to revisit.
+        if self.__class__.__name__ in ('GPTModelPipe', 'GPT2ModelPipe'):
             return all('ParallelTransformerLayerPipe' in f.__class__.__name__
                        for f in funcs)
+        if self.checkpointable_layers is not None:
+            return all(f.__class__.__name__ in self.checkpointable_layers for f in funcs)
 
         params = [f.parameters() for f in funcs if isinstance(f, torch.nn.Module)]
         return any(len(list(p)) > 0 for p in params)
diff --git a/deepspeed/runtime/pipe/p2p.py b/deepspeed/runtime/pipe/p2p.py
index 24c0f250a4b9c9edd5b3f5e5384e6e3d2f42236e..8a1b7192600680a349f4c8697f70b5af73b49a1f 100644
--- a/deepspeed/runtime/pipe/p2p.py
+++ b/deepspeed/runtime/pipe/p2p.py
@@ -2,11 +2,27 @@
 Copyright 2019 The Microsoft DeepSpeed Team
 '''
 
+import pickle
+import typing
+
+import torch
 import torch.distributed as dist
 
+# To query whether we have send/recv support
+from packaging.version import Version
+from deepspeed.git_version_info import torch_info
+
 _groups = None
 _grid = None
 
+_async = []
+
+
+def can_send_recv() -> bool:
+    torch_version = Version(torch_info['version'])
+    sendrecv_min = Version('1.8')
+    return torch_version >= sendrecv_min
+
 
 #initializes adjacent process groups
 #run this only after torch.distributed.init_process_group() has been called
@@ -16,7 +32,8 @@ def init_process_groups(grid):
 
     assert _grid.pipe_parallel_size > 1, "There is no pipeline parallelism"
 
-    _groups = [dist.new_group(ranks=group) for group in _grid.p2p_groups]
+    if not can_send_recv():
+        _groups = [dist.new_group(ranks=group) for group in _grid.p2p_groups]
 
 
 def _is_valid_send_recv(src_stage, dest_stage):
@@ -30,40 +47,117 @@ def _is_valid_send_recv(src_stage, dest_stage):
 
 def send(tensor, dest_stage, async_op=False):
     global _groups
-
-    async_op = False
+    assert async_op == False, "Doesn't support async_op true"
     src_stage = _grid.get_stage_id()
     _is_valid_send_recv(src_stage, dest_stage)
 
-    group = _get_send_recv_group(src_stage, dest_stage)
-    src_rank = _grid.stage_to_global(stage_id=src_stage)
+    dest_rank = _grid.stage_to_global(stage_id=dest_stage)
+    if async_op:
+        global _async
+        op = dist.isend(tensor, dest_rank)
+        _async.append(op)
+    else:
 
-    return dist.broadcast(tensor, src_rank, group=group, async_op=async_op)
+        if can_send_recv():
+            return dist.send(tensor, dest_rank)
+        else:
+            group = _get_send_recv_group(src_stage, dest_stage)
+            src_rank = _grid.stage_to_global(stage_id=src_stage)
+            return dist.broadcast(tensor, src_rank, group=group, async_op=async_op)
 
 
 def recv(tensor, src_stage, async_op=False):
-
     global _groups
-
-    async_op = False
+    assert async_op == False, "Doesn't support async_op true"
     dest_stage = _grid.get_stage_id()
     _is_valid_send_recv(src_stage, dest_stage)
 
-    group = _get_send_recv_group(src_stage, dest_stage)
     src_rank = _grid.stage_to_global(stage_id=src_stage)
 
-    return dist.broadcast(tensor, src_rank, group=group, async_op=async_op)
+    if async_op:
+        global _async
+        op = dist.irecv(tensor, src_rank)
+        _async.append(op)
+    else:
+        if can_send_recv():
+            return dist.recv(tensor, src_rank)
+        else:
+            group = _get_send_recv_group(src_stage, dest_stage)
+            return dist.broadcast(tensor, src_rank, group=group, async_op=async_op)
 
 
-def barrier(stage_id):
-    global _groups, _grid
-    group_id = _grid.stage_to_global(stage_id=stage_id)
-    if (dist.get_rank() >= 0):
-        print("Barrier Group ID", group_id)
-        print("Barrier Group", _grid.p2p_groups[group_id])
-    dist.barrier(group=_groups[group_id])
-    if (dist.get_rank() >= 0):
-        print("Exiting Barrier ", group_id)
+def wait():
+    global _async
+    for op in _async:
+        op.wait()
+    _async = []
+
+    torch.cuda.synchronize()
+
+
+def send_obj(msg: typing.Any, dest: int):
+    """Send an arbitrary python object to ``dest``.
+
+    Note: ``msg`` must be pickleable.
+
+    WARN: This incurs a CPU -> GPU transfer and should be used sparingly
+    for performance reasons.
+
+    Args:
+        msg (typing.Any): The object to send.
+        dest (int): Destination rank.
+    """
+    # serialize the message
+    msg = pickle.dumps(msg)
+    # construct a tensor to send
+    msg = torch.ByteTensor(torch.ByteStorage.from_buffer(msg)).cuda()
+
+    # Send meta and message
+    length_tensor = torch.tensor([len(msg)], dtype=torch.long).cuda()
+    dist.send(length_tensor, dst=dest)
+    dist.send(msg, dst=dest)
+
+
+def recv_obj(sender: int) -> typing.Any:
+    """Receive an arbitrary python object from ``sender``.
+
+    WARN: This incur a CPU <-> GPU transfers and should be used sparingly
+    for performance reasons.
+
+    Args:
+        sender (int): The rank sending the message.
+    """
+    # Get message meta
+    length = torch.tensor([0], dtype=torch.long).cuda()
+    dist.recv(length, src=sender)
+
+    # Receive and deserialize
+    msg = torch.empty(length.item(), dtype=torch.uint8).cuda()
+    dist.recv(msg, src=sender)
+
+    msg = pickle.loads(msg.cpu().numpy().tobytes())
+
+    def _to(x):
+        """Recursively move to the current device."""
+        if torch.is_tensor(x):
+            return x.cuda()
+        if isinstance(x, (tuple, list)):
+            ret = [_to(x_) for x_ in x]
+            if isinstance(x, tuple):
+                ret = tuple(ret)
+            return ret
+        # handle kwargs
+        if isinstance(x, dict):
+            ret = dict()
+            for key, val in x.items():
+                ret[_to(key)] = _to(val)
+            return ret
+
+        # Anything else is a no-op
+        return x
+
+    msg = _to(msg)
+    return msg
 
 
 def _get_send_recv_group(src_stage, dest_stage):
@@ -83,7 +177,7 @@ def _get_send_recv_group(src_stage, dest_stage):
         stage_id = src_stage
     '''group_id corresponds to group of [group_id, group_id+1]
      unless group_id is the rank of the last stage
-     in which case group_id correspods to group[group_id-num_stages+1, group_id]
+     in which case group_id corresponds to group[group_id-num_stages+1, group_id]
      '''
     group_id = _grid.stage_to_global(stage_id=stage_id)
 
diff --git a/deepspeed/runtime/pipe/schedule.py b/deepspeed/runtime/pipe/schedule.py
index 1f6ac1d8dce89413d30f2074cdbd1e7979fb24a0..181d43e8062a5901b81661a1cd07b3a74803f1e3 100644
--- a/deepspeed/runtime/pipe/schedule.py
+++ b/deepspeed/runtime/pipe/schedule.py
@@ -392,7 +392,7 @@ class ForwardPass(BufferOpInstruction):
 
     .. code-block:: python
 
-        buffers['ouputs'][buffer_id] = forward(buffers['inputs'][buffer_id])
+        buffers['outputs'][buffer_id] = forward(buffers['inputs'][buffer_id])
     """
     pass
 
@@ -404,7 +404,7 @@ class BackwardPass(BufferOpInstruction):
 
     .. code-block:: python
 
-        outputs = buffers['ouputs'][buffer_id]
+        outputs = buffers['outputs'][buffer_id]
         gradients = buffers['gradients'][buffer_id]
         torch.autograd.backward(tensors=outputs,
                                 grad_tensors=gradients)
diff --git a/deepspeed/runtime/pipe/topology.py b/deepspeed/runtime/pipe/topology.py
index b1e11ee0e492c3ee7b268846431115e6625207cb..240c973a3fc1efb19622364ce5edfc3bf7bc926f 100644
--- a/deepspeed/runtime/pipe/topology.py
+++ b/deepspeed/runtime/pipe/topology.py
@@ -191,7 +191,7 @@ class ProcessTopology:
             return True
 
         coords = filter(_filter_helper, self.mapping.keys())
-        return [self.mapping[coo] for coo in coords]
+        return [self.mapping[coord] for coord in coords]
 
     def get_axis_list(self, axis, idx):
         """Returns the list of global ranks whose coordinate in an axis is idx.
@@ -233,7 +233,7 @@ def _prime_factors(N):
 
 
 class PipeDataParallelTopology(ProcessTopology):
-    """ A topology specialiation for hybrid data and pipeline parallelism.
+    """ A topology specialization for hybrid data and pipeline parallelism.
 
         Uses data parallelism on the last dimension to encourage gradient
         reductions to use high-bandwidth intra-node links and lower-volume
@@ -291,6 +291,7 @@ class PipelineParallelGrid:
         self.data_parallel_size = max(self._topo.get_dim('data'), 1)
         self.pipe_parallel_size = max(self._topo.get_dim('pipe'), 1)
         self.model_parallel_size = max(self._topo.get_dim('model'), 1)
+        self.slice_parallel_size = self.model_parallel_size
         assert self._is_grid_valid(), "Invalid Grid"
 
         self.stage_id = self.get_stage_id()
@@ -449,7 +450,7 @@ class PipelineParallelGrid:
             return 0
 
     def get_slice_parallel_world_size(self):
-        self.slice_parallel_size
+        return self.slice_parallel_size
 
     def get_slice_parallel_group(self):
         return self.slice_proc_group
diff --git a/deepspeed/runtime/progressive_layer_drop.py b/deepspeed/runtime/progressive_layer_drop.py
old mode 100755
new mode 100644
index 770978a940a0cdea0eb73a306c9cbb2ae3bdef89..41c08cfd9e7cd69abb7dba73f3233c181f895d3a
--- a/deepspeed/runtime/progressive_layer_drop.py
+++ b/deepspeed/runtime/progressive_layer_drop.py
@@ -1,33 +1,33 @@
-import numpy as np
-from deepspeed.utils import log_dist
-
-
-class ProgressiveLayerDrop(object):
-    r""" Progressive Layer Dropping (PLD) for model training.
-        This implements the PLD technique for compressed model training
-        from this paper: https://arxiv.org/pdf/2010.13369.pdf
-    Args:
-        theta (float): a hyper-parameter that controls the trade-off between training time and robustness.
-        The lower the theta value, the faster the training speed. Default value: 0.5.
-        gamma (float): a hyper-parameter that controls how fast the drop ratio increases. Default value: 0.001.
-    """
-    def __init__(self, theta=0.5, gamma=0.001):
-        super().__init__()
-
-        self.theta = theta
-        self.gamma = gamma
-        self.current_theta = 1.0
-        log_dist(f'Enabled progressive layer dropping (theta = {self.theta})', ranks=[0])
-
-    def get_state(self):
-        kwargs = {'progressive_layer_drop': True, 'pld_theta': self.get_theta()}
-        return kwargs
-
-    def get_theta(self):
-        return self.current_theta
-
-    def update_state(self, global_step):
-        def _prob(x, gamma, p):
-            return (1. - p) * np.exp(-gamma * x) + p
-
-        self.current_theta = _prob(global_step, self.gamma, self.theta)
+import numpy as np
+from deepspeed.utils import log_dist
+
+
+class ProgressiveLayerDrop(object):
+    r""" Progressive Layer Dropping (PLD) for model training.
+        This implements the PLD technique for compressed model training
+        from this paper: https://arxiv.org/pdf/2010.13369.pdf
+    Args:
+        theta (float): a hyper-parameter that controls the trade-off between training time and robustness.
+        The lower the theta value, the faster the training speed. Default value: 0.5.
+        gamma (float): a hyper-parameter that controls how fast the drop ratio increases. Default value: 0.001.
+    """
+    def __init__(self, theta=0.5, gamma=0.001):
+        super().__init__()
+
+        self.theta = theta
+        self.gamma = gamma
+        self.current_theta = 1.0
+        log_dist(f'Enabled progressive layer dropping (theta = {self.theta})', ranks=[0])
+
+    def get_state(self):
+        kwargs = {'progressive_layer_drop': True, 'pld_theta': self.get_theta()}
+        return kwargs
+
+    def get_theta(self):
+        return self.current_theta
+
+    def update_state(self, global_step):
+        def _prob(x, gamma, p):
+            return (1. - p) * np.exp(-gamma * x) + p
+
+        self.current_theta = _prob(global_step, self.gamma, self.theta)
diff --git a/deepspeed/runtime/quantize.py b/deepspeed/runtime/quantize.py
new file mode 100644
index 0000000000000000000000000000000000000000..05fc50201b77abf587bc20f71f94e9768fbc2481
--- /dev/null
+++ b/deepspeed/runtime/quantize.py
@@ -0,0 +1,224 @@
+import torch
+import math
+from deepspeed.utils import log_dist
+from deepspeed.utils import logger
+from deepspeed.ops.quantizer import ds_quantizer
+
+# number of 2-dimensional parameters in a layer
+# this is set for transformer-based models
+TWO_D_PARAMS = 6
+
+
+class Quantizer(object):
+    def __init__(self,
+                 q_target_bits=8,
+                 q_start_bits=16,
+                 q_period=100,
+                 q_offset=100,
+                 q_groups=1,
+                 q_mixed_fp16=False,
+                 q_change_ratio=0.01,
+                 q_type=0,
+                 q_rounding=0,
+                 q_verbose=False,
+                 q_eigenvalue=False,
+                 use_quantizer_kernel=False,
+                 layer_num=0):
+
+        self.q_target_bits = q_target_bits
+
+        self.q_start_bits = [q_start_bits] * (layer_num if layer_num != 0 else 1)
+        self.q_period = [q_period] * (layer_num if layer_num != 0 else 1)
+        self.q_offset = q_offset
+        self.q_groups = q_groups
+        self.q_mixed_fp16 = q_mixed_fp16
+        self.q_change_ratio = q_change_ratio
+        self.q_type = q_type
+        self.qsteps = 0
+        self.q_init_period = q_period
+        self.quantize_real_ratio = 1.000
+        self.q_verbose = q_verbose
+        self.q_eigenvalue = q_eigenvalue
+        self.use_quantizer_kernel = use_quantizer_kernel
+        self.q_rounding = q_rounding
+        self.layer_num = layer_num
+
+    def any_precision_switch(self):
+        if self.layer_num == 0:
+            return True
+        result = False
+        for index in range(self.layer_num):
+            if self.q_start_bits[index] != self.q_target_bits:
+                next_step = self.qsteps + (
+                    TWO_D_PARAMS * (self.layer_num if self.layer_num != 0 else 1))
+                if next_step >= self.q_period[index]:
+                    result = True
+        return result
+
+    def quantize(self,
+                 parameter_group,
+                 overflow,
+                 eigenvalue_enabled,
+                 block_eigenvalue={}):
+
+        if overflow and not eigenvalue_enabled:
+            return
+
+        self.step()
+
+        self.update_fp16_ratio()
+
+        for i in range(len(parameter_group)):
+            for p in parameter_group[i]:
+                if len(p.size()) > 1:
+                    param_id = id(p)
+                    eigenvalue, layer_id = block_eigenvalue[param_id] if param_id in block_eigenvalue else (None, 0)
+                    if eigenvalue is not None:
+                        factor = 1 + math.floor(eigenvalue * 4)
+                        p.data = self.compute_quantization(p.data, layer_id, factor)
+                    else:
+                        p.data = self.compute_quantization(p.data, layer_id)
+
+    def step(self):
+        self.qsteps += (TWO_D_PARAMS * (self.layer_num if self.layer_num != 0 else 1))
+
+    def sr_quantize(self, input_flat, input_g, scale):
+        # Random number generator (Uniform)
+        p = torch.cuda.FloatTensor(input_flat.size(),
+                                   device=input_flat.device).uniform_()
+        p = torch.split(p, p.size(0) // self.q_groups)
+        add_s = torch.zeros_like(input_flat)
+        add_s = torch.split(add_s, add_s.size(0) // self.q_groups)
+
+        scale = [q_range / (2 * max(g.max(), g.min().abs())) for g in input_g]
+        # Quantize with INT rounding
+        input_flat = [(g * s).int().float() / s for (g, s) in zip(input_g, scale)]
+        # Compute the error
+        error = [((g - q).abs() / s) for (g, s, q) in zip(input_g, scale, input_flat)]
+        # Stochastic Rounding
+        add_s = [
+            a_s.masked_fill_(pg < err_g,
+                             1 / s) for (a_s,
+                                         pg,
+                                         err_g,
+                                         s) in zip(add_s,
+                                                   p,
+                                                   error,
+                                                   scale)
+        ]
+        add_s = [
+            a_s * (g > 0).float() - a_s * (g < 0).float() for a_s,
+            g in zip(add_s,
+                     input_flat)
+        ]
+        input_flat = [((q + a_s) * s).clamp(-(q_range >> 1),
+                                            (q_range >> 1) - 1) / s for q,
+                      a_s,
+                      s in zip(input_flat,
+                               add_s,
+                               scale)]
+        return input_flat
+
+    def mixed_fp16_quantize(self, input, input_q, index):
+        if self.q_mixed_fp16 and self.q_start_bits[index] >= (self.q_target_bits - 1):
+            input_q = input * self.quantize_real_ratio + (
+                1 - self.quantize_real_ratio) * input_q
+            return input_q
+        return input_q
+
+    def compute_quantization(self, input, index=0, factor=1):
+        # fixing the quantization bits based on the training steps
+        # when reducing 1 bit at each period, we increase the period
+        # to go slowly toward the target quantization bits
+        # the period and starting bit can be configured
+        if self.q_offset > 0:
+            if self.qsteps >= self.q_offset:
+                self.q_offset = 0
+                self.qsteps = 0
+            else:
+                return input
+
+        if self.q_start_bits[index] != self.q_target_bits:
+            if self.qsteps >= self.q_period[index]:
+                self.quantize_real_ratio = 1.0
+                if self.q_eigenvalue:
+                    self.q_period[index] <<= 1
+                    self.q_period[index] *= factor
+                    self.q_start_bits[index] -= 1
+                else:
+                    for i in range(len(self.q_start_bits)):
+                        self.q_start_bits[i] -= 1
+                        self.q_period[i] <<= 1
+                if self.q_verbose:
+                    logger.info(
+                        f'Quantization settings: current bit-precision = {self.q_start_bits[index]}, step = {self.qsteps}, quantization period = {self.q_period[index]}, index = {index}'
+                    )
+        assert (self.q_start_bits[index] >= self.q_target_bits), \
+            'Quantization bit is lower than target precision bits!'
+
+        # quantize the weights base on the selected bits and the value-range
+        if not self.use_quantizer_kernel:
+            q_range = 2**self.q_start_bits[index]
+            input_flat = input.view(-1)
+            input_g = torch.split(input_flat, input_flat.size(0) // self.q_groups)
+        if self.q_type == 0:  #symmetric
+            if self.use_quantizer_kernel:
+                input_q = ds_quantizer(input.clone(),
+                                       self.q_groups,
+                                       self.q_start_bits[index])
+            else:
+                scale = [q_range / (2 * max(g.max(), g.min().abs())) for g in input_g]
+                if self.q_rounding == 0:  # Nearest value rounding
+                    input_flat = [(g * s).round().clamp(-(q_range >> 1),
+                                                        (q_range >> 1) - 1) / s for g,
+                                  s in zip(input_g,
+                                           scale)]
+                else:  # Stochastic Rounding
+                    if self.use_quantizer_kernel:
+                        input_q = ds_quantizer(input.clone(),
+                                               self.q_groups,
+                                               self.q_start_bits[index],
+                                               sr=True)
+                    else:
+                        input_flat = self.sr_quantize(input_flat, input_g)
+        else:  #asymmetric
+            if self.q_rounding == 0:
+                if self.use_quantizer_kernel:
+                    input_q = ds_quantizer(input.clone(),
+                                           self.q_groups,
+                                           self.q_start_bits[index],
+                                           asym=True)
+                else:
+                    scale = [(g.max() - g.min()) / q_range for g in input_g]
+                    input_flat = [
+                        ((g - g.min()) / s).round().clamp(0,
+                                                          (q_range - 1)) * s + g.min()
+                        for g,
+                        s in zip(input_g,
+                                 scale)
+                    ]
+            else:
+                input_q = ds_quantizer(input.clone(),
+                                       self.q_groups,
+                                       self.q_start_bits[index],
+                                       asym=True)
+
+        if self.use_quantizer_kernel or (self.q_type and self.q_rounding):
+            return self.mixed_fp16_quantize(input, input_q, index)
+        else:
+            if self.q_mixed_fp16 and self.q_start_bits[index] >= (self.q_target_bits -
+                                                                  1):
+                input_flat = [(self.quantize_real_ratio * g) +
+                              ((1 - self.quantize_real_ratio) * g_q) for g,
+                              g_q in zip(input_g,
+                                         input_flat)]
+            input_q = torch.cat(input_flat)
+            input_q = input_q.reshape(input.size())
+            return input_q
+
+    def update_fp16_ratio(self):
+        if self.q_mixed_fp16:
+            if self.quantize_real_ratio > 0:
+                self.quantize_real_ratio -= self.q_change_ratio
+            else:
+                self.quantize_real_ratio = 0.000
diff --git a/deepspeed/runtime/sparse_tensor.py b/deepspeed/runtime/sparse_tensor.py
new file mode 100644
index 0000000000000000000000000000000000000000..49dedbe14b7d7e2b9f91d01f3c83206c53a140af
--- /dev/null
+++ b/deepspeed/runtime/sparse_tensor.py
@@ -0,0 +1,70 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+
+Implementation of a compressed sparse tensor. Similar in
+functionality to TensorFlow's IndexedSlices implementation.
+"""
+
+import torch
+
+
+class SparseTensor(object):
+    """ Compressed Sparse Tensor """
+    def __init__(self, dense_tensor=None):
+        self.orig_dense_tensor = dense_tensor
+        self.is_sparse = dense_tensor.is_sparse
+        if dense_tensor is not None:
+            if dense_tensor.is_sparse:
+                dense_tensor = dense_tensor.coalesce()
+                self.indices = dense_tensor.indices().flatten()
+                self.values = dense_tensor.values()
+            else:
+                result = torch.sum(dense_tensor, dim=1)
+                self.indices = result.nonzero().flatten()
+                self.values = dense_tensor[self.indices]
+            self.dense_size = list(dense_tensor.size())
+        else:
+            self.indices = None
+            self.values = None
+            self.dense_size = None
+
+    def to_coo_tensor(self):
+        return torch.sparse_coo_tensor(self.indices.unsqueeze(0),
+                                       self.values,
+                                       self.dense_size)
+
+    @staticmethod
+    def type():
+        return "deepspeed.SparseTensor"
+
+    def to_dense(self):
+        it = self.indices.unsqueeze(1)
+        full_indices = torch.cat([it for _ in range(self.dense_size[1])], dim=1)
+        return self.values.new_zeros(self.dense_size).scatter_add_(
+            0,
+            full_indices,
+            self.values)
+
+    def sparse_size(self):
+        index_size = list(self.indices.size())
+        index_size = index_size[0]
+        value_size = list(self.values.size())
+        value_size = value_size[0] * value_size[1]
+        dense_size = self.dense_size[0] * self.dense_size[1]
+        return index_size + value_size, dense_size
+
+    def add(self, b):
+        assert self.dense_size == b.dense_size
+        self.indices = torch.cat([self.indices, b.indices])
+        self.values = torch.cat([self.values, b.values])
+
+    def __str__(self):
+        sparse_size, dense_size = self.sparse_size()
+        return "DeepSpeed.SparseTensor(indices_size={}, values_size={}, " \
+               "dense_size={}, device={}, reduction_factor={})".format(
+            self.indices.size(), self.values.size(), self.dense_size,
+            self.indices.get_device(), dense_size / sparse_size
+        )
+
+    def __repr__(self):
+        return self.__str__()
diff --git a/deepspeed/runtime/state_dict_factory.py b/deepspeed/runtime/state_dict_factory.py
new file mode 100644
index 0000000000000000000000000000000000000000..09887aaa275ce4060489aefb46be7ec4182dee09
--- /dev/null
+++ b/deepspeed/runtime/state_dict_factory.py
@@ -0,0 +1,453 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+'''
+
+import torch
+import os
+import copy
+import collections
+import json
+from abc import ABC, abstractmethod
+from deepspeed.utils import logger
+from .weight_quantizer import WeightQuantization
+
+AUTO_MODULE_KEY = 'auto'
+
+
+class SDLoaderFactory:
+    @staticmethod
+    def get_sd_loader_json(json_file):
+        with open(json_file) as f:
+            data = json.load(f)
+            sd_type = data['type']
+            ckpt_list = data['checkpoints']
+            version = data['version']
+            return SDLoaderFactory.get_sd_loader(ckpt_list, sd_type, version)
+
+    @staticmethod
+    def get_sd_loader(ckpt_list, sd_type='Megatron', version=None):
+        if sd_type == 'Megatron':
+            return MegatronSDLoader(ckpt_list, version)
+        else:
+            assert False, '{} checkpoint type is not supported'.format(sd_type)
+
+
+class SDLoaderBase(ABC):
+    def __init__(self, ckpt_list, version):
+        self.module_key = None
+        self.ckpt_list = ckpt_list
+        self.check_ckpt_list()
+        self.version = version
+
+    def load(self,
+             mp_world_size,
+             mp_rank,
+             module_key=AUTO_MODULE_KEY,
+             is_pipe_parallel=False,
+             quantize=False,
+             quantize_bits=8,
+             quantize_groups=64,
+             mlp_extra_grouping=True):
+        self.module_key = module_key
+        num_ckpt = len(self.ckpt_list)
+        idx = mp_rank * num_ckpt // mp_world_size
+        """ We have multiple cases to handle here for both training and inference:
+            1. PipeModule loading mp_rank_*.pt files, is_pipe_parallel=True, module_key is not None
+                a. if no mp_size/pp_size resizing occurs, for both training & inference, loading
+                   the mp_rank related checkpoint directly.
+                b. if has mp_size/pp_size resizing, only Megatron model inference is supported,
+                   in this case each mp_rank_*.pt have same content, we will load the first checkpoint
+                   file (idx=0), to avoid idx exceeding file list boundary.
+
+            2. PipeModule loading layer_*.pt files, is_pipe_parallel=True, module_key is None
+                a. if no mp_size resizing occurs, for both training & inference, loading
+                   the mp_rank related checkpoint directly.
+                b. if has mp_size resizing, only Megatron model inference is supported,
+                   checkpoint file(s) will be merged/split according to mp_rank, mp_world_size and
+                   checkpoint file list.
+
+            3. Non-PipeModule loading mp_rank_*.pt files, is_pipe_parallel=False
+                Same with case (2).
+        """
+        if is_pipe_parallel and module_key is not None and mp_world_size != num_ckpt:
+            mp_world_size = num_ckpt
+            idx = 0
+
+        load_path = self.ckpt_list[idx]
+
+        merge_count = 1
+        if num_ckpt == mp_world_size:
+            assert os.path.exists(load_path)
+            #logger.info(f'rank: {mp_rank} loading checkpoint: {load_path}')
+            sd = torch.load(load_path, map_location=lambda storage, loc: storage)
+
+            if quantize:
+                quantizer = WeightQuantization(mlp_extra_grouping=mlp_extra_grouping,
+                                               mp_size=mp_world_size)
+                sd_module, all_scales = quantizer.sd_quantize_megatron(self.get_module(sd), quantize_bits, quantize_groups)
+                self.set_module(sd, sd_module)
+            else:
+                all_scales = None
+        elif num_ckpt > mp_world_size:
+            sd, all_scales, merge_count = self.merge_state_dict(mp_world_size, mp_rank, quantize, \
+                quantize_bits, quantize_groups, mlp_extra_grouping)
+        else:
+            sd, all_scales = self.split_state_dict(mp_world_size, mp_rank, quantize, quantize_bits, \
+                quantize_groups, mlp_extra_grouping)
+        return load_path, sd, (all_scales, merge_count)
+
+    def get_merge_state_dicts(self, mp_world_size, mp_rank):
+        num_ckpt = len(self.ckpt_list)
+        assert num_ckpt % mp_world_size == 0, 'Invalid checkpoints and world size for sd merge'
+
+        num_to_merge = num_ckpt // mp_world_size
+        ckpt_list = [
+            self.ckpt_list[i] for i in range(num_to_merge * mp_rank,
+                                             num_to_merge * (mp_rank + 1))
+        ]
+
+        logger.info(f"mp_rank: {mp_rank}, ckpt_list: {ckpt_list}")
+        sd_list = [
+            torch.load(ckpt,
+                       map_location=lambda storage,
+                       loc: storage) for ckpt in ckpt_list
+        ]
+        return sd_list
+
+    def get_split_state_dict(self, mp_world_size, mp_rank):
+        num_ckpt = len(self.ckpt_list)
+        assert mp_world_size % num_ckpt == 0, 'Invalid checkpoints and world size for sd split'
+
+        num_to_split = mp_world_size // num_ckpt
+        ckpt_index = mp_rank // num_to_split
+        ckpt_offset = mp_rank % num_to_split
+
+        logger.info(
+            f"mp_rank: {mp_rank}, ckpt_list: {self.ckpt_list[ckpt_index]}, offset: {ckpt_offset}"
+        )
+
+        sd = torch.load(self.ckpt_list[ckpt_index],
+                        map_location=lambda storage,
+                        loc: storage)
+
+        return sd, num_to_split, ckpt_offset
+
+    def _choose_module_key(self, sd):
+        assert not ('module' in sd and 'model' in sd), "checkpoint has both 'model' and 'module' keys, not sure how to proceed"
+        assert 'module' in sd or 'model' in sd, "checkpoint contains neither 'model' or 'module' keys, not sure how to proceed"
+        if 'module' in sd:
+            return 'module'
+        elif 'model' in sd:
+            return 'model'
+
+    def get_module(self, sd):
+        if self.module_key is None:
+            return sd
+        elif self.module_key == AUTO_MODULE_KEY:
+            return sd[self._choose_module_key(sd)]
+        else:
+            return sd[self.module_key]
+
+    def set_module(self, sd, module):
+        if self.module_key is None:
+            sd = module
+        elif self.module_key == AUTO_MODULE_KEY:
+            sd[self._choose_module_key(sd)] = module
+        else:
+            sd[self.module_key] = module
+        return sd
+
+    def check_ckpt_list(self):
+        #logger.info(f'checkpoint file list: {self.ckpt_list}')
+        assert len(self.ckpt_list) > 0
+
+        sd = torch.load(self.ckpt_list[0], map_location=lambda storage, loc: storage)
+
+        # check checkpoint count is same with saved mp_world_size
+        if 'mp_world_size' in sd.keys():
+            assert len(self.ckpt_list) == sd['mp_world_size'], f"checkpoint count {len(self.ckpt_list)} is different from saved mp_world_size {sd['mp_world_size']}"
+
+    @abstractmethod
+    def merge_state_dict(self,
+                         mp_world_size,
+                         mp_rank,
+                         quantize,
+                         quantize_bits,
+                         groups,
+                         mlp_extra_grouping):
+        pass
+
+    @abstractmethod
+    def split_state_dict(self,
+                         mp_world_size,
+                         mp_rank,
+                         quantize,
+                         quantize_bits,
+                         groups,
+                         mlp_extra_grouping):
+        pass
+
+    @abstractmethod
+    def sanity_check(self, ckpt_file_name):
+        pass
+
+
+class MegatronSDLoader(SDLoaderBase):
+    def __init__(self, ckpt_list, version):
+        super().__init__(ckpt_list, version)
+        """
+        ## Q/K/V data need special processing
+        key: transformer.layers.0.attention.query_key_value.weight, shape: torch.Size([3192, 4256])
+        key: transformer.layers.0.attention.query_key_value.bias, shape: torch.Size([3192])
+
+        ## merge or split on axis=0
+        key: word_embeddings.weight, shape: torch.Size([12672, 4256])
+        key: transformer.layers.0.mlp.dense_h_to_4h.bias, shape: torch.Size([4256])
+        key: transformer.layers.0.mlp.dense_h_to_4h.weight, shape: torch.Size([4256, 4256])
+
+        ## merge or split on axis=1
+        key: transformer.layers.0.attention.dense.weight, shape: torch.Size([4256, 1064])
+        key: transformer.layers.0.mlp.dense_4h_to_h.weight, shape: torch.Size([4256, 4256])
+
+        ## no change required
+        key: transformer.layers.0.mlp.dense_4h_to_h.bias, shape: torch.Size([4256])
+        key: transformer.final_layernorm.weight, shape: torch.Size([4256])
+        key: transformer.final_layernorm.bias, shape: torch.Size([4256])
+        key: transformer.layers.0.attention.dense.bias, shape: torch.Size([4256])
+        key: transformer.layers.0.post_attention_layernorm.weight, shape: torch.Size([4256])
+        key: transformer.layers.0.post_attention_layernorm.bias, shape: torch.Size([4256])
+        key: transformer.layers.0.input_layernorm.weight, shape: torch.Size([4256])
+        key: transformer.layers.0.input_layernorm.bias, shape: torch.Size([4256])
+        key: position_embeddings.weight, shape: torch.Size([1024, 4256])
+        """
+
+    def merge_query_key_value(self, param_list, ckpt_ver):
+        """
+        Up to now we found 3 Q/K/V parameter formats in different Megatron checkpoint versions:
+
+        1. version 0, there is no version information saved in checkpoint.
+            format: [(3 * np * hn), h]
+        2. version 1.0
+            format: [(np * hn * 3), h]
+        3. version 2.0
+            format: [(np * 3 * hn), h]
+
+        h: hidden size
+        n: number of attention heads
+        p: number of model parallel partitions
+        np: n/p
+        hn: h/n
+        """
+
+        new_qkv = None
+        if ckpt_ver == 0:
+            # [(3 * np * hn), h]
+            assert param_list[0].shape[0] % 3 == 0
+            size_qkv = param_list[0].shape[0] // 3
+            split_tensors = [torch.split(param, size_qkv, dim=0) for param in param_list]
+
+            tensors = []
+            for i in range(3):
+                tensor_tuple = [t[i] for t in split_tensors]
+                tensors.append(torch.cat(tensor_tuple, axis=0))
+            new_qkv = torch.cat(tensors, axis=0)
+        elif ckpt_ver == 1.0 or ckpt_ver == 2.0:
+            # [(np * hn * 3), h] or [(np * 3 * hn), h]
+            new_qkv = torch.cat(param_list, axis=0)
+        else:
+            assert False, f'checkpoint version: {ckpt_ver} is not supported'
+
+        return new_qkv
+
+    def split_query_key_value(self, param, num_to_split, offset, ckpt_ver):
+        """
+        Up to now we found 3 Q/K/V parameter formats in different Megatron checkpoint versions:
+
+        1. version 0, there is no version information saved in checkpoint.
+            format: [(3 * np * hn), h]
+        2. version 1.0
+            format: [(np * hn * 3), h]
+        3. version 2.0
+            format: [(np * 3 * hn), h]
+
+        h: hidden size
+        n: number of attention heads
+        p: number of model parallel partitions
+        np: n/p
+        hn: h/n
+        """
+
+        new_qkv = None
+        if ckpt_ver == 0:
+            # [(3 * np * hn), h]
+            assert param.shape[0] % 3 == 0
+            size_qkv = param.shape[0] // 3
+            split_tensors = torch.split(param, size_qkv, dim=0)
+
+            assert split_tensors[0].shape[0] % num_to_split == 0
+            split_size = split_tensors[0].shape[0] // num_to_split
+
+            tensors = []
+            for i in range(3):
+                tensors.append(torch.split(split_tensors[i], split_size, dim=0)[offset])
+            new_qkv = torch.cat(tensors, axis=0)
+        elif ckpt_ver == 1.0 or ckpt_ver == 2.0:
+            # [(np * hn * 3), h] or [(np * 3 * hn), h]
+            assert param.shape[0] % num_to_split == 0
+            size_qkv = param.shape[0] // num_to_split
+            split_tensors = torch.split(param, size_qkv, dim=0)
+            new_qkv = split_tensors[offset]
+        else:
+            assert False, f'checkpoint version: {ckpt_ver} is not supported'
+
+        return new_qkv
+
+    def merge_state_dict(self,
+                         mp_world_size,
+                         mp_rank,
+                         quantize=False,
+                         quantize_bits=8,
+                         groups=64,
+                         mlp_extra_grouping=True):
+        self.sanity_check(self.ckpt_list[0])
+
+        sd_list = self.get_merge_state_dicts(mp_world_size, mp_rank)
+        ds_sd = copy.deepcopy(sd_list[0])
+        new_client_sd = collections.OrderedDict()
+
+        client_sd_list = [self.get_module(sd) for sd in sd_list]
+        keys = client_sd_list[0].keys()
+
+        ckpt_ver = self.get_checkpoint_version(ds_sd)
+        logger.info(f"checkpoint version: {ckpt_ver}")
+        if quantize:
+            quantizer = WeightQuantization(mlp_extra_grouping=mlp_extra_grouping,
+                                           mp_size=mp_world_size)
+
+        for key in keys:
+            value_list = [sd[key] for sd in client_sd_list]
+
+            if "attention.dense.weight" in key or "mlp.dense_4h_to_h.weight" in key:
+                if quantize:
+                    value_list = quantizer.Quantize(value_list,
+                                                    quantize_bits,
+                                                    groups,
+                                                    key=key,
+                                                    merge_dim=1)
+                new_client_sd[key] = torch.cat(value_list, axis=1)
+            elif "attention.query_key_value" in key:
+                if quantize and "attention.query_key_value.weight" in key:
+                    value_list = quantizer.Quantize(value_list,
+                                                    quantize_bits,
+                                                    groups,
+                                                    key=key)
+                    new_client_sd[key] = torch.cat(value_list, axis=0)
+                else:
+                    if quantize:
+                        new_client_sd[key] = torch.cat(value_list, axis=0)
+                    else:
+                        new_client_sd[key] = self.merge_query_key_value(
+                            value_list,
+                            ckpt_ver)
+            elif "mlp.dense_h_to_4h.weight" in key or "word_embeddings.weight" in key or "mlp.dense_h_to_4h.bias" in key:
+                if quantize and "mlp.dense_h_to_4h.weight" in key:
+                    value_list = quantizer.Quantize(value_list,
+                                                    quantize_bits,
+                                                    groups,
+                                                    key=key)
+                new_client_sd[key] = torch.cat(value_list, axis=0)
+            else:
+                new_client_sd[key] = value_list[0]
+        if quantize:
+            all_scales = quantizer.merge_scales()
+        ds_sd = self.set_module(ds_sd, new_client_sd)
+
+        return ds_sd, (all_scales if quantize else None), len(client_sd_list)
+
+    def split_state_dict(self,
+                         mp_world_size,
+                         mp_rank,
+                         quantize=False,
+                         quantize_bits=8,
+                         groups=64,
+                         mlp_extra_grouping=True):
+        #self.sanity_check(self.ckpt_list[0])
+
+        sd, num_to_split, ckpt_offset = self.get_split_state_dict(mp_world_size, mp_rank)
+        ds_sd = copy.deepcopy(sd)
+        new_client_sd = collections.OrderedDict()
+
+        client_sd = self.get_module(sd)
+
+        ckpt_ver = self.get_checkpoint_version(ds_sd)
+        logger.info(f"checkpoint version: {ckpt_ver}")
+
+        if quantize:
+            quantizer = WeightQuantization(mlp_extra_grouping=mlp_extra_grouping,
+                                           mp_size=mp_world_size)
+
+        for key in client_sd.keys():
+            value = client_sd[key]
+
+            if "attention.dense.weight" in key or "mlp.dense_4h_to_h.weight" in key:
+                assert value.shape[1] % num_to_split == 0
+                split_size = value.shape[1] // num_to_split
+                if quantize:
+                    q_vals = quantizer.Quantize([value], quantize_bits, groups, key)
+                    value = q_vals[0]
+                new_client_sd[key] = torch.split(value, split_size, dim=1)[ckpt_offset]
+            elif "attention.query_key_value" in key:
+                if quantize and "attention.query_key_value.weight" in key:
+                    q_vals = quantizer.Quantize([value], quantize_bits, groups, key)
+                    value = q_vals[0]
+                new_client_sd[key] = self.split_query_key_value(
+                    value,
+                    num_to_split,
+                    ckpt_offset,
+                    ckpt_ver)
+            elif "mlp.dense_h_to_4h.weight" in key or "word_embeddings.weight" in key or "mlp.dense_h_to_4h.bias" in key or "final_linear.weight" in key:
+                assert value.shape[0] % num_to_split == 0
+                split_size = value.shape[0] // num_to_split
+                if quantize and "mlp.dense_h_to_4h.weight" in key:
+                    q_vals = quantizer.Quantize([value], quantize_bits, groups, key)
+                    value = q_vals[0]
+                new_client_sd[key] = torch.split(value, split_size, dim=0)[ckpt_offset]
+            else:
+                new_client_sd[key] = value
+
+        if quantize:
+            all_scales = quantizer.merge_scales_split(num_to_split)
+
+        ds_sd = self.set_module(ds_sd, new_client_sd)
+
+        return ds_sd, (all_scales if quantize else None)
+
+    def sanity_check(self, ckpt_file_name):
+        keys_to_check = [
+            "attention.dense.weight",
+            "mlp.dense_4h_to_h.weight",
+            "attention.query_key_value",
+            "mlp.dense_h_to_4h.weight",
+            "mlp.dense_h_to_4h.bias"
+        ]
+
+        sd = torch.load(ckpt_file_name, map_location=lambda storage, loc: storage)
+
+        # partial_key is a sub-string of one key in the sd
+        def check_key_exist(partial_key, sd):
+            keys = sd.keys()
+            found = False
+            for k in keys:
+                if partial_key in k:
+                    found = True
+                    break
+            return found
+
+        for key in keys_to_check:
+            assert check_key_exist(key, self.get_module(sd)), f'key: {key} is not found in the checkpoint {ckpt_file_name}'
+
+    def get_checkpoint_version(self, state_dict):
+        # Use 0 if version info doesn't exist
+        return self.version if self.version is not None else state_dict.get(
+            'checkpoint_version',
+            0)
diff --git a/deepspeed/runtime/swap_tensor/__init__.py b/deepspeed/runtime/swap_tensor/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c18558e94238db318152d650364e4bbb8af5099
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/__init__.py
@@ -0,0 +1,4 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+'''
diff --git a/deepspeed/runtime/swap_tensor/aio_config.py b/deepspeed/runtime/swap_tensor/aio_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a7014c1863c8782d82d3c02062c49332d840b88
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/aio_config.py
@@ -0,0 +1,44 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+'''
+
+from deepspeed.runtime.config_utils import get_scalar_param
+from deepspeed.runtime.swap_tensor.constants import *
+
+AIO_DEFAULT_DICT = {
+    AIO_BLOCK_SIZE: AIO_BLOCK_SIZE_DEFAULT,
+    AIO_QUEUE_DEPTH: AIO_QUEUE_DEPTH_DEFAULT,
+    AIO_THREAD_COUNT: AIO_THREAD_COUNT_DEFAULT,
+    AIO_SINGLE_SUBMIT: AIO_SINGLE_SUBMIT_DEFAULT,
+    AIO_OVERLAP_EVENTS: AIO_OVERLAP_EVENTS_DEFAULT
+}
+
+
+def get_aio_config(param_dict):
+    if AIO in param_dict.keys() and param_dict[AIO] is not None:
+        aio_dict = param_dict[AIO]
+        return {
+            AIO_BLOCK_SIZE:
+            get_scalar_param(aio_dict,
+                             AIO_BLOCK_SIZE,
+                             AIO_BLOCK_SIZE_DEFAULT),
+            AIO_QUEUE_DEPTH:
+            get_scalar_param(aio_dict,
+                             AIO_QUEUE_DEPTH,
+                             AIO_QUEUE_DEPTH_DEFAULT),
+            AIO_THREAD_COUNT:
+            get_scalar_param(aio_dict,
+                             AIO_THREAD_COUNT,
+                             AIO_THREAD_COUNT_DEFAULT),
+            AIO_SINGLE_SUBMIT:
+            get_scalar_param(aio_dict,
+                             AIO_SINGLE_SUBMIT,
+                             AIO_SINGLE_SUBMIT_DEFAULT),
+            AIO_OVERLAP_EVENTS:
+            get_scalar_param(aio_dict,
+                             AIO_OVERLAP_EVENTS,
+                             AIO_OVERLAP_EVENTS_DEFAULT)
+        }
+
+    return AIO_DEFAULT_DICT
diff --git a/deepspeed/runtime/swap_tensor/async_swapper.py b/deepspeed/runtime/swap_tensor/async_swapper.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6e19a4c67efb913862f1646d43768bf0bb17169
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/async_swapper.py
@@ -0,0 +1,173 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+
+Functionality of swapping tensors to/from (NVMe) storage devices.
+"""
+import torch
+
+from deepspeed.utils.logging import logger
+from deepspeed.runtime.swap_tensor.utils import swap_out_tensors, SwapBuffer
+
+INVALID_BUFFER_INDEX = -1
+ASYNC_SWAPPER_WAIT_TIMER = 'async_swap_gradient_wait'
+
+
+class AsyncTensorSwapper(object):
+    def __init__(self, aio_handle, numel_alignment, timers):
+        self.free_buffer_index = []
+        self.swapping_buffer_index = []
+        self.ready_buffer_index = []
+        self.current_buffer_index = INVALID_BUFFER_INDEX
+        self.all_buffers = []
+        self.aio_handle = aio_handle
+        self.numel_alignment = numel_alignment
+        self.max_numel = 0
+        self.num_pending_swaps = 0
+        self.timers = timers
+        self.timer_names = set()
+        self.num_elements_swapped = 0
+        self.dtype = None
+
+    def has_buffers(self):
+        return len(self.all_buffers) > 0
+
+    def add_buffers(self, buffer_list):
+        assert len(self.all_buffers) == 0
+        assert all([buffer.is_pinned() for buffer in buffer_list])
+        dtype = buffer_list[0].dtype
+        assert all([buffer.dtype == dtype for buffer in buffer_list])
+
+        self.dtype = dtype
+        self.all_buffers = [SwapBuffer(buffer) for buffer in buffer_list]
+        self.free_buffer_index += [i for i in range(len(self.all_buffers))]
+        self.max_numel = max([buffer.numel() for buffer in buffer_list])
+        self.timer_names = set()
+
+    def get_timer_names(self):
+        return list(self.timer_names)
+
+    def release_buffers(self):
+        self._report_statistics('Swapped out[Before flush]')
+        self._flush_buffers_until_complete()
+        self._report_statistics('Swapped out[After flush]')
+
+        pinned_buffers = [buf.buffer for buf in self.all_buffers]
+        self.all_buffers = []
+        self.free_buffer_index = []
+        self.current_buffer_index = INVALID_BUFFER_INDEX
+        self.num_elements_swapped = 0
+        self.dtype = None
+
+        return pinned_buffers
+
+    def swap_out_tensors(self, tensor_list, path_list):
+        for tensor, swap_path in zip(tensor_list, path_list):
+            self._swap_out_tensor(tensor, swap_path)
+
+    def _report_statistics(self, message):
+        if torch.distributed.get_rank() == 0:
+            element_size = torch.tensor([], dtype=self.dtype).element_size()
+            swapped_GB = (self.num_elements_swapped * element_size) / (1024**3)
+            logger.info(
+                f'{message} num_elems = {self.num_elements_swapped}, {swapped_GB:5.2f} GB'
+            )
+
+    def _swap_out_tensor(self, tensor, swap_path):
+        assert len(self.all_buffers) > 0
+
+        aligned_numel = self._io_aligned_numel(tensor.numel())
+        assert aligned_numel <= self.max_numel
+
+        self._make_swap_space(aligned_numel)
+        assert self.current_buffer_index != INVALID_BUFFER_INDEX
+
+        swap_buffer = self._get_current_buffer()
+        swap_buffer.insert_tensor(tensor, swap_path, aligned_numel)
+
+    def _make_swap_space(self, numel):
+        if self.current_buffer_index == INVALID_BUFFER_INDEX:
+            self._allocate_buffer()
+            return
+
+        if not self._get_current_buffer().has_space(numel):
+            if len(self.free_buffer_index) > 0:
+                self._flush_ready_buffers()
+            else:
+                self._flush_buffers_until_complete()
+            self._allocate_buffer()
+
+    def _io_aligned_numel(self, numel):
+        remainder = numel % self.numel_alignment
+        return numel if remainder == 0 else (numel + self.numel_alignment - remainder)
+
+    def _allocate_buffer(self):
+        assert self.current_buffer_index == INVALID_BUFFER_INDEX
+        assert len(self.all_buffers) > 0
+        assert len(self.free_buffer_index) > 0
+        self.current_buffer_index = self.free_buffer_index[-1]
+        self.free_buffer_index = self.free_buffer_index[:-1]
+
+    def _flush_ready_buffers(self):
+        if self.current_buffer_index != INVALID_BUFFER_INDEX:
+            self.ready_buffer_index.append(self.current_buffer_index)
+            self.current_buffer_index = INVALID_BUFFER_INDEX
+
+        self._swap_out_ready_buffers()
+
+    def _flush_buffers_until_complete(self):
+        self._flush_ready_buffers()
+        assert len(self.ready_buffer_index) == 0
+
+        self._wait_for_swap_complete()
+        assert len(self.swapping_buffer_index) == 0
+        assert len(self.free_buffer_index) == len(self.all_buffers)
+
+    def _swap_out_ready_buffers(self):
+        for buffer_index in self.ready_buffer_index:
+            buffer = self._get_buffer(buffer_index)
+            swap_tensors = buffer.get_swap_tensors()
+            swap_paths = buffer.get_swap_paths()
+            self.num_pending_swaps += len(swap_tensors)
+            swap_out_tensors(self.aio_handle, swap_tensors, swap_paths)
+
+        self.swapping_buffer_index += self.ready_buffer_index
+        self.ready_buffer_index = []
+
+    def _wait_for_swap_complete(self):
+        assert len(self.swapping_buffer_index) > 0
+
+        self._start_timer(ASYNC_SWAPPER_WAIT_TIMER)
+        assert self.aio_handle.wait() == self.num_pending_swaps
+        self._stop_timer(ASYNC_SWAPPER_WAIT_TIMER)
+        self.timer_names.add(ASYNC_SWAPPER_WAIT_TIMER)
+
+        self.num_pending_swaps = 0
+
+        for buffer_index in self.swapping_buffer_index:
+            buffer = self._get_buffer(buffer_index)
+            self.num_elements_swapped += buffer.get_num_elem()
+            buffer.reset()
+
+        self.free_buffer_index += self.swapping_buffer_index
+        assert len(self.free_buffer_index) <= len(self.all_buffers)
+        self.swapping_buffer_index = []
+
+    def _get_buffer(self, index):
+        assert index != INVALID_BUFFER_INDEX
+        return self.all_buffers[index]
+
+    def _get_current_buffer(self):
+        return self._get_buffer(self.current_buffer_index)
+
+    def _start_timer(self, name):
+        if self.timers:
+            self.timers(name).start()
+
+    def _stop_timer(self, name):
+        if self.timers:
+            self.timers(name).stop()
+
+    def _log_timers(self, name_list, force=False):
+        if self.timers and force:
+            self.timers.log(name_list)
diff --git a/deepspeed/runtime/swap_tensor/constants.py b/deepspeed/runtime/swap_tensor/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..752ec8dcaacf8a77df1196d1c100ce447127e911
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/constants.py
@@ -0,0 +1,27 @@
+"""
+"Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+"""
+#########################################
+# AIO
+#########################################
+AIO_FORMAT = '''
+"aio": {
+  "block_size": 1048576,
+  "queue_depth": 8,
+  "thread_count": 1,
+  "single_submit": false,
+  "overlap_events": true
+}
+'''
+AIO = "aio"
+AIO_BLOCK_SIZE = "block_size"
+AIO_BLOCK_SIZE_DEFAULT = 1048576
+AIO_QUEUE_DEPTH = "queue_depth"
+AIO_QUEUE_DEPTH_DEFAULT = 8
+AIO_THREAD_COUNT = "thread_count"
+AIO_THREAD_COUNT_DEFAULT = 1
+AIO_SINGLE_SUBMIT = "single_submit"
+AIO_SINGLE_SUBMIT_DEFAULT = False
+AIO_OVERLAP_EVENTS = "overlap_events"
+AIO_OVERLAP_EVENTS_DEFAULT = True
diff --git a/deepspeed/runtime/swap_tensor/optimizer_utils.py b/deepspeed/runtime/swap_tensor/optimizer_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a08af96f2a1254d5bdcafce07a6a5fabecf0c3ce
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/optimizer_utils.py
@@ -0,0 +1,526 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+
+Functionality of swapping tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import torch
+
+from deepspeed.utils.logging import logger
+from deepspeed.runtime.zero.offload_constants import *
+from deepspeed.runtime.swap_tensor.constants import *
+from deepspeed.runtime.swap_tensor.utils import swap_in_tensors, swap_out_tensors, \
+    MIN_AIO_BYTES, AIO_ALIGNED_BYTES, get_sized_buffers, get_sized_buffer
+from deepspeed.runtime.swap_tensor.utils import SwapBufferManager, SwapBufferPool
+
+
+class FlattenedTensorSwapInfo(object):
+    def __init__(self, path, length, offset):
+        self.path = path
+        self.offset = offset
+        self.length = length
+
+
+class OptimizerStateSwapInfo(object):
+    def __init__(self, parameter, numel, base_folder):
+        self.tensors = []
+        self.param_id = id(parameter)
+        self.swap_folder = base_folder
+        self.swap_paths = []
+        self.swapped_gradients = {}
+        self.unswapped_gradients = {}
+        self.tensor_numel = numel
+        self.tensor_dtype = parameter.dtype
+        self.tensor_device = parameter.device
+        self.has_state_tensors = False
+        self._add_tensors([parameter])
+
+    def numel(self):
+        return self.tensor_numel
+
+    def has_gradients(self):
+        return self.swapped_gradients or self.unswapped_gradients
+
+    def _add_tensors(self, tensor_list):
+        for t in tensor_list:
+            self.tensors.append(t)
+            self.swap_paths.append(os.path.join(self.swap_folder, f'{id(t)}.tensor.swp'))
+
+    def add_state_tensors(self, tensor_list):
+        self.has_state_tensors = True
+        self._add_tensors(tensor_list)
+
+    def device(self):
+        return self.tensor_device
+
+    def dtype(self):
+        return self.tensor_dtype
+
+    def release_memory(self):
+        for tensor in self.tensors:
+            tensor.data = torch.Tensor()
+
+    def get_or_create_gradient_paths(self, offsets, lengths):
+        gradient_paths = []
+        for offset, length in zip(offsets, lengths):
+            if not offset in self.swapped_gradients.keys():
+                path = os.path.join(
+                    self.swap_folder,
+                    f'{self.param_id}_gradient_{offset}_{length}.tensor.swp')
+                self.swapped_gradients[offset] = FlattenedTensorSwapInfo(
+                    path,
+                    length,
+                    offset)
+
+            gradient_paths.append(self.swapped_gradients[offset].path)
+
+        return gradient_paths
+
+    def set_swap_buffers(self, buffers):
+        compute_lengths = [self.numel()] * len(self.tensors)
+        compute_buffers = get_sized_buffers(buffers, compute_lengths)
+        for t, buffer in zip(self.tensors, compute_buffers):
+            t.data = buffer.data
+
+    def get_swap_gradient_buffers(self, swap_buffer):
+        assert self.numel() <= swap_buffer.numel()
+        return [
+            swap_buffer.narrow(0,
+                               grad.offset,
+                               grad.length) for grad in self.swapped_gradients.values()
+        ]
+
+    def get_swap_gradient_paths(self):
+        return [grad.path for grad in self.swapped_gradients.values()]
+
+    def get_unpinned_state_tensors(self):
+        return [t for t in self.tensors if not t.is_pinned()]
+
+    def read_unswapped_gradients(self, dest_buffer):
+        num_elem_count = 0
+        for offset, grad_partition in self.unswapped_gradients.items():
+            dst_tensor = dest_buffer.narrow(0, offset, grad_partition.numel())
+            dst_tensor.data.copy_(grad_partition.data)
+            num_elem_count += grad_partition.numel()
+
+        return num_elem_count
+
+    def release_unswapped_gradients(self):
+        self.unswapped_gradients = {}
+
+
+SWAPPER_DEBUG_MODE = False
+SWAP_OUT_GRADIENT_TIMER = 'swap_out_gradient'
+
+
+class OptimizerSwapper(object):
+    def __init__(self,
+                 swap_config,
+                 aio_config,
+                 base_folder,
+                 optimizer,
+                 largest_numel,
+                 device,
+                 dtype,
+                 timers):
+        self.swap_config = swap_config
+        self.aio_config = aio_config
+
+        # NVMe swap management
+        self.swap_params_info = {}
+        self.swap_element_size = torch.tensor([], dtype=dtype).element_size()
+        self.swap_folder = os.path.join(base_folder,
+                                        'optimizer',
+                                        f'rank{torch.distributed.get_rank()}')
+        os.makedirs(self.swap_folder, exist_ok=True)
+
+        self.optimizer = optimizer
+
+        # Read/Write alignment for each thread during Intra-request parallelism
+        self.min_aio_bytes = max(MIN_AIO_BYTES, aio_config[AIO_BLOCK_SIZE])
+        self.aligned_bytes = AIO_ALIGNED_BYTES * aio_config[AIO_THREAD_COUNT]
+        self.numel_alignment = self.aligned_bytes // self.swap_element_size
+
+        # Swap buffer management
+        self.largest_numel = self._io_aligned_numel(largest_numel)
+        self.dtype = dtype
+        self.swap_buffer_manager = SwapBufferManager(
+            num_elems=self.largest_numel,
+            count=swap_config[OFFLOAD_OPTIMIZER_BUFFER_COUNT],
+            dtype=dtype)
+
+        # Timers
+        self.timers = timers
+        self.timer_names = set()
+
+        # Print exclusion list
+        self.print_exclude_list = [
+            'optimizer',
+            'swap_buffer_manager',
+            'swap_params_info',
+            'timers',
+            'timer_names',
+        ]
+
+    def swappable_tensor(self, param=None, numel=None):
+        assert param is not None or numel is not None, "Either param or numel must be provided"
+        if param is not None:
+            return self.min_aio_bytes <= (param.numel() * self.swap_element_size)
+        return self.min_aio_bytes <= (numel * self.swap_element_size)
+
+    def init_timers(self):
+        self.timer_names = set()
+
+    def log_timers(self):
+        if self.timer_names:
+            self._log_timers(list(self.timer_names), force=True)
+
+    def pre_backward(self):
+        self.init_timers()
+
+    def post_backward(self):
+        pass
+
+    def _flush_gradient_swapper(self, gradient_swapper):
+        if gradient_swapper.has_buffers():
+            self._start_timer(SWAP_OUT_GRADIENT_TIMER)
+            pinned_buffers = gradient_swapper.release_buffers()
+            self.swap_buffer_manager.free(pinned_buffers)
+            self._stop_timer(SWAP_OUT_GRADIENT_TIMER)
+            self.timer_names.add(SWAP_OUT_GRADIENT_TIMER)
+            self.timer_names.update(gradient_swapper.get_timer_names())
+
+    def _swap_out_gradients(self,
+                            parameter,
+                            gradient_offsets,
+                            gradient_tensors,
+                            gradient_swapper):
+        if not id(parameter) in self.swap_params_info.keys():
+            return
+
+        swap_info = self.swap_params_info[id(parameter)]
+
+        swappable_tensors = []
+        swappable_offsets = []
+        swappable_lengths = []
+
+        aligned_gradients, aligned_offsets = self._adjust_for_misaligned_lengths(
+            tensors=gradient_tensors,
+            offsets=gradient_offsets
+        )
+
+        self._start_timer(SWAP_OUT_GRADIENT_TIMER)
+        for tensor, offset in zip(aligned_gradients, aligned_offsets):
+            if not self.swappable_tensor(param=tensor):
+                swap_info.unswapped_gradients[offset] = tensor
+                continue
+
+            swappable_tensors.append(tensor)
+            swappable_offsets.append(offset)
+            swappable_lengths.append(tensor.numel())
+
+        if len(swappable_tensors) > 0:
+            if not gradient_swapper.has_buffers():
+                pinned_buffers = self.swap_buffer_manager.allocate_all(
+                    num_elems=self.largest_numel,
+                    dtype=self.dtype)
+
+                gradient_swapper.add_buffers(pinned_buffers)
+
+            swappable_paths = swap_info.get_or_create_gradient_paths(
+                swappable_offsets,
+                swappable_lengths)
+
+            gradient_swapper.swap_out_tensors(tensor_list=swappable_tensors,
+                                              path_list=swappable_paths)
+
+        self._stop_timer(SWAP_OUT_GRADIENT_TIMER)
+        self.timer_names.add(SWAP_OUT_GRADIENT_TIMER)
+
+    def _initialize_from_swapped_fp16_params(self,
+                                             aio_handle,
+                                             fp16_partitions_info,
+                                             fp16_num_elems,
+                                             fp16_pinned_buffers,
+                                             fp32_parameters):
+        assert len(fp32_parameters) == len(fp16_partitions_info)
+        assert len(fp32_parameters) == len(fp16_num_elems)
+        assert all([buffer.is_pinned() for buffer in fp16_pinned_buffers])
+
+        fp32_swap_paths = self._get_swap_paths(parameters=fp32_parameters,
+                                               num_elems=fp16_num_elems)
+
+        fp32_pinned_buffers = self.swap_buffer_manager.allocate_all(
+            num_elems=self.largest_numel,
+            dtype=self.dtype)
+
+        fp16_buffer_numel = [buf.numel() for buf in fp16_pinned_buffers]
+        assert all([numel >= self.largest_numel for numel in fp16_buffer_numel]), \
+        f"numel of fp16 buffers {fp16_buffer_numel} is too small for initializing fp32 params {self.largest_numel}"
+
+        fp32_swap_buffers = SwapBufferPool(fp32_pinned_buffers)
+        fp16_swap_buffers = SwapBufferPool(fp16_pinned_buffers)
+
+        curr_index = 0
+        while curr_index < len(fp32_parameters):
+            fp16_pinned_tensors = self._swap_in_fp16_params(
+                aio_handle=aio_handle,
+                fp16_num_elems=fp16_num_elems[curr_index:],
+                fp16_partitions_info=fp16_partitions_info[curr_index:],
+                fp16_swap_buffers=fp16_swap_buffers)
+
+            if torch.distributed.get_rank() == 0 and SWAPPER_DEBUG_MODE:
+                for i, tensor in enumerate(fp16_pinned_tensors):
+                    true_index = curr_index + i
+                    logger.info(
+                        f'swap_in_fp16_param: fp32_id = {id(fp32_parameters[true_index])} index = {true_index} orig_num_elem = {fp16_num_elems[true_index]}, swap_num_elem = {fp16_pinned_tensors[i].numel()}'
+                    )
+
+            swap_out_count = self._swap_out_fp16_params(
+                aio_handle=aio_handle,
+                fp32_swap_paths=fp32_swap_paths[curr_index:],
+                fp32_swap_buffers=fp32_swap_buffers,
+                fp16_pinned_tensors=fp16_pinned_tensors)
+            assert swap_out_count == len(fp16_pinned_tensors), \
+            f"{swap_out_count} does not match {len(fp16_pinned_tensors)}"
+
+            fp16_swap_buffers.reset()
+            fp32_swap_buffers.reset()
+            curr_index += swap_out_count
+
+        self.swap_buffer_manager.free(fp32_pinned_buffers)
+
+    def _swap_in_fp16_params(self,
+                             aio_handle,
+                             fp16_num_elems,
+                             fp16_partitions_info,
+                             fp16_swap_buffers):
+        assert len(fp16_num_elems) > 0
+
+        swapped_fp16_tensors = []
+        swap_tensors = []
+        swap_paths = []
+        unswapped_srcs = []
+        unswapped_dsts = []
+
+        for i, numel in enumerate(fp16_num_elems):
+            pinned_tensor, _ = fp16_swap_buffers.allocate_tensor(numel, None, numel)
+            if pinned_tensor is None:
+                break
+
+            swapped_fp16_tensors.append(pinned_tensor)
+            offset = 0
+            for tensor, partition_numel, partition_path in fp16_partitions_info[i]:
+                dst_tensor = pinned_tensor.narrow(0, offset, partition_numel)
+                if partition_path is None:
+                    unswapped_srcs.append(tensor)
+                    unswapped_dsts.append(dst_tensor)
+                else:
+                    swap_paths.append(partition_path)
+                    swap_tensors.append(dst_tensor)
+                offset += partition_numel
+
+        assert len(swapped_fp16_tensors) + len(unswapped_srcs) > 0
+        ret = swap_in_tensors(aio_handle, swap_tensors, swap_paths)
+        for src, dst in zip(unswapped_srcs, unswapped_dsts):
+            dst.data.copy_(src.data)
+
+        assert len(swap_tensors) == aio_handle.wait()
+
+        return swapped_fp16_tensors
+
+    def _swap_out_fp16_params(self,
+                              aio_handle,
+                              fp32_swap_paths,
+                              fp32_swap_buffers,
+                              fp16_pinned_tensors):
+
+        assert len(fp16_pinned_tensors) <= len(fp32_swap_paths)
+        swap_out_count = 0
+        for i, fp16_tensor in enumerate(fp16_pinned_tensors):
+            if not fp32_swap_buffers.has_space(fp16_tensor.numel()):
+                fp32_swap_buffers.swap_out(aio_handle)
+                fp32_swap_buffers.reset()
+
+            pinned_tensor, _ = fp32_swap_buffers.insert_tensor(
+                fp16_tensor,
+                fp32_swap_paths[i],
+                self._io_aligned_numel(fp16_tensor.numel())
+                )
+            assert pinned_tensor is not None
+            swap_out_count += 1
+
+        if len(fp32_swap_buffers.get_swap_tensors()) > 0:
+            fp32_swap_buffers.swap_out(aio_handle)
+
+        return swap_out_count
+
+    def _initialize_parameters(self, parameters, src_tensors, aio_handle):
+        assert len(parameters) == len(src_tensors)
+
+        swap_paths = self._get_swap_paths(parameters=parameters,
+                                          num_elems=[src.numel() for src in src_tensors])
+
+        SWAP_INIT_TIMER = "swap_init_write"
+        self._start_timer(SWAP_INIT_TIMER)
+
+        pinned_buffers = self.swap_buffer_manager.allocate_all(
+            num_elems=self.largest_numel,
+            dtype=self.dtype)
+        assert pinned_buffers is not None
+
+        self._swap_out_unpinned_tensors(aio_handle=aio_handle,
+                                        unpinned_tensors=src_tensors,
+                                        dest_paths=swap_paths,
+                                        pinned_buffers=pinned_buffers)
+
+        if torch.distributed.get_rank() == 0 and SWAPPER_DEBUG_MODE:
+            for i, tensor in enumerate(src_tensors):
+                logger.info(
+                    f'copy_in_fp16_param: fp32_id = {id(parameters[i])} index = {i}, swap_num_elem = {src_tensors[i].numel()}'
+                )
+
+        self.swap_buffer_manager.free(pinned_buffers)
+
+        self._stop_timer(SWAP_INIT_TIMER)
+        self._log_timers([SWAP_INIT_TIMER])
+
+    def _get_swap_paths(self, parameters, num_elems):
+        swap_info_list = [
+            self._create_param_swap_info(parameter=p,
+                                         numel=numel) \
+            for p, numel in zip(parameters, num_elems)
+        ]
+        assert len(swap_info_list) == len(num_elems)
+
+        swap_paths = [info.swap_paths[0] for info in swap_info_list]
+        return swap_paths
+
+    def _swap_out_unpinned_tensors(self,
+                                   aio_handle,
+                                   unpinned_tensors,
+                                   dest_paths,
+                                   pinned_buffers):
+
+        swap_buffer_count = len(pinned_buffers)
+        unpinned_tensor_count = len(unpinned_tensors)
+
+        for i in range(0, unpinned_tensor_count, swap_buffer_count):
+            swap_tensor_count = min((unpinned_tensor_count - i), swap_buffer_count)
+
+            src_tensors = unpinned_tensors[i:(i + swap_tensor_count)]
+            compute_lengths = [t.numel() for t in src_tensors]
+            compute_buffers = get_sized_buffers(pinned_buffers, compute_lengths)
+
+            for dst, src in zip(compute_buffers, src_tensors):
+                dst.data.copy_(src.data)
+
+            swap_lengths = [self._io_aligned_numel(t.numel()) for t in src_tensors]
+            swap_buffers = get_sized_buffers(pinned_buffers, swap_lengths)
+
+            swap_paths = dest_paths[i:(i + swap_tensor_count)]
+            swap_out_tensors(aio_handle, swap_buffers, swap_paths)
+
+            assert aio_handle.wait() == swap_tensor_count
+
+    def _adjust_for_misaligned_lengths(self, tensors, offsets):
+        new_tensors = []
+        new_offsets = []
+
+        for orig_tensor, orig_offset in zip(tensors, offsets):
+            if not self.swappable_tensor(param=orig_tensor):
+                new_tensors.append(orig_tensor)
+                new_offsets.append(orig_offset)
+                continue
+
+            remainder = orig_tensor.numel() % self.numel_alignment
+            if remainder == 0:
+                new_tensors.append(orig_tensor)
+                new_offsets.append(orig_offset)
+                continue
+
+            # Split into two by making remainder a tensor
+            aligned_length = (orig_tensor.numel() //
+                              self.numel_alignment) * self.numel_alignment
+            new_tensors.append(orig_tensor.narrow(0, 0, aligned_length))
+            new_offsets.append(orig_offset)
+
+            # remainder tensor
+            new_tensors.append(orig_tensor.narrow(0, aligned_length, remainder))
+            new_offsets.append(orig_offset + aligned_length)
+
+        return new_tensors, new_offsets
+
+    def _retrieve_unswapped_grad_partitions(self, swap_info, dest_buffer):
+        UNSWAPPED_READ_GRADIENTS = 'unswapped_read_gradients'
+        self._start_timer(UNSWAPPED_READ_GRADIENTS)
+        tensor_count = len(swap_info.unswapped_gradients)
+        num_elem_count = swap_info.read_unswapped_gradients(dest_buffer)
+        self._stop_timer(UNSWAPPED_READ_GRADIENTS)
+        self._log_timers([UNSWAPPED_READ_GRADIENTS])
+
+        # It should be safe to discard unswapped gradient partitions
+        swap_info.release_unswapped_gradients()
+
+        if SWAPPER_DEBUG_MODE:
+            logger.info(
+                f'optimizer_retrieve_unswapped_gradients: param={swap_info.param_id} tensor_count={tensor_count} elem_count={num_elem_count}'
+            )
+
+    def _get_state_tensors(self, parameter):
+        if not parameter in self.optimizer.state:
+            return []
+
+        tensor_list = []
+        for value in self.optimizer.state[parameter].values():
+            if torch.is_tensor(value):
+                tensor_list.append(value)
+
+        return tensor_list
+
+    def _update_param_state_info(self, swap_info, parameter):
+        if not swap_info.has_state_tensors:
+            state_tensors = self._get_state_tensors(parameter)
+            if state_tensors:
+                swap_info.add_state_tensors(state_tensors)
+
+    def _create_param_swap_info(self, parameter, numel):
+        param_id = id(parameter)
+        assert not param_id in self.swap_params_info
+
+        self.swap_params_info[param_id] = OptimizerStateSwapInfo(
+            parameter=parameter,
+            numel=numel,
+            base_folder=self.swap_folder)
+        swap_info = self.swap_params_info[param_id]
+
+        self._update_param_state_info(swap_info, parameter)
+
+        return swap_info
+
+    def _get_param_swap_info(self, parameter):
+        param_id = id(parameter)
+        swap_info = self.swap_params_info.get(param_id, None)
+
+        if swap_info is not None:
+            self._update_param_state_info(swap_info, parameter)
+
+        return swap_info
+
+    def _start_timer(self, name):
+        if self.timers:
+            self.timers(name).start()
+
+    def _stop_timer(self, name):
+        if self.timers:
+            self.timers(name).stop()
+
+    def _log_timers(self, name_list, force=False):
+        if self.timers and (SWAPPER_DEBUG_MODE or force):
+            self.timers.log(name_list)
+
+    def _io_aligned_numel(self, numel):
+        remainder = numel % self.numel_alignment
+        return numel if remainder == 0 else (numel + self.numel_alignment - remainder)
diff --git a/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py b/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbcb136aa337d3f7cdc97520090cde95cabc3038
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py
@@ -0,0 +1,260 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import torch
+
+from deepspeed.utils.logging import logger
+from deepspeed.ops.aio import AsyncIOBuilder
+
+from deepspeed.runtime.swap_tensor.constants import *
+from deepspeed.runtime.swap_tensor.utils import swap_in_tensors, swap_out_tensors, print_object, \
+    MIN_AIO_BYTES, AIO_ALIGNED_BYTES, get_sized_buffers, get_sized_buffer
+from deepspeed.runtime.swap_tensor.async_swapper import AsyncTensorSwapper
+from deepspeed.runtime.swap_tensor.optimizer_utils import OptimizerSwapper
+
+DEBUG_MODE = False
+
+SWAP_IN_PARAM_TIMER = 'swap_in_param'
+SWAP_OUT_PARAM_TIMER = 'swap_out_param'
+SWAP_IN_GRADIENT_TIMER = 'swap_in_gradient'
+
+
+class PartitionedOptimizerSwapper(OptimizerSwapper):
+    def __init__(self,
+                 swap_config,
+                 aio_config,
+                 base_folder,
+                 optimizer,
+                 largest_numel,
+                 device,
+                 dtype,
+                 timers):
+        super(PartitionedOptimizerSwapper,
+              self).__init__(swap_config,
+                             aio_config,
+                             base_folder,
+                             optimizer,
+                             largest_numel,
+                             device,
+                             dtype,
+                             timers)
+
+        aio_op = AsyncIOBuilder().load()
+        self.aio_handle = aio_op.aio_handle(aio_config[AIO_BLOCK_SIZE],
+                                            aio_config[AIO_QUEUE_DEPTH],
+                                            aio_config[AIO_SINGLE_SUBMIT],
+                                            aio_config[AIO_OVERLAP_EVENTS],
+                                            aio_config[AIO_THREAD_COUNT])
+
+        # Overlap swapping out
+        self.gradient_swapper = AsyncTensorSwapper(aio_handle=self.aio_handle,
+                                                   numel_alignment=self.numel_alignment,
+                                                   timers=self.timers)
+
+        self.print_exclude_list += [
+            'aio_handle',
+            'gradient_swapper',
+            'print_exclude_list'
+        ]
+
+        if torch.distributed.get_rank() == 0:
+            print_object(obj=self,
+                         name='PartitionedOptimizerSwapper',
+                         exclude_list=self.print_exclude_list)
+
+    def initialize_parameters(self, parameters, src_tensors):
+        self._initialize_parameters(parameters=parameters,
+                                    src_tensors=src_tensors,
+                                    aio_handle=self.aio_handle)
+
+    def initialize_from_swapped_fp16_params(self,
+                                            fp16_partitions_info,
+                                            fp16_num_elems,
+                                            fp16_pinned_buffers,
+                                            fp32_parameters):
+        self._initialize_from_swapped_fp16_params(
+            aio_handle=self.aio_handle,
+            fp16_partitions_info=fp16_partitions_info,
+            fp16_num_elems=fp16_num_elems,
+            fp16_pinned_buffers=fp16_pinned_buffers,
+            fp32_parameters=fp32_parameters)
+
+    def flush_gradients(self):
+        self._flush_gradient_swapper(self.gradient_swapper)
+
+    def swap_in_optimizer_state(self, parameter, async_parameter=None):
+        swap_info = self._get_param_swap_info(parameter)
+        if swap_info is None:
+            return
+
+        self._flush_gradient_swapper(self.gradient_swapper)
+
+        required_buffer_count = len(
+            swap_info.tensors) + (1 if swap_info.has_gradients() else 0)
+        aligned_numel = self._io_aligned_numel(swap_info.numel())
+        pinned_buffers = self.swap_buffer_manager.allocate(num_elems=aligned_numel,
+                                                           count=required_buffer_count,
+                                                           dtype=parameter.dtype)
+        assert pinned_buffers is not None
+        self.allocated_swap_buffers = pinned_buffers.copy()
+
+        self._start_timer(SWAP_IN_PARAM_TIMER)
+        self._swap_in_parameter(aio_handle=self.aio_handle,
+                                parameter=parameter,
+                                dest_buffers=pinned_buffers[:required_buffer_count])
+        self._stop_timer(SWAP_IN_PARAM_TIMER)
+        self.timer_names.add(SWAP_IN_PARAM_TIMER)
+
+        self._start_timer(SWAP_IN_GRADIENT_TIMER)
+        self._swap_in_gradients(aio_handle=self.aio_handle,
+                                parameter=parameter,
+                                dest_buffer=pinned_buffers[-1])
+        self._stop_timer(SWAP_IN_GRADIENT_TIMER)
+        self.timer_names.add(SWAP_IN_GRADIENT_TIMER)
+
+    def swap_out_optimizer_state(self, parameter, async_swap=False):
+        swap_info = self._get_param_swap_info(parameter=parameter)
+
+        if swap_info is None:
+            return
+
+        self._start_timer(SWAP_OUT_PARAM_TIMER)
+        pinned_tensors, pinned_paths, unpinned_tensors, unpinned_paths = self._separate_pinned_tensors(swap_info)
+        swap_bytes = sum([
+            self._io_aligned_numel(t.numel()) * t.element_size()
+            for t in swap_info.tensors
+        ])
+
+        WRITE_TIMER = 'swap_submit_write'
+        self._start_timer(WRITE_TIMER)
+
+        swap_out_tensors(self.aio_handle, pinned_tensors, pinned_paths)
+        assert self.aio_handle.wait() == len(pinned_tensors)
+        for t in pinned_tensors:
+            t.data = torch.Tensor()
+
+        if len(unpinned_tensors) > 0:
+            pinned_buffers = self.swap_buffer_manager.allocate_all(
+                num_elems=self.largest_numel,
+                dtype=self.dtype)
+            self._swap_out_unpinned_tensors(aio_handle=self.aio_handle,
+                                            unpinned_tensors=unpinned_tensors,
+                                            dest_paths=unpinned_paths,
+                                            pinned_buffers=pinned_buffers)
+            self.allocated_swap_buffers += pinned_buffers
+
+            for t in unpinned_tensors:
+                t.data = torch.Tensor()
+        self._stop_timer(WRITE_TIMER)
+
+        self.swap_buffer_manager.free(self.allocated_swap_buffers)
+        self.allocated_swap_buffers = []
+
+        self._stop_timer(SWAP_OUT_PARAM_TIMER)
+        self.timer_names.add(SWAP_OUT_PARAM_TIMER)
+
+        self._log_timers([WRITE_TIMER])
+
+        if DEBUG_MODE and torch.distributed.get_rank() == 0:
+            logger.info(f'optimizer_param_swap_out: {(swap_bytes/(1024**3)):5.2f} GB')
+
+    def swap_out_gradients(self, parameter, gradient_offsets, gradient_tensors):
+        self._swap_out_gradients(parameter=parameter,
+                                 gradient_offsets=gradient_offsets,
+                                 gradient_tensors=gradient_tensors,
+                                 gradient_swapper=self.gradient_swapper)
+
+    def _swap_in_parameter(self, aio_handle, parameter, dest_buffers):
+        swap_info = self._get_param_swap_info(parameter)
+        if swap_info is None:
+            return
+
+        assert len(swap_info.tensors) <= len(dest_buffers)
+
+        swap_lengths = [self._io_aligned_numel(swap_info.numel())] * len(
+            swap_info.tensors)
+        swap_buffers = get_sized_buffers(dest_buffers, swap_lengths)
+
+        READ_TIMER = 'swap_submit_read_param'
+        WAIT_TIMER = 'swap_wait_read_param'
+
+        self._start_timer(READ_TIMER)
+        swap_in_tensors(aio_handle, swap_buffers, swap_info.swap_paths)
+        self._stop_timer(READ_TIMER)
+
+        swap_bytes = sum(
+            [buffer.numel() * buffer.element_size() for buffer in swap_buffers])
+
+        self._start_timer(WAIT_TIMER)
+        aio_handle.wait()
+        self._stop_timer(WAIT_TIMER)
+
+        compute_lengths = [swap_info.numel()] * len(swap_info.tensors)
+        compute_buffers = get_sized_buffers(dest_buffers, compute_lengths)
+        for t, buffer in zip(swap_info.tensors, compute_buffers):
+            t.data = buffer.data
+
+        self._log_timers([READ_TIMER, WAIT_TIMER])
+        if DEBUG_MODE and torch.distributed.get_rank() == 0:
+            logger.info(f'optimizer_param_swap_in: {(swap_bytes/(1024**3)):5.2f} GB')
+
+    def _separate_pinned_tensors(self, swap_info):
+        pinned_tensors = []
+        pinned_paths = []
+
+        unpinned_tensors = []
+        unpinned_paths = []
+
+        for tensor, path in zip(swap_info.tensors, swap_info.swap_paths):
+            if tensor.is_pinned():
+                pinned_tensors.append(tensor)
+                pinned_paths.append(path)
+            else:
+                unpinned_tensors.append(tensor)
+                unpinned_paths.append(path)
+
+        return pinned_tensors, pinned_paths, unpinned_tensors, unpinned_paths
+
+    def _swap_in_pinned_gradients(self, aio_handle, parameter, gradient_tensor):
+        swap_info = self.swap_params_info[id(parameter)]
+        param_gradients = swap_info.swapped_gradients.values()
+        swap_buffers = [
+            gradient_tensor.narrow(0,
+                                   grad.offset,
+                                   grad.length) for grad in param_gradients
+        ]
+        swap_paths = [grad.path for grad in param_gradients]
+        SWAP_READ_GRADIENTS = 'swap_submit_read_gradient'
+        SWAP_WAIT_GRADIENTS = 'swap_submit_wait_gradient'
+
+        self._start_timer(SWAP_READ_GRADIENTS)
+        swap_in_tensors(aio_handle, swap_buffers, swap_paths)
+        self._stop_timer(SWAP_READ_GRADIENTS)
+
+        self._start_timer(SWAP_WAIT_GRADIENTS)
+        assert len(swap_buffers) == aio_handle.wait()
+        self._stop_timer(SWAP_WAIT_GRADIENTS)
+
+        self._log_timers([SWAP_READ_GRADIENTS, SWAP_WAIT_GRADIENTS])
+
+    def _swap_in_gradients(self, aio_handle, parameter, dest_buffer):
+        swap_info = self.swap_params_info.get(id(parameter), None)
+        if not (swap_info and swap_info.has_gradients()):
+            return
+
+        assert dest_buffer.is_pinned()
+        assert parameter.numel() <= dest_buffer.numel()
+
+        parameter.grad = dest_buffer.narrow(0, 0, parameter.numel())
+
+        if swap_info.swapped_gradients:
+            self._swap_in_pinned_gradients(aio_handle, parameter, parameter.grad)
+
+        if swap_info.unswapped_gradients:
+            self._retrieve_unswapped_grad_partitions(swap_info=swap_info,
+                                                     dest_buffer=parameter.grad)
diff --git a/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py b/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py
new file mode 100644
index 0000000000000000000000000000000000000000..5128bbf6a59a68ad8afb43e5a9b9c1f57c28a390
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py
@@ -0,0 +1,424 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+
+Functionality of swapping tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import shutil
+from enum import Enum
+import torch
+import torch.distributed as dist
+
+from deepspeed.utils.logging import logger
+from deepspeed.ops.aio import AsyncIOBuilder
+from .constants import *
+from .utils import swap_in_tensors, swap_out_tensors, MIN_AIO_BYTES, AIO_ALIGNED_BYTES, print_object, SwapBufferPool
+from ..zero.offload_constants import *
+
+
+def print_rank_0(message, debug=False, force=False):
+    if torch.distributed.get_rank() == 0 and (debug or force):
+        print(message)
+
+
+class PartitionedParamStatus(Enum):
+    # Partitioned parameters are present and ready for use
+    AVAILABLE = 1
+
+    # partitioned params are in some non-memory device
+    NOT_AVAILABLE = 2
+
+    # partitioned params are being read from some non-memory device.
+    INFLIGHT = 3
+
+
+class AsyncPartitionedParameterSwapper(object):
+    def __init__(self, ds_config, model_dtype):
+
+        aio_op = AsyncIOBuilder().load(verbose=False)
+        self.aio_handle = aio_op.aio_handle
+        self.dtype = model_dtype
+
+        #set swap buffers, create aio handles
+        self._configure_aio(ds_config)
+
+        #mapping from param id to path
+        self.id_to_path = {}
+
+        #mapping from pram_id to buffer id
+        self.param_id_to_buffer_id = {}
+
+        # mapping from param_id to swap buffer
+        self.param_id_to_swap_buffer = {}
+
+        #number of elements in the param
+        self.param_id_to_numel = {}
+
+        self.pending_writes = 0
+        self.pending_reads = 0
+
+        #keep track of async swap in params and buffers
+        self.inflight_params = []
+        self.inflight_swap_in_buffers = []
+        self.inflight_numel = 0
+
+        #keep track of available params
+        self.available_params = set()
+        self.available_numel = 0
+
+        # for swapping out from partitioned fp32 params
+        self.partitioned_swap_buffer = None
+        self.partitioned_swap_pool = None
+
+        self.invalid_buffer = torch.tensor(1).half()
+
+        if dist.get_rank() == 0:
+            exclude_list = ['aio_read_handle', 'aio_write_handle', 'buffers']
+            print_object(obj=self,
+                         name='AsyncPartitionedParameterSwapper',
+                         exclude_list=exclude_list)
+
+    def available_swap_in_buffers(self):
+        return len(self.available_buffer_ids)
+
+    def _configure_aio(self, ds_config):
+        self.swap_config = ds_config.zero_config.offload_param
+        torch_dtype_string = str(self.dtype).split(".")[1]
+        self.swap_folder = os.path.join(self.swap_config[OFFLOAD_PARAM_NVME_PATH],
+                                        'zero_stage_3',
+                                        f'{torch_dtype_string}params',
+                                        f'rank{dist.get_rank()}')
+        shutil.rmtree(self.swap_folder, ignore_errors=True)
+        os.makedirs(self.swap_folder, exist_ok=True)
+
+        self.swap_element_size = torch.tensor([], dtype=self.dtype).element_size()
+
+        self.aio_config = ds_config.aio_config
+
+        # Read/Write alignment for each thread during Intra-request parallelism
+        self.min_aio_bytes = max(MIN_AIO_BYTES, self.aio_config[AIO_BLOCK_SIZE])
+        self.aligned_bytes = AIO_ALIGNED_BYTES * self.aio_config[AIO_THREAD_COUNT]
+        self.numel_alignment = self.aligned_bytes // self.swap_element_size
+
+        self.elements_per_buffer = self.swap_config[OFFLOAD_PARAM_BUFFER_SIZE]
+        self.aligned_elements_per_buffer = self._io_aligned_numel(
+            self.elements_per_buffer)
+        self.param_buffer_count = self.swap_config[OFFLOAD_PARAM_BUFFER_COUNT]
+
+        self.available_buffer_ids = [i for i in range(self.param_buffer_count)]
+        self.reserved_buffer_ids = []
+        self.buffers = torch.empty(int(self.aligned_elements_per_buffer *
+                                       self.param_buffer_count),
+                                   dtype=self.dtype,
+                                   pin_memory=True,
+                                   requires_grad=False)
+
+        self.aio_read_handle = self.aio_handle(self.aio_config[AIO_BLOCK_SIZE],
+                                               self.aio_config[AIO_QUEUE_DEPTH],
+                                               self.aio_config[AIO_SINGLE_SUBMIT],
+                                               self.aio_config[AIO_OVERLAP_EVENTS],
+                                               self.aio_config[AIO_THREAD_COUNT])
+
+        self.aio_write_handle = self.aio_handle(self.aio_config[AIO_BLOCK_SIZE],
+                                                self.aio_config[AIO_QUEUE_DEPTH],
+                                                self.aio_config[AIO_SINGLE_SUBMIT],
+                                                self.aio_config[AIO_OVERLAP_EVENTS],
+                                                self.aio_config[AIO_THREAD_COUNT])
+
+        self.swap_out_params = []
+
+    #Check if partitioned param or numel in a tensor is swappable or not
+    def swappable_tensor(self, param=None, numel=None):
+        if param is not None:
+            assert numel is None, "Both parma and numel cannot be provided"
+            numel = param.ds_tensor.ds_numel
+        if numel is not None:
+            return self.min_aio_bytes <= numel * self.swap_element_size
+        assert False, "Either param or numel must be provided"
+
+    def get_path(self, param, must_exist=False):
+        paths = self._get_swap_paths([param], must_exist=must_exist)
+        return paths[0]
+
+    def _get_swap_paths(self, params, must_exist=False):
+        paths = []
+        for param in params:
+            param_id = param.ds_id
+            if param_id in self.id_to_path.keys():
+                param_path = self.id_to_path[param_id]
+            else:
+                assert not must_exist, f"Path for param id {param_id} does not exist"
+                param_path = os.path.join(self.swap_folder,
+                                          f'{param_id}_param.tensor.swp')
+
+                self.id_to_path[param_id] = param_path
+            paths.append(param_path)
+
+        return paths
+
+    def _get_swap_buffers(self, params):
+        buffers = []
+        for param in params:
+            param_id = param.ds_id
+            assert param_id in self.param_id_to_swap_buffer.keys(), \
+            f'param {param_id} has not been assigned a swap buffer'
+            buffers.append(self.param_id_to_swap_buffer[param_id])
+
+        return buffers
+
+    def _track_numel(self, params):
+        for param in params:
+            assert param.ds_tensor is not None, "Partitioned tensor is None"
+            self.param_id_to_numel[param.ds_id] = param.ds_tensor.ds_numel
+
+    def _allocate_and_return_buffers_for_swap_in(self, params):
+        compute_buffers = []
+        swap_buffers = []
+
+        for param in params:
+            param_id = param.ds_id
+            assert param_id in self.param_id_to_numel.keys(), f" Number of elements in param {param_id} is unknown"
+            assert param_id not in self.param_id_to_buffer_id.keys(), f"param {param_id} already assigned swap buffer id {self.param_id_to_buffer_id[param_id]}"
+            assert param_id not in self.param_id_to_swap_buffer.keys(), f"param {param_id} has already been assigned a swap buffer"
+
+            buffer_id = self.available_buffer_ids.pop()
+            print_rank_0(
+                f"param {param.ds_id} is assigned swap in buffer id {buffer_id}  ")
+            self.param_id_to_buffer_id[param_id] = buffer_id
+            aligned_swap_numel = self._io_aligned_numel(self.param_id_to_numel[param_id])
+            swap_buffer = self.buffers.narrow(
+                0,
+                int(buffer_id * self.aligned_elements_per_buffer),
+                aligned_swap_numel)
+
+            self.param_id_to_swap_buffer[param_id] = swap_buffer
+            compute_buffer = swap_buffer.narrow(0, 0, self.param_id_to_numel[param_id])
+            compute_buffers.append(compute_buffer)
+            swap_buffers.append(swap_buffer)
+
+        return compute_buffers, swap_buffers
+
+    #waits for inflight nvme write to complete
+    def synchronize_writes(self):
+        if self.pending_writes == 0:
+            return
+        assert self.pending_writes == self.aio_write_handle.wait()
+        self.pending_writes = 0
+        self.remove_partition_and_release_buffers(self.swap_out_params)
+        self.swap_out_params = []
+
+    #waits for inflight nvme reads to complete
+    def synchronize_reads(self):
+        if self.pending_reads == 0:
+            return
+
+        assert self.pending_reads == self.aio_read_handle.wait()
+
+        self.pending_reads = 0
+
+        for param, swap_in_buffer in zip(self.inflight_params, self.inflight_swap_in_buffers):
+            param_id = param.ds_id
+            compute_buffer = swap_in_buffer.narrow(0,
+                                                   0,
+                                                   self.param_id_to_numel[param_id])
+            param.ds_tensor.data = compute_buffer.data
+            param.ds_tensor.status = PartitionedParamStatus.AVAILABLE
+
+        self.available_params.update([param.ds_id for param in self.inflight_params])
+        self.available_numel += self.inflight_numel
+
+        self.inflight_params = []
+        self.inflight_swap_in_buffers = []
+        self.inflight_numel = 0
+
+    #Removes the memory assignment and releases the buffers
+    #Should only be executed after swapping out the tensors
+    def remove_partition_and_release_buffers(self, params):
+        for param in params:
+            param_id = param.ds_id
+
+            if param_id in self.param_id_to_buffer_id.keys():
+
+                buffer_id = self.param_id_to_buffer_id[param_id]
+
+                assert buffer_id is not None, "Missing buffer id for releasing"
+
+                self.available_buffer_ids.append(buffer_id)
+                del self.param_id_to_buffer_id[param_id]
+                del self.param_id_to_swap_buffer[param_id]
+                print_rank_0(f"param {param.ds_id} releases buffer id {buffer_id}  ")
+
+                if param_id in self.available_params:
+                    self.available_params.remove(param_id)
+                    self.available_numel -= self.param_id_to_numel[param_id]
+
+            param.ds_tensor.data = self.invalid_buffer.data
+            param.ds_tensor.status = PartitionedParamStatus.NOT_AVAILABLE
+
+    #writes from in memory to nvme. Does not release the buffers
+    def _swap_out(self, params, async_op=True):
+
+        swap_out_paths = self._get_swap_paths(params)
+        swap_out_params = self._get_swap_buffers(params)
+        self._track_numel(params)
+
+        swap_out_tensors(self.aio_write_handle, swap_out_params, swap_out_paths)
+
+        self.pending_writes += len(swap_out_params)
+        self.swap_out_params += params
+
+        if not async_op:
+            self.synchronize_writes()
+
+    #blocking swap out followed by releasing the memory buffers
+    def swap_out_and_release(self, params, async_op=False, force_buffer_release=False):
+        if async_op:
+            assert force_buffer_release, "Should not release preallocated buffers without completing the swap out. Set force_buffer_release to True to do it anyways"
+        self._swap_out(params, async_op=async_op)
+
+    # book keeping function for inflight swap in
+    def _update_inflight_swap_in(self, params, swap_in_buffers, inflight_numel):
+        self.inflight_params.extend(params)
+        self.inflight_swap_in_buffers.extend(swap_in_buffers)
+        self.inflight_numel += inflight_numel
+
+        for param in params:
+            param.ds_tensor.status = PartitionedParamStatus.INFLIGHT
+
+        self.pending_reads += len(params)
+
+    #assigns an in memory buffer and swaps in from nvme
+    def swap_in(self, params, async_op=True, swap_in_buffers=None):
+
+        assert all([param.ds_tensor.status == PartitionedParamStatus.NOT_AVAILABLE for param in params]), "Some params are already available or in flight"
+        swap_in_paths = self._get_swap_paths(params)
+
+        if swap_in_buffers is None:
+            if len(self.available_buffer_ids) < len(swap_in_paths):
+                ids = [p.ds_id for p in params]
+                print_rank_0(
+                    f'Not enough swap in buffers {len(self.available_buffer_ids)} for {len(swap_in_paths)} params, ids = {ids}',
+                    force=True)
+                print_rank_0(
+                    f'Num inflight: params {len(self.inflight_params)}, buffers {len(self.inflight_swap_in_buffers)}, numel = {self.inflight_numel}',
+                    force=True)
+                print_rank_0(
+                    f'Num available: param {len(self.available_params)}, numel = {self.available_numel}',
+                    force=True)
+
+            assert len(swap_in_paths) <= len(self.available_buffer_ids), f"Not enough buffers {len(self.available_buffer_ids)} for swapping {len(swap_in_paths)}"
+            compute_buffers, swap_in_buffers = self._allocate_and_return_buffers_for_swap_in(params)
+            inflight_numel = sum([t.numel() for t in compute_buffers])
+        else:
+            inflight_numel = sum([t.numel() for t in swap_in_buffers])
+
+        swap_in_tensors(self.aio_read_handle, swap_in_buffers, swap_in_paths)
+
+        self._update_inflight_swap_in(params, swap_in_buffers, inflight_numel)
+
+        if not async_op:
+            self.synchronize_reads()
+
+    # Enables swapping into buffer that is out the control of swapper. This is always synchronous
+    def swap_into_buffer(self, param, dest_buffer):
+        assert param.ds_tensor.status == PartitionedParamStatus.NOT_AVAILABLE, f"param {param.ds_id} is already available or inflight"
+
+        require_swap_buffer = not (dest_buffer.is_pinned()
+                                   and self._is_io_aligned(dest_buffer.numel()))
+
+        if require_swap_buffer:
+            assert len(self.available_buffer_ids) > 0, f"No buffer available to swap param {param.ds_id}."
+            compute_buffers, swap_in_buffers = self._allocate_and_return_buffers_for_swap_in([param])
+            inflight_numel = compute_buffers[0].numel()
+        else:
+            swap_in_buffers = [dest_buffer]
+            inflight_numel = dest_buffer.numel()
+
+        swap_in_paths = self._get_swap_paths([param])
+
+        swap_in_tensors(self.aio_read_handle, swap_in_buffers, swap_in_paths)
+        self._update_inflight_swap_in([param], swap_in_buffers, inflight_numel)
+        self.synchronize_reads()
+
+        if require_swap_buffer:
+            dest_buffer.data.copy_(param.ds_tensor.data)
+            # Release swap buffer memory assignment. Note, this will mark the parameter not available.
+            self.remove_partition_and_release_buffers([param])
+
+    #assign a buffer to a param and return the buffer
+    def get_buffer(self, param, numel):
+        param_id = param.ds_id
+
+        assert self.available_swap_in_buffers() > 0, f"No swap buffers to allocate for fp16 param {param_id} of numel = {numel}"
+        assert numel < self.elements_per_buffer, f"More elements {numel} than buffer size {self.elements_per_buffer}"
+
+        self.param_id_to_numel[param_id] = numel
+        buffer_id = self.available_buffer_ids.pop()
+        self.param_id_to_buffer_id[param_id] = buffer_id
+        aligned_swap_numel = self._io_aligned_numel(self.param_id_to_numel[param_id])
+        swap_buffer = self.buffers.narrow(
+            0,
+            int(buffer_id * self.aligned_elements_per_buffer),
+            aligned_swap_numel)
+
+        self.param_id_to_swap_buffer[param_id] = swap_buffer
+        compute_buffer = swap_buffer.narrow(0, 0, self.param_id_to_numel[param_id])
+        print_rank_0(f"param {param.ds_id} is assigned swap in buffer id {buffer_id}")
+        return compute_buffer
+
+    def reserve_available_buffers(self):
+        buffers = []
+        for id in self.available_buffer_ids:
+            buffers.append(
+                self.buffers.narrow(0,
+                                    int(id * self.aligned_elements_per_buffer),
+                                    int(self.aligned_elements_per_buffer)))
+            self.reserved_buffer_ids.append(id)
+
+        self.available_buffer_ids = []
+        return buffers
+
+    def release_reserved_buffers(self):
+        for id in self.reserved_buffer_ids:
+            self.available_buffer_ids.append(id)
+        self.reserved_buffer_ids = []
+
+    def _io_aligned_numel(self, numel):
+        remainder = numel % self.numel_alignment
+        return numel if remainder == 0 else (numel + self.numel_alignment - remainder)
+
+    def _is_io_aligned(self, numel):
+        return (numel % self.numel_alignment) == 0
+
+    def reserve_partitioned_swap_space(self, partition_num_elems):
+        aligned_numel = sum(
+            [self._io_aligned_numel(numel) for numel in partition_num_elems])
+        self.partitioned_swap_buffer = torch.zeros(aligned_numel,
+                                                   device='cpu',
+                                                   dtype=self.dtype).pin_memory()
+        self.partitioned_swap_pool = SwapBufferPool([self.partitioned_swap_buffer])
+
+    def swap_out_partitioned_params(self, dst_fp16_params, src_fp32_params):
+        assert self.partitioned_swap_buffer is not None, f'partitioned swap buffers for fp16 params not initialized'
+        assert self.partitioned_swap_pool is not None, f'partitioned swap pool for fp16 params not initialized'
+        assert len(dst_fp16_params) == len(src_fp32_params), \
+        f'mismatch in number of fp16 params {len(dst_fp16_params)} and fp32 params {len(src_fp32_params)}'
+
+        fp16_swap_paths = self._get_swap_paths(dst_fp16_params, must_exist=True)
+        self.synchronize_writes()
+        self.partitioned_swap_pool.reset()
+        for i, fp32_tensor in enumerate(src_fp32_params):
+            swap_tensor, _ = self.partitioned_swap_pool.insert_tensor(
+                fp32_tensor,
+                fp16_swap_paths[i],
+                self._io_aligned_numel(fp32_tensor.numel())
+            )
+            assert swap_tensor is not None
+            dst_fp16_params[i].ds_tensor.status = PartitionedParamStatus.AVAILABLE
+
+        self.partitioned_swap_pool.swap_out(self.aio_write_handle)
+
+        for param in dst_fp16_params:
+            param.ds_tensor.status = PartitionedParamStatus.NOT_AVAILABLE
diff --git a/deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py b/deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d0116faab5b51724fd24de4420802c83ca1cfce
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py
@@ -0,0 +1,284 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+
+Functionality of swapping optimizer tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import torch
+
+from deepspeed.utils.logging import logger
+from deepspeed.ops.aio import AsyncIOBuilder
+
+from deepspeed.runtime.zero.offload_constants import *
+from deepspeed.runtime.swap_tensor.constants import *
+from deepspeed.runtime.swap_tensor.utils import swap_in_tensors, swap_out_tensors, print_object, \
+    MIN_AIO_BYTES, AIO_ALIGNED_BYTES
+from deepspeed.runtime.swap_tensor.async_swapper import AsyncTensorSwapper
+from deepspeed.runtime.swap_tensor.optimizer_utils import SwapBufferManager, get_sized_buffer
+from deepspeed.runtime.swap_tensor.optimizer_utils import OptimizerSwapper
+
+
+class OptimizerSwapOp(object):
+    def __init__(self,
+                 aio_handle,
+                 read_op,
+                 param_info,
+                 allocated_buffers,
+                 state_buffers,
+                 num_ops):
+        self.aio_handle = aio_handle
+        self.read_op = read_op
+        self.param_info = param_info
+        self.allocated_buffers = allocated_buffers
+        self.state_buffers = state_buffers
+        self.wait_required = True
+        self.num_ops = num_ops
+
+    def is_parameter(self, parameter):
+        return id(parameter) == self.param_info.param_id
+
+    def wait(self):
+        assert self.wait_required
+        assert self.aio_handle.wait() == self.num_ops
+        self.wait_required = False
+
+
+SYNC_SWAP_IN = 'sync_swap_in'
+ASYNC_SWAP_IN = 'async_swap_in'
+SYNC_SWAP_OUT = 'sync_swap_out'
+ASYNC_SWAP_OUT = 'async_swap_out'
+
+SWAP_IN_STATE_TIMER = 'swap_in_state'
+SWAP_OUT_STATE_TIMER = 'swap_out_state'
+SWAP_OUT_GRADIENT_TIMER = 'swap_out_gradient'
+ASYNC_SWAP_IN_STATE_TIMER = "async_swap_in_state"
+ASYNC_SWAP_OUT_STATE_TIMER = 'async_swap_out_state'
+
+
+class PipelinedOptimizerSwapper(OptimizerSwapper):
+    def __init__(self,
+                 swap_config,
+                 aio_config,
+                 base_folder,
+                 optimizer,
+                 largest_numel,
+                 device,
+                 dtype,
+                 timers):
+        super(PipelinedOptimizerSwapper,
+              self).__init__(swap_config,
+                             aio_config,
+                             base_folder,
+                             optimizer,
+                             largest_numel,
+                             device,
+                             dtype,
+                             timers)
+
+        aio_op = AsyncIOBuilder().load()
+        self.write_aio_handle = aio_op.aio_handle(aio_config[AIO_BLOCK_SIZE],
+                                                  aio_config[AIO_QUEUE_DEPTH],
+                                                  aio_config[AIO_SINGLE_SUBMIT],
+                                                  aio_config[AIO_OVERLAP_EVENTS],
+                                                  aio_config[AIO_THREAD_COUNT])
+
+        self.read_aio_handle = aio_op.aio_handle(aio_config[AIO_BLOCK_SIZE],
+                                                 aio_config[AIO_QUEUE_DEPTH],
+                                                 aio_config[AIO_SINGLE_SUBMIT],
+                                                 aio_config[AIO_OVERLAP_EVENTS],
+                                                 aio_config[AIO_THREAD_COUNT])
+
+        # Overlap gradient swap out
+        self.gradient_swapper = AsyncTensorSwapper(aio_handle=self.write_aio_handle,
+                                                   numel_alignment=self.numel_alignment,
+                                                   timers=self.timers)
+
+        self.async_swap_in = swap_config[OFFLOAD_OPTIMIZER_PIPELINE_READ]
+        self.async_swap_out = swap_config[OFFLOAD_OPTIMIZER_PIPELINE_WRITE]
+
+        self.swap_ops = {
+            SYNC_SWAP_IN: None,
+            ASYNC_SWAP_IN: None,
+            SYNC_SWAP_OUT: None,
+            ASYNC_SWAP_OUT: None
+        }
+
+        self.print_exclude_list += [
+            'gradient_swapper',
+            'read_aio_handle',
+            'write_aio_handle',
+            'swap_ops',
+            'print_exclude_list'
+        ]
+
+        if torch.distributed.get_rank() == 0:
+            print_object(obj=self,
+                         name='PipelinedOptimizerSwapper',
+                         exclude_list=self.print_exclude_list)
+
+    def initialize_parameters(self, parameters, src_tensors):
+        self._initialize_parameters(parameters=parameters,
+                                    src_tensors=src_tensors,
+                                    aio_handle=self.write_aio_handle)
+
+    def initialize_from_swapped_fp16_params(self,
+                                            fp16_partitions_info,
+                                            fp16_num_elems,
+                                            fp16_pinned_buffers,
+                                            fp32_parameters):
+        self._initialize_from_swapped_fp16_params(
+            aio_handle=self.write_aio_handle,
+            fp16_partitions_info=fp16_partitions_info,
+            fp16_num_elems=fp16_num_elems,
+            fp16_pinned_buffers=fp16_pinned_buffers,
+            fp32_parameters=fp32_parameters)
+
+    def flush_gradients(self):
+        self._flush_gradient_swapper(self.gradient_swapper)
+
+    def swap_in_optimizer_state(self, parameter, async_parameter):
+        assert parameter is not None
+        assert self.swap_ops[SYNC_SWAP_IN] is None
+
+        self._flush_gradient_swapper(self.gradient_swapper)
+
+        self._start_timer(SWAP_IN_STATE_TIMER)
+
+        if self.swap_ops[ASYNC_SWAP_IN]:
+            assert self.swap_ops[ASYNC_SWAP_IN].is_parameter(parameter)
+            self.swap_ops[SYNC_SWAP_IN] = self.swap_ops[ASYNC_SWAP_IN]
+            self.swap_ops[ASYNC_SWAP_IN] = None
+        else:
+            self.swap_ops[SYNC_SWAP_IN] = self._swap_in_optimizer_state(
+                aio_handle=self.read_aio_handle,
+                parameter=parameter)
+
+        if self.swap_ops[SYNC_SWAP_IN]:
+            self.swap_ops[SYNC_SWAP_IN].wait()
+
+        if self.async_swap_in and async_parameter is not None:
+            assert self.swap_ops[ASYNC_SWAP_IN] is None
+            self.swap_ops[ASYNC_SWAP_IN] = self._swap_in_optimizer_state(
+                aio_handle=self.read_aio_handle,
+                parameter=async_parameter)
+
+        self._stop_timer(SWAP_IN_STATE_TIMER)
+        self.timer_names.add(SWAP_IN_STATE_TIMER)
+
+    def swap_out_optimizer_state(self, parameter, async_swap):
+        self._start_timer(SWAP_OUT_STATE_TIMER)
+
+        if self.swap_ops[ASYNC_SWAP_OUT]:
+            self._start_timer(ASYNC_SWAP_OUT_STATE_TIMER)
+            self._complete_swap_out(ASYNC_SWAP_OUT)
+            self._stop_timer(ASYNC_SWAP_OUT_STATE_TIMER)
+            self.timer_names.add(ASYNC_SWAP_OUT_STATE_TIMER)
+
+        assert self.swap_ops[SYNC_SWAP_IN] is not None
+        assert not self.swap_ops[SYNC_SWAP_IN].wait_required
+        swap_op = self._swap_out_optimizer_state(aio_handle=self.write_aio_handle,
+                                                 parameter=parameter,
+                                                 swap_in_op=self.swap_ops[SYNC_SWAP_IN])
+        self.swap_ops[SYNC_SWAP_IN] = None
+
+        if self.async_swap_out and async_swap:
+            self.swap_ops[ASYNC_SWAP_OUT] = swap_op
+        else:
+            self.swap_ops[SYNC_SWAP_OUT] = swap_op
+            self._complete_swap_out(SYNC_SWAP_OUT)
+
+        self._stop_timer(SWAP_OUT_STATE_TIMER)
+        self.timer_names.add(SWAP_OUT_STATE_TIMER)
+
+    def swap_out_gradients(self, parameter, gradient_offsets, gradient_tensors):
+        self._swap_out_gradients(parameter=parameter,
+                                 gradient_offsets=gradient_offsets,
+                                 gradient_tensors=gradient_tensors,
+                                 gradient_swapper=self.gradient_swapper)
+
+    def _complete_swap_out(self, swap_out_type):
+        self.swap_ops[swap_out_type].wait()
+        self.swap_buffer_manager.free(self.swap_ops[swap_out_type].allocated_buffers)
+        self.swap_ops[swap_out_type] = None
+
+    def _swap_out_optimizer_state(self, aio_handle, parameter, swap_in_op):
+        assert swap_in_op.is_parameter(parameter)
+
+        allocated_buffers = swap_in_op.allocated_buffers.copy()
+        swap_buffers = swap_in_op.state_buffers.copy()
+
+        param_info = swap_in_op.param_info
+        self._update_param_state_info(param_info, parameter)
+        unpinned_tensors = param_info.get_unpinned_state_tensors()
+
+        if len(unpinned_tensors) > 0:
+            new_alloc_buffers = self.swap_buffer_manager.allocate(
+                num_elems=self._io_aligned_numel(param_info.numel()),
+                count=len(unpinned_tensors),
+                dtype=param_info.dtype())
+            assert new_alloc_buffers is not None
+
+            allocated_buffers += new_alloc_buffers
+            swap_buffers += new_alloc_buffers
+
+            for pinned_dst, unpinned_src in zip(new_alloc_buffers, unpinned_tensors):
+                dst = get_sized_buffer(pinned_dst, unpinned_src.numel())
+                dst.data.copy_(unpinned_src.data)
+
+        swap_paths = param_info.swap_paths.copy()
+        assert len(swap_paths) == len(swap_buffers)
+
+        swap_out_tensors(aio_handle, swap_buffers, swap_paths)
+
+        swap_out_op = OptimizerSwapOp(aio_handle=aio_handle,
+                                      param_info=param_info,
+                                      read_op=False,
+                                      allocated_buffers=allocated_buffers,
+                                      state_buffers=swap_buffers,
+                                      num_ops=len(swap_buffers))
+
+        return swap_out_op
+
+    def _swap_in_optimizer_state(self, aio_handle, parameter):
+        param_info = self._get_param_swap_info(parameter)
+        if param_info is None:
+            return None
+
+        required_buffer_count = len(
+            param_info.tensors) + (1 if param_info.has_gradients() else 0)
+        aligned_numel = self._io_aligned_numel(param_info.numel())
+        allocated_buffers = self.swap_buffer_manager.allocate(
+            num_elems=aligned_numel,
+            count=required_buffer_count,
+            dtype=parameter.dtype)
+        assert allocated_buffers is not None, \
+        f"PipelinedOptimizerSwapper ran out of swap buffers, try increasing {OFFLOAD_OPTIMIZER_BUFFER_COUNT}"
+
+        state_buffers = allocated_buffers[:len(param_info.tensors)]
+        param_info.set_swap_buffers(state_buffers)
+
+        swap_buffers = state_buffers.copy()
+        swap_paths = param_info.swap_paths.copy()
+
+        if param_info.has_gradients():
+            parameter.grad = allocated_buffers[-1].narrow(0, 0, param_info.numel())
+            if param_info.swapped_gradients:
+                swap_buffers += param_info.get_swap_gradient_buffers(parameter.grad)
+                swap_paths += param_info.get_swap_gradient_paths()
+
+        swap_in_tensors(aio_handle, swap_buffers, swap_paths)
+
+        if param_info.unswapped_gradients:
+            self._retrieve_unswapped_grad_partitions(swap_info=param_info,
+                                                     dest_buffer=parameter.grad)
+
+        swap_in_op = OptimizerSwapOp(aio_handle=aio_handle,
+                                     param_info=param_info,
+                                     read_op=True,
+                                     allocated_buffers=allocated_buffers,
+                                     state_buffers=state_buffers,
+                                     num_ops=len(swap_buffers))
+
+        return swap_in_op
diff --git a/deepspeed/runtime/swap_tensor/utils.py b/deepspeed/runtime/swap_tensor/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a751e398000914c819dfd32cd988666e0918030
--- /dev/null
+++ b/deepspeed/runtime/swap_tensor/utils.py
@@ -0,0 +1,241 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+Licensed under the MIT license.
+
+Functionality of swapping tensors to/from (NVMe) storage devices.
+"""
+
+import os
+import torch
+from deepspeed.utils.logging import logger
+
+from deepspeed.runtime.swap_tensor.constants import AIO_BLOCK_SIZE, AIO_QUEUE_DEPTH, \
+    AIO_THREAD_COUNT, AIO_SINGLE_SUBMIT, AIO_OVERLAP_EVENTS
+
+MIN_AIO_BYTES = 1024**2
+AIO_ALIGNED_BYTES = 1024
+
+
+def swap_in_tensors(swap_handle, tensor_buffers, swap_paths):
+    for buffer, path in zip(tensor_buffers, swap_paths):
+        assert (swap_handle.async_pread(buffer, path) == 0)
+
+
+def swap_out_tensors(swap_handle, tensor_buffers, swap_paths):
+    for buffer, path in zip(tensor_buffers, swap_paths):
+        assert (swap_handle.async_pwrite(buffer, path) == 0)
+
+
+def print_object(obj, name, exclude_list=[]):
+    logger.info('{}:'.format(name))
+    for arg in sorted(vars(obj)):
+        if not arg in exclude_list:
+            dots = '.' * (29 - len(arg))
+            logger.info('  {} {} {}'.format(arg, dots, getattr(obj, arg)))
+
+
+class SwapBuffer(object):
+    def __init__(self, buffer):
+        self.buffer = buffer
+        self.reset()
+
+    def reset(self):
+        self.offset = 0
+        self.swap_tensors = {}
+        self.compute_tensors = {}
+        self.swap_paths = {}
+        self.num_elem = 0
+
+    def insert_tensor(self, tensor, swap_path, aligned_numel):
+        swap_tensor, compute_tensor = self.allocate_tensor(swap_path, tensor.numel(), aligned_numel)
+        compute_tensor.data.copy_(tensor.data)
+        return swap_tensor, compute_tensor
+
+    def allocate_tensor(self, swap_path, numel, aligned_numel):
+        assert self.has_space(aligned_numel)
+        assert not self.offset in self.swap_tensors
+
+        allocate_offset = self.offset
+        swap_tensor = self.buffer.narrow(0, allocate_offset, aligned_numel)
+        dest_tensor = swap_tensor.narrow(0, 0, numel)
+
+        self.swap_tensors[allocate_offset] = swap_tensor
+        self.compute_tensors[allocate_offset] = dest_tensor
+        self.swap_paths[allocate_offset] = swap_path
+        self.offset += aligned_numel
+        self.num_elem += numel
+
+        return self.swap_tensors[allocate_offset], self.compute_tensors[allocate_offset]
+
+    def has_space(self, numel):
+        return (self.offset + numel) <= self.buffer.numel()
+
+    def get_swap_tensors(self):
+        return [tensor for tensor in self.swap_tensors.values()]
+
+    def get_swap_paths(self):
+        return [path for path in self.swap_paths.values()]
+
+    def get_compute_tensors(self):
+        return [tensor for tensor in self.compute_tensors.values()]
+
+    def get_num_elem(self):
+        return self.num_elem
+
+    def get_swap_tensor(self, offset):
+        return self.swap_tensors.get(offset, None)
+
+    def get_compute_tensor(self, offset):
+        return self.compute_tensors.get(offset, None)
+
+    def get_swap_path(self, offset):
+        return self.swap_paths(offset, None)
+
+
+class SwapBufferPool(object):
+    def __init__(self, buffers):
+        assert all([buf.is_pinned() for buf in buffers])
+        self.buffers = [SwapBuffer(buf) for buf in buffers]
+        self.current_index = 0
+
+    def reset(self):
+        self.current_index = 0
+        for buffer in self.buffers:
+            buffer.reset()
+
+    def allocate_tensor(self, numel, swap_path, aligned_numel):
+        if self.has_space(aligned_numel):
+            swap_tensor, compute_tensor = self._get_current_buffer().allocate_tensor(swap_path, numel, aligned_numel)
+            return swap_tensor, compute_tensor
+
+        return None, None
+
+    def insert_tensor(self, tensor, swap_path, aligned_numel):
+        if self.has_space(aligned_numel):
+            swap_tensor, compute_tensor = self._get_current_buffer().insert_tensor(tensor, swap_path, aligned_numel)
+            return swap_tensor, compute_tensor
+
+        return None, None
+
+    def get_swap_tensors(self):
+        swap_tensors = []
+        for buffer in self._get_used_buffers():
+            swap_tensors += buffer.get_swap_tensors()
+
+        return swap_tensors
+
+    def get_swap_paths(self):
+        swap_paths = []
+        for buffer in self._get_used_buffers():
+            swap_paths += buffer.get_swap_paths()
+
+        return swap_paths
+
+    def get_compute_tensors(self):
+        compute_tensors = []
+        for buffer in self._get_used_buffers():
+            compute_tensors += buffer.get_compute_tensors()
+
+        return compute_tensors
+
+    def has_space(self, numel):
+        if self._get_current_buffer().has_space(numel):
+            return True
+
+        if self.current_index == len(self.buffers) - 1:
+            return False
+
+        self.current_index += 1
+        return self._get_current_buffer().has_space(numel)
+
+    def swap_out(self, aio_handle, async_op=False):
+        swap_tensors = self.get_swap_tensors()
+        swap_paths = self.get_swap_paths()
+        assert all([p is not None for p in swap_paths])
+
+        swap_out_tensors(aio_handle, swap_tensors, swap_paths)
+
+        if not async_op:
+            assert len(swap_tensors) == aio_handle.wait()
+
+    def swap_in(self, aio_handle, async_op=False):
+        swap_tensors = self.get_swap_tensors()
+        swap_paths = self.get_swap_paths()
+        assert all([p is not None for p in swap_paths])
+
+        swap_in_tensors(aio_handle, swap_tensors, swap_paths)
+
+        if not async_op:
+            assert len(swap_tensors) == aio_handle.wait()
+
+    def _get_current_buffer(self):
+        return self.buffers[self.current_index]
+
+    def _get_used_buffers(self):
+        return self.buffers[:self.current_index + 1]
+
+
+class SwapBufferManager(object):
+    def __init__(self, num_elems, count, dtype):
+        self.num_elems = num_elems
+        self.count = count
+        self.dtype = dtype
+        self.all_buffers = [
+            torch.zeros(num_elems,
+                        device='cpu',
+                        dtype=dtype).pin_memory() for _ in range(count)
+        ]
+        self.free_buffer_index = [i for i in range(count)]
+        self.used_buffer_index = {}
+        self.gigabytes = (self.all_buffers[0].element_size() * num_elems * count) / (1024
+                                                                                     **3)
+
+        if torch.distributed.get_rank() == 0:
+            exclude_list = ['all_buffers']
+            print_object(obj=self, name='SwapBufferManager', exclude_list=exclude_list)
+
+    def allocate(self, num_elems, count, dtype):
+        assert dtype == self.dtype
+        assert num_elems <= self.num_elems
+        if count > len(self.free_buffer_index):
+            return None
+
+        used_indices = self.free_buffer_index[-count:]
+        self.free_buffer_index = self.free_buffer_index[:-count]
+
+        buffers = []
+        for i in used_indices:
+            tmp_buffer = self.all_buffers[i].narrow(0, 0, num_elems)
+            buffers.append(tmp_buffer)
+            self.used_buffer_index[id(tmp_buffer)] = i
+        return buffers
+
+    def allocate_all(self, num_elems, dtype):
+        return self.allocate(num_elems=num_elems,
+                             count=len(self.free_buffer_index),
+                             dtype=dtype)
+
+    def free(self, buffers):
+        buffer_ids = []
+        for buf in buffers:
+            buffer_ids.append(id(buf))
+
+        assert all([b_id in self.used_buffer_index for b_id in buffer_ids])
+
+        for b_id in buffer_ids:
+            self.free_buffer_index.append(self.used_buffer_index[b_id])
+            del (self.used_buffer_index[b_id])
+
+
+def get_sized_buffer(buffer, num_elems):
+    assert num_elems <= buffer.numel(), \
+        f'num_elems {num_elems} > buffer {buffer.numel()}'
+    return buffer.narrow(0, 0, num_elems) if num_elems < buffer.numel() else buffer
+
+
+def get_sized_buffers(buffer_list, num_elems_list):
+    swap_buffers = [
+        get_sized_buffer(buffer, num_elems) \
+        for buffer, num_elems in zip(buffer_list, num_elems_list)
+    ]
+    return swap_buffers
diff --git a/deepspeed/runtime/utils.py b/deepspeed/runtime/utils.py
old mode 100755
new mode 100644
index b1a7a4b0aae197fadb1eb5cc6a0138923c9acff1..ff6daba1c6c9b60ebe486ce7e7bf27391c6ff74c
--- a/deepspeed/runtime/utils.py
+++ b/deepspeed/runtime/utils.py
@@ -6,21 +6,47 @@ Copyright NVIDIA/Megatron
 Helper functions and classes from multiple sources.
 '''
 
+from collections.abc import Iterable
+from deepspeed.moe.utils import is_moe_param, split_params_into_shared_and_expert_params
 import os
 import psutil
 import gc
-from math import ceil
+from math import ceil, sqrt
 from math import floor
 from bisect import bisect_left, bisect_right
 
 import torch
-import torch.distributed as dist
 from torch._six import inf
 import torch.distributed as dist
 
-from deepspeed.utils import logger
+from deepspeed.utils import groups, logger
+from deepspeed.runtime.constants import PIPE_REPLICATED
 from numpy import prod
 
+# pt-1.9 deprecations
+if hasattr(torch.cuda, "memory_reserved"):
+    torch_memory_reserved = torch.cuda.memory_reserved
+else:
+    torch_memory_reserved = torch.cuda.memory_allocated
+if hasattr(torch.cuda, "max_memory_reserved"):
+    torch_max_memory_reserved = torch.cuda.max_memory_reserved
+else:
+    torch_max_memory_reserved = torch.cuda.memory_cached
+
+
+class DummyOptim():
+    """
+    Dummy optimizer presents model parameters as a param group, this is
+    primarily used to allow ZeRO-3 without an optimizer
+    """
+    def __init__(self, params):
+        self.param_groups = []
+        self.param_groups.append({'params': params})
+
+
+def noop_decorator(func):
+    return func
+
 
 def ensure_directory_exists(filename):
     """Create the directory path to ``filename`` if it does not already exist.
@@ -33,6 +59,11 @@ def ensure_directory_exists(filename):
 
 
 def set_random_seed(seed):
+    """Set the random seed for common PRNGs used during training: random, numpy, and torch.
+
+    Args:
+        seed (int): the seed to use
+    """
     import numpy
     import random
     random.seed(seed)
@@ -40,61 +71,152 @@ def set_random_seed(seed):
     torch.manual_seed(seed)
 
 
-def move_to_device(item, device):
+def is_model_parallel_parameter(p) -> bool:
+    if hasattr(p, 'model_parallel') and p.model_parallel:
+        return True
+
+    if hasattr(p, 'tensor_model_parallel') and p.tensor_model_parallel:
+        return True
+
+    return False
+
+
+def bwc_tensor_model_parallel_rank(mpu=None):
+    """Backwards-compatible way of querying the tensor model parallel rank from
+    an ``mpu`` object.
+
+    *Tensor* model parallelism means that tensors are physically split across
+    processes. This contrasts with *pipeline* model parallelism, in which the
+    layers are partitioned but tensors left intact.
+
+    The API for tensor model parallelism has changed across versions and this
+    helper provides a best-effort implementation across versions of ``mpu``
+    objects.  The preferred mechanism is
+    ``mpu.get_tensor_model_parallel_rank()``.
+
+    This should "just work" with both Megatron-LM and DeepSpeed's pipeline
+    parallelism.
+
+    Args:
+        mpu (model parallel unit, optional): The tensor model parallel rank.
+            If ``mpu=None``, returns 0. Defaults to ``None``.
+
+    Returns:
+        int: the rank
+    """
+    if mpu is None:
+        # No model parallelism in easy :)
+        return 0
+
+    if hasattr(mpu, 'get_tensor_model_parallel_rank'):
+        # New Megatron and DeepSpeed convention (post pipeline-parallelism release)
+        return mpu.get_tensor_model_parallel_rank()
+    elif hasattr(mpu, 'get_slice_parallel_rank'):
+        # Some DeepSpeed + pipeline parallelism versions
+        return mpu.get_slice_parallel_rank()
+    else:
+        # Deprecated Megatron and DeepSpeed convention
+        return mpu.get_model_parallel_rank()
+
+
+def copy_to_device(item, device, criterion_func):
     """
-    Move tensor onto device. Works on individual tensors, and tensors contained/nested in lists, tuples, and dicts.
+    Return a copy of tensor on specified device.
+    Works on individual tensors, and tensors contained/nested in lists, tuples, and dicts.
     Parameters:
-        item: tensor to move or (possibly nested) container of tensors to move.
+        item: tensor to copy or (possibly nested) container of tensors to copy.
         device: target device
+        criterion_func: Function to restrict copy operation to items meet criterion
 
     Returns:
         None
     """
-    if torch.is_tensor(item):
+    if criterion_func(item):
         return item.to(device)
     elif isinstance(item, list):
-        return [move_to_device(v, device) for v in item]
+        return [copy_to_device(v, device, criterion_func) for v in item]
     elif isinstance(item, tuple):
-        return tuple([move_to_device(v, device) for v in item])
+        return tuple([copy_to_device(v, device, criterion_func) for v in item])
     elif isinstance(item, dict):
-        return {k: move_to_device(v, device) for k, v in item.items()}
+        return {k: copy_to_device(v, device, criterion_func) for k, v in item.items()}
+    else:
+        return item
+
+
+def move_to_device(item, device, criterion_func):
+    """
+    Move tensor on to specified device by changing the storage.
+    Works on individual tensors, and tensors contained/nested in lists, tuples, and dicts.
+    Parameters:
+        item: tensor to move or (possibly nested) container of tensors to move.
+        device: target device
+        criterion_func: Function to restrict move operation to items meet criterion
+
+    Returns:
+        None
+    """
+    if criterion_func(item):
+        device_copy = item.to(device)
+        item.data = device_copy.data
+        return item
+    elif isinstance(item, list):
+        return [move_to_device(v, device, criterion_func) for v in item]
+    elif isinstance(item, tuple):
+        return tuple([move_to_device(v, device, criterion_func) for v in item])
+    elif isinstance(item, dict):
+        return {k: move_to_device(v, device, criterion_func) for k, v in item.items()}
     else:
         return item
 
 
 class CheckOverflow(object):
     '''Checks for overflow in gradient across parallel process'''
-    def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False):
+    def __init__(self,
+                 param_groups=None,
+                 mpu=None,
+                 zero_reduce_scatter=False,
+                 deepspeed=None):
         self.mpu = mpu
         self.params = [] if param_groups else None
         self.zero_reduce_scatter = zero_reduce_scatter
+        self.deepspeed = deepspeed
+        self.has_moe_params = False
         if param_groups:
             for group in param_groups:
                 for param in group:
                     self.params.append(param)
+                    if is_moe_param(param):
+                        self.has_moe_params = True
 
     def check_using_norm(self, norm_group, reduce_overflow=True):
         # TODO: I don't think reduce_overflow is needed if mpu is None
         overflow = -1 in norm_group
-
+        overflow_gpu = torch.cuda.FloatTensor([overflow])
+        if self.has_moe_params:
+            # In this case, we need to do an all_reduce across
+            # the expert_parallel_group, so that if there was
+            # an overflow due to expert weights, we detect it
+
+            # Only need to check groups.get_largest_expert_parallel_group()
+            dist.all_reduce(overflow_gpu,
+                            op=dist.ReduceOp.MAX,
+                            group=groups._get_max_expert_parallel_group())
         if self.mpu is not None:
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
             torch.distributed.all_reduce(overflow_gpu,
                                          op=torch.distributed.ReduceOp.MAX,
                                          group=self.mpu.get_model_parallel_group())
-            overflow = overflow_gpu[0].item()
         elif reduce_overflow:
-            cuda_overflow = torch.cuda.FloatTensor([overflow])
-            dist.all_reduce(cuda_overflow, op=torch.distributed.ReduceOp.MAX)
+            dist.all_reduce(overflow_gpu, op=torch.distributed.ReduceOp.MAX)
             dist.barrier()
-            overflow = cuda_overflow[0].item()
-
+        overflow = overflow_gpu[0].item()
         return bool(overflow)
 
     def check(self, param_groups=None):
         params = []
+        has_moe_params = False
         if param_groups is None:
             params = self.params
+            has_moe_params = self.has_moe_params
         else:
             assert param_groups is not None, \
                 "self.params and param_groups both cannot be none"
@@ -102,8 +224,10 @@ class CheckOverflow(object):
             for group in param_groups:
                 for param in group:
                     params.append(param)
+                    if is_moe_param(param):
+                        has_moe_params = True
 
-        return self.has_overflow(params)
+        return self.has_overflow(params, has_moe_params=has_moe_params)
 
     # `params` is a list / generator of torch.Variable
     def has_overflow_serial(self, params):
@@ -112,7 +236,9 @@ class CheckOverflow(object):
                 return True
         return False
 
-    def has_overflow(self, params):
+    def has_overflow(self, params, has_moe_params=None):
+        if has_moe_params is None:
+            has_moe_params = self.has_moe_params
         overflow = self.has_overflow_serial(params)
         # Since each model parallel GPU carries only part of the model,
         # make sure overflow flag is synced across all the model parallel GPUs
@@ -120,14 +246,35 @@ class CheckOverflow(object):
         # torch.distributed.all_reduce(overflow_gpu,
         #                             op=torch.distributed.ReduceOp.MAX,
         #                             group=mpu.get_model_parallel_group())
+        if has_moe_params:
+            # All reduce this across expert_parallel_group, so that if an expert
+            # overflows, we detect it here
+            dist.all_reduce(overflow_gpu,
+                            op=dist.ReduceOp.MAX,
+                            group=groups._get_max_expert_parallel_group())
         if self.zero_reduce_scatter:
             torch.distributed.all_reduce(overflow_gpu,
                                          op=torch.distributed.ReduceOp.MAX,
                                          group=torch.distributed.group.WORLD)
         elif self.mpu is not None:
+            if self.deepspeed is not None:
+                using_pipeline = hasattr(self.deepspeed,
+                                         'pipeline_enable_backward_allreduce')
+                if (using_pipeline
+                        and self.deepspeed.pipeline_enable_backward_allreduce is False
+                    ) or (not using_pipeline
+                          and self.deepspeed.enable_backward_allreduce is False):
+                    torch.distributed.all_reduce(
+                        overflow_gpu,
+                        op=torch.distributed.ReduceOp.MAX,
+                        group=self.mpu.get_data_parallel_group())
             torch.distributed.all_reduce(overflow_gpu,
                                          op=torch.distributed.ReduceOp.MAX,
                                          group=self.mpu.get_model_parallel_group())
+        elif self.deepspeed is not None and self.deepspeed.enable_backward_allreduce is False:
+            torch.distributed.all_reduce(overflow_gpu,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=torch.distributed.group.WORLD)
 
         overflow = overflow_gpu[0].item()
         return bool(overflow)
@@ -169,12 +316,25 @@ def _handle_overflow(cpu_sum, x, i):
         )
 
 
-def get_grad_norm(parameters, norm_type=2, mpu=None):
+def get_global_norm(norm_list):
+    """ Compute total from a list of norms
+    """
+    total_norm = 0.0
+    for norm in norm_list:
+        total_norm += norm**2.0
+    return sqrt(total_norm)
+
+
+def clip_grad_norm_(parameters, max_norm, norm_type=2, mpu=None):
     """Clips gradient norm of an iterable of parameters.
 
+    This has been adapted from Nvidia megatron. We add norm averaging
+    to consider MoE params when calculating norm as they will result
+    in different norms across different ranks.
+
     This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
     added functionality to handle model parallel parameters. Note that
-    the gradients are modified in place. Taken from Nvidia Megatron.
+    the gradients are modified in place.
 
     Arguments:
         parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
@@ -189,7 +349,7 @@ def get_grad_norm(parameters, norm_type=2, mpu=None):
     if isinstance(parameters, torch.Tensor):
         parameters = [parameters]
     parameters = list(filter(lambda p: p.grad is not None, parameters))
-
+    max_norm = float(max_norm)
     norm_type = float(norm_type)
     if norm_type == inf:
         total_norm = max(p.grad.data.abs().max() for p in parameters)
@@ -201,12 +361,12 @@ def get_grad_norm(parameters, norm_type=2, mpu=None):
                                          group=mpu.get_model_parallel_group())
         total_norm = total_norm_cuda[0].item()
     else:
-        total_norm = 0.
+        total_norm = 0
         for p in parameters:
             if mpu is not None:
-                if (mpu.get_model_parallel_rank() == 0
-                    ) or is_model_parallel_parameter(p):
-                    param_norm = p.grad.data.float().norm(norm_type)
+                if (mpu.get_model_parallel_rank()
+                        == 0) or is_model_parallel_parameter(p):
+                    param_norm = p.grad.data.norm(norm_type)
                     total_norm += param_norm.item()**norm_type
             else:
                 param_norm = p.grad.data.float().norm(norm_type)
@@ -220,6 +380,76 @@ def get_grad_norm(parameters, norm_type=2, mpu=None):
                                          group=mpu.get_model_parallel_group())
         total_norm = total_norm_cuda[0].item()**(1. / norm_type)
 
+    # Need to average total_norm across different GPUs due to the presence of moe params
+    pg = groups._get_data_parallel_group()
+    scaled_norm = total_norm * 1.0 / float(dist.get_world_size(group=pg))
+
+    scaled_norm_tensor = torch.cuda.FloatTensor([float(scaled_norm)])
+    dist.all_reduce(scaled_norm_tensor, group=pg)
+    total_norm = scaled_norm_tensor.item()
+
+    clip_coef = max_norm / (total_norm + 1e-6)
+    if clip_coef < 1:
+        for p in parameters:
+            p.grad.data.mul_(clip_coef)
+    return total_norm
+
+
+def get_grad_norm(parameters, norm_type=2, mpu=None):
+    """Get grad norm of an iterable of parameters.
+
+    This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
+    added functionality to handle model parallel parameters. Note that
+    the gradients are modified in place. Taken from Nvidia Megatron.
+
+    Arguments:
+        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
+            single Tensor that will have gradients normalized
+        max_norm (float or int): max norm of the gradients
+        norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
+            infinity norm.
+
+    Returns:
+        Total norm of the parameters (viewed as a single vector).
+    """
+    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    parameters = list(filter(lambda p: p.grad is not None, parameters))
+
+    norm_type = float(norm_type)
+    if norm_type == inf:
+        total_norm = max(p.grad.data.abs().max() for p in parameters)
+        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+        # Take max across all GPUs.
+        if mpu is not None:
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=mpu.get_model_parallel_group())
+        total_norm = total_norm_cuda[0].item()
+    else:
+        total_norm = 0.
+        tensor_mp_rank = bwc_tensor_model_parallel_rank(mpu=mpu)
+        for p in parameters:
+            # Pipeline parallelism may replicate parameters. Avoid multi-counting.
+            if hasattr(p, PIPE_REPLICATED) and p.ds_pipe_replicated:
+                continue
+
+            # Filter to avoid over-counting replicated tensors from tensor
+            # model parallelism
+            if (tensor_mp_rank > 0) and not is_model_parallel_parameter(p):
+                continue
+
+            param_norm = p.grad.data.float().norm(norm_type)
+            total_norm += param_norm.item()**norm_type
+
+        # Sum across all model parallel GPUs.
+        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+        if mpu is not None:
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.SUM,
+                                         group=mpu.get_model_parallel_group())
+        total_norm = total_norm_cuda[0].item()**(1. / norm_type)
+
     if total_norm == float(
             'inf') or total_norm == -float('inf') or total_norm != total_norm:
         total_norm = -1
@@ -227,8 +457,50 @@ def get_grad_norm(parameters, norm_type=2, mpu=None):
     return total_norm
 
 
+def get_grad_zeros(parameters, mpu=None):
+    """Compute the number of grads with zero values.
+
+    This is adapted from get_grad_norm
+
+    Arguments:
+        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
+            single Tensor that will have gradients normalized
+
+    Returns:
+        Total number of params with zero values (viewed as a single vector).
+    """
+    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    parameters = list(filter(lambda p: p.grad is not None, parameters))
+
+    total_zeros = 0.
+    tensor_mp_rank = bwc_tensor_model_parallel_rank(mpu=mpu)
+    for p in parameters:
+        # Pipeline parallelism may replicate parameters. Avoid multi-counting.
+        if hasattr(p, PIPE_REPLICATED) and p.ds_pipe_replicated:
+            continue
+
+        # Filter to avoid over-counting replicated tensors from tensor
+        # model parallelism
+        if (tensor_mp_rank > 0) and not is_model_parallel_parameter(p):
+            continue
+
+        count_zeros = p.grad.numel() - torch.count_nonzero(p.grad)
+        total_zeros += count_zeros.item()
+
+    # Sum across all model parallel GPUs.
+    total_zeros_cuda = torch.cuda.FloatTensor([float(total_zeros)])
+    if mpu is not None:
+        torch.distributed.all_reduce(total_zeros_cuda,
+                                     op=torch.distributed.ReduceOp.SUM,
+                                     group=mpu.get_model_parallel_group())
+    total_zeros = total_zeros_cuda[0].item()
+
+    return total_zeros
+
+
 def get_weight_norm(parameters, norm_type=2, mpu=None):
-    """Clips gradient norm of an iterable of parameters.
+    """Get norm of an iterable of parameters.
 
     This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
     added functionality to handle model parallel parameters. Note that
@@ -259,24 +531,19 @@ def get_weight_norm(parameters, norm_type=2, mpu=None):
         total_norm = total_norm_cuda[0].item()
     else:
         total_norm = 0.
+        tensor_mp_rank = bwc_tensor_model_parallel_rank(mpu=mpu)
         for p in parameters:
-            if mpu is not None:
-                if (mpu.get_model_parallel_rank() == 0
-                    ) or is_model_parallel_parameter(p):
-                    try:
-                        param_norm = float(torch.norm(p, norm_type, dtype=torch.float32))
-                    except TypeError as err:
-                        param_norm = float(torch.norm(p.float(), norm_type))
-
-                    #param_norm = p.data.float().norm(norm_type)
-                    total_norm += param_norm**norm_type
-            else:
-                try:
-                    param_norm = float(torch.norm(p, norm_type, dtype=torch.float32))
-                except TypeError as err:
-                    param_norm = float(torch.norm(p.float(), norm_type))
-                #param_norm = p.data.float().norm(norm_type)
-                total_norm += param_norm**norm_type
+            # Pipeline parallelism may replicate parameters. Avoid multi-counting.
+            if hasattr(p, PIPE_REPLICATED) and p.ds_pipe_replicated:
+                continue
+
+            # Filter to avoid over-counting replicated tensors from tensor
+            # model parallelism
+            if (tensor_mp_rank > 0) and not is_model_parallel_parameter(p):
+                continue
+
+            param_norm = p.data.float().norm(norm_type)
+            total_norm += param_norm**norm_type
 
         # Sum across all model parallel GPUs.
         total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
@@ -293,10 +560,6 @@ def get_weight_norm(parameters, norm_type=2, mpu=None):
     return total_norm
 
 
-def is_model_parallel_parameter(p):
-    return hasattr(p, 'model_parallel') and p.model_parallel
-
-
 def prefix_sum_inc(weights):
     """ Compute an inclusive prefix sum.
 
@@ -546,6 +809,12 @@ def memory_status(msg, print_rank=-1, reset_max=False):
     )
 
 
+def get_ma_status():
+    if torch.distributed.is_initialized() and not torch.distributed.get_rank() == 0:
+        return 0
+    return torch.cuda.memory_allocated()
+
+
 def see_memory_usage(message, force=False):
     if not force:
         return
@@ -560,8 +829,8 @@ def see_memory_usage(message, force=False):
     logger.info(
         f"MA {round(torch.cuda.memory_allocated() / (1024 * 1024 * 1024),2 )} GB \
         Max_MA {round(torch.cuda.max_memory_allocated() / (1024 * 1024 * 1024),2)} GB \
-        CA {round(torch.cuda.memory_cached() / (1024 * 1024 * 1024),2)} GB \
-        Max_CA {round(torch.cuda.max_memory_cached() / (1024 * 1024 * 1024))} GB ")
+        CA {round(torch_memory_reserved() / (1024 * 1024 * 1024),2)} GB \
+        Max_CA {round(torch_max_memory_reserved() / (1024 * 1024 * 1024))} GB ")
 
     vm_stats = psutil.virtual_memory()
     used_GB = round(((vm_stats.total - vm_stats.available) / (1024**3)), 2)
@@ -593,3 +862,158 @@ def call_to_str(base, *args, **kwargs):
         name += ', '.join(f'{key}={repr(arg)}' for key, arg in kwargs.items())
     name += ')'
     return name
+
+
+def get_only_unique_item(items):
+    item_set = set(items)
+    if len(item_set) != 1:
+        raise RuntimeError(f"expected there to be only one unique element in {items}")
+    unique_item, = item_set
+
+    return unique_item
+
+
+def clip_gradients(parameters, max_norm=1.0, global_grad_norm=None, mpu=None, eps=1e-6):
+    """Clip the gradient of a list of parameters.
+    Args:
+        parameters: List of parameters whose .grad will be clipped.
+        global_grad_norm (float, optional): Precomputed gradient norm. Defaults to None.
+        mpu (optional): model parallelism unit. Defaults to None.
+        eps (float, optional): epsilon value added to grad norm. Defaults to 1e-6
+    Returns:
+        float: the global gradient norm
+    """
+    if global_grad_norm is None:
+        global_grad_norm = get_grad_norm(parameters, mpu=mpu)
+    clip_coef = max_norm / (global_grad_norm + eps)
+    if clip_coef < 1:
+        for p in parameters:
+            p.grad.detach().mul_(clip_coef)
+    return global_grad_norm
+
+
+def get_global_norm_of_tensors(input_tensors, norm_type=2, mpu=None):
+    """Get norm of an iterable of tensors.
+
+    This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
+    added functionality to handle model parallel parameters. Taken from Nvidia Megatron.
+
+    Arguments:
+        input_tensors (Iterable[Tensor]): an iterable of Tensors will have norm computed
+        norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
+            infinity norm.
+
+    Returns:
+        Total norm of the tensors (viewed as a single vector).
+    """
+
+    assert isinstance(input_tensors, Iterable), f'expected Iterable type not {type(input_tensors)}'
+    assert all([torch.is_tensor(t) for t in input_tensors]), f'expected list of only tensors'
+
+    norm_type = float(norm_type)
+    if norm_type == inf:
+        total_norm = max(t.data.abs().max() for t in input_tensors)
+        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+        if mpu is not None:
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=mpu.get_model_parallel_group())
+            total_norm = total_norm_cuda[0].item()
+    else:
+        total_norm = sum(
+            [t.data.float().norm(norm_type).item()**norm_type for t in input_tensors])
+        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+        if mpu is not None:
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.SUM,
+                                         group=mpu.get_model_parallel_group())
+        total_norm = total_norm_cuda[0].item()**(1. / norm_type)
+
+    if total_norm == float(
+            'inf') or total_norm == -float('inf') or total_norm != total_norm:
+        total_norm = -1
+
+    return total_norm
+
+
+def clip_tensors_by_global_norm(input_tensors,
+                                max_norm=1.0,
+                                global_norm=None,
+                                mpu=None,
+                                eps=1e-6):
+    """Clip list of tensors by global norm.
+    Args:
+        input_tensors: List of tensors to be clipped
+        global_norm (float, optional): Precomputed norm. Defaults to None.
+        mpu (optional): model parallelism unit. Defaults to None.
+        eps (float, optional): epsilon value added to grad norm. Defaults to 1e-6
+    Returns:
+        float: the global norm
+    """
+    if global_norm is None:
+        global_norm = get_global_norm_of_tensors(input_tensors, mpu=mpu)
+
+    clip_coef = max_norm / (global_norm + eps)
+
+    if clip_coef < 1:
+        for t in input_tensors:
+            t.detach().mul_(clip_coef)
+
+    return global_norm
+
+
+def align_dense_tensors(tensor_list, alignment):
+    num_elements = sum(t.numel() for t in tensor_list)
+    remaining = num_elements % alignment
+
+    if remaining:
+        elements_to_add = alignment - remaining
+        pad_tensor = torch.zeros(elements_to_add,
+                                 device=tensor_list[0].device,
+                                 dtype=tensor_list[0].dtype)
+        padded_tensor_list = tensor_list + [pad_tensor]
+    else:
+        padded_tensor_list = tensor_list
+
+    return padded_tensor_list
+
+
+def all_gather_dp_groups(partitioned_param_groups,
+                         dp_process_group,
+                         start_alignment_factor,
+                         allgather_bucket_size):
+    for group_id, partitioned_params in enumerate(partitioned_param_groups):
+        # Sequential AllGather Best of both worlds
+        partition_id = dist.get_rank(group=dp_process_group[group_id])
+        dp_world_size = dist.get_world_size(group=dp_process_group[group_id])
+
+        num_shards = max(
+            1,
+            partitioned_params[partition_id].numel() * dp_world_size //
+            allgather_bucket_size)
+
+        shard_size = partitioned_params[partition_id].numel() // num_shards
+
+        # Enforce nccl/rccl alignment of start location of each shard
+        shard_size = shard_size - (shard_size % start_alignment_factor)
+
+        num_elements = shard_size
+
+        assert shard_size * num_shards <= partitioned_params[partition_id].numel()
+
+        for shard_id in range(num_shards):
+
+            if shard_id == (num_shards - 1):
+                num_elements = partitioned_params[partition_id].numel(
+                ) - shard_id * shard_size
+
+            shard_list = []
+            for dp_id in range(dp_world_size):
+                curr_shard = partitioned_params[dp_id].narrow(0,
+                                                              shard_id * shard_size,
+                                                              num_elements).detach()
+                shard_list.append(curr_shard)
+
+            dist.all_gather(shard_list,
+                            shard_list[partition_id],
+                            dp_process_group[group_id])
diff --git a/deepspeed/runtime/weight_quantizer.py b/deepspeed/runtime/weight_quantizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6c241d19f58ba41596e579ce6cedaca92c6bb81
--- /dev/null
+++ b/deepspeed/runtime/weight_quantizer.py
@@ -0,0 +1,163 @@
+import torch
+from ..module_inject.replace_policy import HFBertLayerPolicy, replace_policies
+
+
+class WeightQuantization(object):
+    def __init__(self, mlp_extra_grouping=True, mp_size=1):
+        self.dense_scales = []
+        self.qkv_scales = []
+        self.mlp4hh_scales = []
+        self.mlph4h_scales = []
+        self.mlp_extra_grouping = mlp_extra_grouping
+        self.mp_size = mp_size
+
+    def quantize_data(self, data, quantize_bits, groups, key=None):
+        data_groups = torch.split(data.float().view(-1), data.numel() // groups)
+        max_d = [max(g.max(), g.min().abs()) for g in data_groups]
+        data_scale = [float(1 << quantize_bits) / (2 * mx + 1e-5) for mx in max_d]
+        data_int = [(g * s) for g, s in zip(data_groups, data_scale)]
+        data_int = [
+            di.round().clamp(-(1 << (quantize_bits - 1)),
+                             (((1 << (quantize_bits - 1)) - 1))) for di in data_int
+        ]
+        data_int = torch.cat(data_int).reshape(data.shape)
+        data_int = data_int.to(torch.int8)
+        data_scale = torch.cat([s.unsqueeze(0).unsqueeze(0) for s in data_scale])
+        return data_int, data_scale
+
+    def is_mlp(self, data, merge_count=1):
+        return ((self.mp_size *data.shape[0] * merge_count) / data.shape[1] == 4 or \
+                (self.mp_size *data.shape[1] * merge_count) / data.shape[0] == 4)
+
+    def is_qkv(self, data):
+        return ((self.mp_size * data.shape[0]) / data.shape[1] == 3 or \
+                (self.mp_size * data.shape[1]) / data.shape[0] == 3)
+
+    def Quantize(self, value_list, quantize_bits, groups, key, merge_dim=0):
+        if self.mlp_extra_grouping and self.is_mlp(value_list[0],
+                                                   merge_count=len(value_list)):
+            groups *= 2
+        q_scale = []
+        index = 0
+        for data in value_list:
+            data_int, data_scale = self.quantize_data(data, quantize_bits, groups, key)
+            q_scale.append(data_scale)
+            value_list[index] = data_int
+            index += 1
+        q_scale = (1 / torch.cat(q_scale,
+                                 dim=merge_dim).to(
+                                     torch.cuda.current_device()).view(-1).unsqueeze(0))
+        if "mlp.dense_4h_to_h.weight" in key:
+            self.mlp4hh_scales.append(q_scale)
+        elif "mlp.dense_h_to_4h.weight" in key:
+            self.mlph4h_scales.append(q_scale)
+        elif "attention.query_key_value.weight" in key:
+            self.qkv_scales.append(q_scale)
+        else:
+            self.dense_scales.append(q_scale)
+        return value_list
+
+    def merge_layer_scales(self, layer_scales):
+        max_dim = max([s.shape[-1] for s in layer_scales])
+        layer_scales = [
+            torch.cat((s,
+                       torch.zeros((1,
+                                    max_dim - s.shape[-1]),
+                                   device=torch.cuda.current_device())),
+                      dim=-1) if s.shape[-1] < max_dim else s for s in layer_scales
+        ]
+        return torch.cat(layer_scales).unsqueeze(0)
+
+    def merge_scales(self):
+        all_scales = []
+        for dense_scale, qkv_scale, m4hh_scale, mh4h_scale in \
+            zip(self.dense_scales, self.qkv_scales, self.mlp4hh_scales, self.mlph4h_scales):
+            all_scales.append(
+                self.merge_layer_scales([qkv_scale,
+                                         dense_scale,
+                                         mh4h_scale,
+                                         m4hh_scale]))
+        return torch.cat(all_scales)
+
+    def merge_scales_split(self, split_count):
+        all_scales = [[] for _ in range(split_count)]
+        for dense_scale, qkv_scale, m4hh_scale, mh4h_scale in \
+            zip(self.dense_scales, self.qkv_scales, self.mlp4hh_scales, self.mlph4h_scales):
+            dense_scale = torch.split(dense_scale, dense_scale.numel() // split_count)
+            qkv_scale = torch.split(qkv_scale, qkv_scale.numel() // split_count)
+            m4hh_scale = torch.split(m4hh_scale, m4hh_scale.numel() // split_count)
+            mh4h_scale = torch.split(mh4h_scale, mh4h_scale.numel() // split_count)
+            for s in range(split_count):
+                all_scales[s].append(
+                    torch.cat([
+                        torch.cat((qkv_scale[s],
+                                   torch.zeros_like(qkv_scale[s])),
+                                  dim=1),
+                        torch.cat((dense_scale[s],
+                                   torch.zeros_like(dense_scale[s])),
+                                  dim=1),
+                        mh4h_scale[s],
+                        m4hh_scale[s]
+                    ]).unsqueeze(0))
+            for scales_a in all_scales:
+                torch.cat(scales_a)
+        return all_scales
+
+    def sd_quantize_megatron(self, sd, quantize_bits, groups):
+        keys = sd.keys()
+        for key in keys:
+            value_list = [sd[key]]
+            if "attention.dense.weight" in key or "mlp.dense_4h_to_h.weight" in key or \
+                "mlp.dense_h_to_4h.weight" in key or "attention.query_key_value.weight" in key:
+                value_list = self.Quantize(value_list, quantize_bits, groups, key=key)
+            sd[key] = value_list[0]
+
+        all_scales = self.merge_scales()
+        return sd, all_scales
+
+    def model_quantize(self, model, quantize_policy, quantize_bits, groups):
+        all_scales = []
+
+        def quantize_fn(layer, policy_cls):
+            policy = policy_cls(layer)
+
+            _, qkvw, _, dense_w, _, _ = policy.attention()
+            _, _h4h_w, _, _4hh_w, _ = policy.mlp()
+            keys = [qkvw, dense_w, _h4h_w, _4hh_w]
+            layer_scales = []
+
+            for key in range(len(keys)):
+                if self.mlp_extra_grouping and is_mlp(keys[key]):
+                    data_quantized, data_scale = self.quantize_data(keys[key], quantize_bits, groups * 2)
+                elif policy_cls is HFBertLayerPolicy and self.is_qkv(keys[key]):
+                    data_quantized, data_scale = self.quantize_data(keys[key], quantize_bits, groups * 3)
+                else:
+                    data_quantized, data_scale = self.quantize_data(keys[key], quantize_bits, groups)
+                keys[key].copy_(data_quantized)
+                layer_scales.append(
+                    (1 /
+                     data_scale.to(torch.cuda.current_device()).view(-1).unsqueeze(0)))
+            all_scales.append(self.merge_layer_scales(layer_scales))
+            return layer
+
+        def _quantize_module(model, policies):
+            for name, child in model.named_children():
+                if child.__class__ in policies:
+                    quantize_fn, replace_policy = policies[child.__class__]
+                    setattr(model, name, quantize_fn(child, replace_policy))
+                else:
+                    _quantize_module(child, policies)
+
+            return model
+
+        policy = {}
+        if quantize_policy is not None:
+            for layer_name, replace_policy in quantize_policy.items():
+                policy.update({layer_name: (quantize_fn, replace_policy)})
+        else:
+            for plcy in replace_policies:
+                policy.update({plcy._orig_layer_class: (quantize_fn, plcy)})
+
+        quantized_module = _quantize_module(model, policy)
+
+        return quantized_module, torch.cat(all_scales)
diff --git a/deepspeed/runtime/zero/__init__.py b/deepspeed/runtime/zero/__init__.py
index d521573e1a770a0a11381ac2488af2b3239d7b8d..bd881c8ec69fff6cb9a7f35d8b8d10639bfb9057 100644
--- a/deepspeed/runtime/zero/__init__.py
+++ b/deepspeed/runtime/zero/__init__.py
@@ -1,5 +1,13 @@
+"""
+"Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+"""
+
 from .partition_parameters import ZeroParamType
 from .partition_parameters import ZeroParamStatus
 from .partition_parameters import Init
 from .partition_parameters import GatheredParameters
 from .partition_parameters import register_external_parameter
+
+from .tiling import TiledLinear
+from .tiling import TiledLinearReturnBias
diff --git a/deepspeed/runtime/zero/config.py b/deepspeed/runtime/zero/config.py
old mode 100755
new mode 100644
index 622ffa9ba1cbb6e9a759096bce74c6e268619a05..3804fb50a3715166440c7d9bbd4c73ba5d0d6a42
--- a/deepspeed/runtime/zero/config.py
+++ b/deepspeed/runtime/zero/config.py
@@ -1,158 +1,197 @@
-"""
-Copyright (c) Microsoft Corporation
-Licensed under the MIT license.
-"""
-
-from deepspeed.runtime.config_utils import get_scalar_param, DeepSpeedConfigObject
-from deepspeed.utils import logger
-from deepspeed.runtime.zero.constants import *
-
-
-class DeepSpeedZeroConfig(DeepSpeedConfigObject):
-    def __init__(self, param_dict):
-        super(DeepSpeedZeroConfig, self).__init__()
-
-        self.stage = None
-        self.contiguous_gradients = None
-        self.reduce_scatter = None
-        self.reduce_bucket_size = None
-        self.allgather_partitions = None
-        self.allgather_bucket_size = None
-        self.overlap_comm = None
-        self.load_from_fp32_weights = None
-
-        self.elastic_checkpoint = None
-
-        #Offload Specific Parameters
-        self.cpu_offload = None
-        self.cpu_offload_params = None
-        self.cpu_offload_use_pin_memory = None
-        self.sub_group_size = None
-
-        #Stage3 Specific Parameters
-        self.prefetch_bucket_size = None
-        self.param_persistence_threshold = None
-        self.max_live_parameters = None
-        self.max_reuse_distance = None
-        self.gather_fp16_weights_on_model_save = None
-
-        #Stage3 Specific Parameters
-        self.prefetch_bucket_size = None
-        self.param_persistence_threshold = None
-        self.max_live_parameters = None
-        self.max_reuse_distance = None
-
-        if ZERO_OPTIMIZATION in param_dict.keys():
-            zero_config_dict = param_dict[ZERO_OPTIMIZATION]
-            if type(zero_config_dict) is bool:
-                zero_config_dict = self.read_zero_config_deprecated(param_dict)
-        else:
-            zero_config_dict = ZERO_OPTIMIZATION_DEFAULT
-
-        self._initialize(zero_config_dict)
-
-    def read_zero_config_deprecated(self, param_dict):
-        zero_config_dict = {}
-        zero_config_dict[
-            ZERO_OPTIMIZATION_STAGE] = 1 if param_dict[ZERO_OPTIMIZATION] else 0
-        if zero_config_dict[ZERO_OPTIMIZATION_STAGE] > 0:
-            zero_config_dict[ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE] = get_scalar_param(
-                param_dict,
-                ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEPRECATED,
-                ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT)
-
-        logger.warning(
-            'DeepSpeedConfig: this format of ZeRO optimization setup is deprecated. Please use the following format: {}'
-            .format(ZERO_FORMAT))
-        return zero_config_dict
-
-    def _initialize(self, zero_config_dict):
-        self.stage = get_scalar_param(zero_config_dict,
-                                      ZERO_OPTIMIZATION_STAGE,
-                                      ZERO_OPTIMIZATION_STAGE_DEFAULT)
-
-        self.contiguous_gradients = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS,
-            ZERO3_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT
-            if self.stage == ZERO_OPTIMIZATION_WEIGHTS else
-            ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT)
-
-        self.reduce_bucket_size = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE,
-            ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT)
-
-        self.reduce_scatter = get_scalar_param(zero_config_dict,
-                                               ZERO_OPTIMIZATION_REDUCE_SCATTER,
-                                               ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT)
-
-        self.overlap_comm = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_OVERLAP_COMM,
-            ZERO3_OPTIMIZATION_OVERLAP_COMM_DEFAULT
-            if self.stage == ZERO_OPTIMIZATION_WEIGHTS else
-            ZERO_OPTIMIZATION_OVERLAP_COMM_DEFAULT)
-
-        self.allgather_partitions = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS,
-            ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT)
-
-        self.allgather_bucket_size = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE,
-            ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT)
-
-        self.load_from_fp32_weights = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS,
-            ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS_DEFAULT)
-
-        self.cpu_offload = get_scalar_param(zero_config_dict,
-                                            ZERO_OPTIMIZATION_CPU_OFFLOAD,
-                                            ZERO_OPTIMIZATION_CPU_OFFLOAD_DEFAULT)
-
-        self.elastic_checkpoint = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT,
-            ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT_DEFAULT)
-
-        self.cpu_offload_params = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS,
-            ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS_DEFAULT)
-
-        self.cpu_offload_use_pin_memory = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY,
-            ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY_DEFAULT)
-
-        self.sub_group_size = get_scalar_param(zero_config_dict,
-                                               ZERO_OPTIMIZATION_SUB_GROUP_SIZE,
-                                               ZERO_OPTIMIZATION_SUB_GROUP_SIZE_DEFAULT)
-
-        self.max_live_parameters = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS,
-            ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS_DEFAULT)
-
-        self.max_reuse_distance = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE,
-            ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE_DEFAULT)
-
-        self.prefetch_bucket_size = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE,
-            ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE_DEFAULT)
-
-        self.param_persistence_threshold = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD,
-            ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD_DEFAULT)
-
-        self.gather_fp16_weights_on_model_save = get_scalar_param(
-            zero_config_dict,
-            ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE,
-            ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE_DEFAULT)
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+
+from deepspeed.runtime.config_utils import get_scalar_param, DeepSpeedConfigObject
+from deepspeed.utils import logger
+from .constants import *
+from .offload_constants import *
+from .offload_config import get_offload_param_config, get_default_offload_param_config, \
+    get_offload_optimizer_config, get_default_offload_optimizer_config
+
+
+class DeepSpeedZeroConfig(DeepSpeedConfigObject):
+    def __init__(self, param_dict):
+        super(DeepSpeedZeroConfig, self).__init__()
+
+        self.stage = None
+        self.contiguous_gradients = None
+        self.reduce_scatter = None
+        self.reduce_bucket_size = None
+        self.allgather_partitions = None
+        self.allgather_bucket_size = None
+        self.overlap_comm = None
+        self.load_from_fp32_weights = None
+
+        self.elastic_checkpoint = None
+
+        #Offload Specific Parameters
+        self.offload_param = None
+        self.offload_optimizer = None
+        self.sub_group_size = None
+
+        #Stage3 Specific Parameters
+        self.prefetch_bucket_size = None
+        self.param_persistence_threshold = None
+        self.max_live_parameters = None
+        self.max_reuse_distance = None
+        self.gather_16bit_weights_on_model_save = None
+
+        self.ignore_unused_parameters = None
+        self.round_robin_gradients = None
+
+        if ZERO_OPTIMIZATION in param_dict.keys():
+            zero_config_dict = param_dict[ZERO_OPTIMIZATION]
+            if type(zero_config_dict) is bool:
+                zero_config_dict = self.read_zero_config_deprecated(param_dict)
+        else:
+            zero_config_dict = ZERO_OPTIMIZATION_DEFAULT
+
+        self._initialize(zero_config_dict)
+
+    def read_zero_config_deprecated(self, param_dict):
+        zero_config_dict = {}
+        zero_config_dict[
+            ZERO_OPTIMIZATION_STAGE] = 1 if param_dict[ZERO_OPTIMIZATION] else 0
+        if zero_config_dict[ZERO_OPTIMIZATION_STAGE] > 0:
+            zero_config_dict[ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE] = get_scalar_param(
+                param_dict,
+                ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEPRECATED,
+                ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT)
+
+        logger.warning(
+            'DeepSpeedConfig: this format of ZeRO optimization setup is deprecated. Please use the following format: {}'
+            .format(ZERO_FORMAT))
+        return zero_config_dict
+
+    def _sanity_check(self, zero_config_dict):
+        deprecated_dict = dict(
+            ZERO_OPTIMIZATION_CPU_OFFLOAD=ZERO_OPTIMIZATION_OFFLOAD_OPTIMIZER,
+            ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS=ZERO_OPTIMIZATION_OFFLOAD_PARAM,
+            ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY=
+            f'{ZERO_OPTIMIZATION_OFFLOAD_PARAM} or {ZERO_OPTIMIZATION_OFFLOAD_OPTIMIZER}'
+        )
+
+        for old_key, new_key in deprecated_dict.items():
+            if old_key in zero_config_dict:
+                logger.warning(
+                    f'DeepSpeedConfig: {old_key} is deprecated. Please use {new_key}.')
+
+    def _initialize(self, zero_config_dict):
+        self._sanity_check(zero_config_dict)
+
+        self.stage = get_scalar_param(zero_config_dict,
+                                      ZERO_OPTIMIZATION_STAGE,
+                                      ZERO_OPTIMIZATION_STAGE_DEFAULT)
+
+        self.contiguous_gradients = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS,
+            ZERO3_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT
+            if self.stage == ZERO_OPTIMIZATION_WEIGHTS else
+            ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT)
+
+        self.reduce_bucket_size = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE,
+            ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT)
+
+        self.reduce_scatter = get_scalar_param(zero_config_dict,
+                                               ZERO_OPTIMIZATION_REDUCE_SCATTER,
+                                               ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT)
+
+        self.overlap_comm = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_OVERLAP_COMM,
+            ZERO3_OPTIMIZATION_OVERLAP_COMM_DEFAULT if self.stage
+            == ZERO_OPTIMIZATION_WEIGHTS else ZERO_OPTIMIZATION_OVERLAP_COMM_DEFAULT)
+
+        self.allgather_partitions = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS,
+            ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT)
+
+        self.allgather_bucket_size = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE,
+            ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT)
+
+        self.load_from_fp32_weights = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS,
+            ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS_DEFAULT)
+
+        self.elastic_checkpoint = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT,
+            ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT_DEFAULT)
+
+        if ZERO_OPTIMIZATION_CPU_OFFLOAD in zero_config_dict:
+            cpu_offload_optimizer = get_scalar_param(
+                zero_config_dict,
+                ZERO_OPTIMIZATION_CPU_OFFLOAD,
+                ZERO_OPTIMIZATION_CPU_OFFLOAD_DEFAULT)
+            if cpu_offload_optimizer:
+                self.offload_optimizer = get_default_offload_optimizer_config()
+        else:
+            self.offload_optimizer = get_offload_optimizer_config(zero_config_dict)
+
+        if ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS in zero_config_dict:
+            cpu_offload_params = get_scalar_param(
+                zero_config_dict,
+                ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS,
+                ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS_DEFAULT)
+            if cpu_offload_params:
+                self.offload_param = get_default_offload_param_config()
+        else:
+            self.offload_param = get_offload_param_config(zero_config_dict)
+
+        self.sub_group_size = get_scalar_param(zero_config_dict,
+                                               ZERO_OPTIMIZATION_SUB_GROUP_SIZE,
+                                               ZERO_OPTIMIZATION_SUB_GROUP_SIZE_DEFAULT)
+
+        self.max_live_parameters = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS,
+            ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS_DEFAULT)
+
+        self.max_reuse_distance = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE,
+            ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE_DEFAULT)
+
+        self.prefetch_bucket_size = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE,
+            ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE_DEFAULT)
+
+        self.param_persistence_threshold = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD,
+            ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD_DEFAULT)
+
+        # config key has been renamed to use "16bit" instead of "fp16." falling back
+        # to old config name in order to preserve backwards compatibility
+        self.gather_16bit_weights_on_model_save = ZERO_OPTIMIZATION_GATHER_16BIT_WEIGHTS_ON_MODEL_SAVE_DEFAULT
+        for key in [
+                ZERO_OPTIMIZATION_GATHER_16BIT_WEIGHTS_ON_MODEL_SAVE,
+                ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE
+        ]:
+            if key in zero_config_dict:
+                self.gather_16bit_weights_on_model_save = zero_config_dict[key]
+                break
+
+        self.ignore_unused_parameters = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_IGNORE_UNUSED_PARAMETERS,
+            ZERO_OPTIMIZATION_IGNORE_UNUSED_PARAMETERS_DEFAULT)
+
+        self.legacy_stage1 = get_scalar_param(zero_config_dict,
+                                              ZERO_OPTIMIZATION_LEGACY_STAGE1,
+                                              ZERO_OPTIMIZATION_LEGACY_STAGE1_DEFAULT)
+
+        self.round_robin_gradients = get_scalar_param(
+            zero_config_dict,
+            ZERO_OPTIMIZATION_ROUND_ROBIN_GRADIENTS,
+            ZERO_OPTIMIZATION_ROUND_ROBIN_GRADIENTS_DEFAULT)
diff --git a/deepspeed/runtime/zero/constants.py b/deepspeed/runtime/zero/constants.py
old mode 100755
new mode 100644
index e5812980a33777dce1e309f1e0b8b052f02abbc1..af5c5f1953983966454cd161568ad04b20b15221
--- a/deepspeed/runtime/zero/constants.py
+++ b/deepspeed/runtime/zero/constants.py
@@ -1,143 +1,173 @@
-"""
-Copyright (c) Microsoft Corporation
-Licensed under the MIT license.
-"""
-
-#########################################
-# ZeRO optimization
-#########################################
-# ZeRO optimization. By default, this optimization is not enabled.
-# Users have to configure the desired optimization (0 means disabled) in params.json as below example:
-ZERO_FORMAT = '''
-ZeRO optimization should be enabled as:
-"session_params": {
-  "zero_optimization": {
-    "stage": [0|1|2],
-    "stage3_max_live_parameters" : 1000000000,
-    "stage3_max_reuse_distance" : 1000000000,
-    "allgather_partitions": [true|false],
-    "allgather_bucket_size": 500000000,
-    "reduce_scatter": [true|false],
-    "contiguous_gradients" : [true|false]
-    "overlap_comm": [true|false],
-    "reduce_bucket_size": 500000000,
-    "load_from_fp32_weights": [true|false],
-    "cpu_offload": [true|false],
-    "cpu_offload_params" : [true|false],
-    "cpu_offload_use_pin_memory": [true|false],
-    "sub_group_size" : 1000000000000
-    }
-}
-'''
-
-ZERO_OPTIMIZATION = 'zero_optimization'
-ZERO_OPTIMIZATION_DISABLED = 0
-ZERO_OPTIMIZATION_OPTIMIZER_STATES = 1
-ZERO_OPTIMIZATION_GRADIENTS = 2
-ZERO_OPTIMIZATION_WEIGHTS = 3
-MAX_STAGE_ZERO_OPTIMIZATION = ZERO_OPTIMIZATION_WEIGHTS
-
-ZERO_OPTIMIZATION_STAGE = 'stage'
-ZERO_OPTIMIZATION_STAGE_1 = 'stage_1'
-ZERO_OPTIMIZATION_STAGE_2 = 'stage_2'
-ZERO_OPTIMIZATION_STAGE_3 = 'stage_3'
-
-ZERO_OPTIMIZATION_STAGE_DEFAULT = ZERO_OPTIMIZATION_DISABLED
-
-ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS = 'allgather_partitions'
-ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT = True
-
-ZERO_OPTIMIZATION_REDUCE_SCATTER = 'reduce_scatter'
-ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT = True
-
-ZERO_OPTIMIZATION_OVERLAP_COMM = 'overlap_comm'
-ZERO_OPTIMIZATION_OVERLAP_COMM_DEFAULT = False
-ZERO3_OPTIMIZATION_OVERLAP_COMM_DEFAULT = True
-
-ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS = 'contiguous_gradients'
-ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT = False
-ZERO3_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT = False
-
-ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE = 'reduce_bucket_size'
-ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT = 500000000
-
-ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE = 'allgather_bucket_size'
-ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT = 500000000
-ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEPRECATED = 'allgather_size'
-ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS = 'load_from_fp32_weights'
-ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS_DEFAULT = True
-
-ZERO_OPTIMIZATION_CPU_OFFLOAD = 'cpu_offload'
-ZERO_OPTIMIZATION_CPU_OFFLOAD_DEFAULT = False
-
-ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT = 'elastic_checkpoint'
-ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT_DEFAULT = True
-
-ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS = 'cpu_offload_params'
-ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS_DEFAULT = False
-
-ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY = 'cpu_offload_use_pin_memory'
-ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY_DEFAULT = False
-
-ZERO_OPTIMIZATION_SUB_GROUP_SIZE = 'sub_group_size'
-ZERO_OPTIMIZATION_SUB_GROUP_SIZE_DEFAULT = 1000000000000
-
-#maximum number of parameters per GPU before releasing them
-ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS = 'stage3_max_live_parameters'
-ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS_DEFAULT = 1000000000
-
-#release a parameter only if the reuse distance is larger than specified
-ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE = 'stage3_max_reuse_distance'
-ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE_DEFAULT = 1000000000
-
-ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE = 'stage3_prefetch_bucket_size'
-ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE_DEFAULT = 50000000
-
-#parameters smaller than the threshold are only communicated once after the
-#parameters are updated and are persisted thoughout the trainging
-#avoid tons of latency bound communication
-ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD = 'stage3_param_persistence_threshold'
-ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD_DEFAULT = 100000
-
-# gathers params for saving a model - inefficient but is required in certain situations
-ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE = 'stage3_gather_fp16_weights_on_model_save'
-ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE_DEFAULT = False
-
-ZERO_OPTIMIZATION_DEFAULT = {
-    ZERO_OPTIMIZATION_STAGE:
-    ZERO_OPTIMIZATION_STAGE_DEFAULT,
-    ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS:
-    ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT,
-    ZERO_OPTIMIZATION_REDUCE_SCATTER:
-    ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT,
-    ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE:
-    ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT,
-    ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS:
-    ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT,
-    ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE:
-    ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT,
-    ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS:
-    ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS_DEFAULT,
-    ZERO_OPTIMIZATION_CPU_OFFLOAD:
-    ZERO_OPTIMIZATION_CPU_OFFLOAD_DEFAULT,
-    ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT:
-    ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT_DEFAULT,
-    ZERO_OPTIMIZATION_CPU_OFFLOAD:
-    ZERO_OPTIMIZATION_CPU_OFFLOAD_DEFAULT,
-    ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS:
-    ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS_DEFAULT,
-    ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY:
-    ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY,
-    ZERO_OPTIMIZATION_SUB_GROUP_SIZE:
-    ZERO_OPTIMIZATION_SUB_GROUP_SIZE_DEFAULT,
-    ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS:
-    ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS_DEFAULT,
-    ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE:
-    ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE_DEFAULT,
-    ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE:
-    ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE_DEFAULT,
-    ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD:
-    ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD_DEFAULT,
-    ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE:
-    ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE_DEFAULT
-}
+"""
+Copyright (c) Microsoft Corporation
+Licensed under the MIT license.
+"""
+
+from .offload_constants import *
+
+#########################################
+# ZeRO optimization
+#########################################
+# ZeRO optimization. By default, this optimization is not enabled.
+# Users have to configure the desired optimization (0 means disabled) in params.json as below example:
+ZERO_FORMAT = '''
+ZeRO optimization should be enabled as:
+"session_params": {
+  "zero_optimization": {
+    "stage": [0|1|2],
+    "stage3_max_live_parameters" : 1000000000,
+    "stage3_max_reuse_distance" : 1000000000,
+    "allgather_partitions": [true|false],
+    "allgather_bucket_size": 500000000,
+    "reduce_scatter": [true|false],
+    "contiguous_gradients" : [true|false]
+    "overlap_comm": [true|false],
+    "reduce_bucket_size": 500000000,
+    "load_from_fp32_weights": [true|false],
+    "cpu_offload": [true|false] (deprecated),
+    "cpu_offload_params" : [true|false] (deprecated),
+    "cpu_offload_use_pin_memory": [true|false] (deprecated),
+    "sub_group_size" : 1000000000000,
+    "offload_param": {...},
+    "offload_optimizer": {...},
+    "ignore_unused_parameters": [true|false],
+    "round_robin_gradients": [true|false]
+    }
+}
+'''
+
+ZERO_OPTIMIZATION = 'zero_optimization'
+ZERO_OPTIMIZATION_DISABLED = 0
+ZERO_OPTIMIZATION_OPTIMIZER_STATES = 1
+ZERO_OPTIMIZATION_GRADIENTS = 2
+ZERO_OPTIMIZATION_WEIGHTS = 3
+MAX_STAGE_ZERO_OPTIMIZATION = ZERO_OPTIMIZATION_WEIGHTS
+
+ZERO_OPTIMIZATION_STAGE = 'stage'
+ZERO_OPTIMIZATION_STAGE_1 = 'stage_1'
+ZERO_OPTIMIZATION_STAGE_2 = 'stage_2'
+ZERO_OPTIMIZATION_STAGE_3 = 'stage_3'
+
+ZERO_OPTIMIZATION_STAGE_DEFAULT = ZERO_OPTIMIZATION_DISABLED
+
+ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS = 'allgather_partitions'
+ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT = True
+
+ZERO_OPTIMIZATION_REDUCE_SCATTER = 'reduce_scatter'
+ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT = True
+
+ZERO_OPTIMIZATION_OVERLAP_COMM = 'overlap_comm'
+ZERO_OPTIMIZATION_OVERLAP_COMM_DEFAULT = False
+ZERO3_OPTIMIZATION_OVERLAP_COMM_DEFAULT = True
+
+ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS = 'contiguous_gradients'
+ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT = True
+ZERO3_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT = True
+
+ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE = 'reduce_bucket_size'
+ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT = 500000000
+
+ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE = 'allgather_bucket_size'
+ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT = 500000000
+ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEPRECATED = 'allgather_size'
+ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS = 'load_from_fp32_weights'
+ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS_DEFAULT = True
+
+ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT = 'elastic_checkpoint'
+ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT_DEFAULT = False
+
+ZERO_OPTIMIZATION_CPU_OFFLOAD = 'cpu_offload'
+ZERO_OPTIMIZATION_CPU_OFFLOAD_DEFAULT = False
+
+ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS = 'cpu_offload_params'
+ZERO_OPTIMIZATION_CPU_OFFLOAD_PARAMS_DEFAULT = False
+
+ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY = 'cpu_offload_use_pin_memory'
+ZERO_OPTIMIZATION_CPU_OFFLOAD_USE_PIN_MEMORY_DEFAULT = False
+
+ZERO_OPTIMIZATION_OFFLOAD_PARAM = OFFLOAD_PARAM
+ZERO_OPTIMIZATION_OFFLOAD_PARAM_DEFAULT = None
+
+ZERO_OPTIMIZATION_OFFLOAD_OPTIMIZER = OFFLOAD_OPTIMIZER
+ZERO_OPTIMIZATION_OFFLOAD_OPTIMIZER_DEFAULT = None
+
+ZERO_OPTIMIZATION_SUB_GROUP_SIZE = 'sub_group_size'
+ZERO_OPTIMIZATION_SUB_GROUP_SIZE_DEFAULT = 1000000000
+
+#maximum number of parameters per GPU before releasing them
+ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS = 'stage3_max_live_parameters'
+ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS_DEFAULT = 1000000000
+
+#release a parameter only if the reuse distance is larger than specified
+ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE = 'stage3_max_reuse_distance'
+ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE_DEFAULT = 1000000000
+
+ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE = 'stage3_prefetch_bucket_size'
+ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE_DEFAULT = 50000000
+
+#parameters smaller than the threshold are only communicated once after the
+#parameters are updated and are persisted throughout the training
+#avoid tons of latency bound communication
+ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD = 'stage3_param_persistence_threshold'
+ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD_DEFAULT = 100000
+
+# gathers params for saving a model - inefficient but is required in certain situations
+ZERO_OPTIMIZATION_GATHER_FP16_WEIGHTS_ON_MODEL_SAVE = 'stage3_gather_fp16_weights_on_model_save'
+ZERO_OPTIMIZATION_GATHER_16BIT_WEIGHTS_ON_MODEL_SAVE = 'stage3_gather_16bit_weights_on_model_save'
+ZERO_OPTIMIZATION_GATHER_16BIT_WEIGHTS_ON_MODEL_SAVE_DEFAULT = False
+
+# Now just used in stage2 complete_grad_norm_calculation_for_cpu_offload
+# Enable this option to avoid:
+# https://github.com/microsoft/DeepSpeed/issues/707
+ZERO_OPTIMIZATION_IGNORE_UNUSED_PARAMETERS = 'ignore_unused_parameters'
+ZERO_OPTIMIZATION_IGNORE_UNUSED_PARAMETERS_DEFAULT = True
+
+# Use deepspeed < v0.3.17 zero stage 1, kept for backwards compatibility reasons
+ZERO_OPTIMIZATION_LEGACY_STAGE1 = "legacy_stage1"
+ZERO_OPTIMIZATION_LEGACY_STAGE1_DEFAULT = False
+
+# Stage 2 - partition gradients in a round robin fashion to load-balance reduction and offload copying
+ZERO_OPTIMIZATION_ROUND_ROBIN_GRADIENTS = 'round_robin_gradients'
+ZERO_OPTIMIZATION_ROUND_ROBIN_GRADIENTS_DEFAULT = False
+
+#yapf: disable
+ZERO_OPTIMIZATION_DEFAULT = {
+    ZERO_OPTIMIZATION_STAGE:
+    ZERO_OPTIMIZATION_STAGE_DEFAULT,
+    ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS:
+    ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT,
+    ZERO_OPTIMIZATION_REDUCE_SCATTER:
+    ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT,
+    ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE:
+    ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT,
+    ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS:
+    ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT,
+    ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE:
+    ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT,
+    ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS:
+    ZERO_OPTIMIZATION_LOAD_FROM_FP32_WEIGHTS_DEFAULT,
+    ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT:
+    ZERO_OPTIMIZATION_ELASTIC_CHECKPOINT_DEFAULT,
+    ZERO_OPTIMIZATION_OFFLOAD_PARAM:
+    ZERO_OPTIMIZATION_OFFLOAD_PARAM_DEFAULT,
+    ZERO_OPTIMIZATION_OFFLOAD_OPTIMIZER:
+    ZERO_OPTIMIZATION_OFFLOAD_OPTIMIZER_DEFAULT,
+    ZERO_OPTIMIZATION_SUB_GROUP_SIZE:
+    ZERO_OPTIMIZATION_SUB_GROUP_SIZE_DEFAULT,
+    ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS:
+    ZERO_OPTIMIZATION_MAX_LIVE_PARAMETERS_DEFAULT,
+    ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE:
+    ZERO_OPTIMIZATION_MAX_REUSE_DISTANCE_DEFAULT,
+    ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE:
+    ZERO_OPTIMIZATION_PREFETCH_BUCKET_SIZE_DEFAULT,
+    ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD:
+    ZERO_OPTIMIZATION_PARAM_PERSISTENCE_THRESHOLD_DEFAULT,
+    ZERO_OPTIMIZATION_GATHER_16BIT_WEIGHTS_ON_MODEL_SAVE:
+    ZERO_OPTIMIZATION_GATHER_16BIT_WEIGHTS_ON_MODEL_SAVE_DEFAULT,
+    ZERO_OPTIMIZATION_IGNORE_UNUSED_PARAMETERS:
+    ZERO_OPTIMIZATION_IGNORE_UNUSED_PARAMETERS_DEFAULT,
+    ZERO_OPTIMIZATION_LEGACY_STAGE1:
+    ZERO_OPTIMIZATION_LEGACY_STAGE1_DEFAULT,
+    ZERO_OPTIMIZATION_ROUND_ROBIN_GRADIENTS:
+    ZERO_OPTIMIZATION_ROUND_ROBIN_GRADIENTS_DEFAULT
+}
diff --git a/deepspeed/runtime/zero/contiguous_memory_allocator.py b/deepspeed/runtime/zero/contiguous_memory_allocator.py
index 686f376cfce783cc665adf38d6124896806b7107..a5b14ae13342c64ec982f5b2c039a40befe6ac44 100644
--- a/deepspeed/runtime/zero/contiguous_memory_allocator.py
+++ b/deepspeed/runtime/zero/contiguous_memory_allocator.py
@@ -10,7 +10,7 @@ class ContiguousMemoryAllocator(object):
     def __init__(self, size, dtype, device):
         self.buffer = torch.zeros(size, dtype=dtype, device=device)
 
-        #address to contiguous size avaialble
+        #address to contiguous size available
         self.contiguous_sizes = {}
 
         self.contiguous_sizes[0] = size
@@ -65,7 +65,7 @@ class ContiguousMemoryAllocator(object):
         print_rank_0(
             f"Free before allocation {free_before}. Allocating {size}. Free after allocation {self.total_free}. Max allocated {self.max_allocated}"
         )
-        assert self.total_free + size == free_before, "Allcation bookeeping error"
+        assert self.total_free + size == free_before, "Allocation bookkeeping error"
 
         return ret_tensor
 
@@ -96,7 +96,7 @@ class ContiguousMemoryAllocator(object):
         print_rank_0(
             f"Free before release {free_before}. Released {tensor.numel()}. Total free after {self.total_free}."
         )
-        assert self.total_free - tensor_size == free_before, "Release bookeeping error"
+        assert self.total_free - tensor_size == free_before, "Release bookkeeping error"
 
     def release_tensor_with_id(self, tensor_id):
         free_before = self.total_free
@@ -109,7 +109,7 @@ class ContiguousMemoryAllocator(object):
         print_rank_0(
             f"Free before release {free_before}. Released {tensor.numel()}. Total free after {self.total_free}."
         )
-        assert self.total_free - tensor_size == free_before, "Release bookeeping error"
+        assert self.total_free - tensor_size == free_before, "Release bookkeeping error"
 
     #shows the current memory allocation at specified resolution
     def print_allocation(self, resolution=200):
diff --git a/deepspeed/runtime/zero/linear.py b/deepspeed/runtime/zero/linear.py
index f29fcda2bb1945579c47408efddb939c76ddf511..458f4ff392be067c4c98ec1ac91f98204406030f 100644
--- a/deepspeed/runtime/zero/linear.py
+++ b/deepspeed/runtime/zero/linear.py
@@ -1,14 +1,14 @@
 #Linear Module to use with ZeRO Stage 3 to allow for parameter memory release
 #after the module execution during forward
 #Instead of saving variables using save_for_backward, we save variable ids
-#Allowing us to retrive the variable without creating pointer to it
+#Allowing us to retrieve the variable without creating pointer to it
 #Which allows for underlying tensor to be garbage collected
 #When partitioned as needed by the Zero Stage 3 optimizer
 #TODO instead of patching Linear module, we could patch the ctx.save_for_backward
 #ctx.saved_tensors so that this approach works for all nn modules that are built upon
 #torch.nn.function. However the issue is that many modules uses C++ implementations
-#which does not have pytroch implementation. Eg torch.addmm which acts as a funcitonal
-#when implemeted outside of torch.autograd.Function
+#which does not have pytorch implementation. Eg torch.addmm which acts as a functional
+#when implemented outside of torch.autograd.Function
 
 import math
 
@@ -17,14 +17,29 @@ from torch import Tensor
 from torch.nn.parameter import Parameter
 from torch.nn import init
 from torch.nn.modules.module import Module
+from deepspeed.runtime.utils import noop_decorator
 
 tensor_map = {}
 
 
+def print_rank_0(message, debug=False, force=False):
+    if torch.distributed.get_rank() == 0 and (debug or force):
+        print(message)
+
+
+try:
+    autocast_custom_fwd = torch.cuda.amp.custom_fwd
+    autocast_custom_bwd = torch.cuda.amp.custom_bwd
+except (ImportError, AttributeError) as exp:
+    autocast_custom_fwd = noop_decorator
+    autocast_custom_bwd = noop_decorator
+
+
 class LinearFunctionForZeroStage3(torch.autograd.Function):
 
     # Note that both forward and backward are @staticmethods
     @staticmethod
+    @autocast_custom_fwd
     # bias is an optional argument
     def forward(ctx, input, weight, bias=None):
         #print("In ZeRO Linear Function")
@@ -46,10 +61,12 @@ class LinearFunctionForZeroStage3(torch.autograd.Function):
             if bias is not None:
                 output += bias
             ret = output
+
         return ret
 
     # This function has only a single output, so it gets only one gradient
     @staticmethod
+    @autocast_custom_bwd
     def backward(ctx, grad_output):
         # This is a pattern that is very convenient - at the top of backward
         # unpack saved_tensors and initialize all gradients w.r.t. inputs to
@@ -77,10 +94,10 @@ class LinearFunctionForZeroStage3(torch.autograd.Function):
             #print("Computing grad weight")
             dim = grad_output.dim()
             if dim > 2:
-                grad_weight = grad_output.view(-1,
-                                               grad_output.shape[-1]).t().matmul(
-                                                   input.view(-1,
-                                                              input.shape[-1]))
+                grad_weight = grad_output.reshape(-1,
+                                                  grad_output.shape[-1]).t().matmul(
+                                                      input.reshape(-1,
+                                                                    input.shape[-1]))
             else:
                 grad_weight = grad_output.t().matmul(input)
             #print(f"Computed grad weight grad_weight {grad_weight.shape}")
diff --git a/deepspeed/runtime/zero/offload_config.py b/deepspeed/runtime/zero/offload_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..c438a7638a763c497be322509382cdccb91c3386
--- /dev/null
+++ b/deepspeed/runtime/zero/offload_config.py
@@ -0,0 +1,75 @@
+'''
+Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+'''
+
+from deepspeed.runtime.config_utils import get_scalar_param
+from .offload_constants import *
+from .utils import logger
+
+OFFLOAD_PARAM_KEY_DEFAULT_DICT = {
+    OFFLOAD_PARAM_DEVICE: OFFLOAD_PARAM_DEVICE_DEFAULT,
+    OFFLOAD_PARAM_NVME_PATH: OFFLOAD_PARAM_NVME_PATH_DEFAULT,
+    OFFLOAD_PARAM_BUFFER_COUNT: OFFLOAD_PARAM_BUFFER_COUNT_DEFAULT,
+    OFFLOAD_PARAM_BUFFER_SIZE: OFFLOAD_PARAM_BUFFER_SIZE_DEFAULT,
+    OFFLOAD_PARAM_MAX_IN_CPU: OFFLOAD_PARAM_MAX_IN_CPU_DEFAULT,
+    OFFLOAD_PARAM_PIN_MEMORY: OFFLOAD_PARAM_PIN_MEMORY_DEFAULT
+}
+
+OFFLOAD_OPTIMIZER_KEY_DEFAULT_DICT = {
+    OFFLOAD_OPTIMIZER_DEVICE: OFFLOAD_OPTIMIZER_DEVICE_DEFAULT,
+    OFFLOAD_OPTIMIZER_NVME_PATH: OFFLOAD_OPTIMIZER_NVME_PATH_DEFAULT,
+    OFFLOAD_OPTIMIZER_BUFFER_COUNT: OFFLOAD_OPTIMIZER_BUFFER_COUNT_DEFAULT,
+    OFFLOAD_OPTIMIZER_PIN_MEMORY: OFFLOAD_OPTIMIZER_PIN_MEMORY_DEFAULT,
+    OFFLOAD_OPTIMIZER_PIPELINE_READ: OFFLOAD_OPTIMIZER_PIPELINE_READ_DEFAULT,
+    OFFLOAD_OPTIMIZER_PIPELINE_WRITE: OFFLOAD_OPTIMIZER_PIPELINE_WRITE_DEFAULT,
+    OFFLOAD_OPTIMIZER_FAST_INIT: OFFLOAD_OPTIMIZER_FAST_INIT_DEFAULT
+}
+
+
+def _get_offload_config(param_dict, key_default_dict):
+    offload_config = {}
+    for key, default_value in key_default_dict.items():
+        offload_config[key] = get_scalar_param(param_dict, key, default_value)
+
+    return offload_config
+
+
+def get_offload_param_config(param_dict):
+    if OFFLOAD_PARAM in param_dict and param_dict[OFFLOAD_PARAM] is not None:
+        offload_config = _get_offload_config(
+            param_dict=param_dict[OFFLOAD_PARAM],
+            key_default_dict=OFFLOAD_PARAM_KEY_DEFAULT_DICT)
+        device = offload_config.get("device", OFFLOAD_PARAM_DEVICE_DEFAULT)
+        assert device in VALID_OFFLOAD_DEVICES, f'Invalid parameter offloading device specified: {device}.'
+        if device == OFFLOAD_NONE_DEVICE:
+            return None
+        return offload_config
+    return None
+
+
+def get_default_offload_param_config():
+    return OFFLOAD_PARAM_KEY_DEFAULT_DICT
+
+
+def get_offload_optimizer_config(param_dict):
+    if OFFLOAD_OPTIMIZER in param_dict and param_dict[OFFLOAD_OPTIMIZER] is not None:
+        offload_config = _get_offload_config(
+            param_dict=param_dict[OFFLOAD_OPTIMIZER],
+            key_default_dict=OFFLOAD_OPTIMIZER_KEY_DEFAULT_DICT)
+
+        device = offload_config.get("device", OFFLOAD_OPTIMIZER_DEVICE_DEFAULT)
+        assert device in VALID_OFFLOAD_DEVICES, f'Invalid optimizer offloading device specified: {device}.'
+        if device == OFFLOAD_NONE_DEVICE:
+            return None
+
+        offload_config[OFFLOAD_OPTIMIZER_PIPELINE] = offload_config[
+            OFFLOAD_OPTIMIZER_PIPELINE_READ] or offload_config[
+                OFFLOAD_OPTIMIZER_PIPELINE_WRITE]
+        return offload_config
+
+    return None
+
+
+def get_default_offload_optimizer_config():
+    return OFFLOAD_OPTIMIZER_KEY_DEFAULT_DICT
diff --git a/deepspeed/runtime/zero/offload_constants.py b/deepspeed/runtime/zero/offload_constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..436e8bb8a4def16536d312d8ded846525bbb5ff3
--- /dev/null
+++ b/deepspeed/runtime/zero/offload_constants.py
@@ -0,0 +1,69 @@
+"""
+"Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+"""
+#########################################
+# TENSOR OFFLOADING
+#########################################
+OFFLOAD_NONE_DEVICE = "none"
+OFFLOAD_CPU_DEVICE = "cpu"
+OFFLOAD_NVME_DEVICE = "nvme"
+VALID_OFFLOAD_DEVICES = [OFFLOAD_NONE_DEVICE, OFFLOAD_CPU_DEVICE, OFFLOAD_NVME_DEVICE]
+
+#########################################
+# PARAM TENSOR OFFLOADING
+#########################################
+OFFLOAD_PARAM_FORMAT = '''
+"offload_param": {
+  "device": [none|cpu|nvme],
+  "nvme_path": "/local_nvme",
+  "buffer_count": 5,
+  "buffer_size": 1e8,
+  "max_in_cpu": 1e9,
+  "pin_memory": [true|false]
+}
+'''
+OFFLOAD_PARAM = "offload_param"
+OFFLOAD_PARAM_DEVICE = "device"
+OFFLOAD_PARAM_DEVICE_DEFAULT = None
+OFFLOAD_PARAM_NVME_PATH = "nvme_path"
+OFFLOAD_PARAM_NVME_PATH_DEFAULT = None
+OFFLOAD_PARAM_BUFFER_COUNT = "buffer_count"
+OFFLOAD_PARAM_BUFFER_COUNT_DEFAULT = 5
+OFFLOAD_PARAM_BUFFER_SIZE = "buffer_size"
+OFFLOAD_PARAM_BUFFER_SIZE_DEFAULT = 1e8
+OFFLOAD_PARAM_MAX_IN_CPU = "max_in_cpu"
+OFFLOAD_PARAM_MAX_IN_CPU_DEFAULT = 1e9
+OFFLOAD_PARAM_PIN_MEMORY = "pin_memory"
+OFFLOAD_PARAM_PIN_MEMORY_DEFAULT = False
+
+#########################################
+# OPTIMIZER TENSOR OFFLOADING
+#########################################
+OFFLOAD_OPTIMIZER_FORMAT = '''
+"offload_optimizer": {
+  "device": [none|cpu|nvme],
+  "nvme_path": "/local_nvme",
+  "buffer_count": 4,
+  "pin_memory": [true|false],
+  "pipeline_read": false,
+  "pipeline_write": false,
+  "fast_init": false
+}
+'''
+OFFLOAD_OPTIMIZER = "offload_optimizer"
+OFFLOAD_OPTIMIZER_DEVICE = "device"
+OFFLOAD_OPTIMIZER_DEVICE_DEFAULT = None
+OFFLOAD_OPTIMIZER_NVME_PATH = "nvme_path"
+OFFLOAD_OPTIMIZER_NVME_PATH_DEFAULT = None
+OFFLOAD_OPTIMIZER_BUFFER_COUNT = "buffer_count"
+OFFLOAD_OPTIMIZER_BUFFER_COUNT_DEFAULT = 4
+OFFLOAD_OPTIMIZER_PIN_MEMORY = "pin_memory"
+OFFLOAD_OPTIMIZER_PIN_MEMORY_DEFAULT = False
+OFFLOAD_OPTIMIZER_PIPELINE_READ = "pipeline_read"
+OFFLOAD_OPTIMIZER_PIPELINE_READ_DEFAULT = False
+OFFLOAD_OPTIMIZER_PIPELINE_WRITE = "pipeline_write"
+OFFLOAD_OPTIMIZER_PIPELINE_WRITE_DEFAULT = False
+OFFLOAD_OPTIMIZER_PIPELINE = "pipeline"
+OFFLOAD_OPTIMIZER_FAST_INIT = "fast_init"
+OFFLOAD_OPTIMIZER_FAST_INIT_DEFAULT = False
diff --git a/deepspeed/runtime/zero/partition_parameters.py b/deepspeed/runtime/zero/partition_parameters.py
old mode 100755
new mode 100644
index 4465adfd7c165920e17075dc41a7b520a230b9ef..80b1ee34bcec3d16ddb02cd88c4035b94671cd5a
--- a/deepspeed/runtime/zero/partition_parameters.py
+++ b/deepspeed/runtime/zero/partition_parameters.py
@@ -1,26 +1,91 @@
+"""
+"Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+"""
+
+import math
 import os
 import time
 import types
+from typing import Callable, Iterable
 from enum import Enum
 import functools
 import itertools
+from typing import List
 
 import torch
-from torch.distributed.distributed_c10d import _get_global_rank
+from torch import Tensor
+import torch.distributed as dist
+from torch.distributed.distributed_c10d import _get_global_rank, group
+from torch.nn import Module
+from torch.nn import Parameter
+
+from .linear import LinearModuleForZeroStage3, LinearFunctionForZeroStage3
+from .offload_constants import *
+
+import deepspeed
+from ..utils import get_only_unique_item, see_memory_usage
+from deepspeed.runtime.zero.utils import assert_ints_same_as_other_ranks
+from deepspeed.utils import init_distributed, instrument_w_nvtx, logger
+from deepspeed.utils.debug import debug_param2name_id_shape, debug_param2name_id_shape_device, debug_module2name, debug_param2name, debug_param2name_id_shape_status, printflock, log_rank_file
+from deepspeed.utils.logging import logger
 
-from deepspeed.runtime.zero.linear import LinearModuleForZeroStage3, LinearFunctionForZeroStage3
-from deepspeed.runtime.utils import see_memory_usage
-from deepspeed.utils import log_dist, init_distributed
+from ..swap_tensor.partitioned_param_swapper import AsyncPartitionedParameterSwapper, PartitionedParamStatus
 
 param_count = 0
+partitioned_param_data_shape = [0]
+
+if hasattr(torch.distributed, "_all_gather_base"):
+
+    def torch_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group):
+        try:
+            return instrument_w_nvtx(torch.distributed._all_gather_base)(
+                output_tensor,
+                input_tensor,
+                group=group,
+                async_op=True,
+            )
+        except RuntimeError as e:
+            raise RuntimeError(
+                f"output_tensor: {output_tensor.device}, input_tensor: {input_tensor.device}"
+            ) from e
+else:
+    logger.warning(
+        "unable to find torch.distributed._all_gather_base. will fall back to "
+        "torch.distributed.all_gather which will result in suboptimal performance. "
+        "please consider upgrading your pytorch installation.")
+
+    def torch_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group):
+        output_tensors = list(
+            torch.chunk(output_tensor,
+                        torch.distributed.get_world_size(group)))
+        return instrument_w_nvtx(torch.distributed.all_gather)(
+            output_tensors,
+            input_tensor,
+            group=group,
+            async_op=True,
+        )
 
 
 def print_rank_0(message, debug=False, force=False):
-    if torch.distributed.get_rank() == 0 and (debug or force):
+    rank = torch.distributed.get_rank()
+    if rank == 0 and (debug or force):
         print(message)
+    # other variations
+    # - print for all ranks w/o interleaving
+    # printflock(f"[{rank}] {message}")
+    # - print to log file per rank
+    # log_rank_file(rank, message)
+
+
+def debug_rank0(msg: str) -> None:
+    if torch.distributed.get_rank() == 0:
+        logger.debug(msg)
 
 
 def is_zero_param(parameter):
+    if not torch.is_tensor(parameter):
+        return False
     return hasattr(parameter, 'ds_id')
 
 
@@ -29,8 +94,6 @@ def _init_external_params(module):
         module._external_params = {}
 
         def external_parameters(self):
-            if not hasattr(self, '_external_params'):
-                self._external_params = {}
             return self._external_params.items()
 
         def all_parameters(self):
@@ -94,6 +157,28 @@ def register_external_parameter(module, parameter):
     module._external_params[key] = parameter
 
 
+def unregister_external_parameter(module, parameter):
+    """Reverses the effects of :meth:`register_external_parameter`.
+
+    Args:
+        module (``torch.nn.Module``): The module to affect.
+        parameter (``torch.nn.Parameter``): The parameter to unregister.
+
+    Raises:
+        RuntimeError: If ``parameter`` is not of type ``torch.nn.Parameter``.
+        RuntimeError: If ``parameter`` is not a registered external parameter of ``module``.
+    """
+    if not isinstance(parameter, torch.nn.Parameter):
+        raise RuntimeError('Parameter is not a torch.nn.Parameter')
+
+    if not hasattr(module,
+                   '_external_params') or id(parameter) not in module._external_params:
+        raise RuntimeError('Parameter is not a registered external parameter of module.')
+
+    key = id(parameter)
+    del module._external_params[key]
+
+
 class ZeroParamType(Enum):
 
     # same as regular pytorch parameters
@@ -119,26 +204,63 @@ class ZeroParamStatus(Enum):
 
 
 _orig_torch_empty = torch.empty
+_orig_torch_zeros = torch.zeros
+_orig_torch_ones = torch.ones
+_orig_torch_full = torch.full
 
 
-def empty_cuda_tensor(*size, **kwargs):
-    if not 'device' in kwargs.keys():
-        kwargs['device'] = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
-    tensor = _orig_torch_empty(*size, **kwargs)
-    if tensor.is_floating_point():
-        return tensor.half()
-    else:
+def zero_wrapper_for_fp_tensor_constructor(fn: Callable,
+                                           target_fp_dtype: torch.dtype) -> Callable:
+    def wrapped_fn(*args, **kwargs) -> Tensor:
+        if kwargs.get("device", None) is None:
+            kwargs['device'] = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
+        tensor: Tensor = fn(*args, **kwargs)
+        if tensor.is_floating_point():
+            tensor = tensor.to(target_fp_dtype)
+
         return tensor
 
+    return wrapped_fn
+
+
+def get_new_tensor_fn_for_dtype(dtype: torch.dtype) -> Callable:
+    def new_tensor(cls, *args) -> Tensor:
+        device = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
+        tensor = _orig_torch_empty(0, device=device).new_empty(*args)
+        if tensor.is_floating_point():
+            tensor = tensor.to(dtype)
 
-def new_cuda_tensor(cls, *args):
-    device = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
-    tensor = torch.ones((1, 1), device=device).new_empty(*args).half()
-    if tensor.is_floating_point():
-        return tensor.half()
-    else:
         return tensor
 
+    return new_tensor
+
+
+# https://stackoverflow.com/a/63851681/9201239
+def get_all_subclasses(cls):
+    subclass_list = []
+
+    def recurse(cl):
+        for subclass in cl.__subclasses__():
+            subclass_list.append(subclass)
+            recurse(subclass)
+
+    recurse(cls)
+
+    return set(subclass_list)
+
+
+@instrument_w_nvtx
+def free_param(param: Parameter) -> None:
+    """Free underlying storage of a parameter."""
+    assert not param.ds_active_sub_modules, param.ds_summary()
+    if param.data.is_cuda:
+        # need to make sure that we don't free the parameter while it is still
+        # being used for computation
+        param.data.record_stream(torch.cuda.current_stream())
+    # param.data doesn't store anything meaningful in partitioned state
+    param.data = torch.empty(0, dtype=param.dtype, device=param.device)
+    param.ds_status = ZeroParamStatus.NOT_AVAILABLE
+
 
 reuse_buffers = False
 temp_contiguous_tensor = None
@@ -148,21 +270,118 @@ empty_buffers = {}
 # Inserts _post_init_method at the end of init method
 # for all sub classes of torch.nn.Module
 class InsertPostInitMethodToModuleSubClasses(object):
-    def __init__(self, enabled=True, mem_efficient_linear=True):
+    def __init__(self,
+                 enabled=True,
+                 mem_efficient_linear=True,
+                 ds_config=None,
+                 dtype=None):
         self.mem_efficient_linear = mem_efficient_linear
         self.enabled = enabled
+        self._set_dtype(ds_config, dtype)
+        assert self.dtype in [torch.half, torch.bfloat16, torch.float], f"Invalid data type {self.dtype}, allowed values are [torch.half, torch.bfloat16, torch.float]"
 
     def __enter__(self):
         if not self.enabled:
             return
 
+        def apply_with_gather(orig_module_apply_fn: Callable) -> Callable:
+            """many models make use of child modules like Linear or Embedding which
+            perform their own weight initialization in their __init__ methods,
+            but will then have more weight initialization in a parent module's __init__
+            method that modifies weights of child modules, which is typically done
+            using the Module.apply method.
+
+            since the Init context manager partitions child modules immediately after
+            they are initialized, without modifying apply we would entirely skip
+            any initialization done by parent modules.
+
+            to get around this issue, we wrap the function passed to Module.apply
+            so that the applied function is applied to child modules correctly.
+            """
+            def get_wrapped_fn_to_apply(fn_to_apply: Callable) -> Callable:
+                if hasattr(fn_to_apply, "wrapped"):
+                    return fn_to_apply
+
+                @functools.wraps(fn_to_apply)
+                def wrapped_fn_to_apply(module_to_apply_fn_to: Module) -> None:
+                    """gathers parameters before calling apply function. afterwards
+                    parameters are broadcasted to ensure consistency across all ranks
+                    then re-partitioned.
+
+                    takes the following steps:
+                    1. allgathers parameters for the current module being worked on
+                    2. calls the original function
+                    3. broadcasts root rank's parameters to the other ranks
+                    4. re-partitions the parameters
+                    """
+                    if not all(
+                            is_zero_param(p)
+                            for p in module_to_apply_fn_to.parameters(recurse=False)):
+                        raise RuntimeError(
+                            f"not all parameters for {module_to_apply_fn_to.__class__.__name__}, "
+                            f"were zero params, is it possible that the parameters were "
+                            f"overwritten after they were initialized? "
+                            f"params: {[p for p in module_to_apply_fn_to.parameters(recurse=False)]} "
+                        )
+
+                    params_to_apply_fn_to: Iterable[Parameter] = list(
+                        sorted(module_to_apply_fn_to.parameters(recurse=False),
+                               key=lambda p: p.ds_id))
+
+                    for param in params_to_apply_fn_to:
+                        param.all_gather()
+
+                    fn_to_apply(module_to_apply_fn_to)
+
+                    for param in params_to_apply_fn_to:
+                        torch.distributed.broadcast(param.data,
+                                                    0,
+                                                    group=param.ds_process_group)
+
+                    for param in params_to_apply_fn_to:
+                        param.partition(has_been_updated=True)
+
+                wrapped_fn_to_apply.wrapped = True
+
+                return wrapped_fn_to_apply
+
+            @functools.wraps(orig_module_apply_fn)
+            def wrapped_apply(module: Module, fn_to_apply: Callable) -> None:
+                orig_module_apply_fn(module, get_wrapped_fn_to_apply(fn_to_apply))
+
+            return wrapped_apply
+
         def partition_after(f):
             @functools.wraps(f)
             def wrapper(module, *args, **kwargs):
+
+                # important logic: We want to run post_init only after child's __init__ is
+                # completed, and do nothing after __init__ of any of its parents and grandparents in
+                # the inheritance ancestry. This way the partitioning will need to happen only once
+                # when the whole object is ready to be partitioned and not before. This is because
+                # often the child module will need to tweak the weights - for example running a
+                # custom weights init function. So if a parent created the weights param, the child
+                # won't need to gather it in order to tweak it
+
                 print_rank_0(f'Before initializing {module.__class__.__name__}',
                              force=False)
+
+                is_child_module = False
+                if not hasattr(module, "_ds_child_entered"):
+                    # child's __init__ was called, since parents all see the same object they can now skip post_init
+                    is_child_module = True
+                    setattr(module, "_ds_child_entered", True)
+
                 f(module, *args, **kwargs)
-                self._post_init_method(module)
+
+                if is_child_module:
+                    # child's __init__ is done, now we can run a single post_init on the child object
+                    delattr(module, "_ds_child_entered")
+
+                    print_rank_0(f'Running post_init for {module.__class__.__name__}',
+                                 force=False)
+                    self._post_init_method(module)
+
                 print_rank_0(
                     f'After initializing followed by post init for {module.__class__.__name__}',
                     force=False)
@@ -176,20 +395,33 @@ class InsertPostInitMethodToModuleSubClasses(object):
         def _init_subclass(cls, **kwargs):
             cls.__init__ = partition_after(cls.__init__)
 
-        # Replace .__init__() for all existing subclasses of torch.nn.Module
-        for subclass in torch.nn.modules.module.Module.__subclasses__():
+        # Replace .__init__() for all existing subclasses of torch.nn.Module recursively
+        for subclass in get_all_subclasses(torch.nn.modules.module.Module):
+            # print(f"subclass={subclass.__module__}.{subclass.__qualname__}")
             _enable_class(subclass)
 
-        # holding on to the current __init__subclass__ for exit
+        # holding onto some methods so we can put them back the way they were in __exit__
         torch.nn.modules.module.Module._old_init_subclass = torch.nn.modules.module.Module.__init_subclass__
+        torch.nn.modules.module.Module._old_apply = torch.nn.modules.module.Module.apply
         torch.Tensor.__old_new__ = torch.Tensor.__new__
 
         # Replace .__init__() for future subclasses of torch.nn.Module
         torch.nn.modules.module.Module.__init_subclass__ = classmethod(_init_subclass)
-        torch.Tensor.__new__ = new_cuda_tensor
-        torch.empty = empty_cuda_tensor
+        torch.nn.modules.module.Module.apply = apply_with_gather(
+            torch.nn.modules.module.Module._old_apply)
+
+        torch.Tensor.__new__ = get_new_tensor_fn_for_dtype(self.dtype)
+        torch.empty = zero_wrapper_for_fp_tensor_constructor(_orig_torch_empty,
+                                                             self.dtype)
+        torch.zeros = zero_wrapper_for_fp_tensor_constructor(_orig_torch_zeros,
+                                                             self.dtype)
+        torch.ones = zero_wrapper_for_fp_tensor_constructor(_orig_torch_ones, self.dtype)
+        torch.full = zero_wrapper_for_fp_tensor_constructor(_orig_torch_full, self.dtype)
 
         if self.mem_efficient_linear:
+            print_rank_0(
+                "nn.functional.linear has been overridden with a more memory efficient version. This will persist unless manually reset.",
+                force=False)
             self.linear_bk = torch.nn.functional.linear
             torch.nn.functional.linear = LinearFunctionForZeroStage3.apply
 
@@ -201,17 +433,28 @@ class InsertPostInitMethodToModuleSubClasses(object):
             cls.__init__ = cls._old_init
 
         # Replace .__init__() for all existing subclasses of torch.nn.Module
-        for subclass in torch.nn.modules.module.Module.__subclasses__():
+        for subclass in get_all_subclasses(torch.nn.modules.module.Module):
             _disable_class(subclass)
 
-        # Replace .__init__() for future subclasses of torch.nn.Module
+        # putting methods back the way we found them
         torch.nn.modules.module.Module.__init_subclass__ = torch.nn.modules.module.Module._old_init_subclass
+        torch.nn.modules.module.Module.apply = torch.nn.modules.module.Module._old_apply
 
         torch.Tensor.__new__ = torch.Tensor.__old_new__
         torch.empty = _orig_torch_empty
+        torch.zeros = _orig_torch_zeros
+        torch.ones = _orig_torch_ones
+        torch.full = _orig_torch_full
 
-        if self.mem_efficient_linear:
-            torch.nn.functional.linear = self.linear_bk
+        # un doing it here will undo it during training
+        # if self.mem_efficient_linear:
+        #    torch.nn.functional.linear = self.linear_bk
+        #        if self.mem_efficient_linear:
+        #            torch.nn.functional.linear = self.linear_bk
+
+        if torch.distributed.get_rank() == 0:
+            logger.info("finished initializing model with %.2fB parameters",
+                        param_count / 1e9)
 
         # Now that we cleaned up the metaclass injection, raise the exception.
         if exc_type is not None:
@@ -221,6 +464,85 @@ class InsertPostInitMethodToModuleSubClasses(object):
     def _post_init_method(self, module):
         pass
 
+    def _set_dtype(self, ds_config, dtype):
+        if ds_config is not None and dtype is None:
+            if ds_config.bfloat16_enabled and ds_config.fp16_enabled:
+                raise RuntimeError("bfloat16 and fp16 cannot be enabled at once")
+
+            if ds_config.bfloat16_enabled:
+                self.dtype = torch.bfloat16
+            elif ds_config.fp16_enabled:
+                self.dtype = torch.half
+            else:
+                self.dtype = torch.float
+        else:
+            self.dtype = dtype or torch.half
+
+
+class AllGatherHandle:
+    def __init__(self, handle, param: Parameter) -> None:
+        if param.ds_status != ZeroParamStatus.INFLIGHT:
+            raise RuntimeError(f"expected param {param.ds_summary()} to be available")
+
+        self.__handle = handle
+        self.__param = param
+
+    def wait(self) -> None:
+        instrument_w_nvtx(self.__handle.wait)()
+        self.__param.ds_status = ZeroParamStatus.AVAILABLE
+
+
+class AllGatherCoalescedHandle:
+    def __init__(
+        self,
+        allgather_handle,
+        params: List[Parameter],
+        partitions: List[Tensor],
+        world_size: int,
+    ) -> None:
+        self.__allgather_handle = allgather_handle
+        self.__params = params
+        self.__partitions = partitions
+        self.__world_size = world_size
+        self.__complete = False
+
+        for param in self.__params:
+            if param.ds_status != ZeroParamStatus.INFLIGHT:
+                raise RuntimeError(
+                    f"expected param {param.ds_summary()} to not be available")
+
+    @instrument_w_nvtx
+    def wait(self) -> None:
+        if self.__complete:
+            return
+
+        instrument_w_nvtx(self.__allgather_handle.wait)()
+
+        # split the single tensor out into individual tensors
+        param_offset = 0
+        for param in self.__params:
+            assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected param {param.ds_summary()} to be inflight"
+            partitions: List[Tensor] = []
+            for rank in range(self.__world_size):
+                param_start = rank * param.ds_tensor.ds_numel
+                if param_start < param.ds_numel:
+                    part_to_copy = self.__partitions[rank].narrow(
+                        0,
+                        param_offset,
+                        min(param.ds_numel - param_start,
+                            param.ds_tensor.ds_numel))
+                    partitions.append(part_to_copy)
+
+            param.data = instrument_w_nvtx(torch.cat)(partitions).view(param.ds_shape)
+            param.ds_status = ZeroParamStatus.AVAILABLE
+
+            for part_to_copy in partitions:
+                part_to_copy.record_stream(torch.cuda.current_stream())
+
+            param_offset += param.ds_tensor.ds_numel
+
+        self.__complete = True
+
 
 # Replaces all parameters in module with Scattered Parameters
 class Init(InsertPostInitMethodToModuleSubClasses):
@@ -232,7 +554,11 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                  mem_efficient_linear=True,
                  remote_device=None,
                  pin_memory=False,
-                 enabled=True):
+                 config_dict_or_path=None,
+                 config=None,
+                 enabled=True,
+                 dtype=None,
+                 mpu=None):
         """A context to enable massive model construction for training with
         ZeRO-3. Models are automatically partitioned (or, sharded) across the
         system and converted to half precision.
@@ -245,30 +571,36 @@ class Init(InsertPostInitMethodToModuleSubClasses):
             mem_efficient_linear (bool, optional): Replace
                 torch.nn.functional.linear with an implementation that allows
                 DeepSpeed to partition parameters. Defaults to ``True``.
-            remote_device (string, optional): The device to store model
-                weights. Passing ``"cpu"`` will create the model in CPU
-                memory. The model may still be moved to GPU if
-                ``cpu_offload_param`` is ``False`` in the config provided to
-                :meth:`deepspeed.initialize`. Defaults to the local GPU.
+            remote_device (string, optional): The initial device to store model
+                weights e.g., ``cpu``, ``nvme``. Passing ``"cpu"`` will create the model in CPU
+                memory. The model may still be moved to GPU based on the
+                offload settings for training. Defaults to param offload device if a config is
+                defined, otherwise GPU.
             pin_memory (bool, optional): Potentially increase performance by
                 using pinned memory for model weights. ``remote_device`` must be
-                ``"cpu"``. Defaults to ``False``.
+                ``"cpu"``. Defaults to pin_memory value in config, otherwise ``False``.
+            config_dict_or_path (dict or ``json file``, optional): If provided, provides configuration
+                for swapping fp16 params to NVMe.
+            config (dict or ``json file``, optional): Deprecated, use config_dict_or_path instead.
             enabled (bool, optional): If ``False``, this context has no
                 effect. Defaults to ``True``.
+            dtype (``dtype``, optional): Can be used to change the data type of the parameters.
+                Supported options are ``torch.half`` and ``torch.float``. Defaults to ``None``
+            mpu (``object``, optional): A model parallelism unit object that implements get_{model,data}_parallel_{rank,group,world_size}.
 
         This context accelerates model initialization and enables models that
         are too large to allocate in their entirety in CPU memory. It has the
         following effects:
 
-        #. allocates tensors to either GPU or CPU memory
+        #. allocates tensors to either GPU or CPU memory or NVMe
         #. converts floating point tensors to half precision
         #. immediately partitions tensors among the group of data-parallel devices
         #. (*optional*) replaces ``torch.nn.functional.linear`` with a more
            memory-efficient implementation
 
         These modifications allow for models that exceed the size of local CPU/GPU
-        memory, but fit within the total system memory (*i.e.*, aggregate CPU
-        or GPU memory) across all nodes. Consider initializing a model with one
+        memory/NVMe, but fit within the total NVMe capacity (*i.e.*, aggregate CPU
+        or GPU memory or NVMe) across all nodes. Consider initializing a model with one
         trillion parameters, whose weights occupy two terabytes (TB) in half
         precision. The initial CPU allocation in full precision requires 4TB of
         memory *per process*, and so a system with 8 GPUs per node would need 32TB of
@@ -298,7 +630,6 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         .. note::
             Only applicable to training with ZeRO-3.
 
-
         Examples
         --------
 
@@ -326,8 +657,19 @@ class Init(InsertPostInitMethodToModuleSubClasses):
 
                 model = deepspeed.zero.Init(module=model)
         """
+        if config is not None:
+            config_dict_or_path = config
+            logger.warning(
+                f'zero.Init: the `config` argument is deprecated. Please use `config_dict_or_path` instead.'
+            )
 
-        super().__init__(enabled=enabled, mem_efficient_linear=mem_efficient_linear)
+        _ds_config = deepspeed.runtime.config.DeepSpeedConfig(
+            config_dict_or_path,
+            mpu) if config_dict_or_path is not None else None
+        super().__init__(enabled=enabled,
+                         mem_efficient_linear=mem_efficient_linear,
+                         ds_config=_ds_config,
+                         dtype=dtype)
         if not torch.distributed.is_initialized():
             init_distributed()
             assert torch.distributed.is_initialized(), "Parameters cannot be scattered without initializing torch.distributed"
@@ -339,23 +681,64 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         self.rank = torch.distributed.get_rank(group=self.ds_process_group)
         self.world_size = torch.distributed.get_world_size(group=self.ds_process_group)
 
-        #Local device is the device where the parameters are consumed
-        #It is the device where parameters are fully instantiated using allgather
+        # Local device is the device where the parameters are consumed, must be default device.
+        # It is the device where parameters are fully instantiated using allgather
         self.local_device = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"]))
+        torch.cuda.set_device(self.local_device)
+
+        if _ds_config is not None and _ds_config.zero_config.offload_param is not None:
+            remote_device = _ds_config.zero_config.offload_param[OFFLOAD_PARAM_DEVICE]
+            pin_memory = _ds_config.zero_config.offload_param[OFFLOAD_PARAM_PIN_MEMORY]
+
+        self._validate_remote_device(remote_device, _ds_config)
 
-        #Remote device is the device where parameter partiitons are stored
-        #It can be same as local_device or it could be CPU.
+        # Remote device is the device where parameter partitions are stored
+        # It can be same as local_device or it could be CPU or NVMe.
         self.remote_device = self.local_device if remote_device is None else remote_device
-        self.pin_memory = pin_memory if (self.remote_device == 'cpu') else False
+        self.pin_memory = pin_memory if (self.remote_device
+                                         == OFFLOAD_CPU_DEVICE) else False
+
+        # Enable fp16 param swapping to NVMe
+        if self.remote_device == OFFLOAD_NVME_DEVICE:
+            self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
+        else:
+            self.param_swapper = None
 
         # If we are provided an already-allocated module to prepare.
         if module is not None:
             assert isinstance(module, torch.nn.Module)
-            for param in module.parameters(recurse=True):
-                if is_zero_param(param):
-                    continue
-                self._convert_to_deepspeed_param(param)
-                param.partition()
+            self._convert_to_zero_parameters(module.parameters(recurse=True))
+
+        self.use_all_gather_base = False
+        try:
+            from torch.distributed.distributed_c10d import _all_gather_base as all_gather
+            self.use_all_gather_base = True
+        except:
+            logger.info(
+                f"_all_gather_base API is not available in torch {torch.__version__}")
+
+    def _convert_to_zero_parameters(self, param_list):
+        for param in param_list:
+            if is_zero_param(param):
+                continue
+            self._convert_to_deepspeed_param(param)
+            param.partition()
+
+    def _validate_remote_device(self, remote_device, ds_config):
+        if ds_config is not None:
+            if remote_device in [None, OFFLOAD_CPU_DEVICE]:
+                if ds_config.zero_config.offload_param is not None:
+                    offload_param_device = ds_config.zero_config.offload_param[
+                        OFFLOAD_PARAM_DEVICE]
+                    assert offload_param_device != OFFLOAD_NVME_DEVICE, \
+                        f"{OFFLOAD_PARAM_DEVICE} in DeepSpeed Config cannot be {offload_param_device} if remote device is {remote_device}."
+
+            if remote_device == OFFLOAD_NVME_DEVICE:
+                assert ds_config.zero_config.offload_param is not None, \
+                f'{OFFLOAD_PARAM} must be defined in DeepSpeed Config if remote device is {OFFLOAD_NVME_DEVICE}.'
+
+                assert ds_config.zero_config.offload_param[OFFLOAD_PARAM_NVME_PATH] is not None, \
+                f'{OFFLOAD_PARAM_NVME_PATH} in DeepSpeed Config cannot be None if remote device is {OFFLOAD_NVME_DEVICE}'
 
     def _post_init_method(self, module):
         #see_memory_usage(f"Before converting parmas in {module.__class__.__name__}", force=False)
@@ -370,8 +753,16 @@ class Init(InsertPostInitMethodToModuleSubClasses):
             if not is_zero_param(param):
                 self._convert_to_deepspeed_param(param)
                 print_rank_0(
-                    f"Partitioning param with ds id {param.ds_id} and shape {param.data.shape}"
+                    f"Partitioning param {debug_param2name_id_shape(param)} module={debug_module2name(module)}"
                 )
+
+                if param.is_cuda:
+                    torch.distributed.broadcast(param, 0, self.ds_process_group)
+                else:
+                    if torch.distributed.get_rank() == 0:
+                        logger.warn(f"param `{name}` in {module.__class__.__name__} "
+                                    f"not on GPU so was not broadcasted from rank 0")
+
                 param.partition()
         see_memory_usage(
             f"Param count {param_count}. After converting and partitioning parmas in {module.__class__.__name__}",
@@ -388,22 +779,28 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         # Stores the shape of the original tensor
         param.ds_shape = param.shape
 
-        # Stores the number of elements in the original parmaeter without padding
+        # Stores the number of elements in the original parameter without padding
         param.ds_numel = param.numel()
 
-        # Stores the paritioned copy of the tensor
+        # Stores the partitioned copy of the tensor
         param.ds_tensor = None
 
         # Keeps track of how many active sub-modules need this param at any given point in time
-        param.ds_active_sub_modules = 0
+        param.ds_active_sub_modules = set()
 
         # If this flag is true, then the parameters are replicated throughput training
         # And only partitioned before the step
         param.ds_persist = False
 
+        param.is_external_param = False
+
         # The group that the parameter is scattered across.
         param.ds_process_group = self.ds_process_group
 
+        # This is set to the Async Param swapper if remote device is nvme
+        # else this is set to None
+        param.nvme_swapper = self.param_swapper
+
         # DeepSped Param ID
         param.ds_id = Init.param_id
         Init.param_id += 1
@@ -414,10 +811,88 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                 param_list = [cls]
             return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
 
+        @instrument_w_nvtx
+        def all_gather_coalesced(params: Iterable[Parameter],
+                                 safe_mode: bool = False) -> AllGatherCoalescedHandle:
+
+            # fetches from nvme if the partition is not available and in nvme
+            self._ensure_availability_of_partitioned_params(params)
+
+            for param in params:
+                if param.ds_status != ZeroParamStatus.NOT_AVAILABLE:
+                    raise RuntimeError(param.ds_summary())
+                param.ds_status = ZeroParamStatus.INFLIGHT
+
+            # ensure that each rank has params in same order. the allgather
+            # is done by flattening the parameter list into a single tensor that
+            # can be allgathered in a single call - this means that if each rank
+            # gives a list of the same parameters in a different order we will
+            # silently get incorrect parameter values, and have very difficult
+            # to debug correctness issues.
+            params = sorted(params, key=lambda p: p.ds_id)
+
+            debug_rank0(f"-allgather_coalesced: {[p.ds_id for p in params]}")
+
+            if safe_mode:
+                # ensure that same list (with same ordering) of parameters are
+                # being allgathered across all ranks, otherwise could mix
+                # data between tensors.
+                assert_ints_same_as_other_ranks([p.ds_id for p in params])
+                # ensure that tensors from each rank agree on the same ds_numel
+                # otherwise could mix data between tensors.
+                assert_ints_same_as_other_ranks([p.ds_tensor.ds_numel for p in params])
+
+            if len(params) == 1:
+                # have an opportunity to avoid some intermediate memory allocations
+                param, = params
+                param_buffer = torch.empty(
+                    math.ceil(param.ds_numel / self.world_size) * self.world_size,
+                    dtype=param.dtype,
+                    device=torch.cuda.current_device(),
+                    requires_grad=False,
+                )
+                handle = torch_allgather_fn(
+                    param.ds_tensor.to(torch.cuda.current_device()),
+                    param_buffer,
+                    self.ds_process_group,
+                )
+                param.data = param_buffer.narrow(0,
+                                                 0,
+                                                 param.ds_numel).view(param.ds_shape).to(
+                                                     param.device)
+                return AllGatherHandle(handle, param)
+            else:
+                partition_sz = sum(p.ds_tensor.ds_numel for p in params)
+                flat_tensor = torch.empty(partition_sz * self.world_size,
+                                          dtype=get_only_unique_item(p.dtype
+                                                                     for p in params),
+                                          device=torch.cuda.current_device(),
+                                          requires_grad=False)
+                partitions: List[Parameter] = []
+                for i in range(self.world_size):
+                    partitions.append(
+                        flat_tensor.narrow(0,
+                                           partition_sz * i,
+                                           partition_sz))
+
+                instrument_w_nvtx(torch.cat)(
+                    [p.ds_tensor.to(torch.cuda.current_device()) for p in params],
+                    out=partitions[self.rank])
+                handle = torch_allgather_fn(partitions[self.rank],
+                                            flat_tensor,
+                                            self.ds_process_group)
+
+                return AllGatherCoalescedHandle(
+                    allgather_handle=handle,
+                    params=params,
+                    partitions=partitions,
+                    world_size=self.world_size,
+                )
+
         def partition(param_list=None, hierarchy=0, has_been_updated=False):
             cls = param
             print_rank_0(
-                f"{'--'*hierarchy}----Partitioning param with id {cls.ds_id} dev {cls.device} shape {cls.shape}"
+                f"{'--'*hierarchy}----Partitioning param {debug_param2name_id_shape_device(cls)}"
             )
             if param_list is None:
                 param_list = [cls]
@@ -438,7 +913,8 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                                 accumulate=False):
             cls = param
             print_rank_0(
-                f"{'--'*hierarchy}----Partitioning param gradient with id {cls.ds_id}")
+                f"{'--'*hierarchy}----Partitioning param gradient with id {debug_param2name_id_shape_device(cls)}"
+            )
             if param_list is None:
                 param_list = [cls]
                 if isinstance(partition_buffers, torch.Tensor):
@@ -454,8 +930,40 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         def padding_size():
             return self._padding_size(param)
 
+        def partitioned_size():
+            return self._partitioned_size(param)
+
+        def item_override():
+            param.all_gather()
+            return param._orig_item()
+
+        def ds_summary(slf: torch.Tensor) -> dict:
+            return {
+                "id": slf.ds_id,
+                "status": slf.ds_status.name,
+                "numel": slf.numel(),
+                "ds_numel": slf.ds_numel,
+                "shape": tuple(slf.shape),
+                "ds_shape": tuple(slf.ds_shape),
+                "requires_grad": slf.requires_grad,
+                "grad_shape": tuple(slf.grad.shape) if slf.grad is not None else None,
+                "persist": slf.ds_persist,
+                "active_sub_modules": slf.ds_active_sub_modules,
+            }
+
+        def convert_to_zero_parameters(param_list):
+            self._convert_to_zero_parameters(param_list)
+
+        def allgather_before(func: Callable) -> Callable:
+            def wrapped(*args, **kwargs):
+                param.all_gather()
+                return func(*args, **kwargs)
+
+            return wrapped
+
         # Collectives for gathering and partitioning parameters
         param.all_gather = all_gather
+        param.all_gather_coalesced = all_gather_coalesced
         param.partition = partition
 
         # Collective for averaging gradients
@@ -465,6 +973,12 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         # Partitioning size utilities
         param.aligned_size = aligned_size
         param.padding_size = padding_size
+        param.partitioned_size = partitioned_size
+        param.ds_summary = types.MethodType(ds_summary, param)
+
+        param.item = allgather_before(param.item)
+
+        param.convert_to_zero_parameters = convert_to_zero_parameters
 
     def _aligned_size(self, param):
         return param.ds_numel + self._padding_size(param)
@@ -473,7 +987,30 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         remainder = param.ds_numel % self.world_size
         return (self.world_size - remainder) if remainder else 0
 
+    def _partitioned_size(self, param):
+        return param.ds_tensor.ds_numel
+
+    def _ensure_availability_of_partitioned_params(self, params):
+        swap_in_list = []
+        swap_in_flight = []
+        for param in params:
+            if param.ds_tensor.status == PartitionedParamStatus.NOT_AVAILABLE:
+                assert param.ds_tensor.final_location == OFFLOAD_NVME_DEVICE and param.ds_status == ZeroParamStatus.NOT_AVAILABLE
+                swap_in_list.append(param)
+            if param.ds_tensor.status == PartitionedParamStatus.INFLIGHT:
+                assert param.ds_tensor.final_location == OFFLOAD_NVME_DEVICE and param.ds_status == ZeroParamStatus.NOT_AVAILABLE
+                swap_in_flight.append(param)
+        if len(swap_in_list) > 0:
+            swap_in_list[0].nvme_swapper.swap_in(swap_in_list, async_op=False)
+        elif len(swap_in_flight) > 0:
+            swap_in_flight[0].nvme_swapper.synchronize_reads()
+
+    @instrument_w_nvtx
     def _all_gather(self, param_list, async_op=False, hierarchy=None):
+
+        # fetches from nvme if the partition is not available and in nvme
+        self._ensure_availability_of_partitioned_params(param_list)
+
         handles = []
         all_gather_list = []
         for param in param_list:
@@ -488,7 +1025,11 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                     all_gather_list.append(param)
 
         if not async_op:
-            ret_value = self._allgather_params(all_gather_list, hierarchy=hierarchy)
+            if len(param_list) == 1:
+                ret_value = self._allgather_params(all_gather_list, hierarchy=hierarchy)
+            else:
+                ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
+
             for param in all_gather_list:
                 param.ds_status = ZeroParamStatus.AVAILABLE
             return ret_value
@@ -498,17 +1039,19 @@ class Init(InsertPostInitMethodToModuleSubClasses):
     def _partition(self, param_list, force=False, has_been_updated=False):
         for param in param_list:
             #print_rank_0(f"Before Partitioning Param {param.ds_id}")
-            #self._param_status(param)
+            # self._param_status(param)
             self._partition_param(param, has_been_updated=has_been_updated)
             param.ds_status = ZeroParamStatus.NOT_AVAILABLE
-            #if param.ds_tensor is not None:
+            # if param.ds_tensor is not None:
             #    assert id(param.data) == id(param.ds_tensor.data), \
             #    "After the parameters are initially partitioned, make sure we are not recreating the partition."
             #print_rank_0(f"After Partitioning Param {param.ds_id}")
             # self._param_status(param)
 
-    def _partition_param(self, param, has_been_updated=False):
-        assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot parititon a param in flight"
+    @instrument_w_nvtx
+    def _partition_param(self, param, buffer=None, has_been_updated=False):
+        assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
+
         global reuse_buffers
         #print_rank_0(f"Param id {param.ds_id} status is {param.ds_status}")
         if param.ds_status is ZeroParamStatus.AVAILABLE:
@@ -524,28 +1067,59 @@ class Init(InsertPostInitMethodToModuleSubClasses):
             #     if numel in empty_buffers:
             #         empty_buffers[numel].append(buffer)
 
-            #if torch.distributed.get_rank():
+            # if torch.distributed.get_rank():
             #    print(f"Releasing {param.data.numel()}")
             if param.ds_tensor is not None and not has_been_updated:
 
                 #param.data = param.ds_tensor.data
 
-                #param.data does not store anything meaningful in partitioned state
-                param.data = torch.ones(1).half().to(param.device)
+                see_memory_usage(
+                    f'Before partitioning param {param.ds_id} {param.shape}',
+                    force=False)
+                # param.data does not store anything meaningful in partitioned state
+                free_param(param)
+                see_memory_usage(f'After partitioning param {param.ds_id} {param.shape}',
+                                 force=False)
+
+                if param.ds_tensor.final_location == OFFLOAD_NVME_DEVICE:
+                    print_rank_0(
+                        f"Param {param.ds_id} partition released since it exists in nvme",
+                        force=False)
+                    param.nvme_swapper.remove_partition_and_release_buffers([param])
+
                 return
 
             tensor_size = self._aligned_size(param)
             partition_size = tensor_size // self.world_size
 
             if param.ds_tensor is None:
-                partitioned_tensor = torch.zeros(partition_size,
-                                                 dtype=param.dtype,
-                                                 device=self.remote_device)
-                partitioned_tensor.requires_grad = False
-                if self.pin_memory:
-                    partitioned_tensor = partitioned_tensor.pin_memory()
+                final_location = None
+                if self.remote_device == OFFLOAD_NVME_DEVICE and self.param_swapper.swappable_tensor(
+                        numel=partition_size):
+                    final_location = OFFLOAD_NVME_DEVICE
+                    buffer = self.param_swapper.get_buffer(param, partition_size)
+                    partitioned_tensor = torch.empty(0,
+                                                     dtype=param.dtype,
+                                                     device=buffer.device)
+                    partitioned_tensor.data = buffer.data
+                    print_rank_0(
+                        f"ID {param.ds_id} Initializing partition for the first time for nvme offload."
+                    )
 
+                else:
+                    partitioned_tensor = torch.empty(
+                        partition_size,
+                        dtype=param.dtype,
+                        device=OFFLOAD_CPU_DEVICE if self.remote_device
+                        == OFFLOAD_NVME_DEVICE else self.remote_device)
+                    if self.pin_memory:
+                        partitioned_tensor = partitioned_tensor.pin_memory()
+
+                partitioned_tensor.requires_grad = False
                 param.ds_tensor = partitioned_tensor
+                param.ds_tensor.ds_numel = partition_size
+                param.ds_tensor.status = PartitionedParamStatus.AVAILABLE
+                param.ds_tensor.final_location = final_location
 
             start = partition_size * self.rank
             end = start + partition_size
@@ -579,8 +1153,21 @@ class Init(InsertPostInitMethodToModuleSubClasses):
 
             #param.data = param.ds_tensor.data
 
-            #param.data does not store anything meaningful in partitioned state
-            param.data = torch.ones(1).half().to(param.device)
+            # param.data does not store anything meaningful in partitioned state
+
+            see_memory_usage(f'Before partitioning param {param.ds_id} {param.shape}',
+                             force=False)
+            free_param(param)
+            see_memory_usage(f'After partitioning param {param.ds_id} {param.shape}',
+                             force=False)
+
+            if param.ds_tensor.final_location == OFFLOAD_NVME_DEVICE:
+                self.param_swapper.swap_out_and_release([param])
+                print_rank_0(
+                    f"ID {param.ds_id} Offloaded to nvme offload and buffers released.")
+                see_memory_usage(
+                    f"ID {param.ds_id} Offloaded to nvme offload and buffers released.",
+                    force=False)
 
             print_rank_0(
                 f"ID {param.ds_id} partitioned type {param.dtype} dev {param.device} shape {param.shape}"
@@ -598,23 +1185,30 @@ class Init(InsertPostInitMethodToModuleSubClasses):
 
     def _allgather_param(self, param, async_op=False, hierarchy=0):
 
-        partition_size = param.ds_tensor.numel()
+        partition_size = param.ds_tensor.ds_numel
 
         tensor_size = partition_size * self.world_size
         aligned_param_size = self._aligned_size(param)
         assert tensor_size == aligned_param_size, f'param id {param.ds_id} aligned size {aligned_param_size} does not match tensor size {tensor_size}'
 
         print_rank_0(
-            f"{'--'* hierarchy}---- Before allocating Allgather param with id {param.ds_id} and status {param.ds_status} Partition Size {partition_size} and data shape {param.ds_shape}"
+            f"{'--'* hierarchy}---- Before allocating allgather param {debug_param2name_id_shape_status(param)} partition size={partition_size}"
         )
+
+        see_memory_usage(
+            f'Before allocate allgather param {debug_param2name_id_shape_status(param)} partition_size={partition_size} ',
+            force=False)
         flat_tensor = torch.zeros(aligned_param_size,
                                   dtype=param.dtype,
                                   device=param.device).view(-1)
+        see_memory_usage(
+            f'After allocate allgather param {debug_param2name_id_shape_status(param)} {aligned_param_size} {partition_size} ',
+            force=False)
 
         torch.cuda.synchronize()
 
         print_rank_0(
-            f"{'--'* hierarchy}----Allgather param with id {param.ds_id} and status {param.ds_status} Partition Size {partition_size} and data shape {param.ds_shape}"
+            f"{'--'* hierarchy}----allgather param with {debug_param2name_id_shape_status(param)} partition size={partition_size}"
         )
         #        if not flat_tensor.numel() > 100000:
         #            replicated_tensor = flat_tensor.narrow(0,
@@ -622,27 +1216,106 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         #                                                   param.ds_numel).view(param.ds_shape)
         #            param.data = replicated_tensor.data
         #            return None
-        partitions = []
-        for i in range(self.world_size):
-            partitions.append(flat_tensor.narrow(0, partition_size * i, partition_size))
+        if self.use_all_gather_base:
+            # try the _all_gather_base on PyTorch master branch
+            handle = dist._all_gather_base(flat_tensor,
+                                           param.ds_tensor.cuda(),
+                                           group=self.ds_process_group,
+                                           async_op=async_op)
+        else:
+            partitions = []
+            for i in range(self.world_size):
+                partitions.append(
+                    flat_tensor.narrow(0,
+                                       partition_size * i,
+                                       partition_size))
 
-            if i == torch.distributed.get_rank(group=self.ds_process_group):
-                partitions[i].data.copy_(param.ds_tensor.data, non_blocking=True)
+                if i == dist.get_rank(group=self.ds_process_group):
+                    partitions[i].data.copy_(param.ds_tensor.data, non_blocking=True)
 
-        handle = torch.distributed.all_gather(partitions,
-                                              partitions[self.rank],
-                                              group=self.ds_process_group,
-                                              async_op=async_op)
+            handle = dist.all_gather(partitions,
+                                     partitions[self.rank],
+                                     group=self.ds_process_group,
+                                     async_op=async_op)
 
         replicated_tensor = flat_tensor.narrow(0, 0, param.ds_numel).view(param.ds_shape)
         param.data = replicated_tensor.data
         return handle
 
+    def _allgather_params_coalesced(self, param_list, hierarchy=0):
+        """ blocking call
+        avoid explicit memory copy in _allgather_params
+        """
+        if len(param_list) == 0:
+            return
+        # collect local tensors and partition sizes
+        partition_sizes = []
+        local_tensors = []
+        for param in param_list:
+            partition_sizes.append(param.ds_tensor.ds_numel)
+            local_tensors.append(param.ds_tensor.cuda())
+
+        # allocate memory for allgather params
+        allgather_params = []
+        for psize in partition_sizes:
+            tensor_size = psize * self.world_size
+            flat_tensor = torch.empty(tensor_size,
+                                      dtype=param_list[0].dtype,
+                                      device=self.local_device).view(-1)
+            flat_tensor.requires_grad = False
+            allgather_params.append(flat_tensor)
+
+        # launch
+        launch_handles = []
+        # backend = get_backend(self.ds_process_group)
+        # with _batch_p2p_manager(backend):
+        for param_idx, param in enumerate(param_list):
+            input_tensor = local_tensors[param_idx].view(-1)
+
+            if self.use_all_gather_base:
+                # try the _all_gather_base from Pytorch master
+                h = dist._all_gather_base(allgather_params[param_idx],
+                                          input_tensor,
+                                          group=self.ds_process_group,
+                                          async_op=True)
+            else:
+                output_list = []
+                for i in range(self.world_size):
+                    psize = partition_sizes[param_idx]
+                    partition = allgather_params[param_idx].narrow(0, i * psize, psize)
+                    output_list.append(partition)
+                    if not partition.is_cuda:
+                        logger.warning(
+                            f'param {param_idx}, partition {i} is not on CUDA, partition shape {partition.size()}'
+                        )
+
+                # back to old all_gather function signature
+                h = dist.all_gather(output_list,
+                                    input_tensor,
+                                    group=self.ds_process_group,
+                                    async_op=True)
+            launch_handles.append(h)
+
+        # Wait ensures the operation is enqueued, but not necessarily complete.
+        launch_handles[-1].wait()
+
+        # assign to param.data (not copy)
+        for i, param in enumerate(param_list):
+            gathered_tensor = allgather_params[i]
+            param.data = gathered_tensor.narrow(0,
+                                                0,
+                                                param.ds_numel).view(param.ds_shape).data
+
+        # guarantee the communication to be completed
+        torch.cuda.synchronize()
+
+        return None
+
     def _allgather_params(self, param_list, hierarchy=0):
         if len(param_list) == 0:
             return
 
-        partition_size = sum([param.ds_tensor.numel() for param in param_list])
+        partition_size = sum([param.ds_tensor.ds_numel for param in param_list])
 
         tensor_size = partition_size * self.world_size
         flat_tensor = torch.empty(tensor_size,
@@ -658,7 +1331,7 @@ class Init(InsertPostInitMethodToModuleSubClasses):
             if i == self.rank:
                 offset = 0
                 for param in param_list:
-                    param_numel = param.ds_tensor.numel()
+                    param_numel = param.ds_tensor.ds_numel
 
                     partitions[i].narrow(0,
                                          offset,
@@ -673,9 +1346,7 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         param_offset = 0
 
         for param in param_list:
-
-            param_partition_size = param.ds_tensor.numel()
-
+            param_partition_size = param.ds_tensor.ds_numel
             param_size = param.ds_numel
             replicated_tensor = torch.empty(param.ds_shape,
                                             dtype=param.dtype,
@@ -696,7 +1367,7 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                                                       param_start,
                                                       numel_to_copy).copy_(part_to_copy)
             #param_offset += param.data.numel()
-            param_offset += param.ds_tensor.numel()
+            param_offset += param.ds_tensor.ds_numel
 
             param.data = replicated_tensor.data
 
@@ -720,7 +1391,7 @@ class Init(InsertPostInitMethodToModuleSubClasses):
             # some ranks may have partitions that are padded to go beyond the grad size.
             # For these ranks the output of reduce scatter is a separate buffer and needs
             # to be copied in
-            partition_size = param.ds_tensor.numel()
+            partition_size = param.ds_tensor.ds_numel
             start = self.rank * partition_size
             end = start + partition_size
             #print_rank_0("REduce scatter was executed for praam {param.ds_id}")
@@ -735,7 +1406,7 @@ class Init(InsertPostInitMethodToModuleSubClasses):
 
     def _reduce_scatter_gradient(self, param):
 
-        partition_size = param.ds_tensor.numel()
+        partition_size = param.ds_tensor.ds_numel
         #output = torch.empty(partition_size, dtype=param.dtype, device=param.device)
 
         total_size = partition_size * self.world_size
@@ -787,10 +1458,10 @@ class Init(InsertPostInitMethodToModuleSubClasses):
         # param.grad=None
         # param.grad.test()
         print_rank_0(
-            f"Partitioning param {id(param)} gradient of size {param.grad.numel()} type {param.grad.dtype} part_size {param.ds_tensor.numel()}"
+            f"Partitioning param {param.ds_id} gradient of size {param.grad.numel()} type {param.grad.dtype} part_size {param.ds_tensor.ds_numel}"
         )
         see_memory_usage("Before partitioning gradients", force=False)
-        partition_size = param.ds_tensor.numel()
+        partition_size = param.ds_tensor.ds_numel
 
         if partition_buffer is None:
             assert not accumulate, "No buffer to accumulate to"
@@ -798,23 +1469,19 @@ class Init(InsertPostInitMethodToModuleSubClasses):
                                            dtype=param.dtype,
                                            device=param.device)
         else:
-            assert partition_buffer.numel() >= partition_size, f"The partition buffer size {partition_buffer.numel()} should match the size of param.ds_tensor {partition_size}"
+            assert partition_buffer.numel(
+            ) >= partition_size, f"The partition buffer size {partition_buffer.numel()} should match the size of param.ds_tensor {partition_size}"
 
         rank = torch.distributed.get_rank(group=self.ds_process_group)
         start = partition_size * rank
         end = start + partition_size
 
-        dest_tensor = partition_buffer.view(-1).narrow(0, 0, partition_size)
+        dest_tensor_full_buffer = partition_buffer.view(-1).narrow(0, 0, partition_size)
 
         #print("before partition gradients")
         if start < param.ds_numel:
             elements = min(param.ds_numel - start, partition_size)
 
-            dest_tensor_full_buffer = partition_buffer.view(-1).narrow(
-                0,
-                0,
-                partition_size)
-
             dest_tensor = dest_tensor_full_buffer.narrow(0, 0, elements)
             src_tensor = param.grad.view(-1).narrow(0, start, elements)
 
@@ -822,7 +1489,7 @@ class Init(InsertPostInitMethodToModuleSubClasses):
             if not accumulate:
                 dest_tensor.copy_(src_tensor)
 
-            # if source and destinatoin are on same device,
+            # if source and destination are on same device,
             # add to the provided buffer
             elif src_tensor.device == dest_tensor.device:
                 dest_tensor.add_(src_tensor)
@@ -853,21 +1520,25 @@ class Init(InsertPostInitMethodToModuleSubClasses):
 
 
 class GatheredParameters:
-    def __init__(self, param, modifier_rank=None, fwd_module=None, enabled=True):
-        """A context that collects a parameter that was partitioned via a
-        :class:`deepspeed.zero.Init` context. The parameter is partitioned
+    def __init__(self, params, modifier_rank=None, fwd_module=None, enabled=True):
+        """A context that collects parameters that were partitioned via a
+        :class:`deepspeed.zero.Init` context. The parameters are partitioned
         again upon exit.
 
         Args:
-            param (``torch.nn.Parameter``): The parameter to collect.
+            params (``torch.nn.Parameter``): A single parameter or a list of parameters to collect.
+                It's assumed that all parameters are zero params.
             modifier_rank (int, optional): If specified, this rank's parameter will be
-                broadcasted after the context. This argument is required if ``param`` is
-                modified all processes should have a consistent view of the data. Defaults
+                broadcasted on exit from the context. This argument is required if ``params`` are
+                modified, so that all processes have a consistent view of the data. Defaults
                 to ``None``.
-            fwd_module (``torch.nn.Module``, optional): If specified, ``param`` will be
-                registered as an external parameter of ``fwd_module``. See :meth:`deepspeed.zero.register_external_parameter`.
+            fwd_module (``torch.nn.Module``, optional): If specified, ``params`` will be
+                registered as external parameters of ``fwd_module``. See :meth:`deepspeed.zero.register_external_parameter`.
             enabled (bool, optional): If ``False``, this context is a no-op. Defaults to ``True``.
 
+        Important: Make sure to use ``modifier_rank`` that is not ``None`` (e.g. ``modifier_rank=0``)
+        if you need the GPU memory allocated by gather to be released upon exit from the context manager.
+
         Examples
         ========
 
@@ -884,6 +1555,10 @@ class GatheredParameters:
                     if torch.distributed.get_rank() == 0:
                         linear.weight.zero_()
 
+                with deepspeed.zero.GatheredParameters(linear.weight,
+                                                       modifier_rank=0):
+                    if torch.distributed.get_rank() == 0:
+                        linear.weight.zero_()
 
         #. Collect a partitioned weight to pass to another module during
            training. The parameter will be registered as an external parameter
@@ -900,41 +1575,79 @@ class GatheredParameters:
                                                            fwd_module=self):
                         y = self.layer2(x, self.layer1.weight)
                     return y
+
+
+        #. Pretrained model loading
+
+            .. code-block:: python
+
+                with deepspeed.zero.Init():
+                    model = MyModel()
+
+                state_dict = torch.load(model_path, map_location="cpu")
+
+                def load(module: nn.Module, prefix=""):
+                    # because zero3 puts placeholders in model params, this context
+                    # manager gathers (unpartitions) the params of the current layer, then loads from
+                    # the state dict and then re-partitions them again
+                    with deepspeed.zero.GatheredParameters(list(module.parameters(recurse=False)), modifier_rank=0):
+                        if torch.distributed.get_rank() == 0:
+                            module._load_from_state_dict(state_dict, prefix)
+
+                    for name, child in module._modules.items():
+                        if child is not None:
+                            load(child, prefix + name + ".")
+
+                load(model, prefix="")
+
+        If this approach is not used, then the full model will first get copied to each GPU. For models
+        bigger than the memory of a single gpu this method is required.
         """
 
         self.enabled = enabled
         if not enabled:
             return
 
-        # This is a no-op, just return.
-        if not is_zero_param(param):
+        if not isinstance(params, list):
+            params = [params]
+
+        # enable if at least one is zero-param, otherwise a noop
+        if not any(is_zero_param(p) for p in params):
             self.enabled = False
             return
 
-        self.param = param
+        self.params = [p for p in params if hasattr(p, "ds_id")]
         self.src_rank = None
         if modifier_rank is not None:
-            if self.param.ds_process_group == torch.distributed.group.WORLD:
+            if self.params[0].ds_process_group == torch.distributed.group.WORLD:
                 self.src_rank = modifier_rank
             else:
                 # A group was specified; convert DP rank to global rank
-                self.src_rank = _get_global_rank(self.param.ds_process_group,
+                self.src_rank = _get_global_rank(self.params[0].ds_process_group,
                                                  modifier_rank)
         self.fwd_module = fwd_module
         if self.fwd_module is not None:
             # is a no-op if already registered
-            register_external_parameter(self.fwd_module, self.param)
+            for p in self.params:
+                register_external_parameter(self.fwd_module, p)
 
     def __enter__(self):
         if not self.enabled:
             return
-        self.param.all_gather()
+        self.params[0].all_gather(param_list=self.params)
 
     def __exit__(self, *exc):
         if not self.enabled:
             return
-        if self.src_rank is not None:
-            torch.distributed.broadcast(self.param,
+        if self.src_rank is None:
+            return
+
+        handles = [
+            torch.distributed.broadcast(p,
                                         self.src_rank,
-                                        group=self.param.ds_process_group)
-        self.param.partition(has_been_updated=self.src_rank is not None)
+                                        group=p.ds_process_group,
+                                        async_op=True) for p in self.params
+        ]
+        for h in handles:
+            h.wait()
+        self.params[0].partition(param_list=self.params, has_been_updated=True)
diff --git a/deepspeed/runtime/zero/partitioned_param_coordinator.py b/deepspeed/runtime/zero/partitioned_param_coordinator.py
new file mode 100644
index 0000000000000000000000000000000000000000..4958dcf3c3a4fadba84952147ce74a955cc46a6f
--- /dev/null
+++ b/deepspeed/runtime/zero/partitioned_param_coordinator.py
@@ -0,0 +1,404 @@
+"""
+"Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+"""
+
+from dataclasses import dataclass
+import functools
+import collections
+from collections import OrderedDict, UserDict
+from typing import Deque, Dict, Iterable, Set, Tuple
+import torch
+from torch.cuda import Event, Stream
+from torch.nn import Module, Parameter
+
+from deepspeed.utils.logging import logger
+from deepspeed.runtime.zero.partition_parameters import *
+from deepspeed.runtime.zero.offload_constants import *
+from deepspeed.runtime.swap_tensor.partitioned_param_swapper import PartitionedParamStatus
+
+
+def debug_rank0(message: str) -> None:
+    if dist.get_rank() == 0:
+        logger.debug(message)
+
+
+@instrument_w_nvtx
+def get_all_parameters(sub_module, recurse=False):
+    return itertools.chain(sub_module.named_parameters(recurse=recurse),
+                           sub_module.ds_external_parameters())
+
+
+def iter_params(module: Module, recurse=False) -> Iterable[Parameter]:
+    return map(lambda pair: pair[1], get_all_parameters(module, recurse))
+
+
+class TraceMode(Enum):
+    # Record trace of the network during a single forward+backward (for training) or forward (for inference)
+    RECORD = 1
+    # Use recorded network trace to optimize current forward+backward or forward
+    COMPLETE = 2
+    # Recorded trace does not match current forward+backward or forward pass.
+    INVALID = 3
+
+
+class PartitionedParameterCoordinator:
+    """Handles partitioning and gathering of parameters."""
+    class __InflightParamRegistry(UserDict):
+        """registry for parameters in flight"""
+        def __setitem__(self,
+                        param: Parameter,
+                        handle: AllGatherCoalescedHandle) -> None:
+            if param in self.data:
+                raise RuntimeError(f"{param.ds_summary()} already in registry")
+            if param.ds_status != ZeroParamStatus.INFLIGHT:
+                raise RuntimeError(
+                    f"attempted to add non-inflight parameter to registry {param.ds_summary()}"
+                )
+            self.data[param] = handle
+
+    @dataclass
+    class __ParamInTrace:
+        param: Parameter
+        step_id_last_used_at: int
+
+    def __init__(
+        self,
+        prefetch_bucket_sz: int,
+        max_reuse_distance_in_numel: int,
+        max_available_parameters_in_numel: int,
+        allgather_stream: Stream,
+        prefetch_nvme: bool = False,
+    ) -> None:
+        # mapping of param -> handle for each param that is currently in flight
+        self.__inflight_param_registry = __class__.__InflightParamRegistry()
+        # keeps track of the number of submodules invoked so far.
+        self.__step_id: int = 0
+        # network tracing mode
+        self.__trace_mode: TraceMode = TraceMode.RECORD
+        # sequence of submodules/parameters in forward pass + backward pass
+        self.__submodule_order: Iterable[Module] = []
+        self.__param_order: Iterable[__class__.__ParamInTrace] = []
+        self.__most_recent_step_id_param_fetched_for = collections.defaultdict(
+            lambda: int(-1e10))
+        # number of available params, and max number of available params
+        self.__n_available_params: int = 0
+        self.__max_n_available_params: int = max_available_parameters_in_numel
+        # max distance between two use of the module beyond which module is released
+        self.__max_reuse_dist_in_numel: int = max_reuse_distance_in_numel
+        # queue for parameters to fetch. parameters will be popped off the left
+        # side of the dequeue as they are fetched
+        self.__param_queue: Deque[__class__.__ParamInTrace] = None
+        self.__prefetch_bucket_sz: int = prefetch_bucket_sz
+        self.__prefetch_nvme: bool = prefetch_nvme
+        self.hierarchy: int = 0
+
+        # stream that will be used for allgather operations
+        self.__allgather_stream: Stream = allgather_stream
+
+        # limit the number of fetch events that can be queued at once
+        # otherwise, what happens is memory is allocated by the host thread at the
+        # time of the call, but not used until later by the asynchronous cuda stream.
+        # allowing an infinite number of these to queue up causes a lot of memory
+        # pressure that then becomes detrimental to performance.
+        # this is a much less elegant way of fixing this vs something like using
+        # cudaMallocAsync/cudaFreeAsync. Choosing to not expose this to the user now
+        # because ideally in the future its replaced by an async allocation
+        # mechanism which doesn't require any configuration by the user.
+        self.__ongoing_fetch_events: Deque[Event] = collections.deque()
+        # TODO. make this configurable via JSON
+        self.__max_ongoing_fetch_events: int = 2
+
+    """Tracing and Tracking
+    TODO. consider performing trace before initializing PartitionedParameterCoordinator
+    and passing trace results into constructor. This way all the code in here can
+    just assume that the trace is complete and the results can be entirely
+    immutable.
+
+    Bookkeeping operations used to track where we are in the forward/backward pass
+    """
+
+    def _clear_trace_structures(self) -> None:
+        self.__submodule_order = []
+        self.__param_order = []
+        self.__most_recent_step_id_param_fetched_for = collections.defaultdict(
+            lambda: int(-1e10))
+        self.__param_queue = None
+
+    def is_complete_trace(self) -> bool:
+        return self.__trace_mode == TraceMode.COMPLETE
+
+    def is_invalid_trace(self) -> bool:
+        return self.__trace_mode == TraceMode.INVALID
+
+    def is_record_trace(self) -> bool:
+        return self.__trace_mode == TraceMode.RECORD
+
+    def _invalidate_trace(self) -> None:
+        if self.is_invalid_trace():
+            raise RuntimeError("attempted to invalidate already invalid trace")
+        self.__trace_mode = TraceMode.INVALID
+        self._clear_trace_structures()
+
+    def trace_prologue(self, sub_module: Module) -> None:
+        if self.is_complete_trace():
+            # sub_module must match expectation else invalidate trace cache
+            if sub_module != self.__submodule_order[self.__step_id]:
+                self._invalidate_trace()
+
+    def record_module(self, sub_module: Module) -> None:
+        """adds sub module to trace"""
+        if not self.is_record_trace():
+            raise RuntimeError(
+                f"attempted to record trace when status = {self.__trace_mode}")
+        self.__submodule_order.append(sub_module)
+
+    def record_parameters(self, sub_module: Module) -> None:
+        """adds sub module to trace"""
+        if not self.is_record_trace():
+            raise RuntimeError(
+                f"attempted to record trace when status = {self.__trace_mode}")
+        for param in sorted(set(iter_params(sub_module)), key=lambda p: p.ds_id):
+            self.__param_order.append(
+                __class__.__ParamInTrace(param=param,
+                                         step_id_last_used_at=self.__step_id))
+
+    def reset_step(self) -> None:
+        """indicate that we have completed one fwd+bwd for the model"""
+        if self.__inflight_param_registry:
+            raise RuntimeError(
+                f"still have inflight params "
+                f"{[p.ds_summary for p in self.__inflight_param_registry.keys()]}")
+
+        if not self.is_complete_trace():  # not self.trace_complete:
+            # Make sure that recorded parameter and submodule orders are
+            # identical across ranks
+            assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
+            assert_ints_same_as_other_ranks([p.param.ds_id for p in self.__param_order])
+            assert_ints_same_as_other_ranks(
+                [p.step_id_last_used_at for p in self.__param_order])
+
+            if self.is_record_trace():
+                # Successfully recorded a trace
+                self.__submodule_order = tuple(self.__submodule_order)  # freeze
+                self.__param_order = tuple(self.__param_order)  # freeze
+                self.__trace_mode = TraceMode.COMPLETE  # self.trace_complete = True
+                print_rank_0(
+                    f"completed trace: {[m.id for m in self.__submodule_order]}",
+                    force=False)
+            else:
+                # Enable trace recording for next forward/backward pass
+                self.__trace_mode = TraceMode.RECORD
+
+        self.__param_queue = collections.deque(self.__param_order)  # reset fetch queue
+        self.__most_recent_step_id_param_fetched_for = collections.defaultdict(
+            lambda: int(-1e10))
+        self.__step_id = 0
+        self.__n_available_params = 0
+
+    """Fetch and Release
+    Fetching, prefetching, and releasing parameters
+    """
+
+    @instrument_w_nvtx
+    @torch.no_grad()
+    def fetch_sub_module(self, current_submodule: Module) -> None:
+        """This method does the following (in order):
+        1. kick off fetch for parameters in immediately required sub module
+        2. kick off fetch for next few parameters we will need later (prefetch)
+        3. block on parameters in immediately required sub module
+        """
+        debug_rank0(
+            f"{self.__step_id}: M{current_submodule.id}({type(current_submodule).__name__}) P{[p.ds_id for p in iter_params(current_submodule)]} "
+            + str({
+                "avail": f"{self.__n_available_params:.1e}",
+                "queue_sz": f"{len(self.__param_queue or [])}",
+                "inflight": [p.ds_id for p in self.__inflight_param_registry],
+            }))
+
+        params_to_fetch = frozenset(iter_params(current_submodule))
+
+        # kick off all gather for params in the immediately required submodule
+        for param in params_to_fetch:
+            debug_rank0(f"-fetch: {param.ds_summary()}")
+        self.__all_gather_params(params_to_fetch)
+
+        # wait for parameters in the immediately needed submodule to become available
+        for param in params_to_fetch:
+            param.ds_active_sub_modules.add(current_submodule.id)
+            debug_rank0(f"-wait: {param.ds_summary()}")
+            if param in self.__inflight_param_registry:
+                with torch.cuda.stream(self.__allgather_stream):
+                    while self.__ongoing_fetch_events and self.__ongoing_fetch_events[
+                            0].query():
+                        self.__ongoing_fetch_events.popleft()
+                    if len(self.__ongoing_fetch_events
+                           ) > self.__max_ongoing_fetch_events:
+                        self.__ongoing_fetch_events.popleft().synchronize()
+
+                    self.__inflight_param_registry.pop(param).wait()
+
+                    event = Event()
+                    event.record()
+                    self.__ongoing_fetch_events.append(event)
+
+            assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
+        torch.cuda.current_stream().wait_stream(self.__allgather_stream)
+
+        # kick off parameter prefetches for upcoming modules
+        # don't prefetch if we dont have a completed model trace
+        if self.is_complete_trace():
+            # go through the parameters we need for the current module and pop them
+            # off the fetch queue so that they aren't prefetched later.
+            # if params have already been popped off the fetch queue by earlier
+            # prefetches we won't look for them here
+            discarded_from_prefetch_queue = set()
+            params_not_already_fetched = set(
+                filter(
+                    lambda p: self.__most_recent_step_id_param_fetched_for[p] < self.
+                    __step_id,
+                    params_to_fetch))
+            while self.__param_queue and len(discarded_from_prefetch_queue) < len(
+                    params_not_already_fetched):
+                param_in_trace = self.__param_queue.popleft()
+                self.__most_recent_step_id_param_fetched_for[
+                    param_in_trace.param] = param_in_trace.step_id_last_used_at
+                discarded_from_prefetch_queue.add(param_in_trace.param)
+            if discarded_from_prefetch_queue != params_not_already_fetched:
+                raise RuntimeError(
+                    f"tracing error at step {self.__step_id}: \n"
+                    f"module id: {current_submodule.id}, training: {current_submodule.training}\n"
+                    f"expected the next {len(params_not_already_fetched)} parameters in the "
+                    f"parameter fetch queue to be {tuple(p.ds_summary() for p in params_not_already_fetched)} \n"
+                    f"but got \n {tuple(p.ds_summary() for p in discarded_from_prefetch_queue)}."
+                )
+
+            # kick off all gather for params in the next few submodules (prefetch)
+            if self.__prefetch_bucket_sz > 0:
+                max_params_to_prefetch = min(
+                    self.__max_n_available_params - self.__n_available_params,
+                    self.__prefetch_bucket_sz)
+                params_to_prefetch = set()
+                numel_prefetching = 0
+                while self.__param_queue and numel_prefetching < max_params_to_prefetch:
+                    param_in_trace: __class__.__ParamInTrace = self.__param_queue.popleft(
+                    )
+                    self.__most_recent_step_id_param_fetched_for[
+                        param_in_trace.param] = param_in_trace.step_id_last_used_at
+                    if param_in_trace.param not in params_to_prefetch:
+                        params_to_prefetch.add(param_in_trace.param)
+                        numel_prefetching += param_in_trace.param.ds_numel
+                for param in params_to_prefetch:
+                    debug_rank0(f"-prefetch: {param.ds_summary()}")
+                self.__all_gather_params(params_to_prefetch)
+
+                if self.__prefetch_nvme:
+                    self.__prefetch_nvme_param_partitions()
+
+        self.__step_id += 1
+
+    @instrument_w_nvtx
+    @torch.no_grad()
+    def release_sub_module(self, submodule: Module) -> None:
+        """release the parameters of a sub module, assuming they meet conditions to
+        be released."""
+        params_to_release = (self.__params_to_release(submodule,
+                                                      self.__step_id)
+                             if self.is_complete_trace() else set(
+                                 p.ds_id for p in iter_params(submodule)))
+        for param in iter_params(submodule):
+            param.ds_active_sub_modules.discard(submodule.id)
+            if param.ds_id in params_to_release and not param.is_external_param:
+                self.__release_param(param)
+
+    @instrument_w_nvtx
+    @torch.no_grad()
+    def release_and_reset_all(self, module: Module) -> None:
+        """release all module parameters"""
+        for param in iter_params(module, recurse=True):
+            if param in self.__inflight_param_registry:
+                raise RuntimeError(f"param {param.ds_summary()} still in flight")
+
+            # TODO. make this throw if if there are still active submodules. currently
+            # there's a hook execution issue
+            param.ds_active_sub_modules.clear()
+            self.__release_param(param)
+
+        for param in iter_params(module, recurse=True):
+            if param.ds_status != ZeroParamStatus.NOT_AVAILABLE:
+                raise RuntimeError(f"{param.ds_summary()} expected to be released")
+
+    @instrument_w_nvtx
+    def __all_gather_params(self, params: Set[Parameter]) -> None:
+        """for each partitioned parameter, kick off an async allgather and store
+        the work handle for the in flight parameters."""
+        partitioned_params = []
+        for param in params:
+            if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
+                partitioned_params.append(param)
+                self.__n_available_params += param.ds_numel
+
+        if partitioned_params:
+            with torch.cuda.stream(self.__allgather_stream):
+                handle = partitioned_params[0].all_gather_coalesced(partitioned_params)
+
+            for param in partitioned_params:
+                assert param.ds_status == ZeroParamStatus.INFLIGHT, param.ds_summary()
+                self.__inflight_param_registry[param] = handle
+
+    @instrument_w_nvtx
+    def __release_param(self, param: Parameter) -> None:
+        if param.ds_status == ZeroParamStatus.AVAILABLE and not param.ds_active_sub_modules:
+            debug_rank0(f"-release: {param.ds_summary()}")
+            param.partition()
+            self.__n_available_params -= param.ds_numel
+
+    @instrument_w_nvtx
+    @functools.lru_cache(maxsize=None)
+    def __params_to_release(self,
+                            submodule_to_release: Module,
+                            step_id: int) -> Set[int]:
+        if not self.is_complete_trace():
+            raise RuntimeError("expected trace to be complete")
+
+        params_to_release = set(p.ds_id for p in iter_params(submodule_to_release)
+                                if not p.ds_persist)
+
+        # examine all modules within `max_reuse_dist_in_numel` of the current step,
+        # if we see any of the candidate parameters to be released reoccur while
+        # doing this, remove them from the set of parameters to release.
+        params_traversed = 0
+        for module in self.__submodule_order[step_id:]:
+            if params_traversed > self.__max_reuse_dist_in_numel:
+                break
+            for param in iter_params(module):
+                params_to_release.discard(param.ds_id)
+                params_traversed += param.ds_numel
+
+        return params_to_release
+
+    @instrument_w_nvtx
+    def __prefetch_nvme_param_partitions(self) -> None:
+        """swap in parameter partitions from nvme for those parameters that will be used
+        after the ones that are already being prefetched into full parameters
+        """
+        if not self.is_complete_trace():
+            return
+
+        numel_in_flight = sum(param.ds_numel for param in self.__inflight_param_registry)
+
+        numel_considered = 0
+        swap_in_params = []
+        for param_in_trace in self.__param_queue:
+            param = param_in_trace.param
+            if param.nvme_swapper is None:
+                continue
+            if (numel_considered > 2 * numel_in_flight or len(swap_in_params) >=
+                    param.nvme_swapper.available_swap_in_buffers()):
+                break
+            if param.ds_tensor.status == PartitionedParamStatus.NOT_AVAILABLE:
+                swap_in_params.append(param)
+            numel_considered += param.ds_numel
+
+        if swap_in_params:
+            swap_in_params[0].nvme_swapper.swap_in(swap_in_params, async_op=True)
diff --git a/deepspeed/runtime/zero/stage3.py b/deepspeed/runtime/zero/stage3.py
old mode 100755
new mode 100644
index ea465357861651778d2e30d0f77d87aaaca27022..836b1ee0a6489944b1188b05aa52fa118bf4b9fc
--- a/deepspeed/runtime/zero/stage3.py
+++ b/deepspeed/runtime/zero/stage3.py
@@ -1,2847 +1,2980 @@
-from deepspeed.utils.logging import logger
-'''
-Copyright 2020 The Microsoft DeepSpeed Team
-'''
-
-import os
-
-import torch
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-from torch.distributed.distributed_c10d import _get_global_rank
-import torch.distributed as dist
-import math
-from torch._six import inf
-from torch.autograd import Variable
-
-from deepspeed.runtime.fp16.loss_scaler import LossScaler, DynamicLossScaler
-from deepspeed.runtime.utils import see_memory_usage, is_model_parallel_parameter
-from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus, ZeroParamType, _init_external_params, Init, is_zero_param
-from deepspeed.runtime.zero.constants import ZERO_OPTIMIZATION_WEIGHTS
-from deepspeed.ops.adam import DeepSpeedCPUAdam
-
-import itertools
-# Toggle this to true to enable correctness test
-# with gradient partitioning and without
-pg_correctness_test = False
-
-try:
-    from apex_C import flatten
-    from apex_C import unflatten
-except ImportError:
-    try:
-        _ = warned_flatten
-    except NameError:
-        logger.warning(
-            "apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten."
-        )
-        warned_flatten = True
-    from torch._utils import _flatten_dense_tensors as flatten
-    from torch._utils import _unflatten_dense_tensors as unflatten
-
-
-def print_rank_0(message, debug=False, force=False):
-    if torch.distributed.get_rank() == 0 and (debug or force):
-        logger.info(message)
-
-
-def input(msg):
-    return
-
-
-def split_half_float_double(tensors):
-    dtypes = [
-        "torch.cuda.HalfTensor",
-        "torch.cuda.FloatTensor",
-        "torch.cuda.DoubleTensor"
-    ]
-    buckets = []
-    for i, dtype in enumerate(dtypes):
-        bucket = [t for t in tensors if t.type() == dtype]
-        if bucket:
-            buckets.append(bucket)
-    return buckets
-
-
-def isclose(a, b, rtol=1e-09, atol=0.0):
-    return abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
-
-
-def lcm(x, y):
-    from fractions import gcd  # or can import gcd from `math` in Python 3
-    return x * y // gcd(x, y)
-
-
-# create a flat tensor aligned at the alignment boundary
-def flatten_dense_tensors_aligned(tensor_list, alignment):
-    num_elements = 0
-    for tens in tensor_list:
-        num_elements = num_elements + tens.numel()
-
-    remaining = num_elements % alignment
-
-    if remaining:
-        elements_to_add = alignment - remaining
-        pad_tensor = torch.zeros(elements_to_add,
-                                 device=tensor_list[0].device,
-                                 dtype=tensor_list[0].dtype)
-        padded_tensor_list = tensor_list + [pad_tensor]
-
-        num_elements = num_elements + elements_to_add
-    else:
-        padded_tensor_list = tensor_list
-
-    return _flatten_dense_tensors(padded_tensor_list)
-
-
-def move_to_cpu(tensor_list):
-    for tensor in tensor_list:
-        tensor.data = tensor.data.cpu()
-
-
-def get_all_parameters(sub_module):
-    return itertools.chain(sub_module.named_parameters(recurse=False),
-                           sub_module.ds_external_parameters())
-
-
-#apply torch.autograd.Function that calls a backward_function to tensors in output
-def _apply_to_tensors_only(module, functional, backward_function, outputs):
-    if type(outputs) is tuple:
-        touched_outputs = []
-        for output in outputs:
-            touched_output = _apply_to_tensors_only(module,
-                                                    functional,
-                                                    backward_function,
-                                                    output)
-            touched_outputs.append(touched_output)
-        return tuple(touched_outputs)
-    elif type(outputs) is torch.Tensor:
-        return functional.apply(module, backward_function, outputs)
-    else:
-        return outputs
-
-
-#for each tensor in outputs run the forward_funciton and register backward_function as hook
-def _apply_forward_and_backward_to_tensors_only(module,
-                                                forward_function,
-                                                backward_function,
-                                                outputs):
-    if type(outputs) is tuple:
-        touched_outputs = []
-        for output in outputs:
-            touched_output = _apply_forward_and_backward_to_tensors_only(
-                module,
-                forward_function,
-                backward_function,
-                output)
-            touched_outputs.append(touched_output)
-        return tuple(touched_outputs)
-    elif type(outputs) is torch.Tensor:
-        forward_function(outputs)
-        if outputs.requires_grad:
-            outputs.register_hook(backward_function)
-        return outputs
-    else:
-        return outputs
-
-
-# TODO Needs to be implemented
-class PrefetchCoordinator(object):
-    def __init__(self):
-        # step_id keeps track of the number of sub-modules invoked so far
-        # the step_id is tracking forward and backward sequence of sub-modules
-        self.step_id = 0
-
-        # stores the sequence of sub modules in forward+backward pass
-        self.sub_module_trace = []
-
-        # maps sub_module id to submodule objects
-        self.id_to_sub_module_map = {}
-
-        # stores the total number of parmeters in each sub_module
-        self.id_to_sub_module_size_map = {}
-
-        self.trace_completed = False
-
-        self.most_recent_sub_module_step = {}
-
-        # reuse distances
-        self.reuse_numel_for_step_id = {}
-
-    def record_trace(self, sub_module):
-        if not self.trace_completed:
-            self.sub_module_trace.append(sub_module.id)
-            self.id_to_sub_module_map[sub_module.id] = sub_module
-
-    def print_trace(self):
-        print_rank_0(
-            f"The module trace is : {[self.id_to_sub_module_map[module_id].id for module_id in self.sub_module_trace]}"
-        )
-
-    def increment_step(self, sub_module):
-        self.most_recent_sub_module_step[sub_module.id] = self.step_id
-        self.step_id += 1
-
-    def reset_step(self):
-        self.step_id = 0
-
-    # returns the next numel parameters that will be used next but are not available or inflight
-    def get_params_to_prefetch(self, sub_module, numel=2000000):
-
-        # numel_in_sub_module = 0
-        # for name, param in sub_module.named_parameters(recurse=False):
-        #     numel_in_sub_module += param.ds_numel
-
-        # #if numel_in_sub_module < (numel // 2):
-        #    return []
-
-        # tracing failed. The sub_module passed at the step_id must match with the sub_module during tracing
-        if sub_module.id != self.sub_module_trace[self.step_id]:
-            print_rank_0(
-                f"Tracing failed. Prefetching is disabled at sub-module: {sub_module.id}"
-            )
-            return []
-
-        params_to_prefetch = []
-        total_numel_to_prefetch = 0
-
-        for i in range(self.step_id, len(self.sub_module_trace)):
-            module_id = self.sub_module_trace[i]
-            for _, param in get_all_parameters(self.id_to_sub_module_map[module_id]):
-                if param.ds_status is ZeroParamStatus.NOT_AVAILABLE and (
-                        param.ds_id not in [p.ds_id for p in params_to_prefetch]):
-                    params_to_prefetch.append(param)
-                    total_numel_to_prefetch += param.ds_numel
-                    #print_rank_0(f"Total numel to prefetch: {total_numel_to_prefetch}. Param: {param.ds_shape} and numel {param.ds_numel}, numel limit {numel}")
-                    if total_numel_to_prefetch >= numel:  # and total_numel_to_prefetch > (numel_in_sub_module // 2):
-                        return params_to_prefetch
-
-        return params_to_prefetch
-
-    # checks if this sub_module will be used again and if so then returns the number of elements
-    # in the parameters used between this sub_module and the reuse of this sub_module
-    def get_reuse_distance_in_numel(self, sub_module, sub_module_step_id=None):
-        #assert is_forward is not None, "is_forward must be set to True for Forward Propagation and False for backward Propagation"
-        is_there_reuse = False
-        reuse_distance_in_numel = 1000000000000
-
-        # set the appropriate trace
-        trace = self.sub_module_trace
-        total_steps = len(trace)
-        if sub_module_step_id is None:
-            sub_module_step_id = self.most_recent_sub_module_step[sub_module.id]
-
-        # tracing failed. The sub_module passed at the step_id must match with the sub_module during tracing
-        if sub_module.id != trace[sub_module_step_id]:
-            print_rank_0(
-                f"Tracing failed. Cannot tell if the sub_module: {sub_module.id} is reused"
-            )
-            return reuse_distance_in_numel
-
-        # return cached value
-        if sub_module_step_id in self.reuse_numel_for_step_id:
-            return self.reuse_numel_for_step_id[sub_module_step_id]
-
-        start_step = self.step_id
-        print_rank_0(f"Step id is {self.step_id} ")
-        for step_id in range(start_step, total_steps):
-            print_rank_0(f"Trace id {trace[step_id]} and sub_module id {sub_module.id}")
-            if sub_module.id == trace[step_id]:
-                end_step = step_id
-
-                is_there_reuse = True
-                reuse_distance_in_numel = self._distance_in_numel(
-                    start_step,
-                    end_step,
-                    trace)
-
-                break
-
-        self.reuse_numel_for_step_id[sub_module_step_id] = reuse_distance_in_numel
-
-        return reuse_distance_in_numel
-
-    def _distance_in_numel(self, start_step, end_step, trace):
-        distance_in_numel = 0
-        for step_id in range(start_step, end_step):
-            module_id = trace[step_id]
-            for _, param in self.id_to_sub_module_map[module_id].named_parameters(recurse=False):
-                distance_in_numel += param.ds_numel
-            for _, param in self.id_to_sub_module_map[module_id].ds_external_parameters():
-                distance_in_numel += param.ds_numel
-        return distance_in_numel
-
-
-class PartitionedParameterCoordinator(object):
-    def __init__(self,
-                 comm_stream=None,
-                 max_reuse_distance_in_numel=500000000,
-                 max_available_parameters_in_numel=700000000):
-
-        self.in_flight_handles = []
-        self.params_in_flight = []
-        self.comm_stream = comm_stream if comm_stream is not None else torch.cuda.current_stream(
-        )
-        self.prefetch_coordinator = PrefetchCoordinator()
-        self.hierarchy = 0
-
-        self.total_available_parameter_numel = 0
-        self.max_available_parameters_in_numel = max_available_parameters_in_numel
-
-        # max distance between two use of the module beyond which module is released
-        self.max_reuse_distance_in_numel = max_reuse_distance_in_numel
-
-    def _increment_available_parameter_numel(self, increment):
-        self.total_available_parameter_numel += increment
-
-    def _decrement_available_parameter_numel(self, decrement):
-        self.total_available_parameter_numel -= decrement
-
-    '''-----------------------Tracing and Prefetching ---------------'''
-
-    def record_trace(self, sub_module):
-        self.prefetch_coordinator.record_trace(sub_module)
-
-    def finish_tracing(self, print_trace=False):
-        self.prefetch_coordinator.trace_completed = True
-
-        if print_trace:
-            self.prefetch_coordinator.print_trace()
-
-    # Pre fetches the parameters for sub_modules that comes after
-    #  the current sub_module. This call is asynchronous
-    def prefetch_next_sub_modules(self, sub_module, numel=5000000):
-
-        params_to_prefetch = []
-        if not self.prefetch_coordinator.trace_completed:
-            return params_to_prefetch
-
-        # prefetch if there is no current prefetching in flight
-        if not self.in_flight_handles and self.total_available_parameter_numel < self.max_available_parameters_in_numel:
-            params_to_prefetch = self.prefetch_coordinator.get_params_to_prefetch(
-                sub_module,
-                numel=numel)
-
-            self._all_gather(params_to_prefetch, async_op=True)
-            for param in params_to_prefetch:
-                param.ds_status = ZeroParamStatus.INFLIGHT
-
-                # keeping track of number of elements consumed by available parmaeters
-                self._increment_available_parameter_numel(param.ds_numel)
-
-        self._print_prefetch_elements_info(sub_module, params_to_prefetch)
-        print_rank_0(
-            f"{'--' * self.hierarchy}--PreFetching parameters {[param.ds_id for param in params_to_prefetch]} and available {self.total_available_parameter_numel}, max limit {self.max_available_parameters_in_numel}",
-            force=False)
-
-    def _print_prefetch_elements_info(self, sub_module, params_to_prefetch):
-        sub_module_numel = 0.0
-        for name, param in sub_module.named_parameters(recurse=False):
-            sub_module_numel += param.ds_numel
-        numel_being_prefetched = 0
-        for param in params_to_prefetch:
-            numel_being_prefetched = param.ds_numel
-        print_rank_0(
-            f"{'--' * self.hierarchy}--PreFetching  {numel_being_prefetched} numels and number of numel in the next sub module is {sub_module_numel}",
-            force=False)
-
-    def increment_step(self, sub_module):
-        self.prefetch_coordinator.increment_step(sub_module)
-
-    def reset_step(self):
-        self.prefetch_coordinator.reset_step()
-
-    '''----------------------------------------------------------------------'''
-
-    # Fetches the parameters in the sub_module
-    # This call is blocking
-    def fetch_sub_module(self, sub_module):
-        partitioned_params = []
-        params_in_flight = False
-        #print_rank_0(f"{'--' * self.hierarchy}Fetching params in module {sub_module.__class__.__name__}")
-        params_to_fetch = [
-            param for _,
-            param in sub_module.named_parameters(recurse=False)
-        ]
-        if hasattr(sub_module, 'ds_external_parameters'):
-            print_rank_0(
-                f"{'--' * self.hierarchy}--Fetching external parameters {sub_module.ds_external_parameters()}"
-            )
-            params_to_fetch += [
-                param for _,
-                param in sub_module.ds_external_parameters()
-            ]
-        # for _, param in sub_module.named_parameters(recurse=False):
-        for param in params_to_fetch:
-            param.ds_active_sub_modules += 1
-            print_rank_0(
-                f"{'--' * self.hierarchy}--Fetching parameters {param.ds_id} with active sub modules {param.ds_active_sub_modules}"
-            )
-
-            if param.ds_status == ZeroParamStatus.AVAILABLE:
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Parameter {param.ds_id} is already available"
-                )
-
-            if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Parameter {param.ds_id} is being fetched")
-                partitioned_params.append(param)
-
-                # keeping track of number of elements consumed by available parmaeters
-                self._increment_available_parameter_numel(param.ds_numel)
-                print_rank_0(f"Incrementing with parameter id {param.ds_id}")
-
-            if param.ds_status == ZeroParamStatus.INFLIGHT:
-                params_in_flight = True
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Parameters {param.ds_id} is already in flight (prefetched)"
-                )
-        self.hierarchy += 1
-
-        # parameters are partitioned and need to be allgathered
-        self._all_gather(partitioned_params, async_op=True)
-
-        # parameters are inflight and communication needs to be completed
-        if partitioned_params or params_in_flight:
-            self._synchronize_communication()
-
-        for _, param in sub_module.named_parameters(recurse=False):
-            param.ds_status = ZeroParamStatus.AVAILABLE
-            #print(f"Param id {param.ds_id}, Shape {param.shape}, device {param.device} ")
-        #print_rank_0(f"After fetching (id, shape, device): {[(param.ds_id, param.shape, param.device) for param in sub_module.named_parameters(recurse=False)]}")
-
-    def release_sub_module(self, sub_module):
-        self.hierarchy -= 1
-        print_rank_0(
-            f"{'--' * self.hierarchy}Releasing params in module {sub_module.__class__.__name__}"
-        )
-        params_to_release = [
-            param for _,
-            param in sub_module.named_parameters(recurse=False)
-        ]
-        if hasattr(sub_module, 'ds_external_parameters'):
-            #print_rank_0(f"Releasing external parameters {sub_module.ds_external_parameters()}")
-            params_to_release += [
-                param for _,
-                param in sub_module.ds_external_parameters()
-            ]
-
-        # for _, param in sub_module.named_parameters(recurse=False):
-        for param in params_to_release:
-            param.ds_active_sub_modules -= 1
-            if not param.ds_active_sub_modules and not self._keep_for_later(
-                    sub_module) and not param.ds_persist:
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Releasing parameters {param.ds_id} with numel {param.numel()} active sub modules {param.ds_active_sub_modules} and keep for later {self._keep_for_later(sub_module)}"
-                )
-
-                # Keeping track of number of elements that are consumed by available parameters
-                self._decrement_available_parameter_numel(param.ds_numel)
-                see_memory_usage(
-                    f"Before releasing param {param.ds_id} with numel{param.numel()}",
-                    force=False)
-                param.partition(hierarchy=self.hierarchy)
-                see_memory_usage(
-                    f"After releasing param {param.ds_id} has numel{param.numel()} ",
-                    force=False)
-
-                param.ds_status = ZeroParamStatus.NOT_AVAILABLE
-            else:
-
-                print_rank_0(
-                    f"{'--' * self.hierarchy}--Did not release parameters {param.ds_id} with numel {param.numel()} with active sub modules {param.ds_active_sub_modules}, keep for later {self._keep_for_later(sub_module)} and persistence {param.ds_persist}"
-                )
-
-    def release_and_reset_parameter(self, param):
-        param.ds_active_sub_modules = 0
-        if param.ds_status == ZeroParamStatus.AVAILABLE:
-            print_rank_0(
-                f"Releasing unpartitioned {param.ds_id} active sub-modules {param.ds_active_sub_modules} size {param.ds_numel} and persisitence {param.ds_persist}"
-            )
-            self._decrement_available_parameter_numel(param.ds_numel)
-            param.partition()
-
-    def _keep_for_later(self, sub_module):
-        if not self.prefetch_coordinator.trace_completed:
-            return False
-        reuse_distance_in_numel = self.prefetch_coordinator.get_reuse_distance_in_numel(
-            sub_module)
-        #print_rank_0(f"Reuse distance and numel for sub_module id {sub_module.id} is {reuse_distance_in_numel}")
-        return reuse_distance_in_numel < self.max_reuse_distance_in_numel
-
-    def _all_gather(self, partitioned_params, async_op=False):
-        with torch.cuda.stream(self.comm_stream):
-            handles = partitioned_params[0].all_gather(
-                param_list=partitioned_params,
-                async_op=async_op,
-                hierarchy=self.hierarchy) if partitioned_params else None
-
-        if handles is not None:
-            self.in_flight_handles.extend(handles)
-            self.params_in_flight.extend(partitioned_params)
-
-    def _synchronize_communication(self, synchronize_streams=True):
-        assert len(self.params_in_flight) == len(self.in_flight_handles)
-        for handle, param in zip(self.in_flight_handles, self.params_in_flight):
-            if handle is not None:
-                with torch.cuda.stream(self.comm_stream):
-                    handle.wait()
-            param.ds_status = ZeroParamStatus.AVAILABLE
-        self.comm_stream.synchronize()
-        torch.cuda.synchronize() if synchronize_streams else None
-        self.in_flight_handles = []
-        self.params_in_flight = []
-
-
-class PreBackwardFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, module, pre_backward_function, outputs):
-        ctx.module = module
-        ctx.pre_backward_function = pre_backward_function
-        module.applied_pre_backward = False
-        #print(f"After Forward: {ctx.module.__class__.__name__}")
-        outputs = outputs.detach()
-        return outputs
-
-    @staticmethod
-    def backward(ctx, *args):
-        #print(f"Before Backward: {ctx.module.__class__.__name__}")
-        ctx.pre_backward_function(ctx.module)
-        return (None, None) + args
-
-
-class PostBackwardFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, module, pre_backward_function, output):
-        ctx.module = module
-        if output.requires_grad:
-            #TODO SOME TIMES post backward does not seem to be triggered debug in detail
-            #Should only cause increase in memory not correctness issue
-            #if output.grad_fn.__class__.__name__ == 'ViewBackward':
-            #    ctx.view=True
-            #    print(f"Warning view tensor for input to module : {module.__class__.__name__}. Backward hooks may not trigger properly")
-            #assert len(module.parameters(recurse=False)), "The input tensor to the module is a view, and autograd Function or register_hook is not triggered with view tensors."
-            #if module.ds_grads_remaining == 0:
-            #    print(f"Before Forward: {ctx.module.__class__.__name__}")
-            module.ds_grads_remaining += 1
-            ctx.pre_backward_function = pre_backward_function
-        output = output.detach()
-        return output
-
-    @staticmethod
-    def backward(ctx, *args):
-        ctx.module.ds_grads_remaining = ctx.module.ds_grads_remaining - 1
-        if ctx.module.ds_grads_remaining == 0:
-            ctx.pre_backward_function(ctx.module)
-            #print(f"After Backward: {ctx.module.__class__.__name__}")
-        return (None, None) + args
-
-
-INITIAL_MICRO_STEP_ID = -1
-
-
-class FP16_DeepSpeedZeroOptimizer_Stage3(object):
-    """
-    DeepSpeedZeroOptimizer designed to reduce the memory footprint
-    required for training large deep learning models.
-
-    For more details please see ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
-    https://arxiv.org/abs/1910.02054
-
-    For usage examples, refer to TODO: DeepSpeed Tutorial
-
-    """
-    def __init__(self,
-                 module,
-                 init_optimizer,
-                 timers,
-                 static_loss_scale=1.0,
-                 dynamic_loss_scale=False,
-                 dynamic_loss_args=None,
-                 verbose=True,
-                 contiguous_gradients=True,
-                 reduce_bucket_size=500000000,
-                 prefetch_bucket_size=50000000,
-                 max_reuse_distance=1000000000,
-                 max_live_parameters=1000000000,
-                 param_persistence_threshold=100000,
-                 dp_process_group=None,
-                 reduce_scatter=True,
-                 overlap_comm=False,
-                 cpu_offload_optimizer_state=False,
-                 cpu_offload_params=False,
-                 cpu_offload_use_pin_memory=False,
-                 sub_group_size=1000000000000,
-                 mpu=None,
-                 clip_grad=0.0,
-                 allreduce_always_fp32=False,
-                 postscale_gradients=True,
-                 gradient_predivide_factor=1.0,
-                 gradient_accumulation_steps=1,
-                 elastic_checkpoint=False):
-
-        see_memory_usage("Stage 3 intialize beginning", force=True)
-
-        if dist.get_rank() == 0:
-            logger.info(f"Reduce bucket size {reduce_bucket_size}")
-            logger.info(f"Allgather bucket size {prefetch_bucket_size}")
-        # The fused optimizer does all the work. We need this layer for two reason:
-        # 1. maintain same user API from apex.fp16_utils
-        # 2. keep common stuff here in case we need to add ne552w fused optimizer later
-
-        # differences from apex.fp16_utils:
-        # - assume all model params in fp16
-        # - assume all params requires grad
-        # - flat by groups, not keeping state. TODO: remove state explicitly?
-        # - master gard and unflat master weight never exist. TODO: a way to save out unflat master?
-        if not torch.cuda.is_available:
-            raise SystemError("Cannot use fp16 without CUDA.")
-        self.optimizer = init_optimizer
-
-        if not all(is_zero_param(p) for p in module.parameters()):
-            group = None
-            if mpu:
-                group = mpu.get_data_parallel_group()
-            Init(module=module, data_parallel_group=group)
-
-        for m in module.modules():
-            _init_external_params(m)
-
-        self.module = module
-        self.elastic_checkpoint = elastic_checkpoint
-        self.overlap_comm = overlap_comm
-
-        if self.overlap_comm:
-            self.gpu_sum = torch.zeros(1, dtype=torch.float).cuda()
-
-        ######################cpu offload setup##################################
-        self.cpu_offload = cpu_offload_optimizer_state
-        self.cpu_offload_use_pin_memory = cpu_offload_use_pin_memory
-
-        if cpu_offload_params:
-            assert cpu_offload_optimizer_state, "parameter offload is only available with optimizer state offload"
-        self.cpu_offload_params = cpu_offload_optimizer_state and cpu_offload_params
-
-        self.deepspeed_adam_offload = (self.cpu_offload
-                                       and type(init_optimizer) == DeepSpeedCPUAdam)
-
-        self.device = torch.cuda.current_device() if not self.cpu_offload else 'cpu'
-        ############################################################################
-
-        see_memory_usage("Before Partitioned Parameter Coordinator", force=False)
-
-        fetch_stream = torch.cuda.Stream() if self.overlap_comm else None
-        self.param_coordinator = PartitionedParameterCoordinator(
-            comm_stream=fetch_stream,
-            max_reuse_distance_in_numel=int(max_reuse_distance),
-            max_available_parameters_in_numel=int(max_live_parameters))
-
-        see_memory_usage("After Partitioned Parameter Coordinator", force=False)
-
-        #self.param_coordinator = PartitionedParameterCoordinator(comm_stream=torch.cuda.Stream())
-        #-------------Stage 3 Setup-------------------#
-        # parameters smaller than the threshold will be collectively gathered at the
-        # end of the optimizer step and will be kept till the end of the backward pass
-        # TODO maybe worth just replicating these parameters and doing all reduce for them
-        self.persistence_threshold = int(param_persistence_threshold)
-
-        self.persistent_parameters = self.persistent_parameters()
-
-        self.setup_zero_stage3_hooks()
-
-        #resetting ds_tensor just in case parameters have been changed after initialization
-        #example .half() or .to()
-        #self.reset_ds_tensor()
-        #---------------------------------------------#
-
-        self.timers = timers
-
-        self.reduce_scatter = reduce_scatter
-
-        self.dp_process_group = dp_process_group
-
-        self.partition_count = dist.get_world_size(group=self.dp_process_group)
-
-        if mpu is None:
-            self.model_parallel_group = None
-            self.model_parallel_rank = 0
-        else:
-            self.model_parallel_group = mpu.get_model_parallel_group()
-            self.model_parallel_rank = mpu.get_model_parallel_rank()
-
-        self.overflow = False
-        self.clip_grad = clip_grad
-        self.allreduce_always_fp32 = allreduce_always_fp32
-        self.gradient_predivide_factor = gradient_predivide_factor
-        self.postscale_gradients = postscale_gradients
-        self.gradient_accumulation_steps = gradient_accumulation_steps
-        self.micro_step_id = INITIAL_MICRO_STEP_ID
-
-        if self.reduce_scatter:
-            assert not self.allreduce_always_fp32, "allreduce_always_fp32 is not yet supported with ZeRO-2 with reduce scatter enabled"
-            assert self.gradient_predivide_factor == 1.0, "gradient_predivide_factor != 1.0 is not yet supported with ZeRO-2 with reduce scatter enabled"
-            assert self.postscale_gradients, "pre-scale gradients is not yet supported with ZeRO-2 with reduce scatter enabled"
-
-        # Holds the mode parameter
-        # The param.data may not hold any meaningful data
-        # when param's status is NOT_AVAILABLE or IN_FLGHT
-        self.fp16_groups = []
-
-        # Hold partitioned parameters
-        self.fp16_partitioned_groups = []
-
-        # Holds a fused and flattened copy of the parameters
-        self.fp16_partitioned_groups_flat = []
-
-        #a single 32-bit partition of the parallel partitioned parameters
-        #that this process will update
-        self.fp32_partitioned_groups_flat = []
-
-        # number of elements per partition in each group
-        self.partition_size = []
-
-        self.all_reduce_print = False
-
-        self.prefetch_elements = int(prefetch_bucket_size)
-
-        # padding on each partition for alignment purposes
-        self.groups_padding = []
-
-        self.sub_group_size = sub_group_size
-
-        self.sub_group_to_group_id = {}
-
-        see_memory_usage("Before creating fp16 partitions", force=False)
-        #self._create_fp16_partitions()
-        self._create_fp16_partitions_with_defragmentation()
-        num_fp16_subgroups = len(self.fp16_partitioned_groups_flat)
-        see_memory_usage(f"After creating fp16 partitions: {num_fp16_subgroups}",
-                         force=False)
-
-        see_memory_usage("Before creating fp32 partitions", force=False)
-        self._create_fp32_partitions()
-        see_memory_usage("After creating fp32 partitions", force=False)
-
-        see_memory_usage("Before initializing optimizer states", force=False)
-        self.initialize_optimizer_states()
-        see_memory_usage("After initializing optimizer states", force=False)
-
-        if dist.get_rank() == 0:
-            logger.info(f"optimizer state initialized")
-
-        self.reduce_bucket_size = int(reduce_bucket_size)
-
-        self.reduction_event = torch.cuda.Event(enable_timing=False, blocking=False)
-
-        self.reduction_stream = torch.cuda.Stream(
-        ) if self.overlap_comm else torch.cuda.current_stream()
-        self.callback_queued = False
-        self.copy_grad_stream = torch.cuda.Stream()
-
-        self.param_dict = {}
-
-        # map between param_id and bool to specify if a param is in this partition
-        self.is_param_in_current_partition = {}
-
-        self.contiguous_gradients = contiguous_gradients
-        self.extra_large_param_to_reduce = None
-        self.grads_in_ipg_bucket = []
-        self.params_in_ipg_bucket = []
-        self.elements_in_ipg_bucket = 0
-        self.params_already_reduced = []
-        self._release_ipg_buffers()
-        self.previous_reduced_grads = None
-
-        # simplified param id
-        self.param_id = {}
-
-        count = 0
-        for i, params_group in enumerate(self.fp16_groups):
-            for param in params_group:
-                unique_id = id(param)
-                self.param_id[unique_id] = count
-                self.param_dict[count] = param
-                self.params_already_reduced.append(False)
-                count = count + 1
-
-        #Largest partitioned param
-        largest_partitioned_param_numel = self._get_largest_partitioned_numel()
-
-        see_memory_usage(f"Before Set Grad positions", force=False)
-
-        self.grad_position = {}
-        self.set_grad_positions()
-        see_memory_usage(f"Before CPU Offload initialization", force=False)
-
-        self.grads_in_partition = None
-
-        if self.cpu_offload:
-            self.accumulated_grads_in_cpu = {}
-            self.norm_for_param_grads = {}
-            self.local_overflow = False
-            self.temp_grad_buffer_for_gpu_offload = torch.zeros(
-                largest_partitioned_param_numel,
-                device=torch.cuda.current_device()).half()
-            self.temp_grad_gpu_buffer = torch.zeros(
-                largest_partitioned_param_numel,
-                device=torch.cuda.current_device()).half()
-        see_memory_usage(f"After CPU Offload initialization", force=False)
-
-        # stores if a partition has been reduced in this step
-        self.is_partition_reduced = {}
-
-        # stores if a grad in a partition has been computed or not
-        self.is_grad_computed = {}
-
-        # will store the averaged gradients required by this parititon
-        self.averaged_gradients = {}
-
-        #creates backward hooks for gradient partitioning
-        self.create_reduce_and_remove_grad_hooks()
-
-        #exit(0)
-
-        # we may have a way of fusing dynamic scale. Do not support for now
-        if dynamic_loss_scale:
-            if dynamic_loss_args is None:
-                self.loss_scaler = DynamicLossScaler()
-            else:
-                self.loss_scaler = DynamicLossScaler(**dynamic_loss_args)
-
-            self.dynamic_loss_scale = True
-
-        else:
-            self.dynamic_loss_scale = False
-            self.loss_scaler = LossScaler(scale=static_loss_scale)
-            self.cur_iter = 0
-
-        self.debug_fp16_grads = [{} for _ in self.fp16_groups]
-
-        if dist.get_rank(group=self.dp_process_group) == 0:
-            see_memory_usage(f"After initializing ZeRO optimizer", force=True)
-
-    def _get_largest_partitioned_numel(self):
-        largest_partitioned_param_numel = 0
-        for partitioned_params_group in self.fp16_partitioned_groups:
-            for partitioned_param in partitioned_params_group:
-                if partitioned_param.numel() > largest_partitioned_param_numel:
-                    largest_partitioned_param_numel = partitioned_param.numel()
-
-        return largest_partitioned_param_numel
-
-    def _create_fp16_partitions(self):
-        dist.barrier()
-        partition_id = dist.get_rank(group=self.dp_process_group)
-
-        # loop to deal with groups
-        for j, param_group in enumerate(self.optimizer.param_groups):
-
-            sub_groups = self._create_fp16_sub_groups(param_group['params'])
-            for sub_group in sub_groups:
-                i = len(self.fp16_groups)
-
-                # push this group to list before modify
-                self.fp16_groups.append(sub_group)
-                self.sub_group_to_group_id[i] = j
-
-                #These are the list of the partitoned parameters
-                self.fp16_partitioned_groups.append(
-                    [param.ds_tensor for param in self.fp16_groups[i]])
-
-                print_rank_0(
-                    f"fp16 group {i} partitioned_param norms : {[param.ds_tensor.norm().item() for param in self.fp16_groups[i]]}"
-                )
-
-                # Record padding required to align group to world size (only applies to last rank)
-                if partition_id == dist.get_world_size(group=self.dp_process_group) - 1:
-                    padding = [p.padding_size() for p in self.fp16_groups[i]]
-                else:
-                    padding = [0] * len(self.fp16_groups[i])
-                self.groups_padding.append(padding)
-
-                #not sure why apex was cloning the weights before flattening
-                #removing cloning here
-                see_memory_usage(f"Before Flattening param group {i}", force=False)
-
-                if not self.cpu_offload_params:
-                    see_memory_usage(f"Before moving param group {i} to CPU",
-                                     force=False)
-                    #move all the parameters to cpu to free up GPU space for creating flat buffer
-                    move_to_cpu(self.fp16_partitioned_groups[i])
-                    see_memory_usage(f"After moving param group {i} to CPU", force=False)
-
-                    #create flat buffer in CPU and move to GPU
-                    self.fp16_partitioned_groups_flat.append(
-                        flatten_dense_tensors_aligned(
-                            self.fp16_partitioned_groups[i],
-                            dist.get_world_size(group=self.dp_process_group)).cuda(
-                                torch.cuda.current_device()))
-                    see_memory_usage(
-                        f"After flattening and moving param group {i} to GPU",
-                        force=False)
-                else:
-                    #Without the detach, seems like the flattening becomes part of the
-                    #model graph causing errors downstream
-                    self.fp16_partitioned_groups_flat.append(
-                        flatten_dense_tensors_aligned(
-                            self.fp16_partitioned_groups[i],
-                            dist.get_world_size(
-                                group=self.dp_process_group)).detach().pin_memory())
-
-                see_memory_usage(f"After Flattening param group {i}", force=False)
-
-                see_memory_usage(f"After Flattening param group {i}", force=False)
-
-                #set model fp16 weight to slices of flattened buffer
-                updated_params = _unflatten_dense_tensors(
-                    self.fp16_partitioned_groups_flat[i],
-                    self.fp16_partitioned_groups[i])
-
-                for partitioned_param, q in zip(self.fp16_partitioned_groups[i], updated_params):
-                    partitioned_param.data = q.data
-
-    def _move_to_flat_buffer(self, src_list, flat_buffer):
-        start = 0
-        for src in src_list:
-            dest = flat_buffer.narrow(0, start, src.numel())
-            start = start + src.numel()
-            dest.data.copy_(src.data)
-            src.data = dest.data
-
-    def _create_fp16_partitions_with_defragmentation(self):
-        dist.barrier()
-        partition_id = dist.get_rank(group=self.dp_process_group)
-
-        if self.cpu_offload_params:
-            self.param_groups_fp16_flat_cpu_memory = []
-            for j, param_group in enumerate(self.optimizer.param_groups):
-                total_params = sum([p.ds_tensor.numel() for p in param_group['params']])
-                self.param_groups_fp16_flat_cpu_memory.append(
-                    torch.empty(total_params,
-                                dtype=torch.half,
-                                pin_memory=True))
-
-        # loop to deal with groups
-        for j, param_group in enumerate(self.optimizer.param_groups):
-
-            sub_groups = self._create_fp16_sub_groups(param_group['params'])
-            flat_offset = 0
-            for sub_group in sub_groups:
-                i = len(self.fp16_groups)
-
-                # push this group to list before modify
-                self.fp16_groups.append(sub_group)
-                self.sub_group_to_group_id[i] = j
-
-                #These are the list of the partitoned parameters
-                self.fp16_partitioned_groups.append(
-                    [param.ds_tensor for param in self.fp16_groups[i]])
-
-                print_rank_0(
-                    f"fp16 group {i} partitioned_param norms : {[param.ds_tensor.norm().item() for param in self.fp16_groups[i]]}"
-                )
-
-                # Record padding required to align group to world size (only applies to last rank)
-                if partition_id == dist.get_world_size(group=self.dp_process_group) - 1:
-                    padding = [p.padding_size() for p in self.fp16_groups[i]]
-                else:
-                    padding = [0] * len(self.fp16_groups[i])
-                self.groups_padding.append(padding)
-
-                #not sure why apex was cloning the weights before flattening
-                #removing cloning here
-                see_memory_usage(f"Before Flattening param group {i}", force=False)
-
-                if not self.cpu_offload_params:
-                    see_memory_usage(f"Before moving param group {i} to CPU",
-                                     force=False)
-                    #move all the parameters to cpu to free up GPU space for creating flat buffer
-                    move_to_cpu(self.fp16_partitioned_groups[i])
-                    see_memory_usage(f"After moving param group {i} to CPU", force=False)
-
-                    #create flat buffer in CPU and move to GPU
-                    self.fp16_partitioned_groups_flat.append(
-                        flatten_dense_tensors_aligned(self.fp16_partitioned_groups[i],
-                                                      1).cuda(
-                                                          torch.cuda.current_device()))
-                    see_memory_usage(
-                        f"After flattening and moving param group {i} to GPU",
-                        force=False)
-                else:
-                    total_elements = sum(
-                        [t.numel() for t in self.fp16_partitioned_groups[i]])
-                    fp16_partitioned_group_flat = self.param_groups_fp16_flat_cpu_memory[
-                        j].narrow(0,
-                                  flat_offset,
-                                  total_elements)
-                    self.fp16_partitioned_groups_flat.append(fp16_partitioned_group_flat)
-                    flat_offset += total_elements
-
-                # move param to flat buffer for both param offload on/off
-                self._move_to_flat_buffer(self.fp16_partitioned_groups[i],
-                                          self.fp16_partitioned_groups_flat[i])
-
-                see_memory_usage(f"After Flattening param group {i}", force=False)
-
-    def _create_fp32_partitions(self):
-        for i, tensor in enumerate(self.fp16_partitioned_groups_flat):
-            # a partition of the fp32 master weights that will be updated by this process
-
-            self.fp32_partitioned_groups_flat.append(
-                self.fp16_partitioned_groups_flat[i].to(
-                    self.device).clone().float().detach())
-            element_size = self.fp32_partitioned_groups_flat[i].element_size()
-            num_elements = self.fp32_partitioned_groups_flat[i].numel()
-
-            self.fp32_partitioned_groups_flat[
-                i].requires_grad = True  # keep this in case internal optimizer uses it
-
-        # Clear for on-the-fly population before the optimizer step
-        for param_group in self.optimizer.param_groups:
-            param_group['params'] = []
-
-    def _create_fp16_sub_groups(self, params_group):
-
-        params_group_numel = sum([param.ds_tensor.numel() for param in params_group])
-
-        sub_group_size = self.sub_group_size
-
-        if sub_group_size is None or sub_group_size >= params_group_numel:
-            return [params_group]
-
-        sub_groups = []
-        sub_group = []
-        local_sub_group_size = 0
-        for param in params_group:
-
-            sub_group.append(param)
-            local_sub_group_size += param.ds_tensor.numel()
-
-            if local_sub_group_size >= sub_group_size or id(param) == id(
-                    params_group[-1]):
-
-                sub_groups.append(sub_group)
-
-                sub_group = []
-                local_sub_group_size = 0
-
-        return sub_groups
-
-    # def reset_ds_tensor(self):
-    #     for name, param in self.module.named_parameters(recurse=True):
-    #         assert hasattr(param,'ds_id'), "Parameters have not been converted to be Zero 3 compatible"
-    #         assert (param.ds_status == ZeroParamStatus.NOT_AVAILABLE), "All the parameters must have been partitioned by now"
-    #         param.ds_tensor.data = param.data
-
-    def setup_zero_stage3_hooks(self):
-        self.hierarchy = 0
-        self._register_hooks_recursively(self.module)
-
-        #reset step if in inference mode
-        def _end_of_forward_hook(module, *args):
-
-            if not torch._C.is_grad_enabled():
-                self.param_coordinator.reset_step()
-
-        self.module.register_forward_hook(_end_of_forward_hook)
-
-    def persistent_parameters(self):
-        persistent_params = []
-        total_persistent_parameters = 0
-        for _, param in self.module.named_parameters(recurse=True):
-            if param.ds_numel < self.persistence_threshold:
-                param.ds_persist = True
-                persistent_params.append(param)
-                total_persistent_parameters += param.ds_numel
-
-        print_rank_0(
-            f'ZeRO 3: Total persistent parameters: {total_persistent_parameters}',
-            force=False)
-        return persistent_params
-
-    def _register_hooks_recursively(self, module, count=[0]):
-        my_count = count[0]
-        module.id = my_count
-
-        #print(f"{module.__class__} : {module.id}")
-
-        for child in module.children():
-            count[0] = count[0] + 1
-            self._register_hooks_recursively(child, count=count)
-
-        def _pre_forward_module_hook(module, *args):
-            self.pre_sub_module_forward_function(module)
-
-        def _post_forward_module_hook(module, *args):
-            self.post_sub_module_forward_function(module)
-
-        def _pre_backward_module_hook(module, inputs, output):
-            def _run_before_backward_function(sub_module):
-                if sub_module.applied_pre_backward is False:
-                    self.pre_sub_module_backward_function(sub_module)
-                    sub_module.applied_pre_backward = True
-
-            return _apply_to_tensors_only(module,
-                                          PreBackwardFunction,
-                                          _run_before_backward_function,
-                                          output)
-
-        #This is an alternate to doing _post_backward_module_hook
-        #it uses tensor.register_hook instead of using torch.autograd.Function
-        def _alternate_post_backward_module_hook(module, inputs):
-            module.ds_grads_remaining = 0
-
-            #print(f"Before Forward {module.__class__.__name__}")
-
-            def _run_after_backward_hook(*unused):
-                module.ds_grads_remaining = module.ds_grads_remaining - 1
-                if module.ds_grads_remaining == 0:
-                    #print(f"After backward {module.__class__.__name__}")
-                    self.post_sub_module_backward_function(module)
-
-            def _run_before_forward_function(input):
-                if input.requires_grad:
-                    module.ds_grads_remaining += 1
-
-            return _apply_forward_and_backward_to_tensors_only(
-                module,
-                _run_before_forward_function,
-                _run_after_backward_hook,
-                inputs)
-
-        def _post_backward_module_hook(module, inputs):
-            module.ds_grads_remaining = 0
-
-            def _run_after_backward_function(sub_module):
-                if sub_module.ds_grads_remaining == 0:
-                    self.post_sub_module_backward_function(sub_module)
-
-            return _apply_to_tensors_only(module,
-                                          PostBackwardFunction,
-                                          _run_after_backward_function,
-                                          inputs)
-
-        # Pre forward hook
-        module.register_forward_pre_hook(_pre_forward_module_hook)
-        # Post forward hook
-        module.register_forward_hook(_post_forward_module_hook)
-
-        # Pre backward hook
-        module.register_forward_hook(_pre_backward_module_hook)
-
-        # post backward hook
-        module.register_forward_pre_hook(_post_backward_module_hook)
-
-    def pre_sub_module_forward_function(self, sub_module):
-        see_memory_usage(f"Before sub module function {sub_module.__class__.__name__}",
-                         force=False)
-
-        self.param_coordinator.record_trace(sub_module)
-
-        self.param_coordinator.fetch_sub_module(sub_module)
-        see_memory_usage(
-            f"Before sub module function {sub_module.__class__.__name__} after fetch",
-            force=False)
-
-        self.param_coordinator.prefetch_next_sub_modules(sub_module,
-                                                         numel=self.prefetch_elements)
-        see_memory_usage(
-            f"Before sub module function {sub_module.__class__.__name__} after prefetch",
-            force=False)
-
-        self.param_coordinator.increment_step(sub_module)
-
-    def post_sub_module_forward_function(self, sub_module):
-        see_memory_usage(
-            f"After sub module function {sub_module.__class__.__name__} before release",
-            force=False)
-        self.param_coordinator.release_sub_module(sub_module)
-        see_memory_usage(
-            f"After sub module function {sub_module.__class__.__name__} after release",
-            force=False)
-
-    def pre_sub_module_backward_function(self, sub_module):
-        self.param_coordinator.record_trace(sub_module)
-
-        self.param_coordinator.fetch_sub_module(sub_module)
-
-        self.param_coordinator.prefetch_next_sub_modules(sub_module,
-                                                         numel=self.prefetch_elements)
-
-        self.param_coordinator.increment_step(sub_module)
-
-    def post_sub_module_backward_function(self, sub_module):
-        see_memory_usage(
-            f"After sub module backward function {sub_module.__class__.__name__} before release",
-            force=False)
-        self.param_coordinator.release_sub_module(sub_module)
-        see_memory_usage(
-            f"After sub module backward function {sub_module.__class__.__name__} after release",
-            force=False)
-
-    def _release_ipg_buffers(self):
-        if self.contiguous_gradients:
-            self.ipg_buffer = None
-            if not self.cpu_offload:
-                self.grads_in_partition = None
-
-            self.grads_in_partition_offset = 0
-
-    def _optimizer_step(self, sub_group_id):
-        param_group_id = self.sub_group_to_group_id[sub_group_id]
-        fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]
-        fp16_param = self.fp16_partitioned_groups_flat[sub_group_id]
-        self.optimizer.param_groups[param_group_id]['params'] = [fp32_param]
-        self.optimizer.step()
-        self.optimizer.param_groups[param_group_id]['params'] = []
-        fp16_param.data.copy_(fp32_param.data)
-
-    def initialize_optimizer_states(self):
-        num_subgroups = len(self.fp16_groups)
-
-        largest_numel = max([t.numel() for t in self.fp16_partitioned_groups_flat])
-        gradient_dtype = self.fp32_partitioned_groups_flat[0].dtype
-        gradient_buffer = torch.zeros(int(largest_numel),
-                                      dtype=gradient_dtype,
-                                      device=self.device)
-
-        for i, group in enumerate(self.fp16_groups):
-            see_memory_usage(
-                f'[Begin] Initialize optimizer states {i} / {num_subgroups} subgroups',
-                force=False)
-
-            num_elements = int(self.fp16_partitioned_groups_flat[i].numel())
-            if self.cpu_offload and not self.cpu_offload_use_pin_memory:
-                self.fp32_partitioned_groups_flat[i].grad = torch.zeros(
-                    num_elements,
-                    dtype=gradient_dtype,
-                    device=self.device)
-            elif self.cpu_offload_use_pin_memory:
-                self.fp32_partitioned_groups_flat[i].grad = torch.zeros(
-                    num_elements,
-                    dtype=gradient_dtype,
-                    device=self.device).pin_memory()
-            else:
-                self.fp32_partitioned_groups_flat[i].grad = gradient_buffer.narrow(
-                    0,
-                    0,
-                    num_elements)
-
-            self._optimizer_step(i)
-
-            see_memory_usage(
-                f'[End] Initialize optimizer states {i} / {num_subgroups} subgroups',
-                force=False)
-
-        if not self.cpu_offload:
-            for group in self.fp32_partitioned_groups_flat:
-                group.grad = None
-
-        return
-
-    #########################################################################
-    #########################ZeRO Partition Gradients########################
-    #########################################################################
-
-    def get_first_param_index(self, group_id, param_group, partition_id):
-        for index, param in enumerate(param_group):
-            param_id = self.get_param_id(param)
-            if partition_id in self.param_to_partition_ids[group_id][param_id]:
-                return index
-        return None
-
-    def initialize_gradient_partitioning_data_structures(self):
-
-        total_partitions = dist.get_world_size(group=self.dp_process_group)
-
-        for i, param_group in enumerate(self.fp16_groups):
-
-            self.param_to_partition_ids[i] = {}
-            self.is_partition_reduced[i] = {}
-            self.total_grads_in_partition[i] = {}
-            self.remaining_grads_in_partition[i] = {}
-            self.is_grad_computed[i] = {}
-            self.grad_partition_insertion_offset[i] = {}
-            self.grad_start_offset[i] = {}
-            self.first_param_index_in_partition[i] = {}
-
-            for partition_id in range(total_partitions):
-                self.is_grad_computed[i][partition_id] = {}
-                self.grad_partition_insertion_offset[i][partition_id] = {}
-                self.grad_start_offset[i][partition_id] = {}
-                self.initialize_gradient_partition(i, param_group, partition_id)
-                self.is_partition_reduced[i][partition_id] = False
-                self.first_param_index_in_partition[i][
-                    partition_id] = self.get_first_param_index(
-                        i,
-                        param_group,
-                        partition_id)
-
-    def independent_gradient_partition_epilogue(self):
-        self.report_ipg_memory_usage(f"In ipg_epilogue before reduce_ipg_grads", 0)
-        self.reduce_ipg_grads()
-        self.report_ipg_memory_usage(f"In ipg_epilogue after reduce_ipg_grads", 0)
-
-        if self.overlap_comm:
-            self.reduction_stream.synchronize()
-
-        with torch.cuda.stream(self.reduction_stream):
-            self.partition_previous_reduced_grads()
-
-        # if dist.get_rank() == 0:
-        #    logger.info("Params already reduced %s", self.params_already_reduced)
-        for i in range(len(self.params_already_reduced)):
-            self.params_already_reduced[i] = False
-
-        #in case of cpu offload, averaged gradients are already in fp32_partitioned_groups_flat.grad
-        #TODO: use a similar code path for both cpu_offload and non-cpu offload
-        if not self.cpu_offload:
-            for i, sub_group in enumerate(self.fp16_groups):
-                self.averaged_gradients[i] = [
-                    torch.zeros_like(param.ds_tensor) if param.grad is None else
-                    param.grad.data.narrow(0,
-                                           0,
-                                           param.ds_tensor.numel())
-                    for param in sub_group
-                ]
-                # self.averaged_gradients[i] = self.get_flat_partition(
-                #     self.fp16_groups[i],
-                #     0,
-                #     self.fp32_partitioned_groups_flat[i].numel(),
-                #     return_tensor_list=True)
-
-        self._release_ipg_buffers()
-
-        see_memory_usage(f"End ipg_epilogue", force=False)
-
-    # resets all partition to no reduced
-    # sets remianing grads to the total number of grads in each partition
-    # set is grad computed to false for all grads in partition
-    def reset_partition_gradient_structures(self):
-        total_partitions = dist.get_world_size(group=self.dp_process_group)
-        for i, _ in enumerate(self.fp16_groups):
-            for partition_id in range(total_partitions):
-                self.is_partition_reduced[i][partition_id] = False
-                self.remaining_grads_in_partition[i][
-                    partition_id] = self.total_grads_in_partition[i][partition_id]
-
-                for param_id in self.is_grad_computed[i][partition_id]:
-                    self.is_grad_computed[i][partition_id][param_id] = False
-
-    def initialize_gradient_partition(self, i, param_group, partition_id):
-        def set_key_value_list(dictionary, key, value):
-            if key in dictionary:
-                dictionary[key].append(value)
-            else:
-                dictionary[key] = [value]
-
-        def increment_value(dictionary, key):
-            if key in dictionary:
-                dictionary[key] += 1
-            else:
-                dictionary[key] = 1
-
-        partition_size = self.partition_size[i]
-
-        start_index = partition_size * partition_id
-        end_index = partition_size * (partition_id + 1)
-
-        current_index = 0
-        first_offset = 0
-
-        for param in param_group:
-
-            param_size = param.numel()
-            param_id = self.get_param_id(param)
-
-            if (current_index >= start_index and current_index < end_index):
-                set_key_value_list(self.param_to_partition_ids[i],
-                                   param_id,
-                                   partition_id)
-                increment_value(self.total_grads_in_partition[i], partition_id)
-
-                self.is_grad_computed[i][partition_id][param_id] = False
-
-                self.grad_partition_insertion_offset[i][partition_id][
-                    param_id] = current_index - start_index
-                self.grad_start_offset[i][partition_id][param_id] = 0
-
-            elif start_index > current_index and start_index < (current_index +
-                                                                param_size):
-                assert (first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
-                first_offset = start_index - current_index
-
-                set_key_value_list(self.param_to_partition_ids[i],
-                                   param_id,
-                                   partition_id)
-                increment_value(self.total_grads_in_partition[i], partition_id)
-
-                self.is_grad_computed[i][partition_id][param_id] = False
-
-                self.grad_partition_insertion_offset[i][partition_id][param_id] = 0
-                self.grad_start_offset[i][partition_id][param_id] = first_offset
-
-            current_index = current_index + param_size
-
-    def overlapping_partition_gradients_reduce_epilogue(self):
-        self.independent_gradient_partition_epilogue()
-        self.zero_grad()
-
-    def create_reduce_and_remove_grad_hooks(self):
-        print_rank_0(f'[Begin] Create gradient reduction hooks')
-        self.grad_accs = []
-        for i, param_group in enumerate(self.fp16_groups):
-            for param in param_group:
-                if param.requires_grad:
-                    #print_rank_0(f" Before all gather {param.device}, {param.shape}")
-
-                    # The hook must be created in un-partitioned parameter
-                    param.all_gather()
-
-                    #print(f"After all gather {param.device}, {param.shape}")
-                    def wrapper(param, i):
-                        param_tmp = param.expand_as(param)
-                        grad_acc = param_tmp.grad_fn.next_functions[0][0]
-
-                        def reduce_partition_and_remove_grads(*notneeded):
-                            self.reduce_ready_partitions_and_remove_grads(param, i)
-
-                        grad_acc.register_hook(reduce_partition_and_remove_grads)
-                        self.grad_accs.append(grad_acc)
-
-                    #print(f"param grad fn {param.expand_as(param).grad_fn}")
-                    wrapper(param, i)
-
-                    # Partition the parameter after creating the hook
-                    param.partition()
-        print_rank_0(f'[End] Create gradient reduction hooks')
-
-    def get_param_id(self, param):
-        unique_id = id(param)
-        return self.param_id[unique_id]
-
-    def report_ipg_memory_usage(self, tag, param_elems):
-        elem_count = self.elements_in_ipg_bucket + param_elems
-        percent_of_bucket_size = (100.0 * elem_count) // self.reduce_bucket_size
-        see_memory_usage(
-            f"{tag}: elems in_bucket {self.elements_in_ipg_bucket} param {param_elems} max_percent {percent_of_bucket_size}",
-            force=False)
-
-    ###############Idependent Partition Gradient ########################
-    def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):
-        #print_rank_0(f"Inside reduce ipg buckets. Param ID {param.ds_id}, ipg elements {self.elements_in_ipg_bucket}, reduce bucket size {self.reduce_bucket_size}", force=True)
-        if self.elements_in_ipg_bucket + param.ds_numel > self.reduce_bucket_size:
-            self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads",
-                                         param.ds_numel)
-
-            self.reduce_ipg_grads()
-
-            if self.contiguous_gradients and self.overlap_comm:
-                # Swap ipg_index between 0 and 1
-                self.ipg_index = 1 - self.ipg_index
-            self.report_ipg_memory_usage("In ipg_remove_grads after reduce_ipg_grads",
-                                         param.ds_numel)
-
-        param_id = self.get_param_id(param)
-        assert self.params_already_reduced[param_id] == False, \
-            f"The parameter {param_id} has already been reduced. \
-            Gradient computed twice for this partition. \
-            Multiple gradient reduction is currently not supported"
-
-        # keeping the gradients contiguous to prevent memory fragmentation, and avoid flattening
-        if param.ds_numel > self.reduce_bucket_size:
-            self.extra_large_param_to_reduce = param
-
-        elif self.contiguous_gradients:
-            #print_rank_0("before new grad tensor move")
-            new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
-                0,
-                self.elements_in_ipg_bucket,
-                param.ds_numel)
-            #print_rank_0("after new grad tensor move")
-            new_grad_tensor.copy_(param.grad.view(-1))
-            param.grad.data = new_grad_tensor.data.view_as(param.grad)
-
-        self.elements_in_ipg_bucket += param.ds_numel
-        self.grads_in_ipg_bucket.append(param.grad)
-        self.params_in_ipg_bucket.append((i, param, param_id))
-        self.report_ipg_memory_usage("End ipg_remove_grads", 0)
-
-    def gradient_reduction_w_predivide(self, tensor):
-        dp_world_size = dist.get_world_size(group=self.dp_process_group)
-
-        tensor_to_allreduce = tensor
-
-        if self.allreduce_always_fp32:
-            tensor_to_allreduce = tensor.float()
-
-        if self.postscale_gradients:
-            if self.gradient_predivide_factor != 1.0:
-                tensor_to_allreduce.mul_(1. / self.gradient_predivide_factor)
-
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-
-            if self.gradient_predivide_factor() != dp_world_size:
-                tensor_to_allreduce.mul_(self.gradient_predivide_factor() /
-                                         dp_world_size)
-        else:
-            tensor_to_allreduce.div_(dp_world_size)
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-
-        if self.allreduce_always_fp32 and tensor is not tensor_to_allreduce:
-            tensor.copy_(tensor_to_allreduce)
-
-        return tensor
-
-    def average_tensor(self, tensors, params_to_reduce):
-        with torch.cuda.stream(self.reduction_stream):
-            if not self.reduce_scatter:
-                for tensor in tensors:
-                    self.gradient_reduction_w_predivide(tensor)
-                return
-
-            for tensor in tensors:
-                tensor.div_(dist.get_world_size(group=self.dp_process_group))
-
-            # reduction resulting with each rank only holding the gradient partition it owns
-            # This could either be a reduce scatter or a reduce op depending on how
-            # parameters are partitionied. The method is impelemnted by the
-            # DeepSpeed param extensions to the pytroch parameter, so its up to
-            # the extension to define what happens here
-            params_to_reduce[0].reduce_gradients_at_owner(
-                param_list=params_to_reduce,
-                hierarchy=self.param_coordinator.hierarchy)
-
-    def set_grad_positions(self):
-        for i, group in enumerate(self.fp16_groups):
-            current_offset = 0
-            for param in group:
-                param_id = self.get_param_id(param)
-                num_elements = param.ds_tensor.numel()
-
-                self.grad_position[param_id] = [
-                    int(i),
-                    int(current_offset),
-                    int(num_elements)
-                ]
-                #print(f"param id {param_id} i:{i}, ds_tensor {num_elements} numel {param.numel()}")
-                current_offset += num_elements
-
-    def async_accumulate_grad_in_cpu_via_gpu(self, param, acc_grad_cpu_partition):
-
-        # copy to a preexisiting buffer to avoid memory allocation penalty
-        dest_buffer = self.temp_grad_buffer_for_gpu_offload.view(-1).narrow(
-            0,
-            0,
-            param.ds_tensor.numel())
-
-        if self.micro_step_id > 0:
-            dest_buffer.copy_(acc_grad_cpu_partition.view(-1), non_blocking=True)
-            param.grad.data.view(-1).add_(dest_buffer)
-
-        # at the boundary we will send 32bit directly
-        if not self.is_gradient_accumulation_boundary:
-            acc_grad_cpu_partition.data.copy_(param.grad.data.view(-1),
-                                              non_blocking=True)
-
-    def _constant_buffered_norm2(self, input, buffer_size=250000000):
-        norm = None
-        for part in input.view(-1).split(buffer_size):
-            if norm is None:
-                norm = part.data.double().norm(2)**2.0
-            else:
-                norm += part.data.double().norm(2)**2.0
-        return norm**0.5
-
-    def set_norm_for_param_grad_in_gpu(self, param):
-        param_id = self.get_param_id(param)
-        #self.norm_for_param_grads[param_id] = param.grad.data.double().norm(2)
-        #Using a more memory efficient version
-        self.norm_for_param_grads[param_id] = self._constant_buffered_norm2(param.grad)
-
-    def update_overflow_tracker_for_param_grad(self, param):
-        #Credit to our user David Minn
-        if param.grad is not None:
-            if self.overlap_comm:
-                self.gpu_sum = self.gpu_sum + param.grad.data.float().sum()
-            elif self._has_inf_or_nan(param.grad.data):
-                self.local_overflow = True
-
-    def async_inplace_copy_grad_to_fp32_buffer_from_gpu(self, param, fp32_grad_tensor):
-        with torch.cuda.stream(self.copy_grad_stream):
-            param_id = self.get_param_id(param)
-            src_tensor = param.grad.view(-1).float()
-            #print(f"src_tensor {src_tensor.size()} and fp32 grad {fp32_grad_tensor.size()}")
-            fp32_grad_tensor.copy_(src_tensor, non_blocking=True)
-            param.grad = None
-
-    def complete_grad_norm_calculation_for_cpu_offload(self, params):
-        total_norm = 0.0
-        norm_type = 2.0
-        for p in params:
-            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
-                param_id = self.get_param_id(p)
-                if param_id in self.norm_for_param_grads.keys():
-                    param_norm = self.norm_for_param_grads[param_id]
-                    total_norm += param_norm.item()**2
-
-        # Sum across all model parallel GPUs.
-        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-
-        torch.distributed.all_reduce(total_norm_cuda,
-                                     op=torch.distributed.ReduceOp.SUM,
-                                     group=self.dp_process_group)
-
-        self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                        op=torch.distributed.ReduceOp.SUM)
-
-        total_norm = total_norm_cuda[0].item()**(1. / norm_type)
-
-        if total_norm == float(
-                'inf') or total_norm == -float('inf') or total_norm != total_norm:
-            total_norm = -1
-
-        return total_norm
-
-    def partition_previous_reduced_grads(self):
-        if not self.previous_reduced_grads:
-            return
-
-        if self.cpu_offload:
-            allocate_grads_in_partition = self.grads_in_partition is None\
-            and self.gradient_accumulation_steps > 1
-        else:
-            allocate_grads_in_partition = self.grads_in_partition is None
-
-        if allocate_grads_in_partition:
-            self.grads_in_partition = []
-
-            for i, group in enumerate(self.fp16_groups):
-                total_size = 0
-                for param_in_partition in group:
-                    total_size += param_in_partition.ds_tensor.numel()
-
-                see_memory_usage(
-                    f"group {i} before creating {total_size} reduced gradients into partition",
-                    force=False)
-                if self.cpu_offload_use_pin_memory:
-                    self.grads_in_partition.append(
-                        torch.zeros(int(total_size),
-                                    dtype=torch.half,
-                                    device=self.device).pin_memory())
-                else:
-                    self.grads_in_partition.append(
-                        torch.zeros(int(total_size),
-                                    dtype=torch.half,
-                                    device=self.device))
-                see_memory_usage(
-                    f"group {i} after creating {total_size} reduced gradients into partition",
-                    force=False)
-
-        for param in self.previous_reduced_grads:
-
-            [i, dest_offset, num_elements] = self.grad_position[self.get_param_id(param)]
-
-            # self.debug_fp16_grads[i][self.get_param_id(param)] = (
-            #     float(param.data.float().norm(2)),
-            #     float(param.grad.data.float().norm(2)))
-
-            if self.cpu_offload:
-
-                param.partition_gradients(partition_buffers=self.temp_grad_gpu_buffer)
-                with torch.cuda.stream(self.copy_grad_stream):
-                    self.reduction_stream.synchronize()
-
-                if self.gradient_accumulation_steps > 1:
-                    # The allreduce buffer will be rewritted. Copy the gradients in partition to a new buffer
-                    fp16_grad_tensor = self.grads_in_partition[i].narrow(
-                        0,
-                        dest_offset,
-                        num_elements)
-                    self.async_accumulate_grad_in_cpu_via_gpu(param, fp16_grad_tensor)
-
-                if self.is_gradient_accumulation_boundary:
-
-                    self.set_norm_for_param_grad_in_gpu(param)
-
-                    self.update_overflow_tracker_for_param_grad(param)
-
-                    fp32_grad_tensor = self.fp32_partitioned_groups_flat[i].grad.narrow(
-                        0,
-                        dest_offset,
-                        num_elements)
-
-                    self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(
-                        param,
-                        fp32_grad_tensor)
-            else:
-                # The allreduce buffer will be rewritted. Copy the gradients in partition to a new buffer
-                fp16_grad_tensor = self.grads_in_partition[i].narrow(
-                    0,
-                    dest_offset,
-                    num_elements)
-                param.partition_gradients(
-                    partition_buffers=fp16_grad_tensor,
-                    accumulate=True if self.micro_step_id > 0 else False)
-
-        self.previous_reduced_grads = []
-
-    def reduce_ipg_grads(self, extra_param=None):
-        if self.overlap_comm:
-            self.reduction_stream.synchronize()
-
-        with torch.cuda.stream(self.reduction_stream):
-            self.partition_previous_reduced_grads()
-
-        params_to_reduce = [param for i, param, param_id in self.params_in_ipg_bucket]
-        #print(f"Params in ipg bucket {self.params_in_ipg_bucket}")
-        #print(f"Reducing {[(param.ds_id, param.grad) for param in params_to_reduce]}")
-        #exit(0)
-        if self.contiguous_gradients:
-            reduction_list = [self.ipg_buffer[self.ipg_index]]
-            if self.extra_large_param_to_reduce is not None:
-                reduction_list.append(self.extra_large_param_to_reduce.grad)
-                self.extra_large_param_to_reduce = None
-            self.average_tensor(reduction_list, params_to_reduce)
-        else:
-            self.buffered_reduce_fallback(
-                None,
-                self.grads_in_ipg_bucket,
-                elements_per_buffer=self.elements_in_ipg_bucket)
-
-        for _, param, param_id in self.params_in_ipg_bucket:
-            self.params_already_reduced[param_id] = True
-
-        self.previous_reduced_grads = params_to_reduce
-
-        self.grads_in_ipg_bucket = []
-        self.params_in_ipg_bucket = []
-        self.elements_in_ipg_bucket = 0
-        #####################################################################
-
-    def reduce_ready_partitions_and_remove_grads(self, param, i):
-        #print(f"Backward {param.ds_id}")
-        self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
-
-    def zero_reduced_gradients(self, partition_id, i):
-        def are_all_related_partitions_reduced(params_id):
-            for partition_id in self.param_to_partition_ids[i][params_id]:
-                if not self.is_partition_reduced[i][partition_id]:
-                    return False
-            return True
-
-        for params_id in self.is_grad_computed[i][partition_id]:
-            if are_all_related_partitions_reduced(params_id):
-                self.param_dict[params_id].grad = None
-
-    def flatten_and_print(self, message, tensors, start=0, n=5):
-        flatten_tensor = _flatten_dense_tensors(tensors)
-
-        def print_func():
-            logger.info(flatten_tensor.contiguous().view(-1).narrow(0, start, n))
-
-        self.sequential_execution(print_func, message)
-
-    def get_grads_to_reduce(self, i, partition_id):
-        def get_reducable_portion(key):
-            grad = self.param_dict[key].grad
-            total_elements = grad.numel()
-            start = self.grad_start_offset[i][partition_id][key]
-            num_elements = min(
-                total_elements - start,
-                self.partition_size[i] -
-                self.grad_partition_insertion_offset[i][partition_id][key])
-            if not pg_correctness_test:
-                if num_elements == total_elements:
-                    return grad
-                else:
-                    return grad.contiguous().view(-1).narrow(0,
-                                                             int(start),
-                                                             int(num_elements))
-            else:
-                if num_elements == total_elements:
-                    return grad.clone()
-                else:
-                    return grad.clone().contiguous().view(-1).narrow(
-                        0,
-                        int(start),
-                        int(num_elements))
-
-        grads_to_reduce = []
-        for key in self.is_grad_computed[i][partition_id]:
-            grad = get_reducable_portion(key)
-            grads_to_reduce.append(grad)
-        return grads_to_reduce
-
-    def sequential_execution(self, function, message, group=None):
-        if group is None:
-            group = self.dp_process_group
-        if dist.get_rank(group=group) == 0:
-            logger.info(message)
-        for id in range(dist.get_world_size(group=group)):
-            if id == dist.get_rank(group=group):
-                function()
-            dist.barrier(group=group)
-
-    def set_none_gradients_to_zero(self, i, partition_id):
-        for param_id in self.is_grad_computed[i][partition_id]:
-            param = self.param_dict[param_id]
-            if param.grad is None:
-                param.grad = torch.zero_like(param)
-
-    ######################Reduction Related Methods##############################
-
-    def allreduce_bucket(self, bucket, allreduce_always_fp32=False, rank=None, log=None):
-        rank = None
-        tensor = flatten(bucket)
-
-        tensor_to_allreduce = tensor
-
-        if pg_correctness_test:
-            allreduce_always_fp32 = True
-
-        if allreduce_always_fp32:
-            tensor_to_allreduce = tensor.float()
-
-        tensor_to_allreduce.div_(dist.get_world_size(group=self.dp_process_group))
-
-        if rank is None:
-            #    "All Reducing"
-            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
-        else:
-            global_rank = _get_global_rank(self.dp_process_group, rank)
-            dist.reduce(tensor_to_allreduce, global_rank, group=self.dp_process_group)
-
-        if allreduce_always_fp32 and tensor is not tensor_to_allreduce:
-            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
-                tensor.copy_(tensor_to_allreduce)
-
-        return tensor
-
-    # if rank is specified do a reduction instead of an allreduce
-    def allreduce_and_copy(self, small_bucket, rank=None, log=None):
-        with torch.cuda.stream(self.reduction_stream):
-            allreduced = self.allreduce_bucket(small_bucket, rank=rank, log=log)
-            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
-                for buf, synced in zip(small_bucket, unflatten(allreduced, small_bucket)):
-                    buf.copy_(synced)
-
-    def allreduce_no_retain(self,
-                            bucket,
-                            numel_per_bucket=500000000,
-                            rank=None,
-                            log=None):
-        small_bucket = []
-        numel = 0
-        for tensor in bucket:
-            small_bucket.append(tensor)
-            numel = numel + tensor.numel()
-            if numel > numel_per_bucket:
-                self.allreduce_and_copy(small_bucket, rank=rank, log=None)
-                small_bucket = []
-        if len(small_bucket) > 0:
-            self.allreduce_and_copy(small_bucket, rank=rank, log=log)
-
-    # allows using reduction of gradients instead of using all_reduce
-    def buffered_reduce_fallback(self,
-                                 rank,
-                                 grads,
-                                 elements_per_buffer=500000000,
-                                 log=None):
-        split_buckets = split_half_float_double(grads)
-
-        for i, bucket in enumerate(split_buckets):
-            self.allreduce_no_retain(bucket,
-                                     numel_per_bucket=elements_per_buffer,
-                                     rank=rank,
-                                     log=log)
-
-    #############################################################################
-    #############################################################################
-    #############################################################################
-
-    # views the tensor as multiple partitions and returns
-    # those partitions
-    def get_data_parallel_partitions(self, tensor):
-        partitions = []
-
-        dp = dist.get_world_size(group=self.dp_process_group)
-        dp_id = dist.get_rank(group=self.dp_process_group)
-
-        total_num_elements = tensor.numel()
-
-        base_size = total_num_elements // dp
-        remaining = total_num_elements % dp
-
-        start = 0
-        for id in range(dp):
-            partition_size = base_size
-            if id < remaining:
-                partition_size = partition_size + 1
-            partitions.append(tensor.narrow(0, start, partition_size))
-            start = start + partition_size
-        return partitions
-
-    def get_partition_info(self, tensor_list, partition_size, partition_id):
-        params_in_partition = []
-        params_not_in_partition = []
-
-        start_index = partition_size * partition_id
-        end_index = partition_size * (partition_id + 1)
-
-        current_index = 0
-        first_offset = 0
-
-        for tensor in tensor_list:
-
-            tensor_size = tensor.numel()
-
-            if (current_index >= start_index and current_index < end_index):
-                params_in_partition.append(tensor)
-
-            elif start_index > current_index and start_index < (current_index +
-                                                                tensor_size):
-                params_in_partition.append(tensor)
-
-                assert (first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
-                first_offset = start_index - current_index
-
-            else:
-                params_not_in_partition.append(tensor)
-
-            current_index = current_index + tensor_size
-
-        return params_in_partition, params_not_in_partition, first_offset
-
-    def zero_grad(self, set_grads_to_None=True):
-        """
-        Zero FP16 parameter grads.
-        """
-        # FP32 grad should never exist.
-        # For speed, set model fp16 grad to None by default
-        for group in self.fp16_groups:
-            for p in group:
-                if set_grads_to_None:
-                    p.grad = None
-                else:
-                    if p.grad is not None:
-                        p.grad.detach_()
-                        p.grad.zero_()
-
-    def _model_parallel_all_reduce(self, tensor, op):
-        """ Perform all reduce within model parallel group, if any.
-        """
-        if self.model_parallel_group is None:
-            torch.distributed.all_reduce(tensor=tensor, op=op)
-        else:
-            torch.distributed.all_reduce(tensor=tensor,
-                                         op=op,
-                                         group=self.model_parallel_group)
-
-    def get_grad_norm_direct(self, gradients, params, norm_type=2):
-        """Clips gradient norm of an iterable of parameters.
-
-        This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
-        added functionality to handle model parallel parameters. Note that
-        the gradients are modified in place.
-
-        Arguments:
-            parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
-                single Tensor that will have gradients normalized
-            max_norm (float or int): max norm of the gradients
-            norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
-                infinity norm.
-
-        Returns:
-            Total norm of the parameters (viewed as a single vector).
-        """
-        norm_type = float(norm_type)
-        if norm_type == inf:
-            total_norm = max(g.data.abs().max() for g in gradients)
-            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-            torch.distributed.all_reduce(total_norm_cuda,
-                                         op=torch.distributed.ReduceOp.MAX,
-                                         group=self.dp_process_group)
-
-            # Take max across all GPUs.
-            self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                            op=torch.distributed.ReduceOp.MAX)
-            total_norm = total_norm_cuda[0].item()
-        else:
-            total_norm = 0.0
-            # if dist.get_rank() == 0:
-            #    logger.info(f"Total Norm begining {total_norm}")
-            for g, p in zip(gradients, params):
-                if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
-                    param_norm = g.data.double().norm(2)
-                    total_norm += param_norm.item()**2
-            # Sum across all model parallel GPUs.
-            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
-
-            torch.distributed.all_reduce(total_norm_cuda,
-                                         op=torch.distributed.ReduceOp.SUM,
-                                         group=self.dp_process_group)
-
-            self._model_parallel_all_reduce(tensor=total_norm_cuda,
-                                            op=torch.distributed.ReduceOp.SUM)
-
-            total_norm = total_norm_cuda[0].item()**(1. / norm_type)
-
-        if total_norm == float(
-                'inf') or total_norm == -float('inf') or total_norm != total_norm:
-            total_norm = -1
-
-        return total_norm
-
-    # creates a flat fused tensor from the tensor list starting at the first_offset
-    # in the first tensor of the list. If there are not enough elements in the tensor
-    # list then the flat tensor will be padded with zeros
-    def get_flat_partition(self,
-                           tensor_list,
-                           first_offset,
-                           partition_size,
-                           return_tensor_list=False):
-        flat_tensor_list = []
-        current_size = 0
-        for i, tensor in enumerate(tensor_list):
-            if tensor.grad is None:
-                tensor.grad = torch.zeros_like(tensor)
-
-            tensor = tensor.grad
-            num_elements = tensor.numel()
-            tensor_offset = 0
-
-            # we need to offset to get to the right element
-            if i == 0 and first_offset > 0:
-                tensor_offset = first_offset
-                num_elements = num_elements - tensor_offset
-
-            # we dont need all elements of the tensor
-            if num_elements > (partition_size - current_size):
-                num_elements = partition_size - current_size
-
-            # we need a narrow view of the tensor based on the tensor offset and number of elements that
-            # we need from this tensor
-            if tensor_offset > 0 or num_elements < tensor.numel():
-                flat_tensor_list.append(tensor.contiguous().view(-1).narrow(
-                    0,
-                    int(tensor_offset),
-                    int(num_elements)))
-            else:
-                flat_tensor_list.append(tensor)
-
-            current_size = current_size + num_elements
-
-        # this means its the last partition and does not align with the dp boundary. We need to pad before flattening
-        if current_size < partition_size:
-            flat_tensor_list.append(
-                torch.zeros(int(partition_size - current_size),
-                            dtype=tensor_list[0].dtype,
-                            device=tensor_list[0].device))
-
-        if return_tensor_list:
-            return flat_tensor_list
-
-        return _flatten_dense_tensors(flat_tensor_list)
-
-    def free_grad_in_param_list(self, param_list):
-        for p in param_list:
-            p.grad = None
-
-    def reset_cpu_buffers(self):
-        self.norm_for_param_grads = {}
-        self.local_overflow = False
-
-    def log_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        self.timers.log(names=list(timer_names))
-
-    def start_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        for name in timer_names:
-            self.timers(name).start()
-
-    def stop_timers(self, timer_names):
-        if self.timers is None:
-            return
-
-        for name in timer_names:
-            self.timers(name).stop()
-
-    def old_step(self, closure=None):
-        """
-        Not supporting closure.
-        """
-
-        self.micro_step_id = INITIAL_MICRO_STEP_ID
-
-        # if self.cpu_offload:
-        #    torch.cuda.current_stream().wait_stream(self.migration_stream)
-
-        print_rank_0(f"Inside Step function")
-        see_memory_usage(f"In step before checking overflow", force=False)
-
-        print_rank_0("Finished Tracing at Beginning of Step")
-        self.param_coordinator.hierarchy = 0
-        self.param_coordinator.finish_tracing(print_trace=True)
-
-        self.param_coordinator.reset_step()
-
-        print_rank_0("Finished Tracing at Beginning of Step")
-
-        # First compute norm for all group so we know if there is overflow
-        self.check_overflow()
-
-        timers = self.timers
-
-        OPTIMIZER_STEP = 'optimizer_step'
-        OPTIMIZER_FP16_UPDATE = 'optimizer_fp16_update'
-        OPTIMIZER_FP32_GRADIENT = 'optimizer_fp32_gradient'
-        timer_names = [OPTIMIZER_STEP, OPTIMIZER_FP16_UPDATE, OPTIMIZER_FP32_GRADIENT]
-
-        prev_scale = self.loss_scale
-        self._update_scale(self.overflow)
-        if self.overflow:
-            see_memory_usage('After overflow before clearing gradients', force=False)
-            self.zero_grad()
-
-            if self.cpu_offload:
-                self.reset_cpu_buffers()
-            else:
-                self.averaged_gradients = {}
-
-            see_memory_usage('After overflow after clearing gradients', force=False)
-
-            logger.info(
-                "[deepscale] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
-                "reducing to {}".format(dist.get_rank(),
-                                        prev_scale,
-                                        self.loss_scale))
-            self.start_timers(timer_names)
-            self.stop_timers(timer_names)
-            return
-
-        norm_groups = []
-        single_partition_grad_groups = []
-        skip = False
-        partition_id = dist.get_rank(group=self.dp_process_group)
-
-        debug_fp32_grads = [{} for _ in self.fp16_groups]
-
-        self.start_timers([OPTIMIZER_FP32_GRADIENT])
-        for i, group in enumerate(self.fp16_groups):
-
-            if self.cpu_offload:
-                norm_groups.append(
-                    self.complete_grad_norm_calculation_for_cpu_offload(
-                        self.fp16_groups[i]))
-
-                single_grad_partition = self.fp32_partitioned_groups_flat[i].grad
-            else:
-                norm_groups.append(
-                    self.get_grad_norm_direct(self.averaged_gradients[i],
-                                              self.fp16_groups[i]))
-
-                # free gradients for all the prameters that are not updated by this process
-                # self.free_grad_in_param_list(self.params_not_in_partition[i])
-
-                # create a flat gradients for parameters updated by this process
-
-                # If we are last partition, ensure we have same size grads and partition size, if not pad with zero tensors
-                single_grad_partition = _flatten_dense_tensors(
-                    self.averaged_gradients[i]).to(
-                        self.fp32_partitioned_groups_flat[i].dtype)
-
-                assert single_grad_partition.numel() == self.fp32_partitioned_groups_flat[i].numel(), \
-                    "averaged gradients have different number of elements that partition size {} {} {} {}".format(
-                        single_grad_partition.numel(), self.partition_size[i], i, partition_id)
-
-                self.fp32_partitioned_groups_flat[i].grad = single_grad_partition
-
-                # release all the gradient since we have already created a necessary copy in dp_grad_partition
-                self.zero_grad()
-
-                self.averaged_gradients[i] = None
-
-            single_partition_grad_groups.append(single_grad_partition)
-            debug_fp32_grads[i] = [
-                (t.clone().detach(),
-                 t) for t in _unflatten_dense_tensors(single_grad_partition,
-                                                      group)
-            ]
-
-        self.stop_timers([OPTIMIZER_FP32_GRADIENT])
-
-        print(f"Norm groups: {norm_groups}")
-
-        self.unscale_and_clip_grads(single_partition_grad_groups, norm_groups)
-
-        #self.dump_pre_step_gradients(debug_fp32_grads)
-
-        self.start_timers([OPTIMIZER_STEP])
-        self.optimizer.step()
-        self.stop_timers([OPTIMIZER_STEP])
-
-        # get rid of the fp32 gradients. Not needed anymore
-        if not self.cpu_offload:
-            for group in self.fp32_partitioned_groups_flat:
-                group.grad = None
-
-        self.start_timers([OPTIMIZER_FP16_UPDATE])
-        for fp16_partitions, fp32_partition in zip(self.fp16_partitioned_groups_flat, self.fp32_partitioned_groups_flat):
-            fp16_partitions.data.copy_(fp32_partition.data)
-        self.stop_timers([OPTIMIZER_FP16_UPDATE])
-
-        print(
-            f"fp16 groups norm : {[group_flat.norm() for group_flat in self.fp16_partitioned_groups_flat]}"
-        )
-        if self.cpu_offload:
-            self.reset_cpu_buffers()
-
-        # TODO: we probably don't need this? just to be safe
-        for i in range(len(norm_groups)):
-            #for p in self.fp16_groups[i]:
-            #    p.data=p.ds_tensor
-
-            updated_params = _unflatten_dense_tensors(
-                self.fp16_partitioned_groups_flat[i],
-                self.fp16_partitioned_groups[i])
-            for partitioned_param, q in zip(self.fp16_partitioned_groups[i], updated_params):
-                # print(f"Grad fn: {p.grad_fn}")
-                # p.data = torch.ones(1).half().cuda()
-                partitioned_param.data = q.data
-
-        #Gathering persisting parameters
-        self.persistent_parameters[0].all_gather(self.persistent_parameters)
-
-        #self.dump_post_step_gradients()
-        self.debug_fp16_grads = [{} for _ in self.fp16_groups]
-
-        if self.cpu_offload:
-            self.reset_cpu_buffers()
-
-        self.log_timers(timer_names)
-
-        see_memory_usage('After zero_optimizer step', force=False)
-        print_rank_0(f"------------------Finishing Step-----------------------",
-                     force=False)
-        return
-
-    def _pre_step(self):
-
-        self.micro_step_id = INITIAL_MICRO_STEP_ID
-
-        print_rank_0(f"Inside Step function")
-        see_memory_usage(f"In step before checking overflow", force=False)
-
-        print_rank_0("Finished Tracing at Beginning of Step")
-        self.param_coordinator.hierarchy = 0
-        self.param_coordinator.finish_tracing(print_trace=True)
-
-        self.param_coordinator.reset_step()
-
-        print_rank_0("Finished Tracing at Beginning of Step")
-
-    def _get_norm_groups(self):
-        norm_groups = []
-        for i, group in enumerate(self.fp16_groups):
-            if self.cpu_offload:
-                norm_groups.append(
-                    self.complete_grad_norm_calculation_for_cpu_offload(
-                        self.fp16_groups[i]))
-            else:
-                norm_groups.append(
-                    self.get_grad_norm_direct(self.averaged_gradients[i],
-                                              self.fp16_groups[i]))
-        return norm_groups
-
-    def _prepare_fp32_grad_for_sub_group(self, sub_group_id):
-
-        partition_id = dist.get_rank(group=self.dp_process_group)
-
-        single_grad_partition = _flatten_dense_tensors(
-            self.averaged_gradients[sub_group_id]).to(
-                self.fp32_partitioned_groups_flat[sub_group_id].dtype)
-
-        assert single_grad_partition.numel() == self.fp32_partitioned_groups_flat[sub_group_id].numel(), \
-            "averaged gradients have different number of elements that partition size {} {} {} {}".format(
-                single_grad_partition.numel(), self.fp32_partitioned_groups_flat[sub_group_id].numel(), sub_group_id, partition_id)
-
-        self.fp32_partitioned_groups_flat[sub_group_id].grad = single_grad_partition
-
-        # release all the gradient since we have already created a necessary copy in dp_grad_partition
-        self.zero_grad()
-
-        self.averaged_gradients[sub_group_id] = None
-
-    def _prepare_sub_group(self, sub_group_id, timer_names=set()):
-        see_memory_usage(f'Before prepare optimizer sub group {sub_group_id}',
-                         force=False)
-        if not self.cpu_offload:
-            self._prepare_fp32_grad_for_sub_group(sub_group_id)
-        see_memory_usage(f'After prepare optimizer sub group {sub_group_id}',
-                         force=False)
-
-    def _release_sub_group(self, sub_group_id, timer_names=set()):
-        see_memory_usage(f'Before release optimizer sub group {sub_group_id}',
-                         force=False)
-        # get rid of the fp32 gradients. Not needed anymore
-        if not self.cpu_offload:
-            self.fp32_partitioned_groups_flat[sub_group_id].grad = None
-
-        see_memory_usage(f'After release optimizer sub group {sub_group_id}',
-                         force=False)
-
-    def _unflatten_partitioned_parameters(self, sub_group_id):
-        updated_params = _unflatten_dense_tensors(
-            self.fp16_partitioned_groups_flat[sub_group_id],
-            self.fp16_partitioned_groups[sub_group_id])
-
-        for partitioned_param, q in zip(self.fp16_partitioned_groups[sub_group_id], updated_params):
-            partitioned_param.data = q.data
-
-    def _overflow_clean_up(self, prev_scale):
-        see_memory_usage('After overflow before clearing gradients', force=False)
-        self.zero_grad()
-
-        if self.cpu_offload:
-            self.reset_cpu_buffers()
-        else:
-            self.averaged_gradients = {}
-
-        see_memory_usage('After overflow after clearing gradients', force=False)
-
-        if torch.distributed.get_rank() == 0:
-            logger.info(
-                "[deepscale] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
-                "reducing to {}".format(dist.get_rank(),
-                                        prev_scale,
-                                        self.loss_scale))
-
-    def _overflow_check_and_loss_scale_update(self):
-
-        # First compute norm for all group so we know if there is overflow
-        self.check_overflow()
-
-        #loss scaling related computation
-        prev_scale = self.loss_scale
-        self._update_scale(self.overflow)
-
-        if self.overflow:
-            self._overflow_clean_up(prev_scale)
-
-        return self.overflow
-
-    def _post_step(self, timer_names=set()):
-        if self.cpu_offload:
-            self.reset_cpu_buffers()
-
-        #Gathering persisting parameters
-        self.persistent_parameters[0].all_gather(self.persistent_parameters)
-
-        self.log_timers(timer_names)
-
-        see_memory_usage('After zero_optimizer step', force=False)
-        print_rank_0(f"------------------Finishing Step-----------------------")
-
-    def step(self, closure=None):
-        """
-            Not supporting closure.
-            """
-        self._pre_step()
-
-        #checks for overflow, adjust the loss scale accordingly
-        if self._overflow_check_and_loss_scale_update():
-            return
-
-        norm_groups = self._get_norm_groups()
-
-        timer_names = set()
-
-        timer_names.add('optimizer_step')
-        self.start_timers(['optimizer_step'])
-
-        #update parameters one sub group at a time
-        for sub_group_id, group in enumerate(self.fp16_groups):
-
-            #prepare optimizer states, gradients and fp32 parameters for update
-            self._prepare_sub_group(sub_group_id, timer_names)
-
-            #scale the fp32 gradients
-            self.unscale_and_clip_grads(sub_group_id, norm_groups)
-
-            #apply the optimizer step on the sub group and copy fp32 parameters to fp16
-            self._optimizer_step(sub_group_id)
-
-            #release memory or swap out optimizer states of fp32 parameters
-            self._release_sub_group(sub_group_id, timer_names)
-
-            #unflatten fp16 parameter subgroup
-            self._unflatten_partitioned_parameters(sub_group_id)
-
-        self.stop_timers(['optimizer_step'])
-
-        self._post_step(timer_names)
-        return
-
-    def dump_pre_step_gradients(self, debug_fp32_grads):
-        # Dump gradient norms for debbuging
-        for i, _ in enumerate(self.fp16_groups):
-            print(f'Pre-Step Dump Norms for Group {i} FP16P, FP16G, FP32G, FP32GUC')
-            for fp16_param, fp32_grad in zip(self.fp16_groups[i], debug_fp32_grads[i]):
-                param_id = self.get_param_id(fp16_param)
-                fp16_grad_norm = self.debug_fp16_grads[i][param_id]
-
-                fp32_grad_norm = [float(t.data.float().norm(2)) for t in fp32_grad]
-                norm_list = [fp16_grad_norm, fp32_grad_norm]
-                print(f'Pre-Step Norms {i} {param_id} = {norm_list}')
-
-    def dump_post_step_gradients(self):
-        # Dump gradient norms for debbuging
-        for i, group in enumerate(self.fp16_groups):
-            print(
-                f'Post-Step Dump Norms for Group {i} FP16P, FP16DS, FP16FLAT, FP32FLAT')
-            unflat_fp16 = _unflatten_dense_tensors(self.fp16_groups_flat[i],
-                                                   self.fp16_groups[i])
-            unflat_fp32 = _unflatten_dense_tensors(self.fp32_partitioned_groups_flat[i],
-                                                   self.fp16_groups[i])
-            for j, p in enumerate(self.fp16_groups[i]):
-                param_id = self.get_param_id(p)
-                param_norm = float(p.data.float().norm(2))
-                ds_norm = float(p.ds_tensor.data.float().norm(2))
-
-                unflat_norm = [
-                    float(t.data.float().norm(2))
-                    for t in [unflat_fp16[j],
-                              unflat_fp32[j]]
-                ]
-                norm_list = [param_norm, ds_norm] + unflat_norm
-                print(f'Post-Step Norms {i} {param_id} = {norm_list}')
-
-    def unscale_and_clip_grads(self, sub_group_id, norm_groups):
-
-        grad_groups_flat = [self.fp32_partitioned_groups_flat[sub_group_id].grad]
-
-        total_norm = 0.0
-        for norm in norm_groups:
-            total_norm += norm**2.0
-        total_norm = math.sqrt(total_norm)
-
-        # compute combined scale factor for this group
-        combined_scale = self.loss_scale
-        if self.clip_grad > 0.:
-            # norm is in fact norm*scale
-            clip = ((total_norm / self.loss_scale) + 1e-6) / self.clip_grad
-            if clip > 1:
-                combined_scale = clip * self.loss_scale
-
-        for grad in grad_groups_flat:
-            if isinstance(grad, list):
-                sub_partitions = grad
-                for g in sub_partitions:
-                    g.data.mul_(1. / combined_scale)
-            else:
-                grad.data.mul_(1. / combined_scale)
-
-    def _check_overflow(self, partition_gradients=True):
-        self.overflow = self.has_overflow(partition_gradients)
-
-    # `params` is a list / generator of torch.Variable
-    def has_overflow_serial(self, params, is_grad_list=False):
-        for p in params:
-            if p.grad is not None and self._has_inf_or_nan(p.grad.data):
-                return True
-
-        return False
-
-    def has_overflow_partitioned_grads_serial(self):
-        for i in range(len(self.fp16_groups)):
-            for j, grad in enumerate(self.averaged_gradients[i]):
-                if grad is not None and self._has_inf_or_nan(grad.data, j):
-                    return True
-        return False
-
-    def has_overflow(self, partition_gradients=True):
-        if partition_gradients:
-            if self.overlap_comm:
-                self.local_overflow = self._has_inf_or_nan(self.gpu_sum)
-                self.gpu_sum = torch.zeros(1, dtype=torch.float).cuda()
-
-            overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
-            )
-            #overflow = self.has_overflow_partitioned_grads_serial()
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
-            torch.distributed.all_reduce(overflow_gpu,
-                                         op=torch.distributed.ReduceOp.MAX,
-                                         group=self.dp_process_group)
-
-        else:
-            params = []
-            for group in self.fp16_groups:
-                for param in group:
-                    params.append(param)
-
-            overflow = self.has_overflow_serial(params, is_grad_list=partition_gradients)
-            overflow_gpu = torch.cuda.ByteTensor([overflow])
-
-        # Since each model parallel GPU carries only part of the model,
-        # make sure overflow flag is synced across all the model parallel GPUs
-        self._model_parallel_all_reduce(tensor=overflow_gpu,
-                                        op=torch.distributed.ReduceOp.MAX)
-
-        overflow = overflow_gpu[0].item()
-        return bool(overflow)
-
-    # `x` is a torch.Tensor
-    @staticmethod
-    def _has_inf_or_nan(x, j=None):
-        try:
-            # if x is half, the .float() incurs an additional deep copy, but it's necessary if
-            # Pytorch's .sum() creates a one-element tensor of the same type as x
-            # (which is true for some recent version of pytorch).
-            cpu_sum = float(x.float().sum())
-            # More efficient version that can be used if .sum() returns a Python scalar
-            # cpu_sum = float(x.sum())
-        except RuntimeError as instance:
-            # We want to check if inst is actually an overflow exception.
-            # RuntimeError could come from a different error.
-            # If so, we still want the exception to propagate.
-            if "value cannot be converted" not in instance.args[0]:
-                raise
-            return True
-        else:
-            if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
-                return True
-            return False
-
-    def backward(self, loss, retain_graph=False):
-        """
-        :attr:`backward` performs the following steps:
-
-        1. fp32_loss = loss.float()
-        2. scaled_loss = fp32_loss*loss_scale
-        3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
-        """
-        self.micro_step_id += 1
-        print_rank_0(
-            f"Total fully available parameters {self.param_coordinator.total_available_parameter_numel}"
-        )
-        see_memory_usage(f"Before backward", force=False)
-        if self.contiguous_gradients:
-            self.ipg_buffer = []
-            buf_0 = torch.empty(self.reduce_bucket_size,
-                                dtype=torch.half,
-                                device=torch.cuda.current_device())
-            self.ipg_buffer.append(buf_0)
-
-            # Use double buffers to avoid data access conflict when overlap_comm is enabled.
-            if self.overlap_comm:
-                buf_1 = torch.empty(self.reduce_bucket_size,
-                                    dtype=torch.half,
-                                    device=torch.cuda.current_device())
-                self.ipg_buffer.append(buf_1)
-            self.ipg_index = 0
-
-        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
-        '''Partitioning Parameters that were not partitioned
-        Usually if parameters of modules whose input parameters do not require
-        grad computation do not trigger post call and will therefore will remain unpartitioned '''
-        self._partition_all_parameters()
-
-    def _partition_all_parameters(self):
-        for name, param in self.module.named_parameters(recurse=True):
-            self.param_coordinator.release_and_reset_parameter(param)
-
-    def check_overflow(self, partition_gradients=True):
-        self._check_overflow(partition_gradients)
-
-    def _update_scale(self, has_overflow=False):
-        self.loss_scaler.update_scale(has_overflow)
-
-    # Promote state so it can be retrieved or set via "fp16_optimizer_instance.state"
-    def _get_state(self):
-        return self.optimizer.state
-
-    def _set_state(self, value):
-        self.optimizer.state = value
-
-    state = property(_get_state, _set_state)
-
-    # Promote param_groups so it can be retrieved or set via "fp16_optimizer_instance.param_groups"
-    # (for example, to adjust the learning rate)
-    def _get_param_groups(self):
-        return self.optimizer.param_groups
-
-    def _set_param_groups(self, value):
-        self.optimizer.param_groups = value
-
-    param_groups = property(_get_param_groups, _set_param_groups)
-
-    # Promote loss scale so it can be retrieved or set via "fp16_optimizer_instance.loss_scale"
-    def _get_loss_scale(self):
-        return self.loss_scaler.loss_scale
-
-    def _set_loss_scale(self, value):
-        self.loss_scaler.cur_scale = value
-
-    loss_scale = property(_get_loss_scale, _set_loss_scale)
-    cur_scale = property(_get_loss_scale, _set_loss_scale)
-
-    def _get_lean_tensors(self, padded_flattened_tensor, group_tensors, paddings):
-        # Remove paddings from flattened tensor
-        individual_tensors = _unflatten_dense_tensors(padded_flattened_tensor,
-                                                      group_tensors)
-        lean_lengths = [t.numel() - pad for t, pad in zip(group_tensors, paddings)]
-        lean_tensors = [t[:len] for t, len in zip(individual_tensors, lean_lengths)]
-        #logger.info(f'rank {dist.get_rank()}: lean_tensors = {[t.numel() for t in lean_tensors]}')
-        return lean_tensors
-
-    #TODO REVISIT this for stage 3
-    def get_lean_optimizer_state(self):
-        # Return optimizer states after removing paddings.
-        # This method assumes that each param group contains a single flattened tensor.
-        optimizer_groups_state = []
-
-        for i, group in enumerate(self.optimizer.param_groups):
-            p = group['params'][0]
-            lean_state = {}
-            for key, value in self.optimizer.state[p].items():
-                if torch.is_tensor(value):
-                    padded_lens = [t.numel() for t in self.fp16_partitioned_groups[i]]
-                    lean_state[key] = self._get_lean_tensors(
-                        value,
-                        self.fp16_partitioned_groups[i],
-                        self.groups_padding[i])
-                    lean_flat_len = sum([t.numel() for t in lean_state[key]])
-                else:
-                    lean_state[key] = value
-
-            optimizer_groups_state.append(lean_state)
-
-        return optimizer_groups_state
-
-    def get_groups_without_padding(self, groups_with_padding):
-        # Return group tensor after removing paddings added for alignment to DP world size.
-        groups_without_padding = []
-        for i, group in enumerate(groups_with_padding):
-            lean_group = self._get_lean_tensors(group,
-                                                self.fp16_partitioned_groups[i],
-                                                self.groups_padding[i])
-            groups_without_padding.append(lean_group)
-
-        return groups_without_padding
-
-    def _set_fp32_optimizer_param_groups(self):
-        for sub_group_id, _ in enumerate(self.fp16_groups):
-            param_group_id = self.sub_group_to_group_id[sub_group_id]
-            self.optimizer.param_groups[param_group_id]['params'].append(
-                self.fp32_partitioned_groups_flat[sub_group_id])
-
-    def _clear_fp32_optimizer_param_groups(self):
-        for param_group in self.optimizer.param_groups:
-            param_group['params'] = []
-
-    def _rigid_state_dict(self):
-        state_dict = {}
-        state_dict['zero_stage'] = ZERO_OPTIMIZATION_WEIGHTS
-        state_dict['loss_scaler'] = self.loss_scaler
-        state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale
-        state_dict['overflow'] = self.overflow
-        state_dict['partition_count'] = self.partition_count
-
-        self._set_fp32_optimizer_param_groups()
-        state_dict['optimizer_state_dict'] = self.optimizer.state_dict()
-        state_dict['fp32_flat_groups'] = self.fp32_partitioned_groups_flat
-        self._clear_fp32_optimizer_param_groups()
-
-        return state_dict
-
-    def state_dict(self):
-        """
-        Returns a dict containing the current state of this :class:`FP16_Optimizer` instance.
-        This dict contains attributes of :class:`FP16_Optimizer`, as well as the state_dict
-        of the contained Pytorch optimizer.
-        Example::
-            checkpoint = {}
-            checkpoint['model'] = model.state_dict()
-            checkpoint['optimizer'] = optimizer.state_dict()
-            torch.save(checkpoint, "saved.pth")
-        """
-        if self.elastic_checkpoint:
-            raise NotImplementedError(
-                "ZeRO-3 does not yet support elastic checkpointing, please disable for now."
-            )
-
-        return self._rigid_state_dict()
-
-
-# Restore base optimizer fp32 weights from checkpoint by:
-# 1) Merging fp32 weights from checkpoints of all partitions
-# 2) Extracting fp32 weights for current partition from merged weights
-# 3) Using extracted weights to update base optimizer weights directly.
-
-    def _restore_from_fp32_weights(self, all_state_dict):
-
-        flat_local_partition = []
-        for i in range(len(self.fp32_partitioned_groups_flat)):
-            merged_partitions = [sd['fp32_groups'][i] for sd in all_state_dict]
-            flat_local_partition.append(self._get_flattened_partition(merged_partitions))
-
-        for current, saved in zip(self.fp32_partitioned_groups_flat, flat_local_partition):
-            current.data.copy_(saved.data)
-
-    # Restore base optimizer fp32 weights from ZeRO fp16 weights
-    def _restore_from_fp16_weights(self):
-        for fp16_partitions, fp32_partition in zip(self.fp16_partitioned_groups_flat, self.fp32_partitioned_groups_flat):
-            fp32_partition.data.copy_(fp16_partitions.data)
-
-    # Refresh the fp32 master params from the fp16 copies.
-    def refresh_fp32_params(self):
-        self._restore_from_fp16_weights()
-
-    # Extract flattened partion for current rank from all partitions
-    def _get_flattened_partition(self, all_partition_states):
-        partition_id = dist.get_rank(group=self.dp_process_group)
-        alignment = dist.get_world_size(group=self.dp_process_group)
-
-        param_partitions = [[] for _ in range(len(all_partition_states[0]))]
-        for i, partition in enumerate(all_partition_states):
-            for j, param in enumerate(partition):
-                param_partitions[j].append(param)
-
-        local_state_partitions = []
-        for param_index, param_slices in enumerate(param_partitions):
-            flattened_merged_tensor = flatten_dense_tensors_aligned(
-                param_slices,
-                alignment)
-            new_partitions = self.get_data_parallel_partitions(flattened_merged_tensor)
-            local_state_partitions.append(new_partitions[partition_id])
-
-        if torch.is_tensor(local_state_partitions[0]):
-            return flatten_dense_tensors_aligned(local_state_partitions, alignment)
-
-        # Assume non-tensor states are not partitioned and equal across ranks, so return first one
-        return local_state_partitions[0]
-
-    # Restore base optimizer state from checkpoint by
-    # 1) Merging optimizer state from checkpoints of all partitions
-    # 2) Extracting optimizer state for current partition from the merged state
-    # 3) Using the extracted value to directly update the base optimizer.
-    def _restore_base_optimizer_state(self, all_state_dict):
-        base_optimizer_group_states = []
-        for i in range(len(self.optimizer.param_groups)):
-            partition_states = {}
-            all_partition_group_states = [
-                sd['base_optimizer_state'][i] for sd in all_state_dict
-            ]
-            for key in all_partition_group_states[0].keys():
-                all_partition_states = [
-                    all_states[key] for all_states in all_partition_group_states
-                ]
-                partition_states[key] = self._get_flattened_partition(
-                    all_partition_states)
-            base_optimizer_group_states.append(partition_states)
-
-        for i, group in enumerate(self.optimizer.param_groups):
-            p = group['params'][0]
-            for key, saved in base_optimizer_group_states[i].items():
-                if torch.is_tensor(self.optimizer.state[p][key]):
-                    self.optimizer.state[p][key].data.copy_(saved.data)
-                else:
-                    self.optimizer.state[p][key] = saved
-
-    def _rigid_load_state_dict(self, state_dict, load_optimizer_states=True):
-        # I think it should actually be ok to reload the optimizer before the model.
-        self.loss_scaler = state_dict['loss_scaler']
-        self.dynamic_loss_scale = state_dict['dynamic_loss_scale']
-        self.overflow = state_dict['overflow']
-
-        if load_optimizer_states:
-            self._set_fp32_optimizer_param_groups()
-            self.optimizer.load_state_dict(state_dict['optimizer_state_dict'])
-            self._clear_fp32_optimizer_param_groups()
-
-        # restore fp32 partitions
-        for curr_param, saved_param in zip(self.fp32_partitioned_groups_flat, state_dict['fp32_flat_groups']):
-            curr_param.data.copy_(saved_param.data)
-
-        # restore fp16 partitions from fp32
-        for sub_group_id in range(len(self.fp32_partitioned_groups_flat)):
-            fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]
-            fp16_param = self.fp16_partitioned_groups_flat[sub_group_id]
-            fp16_param.data.copy_(fp32_param.data)
-
-        # update fp16 unflattened params
-        for sub_group_id in range(len(self.fp16_partitioned_groups_flat)):
-            updated_params = _unflatten_dense_tensors(
-                self.fp16_partitioned_groups_flat[sub_group_id],
-                self.fp16_partitioned_groups[sub_group_id])
-
-            for partitioned_param, q in zip(self.fp16_partitioned_groups[sub_group_id], updated_params):
-                partitioned_param.data = q.data
-
-    # TODO: Support different/changing load/save DP degree.
-    def load_state_dict(self,
-                        state_dict_list,
-                        load_optimizer_states=True,
-                        load_from_fp32_weights=False):
-        r"""Loading a ZeRO checkpoint
-        Arguments:
-            state_dict_list: List of all saved ZeRO checkpoints, one for each saved partition.
-                Note that the number of saved partitions may differ from number of loading partitions to support
-                changing GPU count, specifically DP world size, between saving and loading checkpoints.
-            load_optimizer_states: Boolean indicating whether or not to load base optimizer states
-            load_from_fp32_weights: Boolean indicating whether to initialize fp32 master weights from fp32
-            copies in checkpoints (no precision loss) or from model's fp16 copies (with precision loss).
-        """
-        """
-        Loads a state_dict created by an earlier call to state_dict().
-        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``,
-        whose parameters in turn came from ``model``, it is expected that the user
-        will call ``model.load_state_dict()`` before
-        ``fp16_optimizer_instance.load_state_dict()`` is called.
-        Example::
-            model = torch.nn.Linear(D_in, D_out).cuda().half()
-            optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
-            optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
-            ...
-            checkpoint = torch.load("saved.pth")
-            model.load_state_dict(checkpoint['model'])
-            optimizer.load_state_dict(checkpoint['optimizer'])
-        """
-
-        if self.elastic_checkpoint:
-            raise NotImplementedError(
-                "ZeRO-3 does not yet support elastic checkpointing, please disable for now."
-            )
-        else:
-            self._rigid_load_state_dict(
-                state_dict_list[dist.get_rank(group=self.dp_process_group)],
-                load_optimizer_states=load_optimizer_states)
-
-        self.persistent_parameters[0].partition(self.persistent_parameters)
-        self.persistent_parameters[0].all_gather(self.persistent_parameters)
-
-    def save_checkpoint_prologue(self):
-        self._partition_all_parameters()
-
-    def save_checkpoint_epilogue(self):
-        self.persistent_parameters[0].all_gather(self.persistent_parameters)
-
-
-def _handle_overflow(cpu_sum, x, i):
-    import math
-    rank = torch.distributed.get_rank()
-    if rank == 0:
-        t_i = -1
-        for v_i, v in enumerate(x.data.contiguous().view(-1)):
-            if not math.isfinite(float(v)):
-                t_i = v_i
-                break
-        logger.info(
-            f"rank {rank} detected overflow {cpu_sum} in tensor {i}:{t_i} shape {x.shape}"
-        )
+"""
+"Copyright 2020 The Microsoft DeepSpeed Team.
+Licensed under the MIT license.
+"""
+
+import gc
+from dataclasses import dataclass
+import functools
+import os
+import collections
+from collections import OrderedDict, UserDict
+import itertools
+from typing import Deque, Dict, Iterable, Set, Tuple
+import torch
+from torch.cuda import Event, Stream
+from torch.nn import Module, Parameter
+import torch.distributed as dist
+import math
+from torch._six import inf
+from torch.nn import Module
+from torch.nn.parameter import Parameter
+
+from deepspeed.utils.logging import logger
+from deepspeed.runtime.fp16.loss_scaler import LossScaler, DynamicLossScaler
+from deepspeed.runtime.comm.coalesced_collectives import reduce_scatter_coalesced
+from deepspeed.runtime.utils import get_global_norm, see_memory_usage, is_model_parallel_parameter, DummyOptim
+from deepspeed.runtime.zero.partition_parameters import *
+from deepspeed.runtime.zero.partition_parameters import _init_external_params
+from deepspeed.runtime.zero.constants import ZERO_OPTIMIZATION_WEIGHTS
+from deepspeed.ops.adam import DeepSpeedCPUAdam
+from deepspeed.ops.op_builder import UtilsBuilder
+from deepspeed.runtime.zero.offload_constants import *
+from deepspeed.runtime.zero.partitioned_param_coordinator import PartitionedParameterCoordinator, iter_params
+from deepspeed.runtime.swap_tensor.partitioned_param_swapper import PartitionedParamStatus
+from deepspeed.runtime.swap_tensor.partitioned_optimizer_swapper import PartitionedOptimizerSwapper
+from deepspeed.runtime.swap_tensor.pipelined_optimizer_swapper import PipelinedOptimizerSwapper
+from deepspeed.checkpoint.constants import OPTIMIZER_STATE_DICT, FP32_FLAT_GROUPS, PARTITION_COUNT, ZERO_STAGE
+
+# Toggle this to true to enable correctness test
+# with gradient partitioning and without
+pg_correctness_test = False
+
+FWD_MODULE_STACK = list()
+from deepspeed.utils.debug import debug_module2name_id, debug_param2name_id, debug_param2name_id_numel, debug_param2name_id_shape_device, debug_module2name_class, printflock, log_rank_file
+
+
+def print_rank_0(message, debug=False, force=False):
+    rank = torch.distributed.get_rank()
+    if rank == 0 and (debug or force):
+        print(message)
+    # other variations
+    # - print for all ranks w/o interleaving
+    # printflock(f"[{rank}] {message}")
+    # - print to log file per rank
+    # log_rank_file(rank, message)
+
+
+def input(msg):
+    return
+
+
+def isclose(a, b, rtol=1e-09, atol=0.0):
+    return abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
+
+
+def lcm(x, y):
+    from fractions import gcd  # or can import gcd from `math` in Python 3
+    return x * y // gcd(x, y)
+
+
+def move_to_cpu(tensor_list):
+    for tensor in tensor_list:
+        tensor.data = tensor.data.cpu()
+
+
+def is_builtin_type(obj):
+    # https://stackoverflow.com/a/17795199
+    return obj.__class__.__module__ == '__builtin__' or obj.__class__.__module__ == "builtins"
+
+
+#apply torch.autograd.Function that calls a backward_function to tensors in output
+def _apply_to_tensors_only(module, functional, backward_function, outputs):
+    if isinstance(outputs, (tuple, list)):
+        touched_outputs = []
+        for output in outputs:
+            touched_output = _apply_to_tensors_only(module,
+                                                    functional,
+                                                    backward_function,
+                                                    output)
+            touched_outputs.append(touched_output)
+        return outputs.__class__(touched_outputs)
+    elif isinstance(outputs, dict):
+        # apply inplace to avoid recreating dict inherited objects
+        for key in outputs.keys():
+            outputs[key] = _apply_to_tensors_only(module,
+                                                  functional,
+                                                  backward_function,
+                                                  outputs[key])
+        return outputs
+    elif type(outputs) is torch.Tensor:
+        return functional.apply(module, backward_function, outputs)
+    else:
+        if not is_builtin_type(outputs):
+            logger.warning(
+                f"A module has unknown inputs or outputs type ({type(outputs)}) and the tensors embedded in it cannot be detected. "
+                "The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and "
+                "output tensors and therefore may not get triggered properly.")
+        return outputs
+
+
+#for each tensor in outputs run the forward_function and register backward_function as hook
+def _apply_forward_and_backward_to_tensors_only(module,
+                                                forward_function,
+                                                backward_function,
+                                                outputs):
+    if type(outputs) is tuple:
+        touched_outputs = []
+        for output in outputs:
+            touched_output = _apply_forward_and_backward_to_tensors_only(
+                module,
+                forward_function,
+                backward_function,
+                output)
+            touched_outputs.append(touched_output)
+        return tuple(touched_outputs)
+    elif type(outputs) is torch.Tensor:
+        forward_function(outputs)
+        if outputs.requires_grad:
+            outputs.register_hook(backward_function)
+        return outputs
+    else:
+        return outputs
+
+
+class ZeROOrderedDict(OrderedDict):
+    def __init__(self, parent_module, *args, **kwargs):
+        """A replacement for ``collections.OrderedDict`` to detect external ZeRO params.
+
+        Args:
+            parent_module (``collections.OrderedDict``): the collection to replace
+        """
+
+        super().__init__(*args, **kwargs)
+        self._parent_module = parent_module
+        self._in_forward = False
+
+    def __getitem__(self, key):
+        param = super().__getitem__(key)
+
+        # Params can be registered as None (e.g., bias)
+        if param is None:
+            return param
+
+        if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
+            if self._parent_module._parameters._in_forward:
+                register_external_parameter(FWD_MODULE_STACK[-1], param)
+                param.all_gather()
+                print_rank_0(
+                    f'Registering external parameter from getter {key} ds_id = {param.ds_id}',
+                    force=False)
+
+        return param
+
+
+def _inject_parameters(module, cls):
+    for module in module.modules():
+        if cls == ZeROOrderedDict:
+            new_param = cls(parent_module=module)
+        else:
+            new_param = cls()
+
+        for key, param in module._parameters.items():
+            new_param[key] = param
+        module._parameters = new_param
+
+
+class PreBackwardFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, module, pre_backward_function, outputs):
+        ctx.module = module
+        ctx.pre_backward_function = pre_backward_function
+        if not hasattr(module, "applied_pre_backward_ref_cnt"):
+            module.applied_pre_backward_ref_cnt = 0
+        module.applied_pre_backward_ref_cnt += 1
+        #print(f"After Forward: {ctx.module.__class__.__name__}")
+        outputs = outputs.detach()
+        return outputs
+
+    @staticmethod
+    def backward(ctx, *args):
+        #print(f"Before Backward: {ctx.module.__class__.__name__}")
+        ctx.pre_backward_function(ctx.module)
+        return (None, None) + args
+
+
+class PostBackwardFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, module, pre_backward_function, output):
+        ctx.module = module
+        if output.requires_grad:
+            #TODO SOME TIMES post backward does not seem to be triggered debug in detail
+            #Should only cause increase in memory not correctness issue
+            #if output.grad_fn.__class__.__name__ == 'ViewBackward':
+            #    ctx.view=True
+            #    print(f"Warning view tensor for input to module : {module.__class__.__name__}. Backward hooks may not trigger properly")
+            #assert len(module.parameters(recurse=False)), "The input tensor to the module is a view, and autograd Function or register_hook is not triggered with view tensors."
+            #if module.ds_grads_remaining == 0:
+            #    print(f"Before Forward: {ctx.module.__class__.__name__}")
+            module.ds_grads_remaining += 1
+            ctx.pre_backward_function = pre_backward_function
+        output = output.detach()
+        return output
+
+    @staticmethod
+    def backward(ctx, *args):
+        ctx.module.ds_grads_remaining = ctx.module.ds_grads_remaining - 1
+        if ctx.module.ds_grads_remaining == 0:
+            ctx.pre_backward_function(ctx.module)
+            #print(f"After Backward: {ctx.module.__class__.__name__}")
+        return (None, None) + args
+
+
+INITIAL_MICRO_STEP_ID = -1
+
+
+class DeepSpeedZeroOptimizer_Stage3(object):
+    """
+    DeepSpeedZeroOptimizer designed to reduce the memory footprint
+    required for training large deep learning models.
+
+    For more details please see ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
+    https://arxiv.org/abs/1910.02054
+
+    For usage examples, refer to TODO: DeepSpeed Tutorial
+
+    """
+    def __init__(self,
+                 module,
+                 init_optimizer,
+                 timers,
+                 ds_config,
+                 static_loss_scale=1.0,
+                 dynamic_loss_scale=False,
+                 dynamic_loss_args=None,
+                 verbose=True,
+                 contiguous_gradients=True,
+                 reduce_bucket_size=500000000,
+                 prefetch_bucket_size=50000000,
+                 max_reuse_distance=1000000000,
+                 max_live_parameters=1000000000,
+                 param_persistence_threshold=100000,
+                 dp_process_group=None,
+                 reduce_scatter=True,
+                 overlap_comm=False,
+                 offload_optimizer_config=None,
+                 offload_param_config=None,
+                 sub_group_size=1000000000000,
+                 mpu=None,
+                 clip_grad=0.0,
+                 communication_data_type=torch.float16,
+                 postscale_gradients=True,
+                 gradient_predivide_factor=1.0,
+                 gradient_accumulation_steps=1,
+                 elastic_checkpoint=False,
+                 aio_config=None):
+
+        see_memory_usage("Stage 3 initialize beginning", force=False)
+
+        print_rank_0(f"initialized {__class__.__name__} with args: {locals()}",
+                     force=False)
+
+        if dist.get_rank() == 0:
+            logger.info(f"Reduce bucket size {reduce_bucket_size}")
+            logger.info(f"Allgather bucket size {prefetch_bucket_size}")
+        # The fused optimizer does all the work. We need this layer for two reason:
+        # 1. maintain same user API from apex.fp16_utils
+        # 2. keep common stuff here in case we need to add ne552w fused optimizer later
+
+        # differences from apex.fp16_utils:
+        # - assume all model params in fp16
+        # - assume all params requires grad
+        # - flat by groups, not keeping state. TODO: remove state explicitly?
+        # - master grad and unflat master weight never exist. TODO: a way to save out unflat master?
+        if not torch.cuda.is_available:
+            raise SystemError("Cannot use fp16 without CUDA.")
+        self.optimizer = init_optimizer
+        self.using_real_optimizer = not isinstance(self.optimizer, DummyOptim)
+
+        # Load pre-built or JIT compile (un)flatten ops
+        util_ops = UtilsBuilder().load()
+        self.flatten = util_ops.flatten
+        self.unflatten = util_ops.unflatten
+        self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
+        self._global_grad_norm = 0.
+
+        self.optimizer_swapper = None
+        self.swap_optimizer = False
+
+        self.offload_optimizer = False
+        self.offload_optimizer_pin_memory = False
+        self.offload_optimizer_fast_init = False
+        self.offload_param = False
+        self.offload_param_pin_memory = False
+        self.params_in_nvme_and_cpu = False
+        self.max_params_in_cpu = 0
+
+        self._configure_offloading(offload_optimizer_config, offload_param_config)
+
+        self._convert_to_zero_parameters(ds_config, module, mpu)
+
+        for m in module.modules():
+            _init_external_params(m)
+
+        self.module = module
+        self.elastic_checkpoint = elastic_checkpoint
+
+        # Replace ._parameters with a new class to enable auto-registration of
+        # external parameters
+        _inject_parameters(module, ZeROOrderedDict)
+
+        self.__inf_or_nan_tracker: Tensor = torch.zeros(
+            1,
+            dtype=torch.bool,
+            device=torch.cuda.current_device(),
+            requires_grad=False)
+
+        self.deepspeed_adam_offload = (self.offload_optimizer
+                                       and type(init_optimizer) == DeepSpeedCPUAdam)
+
+        self.device = torch.cuda.current_device(
+        ) if not self.offload_optimizer else OFFLOAD_CPU_DEVICE
+        ### streams used for overlapping computation with communication
+        self.__allgather_stream = Stream(
+        ) if overlap_comm else torch.cuda.default_stream()
+        self.__reduce_and_partition_stream = Stream(
+        ) if overlap_comm else torch.cuda.default_stream()
+
+        ############################################################################
+
+        see_memory_usage("Before Partitioned Parameter Coordinator", force=False)
+        self.param_coordinators = {}
+        self._prefetch_bucket_sz = int(prefetch_bucket_size)
+        self._max_reuse_distance_in_numel = int(max_reuse_distance)
+        self._max_available_parameters_in_numel = int(max_live_parameters)
+        see_memory_usage("After Partitioned Parameter Coordinator", force=False)
+
+        self.__n_caching_allocator_flushes = 0
+
+        #-------------Stage 3 Setup-------------------#
+        # parameters smaller than the threshold will be collectively gathered at the
+        # end of the optimizer step and will be kept till the end of the backward pass
+        # TODO maybe worth just replicating these parameters and doing all reduce for them
+        self.persistence_threshold = int(param_persistence_threshold)
+
+        self.persistent_parameters = self.persistent_parameters()
+
+        self.setup_zero_stage3_hooks()
+
+        #resetting ds_tensor just in case parameters have been changed after initialization
+        #example .half() or .to()
+        #self.reset_ds_tensor()
+        #---------------------------------------------#
+
+        self.timers = timers
+
+        self.reduce_scatter = reduce_scatter
+
+        self.dp_process_group = dp_process_group
+
+        self.partition_count = dist.get_world_size(group=self.dp_process_group)
+
+        if mpu is None:
+            self.model_parallel_group = None
+            self.model_parallel_rank = 0
+        else:
+            self.model_parallel_group = mpu.get_model_parallel_group()
+            self.model_parallel_rank = mpu.get_model_parallel_rank()
+
+        self.overflow = False
+        self.clip_grad = clip_grad
+        self.communication_data_type = communication_data_type
+        self.gradient_predivide_factor = gradient_predivide_factor
+        self.postscale_gradients = postscale_gradients
+        self.gradient_accumulation_steps = gradient_accumulation_steps
+        self.micro_step_id = 0
+        self.reduce_bucket_size = int(reduce_bucket_size)
+
+        if self.reduce_scatter:
+            assert self.communication_data_type in (torch.float16, torch.bfloat16), f"ZeRO-3 supports only float16 or bfloat16 communication_data_type with reduce scatter enabled. Got: '{self.communication_data_type}'"
+            assert self.gradient_predivide_factor == 1.0, "gradient_predivide_factor != 1.0 is not yet supported with ZeRO-2 with reduce scatter enabled"
+            assert self.postscale_gradients, "pre-scale gradients is not yet supported with ZeRO-2 with reduce scatter enabled"
+
+        # Holds the mode parameter
+        # The param.data may not hold any meaningful data
+        # when param's status is NOT_AVAILABLE or IN_FLGHT
+        self.fp16_groups = []
+
+        # Hold partitioned parameters
+        self.fp16_partitioned_groups = []
+
+        # Holds a fused and flattened copy of the parameters
+        self.fp16_partitioned_groups_flat = []
+        self.fp16_partitioned_groups_flat_numel = []
+
+        #defragmented pinned memory
+        self.param_groups_fp16_flat_cpu_memory = []
+
+        #a single 32-bit partition of the parallel partitioned parameters
+        #that this process will update
+        self.fp32_partitioned_groups_flat = []
+        self.next_swappable_fp32_partitioned_groups = []
+
+        # number of elements per partition in each group
+        self.partition_size = []
+
+        self.all_reduce_print = False
+
+        self.prefetch_elements = int(prefetch_bucket_size)
+        self.contiguous_gradients = contiguous_gradients
+
+        # padding on each partition for alignment purposes
+        self.groups_padding = []
+
+        self.sub_group_size = sub_group_size
+
+        self.sub_group_to_group_id = {}
+        see_memory_usage("Before creating fp16 partitions", force=False)
+        self._create_fp16_partitions_with_defragmentation()
+        num_fp16_subgroups = len(self.fp16_partitioned_groups_flat)
+        see_memory_usage(f"After creating fp16 partitions: {num_fp16_subgroups}",
+                         force=False)
+
+        # Optimizer tensor swapping
+        if self.swap_optimizer:
+            self._configure_tensor_swapping(offload_optimizer_config, aio_config)
+
+        self.__params_in_ipg_bucket: List[Parameter] = []
+        self.is_gradient_accumulation_boundary: bool = True
+
+        self.__param_reduce_events: Deque[Event] = collections.deque()
+        # TODO. make this configurable via JSON
+        self.__max_param_reduce_events: int = 2
+
+        self.param_dict = {}
+
+        # map between param_id and bool to specify if a param is in this partition
+        self.is_param_in_current_partition = {}
+
+        self.extra_large_param_to_reduce = None
+        self.grads_in_ipg_bucket = []
+        self.params_in_ipg_bucket = []
+
+        self.params_already_reduced = []
+        self.is_gradient_accumulation_boundary = True
+        self._release_ipg_buffers()
+        self.previous_reduced_grads = None
+
+        # simplified param id
+        self.param_id = {}
+
+        count = 0
+        for i, params_group in enumerate(self.fp16_groups):
+            for param in params_group:
+                unique_id = id(param)
+                self.param_id[unique_id] = count
+                self.param_dict[count] = param
+                self.params_already_reduced.append(False)
+                count = count + 1
+
+        #Largest partitioned param
+        largest_partitioned_param_numel = max([
+            max([
+                max(tensor.numel(),
+                    tensor.ds_numel) for tensor in fp16_partitioned_group
+            ]) for fp16_partitioned_group in self.fp16_partitioned_groups
+        ])
+        print_rank_0(
+            f'Largest partitioned param numel = {largest_partitioned_param_numel}',
+            force=False)
+
+        self.grad_position = {}
+        if self.using_real_optimizer:
+            self._setup_for_real_optimizer()
+            self.set_grad_positions()
+
+        if self.offload_optimizer:
+            self.norm_for_param_grads = {}
+            self.local_overflow = False
+
+        # stores if a partition has been reduced in this step
+        self.is_partition_reduced = {}
+
+        # stores if a grad in a partition has been computed or not
+        self.is_grad_computed = {}
+
+        # will store the averaged gradients required by this partition
+        self.averaged_gradients = {}
+
+        #creates backward hooks for gradient partitioning
+        self.create_reduce_and_remove_grad_hooks()
+
+        #exit(0)
+
+        # we may have a way of fusing dynamic scale. Do not support for now
+        if self.dtype == torch.float or not dynamic_loss_scale:
+            loss_scale_value = 1.0 if self.dtype == torch.float else static_loss_scale
+
+            self.dynamic_loss_scale = False
+            self.loss_scaler = LossScaler(scale=loss_scale_value)
+            cur_iter = 0
+        else:
+            if dynamic_loss_args is None:
+                self.loss_scaler = DynamicLossScaler()
+            else:
+                self.loss_scaler = DynamicLossScaler(**dynamic_loss_args)
+
+            self.dynamic_loss_scale = True
+
+        self.debug_fp16_grads = [{} for _ in self.fp16_groups]
+
+        if dist.get_rank(group=self.dp_process_group) == 0:
+            see_memory_usage(f"After initializing ZeRO optimizer", force=True)
+
+    def _setup_for_real_optimizer(self):
+        see_memory_usage("Before creating fp32 partitions", force=False)
+        self._create_fp32_partitions()
+        see_memory_usage("After creating fp32 partitions", force=False)
+        dist.barrier()
+
+        # To support pipelined optimizer swapping
+        self._create_next_swappable_fp32_groups()
+
+        see_memory_usage("Before initializing optimizer states", force=False)
+
+        self.initialize_optimizer_states()
+        see_memory_usage("After initializing optimizer states", force=False)
+        dist.barrier()
+
+        if dist.get_rank() == 0:
+            logger.info(f"optimizer state initialized")
+
+        # IPG
+        if self.contiguous_gradients:
+            self.__ipg_bucket_flat_buffer: Tensor = torch.empty(
+                self.reduce_bucket_size,
+                dtype=self.dtype,
+                device=torch.cuda.current_device())
+
+        grad_partitions_flat_buffer = None
+        self.__param_id_to_grad_partition: Dict[int, Tensor] = {}
+
+        all_params = list(itertools.chain.from_iterable(self.fp16_groups))
+
+        grad_partitions_flat_buffer: Tensor = torch.zeros(
+            sum(p.ds_tensor.ds_numel for p in all_params),
+            dtype=self.dtype,
+            device=self.device,
+            pin_memory=self.offload_optimizer_pin_memory)
+
+        offset = 0
+        for param in all_params:
+            self.__param_id_to_grad_partition[
+                param.ds_id] = grad_partitions_flat_buffer.narrow(
+                    0,
+                    offset,
+                    param.ds_tensor.numel())
+            offset += param.ds_tensor.numel()
+
+    # TODO. factor out to a utility outside of stage3
+    @staticmethod
+    def defragment(tensors: List[Tensor]) -> Tensor:
+        """move provided tensors into a contiguous flat buffer, with some additional
+        measures taken to reduce memory fragmentation"""
+        assert len(set(t.dtype for t in tensors)) == 1
+        assert len(set(t.device for t in tensors)) == 1
+
+        cpu_buffer = torch.empty(sum(p.numel() for p in tensors),
+                                 dtype=get_only_unique_item(t.dtype for t in tensors),
+                                 device="cpu")
+        tensor_infos: List[Tuple[Tensor, int, int]] = []
+        orig_device = get_only_unique_item(t.device for t in tensors)
+
+        offset = 0
+        for tensor in tensors:
+            tensor_numel = tensor.numel()
+            # move the tensor from device memory to host memory
+            cpu_buffer.narrow(0, offset, tensor_numel).copy_(tensor)
+            tensor.data = torch.empty(0, dtype=tensor.dtype, device=tensor.device)
+
+            # record some data so we can restore the device tensor later
+            tensor_infos.append((tensor, offset, tensor_numel))
+
+            offset += tensor_numel
+
+        gc.collect()
+        torch.cuda.empty_cache()
+
+        # copy tensors (now flattened and contiguous) back to GPU
+        device_buffer = cpu_buffer.to(orig_device)
+
+        # restore device tensors
+        for tensor, offset, tensor_numel in tensor_infos:
+            tensor.data = device_buffer.narrow(0, offset, tensor_numel)
+
+        return device_buffer
+
+    def _get_param_coordinator(self, training):
+        if not training in self.param_coordinators:
+            self.param_coordinators[training] = PartitionedParameterCoordinator(
+                prefetch_bucket_sz=self._prefetch_bucket_sz,
+                max_reuse_distance_in_numel=self._max_reuse_distance_in_numel,
+                max_available_parameters_in_numel=self.
+                _max_available_parameters_in_numel,
+                allgather_stream=self.__allgather_stream,
+                prefetch_nvme=self.params_in_nvme_and_cpu,
+            )
+
+        return self.param_coordinators[training]
+
+    def _configure_offloading(self, offload_optimizer_config, offload_param_config):
+        ###################### offload optimizer setup ##################################
+        if offload_optimizer_config is not None:
+            self.offload_optimizer = True
+            self.offload_optimizer_pin_memory = offload_optimizer_config[
+                OFFLOAD_OPTIMIZER_PIN_MEMORY]
+            self.swap_optimizer = offload_optimizer_config[
+                OFFLOAD_OPTIMIZER_DEVICE] == OFFLOAD_NVME_DEVICE
+            self.offload_optimizer_fast_init = offload_optimizer_config[
+                OFFLOAD_OPTIMIZER_FAST_INIT]
+
+        ###################### offload param setup ##################################
+        if offload_param_config is not None:
+            if self.using_real_optimizer:
+                assert self.offload_optimizer, "parameter offload is only available with optimizer state offload"
+            self.offload_param = True
+            self.offload_param_pin_memory = offload_param_config[
+                OFFLOAD_PARAM_PIN_MEMORY]
+            self.params_in_nvme_and_cpu = offload_param_config[
+                OFFLOAD_PARAM_DEVICE] == OFFLOAD_NVME_DEVICE
+            self.max_params_in_cpu = offload_param_config[OFFLOAD_PARAM_MAX_IN_CPU]
+            print_rank_0(
+                f"FP16 params swapping is {self.params_in_nvme_and_cpu}, Max params in CPU is {self.max_params_in_cpu}",
+                force=False)
+
+    def _convert_to_zero_parameters(self, ds_config, module, mpu):
+        non_zero_params = [p for p in module.parameters() if not is_zero_param(p)]
+        if non_zero_params:
+            zero_params = [p for p in module.parameters() if is_zero_param(p)]
+            if zero_params:
+                zero_params[0].convert_to_zero_parameters(param_list=non_zero_params)
+            else:
+                group = None
+                if mpu:
+                    group = mpu.get_data_parallel_group()
+
+                if self.params_in_nvme_and_cpu:
+                    remote_device = OFFLOAD_NVME_DEVICE
+                elif self.offload_param:
+                    remote_device = OFFLOAD_CPU_DEVICE
+                else:
+                    remote_device = None
+
+                Init(module=module,
+                     data_parallel_group=group,
+                     dtype=self.dtype,
+                     config_dict_or_path=ds_config,
+                     remote_device=remote_device,
+                     pin_memory=self.offload_param_pin_memory,
+                     mpu=mpu)
+
+    def _configure_tensor_swapping(self, offload_optimizer_config, aio_config):
+        nvme_swap_folder = os.path.join(
+            offload_optimizer_config[OFFLOAD_OPTIMIZER_NVME_PATH],
+            'zero_stage_3')
+        os.makedirs(nvme_swap_folder, exist_ok=True)
+        if torch.distributed.get_rank() == 0:
+            logger.info(f'Tensor Swapping: Adding optimizer tensors')
+
+        swapper_type = PipelinedOptimizerSwapper if offload_optimizer_config[
+            OFFLOAD_OPTIMIZER_PIPELINE] else PartitionedOptimizerSwapper
+
+        self.optimizer_swapper = swapper_type(
+            swap_config=offload_optimizer_config,
+            aio_config=aio_config,
+            base_folder=nvme_swap_folder,
+            optimizer=self.optimizer,
+            largest_numel=max(self.fp16_partitioned_groups_flat_numel),
+            device=self.device,
+            dtype=torch.float32,
+            timers=self.timers)
+
+    @property
+    def elements_in_ipg_bucket(self):
+        return sum(p.ds_numel for p in self.__params_in_ipg_bucket)
+
+    def _move_to_flat_buffer(self, param_list, flat_buffer, avoid_copy=False):
+        '''If flat buffer is None then the parameters in the param_list are
+        not copied to the flat buffer. This is because they exceed the number of max_params_in_cpu
+        Some of these parameters may already be in CPU in unflattened buffers
+        or they maybe in GPU, or they maybe in NVME. If they are in NVME, then
+        they will be marked as NOT_AVAILABLE, and will be moved to CPU when they are
+        needed during training.'''
+        if flat_buffer is None:
+            # this dst buffer is on NVMe, so skip this
+            return
+
+        start = 0
+        for param in param_list:
+            src = param.ds_tensor
+            dest = flat_buffer.narrow(0, start, src.ds_numel)
+            start = start + src.ds_numel
+            '''if the parameter was initialized in nvme then bring it to the destination buffer directly'''
+            if src.status == PartitionedParamStatus.NOT_AVAILABLE:
+                print_rank_0(
+                    f"Swapping in {param.ds_id} with partition size {param.ds_tensor.ds_numel} permanently to CPU"
+                )
+                param.nvme_swapper.swap_into_buffer(param, dest)
+                src.data = dest.data
+                src.status = PartitionedParamStatus.AVAILABLE
+            else:
+                assert src.status == PartitionedParamStatus.AVAILABLE, "Partitioned Param must be available here"
+                if not avoid_copy:
+                    dest.data.copy_(src.data)
+                src.data = dest.data
+
+            # Final location must be gpu/cpu in this case
+            param.ds_tensor.final_location = 'not-nvme'
+
+    def _create_param_groups_fp16_flat_cpu_memory(self):
+
+        aggregate_params_count = 0
+
+        for j, param_group in enumerate(self.optimizer.param_groups):
+            params_in_group = sum([p.ds_tensor.ds_numel for p in param_group['params']])
+
+            flat_buffer_size = params_in_group
+
+            if self.params_in_nvme_and_cpu and \
+                aggregate_params_count + params_in_group > self.max_params_in_cpu:
+
+                flat_buffer_size = max(0,
+                                       self.max_params_in_cpu - aggregate_params_count)
+
+            aggregate_params_count += params_in_group
+
+            if flat_buffer_size > 0:
+                print_rank_0(f"group {j} flat buffer size {flat_buffer_size}",
+                             force=False)
+                self.param_groups_fp16_flat_cpu_memory.append(
+                    torch.empty(int(flat_buffer_size),
+                                dtype=self.dtype,
+                                pin_memory=True))
+            else:
+                print_rank_0(
+                    f"No flat buffer size. Param group size was  {params_in_group}",
+                    force=False)
+
+                self.param_groups_fp16_flat_cpu_memory.append(
+                    torch.empty(1,
+                                dtype=self.dtype))
+
+    def _create_fp16_partitions_with_defragmentation(self):
+        dist.barrier()
+        param_groups: List[List[Parameter]] = tuple(
+            self._create_fp16_sub_groups(param_group["params"])
+            for param_group in self.optimizer.param_groups)
+
+        # bookkeeping related to param groups
+        for param_group_idx, param_group in enumerate(param_groups):
+            for sub_group in param_group:
+                sub_group_idx = len(self.fp16_groups)
+
+                # record sub group and partitions
+                self.fp16_groups.append(sub_group)
+                self.fp16_partitioned_groups.append(
+                    [param.ds_tensor for param in sub_group])
+
+                # record sub group -> group mapping
+                self.sub_group_to_group_id[sub_group_idx] = param_group_idx
+
+                # record total elements of parameter partitions in sub group
+                self.fp16_partitioned_groups_flat_numel.append(
+                    sum(p.ds_tensor.ds_numel for p in sub_group))
+
+                # record padding required to align group to world size (only applies to last rank)
+                rank_requires_padding = dist.get_rank(
+                    self.dp_process_group) == dist.get_world_size(
+                        self.dp_process_group) - 1
+                self.groups_padding.append([
+                    p.padding_size() if rank_requires_padding else 0 for p in sub_group
+                ])
+
+        # move parameters to flattened buffer
+        if not self.offload_param:  # partitioned params remain in GPU during training
+            # move parameter partitions into a single contiguous flat buffer
+            parameter_partitions: List[Tensor] = []
+            for sub_group in self.fp16_groups:
+                for param in sub_group:
+                    parameter_partitions.append(param.ds_tensor)
+            device_buffer = __class__.defragment(parameter_partitions)
+
+            # setup flat buffers per subgroup, these are each just sections of the
+            # contiguous flat buffer for all parameters that we created earlier
+            offset = 0
+            for sub_group in self.fp16_groups:
+                sub_group_numel = sum(param.ds_tensor.ds_numel for param in sub_group)
+                self.fp16_partitioned_groups_flat.append(
+                    device_buffer.narrow(0,
+                                         offset,
+                                         sub_group_numel))
+                offset += sub_group_numel
+        else:  # partitioned params offloaded to CPU when not in use
+            # create a flat CPU memory allocation for each param group
+            self._create_param_groups_fp16_flat_cpu_memory()
+            for param_group_idx, param_group in enumerate(param_groups):
+                flat_offset = 0
+                for i, sub_group in enumerate(param_group):
+                    total_elements = sum(p.ds_tensor.ds_numel for p in sub_group)
+                    print_rank_0(f"Params in nvme and cpu {self.params_in_nvme_and_cpu}")
+                    #Flat buffer may not be available for parameters that reside in NVME
+                    if not self.params_in_nvme_and_cpu or flat_offset + total_elements <= self.param_groups_fp16_flat_cpu_memory[
+                            param_group_idx].numel():
+                        fp16_partitioned_group_flat = self.param_groups_fp16_flat_cpu_memory[
+                            param_group_idx].narrow(0,
+                                                    flat_offset,
+                                                    total_elements)
+                        print_rank_0(
+                            f"Creating a flat buffer for subgroup {i} requiring {total_elements} elements, and cumulative CPU elements {flat_offset + total_elements}",
+                            force=False)
+
+                    elif self.params_in_nvme_and_cpu:
+                        fp16_partitioned_group_flat = None
+                        print_rank_0(
+                            f"No flat buffer for sub group {i} of {total_elements} elements",
+                            force=False)
+                    else:
+                        assert False, "Either params are in nvme, or they are in CPU memory. This code path should not be triggered. Please see you max_params_in_cpu and params_in_nvme configs"
+
+                    self.fp16_partitioned_groups_flat.append(fp16_partitioned_group_flat)
+                    flat_offset += total_elements
+
+                    self._move_to_flat_buffer(sub_group,
+                                              fp16_partitioned_group_flat,
+                                              avoid_copy=not self.offload_param)
+
+        # if necessary, create a pinned memory buffer to be used for swapping out
+        # params to NVME after optimizer step
+        should_create_fp16_flat_reuse_buffer = any(
+            flattened_partition_group is None
+            for flattened_partition_group in self.fp16_partitioned_groups_flat)
+        if should_create_fp16_flat_reuse_buffer:
+            max_partition_numel, largest_partition_numel = 0, None
+            for sub_group in self.fp16_groups:
+                total_elements = sum(t.ds_tensor.ds_numel for t in sub_group)
+                if total_elements > max_partition_numel:
+                    largest_partition_numel = [t.ds_numel for t in sub_group]
+                    max_partition_numel = total_elements
+
+            assert len(largest_partition_numel) > 0, f'Unexpected that largest partition is empty'
+            self.fp16_groups[0][0].nvme_swapper.reserve_partitioned_swap_space(
+                largest_partition_numel)
+
+    def _swap_in_sub_group_to_flat_buffer(self, flat_buffer, sub_group_id):
+        offset = 0
+        elements_in_sub_group = sum(
+            [t.ds_numel for t in self.fp16_partitioned_groups[sub_group_id]])
+        assert (flat_buffer.numel() == elements_in_sub_group)
+        for param, partitioned_param in zip(self.fp16_groups[sub_group_id], self.fp16_partitioned_groups[sub_group_id]):
+            dest = flat_buffer.narrow(0, offset, partitioned_param.ds_numel)
+            if partitioned_param.status == PartitionedParamStatus.NOT_AVAILABLE:
+                print_rank_0(
+                    f"Swapping in {param.ds_id} with elements {param.ds_numel} and partition {param.ds_tensor.ds_numel}"
+                )
+                param.nvme_swapper.swap_in([param], async_op=False)
+                dest.data.copy_(partitioned_param.data)
+                param.nvme_swapper.remove_partition_and_release_buffers([param])
+                print_rank_0(f"Swapping in {param.ds_id} done")
+            else:
+                dest.data.copy_(partitioned_param.data)
+            offset += partitioned_param.ds_numel
+
+    def _create_next_swappable_fp32_groups(self):
+        reverse_order_indices = [
+            i for i in range(len(self.fp32_partitioned_groups_flat))
+        ]
+        reverse_order_indices.reverse()
+
+        next_group = None
+        for i in reverse_order_indices:
+            self.next_swappable_fp32_partitioned_groups.append(next_group)
+            if self._swappable_optimizer_subgroup(i):
+                next_group = self.fp32_partitioned_groups_flat[i]
+
+        self.next_swappable_fp32_partitioned_groups.reverse()
+
+    def _get_sub_group_partitions(self, sub_group_id):
+        sub_group_partitions = []
+        for param, partitioned_param in zip(self.fp16_groups[sub_group_id], self.fp16_partitioned_groups[sub_group_id]):
+            if partitioned_param.status == PartitionedParamStatus.NOT_AVAILABLE:
+                swap_path = param.nvme_swapper.get_path(param, True)
+                sub_group_partitions.append((partitioned_param,
+                                             param.ds_tensor.ds_numel,
+                                             swap_path))
+            else:
+                sub_group_partitions.append((partitioned_param,
+                                             partitioned_param.ds_numel,
+                                             None))
+
+        return sub_group_partitions
+
+    def _create_fp32_partitions(self):
+        cpu_memory_usage = 0
+        cpu_memory_sub_groups = 0
+        nvme_memory_usage = 0
+        num_swappable_partitions = 0
+        num_swap_from_nvme_partitions = 0
+        num_swap_from_cpu_partitions = 0
+        swap_from_nvme_memory_usage = 0
+        swap_from_cpu_memory_usage = 0
+        GIGA_BYTES = (1024**3)
+
+        swappable_fp32_tensors = []
+        swappable_fp16_src_tensors = []
+        nvme_fp16_partitions_info = []
+        nvme_fp16_num_elems = []
+        nvme_fp32_dest_tensors = []
+        fp32_element_size = torch.tensor([], dtype=torch.float32).element_size()
+
+        for i, tensor in enumerate(self.fp16_partitioned_groups_flat):
+            num_elements = self.fp16_partitioned_groups_flat_numel[i]
+
+            # a partition of the fp32 master weights that will be updated by this process
+            if self._swappable_optimizer_subgroup(i):
+                self.fp32_partitioned_groups_flat.append(torch.Tensor())
+                nvme_memory_usage += (fp32_element_size * num_elements)
+                num_swappable_partitions += 1
+
+                if self.params_in_nvme_and_cpu and tensor is None:
+                    num_swap_from_nvme_partitions += 1
+                    swap_from_nvme_memory_usage += (fp32_element_size * num_elements)
+                    if self.offload_optimizer_fast_init:
+                        sub_group_partitions = self._get_sub_group_partitions(i)
+                        nvme_fp16_partitions_info.append(sub_group_partitions)
+                        nvme_fp16_num_elems.append(num_elements)
+                        nvme_fp32_dest_tensors.append(
+                            self.fp32_partitioned_groups_flat[i])
+                    else:
+                        unpinned_fp32_buffer = torch.empty(num_elements,
+                                                           device=self.device,
+                                                           dtype=torch.float)
+                        self._swap_in_sub_group_to_flat_buffer(unpinned_fp32_buffer, i)
+                        self.optimizer_swapper.initialize_parameters(
+                            parameters=[self.fp32_partitioned_groups_flat[i]],
+                            src_tensors=[unpinned_fp32_buffer])
+                else:
+                    num_swap_from_cpu_partitions += 1
+                    swap_from_cpu_memory_usage += (fp32_element_size * num_elements)
+                    swappable_fp32_tensors.append(self.fp32_partitioned_groups_flat[i])
+                    swappable_fp16_src_tensors.append(
+                        self.fp16_partitioned_groups_flat[i])
+            else:
+                cpu_memory_usage += (fp32_element_size * num_elements)
+                cpu_memory_sub_groups += 1
+
+                if self.params_in_nvme_and_cpu and tensor is None:
+                    unpinned_fp32_buffer = torch.empty(num_elements,
+                                                       device=self.device,
+                                                       dtype=torch.float)
+                    self._swap_in_sub_group_to_flat_buffer(unpinned_fp32_buffer, i)
+                    self.fp32_partitioned_groups_flat.append(unpinned_fp32_buffer)
+                else:
+                    self.fp32_partitioned_groups_flat.append(
+                        self.fp16_partitioned_groups_flat[i].to(
+                            self.device).clone().float().detach())
+
+            self.fp32_partitioned_groups_flat[
+                i].requires_grad = True  # keep this in case internal optimizer uses it
+
+        if len(swappable_fp32_tensors) > 0:
+            self.optimizer_swapper.initialize_parameters(
+                parameters=swappable_fp32_tensors,
+                src_tensors=swappable_fp16_src_tensors)
+
+        if len(nvme_fp32_dest_tensors) > 0:
+            fp16_pinned_buffers = self.fp16_groups[0][
+                0].nvme_swapper.reserve_available_buffers()
+            assert len(fp16_pinned_buffers) > 0
+            self.optimizer_swapper.initialize_from_swapped_fp16_params(
+                fp16_partitions_info=nvme_fp16_partitions_info,
+                fp16_num_elems=nvme_fp16_num_elems,
+                fp16_pinned_buffers=fp16_pinned_buffers,
+                fp32_parameters=nvme_fp32_dest_tensors)
+            self.fp16_groups[0][0].nvme_swapper.release_reserved_buffers()
+
+        nvme_gigabytes = nvme_memory_usage / GIGA_BYTES
+        print_rank_0(
+            f'Swappable FP32 Partitions: count={num_swappable_partitions} size={nvme_gigabytes:5.2f} GB',
+            force=False)
+        if self.params_in_nvme_and_cpu:
+            print_rank_0(
+                f'Swap from NVMe Partitions: count = {num_swap_from_nvme_partitions}, size = {swap_from_nvme_memory_usage/GIGA_BYTES:5.2f}GB',
+                force=False)
+            print_rank_0(
+                f'Swap from CPU Partitions: count = {num_swap_from_cpu_partitions}, size = {swap_from_cpu_memory_usage/GIGA_BYTES:5.2f}GB',
+                force=False)
+
+        cpu_memory_gigabytes = cpu_memory_usage / GIGA_BYTES
+        print_rank_0(
+            f'In-Memory FP32 Partitions: count={cpu_memory_sub_groups} size={cpu_memory_gigabytes:5.2f} GB',
+            force=False)
+
+        # Clear for on-the-fly population before the optimizer step
+        for param_group in self.optimizer.param_groups:
+            param_group['params'] = []
+
+    def _create_fp16_sub_groups(self, params_group):
+
+        params_group_numel = sum([param.partitioned_size() for param in params_group])
+        sub_group_size = self.sub_group_size
+
+        if sub_group_size is None or sub_group_size >= params_group_numel:
+            return [params_group]
+
+        sub_groups = []
+        sub_group = []
+        local_sub_group_size = 0
+        for param in params_group:
+
+            sub_group.append(param)
+            local_sub_group_size += param.partitioned_size()
+
+            if local_sub_group_size >= sub_group_size or id(param) == id(
+                    params_group[-1]):
+
+                sub_groups.append(sub_group)
+
+                sub_group = []
+                local_sub_group_size = 0
+
+        return sub_groups
+
+    # def reset_ds_tensor(self):
+    #     for name, param in self.module.named_parameters(recurse=True):
+    #         assert hasattr(param,'ds_id'), "Parameters have not been converted to be Zero 3 compatible"
+    #         assert (param.ds_status == ZeroParamStatus.NOT_AVAILABLE), "All the parameters must have been partitioned by now"
+    #         param.ds_tensor.data = param.data
+
+    def setup_zero_stage3_hooks(self):
+        self.hierarchy = 0
+
+        #reset step if in inference mode
+        @instrument_w_nvtx
+        def _end_of_forward_hook(module, *args):
+
+            if not torch._C.is_grad_enabled():
+                self._get_param_coordinator(training=False).reset_step()
+
+        #likely one of them should be enough but just to be safe
+        self._register_hooks_recursively(self.module)
+        self.module.register_forward_hook(_end_of_forward_hook)
+
+        # Add top module to stack trace
+        global FWD_MODULE_STACK
+        FWD_MODULE_STACK.append(self.module)
+
+    def persistent_parameters(self):
+        persistent_params = []
+        total_persistent_parameters = 0
+        params_count = 0
+        for _, param in self.module.named_parameters(recurse=True):
+            if param.ds_numel < self.persistence_threshold:
+                params_count += 1
+                param.ds_persist = True
+                persistent_params.append(param)
+                total_persistent_parameters += param.ds_numel
+
+        print_rank_0(
+            f"ZeRO 3: Total persistent parameters: {total_persistent_parameters} in {params_count} params",
+            force=False)
+        return persistent_params
+
+    def _register_hooks_recursively(self, module, count=[0]):
+        my_count = count[0]
+        module.id = my_count
+
+        #print(f"{module.__class__} : {module.id}")
+
+        for child in module.children():
+            count[0] = count[0] + 1
+            self._register_hooks_recursively(child, count=count)
+
+        @instrument_w_nvtx
+        def _pre_forward_module_hook(module, *args):
+            self.pre_sub_module_forward_function(module)
+
+        @instrument_w_nvtx
+        def _post_forward_module_hook(module, input, output):
+            global FWD_MODULE_STACK
+            FWD_MODULE_STACK.pop()
+            if output is None:
+                output = []
+            elif not isinstance(output, (list, tuple)):
+                if torch.is_tensor(output):
+                    output = [output]
+                else:
+                    #print(f'got UNKNOWN type {type(output)}')
+                    outputs = []
+                    output = output if isinstance(output, dict) else vars(output)
+                    for name, val in output.items():
+                        if not name.startswith('__') and torch.is_tensor(val):
+                            outputs.append(val)
+                    output = outputs
+                    #print(f'convert output to {output}')
+
+            for item in filter(lambda item: is_zero_param(item), output):
+                if not any(id(item) in m._external_params for m in FWD_MODULE_STACK):
+                    item.is_external_param = True
+                    module_to_register = FWD_MODULE_STACK[-1]
+                    register_external_parameter(module_to_register, item)
+                    print_rank_0(
+                        f'Registering dangling parameter for module {module_to_register.__class__.__name__}, ds_id = {item.ds_id}.',
+                        force=False)
+
+                    # It's possible that the parameter was already external to the completed module. If so, remove it the
+                    # registration as it will be covered by the outer module instead.
+                    if id(item) in module._external_params:
+                        print_rank_0(
+                            f'  Unregistering nested dangling parameter from module {module.__class__.__name__}, ds_id = {item.ds_id}',
+                            force=False)
+                        unregister_external_parameter(module, item)
+
+                    item.all_gather()
+
+            self.post_sub_module_forward_function(module)
+
+        def _pre_backward_module_hook(module, inputs, output):
+            @instrument_w_nvtx
+            def _run_before_backward_function(sub_module):
+                # some models (e.g. Albert) may run multiple forwards on the same layer in a loop
+                # before doing backwards, so each backward will need a pre-fetch - using reference
+                # counting to support this scenario
+                #print(f"COUNTER before: {sub_module.applied_pre_backward_ref_cnt}")
+                if sub_module.applied_pre_backward_ref_cnt > 0:
+                    self.pre_sub_module_backward_function(sub_module)
+                    sub_module.applied_pre_backward_ref_cnt -= 1
+                #print(f"COUNTER after: {sub_module.applied_pre_backward_ref_cnt}")
+
+            return _apply_to_tensors_only(module,
+                                          PreBackwardFunction,
+                                          _run_before_backward_function,
+                                          output)
+
+        #This is an alternate to doing _post_backward_module_hook
+        #it uses tensor.register_hook instead of using torch.autograd.Function
+        def _alternate_post_backward_module_hook(module, inputs):
+            module.ds_grads_remaining = 0
+
+            #print(f"Before Forward {module.__class__.__name__}")
+
+            def _run_after_backward_hook(*unused):
+                module.ds_grads_remaining = module.ds_grads_remaining - 1
+                if module.ds_grads_remaining == 0:
+                    #print(f"After backward {module.__class__.__name__}")
+                    self.post_sub_module_backward_function(module)
+
+            def _run_before_forward_function(input):
+                if input.requires_grad:
+                    module.ds_grads_remaining += 1
+
+            return _apply_forward_and_backward_to_tensors_only(
+                module,
+                _run_before_forward_function,
+                _run_after_backward_hook,
+                inputs)
+
+        def _post_backward_module_hook(module, inputs):
+            module.ds_grads_remaining = 0
+
+            @instrument_w_nvtx
+            def _run_after_backward_function(sub_module):
+                if sub_module.ds_grads_remaining == 0:
+                    self.post_sub_module_backward_function(sub_module)
+
+            return _apply_to_tensors_only(module,
+                                          PostBackwardFunction,
+                                          _run_after_backward_function,
+                                          inputs)
+
+        # Pre forward hook
+        module.register_forward_pre_hook(_pre_forward_module_hook)
+        # Post forward hook
+        module.register_forward_hook(_post_forward_module_hook)
+
+        # Pre backward hook
+        module.register_forward_hook(_pre_backward_module_hook)
+
+        # post backward hook
+        module.register_forward_pre_hook(_post_backward_module_hook)
+
+    @torch.no_grad()
+    def pre_sub_module_forward_function(self, sub_module):
+        see_memory_usage(f"Before sub module function {sub_module.__class__.__name__}",
+                         force=False)
+
+        global FWD_MODULE_STACK
+        FWD_MODULE_STACK.append(sub_module)
+
+        param_coordinator = self._get_param_coordinator(training=sub_module.training)
+        param_coordinator.trace_prologue(sub_module)
+        if param_coordinator.is_record_trace():
+            param_coordinator.record_module(sub_module)
+        param_coordinator.fetch_sub_module(sub_module)
+
+        see_memory_usage(
+            f"Before sub module function {sub_module.__class__.__name__} after fetch",
+            force=False)
+
+    @torch.no_grad()
+    def post_sub_module_forward_function(self, sub_module):
+        see_memory_usage(
+            f"After sub module function {sub_module.__class__.__name__} {sub_module.id} before release",
+            force=False)
+
+        param_coordinator = self._get_param_coordinator(training=sub_module.training)
+        if param_coordinator.is_record_trace():
+            param_coordinator.record_parameters(sub_module)
+        param_coordinator.release_sub_module(sub_module)
+
+        see_memory_usage(
+            f"After sub module function {sub_module.__class__.__name__}  {sub_module.id} after release",
+            force=False)
+
+    @torch.no_grad()
+    def pre_sub_module_backward_function(self, sub_module):
+        param_coordinator = self._get_param_coordinator(training=sub_module.training)
+        param_coordinator.trace_prologue(sub_module)
+        if param_coordinator.is_record_trace():
+            param_coordinator.record_module(sub_module)
+            param_coordinator.record_parameters(sub_module)
+        param_coordinator.fetch_sub_module(sub_module)
+
+    @torch.no_grad()
+    def post_sub_module_backward_function(self, sub_module):
+        see_memory_usage(
+            f"After sub module backward function {sub_module.__class__.__name__} {sub_module.id} before release",
+            force=False)
+
+        self._get_param_coordinator(
+            training=sub_module.training).release_sub_module(sub_module)
+
+        see_memory_usage(
+            f"After sub module backward function {sub_module.__class__.__name__} {sub_module.id} after release",
+            force=False)
+
+    def _release_ipg_buffers(self):
+        if self.contiguous_gradients:
+            self.ipg_buffer = None
+
+    def _optimizer_step(self, sub_group_id):
+        param_group_id = self.sub_group_to_group_id[sub_group_id]
+        fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]
+        self.optimizer.param_groups[param_group_id]['params'] = [fp32_param]
+
+        self.optimizer.step()
+        self.optimizer.param_groups[param_group_id]['params'] = []
+
+    def _swappable_optimizer_subgroup(self, sub_group_id):
+        if not self.swap_optimizer:
+            return False
+
+        return self.optimizer_swapper.swappable_tensor(
+            None,
+            numel=self.fp16_partitioned_groups_flat_numel[sub_group_id])
+
+    def _partitioned_params_swap_out(self, i):
+        offset = 0
+        fp32_param = self.fp32_partitioned_groups_flat[i]
+        assert fp32_param is not None, \
+        f'fp32 parameters of sub_group {i} is None'
+
+        swap_fp16_params = []
+        swap_fp32_params = []
+        for param, partitioned_param in zip(self.fp16_groups[i], self.fp16_partitioned_groups[i]):
+            src = fp32_param.narrow(0, offset, partitioned_param.ds_numel)
+            if partitioned_param.status == PartitionedParamStatus.AVAILABLE:
+                partitioned_param.data.copy_(src.data)
+            else:
+                swap_fp32_params.append(src)
+                swap_fp16_params.append(param)
+            offset += partitioned_param.ds_numel
+
+        if len(swap_fp16_params):
+            swap_fp16_params[0].nvme_swapper.swap_out_partitioned_params(
+                dst_fp16_params=swap_fp16_params,
+                src_fp32_params=swap_fp32_params)
+
+    def initialize_optimizer_states(self):
+        num_subgroups = len(self.fp16_groups)
+
+        largest_numel = max(
+            [sum([p.ds_numel for p in psg]) for psg in self.fp16_partitioned_groups])
+        gradient_dtype = self.fp32_partitioned_groups_flat[0].dtype
+        gradient_buffer = torch.zeros(int(largest_numel),
+                                      dtype=gradient_dtype,
+                                      device=self.device)
+
+        timers = self.timers
+        timer_names = set()
+
+        if self.swap_optimizer:
+            self.optimizer_swapper.init_timers()
+
+        INIT_OPTIMIZER_TIMER = 'init_optimizer_state'
+        timer_names.add(INIT_OPTIMIZER_TIMER)
+        self.start_timers([INIT_OPTIMIZER_TIMER])
+
+        for i, group in enumerate(self.fp16_groups):
+            swappable_optimizer_subgroup = self._swappable_optimizer_subgroup(i)
+            swappable_param_subgroup = self.fp16_partitioned_groups_flat[i] is None
+
+            num_elements = int(self.fp16_partitioned_groups_flat_numel[i])
+
+            see_memory_usage(
+                f'[Begin] Initialize optimizer states {i} / {num_subgroups} subgroups, num_elems: {num_elements}, swappable opt/param:{swappable_optimizer_subgroup}/{swappable_param_subgroup}',
+                force=False)
+
+            if swappable_optimizer_subgroup:
+                self._optimizer_states_and_gradient_swap_in(i, timer_names)
+
+            if self.offload_optimizer and not swappable_optimizer_subgroup:
+                subgroup_gradient_buffer = torch.zeros(num_elements,
+                                                       dtype=gradient_dtype,
+                                                       device=self.device)
+                if self.offload_optimizer_pin_memory:
+                    subgroup_gradient_buffer = subgroup_gradient_buffer.pin_memory()
+
+                self.fp32_partitioned_groups_flat[i].grad = subgroup_gradient_buffer
+            else:
+                self.fp32_partitioned_groups_flat[i].grad = gradient_buffer.narrow(
+                    0,
+                    0,
+                    num_elements)
+
+            self._optimizer_step(i)
+
+            if swappable_param_subgroup:
+                self._partitioned_params_swap_out(i)
+
+            if swappable_optimizer_subgroup:
+                self._optimizer_states_and_gradient_swap_out(i, timer_names)
+
+            see_memory_usage(
+                f'[End] Initialize optimizer states {i} / {num_subgroups} subgroups, num_elems: {num_elements}, swappable opt/param:{swappable_optimizer_subgroup}/{swappable_param_subgroup}',
+                force=False)
+
+        self.stop_timers([INIT_OPTIMIZER_TIMER])
+        self.log_timers(timer_names)
+
+        if self.swap_optimizer:
+            self.optimizer_swapper.log_timers()
+
+        if not self.offload_optimizer:
+            for group in self.fp32_partitioned_groups_flat:
+                group.grad = None
+
+        # Reset steps
+        return
+
+    #########################################################################
+    #########################ZeRO Partition Gradients########################
+    #########################################################################
+
+    def get_first_param_index(self, group_id, param_group, partition_id):
+        for index, param in enumerate(param_group):
+            param_id = self.get_param_id(param)
+            if partition_id in self.param_to_partition_ids[group_id][param_id]:
+                return index
+        return None
+
+    def initialize_gradient_partitioning_data_structures(self):
+
+        total_partitions = dist.get_world_size(group=self.dp_process_group)
+
+        for i, param_group in enumerate(self.fp16_groups):
+
+            self.param_to_partition_ids[i] = {}
+            self.is_partition_reduced[i] = {}
+            self.total_grads_in_partition[i] = {}
+            self.remaining_grads_in_partition[i] = {}
+            self.is_grad_computed[i] = {}
+            self.grad_partition_insertion_offset[i] = {}
+            self.grad_start_offset[i] = {}
+            self.first_param_index_in_partition[i] = {}
+
+            for partition_id in range(total_partitions):
+                self.is_grad_computed[i][partition_id] = {}
+                self.grad_partition_insertion_offset[i][partition_id] = {}
+                self.grad_start_offset[i][partition_id] = {}
+                self.initialize_gradient_partition(i, param_group, partition_id)
+                self.is_partition_reduced[i][partition_id] = False
+                self.first_param_index_in_partition[i][
+                    partition_id] = self.get_first_param_index(
+                        i,
+                        param_group,
+                        partition_id)
+
+    @instrument_w_nvtx
+    def independent_gradient_partition_epilogue(self):
+        self.report_ipg_memory_usage(f"In ipg_epilogue before reduce_ipg_grads", 0)
+        self.__reduce_and_partition_ipg_grads()
+        self.report_ipg_memory_usage(f"In ipg_epilogue after reduce_ipg_grads", 0)
+
+        self.__reduce_and_partition_stream.synchronize()
+
+        # if dist.get_rank() == 0:
+        #    logger.info("Params already reduced %s", self.params_already_reduced)
+        for i in range(len(self.params_already_reduced)):
+            self.params_already_reduced[i] = False
+
+        #in case of cpu offload, averaged gradients are already in fp32_partitioned_groups_flat.grad
+        #TODO: use a similar code path for both cpu_offload and non-cpu offload
+        if not self.offload_optimizer:
+            for i, sub_group in enumerate(self.fp16_groups):
+                self.averaged_gradients[i] = [
+                    self.__param_id_to_grad_partition[param.ds_id]
+                    if param.requires_grad else torch.zeros_like(param.ds_tensor)
+                    for param in sub_group
+                ]
+                # self.averaged_gradients[i] = self.get_flat_partition(
+                #     self.fp16_groups[i],
+                #     0,
+                #     self.fp32_partitioned_groups_flat[i].numel(),
+                #     return_tensor_list=True)
+
+        # this method gets called after every backward. need to increment
+        # here because if it gets incremented in backward() the micro step
+        # id will be off by one when we do the reduce and partition at the.
+        # start of this method.
+        # TODO. make this less error prone
+        self.micro_step_id += 1
+
+    def overlapping_partition_gradients_reduce_epilogue(self):
+        self.independent_gradient_partition_epilogue()
+
+    def create_reduce_and_remove_grad_hooks(self):
+        print_rank_0(f'[Begin] Create gradient reduction hooks')
+        self.grad_accs = []
+        for i, param_group in enumerate(self.fp16_groups):
+            for param in param_group:
+                if param.requires_grad:
+                    #print_rank_0(f" Before all gather {param.device}, {param.shape}")
+
+                    # The hook must be created in un-partitioned parameter
+                    param.all_gather()
+
+                    #print(f"After all gather {param.device}, {param.shape}")
+                    def wrapper(param, i):
+                        param_tmp = param.expand_as(param)
+                        grad_acc = param_tmp.grad_fn.next_functions[0][0]
+
+                        @instrument_w_nvtx
+                        def reduce_partition_and_remove_grads(*notneeded):
+                            self.reduce_ready_partitions_and_remove_grads(param, i)
+
+                        grad_acc.register_hook(reduce_partition_and_remove_grads)
+                        self.grad_accs.append(grad_acc)
+
+                    #print(f"param grad fn {param.expand_as(param).grad_fn}")
+                    wrapper(param, i)
+
+                    # Partition the parameter after creating the hook
+                    param.partition()
+        print_rank_0(f'[End] Create gradient reduction hooks')
+
+    def get_param_id(self, param):
+        unique_id = id(param)
+        return self.param_id[unique_id]
+
+    def report_ipg_memory_usage(self, tag, param_elems):
+        elem_count = self.elements_in_ipg_bucket + param_elems
+        percent_of_bucket_size = (100.0 * elem_count) // self.reduce_bucket_size
+        see_memory_usage(
+            f"{tag}: elems in_bucket {self.elements_in_ipg_bucket} param {param_elems} max_percent {percent_of_bucket_size}",
+            force=False)
+
+    ###############Idependent Partition Gradient ########################
+    def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):
+        #print_rank_0(f"Inside reduce ipg buckets. {debug_param2name_id_shape(param)}, ipg elements {self.elements_in_ipg_bucket}, reduce bucket size {self.reduce_bucket_size}", force=True)
+
+        # Because the ipg bucket is initialized with a random place holder tensor, we must
+        # explicitly check that the bucket has any real data in it (self.elements_in_ipg_bucket >
+        # 0). Otherwise if the incoming param.ds_numel is large, this branch may get triggered on a
+        # garbage data and `self.average_tensor()` will crash because its params_to_reduce will be
+        # empty, while reduction_list will have that garbage data.
+        if self.elements_in_ipg_bucket > 0 and self.elements_in_ipg_bucket + param.ds_numel > self.reduce_bucket_size:
+            self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads",
+                                         param.ds_numel)
+
+            self.__reduce_and_partition_ipg_grads()
+
+        param_id = self.get_param_id(param)
+        assert self.params_already_reduced[param_id] == False, \
+            f"The parameter {param_id} has already been reduced. \
+            Gradient computed twice for this partition. \
+            Multiple gradient reduction is currently not supported"
+
+        self.__add_grad_to_ipg_bucket(param)
+
+    @instrument_w_nvtx
+    @torch.no_grad()
+    def __add_grad_to_ipg_bucket(self, param: Parameter) -> None:
+        self.__reduce_and_partition_stream.wait_stream(torch.cuda.default_stream())
+
+        if self.contiguous_gradients and self.elements_in_ipg_bucket + param.grad.numel(
+        ) < self.reduce_bucket_size:
+            # move the gradient to a contiguous buffer
+            with torch.cuda.stream(self.__reduce_and_partition_stream):
+                # move the parameter's gradient to the contiguous flat buffer
+                new_grad_tensor = self.__ipg_bucket_flat_buffer.narrow(
+                    0,
+                    self.elements_in_ipg_bucket,
+                    param.grad.numel()).view_as(param.grad)
+                new_grad_tensor.copy_(param.grad, non_blocking=True)
+                param.grad.record_stream(torch.cuda.current_stream())
+                param.grad.data = new_grad_tensor
+
+        self.__params_in_ipg_bucket.append(param)
+
+    @instrument_w_nvtx
+    @torch.no_grad()
+    def __reduce_and_partition_ipg_grads(self, safe_mode: bool = False) -> None:
+        if not self.__params_in_ipg_bucket:
+            return
+
+        for param in self.__params_in_ipg_bucket:
+            if param.grad.numel() != param.ds_numel:
+                raise RuntimeError(
+                    f"{param.grad.numel()} != {param.ds_numel} Cannot reduce scatter "
+                    f"gradients whose size is not same as the params")
+
+        self.__params_in_ipg_bucket.sort(key=lambda p: p.ds_id)
+
+        assert len(set(p.ds_id for p in self.__params_in_ipg_bucket)) == len(
+            self.__params_in_ipg_bucket)
+
+        while self.__param_reduce_events and self.__param_reduce_events[0].query():
+            self.__param_reduce_events.popleft()
+        if len(self.__param_reduce_events) > self.__max_param_reduce_events:
+            self.__param_reduce_events.popleft().synchronize()
+
+        with torch.cuda.stream(self.__reduce_and_partition_stream):
+            if safe_mode:
+                assert_ints_same_as_other_ranks(
+                    [p.ds_id for p in self.__params_in_ipg_bucket])
+
+            grad_partitions = self.__avg_scatter_grads(self.__params_in_ipg_bucket)
+            self.__partition_grads(self.__params_in_ipg_bucket, grad_partitions)
+
+            self.__params_in_ipg_bucket.clear()
+
+            event = Event()
+            event.record()
+            self.__param_reduce_events.append(event)
+
+    @instrument_w_nvtx
+    def __avg_scatter_grads(self, params_to_reduce: List[Parameter]) -> List[Tensor]:
+        """average gradients and scatter partitions across ranks"""
+        dtype = get_only_unique_item(p.grad.dtype for p in params_to_reduce)
+
+        full_grads_for_rank = [p.grad for p in params_to_reduce]
+        if self.communication_data_type == torch.float32:
+            full_grads_for_rank = [g.float() for g in full_grads_for_rank]
+
+        if self.postscale_gradients and self.gradient_predivide_factor != 1.0:
+            full_grads_for_rank = [
+                g.div(self.gradient_predivide_factor) for g in full_grads_for_rank
+            ]
+
+        grad_partitions_for_rank = reduce_scatter_coalesced(full_grads_for_rank,
+                                                            self.dp_process_group)
+
+        if self.postscale_gradients and self.gradient_predivide_factor != dist.get_world_size(
+                self.dp_process_group):
+            grad_partitions_for_rank = [
+                g.mul(self.gradient_predivide_factor) for g in grad_partitions_for_rank
+            ]
+
+        if self.communication_data_type == torch.float32:
+            grad_partitions_for_rank = [g.to(dtype) for g in grad_partitions_for_rank]
+
+        return grad_partitions_for_rank
+
+    def set_grad_positions(self):
+        for i, group in enumerate(self.fp16_groups):
+            current_offset = 0
+            for param in group:
+                param_id = self.get_param_id(param)
+                num_elements = param.ds_tensor.ds_numel
+
+                self.grad_position[param_id] = [
+                    int(i),
+                    int(current_offset),
+                    int(num_elements)
+                ]
+                #print(f"param id {param_id} i:{i}, ds_tensor {num_elements} numel {param.numel()}")
+                current_offset += num_elements
+        see_memory_usage(f"After Set Grad positions", force=False)
+
+    def _constant_buffered_norm2(self, input, buffer_size=250000000):
+        norm = None
+        for part in input.view(-1).split(buffer_size):
+            if norm is None:
+                norm = part.data.double().norm(2)**2.0
+            else:
+                norm += part.data.double().norm(2)**2.0
+        return norm**0.5
+
+    def set_norm_for_param_grad_in_gpu(self, param):
+        param_id = self.get_param_id(param)
+        #self.norm_for_param_grads[param_id] = param.grad.data.double().norm(2)
+        #Using a more memory efficient version
+        self.norm_for_param_grads[param_id] = self._constant_buffered_norm2(param.grad)
+
+    def async_inplace_copy_grad_to_fp32_buffer_from_gpu(self, param, fp32_grad_tensor):
+        with torch.cuda.stream(self.copy_grad_stream):
+            param_id = self.get_param_id(param)
+            src_tensor = param.grad.view(-1).float()
+            #print(f"src_tensor {src_tensor.size()} and fp32 grad {fp32_grad_tensor.size()}")
+            fp32_grad_tensor.copy_(src_tensor, non_blocking=True)
+            param.grad = None
+
+    def complete_grad_norm_calculation_for_cpu_offload(self, params):
+        total_norm = 0.0
+        norm_type = 2.0
+        for p in params:
+            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
+                param_id = self.get_param_id(p)
+                if param_id in self.norm_for_param_grads.keys():
+                    param_norm = self.norm_for_param_grads[param_id]
+                    total_norm += param_norm.item()**2
+
+        # Sum across all model parallel GPUs.
+        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+
+        torch.distributed.all_reduce(total_norm_cuda,
+                                     op=torch.distributed.ReduceOp.SUM,
+                                     group=self.dp_process_group)
+
+        self._model_parallel_all_reduce(tensor=total_norm_cuda,
+                                        op=torch.distributed.ReduceOp.SUM)
+
+        total_norm = total_norm_cuda[0].item()**(1. / norm_type)
+
+        if total_norm == float(
+                'inf') or total_norm == -float('inf') or total_norm != total_norm:
+            total_norm = -1
+
+        return total_norm
+
+    @instrument_w_nvtx
+    def __partition_grads(self,
+                          params_to_release: List[Parameter],
+                          grad_partitions: List[Tensor]) -> None:
+        for param, grad_partition in zip(params_to_release, grad_partitions):
+            if param.ds_tensor.ds_numel * dist.get_rank(
+                    self.dp_process_group) > param.ds_numel:
+                # this grad partition is empty - don't need to do anything
+                continue
+
+            # move or accumulate gradient partition to target buffer
+            grad_buffer = self.__param_id_to_grad_partition[param.ds_id].narrow(
+                0,
+                0,
+                grad_partition.numel())
+            if self.micro_step_id == 0:  # don't accumulate
+                grad_buffer.copy_(grad_partition, non_blocking=True)
+                # ensure grad buffer is a CUDA buffer to speed up the next few
+                # operations and so it can be used asynchronously
+                grad_buffer = grad_buffer.to(grad_partition.device, non_blocking=True)
+            elif grad_buffer.is_cuda:
+                grad_buffer.add_(grad_partition)
+            else:
+                # if dst is CPU, copy first to src device, do the addition
+                # there, then move back to dst. adding directly to cpu is very slow
+                cuda_grad_buffer = grad_buffer.to(grad_partition.device,
+                                                  non_blocking=True)
+                cuda_grad_buffer.add_(grad_partition)
+                grad_buffer.copy_(cuda_grad_buffer, non_blocking=True)
+                # ensure grad buffer is a CUDA buffer to speed up the next few
+                # operations and so it can be used asynchronously
+                grad_buffer = cuda_grad_buffer
+
+            if hasattr(self.__inf_or_nan_tracker, "logical_or_"):
+                self.__inf_or_nan_tracker.logical_or_(torch.isinf(grad_buffer).any())
+                self.__inf_or_nan_tracker.logical_or_(torch.isnan(grad_buffer).any())
+            else:
+                # logical_or_ not available in older versions of pytorch
+                self.__inf_or_nan_tracker += torch.isinf(grad_buffer).any()
+                self.__inf_or_nan_tracker += torch.isnan(grad_buffer).any()
+                self.__inf_or_nan_tracker = self.__inf_or_nan_tracker > 0
+
+            # offload the gradient partition if applicable
+            if self.offload_optimizer:
+                i, dest_offset, _ = self.grad_position[self.get_param_id(param)]
+                offload_fp32_gradients = {}
+                offload_fp32_offsets = {}
+
+                if self.is_gradient_accumulation_boundary:
+                    self.norm_for_param_grads[self.get_param_id(
+                        param)] = self._constant_buffered_norm2(grad_buffer)
+
+                    if self._swappable_optimizer_subgroup(i):
+                        if not i in offload_fp32_gradients.keys():
+                            offload_fp32_gradients[i] = []
+                            offload_fp32_offsets[i] = []
+
+                        offload_fp32_gradients[i].append(grad_buffer.float())
+                        offload_fp32_offsets[i].append(dest_offset)
+                    else:
+                        fp32_grad_tensor = self.fp32_partitioned_groups_flat[
+                            i].grad.narrow(0,
+                                           dest_offset,
+                                           grad_buffer.numel())
+                        fp32_grad_tensor.copy_(grad_buffer)
+
+            # free the gradient
+            param.grad.record_stream(torch.cuda.current_stream())
+            param.grad = None
+
+        if self.offload_optimizer and self.swap_optimizer:
+            for i in offload_fp32_gradients.keys():
+                self.optimizer_swapper.swap_out_gradients(
+                    parameter=self.fp32_partitioned_groups_flat[i],
+                    gradient_offsets=offload_fp32_offsets[i],
+                    gradient_tensors=offload_fp32_gradients[i])
+
+    def reduce_ready_partitions_and_remove_grads(self, param, i):
+        #print_rank_0(f"Backward {debug_param2name_id_shape(param)}", force=True)
+        self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
+
+    def zero_reduced_gradients(self, partition_id, i):
+        def are_all_related_partitions_reduced(params_id):
+            for partition_id in self.param_to_partition_ids[i][params_id]:
+                if not self.is_partition_reduced[i][partition_id]:
+                    return False
+            return True
+
+        for params_id in self.is_grad_computed[i][partition_id]:
+            if are_all_related_partitions_reduced(params_id):
+                self.param_dict[params_id].grad = None
+
+    def flatten_and_print(self, message, tensors, start=0, n=5):
+        flatten_tensor = self.flatten(tensors)
+
+        def print_func():
+            logger.info(flatten_tensor.contiguous().view(-1).narrow(0, start, n))
+
+        self.sequential_execution(print_func, message)
+
+    def get_grads_to_reduce(self, i, partition_id):
+        def get_reducible_portion(key):
+            grad = self.param_dict[key].grad
+            total_elements = grad.numel()
+            start = self.grad_start_offset[i][partition_id][key]
+            num_elements = min(
+                total_elements - start,
+                self.partition_size[i] -
+                self.grad_partition_insertion_offset[i][partition_id][key])
+            if not pg_correctness_test:
+                if num_elements == total_elements:
+                    return grad
+                else:
+                    return grad.contiguous().view(-1).narrow(0,
+                                                             int(start),
+                                                             int(num_elements))
+            else:
+                if num_elements == total_elements:
+                    return grad.clone()
+                else:
+                    return grad.clone().contiguous().view(-1).narrow(
+                        0,
+                        int(start),
+                        int(num_elements))
+
+        grads_to_reduce = []
+        for key in self.is_grad_computed[i][partition_id]:
+            grad = get_reducible_portion(key)
+            grads_to_reduce.append(grad)
+        return grads_to_reduce
+
+    def sequential_execution(self, function, message, group=None):
+        if group is None:
+            group = self.dp_process_group
+        if dist.get_rank(group=group) == 0:
+            logger.info(message)
+        for id in range(dist.get_world_size(group=group)):
+            if id == dist.get_rank(group=group):
+                function()
+            dist.barrier(group=group)
+
+    def set_none_gradients_to_zero(self, i, partition_id):
+        for param_id in self.is_grad_computed[i][partition_id]:
+            param = self.param_dict[param_id]
+            if param.grad is None:
+                param.grad = torch.zero_like(param)
+
+    ######################Reduction Related Methods##############################
+
+    def allreduce_bucket(self,
+                         bucket,
+                         communication_data_type=torch.float16,
+                         rank=None,
+                         log=None):
+        rank = None
+        tensor = self.flatten(bucket)
+
+        tensor_to_allreduce = tensor
+
+        if pg_correctness_test:
+            communication_data_type = torch.float32
+
+        if communication_data_type != tensor.dtype:
+            tensor_to_allreduce = tensor.to(communication_data_type)
+
+        tensor_to_allreduce.div_(dist.get_world_size(group=self.dp_process_group))
+
+        if rank is None:
+            #    "All Reducing"
+            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
+        else:
+            global_rank = _get_global_rank(self.dp_process_group, rank)
+            dist.reduce(tensor_to_allreduce, global_rank, group=self.dp_process_group)
+
+        if communication_data_type != tensor.dtype and tensor is not tensor_to_allreduce:
+            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
+                tensor.copy_(tensor_to_allreduce)
+
+        return tensor
+
+    # if rank is specified do a reduction instead of an allreduce
+    def allreduce_and_copy(self, small_bucket, rank=None, log=None):
+        with torch.cuda.stream(self.reduction_stream):
+            allreduced = self.allreduce_bucket(small_bucket, rank=rank, log=log)
+            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
+                for buf, synced in zip(small_bucket, self.unflatten(allreduced, small_bucket)):
+                    buf.copy_(synced)
+
+    def allreduce_no_retain(self,
+                            bucket,
+                            numel_per_bucket=500000000,
+                            rank=None,
+                            log=None):
+        small_bucket = []
+        numel = 0
+        for tensor in bucket:
+            small_bucket.append(tensor)
+            numel = numel + tensor.numel()
+            if numel > numel_per_bucket:
+                self.allreduce_and_copy(small_bucket, rank=rank, log=None)
+                small_bucket = []
+        if len(small_bucket) > 0:
+            self.allreduce_and_copy(small_bucket, rank=rank, log=log)
+
+    #############################################################################
+    #############################################################################
+    #############################################################################
+
+    # views the tensor as multiple partitions and returns
+    # those partitions
+    def get_data_parallel_partitions(self, tensor):
+        partitions = []
+
+        dp = dist.get_world_size(group=self.dp_process_group)
+        dp_id = dist.get_rank(group=self.dp_process_group)
+
+        total_num_elements = tensor.numel()
+
+        base_size = total_num_elements // dp
+        remaining = total_num_elements % dp
+
+        start = 0
+        for id in range(dp):
+            partition_size = base_size
+            if id < remaining:
+                partition_size = partition_size + 1
+            partitions.append(tensor.narrow(0, start, partition_size))
+            start = start + partition_size
+        return partitions
+
+    def get_partition_info(self, tensor_list, partition_size, partition_id):
+        params_in_partition = []
+        params_not_in_partition = []
+
+        start_index = partition_size * partition_id
+        end_index = partition_size * (partition_id + 1)
+
+        current_index = 0
+        first_offset = 0
+
+        for tensor in tensor_list:
+
+            tensor_size = tensor.numel()
+
+            if (current_index >= start_index and current_index < end_index):
+                params_in_partition.append(tensor)
+
+            elif start_index > current_index and start_index < (current_index +
+                                                                tensor_size):
+                params_in_partition.append(tensor)
+
+                assert (first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
+                first_offset = start_index - current_index
+
+            else:
+                params_not_in_partition.append(tensor)
+
+            current_index = current_index + tensor_size
+
+        return params_in_partition, params_not_in_partition, first_offset
+
+    @instrument_w_nvtx
+    def zero_grad(self, set_grads_to_None=True):
+        """
+        Zero FP16 parameter grads.
+        """
+        self.micro_step_id = 0
+
+        # FP32 grad should never exist.
+        # For speed, set model fp16 grad to None by default
+        for group in self.fp16_groups:
+            for p in group:
+                if set_grads_to_None:
+                    if p.grad is not None and p.grad.is_cuda:
+                        p.grad.record_stream(torch.cuda.current_stream())
+                    p.grad = None
+                else:
+                    if p.grad is not None:
+                        p.grad.detach_()
+                        p.grad.zero_()
+
+    def _model_parallel_all_reduce(self, tensor, op):
+        """ Perform all reduce within model parallel group, if any.
+        """
+        if self.model_parallel_group is None:
+            pass
+        else:
+            torch.distributed.all_reduce(tensor=tensor,
+                                         op=op,
+                                         group=self.model_parallel_group)
+
+    @instrument_w_nvtx
+    def get_grad_norm_direct(self, gradients, params, norm_type=2):
+        """Clips gradient norm of an iterable of parameters.
+
+        This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
+        added functionality to handle model parallel parameters. Note that
+        the gradients are modified in place.
+
+        Arguments:
+            parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
+                single Tensor that will have gradients normalized
+            max_norm (float or int): max norm of the gradients
+            norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
+                infinity norm.
+
+        Returns:
+            Total norm of the parameters (viewed as a single vector).
+        """
+        norm_type = float(norm_type)
+        if norm_type == inf:
+            total_norm = max(g.data.abs().max() for g in gradients)
+            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=self.dp_process_group)
+
+            # Take max across all GPUs.
+            self._model_parallel_all_reduce(tensor=total_norm_cuda,
+                                            op=torch.distributed.ReduceOp.MAX)
+            total_norm = total_norm_cuda[0].item()
+        else:
+            # if dist.get_rank() == 0:
+            #    logger.info(f"Total Norm beginning {total_norm}")
+            grad_norms = []
+            for g, p in zip(gradients, params):
+                if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
+                    grad_norms.append(g.cuda(non_blocking=True).double().norm(2))
+
+            # Sum across all model parallel GPUs.
+            total_norm_cuda = torch.sum(torch.pow(torch.stack(grad_norms), 2))
+
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.SUM,
+                                         group=self.dp_process_group)
+
+            self._model_parallel_all_reduce(tensor=total_norm_cuda,
+                                            op=torch.distributed.ReduceOp.SUM)
+
+            total_norm = total_norm_cuda.item()**(1. / norm_type)
+
+        if total_norm == float(
+                'inf') or total_norm == -float('inf') or total_norm != total_norm:
+            total_norm = -1
+
+        return total_norm
+
+    # creates a flat fused tensor from the tensor list starting at the first_offset
+    # in the first tensor of the list. If there are not enough elements in the tensor
+    # list then the flat tensor will be padded with zeros
+    def get_flat_partition(self,
+                           tensor_list,
+                           first_offset,
+                           partition_size,
+                           return_tensor_list=False):
+        flat_tensor_list = []
+        current_size = 0
+        for i, tensor in enumerate(tensor_list):
+            if tensor.grad is None:
+                tensor.grad = torch.zeros_like(tensor)
+
+            tensor = tensor.grad
+            num_elements = tensor.numel()
+            tensor_offset = 0
+
+            # we need to offset to get to the right element
+            if i == 0 and first_offset > 0:
+                tensor_offset = first_offset
+                num_elements = num_elements - tensor_offset
+
+            # we dont need all elements of the tensor
+            if num_elements > (partition_size - current_size):
+                num_elements = partition_size - current_size
+
+            # we need a narrow view of the tensor based on the tensor offset and number of elements that
+            # we need from this tensor
+            if tensor_offset > 0 or num_elements < tensor.numel():
+                flat_tensor_list.append(tensor.contiguous().view(-1).narrow(
+                    0,
+                    int(tensor_offset),
+                    int(num_elements)))
+            else:
+                flat_tensor_list.append(tensor)
+
+            current_size = current_size + num_elements
+
+        # this means its the last partition and does not align with the dp boundary. We need to pad before flattening
+        if current_size < partition_size:
+            flat_tensor_list.append(
+                torch.zeros(int(partition_size - current_size),
+                            dtype=tensor_list[0].dtype,
+                            device=tensor_list[0].device))
+
+        if return_tensor_list:
+            return flat_tensor_list
+
+        return self.flatten(flat_tensor_list)
+
+    def free_grad_in_param_list(self, param_list):
+        for p in param_list:
+            p.grad = None
+
+    def reset_cpu_buffers(self):
+        self.norm_for_param_grads = {}
+        self.local_overflow = False
+
+    def log_timers(self, timer_names):
+        if self.timers is None:
+            return
+
+        self.timers.log(names=list(timer_names))
+
+    def start_timers(self, timer_names):
+        if self.timers is None:
+            return
+
+        for name in timer_names:
+            self.timers(name).start()
+
+    def stop_timers(self, timer_names):
+        if self.timers is None:
+            return
+
+        for name in timer_names:
+            self.timers(name).stop()
+
+    def _pre_step(self):
+        self.micro_step_id = 0
+
+        print_rank_0(f"Inside Step function")
+        see_memory_usage(f"In step before checking overflow", force=False)
+
+        print_rank_0("Finished Tracing at Beginning of Step")
+        self._get_param_coordinator(training=True).hierarchy = 0
+
+        print_rank_0("Finished Tracing at Beginning of Step")
+
+    @instrument_w_nvtx
+    def _get_norm_groups(self):
+        norm_groups = []
+        for i, group in enumerate(self.fp16_groups):
+            if self.offload_optimizer:
+                norm_groups.append(
+                    self.complete_grad_norm_calculation_for_cpu_offload(
+                        self.fp16_groups[i]))
+            else:
+                norm_groups.append(
+                    self.get_grad_norm_direct(self.averaged_gradients[i],
+                                              self.fp16_groups[i]))
+        return norm_groups
+
+    @instrument_w_nvtx
+    def _prepare_fp32_grad_for_sub_group(self, sub_group_id):
+        partition_id = dist.get_rank(group=self.dp_process_group)
+
+        single_grad_partition = self.flatten(self.averaged_gradients[sub_group_id]).to(
+            self.fp32_partitioned_groups_flat[sub_group_id].dtype)
+
+        assert single_grad_partition.numel() == self.fp32_partitioned_groups_flat[sub_group_id].numel(), \
+            "averaged gradients have different number of elements that partition size {} {} {} {}".format(
+                single_grad_partition.numel(), self.fp32_partitioned_groups_flat[sub_group_id].numel(), sub_group_id, partition_id)
+
+        self.fp32_partitioned_groups_flat[sub_group_id].grad = single_grad_partition
+
+        # release all the gradient since we have already created a necessary copy in dp_grad_partition
+        self.zero_grad()
+
+        for grad in filter(lambda g: g.is_cuda, self.averaged_gradients[sub_group_id]):
+            grad.record_stream(torch.cuda.current_stream())
+
+        self.averaged_gradients[sub_group_id] = None
+
+    @instrument_w_nvtx
+    def _prepare_sub_group(self, sub_group_id, timer_names=set()):
+        see_memory_usage(f'Before prepare optimizer sub group {sub_group_id}',
+                         force=False)
+        if self._swappable_optimizer_subgroup(sub_group_id):
+            self._optimizer_states_and_gradient_swap_in(sub_group_id, timer_names)
+        elif not self.offload_optimizer:
+            self._prepare_fp32_grad_for_sub_group(sub_group_id)
+        see_memory_usage(f'After prepare optimizer sub group {sub_group_id}',
+                         force=False)
+
+    def _optimizer_states_and_gradient_swap_in(self, sub_group_id, timer_names=set()):
+        param_length = self.fp16_partitioned_groups_flat_numel[sub_group_id]
+        fp32_param_id = id(self.fp32_partitioned_groups_flat[sub_group_id])
+        assert self._swappable_optimizer_subgroup(sub_group_id), \
+            f'Parameter {fp32_param_id} of numel={param_length} is not swappable'
+
+        OPTIMIZER_SWAP_IN_STATE = 'optimizer_swap_in_state'
+        see_memory_usage(f'pre-step Before swapping in optimizer tensors {sub_group_id}',
+                         force=False)
+        self.start_timers([OPTIMIZER_SWAP_IN_STATE])
+
+        self.optimizer_swapper.swap_in_optimizer_state(
+            parameter=self.fp32_partitioned_groups_flat[sub_group_id],
+            async_parameter=self.next_swappable_fp32_partitioned_groups[sub_group_id])
+
+        self.stop_timers([OPTIMIZER_SWAP_IN_STATE])
+        timer_names.add(OPTIMIZER_SWAP_IN_STATE)
+        see_memory_usage(f'pre-step After swapping in optimizer tensors {sub_group_id}',
+                         force=False)
+
+    @instrument_w_nvtx
+    def _release_sub_group(self, sub_group_id, timer_names=set()):
+        see_memory_usage(f'Before release optimizer sub group {sub_group_id}',
+                         force=False)
+        # get rid of the fp32 gradients. Not needed anymore
+        if not self.offload_optimizer:
+            self.fp32_partitioned_groups_flat[sub_group_id].grad = None
+
+        if self._swappable_optimizer_subgroup(sub_group_id):
+            self._optimizer_states_and_gradient_swap_out(sub_group_id, timer_names)
+        see_memory_usage(f'After release optimizer sub group {sub_group_id}',
+                         force=False)
+
+    # create a flat tensor aligned at the alignment boundary
+    @instrument_w_nvtx
+    def flatten_dense_tensors_aligned(self, tensor_list, alignment):
+        num_elements = 0
+        for tens in tensor_list:
+            num_elements = num_elements + tens.numel()
+
+        remaining = num_elements % alignment
+
+        if remaining:
+            elements_to_add = alignment - remaining
+            pad_tensor = torch.zeros(elements_to_add,
+                                     device=tensor_list[0].device,
+                                     dtype=tensor_list[0].dtype)
+            padded_tensor_list = tensor_list + [pad_tensor]
+
+            num_elements = num_elements + elements_to_add
+        else:
+            padded_tensor_list = tensor_list
+
+        return self.flatten(padded_tensor_list)
+
+    def _optimizer_states_and_gradient_swap_out(self, sub_group_id, timer_names=set()):
+        param_length = self.fp16_partitioned_groups_flat_numel[sub_group_id]
+        fp32_param_id = id(self.fp32_partitioned_groups_flat[sub_group_id])
+        assert self._swappable_optimizer_subgroup(sub_group_id), \
+            f'Parameter {fp32_param_id} of numel={param_length} is not swappable'
+
+        OPTIMIZER_SWAP_OUT_STATE = 'optimizer_swap_out_state'
+        see_memory_usage(
+            f'post-step Before swapping out optimizer tensors {sub_group_id}',
+            force=False)
+        self.start_timers([OPTIMIZER_SWAP_OUT_STATE])
+
+        self.optimizer_swapper.swap_out_optimizer_state(
+            parameter=self.fp32_partitioned_groups_flat[sub_group_id],
+            async_swap=self.next_swappable_fp32_partitioned_groups[sub_group_id]
+            is not None)
+
+        self.stop_timers([OPTIMIZER_SWAP_OUT_STATE])
+        see_memory_usage(
+            f'post-step After swapping out optimizer tensors {sub_group_id}',
+            force=False)
+        timer_names.add(OPTIMIZER_SWAP_OUT_STATE)
+
+        # get rid of the fp32 gradients. Not needed anymore
+        self.fp32_partitioned_groups_flat[sub_group_id].grad = None
+
+    def _unflatten_partitioned_parameters(self, sub_group_id):
+        updated_params = self.unflatten(self.fp16_partitioned_groups_flat[sub_group_id],
+                                        self.fp16_partitioned_groups[sub_group_id])
+
+        for partitioned_param, q in zip(self.fp16_partitioned_groups[sub_group_id], updated_params):
+            partitioned_param.data = q.data
+
+    def _overflow_clean_up(self, prev_scale):
+        see_memory_usage('After overflow before clearing gradients', force=False)
+        self.zero_grad()
+
+        if self.offload_optimizer:
+            self.reset_cpu_buffers()
+        else:
+            self.averaged_gradients = {}
+
+        see_memory_usage('After overflow after clearing gradients', force=False)
+
+        if torch.distributed.get_rank() == 0:
+            logger.info(
+                "[deepspeed] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
+                "reducing to {}".format(dist.get_rank(),
+                                        prev_scale,
+                                        self.loss_scale))
+
+    @instrument_w_nvtx
+    def _overflow_check_and_loss_scale_update(self):
+
+        # First compute norm for all group so we know if there is overflow
+        self.check_overflow()
+
+        #loss scaling related computation
+        prev_scale = self.loss_scale
+        self._update_scale(self.overflow)
+
+        if self.overflow:
+            self._overflow_clean_up(prev_scale)
+
+        return self.overflow
+
+    @instrument_w_nvtx
+    def _post_step(self, timer_names=set()):
+        if self.offload_optimizer:
+            self.reset_cpu_buffers()
+
+        #Gathering persisting parameters
+        if len(self.persistent_parameters) > 0:
+            self.persistent_parameters[0].all_gather(self.persistent_parameters)
+
+        if self.swap_optimizer:
+            self.optimizer_swapper.log_timers()
+
+        self.log_timers(timer_names)
+
+        see_memory_usage('After zero_optimizer step', force=False)
+        print_rank_0(f"------------------Finishing Step-----------------------")
+
+    @instrument_w_nvtx
+    def _reassign_or_swap_out_partitioned_parameters(self, sub_group_id):
+        if self.fp16_partitioned_groups_flat[sub_group_id] is not None:
+            self.fp16_partitioned_groups_flat[sub_group_id].data.copy_(
+                self.fp32_partitioned_groups_flat[sub_group_id].data)
+
+            #unflatten fp16 parameter subgroup
+            self._unflatten_partitioned_parameters(sub_group_id)
+        else:
+            self._partitioned_params_swap_out(sub_group_id)
+
+    @instrument_w_nvtx
+    def step(self, closure=None):
+        """
+            Not supporting closure.
+            """
+        self._pre_step()
+        self._partition_all_parameters()
+
+        #checks for overflow, adjust the loss scale accordingly
+        if self._overflow_check_and_loss_scale_update():
+            if self.swap_optimizer:
+                self.optimizer_swapper.log_timers()
+            return
+
+        norm_groups = self._get_norm_groups()
+        scaled_global_grad_norm = get_global_norm(norm_list=norm_groups)
+
+        # Stash unscaled gradient norm
+        self._global_grad_norm = scaled_global_grad_norm / self.loss_scale
+
+        timer_names = set()
+
+        timer_names.add('optimizer_step')
+        self.start_timers(['optimizer_step'])
+
+        #update parameters one sub group at a time
+        for sub_group_id, group in enumerate(self.fp16_groups):
+
+            #prepare optimizer states, gradients and fp32 parameters for update
+            self._prepare_sub_group(sub_group_id, timer_names)
+
+            #scale the fp32 gradients
+            self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
+
+            #apply the optimizer step on the sub group and copy fp32 parameters to fp16
+            self._optimizer_step(sub_group_id)
+
+            #put fp16 parameters in appropriate location
+            self._reassign_or_swap_out_partitioned_parameters(sub_group_id)
+
+            #release memory or swap out optimizer states of fp32 parameters
+            self._release_sub_group(sub_group_id, timer_names)
+
+        self.stop_timers(['optimizer_step'])
+
+        self._post_step(timer_names)
+
+        # warn user about caching allocator flushes
+        alloc_retries = torch.cuda.memory_stats()["num_alloc_retries"] if hasattr(
+            torch.cuda,
+            "memory_stats") else 0
+        if alloc_retries > self.__n_caching_allocator_flushes:
+            if dist.get_rank() == 0:
+                logger.warning(
+                    "%d pytorch allocator cache flushes since last step. this happens "
+                    "when there is high memory pressure and is detrimental to "
+                    "performance. if this is happening frequently consider adjusting "
+                    "settings to reduce memory consumption. If you are unable to "
+                    "make the cache flushes go away consider adding "
+                    "torch.cuda.empty_cache() calls in your training loop to ensure "
+                    "that all ranks flush their caches at the same time",
+                    alloc_retries - self.__n_caching_allocator_flushes)
+            self.__n_caching_allocator_flushes = alloc_retries
+
+    def dump_pre_step_gradients(self, debug_fp32_grads):
+        # Dump gradient norms for debugging
+        for i, _ in enumerate(self.fp16_groups):
+            print(f'Pre-Step Dump Norms for Group {i} FP16P, FP16G, FP32G, FP32GUC')
+            for fp16_param, fp32_grad in zip(self.fp16_groups[i], debug_fp32_grads[i]):
+                param_id = self.get_param_id(fp16_param)
+                fp16_grad_norm = self.debug_fp16_grads[i][param_id]
+
+                fp32_grad_norm = [float(t.data.float().norm(2)) for t in fp32_grad]
+                norm_list = [fp16_grad_norm, fp32_grad_norm]
+                print(f'Pre-Step Norms {i} {param_id} = {norm_list}')
+
+    def dump_post_step_gradients(self):
+        # Dump gradient norms for debugging
+        for i, group in enumerate(self.fp16_groups):
+            print(
+                f'Post-Step Dump Norms for Group {i} FP16P, FP16DS, FP16FLAT, FP32FLAT')
+            unflat_fp16 = self.unflatten(self.fp16_groups_flat[i], self.fp16_groups[i])
+            unflat_fp32 = self.unflatten(self.fp32_partitioned_groups_flat[i],
+                                         self.fp16_groups[i])
+            for j, p in enumerate(self.fp16_groups[i]):
+                param_id = self.get_param_id(p)
+                param_norm = float(p.data.float().norm(2))
+                ds_norm = float(p.ds_tensor.data.float().norm(2))
+
+                unflat_norm = [
+                    float(t.data.float().norm(2))
+                    for t in [unflat_fp16[j],
+                              unflat_fp32[j]]
+                ]
+                norm_list = [param_norm, ds_norm] + unflat_norm
+                print(f'Post-Step Norms {i} {param_id} = {norm_list}')
+
+    @instrument_w_nvtx
+    def unscale_and_clip_grads(self, sub_group_id, total_norm):
+        # compute combined scale factor for this group
+        combined_scale = self.loss_scale
+        if self.clip_grad > 0.:
+            # norm is in fact norm*scale
+            clip = ((total_norm / self.loss_scale) + 1e-6) / self.clip_grad
+            if clip > 1:
+                combined_scale = clip * self.loss_scale
+
+        self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
+
+    def _check_overflow(self, partition_gradients=True):
+        self.overflow = self.has_overflow(partition_gradients)
+
+    # `params` is a list / generator of torch.Variable
+    def has_overflow_serial(self, params, is_grad_list=False):
+        for p in params:
+            if p.grad is not None and self._has_inf_or_nan(p.grad.data):
+                return True
+
+        return False
+
+    def has_overflow_partitioned_grads_serial(self):
+        for i in range(len(self.fp16_groups)):
+            for j, grad in enumerate(self.averaged_gradients[i]):
+                if grad is not None and self._has_inf_or_nan(grad.data, j):
+                    return True
+        return False
+
+    @instrument_w_nvtx
+    def has_overflow(self, partition_gradients=True):
+        if partition_gradients:
+            with torch.cuda.stream(self.__reduce_and_partition_stream):
+                self.local_overflow = bool(self.__inf_or_nan_tracker.item())
+                self.__inf_or_nan_tracker.zero_()
+
+            overflow = self.local_overflow
+            #overflow = self.has_overflow_partitioned_grads_serial()
+            overflow_gpu = torch.cuda.ByteTensor([overflow])
+            torch.distributed.all_reduce(overflow_gpu,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=self.dp_process_group)
+
+        else:
+            params = []
+            for group in self.fp16_groups:
+                for param in group:
+                    params.append(param)
+
+            overflow = self.has_overflow_serial(params, is_grad_list=partition_gradients)
+            overflow_gpu = torch.cuda.ByteTensor([overflow])
+
+        # Since each model parallel GPU carries only part of the model,
+        # make sure overflow flag is synced across all the model parallel GPUs
+        self._model_parallel_all_reduce(tensor=overflow_gpu,
+                                        op=torch.distributed.ReduceOp.MAX)
+
+        overflow = overflow_gpu[0].item()
+        return bool(overflow)
+
+    # `x` is a torch.Tensor
+    @staticmethod
+    def _has_inf_or_nan(x, j=None):
+        try:
+            # if x is half, the .float() incurs an additional deep copy, but it's necessary if
+            # Pytorch's .sum() creates a one-element tensor of the same type as x
+            # (which is true for some recent version of pytorch).
+            cpu_sum = float(x.float().sum())
+            # More efficient version that can be used if .sum() returns a Python scalar
+            # cpu_sum = float(x.sum())
+        except RuntimeError as instance:
+            # We want to check if inst is actually an overflow exception.
+            # RuntimeError could come from a different error.
+            # If so, we still want the exception to propagate.
+            if "value cannot be converted" not in instance.args[0]:
+                raise
+            return True
+        else:
+            if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
+                return True
+            return False
+
+    @instrument_w_nvtx
+    def backward(self, loss, retain_graph=False):
+        """
+        :attr:`backward` performs the following steps:
+
+        1. fp32_loss = loss.float()
+        2. scaled_loss = fp32_loss*loss_scale
+        3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
+        """
+        if self.swap_optimizer:
+            self.optimizer_swapper.pre_backward()
+
+        see_memory_usage(f"Before backward", force=False)
+
+        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
+
+        self._get_param_coordinator(training=True).reset_step()
+
+        if self.swap_optimizer:
+            self.optimizer_swapper.post_backward()
+
+    def get_fp32_grad_partitions(self) -> Dict[int, Dict[int, Tensor]]:
+        """get fp32 gradient partition dictionary
+        accessed as grad_dict[parameter_group_index][parameter_index]
+        """
+        self.__reduce_and_partition_stream.synchronize()
+        grad_dict = collections.defaultdict(dict)
+        if self.offload_optimizer:
+            for group in self.fp16_groups:
+                for param_idx, param in enumerate(group):
+                    group_idx, dest_offset, num_elements = self.grad_position[self.get_param_id(param)]
+                    fp32_grad = self.fp32_partitioned_groups_flat[group_idx].grad.narrow(
+                        0,
+                        dest_offset,
+                        num_elements)
+                    grad_dict[group_idx][param_idx] = fp32_grad
+        else:
+            for group_idx, group in self.averaged_gradients.items():
+                for param_idx, gradient in enumerate(group):
+                    grad_dict[group_idx][param_idx] = gradient.float()
+
+        return grad_dict
+
+    @instrument_w_nvtx
+    def _partition_all_parameters(self):
+        """Partitioning Parameters that were not partitioned usually if parameters
+        of modules whose input parameters do not require grad computation do not
+        trigger post call and will therefore will remain unpartitioned"""
+        self._get_param_coordinator(training=self.module.training).release_and_reset_all(
+            self.module)
+        for param in iter_params(self.module, recurse=True):
+            if param.ds_status != ZeroParamStatus.NOT_AVAILABLE:
+                raise RuntimeError(f"{param.ds_summary()} expected to be released")
+
+    def check_overflow(self, partition_gradients=True):
+        self._check_overflow(partition_gradients)
+
+    def _update_scale(self, has_overflow=False):
+        self.loss_scaler.update_scale(has_overflow)
+
+    # Promote state so it can be retrieved or set via "fp16_optimizer_instance.state"
+    def _get_state(self):
+        return self.optimizer.state
+
+    def _set_state(self, value):
+        self.optimizer.state = value
+
+    state = property(_get_state, _set_state)
+
+    # Promote param_groups so it can be retrieved or set via "fp16_optimizer_instance.param_groups"
+    # (for example, to adjust the learning rate)
+    def _get_param_groups(self):
+        return self.optimizer.param_groups
+
+    def _set_param_groups(self, value):
+        self.optimizer.param_groups = value
+
+    param_groups = property(_get_param_groups, _set_param_groups)
+
+    # Promote loss scale so it can be retrieved or set via "fp16_optimizer_instance.loss_scale"
+    def _get_loss_scale(self):
+        return self.loss_scaler.loss_scale
+
+    def _set_loss_scale(self, value):
+        self.loss_scaler.cur_scale = value
+
+    loss_scale = property(_get_loss_scale, _set_loss_scale)
+    cur_scale = property(_get_loss_scale, _set_loss_scale)
+
+    def _get_lean_tensors(self, padded_flattened_tensor, group_tensors, paddings):
+        # Remove paddings from flattened tensor
+        individual_tensors = self.unflatten(padded_flattened_tensor, group_tensors)
+        lean_lengths = [t.numel() - pad for t, pad in zip(group_tensors, paddings)]
+        lean_tensors = [t[:len] for t, len in zip(individual_tensors, lean_lengths)]
+        #logger.info(f'rank {dist.get_rank()}: lean_tensors = {[t.numel() for t in lean_tensors]}')
+        return lean_tensors
+
+    #TODO REVISIT this for stage 3
+    def get_lean_optimizer_state(self):
+        # Return optimizer states after removing paddings.
+        # This method assumes that each param group contains a single flattened tensor.
+        optimizer_groups_state = []
+
+        for i, group in enumerate(self.optimizer.param_groups):
+            p = group['params'][0]
+            lean_state = {}
+            for key, value in self.optimizer.state[p].items():
+                if torch.is_tensor(value):
+                    padded_lens = [t.numel() for t in self.fp16_partitioned_groups[i]]
+                    lean_state[key] = self._get_lean_tensors(
+                        value,
+                        self.fp16_partitioned_groups[i],
+                        self.groups_padding[i])
+                    lean_flat_len = sum([t.numel() for t in lean_state[key]])
+                else:
+                    lean_state[key] = value
+
+            optimizer_groups_state.append(lean_state)
+
+        return optimizer_groups_state
+
+    def get_groups_without_padding(self, groups_with_padding):
+        # Return group tensor after removing paddings added for alignment to DP world size.
+        groups_without_padding = []
+        for i, group in enumerate(groups_with_padding):
+            lean_group = self._get_lean_tensors(group,
+                                                self.fp16_partitioned_groups[i],
+                                                self.groups_padding[i])
+            groups_without_padding.append(lean_group)
+
+        return groups_without_padding
+
+    def _set_fp32_optimizer_param_groups(self):
+        for sub_group_id, _ in enumerate(self.fp16_groups):
+            param_group_id = self.sub_group_to_group_id[sub_group_id]
+            self.optimizer.param_groups[param_group_id]['params'].append(
+                self.fp32_partitioned_groups_flat[sub_group_id])
+
+    def _clear_fp32_optimizer_param_groups(self):
+        for param_group in self.optimizer.param_groups:
+            param_group['params'] = []
+
+    def _rigid_state_dict(self):
+        state_dict = {}
+        state_dict[ZERO_STAGE] = ZERO_OPTIMIZATION_WEIGHTS
+        state_dict['loss_scaler'] = self.loss_scaler
+        state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale
+        state_dict['overflow'] = self.overflow
+        state_dict[PARTITION_COUNT] = self.partition_count
+
+        self._set_fp32_optimizer_param_groups()
+        state_dict[OPTIMIZER_STATE_DICT] = self.optimizer.state_dict()
+        state_dict[FP32_FLAT_GROUPS] = self.fp32_partitioned_groups_flat
+        self._clear_fp32_optimizer_param_groups()
+
+        return state_dict
+
+    def state_dict(self):
+        """
+        Returns a dict containing the current state of this :class:`FP16_Optimizer` instance.
+        This dict contains attributes of :class:`FP16_Optimizer`, as well as the state_dict
+        of the contained Pytorch optimizer.
+        Example::
+            checkpoint = {}
+            checkpoint['model'] = model.state_dict()
+            checkpoint['optimizer'] = optimizer.state_dict()
+            torch.save(checkpoint, "saved.pth")
+        """
+        if self.elastic_checkpoint:
+            raise NotImplementedError(
+                "ZeRO-3 does not yet support elastic checkpointing, please disable for now."
+            )
+
+        if self.swap_optimizer or self.params_in_nvme_and_cpu:
+            raise NotImplementedError(
+                "ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now."
+            )
+
+        return self._rigid_state_dict()
+
+
+# Restore base optimizer fp32 weights from checkpoint by:
+# 1) Merging fp32 weights from checkpoints of all partitions
+# 2) Extracting fp32 weights for current partition from merged weights
+# 3) Using extracted weights to update base optimizer weights directly.
+
+    def _restore_from_fp32_weights(self, all_state_dict):
+
+        flat_local_partition = []
+        for i in range(len(self.fp32_partitioned_groups_flat)):
+            merged_partitions = [sd['fp32_groups'][i] for sd in all_state_dict]
+            flat_local_partition.append(self._get_flattened_partition(merged_partitions))
+
+        for current, saved in zip(self.fp32_partitioned_groups_flat, flat_local_partition):
+            current.data.copy_(saved.data)
+
+    # Restore base optimizer fp32 weights from ZeRO fp16 weights
+    def _restore_from_bit16_weights(self):
+        for fp16_partitions, fp32_partition in zip(self.fp16_partitioned_groups_flat, self.fp32_partitioned_groups_flat):
+            fp32_partition.data.copy_(fp16_partitions.data)
+
+    # Refresh the fp32 master params from the fp16 copies.
+    def refresh_fp32_params(self):
+        self._restore_from_bit16_weights()
+
+    # Extract flattened partition for current rank from all partitions
+    def _get_flattened_partition(self, all_partition_states):
+        partition_id = dist.get_rank(group=self.dp_process_group)
+        alignment = dist.get_world_size(group=self.dp_process_group)
+
+        param_partitions = [[] for _ in range(len(all_partition_states[0]))]
+        for i, partition in enumerate(all_partition_states):
+            for j, param in enumerate(partition):
+                param_partitions[j].append(param)
+
+        local_state_partitions = []
+        for param_index, param_slices in enumerate(param_partitions):
+            flattened_merged_tensor = self.flatten_dense_tensors_aligned(
+                param_slices,
+                alignment)
+            new_partitions = self.get_data_parallel_partitions(flattened_merged_tensor)
+            local_state_partitions.append(new_partitions[partition_id])
+
+        if torch.is_tensor(local_state_partitions[0]):
+            return self.flatten_dense_tensors_aligned(local_state_partitions, alignment)
+
+        # Assume non-tensor states are not partitioned and equal across ranks, so return first one
+        return local_state_partitions[0]
+
+    # Restore base optimizer state from checkpoint by
+    # 1) Merging optimizer state from checkpoints of all partitions
+    # 2) Extracting optimizer state for current partition from the merged state
+    # 3) Using the extracted value to directly update the base optimizer.
+    def _restore_base_optimizer_state(self, all_state_dict):
+        base_optimizer_group_states = []
+        for i in range(len(self.optimizer.param_groups)):
+            partition_states = {}
+            all_partition_group_states = [
+                sd['base_optimizer_state'][i] for sd in all_state_dict
+            ]
+            for key in all_partition_group_states[0].keys():
+                all_partition_states = [
+                    all_states[key] for all_states in all_partition_group_states
+                ]
+                partition_states[key] = self._get_flattened_partition(
+                    all_partition_states)
+            base_optimizer_group_states.append(partition_states)
+
+        for i, group in enumerate(self.optimizer.param_groups):
+            p = group['params'][0]
+            for key, saved in base_optimizer_group_states[i].items():
+                if torch.is_tensor(self.optimizer.state[p][key]):
+                    self.optimizer.state[p][key].data.copy_(saved.data)
+                else:
+                    self.optimizer.state[p][key] = saved
+
+    def _rigid_load_state_dict(self, state_dict, load_optimizer_states=True):
+        # I think it should actually be ok to reload the optimizer before the model.
+        self.loss_scaler = state_dict['loss_scaler']
+        self.dynamic_loss_scale = state_dict['dynamic_loss_scale']
+        self.overflow = state_dict['overflow']
+
+        if load_optimizer_states:
+            self._set_fp32_optimizer_param_groups()
+            self.optimizer.load_state_dict(state_dict[OPTIMIZER_STATE_DICT])
+            self._clear_fp32_optimizer_param_groups()
+
+        # restore fp32 partitions
+        for curr_param, saved_param in zip(self.fp32_partitioned_groups_flat, state_dict[FP32_FLAT_GROUPS]):
+            curr_param.data.copy_(saved_param.data)
+
+        # restore fp16 partitions from fp32
+        for sub_group_id in range(len(self.fp32_partitioned_groups_flat)):
+            fp32_param = self.fp32_partitioned_groups_flat[sub_group_id]
+            fp16_param = self.fp16_partitioned_groups_flat[sub_group_id]
+            fp16_param.data.copy_(fp32_param.data)
+
+        # update fp16 unflattened params
+        for sub_group_id in range(len(self.fp16_partitioned_groups_flat)):
+            updated_params = self.unflatten(
+                self.fp16_partitioned_groups_flat[sub_group_id],
+                self.fp16_partitioned_groups[sub_group_id])
+
+            for partitioned_param, q in zip(self.fp16_partitioned_groups[sub_group_id], updated_params):
+                partitioned_param.data = q.data
+
+    # TODO: Support different/changing load/save DP degree.
+    def load_state_dict(self,
+                        state_dict_list,
+                        load_optimizer_states=True,
+                        load_from_fp32_weights=False):
+        r"""Loading a ZeRO checkpoint
+        Arguments:
+            state_dict_list: List of all saved ZeRO checkpoints, one for each saved partition.
+                Note that the number of saved partitions may differ from number of loading partitions to support
+                changing GPU count, specifically DP world size, between saving and loading checkpoints.
+            load_optimizer_states: Boolean indicating whether or not to load base optimizer states
+            load_from_fp32_weights: Boolean indicating whether to initialize fp32 master weights from fp32
+            copies in checkpoints (no precision loss) or from model's fp16 copies (with precision loss).
+        """
+        """
+        Loads a state_dict created by an earlier call to state_dict().
+        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``,
+        whose parameters in turn came from ``model``, it is expected that the user
+        will call ``model.load_state_dict()`` before
+        ``fp16_optimizer_instance.load_state_dict()`` is called.
+        Example::
+            model = torch.nn.Linear(D_in, D_out).cuda().half()
+            optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
+            optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
+            ...
+            checkpoint = torch.load("saved.pth")
+            model.load_state_dict(checkpoint['model'])
+            optimizer.load_state_dict(checkpoint['optimizer'])
+        """
+
+        if self.elastic_checkpoint:
+            raise NotImplementedError(
+                "ZeRO-3 does not yet support elastic checkpointing, please disable for now."
+            )
+
+        if self.swap_optimizer or self.params_in_nvme_and_cpu:
+            raise NotImplementedError(
+                "ZeRO-3 does not yet support checkpointing with NVMe offloading, please disable for now."
+            )
+
+        self._rigid_load_state_dict(
+            state_dict_list[dist.get_rank(group=self.dp_process_group)],
+            load_optimizer_states=load_optimizer_states)
+
+        if len(self.persistent_parameters) > 0:
+            self.persistent_parameters[0].partition(self.persistent_parameters)
+            self.persistent_parameters[0].all_gather(self.persistent_parameters)
+
+    def checkpoint_event_prologue(self):
+        self._partition_all_parameters()
+
+    def checkpoint_event_epilogue(self):
+        if len(self.persistent_parameters) > 0:
+            self.persistent_parameters[0].all_gather(self.persistent_parameters)
+
+
+def _handle_overflow(cpu_sum, x, i):
+    import math
+    rank = torch.distributed.get_rank()
+    if rank == 0:
+        t_i = -1
+        for v_i, v in enumerate(x.data.contiguous().view(-1)):
+            if not math.isfinite(float(v)):
+                t_i = v_i
+                break
+        logger.info(
+            f"rank {rank} detected overflow {cpu_sum} in tensor {i}:{t_i} shape {x.shape}"
+        )
+
+
+def estimate_zero3_model_states_mem_needs(total_params,
+                                          largest_layer_params,
+                                          num_gpus_per_node=1,
+                                          num_nodes=1,
+                                          cpu_offload=True,
+                                          cpu_offload_params=True,
+                                          zero_init=True,
+                                          additional_buffer_factor=1.5):
+
+    total_gpus = num_nodes * num_gpus_per_node
+    gpus_factor = 1 / num_nodes
+    largest_layer_memory = (4 * largest_layer_params)
+
+    if cpu_offload:
+        if cpu_offload_params:
+            gpu_mem = largest_layer_memory
+
+            if zero_init:
+                cpu_mem = total_params * 18 * gpus_factor * additional_buffer_factor
+            else:
+                cpu_mem = total_params * max(4 * num_gpus_per_node,
+                                             18 * gpus_factor) * additional_buffer_factor
+
+        else:
+            gpu_mem = largest_layer_memory + int(2 * total_params / total_gpus)
+
+            if zero_init:
+                cpu_mem = total_params * 16 * gpus_factor * additional_buffer_factor
+            else:
+                cpu_mem = total_params * max(4 * num_gpus_per_node,
+                                             16 * gpus_factor) * additional_buffer_factor
+    else:
+        gpu_mem = largest_layer_memory + int(18 * total_params / total_gpus)
+        if zero_init:
+            cpu_mem = largest_layer_params * 4 * num_gpus_per_node * additional_buffer_factor
+        else:
+            cpu_mem = total_params * 4 * num_gpus_per_node * additional_buffer_factor
+
+    return int(cpu_mem), int(gpu_mem), largest_layer_memory
+
+
+def model_to_params(model):
+    # shared params calculated only once
+    total_params = sum(
+        dict((p.data_ptr(),
+              p.numel()) for p in model.parameters()).values())
+
+    largest_layer_params = 0
+    for m in model.modules():
+        # assuming no shared params within a single layer
+        layer_params = sum(p.numel() for p in m.parameters(recurse=False))
+        largest_layer_params = max(largest_layer_params, layer_params)
+
+    return total_params, largest_layer_params
+
+
+import math
+
+
+def estimate_zero3_model_states_mem_needs_all_live(model,
+                                                   num_gpus_per_node=1,
+                                                   num_nodes=1,
+                                                   additional_buffer_factor=1.5):
+    """
+    Print out estimates on memory usage requirements for ZeRO 3 params, optim states and gradients
+    for a given ``model`` and hardware setup.
+
+    If you have an actual model object, use this function and everything will be derived
+    automatically.
+
+    If it's a hypothetical model, use ``estimate_zero3_model_states_mem_needs_all_cold`` where you have to pass
+    the ``total_params`` and ``largest_layer_params`` explicitly.
+
+    Args:
+        - ``model``: ``nn.Module`` object
+        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
+        - ``num_nodes``: how many nodes (defaults to 1),
+        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
+
+    """
+
+    total_params, largest_layer_params = model_to_params(model)
+
+    estimate_zero3_model_states_mem_needs_all_cold(
+        total_params=total_params,
+        largest_layer_params=largest_layer_params,
+        num_gpus_per_node=num_gpus_per_node,
+        num_nodes=num_nodes,
+        additional_buffer_factor=additional_buffer_factor)
+
+
+def estimate_zero3_model_states_mem_needs_all_cold(total_params,
+                                                   largest_layer_params,
+                                                   num_gpus_per_node=1,
+                                                   num_nodes=1,
+                                                   additional_buffer_factor=1.5):
+    """
+    Print out estimates on memory usage requirements for ZeRO 3 params, optim states and gradients
+    for a given ``model`` and hardware setup.
+
+    If it's a hypothetical model, use this function where you have to pass
+    the ``total_params`` and ``largest_layer_params`` explicitly.
+
+    If you have an actual model object, use ``estimate_zero3_model_states_mem_needs_all_live`` and everything
+    will be derived automatically.
+
+    Args:
+        - ``total_params``: total  model params
+        - ``largest_layer_params``: largest layer's params
+        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
+        - ``num_nodes``: how many nodes (defaults to 1),
+        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
+
+    """
+    def format_options(cpu_offload, cpu_offload_params, zero_init):
+        enabled = []
+        padded_cpu_str = f'{OFFLOAD_CPU_DEVICE:4}'
+        param_device = padded_cpu_str if cpu_offload_params else "none"
+        enabled.append(f"{OFFLOAD_PARAM}={param_device}")
+        optimizer_device = padded_cpu_str if cpu_offload else "none"
+        enabled.append(f"{OFFLOAD_OPTIMIZER}={optimizer_device}")
+        enabled.append(f"zero_init={1 if zero_init else 0}")
+        return ", ".join(enabled)
+
+    nodes_str = "nodes" if num_nodes > 1 else "node"
+    gpus_str = "GPUs" if num_gpus_per_node > 1 else "GPU"
+    print(
+        "Estimated memory needed for params, optim states and gradients for a:\n"
+        f"HW: Setup with {num_nodes} {nodes_str}, {num_gpus_per_node} {gpus_str} per node.\n"
+        f"SW: Model with {int(total_params/1e6)}M total params, {int(largest_layer_params/1e6)}M largest layer params."
+    )
+    print("  per CPU  |  per GPU |   Options")
+    for cpu_offload in [True, False]:
+        for cpu_offload_params in [True, False]:
+            if not cpu_offload and cpu_offload_params:
+                continue
+            for zero_init in [True, False]:
+                cpu_mem, gpu_mem, largest_layer_memory = estimate_zero3_model_states_mem_needs(
+                    total_params=total_params,
+                    largest_layer_params=largest_layer_params,
+                    num_gpus_per_node=num_gpus_per_node,
+                    num_nodes=num_nodes,
+                    cpu_offload=cpu_offload,
+                    cpu_offload_params=cpu_offload_params,
+                    zero_init=zero_init,
+                    additional_buffer_factor=additional_buffer_factor
+                )
+
+                options_str = format_options(cpu_offload=cpu_offload,
+                                             cpu_offload_params=cpu_offload_params,
+                                             zero_init=zero_init)
+                print(
+                    f" {cpu_mem/2**30:7.2f}GB | {gpu_mem/2**30:6.2f}GB | {options_str}")
diff --git a/deepspeed/runtime/zero/stage_1_and_2.py b/deepspeed/runtime/zero/stage_1_and_2.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab3eec32657fd8556d4d029a023d3890a8df92af
--- /dev/null
+++ b/deepspeed/runtime/zero/stage_1_and_2.py
@@ -0,0 +1,2311 @@
+'''
+Copyright 2019 The Microsoft DeepSpeed Team
+'''
+
+import torch
+from torch.distributed.distributed_c10d import _get_global_rank
+import torch.distributed as dist
+from torch._six import inf
+from packaging import version as pkg_version
+
+from deepspeed.runtime.fp16.loss_scaler import LossScaler, DynamicLossScaler
+from deepspeed.runtime.utils import (bwc_tensor_model_parallel_rank,
+                                     get_global_norm,
+                                     see_memory_usage,
+                                     is_model_parallel_parameter,
+                                     align_dense_tensors,
+                                     all_gather_dp_groups)
+
+from deepspeed.runtime.zero.config import ZERO_OPTIMIZATION_GRADIENTS
+from deepspeed.runtime.zero.offload_constants import OFFLOAD_CPU_DEVICE, OFFLOAD_OPTIMIZER
+from deepspeed.ops.adam import DeepSpeedCPUAdam
+from deepspeed.ops.op_builder import UtilsBuilder
+from deepspeed.utils import logger
+from deepspeed.moe.utils import is_moe_param
+from deepspeed.git_version_info import version
+from deepspeed.runtime.constants import PIPE_REPLICATED
+from deepspeed.checkpoint.constants import (DS_VERSION,
+                                            PARTITION_COUNT,
+                                            SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            BASE_OPTIMIZER_STATE,
+                                            CLIP_GRAD,
+                                            ZERO_STAGE)
+
+# Toggle this to true to enable correctness test
+# with gradient partitioning and without
+pg_correctness_test = False
+
+
+def input(msg):
+    return
+
+
+def split_half_float_double(tensors):
+    dtypes = [
+        "torch.cuda.HalfTensor",
+        "torch.cuda.FloatTensor",
+        "torch.cuda.DoubleTensor",
+        "torch.cuda.BFloat16Tensor"
+    ]
+    buckets = []
+    for i, dtype in enumerate(dtypes):
+        bucket = [t for t in tensors if t.type() == dtype]
+        if bucket:
+            buckets.append(bucket)
+    return buckets
+
+
+def isclose(a, b, rtol=1e-09, atol=0.0):
+    return abs(a - b) <= max(rtol * max(abs(a), abs(b)), atol)
+
+
+def lcm(x, y):
+    from fractions import gcd  # or can import gcd from `math` in Python 3
+    return x * y // gcd(x, y)
+
+
+def get_alignment_padding(tensor_list, alignment):
+    num_elements = sum([tensor.numel() for tensor in tensor_list])
+    remainder = num_elements % alignment
+    return (alignment - remainder) if remainder else remainder
+
+
+def move_to_cpu(tensor_list):
+    for tensor in tensor_list:
+        tensor.data = tensor.data.cpu()
+
+
+def print_rank_msg(msg):
+    print(f"rank {dist.get_rank()} - {msg}")
+
+
+def _get_padded_tensor(src_tensor, size):
+    if src_tensor.numel() >= size:
+        return src_tensor
+    padded_tensor = torch.zeros(size, dtype=src_tensor.dtype, device=src_tensor.device)
+    slice_tensor = torch.narrow(padded_tensor, 0, 0, src_tensor.numel())
+    slice_tensor.data.copy_(src_tensor.data)
+    return padded_tensor
+
+
+class DeepSpeedZeroOptimizer(object):
+    """
+    DeepSpeedZeroOptimizer designed to reduce the memory footprint
+    required for training large deep learning models.
+
+    For more details please see ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
+    https://arxiv.org/abs/1910.02054
+
+    For usage examples, refer to TODO: DeepSpeed Tutorial
+
+    """
+    def __init__(self,
+                 init_optimizer,
+                 timers,
+                 static_loss_scale=1.0,
+                 dynamic_loss_scale=False,
+                 dynamic_loss_args=None,
+                 verbose=True,
+                 contiguous_gradients=True,
+                 reduce_bucket_size=500000000,
+                 allgather_bucket_size=5000000000,
+                 dp_process_group=None,
+                 expert_parallel_group=None,
+                 expert_data_parallel_group=None,
+                 reduce_scatter=True,
+                 overlap_comm=False,
+                 cpu_offload=False,
+                 mpu=None,
+                 clip_grad=0.0,
+                 communication_data_type=torch.float16,
+                 postscale_gradients=True,
+                 gradient_predivide_factor=1.0,
+                 gradient_accumulation_steps=1,
+                 ignore_unused_parameters=True,
+                 partition_grads=True,
+                 round_robin_gradients=False,
+                 has_moe_layers=False,
+                 fp16_master_weights_and_gradients=False,
+                 elastic_checkpoint=False):
+
+        if dist.get_rank() == 0:
+            logger.info(f"Reduce bucket size {reduce_bucket_size}")
+            logger.info(f"Allgather bucket size {allgather_bucket_size}")
+            logger.info(f"CPU Offload: {cpu_offload}")
+            logger.info(f'Round robin gradient partitioning: {round_robin_gradients}')
+        # The fused optimizer does all the work. We need this layer for two reason:
+        # 1. maintain same user API from apex.fp16_utils
+        # 2. keep common stuff here in case we need to add ne552w fused optimizer later
+
+        self.elastic_checkpoint = elastic_checkpoint
+
+        # differences from apex.fp16_utils:
+        # - assume all model params in fp16
+        # - assume all params requires grad
+        # - flat by groups, not keeping state. TODO: remove state explicitly?
+        # - master grad and unflat master weight never exist. TODO: a way to save out unflat master?
+        if not torch.cuda.is_available:
+            raise SystemError("Cannot use fp16 without CUDA.")
+        self.optimizer = init_optimizer
+
+        # Load pre-built or JIT compile (un)flatten ops
+        util_ops = UtilsBuilder().load()
+        self.flatten = util_ops.flatten
+        self.unflatten = util_ops.unflatten
+
+        # ZeRO stage 1 (False) or 2 (True)
+        self.partition_gradients = partition_grads
+
+        self.timers = timers
+
+        self.reduce_scatter = reduce_scatter
+
+        self.overlap_comm = overlap_comm
+
+        self.cpu_offload = cpu_offload
+
+        self.deepspeed_adam_offload = cpu_offload
+
+        self.device = torch.cuda.current_device() if not self.cpu_offload else 'cpu'
+
+        self.dp_process_group = dp_process_group
+
+        #expert parallel group
+        self.ep_process_group = expert_parallel_group
+
+        #data parallel group for experts
+        self.expert_dp_process_group = expert_data_parallel_group
+
+        #data parallel size for non-experts
+        dp_size = dist.get_world_size(group=self.dp_process_group)
+
+        #For MoE models this maybe different for different param group
+        #It will be modified during MoE setup later in the init
+        self.real_dp_process_group = [
+            dp_process_group for i in range(len(self.optimizer.param_groups))
+        ]
+        self.partition_count = [dp_size for i in range(len(self.optimizer.param_groups))]
+
+        self.is_gradient_accumulation_boundary = True
+
+        # CPU-Offload requires contiguous gradients
+        self.contiguous_gradients = contiguous_gradients or cpu_offload
+
+        self.has_moe_layers = has_moe_layers
+        if self.has_moe_layers:
+            self._configure_moe_settings()
+        self._global_grad_norm = 0.
+
+        if mpu is None:
+            self.model_parallel_group = None
+            self.model_parallel_rank = 0
+        else:
+            self.model_parallel_group = mpu.get_model_parallel_group()
+            self.model_parallel_rank = bwc_tensor_model_parallel_rank(mpu)
+
+        self.overflow = False
+        self.clip_grad = clip_grad
+        self.communication_data_type = communication_data_type
+        self.gradient_predivide_factor = gradient_predivide_factor
+        self.postscale_gradients = postscale_gradients
+        self.gradient_accumulation_steps = gradient_accumulation_steps
+        self.micro_step_id = 0
+        self.ignore_unused_parameters = ignore_unused_parameters
+        self.round_robin_gradients = round_robin_gradients
+
+        self.extra_large_param_to_reduce = None
+        self.fp16_master_weights_and_gradients = fp16_master_weights_and_gradients
+
+        if self.fp16_master_weights_and_gradients:
+            assert self.cpu_offload and type(self.optimizer) in [DeepSpeedCPUAdam], f"fp16_master_and_gradients requires optimizer to support keeping fp16 master and gradients while keeping the optimizer states in fp32. Currently only supported using ZeRO-Offload with DeepSpeedCPUAdam. But current setting is ZeRO-Offload:{self.cpu_offload} and optimizer type {type(self.optimizer)}. Either disable fp16_master_weights_and_gradients or enable ZeRO-2 Offload with DeepSpeedCPUAdam"
+
+        if self.reduce_scatter:
+            assert self.communication_data_type in (torch.float16, torch.bfloat16), f"ZeRO-2 supports only float16 or bfloat16 communication_data_type with reduce scatter enabled. Got: '{self.communication_data_type}'"
+            assert self.gradient_predivide_factor == 1.0, "gradient_predivide_factor != 1.0 is not yet supported with ZeRO-2 with reduce scatter enabled"
+            assert self.postscale_gradients, "pre-scale gradients is not yet supported with ZeRO-2 with reduce scatter enabled"
+
+        # param flattened by groups
+        self.bit16_groups = []
+        self.bit16_groups_flat = []
+
+        # param partitioned by data parallel degree
+        # this will contain a list of equal sized tensors
+        # each of which will be updated by a different process
+        self.parallel_partitioned_bit16_groups = []
+
+        # a single 32-bit partition of the parallel partitioned parameters
+        # that this process will update
+        self.single_partition_of_fp32_groups = []
+
+        # param partition info
+
+        # These are the parameters in each group that will not be updated by this process directly
+        self.params_not_in_partition = []
+
+        # These are the parameters that will be updated by this process directly
+        self.params_in_partition = []
+
+        # Offset from the first parameter in the the self.params_in_partition
+        # the parameter boundaries may not align with partition boundaries
+        # so we need to keep track of the offset
+        self.first_offset = []
+
+        # number of elements per partition in each group
+        self.partition_size = []
+
+        #align nccl all-gather send buffers to 4-bye boundary
+        self.nccl_start_alignment_factor = 2  # 4-byte alignment/sizeof(fp16) = 2
+
+        assert (allgather_bucket_size % self.nccl_start_alignment_factor == 0), f"allgather_bucket_size must be a multiple of nccl_start_alignment_factor, {self.nccl_start_alignment_factor} "
+
+        self.all_reduce_print = False
+        self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
+
+        self.round_robin_bit16_groups = []
+        self.round_robin_bit16_indices = []
+
+        # Use different parallel to do all_to_all_reduce related things
+        # padding on each partition for alignment purposes
+        self.groups_padding = []
+        # loop to deal with groups
+        for i, param_group in enumerate(self.optimizer.param_groups):
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+
+            # push this group to list before modify
+            # TODO: Explore simplification that avoids the extra book-keeping by pushing the reordered group
+            trainable_parameters = [
+                param for param in param_group['params'] if param.requires_grad
+            ]
+            self.bit16_groups.append(trainable_parameters)
+
+            # Record padding required to align group to world size
+            if partition_id == dist.get_world_size(
+                    group=self.real_dp_process_group[i]) - 1:
+                padding = get_alignment_padding(self.bit16_groups[i],
+                                                self.partition_count[i])
+            else:
+                padding = 0
+            self.groups_padding.append(padding)
+
+            # not sure why apex was cloning the weights before flattening
+            # removing cloning here
+
+            see_memory_usage(f"Before moving param group {i} to CPU")
+            # move all the parameters to cpu to free up GPU space for creating flat buffer
+            move_to_cpu(self.bit16_groups[i])
+            see_memory_usage(f"After moving param group {i} to CPU", force=False)
+
+            # Reorder group parameters for load balancing of gradient partitioning during backward among ranks.
+            # This ensures that gradients are reduced in a fashion such that ownership round robins among the ranks.
+            # For example, rather than 3 gradients (g_n+2, g_n+1, g_n) that are reduced consecutively belonging
+            # to the same rank, instead they will belong to 3 ranks (r_m+2, r_m+1, r_m).
+            if self.round_robin_gradients:
+                round_robin_tensors, round_robin_indices = self._round_robin_reorder(
+                    self.bit16_groups[i],
+                    dist.get_world_size(group=self.real_dp_process_group[i])
+                )
+            else:
+                round_robin_tensors = self.bit16_groups[i]
+                round_robin_indices = list(range(len(self.bit16_groups[i])))
+
+            self.round_robin_bit16_groups.append(round_robin_tensors)
+            self.round_robin_bit16_indices.append(round_robin_indices)
+
+            # create flat buffer in CPU and move to GPU
+            self.bit16_groups_flat.append(
+                self.flatten_dense_tensors_aligned(
+                    self.round_robin_bit16_groups[i],
+                    self.nccl_start_alignment_factor *
+                    dist.get_world_size(group=self.real_dp_process_group[i])).cuda(
+                        torch.cuda.current_device()))
+            see_memory_usage(f"After flattening and moving param group {i} to GPU",
+                             force=False)
+
+            if dist.get_rank(group=self.real_dp_process_group[i]) == 0:
+                see_memory_usage(
+                    f"After Flattening and after emptying param group {i} cache",
+                    force=False)
+
+            # set model bit16 weight to slices of flattened buffer
+            self._update_model_bit16_weights(i)
+
+            # divide the flat weights into near equal partition equal to the data parallel degree
+            # each process will compute on a different part of the partition
+            data_parallel_partitions = self.get_data_parallel_partitions(
+                self.bit16_groups_flat[i],
+                i)
+            self.parallel_partitioned_bit16_groups.append(data_parallel_partitions)
+
+            # verify that data partition start locations are 4-byte aligned
+            for partitioned_data in data_parallel_partitions:
+                assert (partitioned_data.data_ptr() %
+                        (2 * self.nccl_start_alignment_factor) == 0)
+
+            # verify that data partition start locations are 4-byte aligned
+            for partitioned_data in data_parallel_partitions:
+                assert (partitioned_data.data_ptr() %
+                        (2 * self.nccl_start_alignment_factor) == 0)
+
+            # a partition of the fp32 master weights that will be updated by this process
+            if not fp16_master_weights_and_gradients:
+                self.single_partition_of_fp32_groups.append(
+                    self.parallel_partitioned_bit16_groups[i][partition_id].to(
+                        self.device).clone().float().detach())
+            else:
+                self.single_partition_of_fp32_groups.append(
+                    self.parallel_partitioned_bit16_groups[i][partition_id].to(
+                        self.device).clone().half().detach())
+
+            # modify optimizer of have flat master weight
+            self.single_partition_of_fp32_groups[
+                i].requires_grad = True  # keep this in case internal optimizer uses it
+            param_group['params'] = [self.single_partition_of_fp32_groups[i]]
+
+            partition_size = len(self.bit16_groups_flat[i]) / dist.get_world_size(
+                group=self.real_dp_process_group[i])
+            params_in_partition, params_not_in_partition, first_offset = self.get_partition_info(
+                self.round_robin_bit16_groups[i],
+                partition_size,
+                partition_id)
+
+            self.partition_size.append(partition_size)
+            self.params_in_partition.append(params_in_partition)
+            self.params_not_in_partition.append(params_not_in_partition)
+            self.first_offset.append(first_offset)
+
+        for rank in range(dist.get_world_size()):
+            if dist.get_rank() == rank:
+                print(
+                    f"Rank: {rank} partition count {self.partition_count} and sizes{[(p.numel(), self.is_moe_param_group[i] if hasattr(self, 'is_moe_param_group') else False) for i,p in enumerate(self.single_partition_of_fp32_groups)]} "
+                )
+                dist.barrier()
+        #exit(0)
+        self.reduce_bucket_size = int(reduce_bucket_size)
+        self.allgather_bucket_size = int(allgather_bucket_size)
+
+        self.reduction_event = torch.cuda.Event(enable_timing=False, blocking=False)
+        self.reduction_stream = torch.cuda.Stream()
+        self.cpu_computation_stream = torch.cuda.Stream()
+        self.copy_grad_stream = torch.cuda.Stream()
+        self.callback_queued = False
+
+        self.param_dict = {}
+
+        # map between param_id and bool to specify if a param is in this partition
+        self.is_param_in_current_partition = {}
+
+        self.grads_in_ipg_bucket = []
+        self.params_in_ipg_bucket = []
+        self.elements_in_ipg_bucket = 0
+        self.params_already_reduced = []
+        self._release_ipg_buffers()
+        self.previous_reduced_grads = None
+        self.ipg_bucket_has_moe_params = False
+
+        # simplified param id
+        self.param_id = {}
+
+        #interesting code: unique ids being assigned to individual parameters
+        largest_param_numel = 0
+        count = 0
+        for i, params_group in enumerate(self.bit16_groups):
+            for param in params_group:
+                unique_id = id(param)
+                self.param_id[unique_id] = count
+                self.param_dict[count] = param
+                self.params_already_reduced.append(False)
+                if param.numel() > largest_param_numel:
+                    largest_param_numel = param.numel()
+                count = count + 1
+
+        for param_group in self.params_in_partition:
+            for param in param_group:
+                self.is_param_in_current_partition[self.get_param_id(param)] = True
+
+        for param_group in self.params_not_in_partition:
+            for param in param_group:
+                self.is_param_in_current_partition[self.get_param_id(param)] = False
+
+        if self.cpu_offload:
+            self.accumulated_grads_in_cpu = {}
+            self.norm_for_param_grads = {}
+            self.local_overflow = False
+            self.grad_position = {}
+            self.temp_grad_buffer_for_cpu_offload = torch.zeros(
+                largest_param_numel,
+                device=self.device,
+                dtype=self.dtype).pin_memory()
+            self.temp_grad_buffer_for_gpu_offload = torch.zeros(
+                largest_param_numel,
+                device=torch.cuda.current_device(),
+                dtype=self.dtype)
+            for i, params_group in enumerate(self.bit16_groups):
+                self.get_grad_position(i,
+                                       self.params_in_partition[i],
+                                       self.first_offset[i],
+                                       self.partition_size[i])
+
+        # mapping from parameter to partition that it belongs to
+        self.param_to_partition_ids = {}
+
+        # stores if a partition has been reduced in this step
+        self.is_partition_reduced = {}
+
+        # number of grads in partition that still need to be computed
+        self.remaining_grads_in_partition = {}
+
+        # total number of grads in partition
+        self.total_grads_in_partition = {}
+
+        # stores if a grad in a partition has been computed or not
+        self.is_grad_computed = {}
+
+        # stores the offset at which a parameter gradient needs to be inserted in a partition
+        self.grad_partition_insertion_offset = {}
+
+        # the offset in the gradient at which it must be inserted at the beginning of the partition
+        self.grad_start_offset = {}
+
+        # will store the averaged gradients required by this partition
+        self.averaged_gradients = {}
+
+        # store index of first parameter in each partition
+        self.first_param_index_in_partition = {}
+
+        # initializes all data structures for implementing gradient partitioning
+        self.initialize_gradient_partitioning_data_structures()
+
+        # resets the data structure value for the next backward propagation
+        self.reset_partition_gradient_structures()
+
+        # creates backward hooks for gradient partitioning
+        if self.partition_gradients or self.overlap_comm:
+            self.create_reduce_and_remove_grad_hooks()
+
+        # we may have a way of fusing dynamic scale. Do not support for now
+        if self.dtype == torch.float or self.dtype == torch.bfloat16 or not dynamic_loss_scale:
+            loss_scale_value = 1.0 if (
+                (self.dtype == torch.float) or
+                (self.dtype == torch.bfloat16)) else static_loss_scale
+
+            self.dynamic_loss_scale = False
+            self.loss_scaler = LossScaler(scale=loss_scale_value)
+            cur_iter = 0
+        else:
+            if dynamic_loss_args is None:
+                self.loss_scaler = DynamicLossScaler()
+            else:
+                self.loss_scaler = DynamicLossScaler(**dynamic_loss_args)
+
+            self.dynamic_loss_scale = True
+
+        see_memory_usage("Before initializing optimizer states", force=True)
+        self.initialize_optimizer_states()
+        see_memory_usage("After initializing optimizer states", force=True)
+
+        if dist.get_rank() == 0:
+            logger.info(f"optimizer state initialized")
+
+        if dist.get_rank(group=self.dp_process_group) == 0:
+            see_memory_usage(f"After initializing ZeRO optimizer", force=True)
+
+    def is_moe_group(self, group):
+        return 'moe' in group and group['moe']
+
+    def _configure_moe_settings(self):
+        assert self.contiguous_gradients, "Contiguous Gradients in ZeRO Stage 2 must be set to True for MoE. Other code paths are not tested with MoE"
+        assert self.reduce_scatter, "Reduce Scatter in ZeRO Stage 2 must be set to True for MoE. Other code paths are not tested with MoE"
+
+        assert any([self.is_moe_group(group) for group in self.optimizer.param_groups]), "The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer"
+        self.is_moe_param_group = []
+        for i, group in enumerate(self.optimizer.param_groups):
+            if self.is_moe_group(group):
+                assert all([is_moe_param(param) for param in group['params']]), "All params in MoE group must be MoE params"
+                self.real_dp_process_group[i] = self.expert_dp_process_group[
+                    group['name']]
+                self.partition_count[i] = dist.get_world_size(
+                    group=self.expert_dp_process_group[group['name']])
+                self.is_moe_param_group.append(True)
+            else:
+                self.is_moe_param_group.append(False)
+
+        assert self.expert_dp_process_group is not None, "Expert data parallel group should be configured with MoE"
+        assert self.ep_process_group is not None, "Expert parallel group should be configured with MoE"
+
+    def _update_model_bit16_weights(self, group_index):
+        updated_params = self.unflatten(self.bit16_groups_flat[group_index],
+                                        self.round_robin_bit16_groups[group_index])
+        for p, q in zip(self.round_robin_bit16_groups[group_index], updated_params):
+            p.data = q.data
+
+        # set model fp16 weight to slices of reordered flattened buffer
+        for param_index, param in enumerate(self.bit16_groups[group_index]):
+            new_index = self.round_robin_bit16_indices[group_index][param_index]
+            param.data = self.round_robin_bit16_groups[group_index][new_index].data
+
+    def _round_robin_reorder(self, tensor_list, num_partitions):
+
+        # disable round robin if need to debug something
+        # return tensor_list, list(range(len(tensor_list)))
+
+        partition_tensors = {}
+
+        for i, tensor in enumerate(tensor_list):
+            j = i % num_partitions
+            if not j in partition_tensors:
+                partition_tensors[j] = []
+            partition_tensors[j].append((i, tensor))
+
+        reordered_tensors = []
+        reordered_indices = {}
+
+        for partition_index in partition_tensors.keys():
+            for i, (original_index, tensor) in enumerate(partition_tensors[partition_index]):
+                reordered_indices[original_index] = len(reordered_tensors)
+                reordered_tensors.append(tensor)
+
+        return reordered_tensors, reordered_indices
+
+    def _release_ipg_buffers(self):
+        if self.contiguous_gradients:
+            self.ipg_buffer = None
+            self.grads_in_partition = None
+            self.grads_in_partition_offset = 0
+
+    def initialize_optimizer_states(self):
+
+        for i, group in enumerate(self.bit16_groups):
+            single_grad_partition = torch.zeros(
+                int(self.partition_size[i]),
+                dtype=self.single_partition_of_fp32_groups[i].dtype,
+                device=self.device)
+            self.single_partition_of_fp32_groups[
+                i].grad = single_grad_partition.pin_memory(
+                ) if self.cpu_offload else single_grad_partition
+
+        self.optimizer.step()
+
+        if not self.cpu_offload:
+            for group in self.single_partition_of_fp32_groups:
+                group.grad = None  #class init
+
+        return
+
+    #########################################################################
+    #################### ZeRO Stage 1 - reduce gradients ####################
+    #########################################################################
+    def reduce_gradients(self, pipeline_parallel=False):
+        world_size = dist.get_world_size(self.dp_process_group)
+        my_rank = dist.get_rank(self.dp_process_group)
+
+        # with PP we must create ipg buffer, since backward is handled outside zero
+        if pipeline_parallel and self.contiguous_gradients:
+            self.ipg_buffer = []
+            buf_0 = torch.empty(int(self.reduce_bucket_size),
+                                dtype=self.dtype,
+                                device=torch.cuda.current_device())
+            self.ipg_buffer.append(buf_0)
+            self.ipg_index = 0
+
+        if not self.overlap_comm:
+            for i, group in enumerate(self.bit16_groups):
+                for param in group:
+                    if param.grad is not None:
+                        self.reduce_ready_partitions_and_remove_grads(param, i)
+        # reduce any pending grads in either hook/non-hook case
+        self.overlapping_partition_gradients_reduce_epilogue()
+
+    #########################################################################
+    #########################ZeRO Partition Gradients########################
+    #########################################################################
+
+    def get_first_param_index(self, group_id, param_group, partition_id):
+        for index, param in enumerate(param_group):
+            param_id = self.get_param_id(param)
+            if partition_id in self.param_to_partition_ids[group_id][param_id]:
+                return index
+        return None
+
+    def initialize_gradient_partitioning_data_structures(self):
+
+        for i, param_group in enumerate(self.round_robin_bit16_groups):
+            total_partitions = dist.get_world_size(group=self.real_dp_process_group[i])
+
+            self.param_to_partition_ids[i] = {}
+            self.is_partition_reduced[i] = {}
+            self.total_grads_in_partition[i] = {}
+            self.remaining_grads_in_partition[i] = {}
+            self.is_grad_computed[i] = {}
+            self.grad_partition_insertion_offset[i] = {}
+            self.grad_start_offset[i] = {}
+            self.first_param_index_in_partition[i] = {}
+
+            for partition_id in range(total_partitions):
+                self.is_grad_computed[i][partition_id] = {}
+                self.grad_partition_insertion_offset[i][partition_id] = {}
+                self.grad_start_offset[i][partition_id] = {}
+                self.total_grads_in_partition[i][partition_id] = 0
+                self.initialize_gradient_partition(i, param_group, partition_id)
+                self.is_partition_reduced[i][partition_id] = False
+                self.first_param_index_in_partition[i][
+                    partition_id] = self.get_first_param_index(
+                        i,
+                        param_group,
+                        partition_id)
+
+    def independent_gradient_partition_epilogue(self):
+        self.report_ipg_memory_usage(f"In ipg_epilogue before reduce_ipg_grads", 0)
+        self.reduce_ipg_grads()
+        self.report_ipg_memory_usage(f"In ipg_epilogue after reduce_ipg_grads", 0)
+
+        # if dist.get_rank() == 0:
+        #    logger.info("Params already reduced %s", self.params_already_reduced)
+        for i in range(len(self.params_already_reduced)):
+            self.params_already_reduced[i] = False
+
+        if self.overlap_comm:
+            torch.cuda.synchronize()
+            # It is safe to clear previously reduced grads of other partitions
+            self._clear_previous_reduced_grads()
+
+        if self.cpu_offload is False:
+            for i, _ in enumerate(self.bit16_groups):
+
+                if not i in self.averaged_gradients or self.averaged_gradients[i] is None:
+                    self.averaged_gradients[i] = self.get_flat_partition(
+                        self.params_in_partition[i],
+                        self.first_offset[i],
+                        self.partition_size[i],
+                        dtype=self.dtype,
+                        device=torch.cuda.current_device(),
+                        return_tensor_list=True)
+                else:
+                    avg_new = self.get_flat_partition(self.params_in_partition[i],
+                                                      self.first_offset[i],
+                                                      self.partition_size[i],
+                                                      dtype=self.dtype,
+                                                      device=torch.cuda.current_device(),
+                                                      return_tensor_list=True)
+
+                    for accumulated_grad, new_avg_grad in zip(self.averaged_gradients[i], avg_new):
+                        accumulated_grad.add_(new_avg_grad)
+
+        self._release_ipg_buffers()
+
+        # No need to keep the gradients anymore.
+        # All gradients required by the step
+        # are in self.averaged_gradients
+        self.zero_grad()
+        see_memory_usage(f"End ipg_epilogue")
+
+    # resets all partition to no reduced
+    # sets remaining grads to the total number of grads in each partition
+    # set is grad computed to false for all grads in partition
+    def reset_partition_gradient_structures(self):
+        for i, _ in enumerate(self.bit16_groups):
+            total_partitions = dist.get_world_size(group=self.real_dp_process_group[i])
+            for partition_id in range(total_partitions):
+                self.is_partition_reduced[i][partition_id] = False
+                self.remaining_grads_in_partition[i][
+                    partition_id] = self.total_grads_in_partition[i][partition_id]
+
+                for param_id in self.is_grad_computed[i][partition_id]:
+                    self.is_grad_computed[i][partition_id][param_id] = False
+
+    def initialize_gradient_partition(self, i, param_group, partition_id):
+        def set_key_value_list(dictionary, key, value):
+            if key in dictionary:
+                dictionary[key].append(value)
+            else:
+                dictionary[key] = [value]
+
+        def increment_value(dictionary, key):
+            if key in dictionary:
+                dictionary[key] += 1
+            else:
+                dictionary[key] = 1
+
+        partition_size = self.partition_size[i]
+
+        start_index = partition_size * partition_id
+        end_index = partition_size * (partition_id + 1)
+
+        current_index = 0
+        first_offset = 0
+
+        for param in param_group:
+
+            param_size = param.numel()
+            param_id = self.get_param_id(param)
+
+            if (current_index >= start_index and current_index < end_index):
+                set_key_value_list(self.param_to_partition_ids[i],
+                                   param_id,
+                                   partition_id)
+                increment_value(self.total_grads_in_partition[i], partition_id)
+
+                self.is_grad_computed[i][partition_id][param_id] = False
+
+                self.grad_partition_insertion_offset[i][partition_id][
+                    param_id] = current_index - start_index
+                self.grad_start_offset[i][partition_id][param_id] = 0
+
+            elif start_index > current_index and start_index < (current_index +
+                                                                param_size):
+                assert (first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
+                first_offset = start_index - current_index
+
+                set_key_value_list(self.param_to_partition_ids[i],
+                                   param_id,
+                                   partition_id)
+                increment_value(self.total_grads_in_partition[i], partition_id)
+
+                self.is_grad_computed[i][partition_id][param_id] = False
+
+                self.grad_partition_insertion_offset[i][partition_id][param_id] = 0
+                self.grad_start_offset[i][partition_id][param_id] = first_offset
+
+            current_index = current_index + param_size
+
+    def overlapping_partition_gradients_reduce_epilogue(self):
+        self.independent_gradient_partition_epilogue()
+
+    def create_reduce_and_remove_grad_hooks(self):
+        self.grad_accs = []
+        for i, param_group in enumerate(self.bit16_groups):
+            for param in param_group:
+                if param.requires_grad:
+
+                    def wrapper(param, i):
+                        param_tmp = param.expand_as(param)
+                        grad_acc = param_tmp.grad_fn.next_functions[0][0]
+
+                        def reduce_partition_and_remove_grads(*notneeded):
+                            self.reduce_ready_partitions_and_remove_grads(param, i)
+
+                        grad_acc.register_hook(reduce_partition_and_remove_grads)
+                        self.grad_accs.append(grad_acc)
+
+                    wrapper(param, i)
+
+    def get_param_id(self, param):
+        unique_id = id(param)
+        return self.param_id[unique_id]
+
+    def report_ipg_memory_usage(self, tag, param_elems):
+        elem_count = self.elements_in_ipg_bucket + param_elems
+        percent_of_bucket_size = (100.0 * elem_count) // self.reduce_bucket_size
+        see_memory_usage(
+            f"{tag}: elems in_bucket {self.elements_in_ipg_bucket} param {param_elems} max_percent {percent_of_bucket_size}"
+        )
+
+    # create a flat tensor aligned at the alignment boundary
+    def flatten_dense_tensors_aligned(self, tensor_list, alignment):
+        return self.flatten(align_dense_tensors(tensor_list, alignment))
+
+    ############### Independent Partition Gradient ########################
+    def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):
+        if self.elements_in_ipg_bucket + param.numel() > self.reduce_bucket_size:
+            self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads",
+                                         param.numel())
+            self.reduce_ipg_grads()
+            if self.contiguous_gradients and self.overlap_comm:
+                # Swap ipg_index between 0 and 1
+                self.ipg_index = 1 - self.ipg_index
+            self.report_ipg_memory_usage("In ipg_remove_grads after reduce_ipg_grads",
+                                         param.numel())
+
+        param_id = self.get_param_id(param)
+        assert self.params_already_reduced[param_id] == False, \
+            f"The parameter {param_id} has already been reduced. \
+            Gradient computed twice for this partition. \
+            Multiple gradient reduction is currently not supported"
+
+        if param.numel() > self.reduce_bucket_size:
+            self.extra_large_param_to_reduce = param
+
+        elif self.contiguous_gradients:
+            # keeping the gradients contiguous to prevent memory fragmentation, and avoid flattening
+            new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
+                0,
+                self.elements_in_ipg_bucket,
+                param.numel())
+            new_grad_tensor.copy_(param.grad.view(-1))
+            param.grad.data = new_grad_tensor.data.view_as(param.grad)
+
+        self.elements_in_ipg_bucket += param.numel()
+
+        assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
+
+        self.grads_in_ipg_bucket.append(param.grad)
+        self.params_in_ipg_bucket.append((i, param, param_id))
+
+        #make sure the average tensor function knows how to average the gradients
+        if is_moe_param(param):
+            self.ipg_bucket_has_moe_params = True
+
+        self.report_ipg_memory_usage("End ipg_remove_grads", 0)
+
+    def print_rank_0(self, message):
+        if dist.get_rank() == 0:
+            logger.info(message)
+
+    def gradient_reduction_w_predivide(self, tensor):
+
+        dp_world_size = dist.get_world_size(group=self.dp_process_group)
+
+        tensor_to_allreduce = tensor
+
+        if self.communication_data_type != tensor.dtype:
+            tensor_to_allreduce = tensor.to(self.communication_data_type)
+
+        if self.postscale_gradients:
+            if self.gradient_predivide_factor != 1.0:
+                tensor_to_allreduce.mul_(1. / self.gradient_predivide_factor)
+
+            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
+
+            if self.gradient_predivide_factor != dp_world_size:
+                tensor_to_allreduce.mul_(self.gradient_predivide_factor / dp_world_size)
+        else:
+            tensor_to_allreduce.div_(dp_world_size)
+            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
+
+        if self.communication_data_type != tensor.dtype and tensor is not tensor_to_allreduce:
+            tensor.copy_(tensor_to_allreduce)
+
+        return tensor
+
+    def average_tensor(self, tensor):
+        if self.overlap_comm:
+            torch.cuda.synchronize()
+            stream = self.reduction_stream
+        else:
+            stream = torch.cuda.current_stream()
+
+        with torch.cuda.stream(stream):
+            if not self.reduce_scatter:
+                self.gradient_reduction_w_predivide(tensor)
+                return
+
+            # Accumulate destination ranks and bucket offsets for each gradient slice.
+            # Note: potential future optimization, record access pattern of parameters
+            # in backward pass and partition gradients w.r.t. access pattern so that our
+            # bucket is guaranteed to be contiguous w.r.t. ranks
+            rank_and_offsets = []
+            real_dp_process_group = []
+            curr_size = 0
+            prev_id = -1
+
+            process_group = self.dp_process_group
+            # count = 0
+            for i, param, param_id in self.params_in_ipg_bucket:
+
+                process_group = self.dp_process_group
+                #Averages gradients at parameter level if ipg has a moe param
+                #Otherwise averaging is done at the entire buffer level at the end of the loop
+                # MoE param have different groups
+                if self.ipg_bucket_has_moe_params:
+                    process_group = self.expert_dp_process_group[
+                        param.group_name] if is_moe_param(
+                            param) else self.dp_process_group
+                    param.grad.data.div_(dist.get_world_size(group=process_group))
+
+                partition_ids = self.param_to_partition_ids[i][param_id]
+                assert all([p_id < dist.get_world_size(group=process_group) for p_id in partition_ids]), f"world size {dist.get_world_size(group=process_group)} and p_ids: {partition_ids}"
+                partition_size = self.partition_size[i]
+                # Get all partition ids + their offsets
+                partition_ids_w_offsets = []
+                for partition_id in partition_ids:
+                    offset = self.grad_start_offset[i][partition_id][param_id]
+                    partition_ids_w_offsets.append((partition_id, offset))
+                partition_ids_w_offsets.sort(key=lambda t: t[1])
+
+                # Calculate rank and offsets for grad slices
+                for idx in range(len(partition_ids_w_offsets)):
+                    partition_id, offset = partition_ids_w_offsets[idx]
+
+                    # if dist.get_rank() == 0 and count < 100:
+                    #     print(f"Rank {dist.get_rank()} rank offset id {idx} calculated dp size {dist.get_world_size(group=process_group)} real dp size {dist.get_world_size(self.real_dp_process_group[i])} and dst: {partition_id}")
+                    # count += 1
+
+                    # Calculate numel for grad slice depending on partition location
+                    if idx == len(partition_ids_w_offsets) - 1:
+                        # Last partition_id uses its own offset
+                        numel = param.numel() - offset
+                    else:
+                        # Set numel to next partition's offset
+                        numel = partition_ids_w_offsets[idx + 1][1] - offset
+
+                    # Merge bucket ranges if they belong to the same rank
+                    if partition_id == prev_id:
+                        prev_pid, prev_size, prev_numel = rank_and_offsets[-1]
+                        rank_and_offsets[-1] = (prev_pid, prev_size, prev_numel + numel)
+                    else:
+                        rank_and_offsets.append((partition_id, curr_size, numel))
+                        real_dp_process_group.append(process_group)
+                    curr_size += numel
+                    prev_id = partition_id
+
+            if not self.ipg_bucket_has_moe_params:
+                tensor.div_(dist.get_world_size(group=self.dp_process_group))
+
+            async_handles = []
+            for i, (dst, bucket_offset, numel) in enumerate(rank_and_offsets):
+                grad_slice = tensor.narrow(0, int(bucket_offset), int(numel))
+                # if dist.get_rank() == 0:
+                #     print(f"Rank {dist.get_rank()} rank offset id {i} real dp size {dist.get_world_size(group=real_dp_process_group[i])} and dst: {dst}")
+                # dist.barrier()
+                #dist.barrier()
+                dst_rank = _get_global_rank(real_dp_process_group[i], dst)
+                async_handle = dist.reduce(grad_slice,
+                                           dst=dst_rank,
+                                           group=real_dp_process_group[i],
+                                           async_op=True)
+                async_handles.append(async_handle)
+
+            for handle in async_handles:
+                handle.wait()
+
+    ##############################################################################
+    ############################# CPU Offload Methods#############################
+    ##############################################################################
+    def get_grad_position(self, group_id, tensor_list, first_offset, partition_size):
+        current_offset = 0
+
+        for i, tensor in enumerate(tensor_list):
+            param_id = self.get_param_id(tensor)
+            param_start_offset = 0
+
+            num_elements = tensor.numel()
+            tensor_offset = 0
+
+            # we need to offset to get to the right element
+            if i == 0 and first_offset > 0:
+                tensor_offset = first_offset
+                num_elements = num_elements - tensor_offset
+                param_start_offset = first_offset
+
+            # we dont need all elements of the tensor
+            if num_elements > (partition_size - current_offset):
+                num_elements = partition_size - current_offset
+
+            self.grad_position[param_id] = [
+                int(group_id),
+                int(param_start_offset),
+                int(current_offset),
+                int(num_elements)
+            ]
+            current_offset += num_elements
+
+    def update_overflow_tracker_for_param_grad(self, param):
+        if param.grad is not None and self._has_inf_or_nan(param.grad.data):
+            self.local_overflow = True
+
+    def async_accumulate_grad_in_cpu_via_gpu(self, param):
+        param_id = self.get_param_id(param)
+
+        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
+
+        # copy to a preexisiting buffer to avoid memory allocation penalty
+        dest_buffer = self.temp_grad_buffer_for_gpu_offload.view(-1).narrow(
+            0,
+            0,
+            param.numel())
+
+        #buffer for storing gradients for this parameter in CPU
+        def buffer_to_accumulate_to_in_cpu():
+            if not self.fp16_master_weights_and_gradients:
+                return torch.zeros(param.numel(),
+                                   dtype=param.dtype,
+                                   device=self.device).pin_memory()
+            else:
+                return self.single_partition_of_fp32_groups[i].grad.view(-1).narrow(
+                    0,
+                    dest_offset,
+                    num_elements)
+
+        #accumulate gradients into param.grad or parts of it that belongs to this partition
+        def accumulate_gradients():
+            if not self.fp16_master_weights_and_gradients:
+                dest_buffer.copy_(self.accumulated_grads_in_cpu[param_id].view(-1),
+                                  non_blocking=True)
+                param.grad.data.view(-1).add_(dest_buffer)
+            else:
+                dest_buffer.narrow(0,
+                                   source_offset,
+                                   num_elements).copy_(
+                                       self.accumulated_grads_in_cpu[param_id].view(-1),
+                                       non_blocking=True)
+                param.grad.data.view(-1).narrow(
+                    0,
+                    source_offset,
+                    num_elements).add_(dest_buffer.narrow(0,
+                                                          source_offset,
+                                                          num_elements))
+
+        #move accumulated gradients back to CPU
+        def copy_gradients_to_cpu():
+            if not self.fp16_master_weights_and_gradients:
+                self.accumulated_grads_in_cpu[param_id].data.copy_(
+                    param.grad.data.view(-1),
+                    non_blocking=True)
+            else:
+                self.accumulated_grads_in_cpu[param_id].data.copy_(
+                    param.grad.data.view(-1).narrow(0,
+                                                    source_offset,
+                                                    num_elements),
+                    non_blocking=True)
+
+        if param_id not in self.accumulated_grads_in_cpu:
+            self.accumulated_grads_in_cpu[param_id] = buffer_to_accumulate_to_in_cpu()
+
+        if self.micro_step_id > 0:
+            accumulate_gradients()
+
+        # at the boundary we will send 32bit directly
+        if not self.is_gradient_accumulation_boundary:
+            copy_gradients_to_cpu()
+
+    def set_norm_for_param_grad(self, param):
+        param_id = self.get_param_id(param)
+        accumulated_grad = self.accumulated_grads_in_cpu[
+            param_id] if self.gradient_accumulation_steps > 1 else param.grad
+
+        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
+
+        start = source_offset
+        accumulated_grad = accumulated_grad.view(-1).narrow(0, start, num_elements)
+
+        self.norm_for_param_grads[param_id] = accumulated_grad.data.double().norm(2)
+
+    def set_norm_for_param_grad_in_gpu(self, param):
+        param_id = self.get_param_id(param)
+        accumulated_grad = param.grad
+
+        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
+
+        start = source_offset
+        accumulated_grad = accumulated_grad.view(-1).narrow(0, start, num_elements)
+
+        self.norm_for_param_grads[param_id] = accumulated_grad.data.double().norm(2)
+
+    def async_inplace_copy_grad_to_fp32_buffer_from_gpu(self, param):
+        param_id = self.get_param_id(param)
+
+        [i, source_offset, dest_offset, num_elements] = self.grad_position[param_id]
+
+        dest_tensor = self.single_partition_of_fp32_groups[i].grad.view(-1).narrow(
+            0,
+            dest_offset,
+            num_elements)
+
+        src_tensor = param.grad.view(-1).narrow(0, source_offset, num_elements)
+        if not self.fp16_master_weights_and_gradients:
+            src_tensor = src_tensor.float()
+
+        dest_tensor.copy_(src_tensor, non_blocking=True)
+        param.grad = None  #offload only
+
+    def complete_grad_norm_calculation_for_cpu_offload(self, params):
+        total_norm = 0.0
+        norm_type = 2.0
+        for p in params:
+            # Pipeline parallelism may replicate parameters. Avoid multi-counting.
+            if hasattr(p, PIPE_REPLICATED) and p.ds_pipe_replicated:
+                continue
+
+            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
+                param_id = self.get_param_id(p)
+                # as some model have trainable parameters but skipped in training,
+                # their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run,
+                # so they have no norm_for_param_grads
+                if param_id in self.norm_for_param_grads:
+                    param_norm = self.norm_for_param_grads[param_id]
+                    total_norm += param_norm.item()**2
+                else:
+                    # As unused parameters in modules may not be expected sometimes,
+                    # add an explicit error msg when it occurred and an option to
+                    # avoid the error
+                    assert self.ignore_unused_parameters, """
+                        This assert indicates that your module has parameters that
+                        were not used in producing loss.
+                        You can avoid this assert by
+                        (1) enable ignore_unused_parameters option in zero_optimization config;
+                        (2) making sure all trainable parameters and `forward` function
+                            outputs participate in calculating loss.
+                    """
+
+        # Sum across all model parallel GPUs.
+        total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+        torch.distributed.all_reduce(total_norm_cuda,
+                                     op=torch.distributed.ReduceOp.SUM,
+                                     group=self.dp_process_group)
+
+        self._model_parallel_all_reduce(tensor=total_norm_cuda,
+                                        op=torch.distributed.ReduceOp.SUM)
+
+        total_norm = total_norm_cuda[0].item()**(1. / norm_type)
+
+        if total_norm == float(
+                'inf') or total_norm == -float('inf') or total_norm != total_norm:
+            total_norm = -1
+
+        return total_norm
+
+    ############################################################################################
+    def copy_grads_in_partition(self, param):
+        if self.cpu_offload:
+
+            if self.gradient_accumulation_steps > 1:
+                self.async_accumulate_grad_in_cpu_via_gpu(param)
+
+            if self.is_gradient_accumulation_boundary:
+                self.set_norm_for_param_grad_in_gpu(param)
+
+                self.update_overflow_tracker_for_param_grad(param)
+
+                self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(param)
+
+            return
+        #print(f"ID {self.get_param_id(param)} grad norm {param.grad.norm()}")
+        if self.grads_in_partition is None:
+            self.grads_in_partition_offset = 0
+            total_size = 0
+            for group in self.params_in_partition:
+                for param_in_partition in group:
+                    total_size += param_in_partition.numel()
+
+            see_memory_usage(f"before copying {total_size} gradients into partition")
+            self.grads_in_partition = torch.empty(int(total_size),
+                                                  dtype=self.dtype,
+                                                  device=torch.cuda.current_device())
+            see_memory_usage(f"after copying {total_size} gradients into partition")
+
+        # The allreduce buffer will be rewritten. Copy the gradients in partition to a new buffer
+        new_grad_tensor = self.grads_in_partition.view(-1).narrow(
+            0,
+            self.grads_in_partition_offset,
+            param.numel())
+        new_grad_tensor.copy_(param.grad.view(-1))
+        param.grad.data = new_grad_tensor.data.view_as(param.grad)
+        #print(f"Grad norm after copy to contiguous_buffer {param.grad.data.norm()}")
+        self.grads_in_partition_offset += param.numel()
+
+    def reduce_ipg_grads(self):
+        if self.contiguous_gradients:
+            if self.extra_large_param_to_reduce is not None:
+                assert len(self.params_in_ipg_bucket) == 1, "more than 1 param in ipg bucket, this shouldn't happen"
+                _, _, param_id = self.params_in_ipg_bucket[0]
+                assert self.get_param_id(
+                    self.extra_large_param_to_reduce) == param_id, "param in ipg bucket does not match extra-large param"
+                self.average_tensor(self.extra_large_param_to_reduce.grad.view(-1))
+                self.extra_large_param_to_reduce = None
+            else:
+                self.average_tensor(self.ipg_buffer[self.ipg_index])
+        else:
+            self.buffered_reduce_fallback(
+                None,
+                self.grads_in_ipg_bucket,
+                elements_per_buffer=self.elements_in_ipg_bucket)
+
+        if self.overlap_comm:
+            stream = self.reduction_stream
+        elif self.cpu_offload:
+            # TODO: copy_grad_stream is disabled because of race with reduce. This hurts perf and should be fixed.
+            #            torch.cuda.synchronize()
+            #            stream = self.copy_grad_stream
+            stream = torch.cuda.current_stream()
+        else:
+            stream = torch.cuda.current_stream()
+
+        with torch.cuda.stream(stream):
+            for _, param, param_id in self.params_in_ipg_bucket:
+
+                assert self.params_already_reduced[param_id] == False, \
+                    f"The parameter {param_id} has already been reduced. \
+                    Gradient computed twice for this partition. \
+                    Multiple gradient reduction is currently not supported"
+
+                self.params_already_reduced[param_id] = True
+
+                if self.partition_gradients:
+                    if not self.is_param_in_current_partition[param_id]:
+                        if self.overlap_comm and self.contiguous_gradients is False:
+                            # Clear grads of other partitions during the next reduction
+                            # to avoid clearing them before the reduction is complete.
+                            if self.previous_reduced_grads is None:
+                                self.previous_reduced_grads = []
+                            self.previous_reduced_grads.append(param)
+                        else:
+                            param.grad = None  #only if self.partition_gradients
+                    elif self.contiguous_gradients:
+                        self.copy_grads_in_partition(param)
+                else:  # zero stage 1 - partition only optimizer state
+                    if self.contiguous_gradients and self.is_param_in_current_partition[
+                            param_id]:
+                        self.copy_grads_in_partition(param)
+
+        self.grads_in_ipg_bucket = []
+        self.params_in_ipg_bucket = []
+        self.ipg_bucket_has_moe_params = False
+        self.elements_in_ipg_bucket = 0
+        #####################################################################
+
+    def reduce_ready_partitions_and_remove_grads(self, param, i):
+        if self.partition_gradients or self.is_gradient_accumulation_boundary:
+            self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
+
+    def zero_reduced_gradients(self, partition_id, i):
+        def are_all_related_partitions_reduced(params_id):
+            for partition_id in self.param_to_partition_ids[i][params_id]:
+                if not self.is_partition_reduced[i][partition_id]:
+                    return False
+            return True
+
+        for params_id in self.is_grad_computed[i][partition_id]:
+            if are_all_related_partitions_reduced(params_id):
+                self.param_dict[params_id].grad = None  # dead code
+
+    def flatten_and_print(self, message, tensors, start=0, n=5):
+        flatten_tensor = self.flatten(tensors)
+
+        def print_func():
+            logger.info(flatten_tensor.contiguous().view(-1).narrow(0, start, n))
+
+        self.sequential_execution(print_func, message)
+
+    def get_grads_to_reduce(self, i, partition_id):
+        def get_reducible_portion(key):
+            grad = self.param_dict[key].grad
+            total_elements = grad.numel()
+            start = self.grad_start_offset[i][partition_id][key]
+            num_elements = min(
+                total_elements - start,
+                self.partition_size[i] -
+                self.grad_partition_insertion_offset[i][partition_id][key])
+            if not pg_correctness_test:
+                if num_elements == total_elements:
+                    return grad
+                else:
+                    return grad.contiguous().view(-1).narrow(0,
+                                                             int(start),
+                                                             int(num_elements))
+            else:
+                if num_elements == total_elements:
+                    return grad.clone()
+                else:
+                    return grad.clone().contiguous().view(-1).narrow(
+                        0,
+                        int(start),
+                        int(num_elements))
+
+        grads_to_reduce = []
+        for key in self.is_grad_computed[i][partition_id]:
+            grad = get_reducible_portion(key)
+            grads_to_reduce.append(grad)
+        return grads_to_reduce
+
+    def sequential_execution(self, function, message, group=None):
+        if group is None:
+            group = self.dp_process_group
+        if dist.get_rank(group=group) == 0:
+            logger.info(message)
+        for id in range(dist.get_world_size(group=group)):
+            if id == dist.get_rank(group=group):
+                function()
+            dist.barrier(group=group)
+
+    def set_none_gradients_to_zero(self, i, partition_id):
+        for param_id in self.is_grad_computed[i][partition_id]:
+            param = self.param_dict[param_id]
+            if param.grad is None:
+                param.grad = torch.zero_like(param)
+
+    ######################Reduction Related Methods##############################
+    def allreduce_bucket(self,
+                         bucket,
+                         communication_data_type=torch.float16,
+                         rank=None,
+                         log=None):
+        rank = None
+        tensor = self.flatten(bucket)
+
+        tensor_to_allreduce = tensor
+
+        if pg_correctness_test:
+            communication_data_type = torch.float32
+
+        if communication_data_type != tensor.dtype:
+            tensor_to_allreduce = tensor.to(communication_data_type)
+
+        tensor_to_allreduce.div_(dist.get_world_size(group=self.dp_process_group))
+
+        if rank is None:
+            #    "All Reducing"
+            dist.all_reduce(tensor_to_allreduce, group=self.dp_process_group)
+        else:
+            global_rank = _get_global_rank(self.dp_process_group, rank)
+            dist.reduce(tensor_to_allreduce, global_rank, group=self.dp_process_group)
+
+        if communication_data_type != tensor.dtype and tensor is not tensor_to_allreduce:
+            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
+                tensor.copy_(tensor_to_allreduce)
+
+        return tensor
+
+    def _clear_previous_reduced_grads(self):
+        if self.previous_reduced_grads is not None:
+            for param in self.previous_reduced_grads:
+                param.grad = None  # overlap enabled
+            self.previous_reduced_grads = None
+
+    # if rank is specified do a reduction instead of an allreduce
+    def allreduce_and_copy(self, small_bucket, rank=None, log=None):
+        if self.overlap_comm:
+            torch.cuda.synchronize()
+            # It is safe to clear the previously reduced grads of other partitions
+            self._clear_previous_reduced_grads()
+            stream = self.reduction_stream
+        else:
+            stream = torch.cuda.current_stream()
+
+        with torch.cuda.stream(stream):
+            allreduced = self.allreduce_bucket(small_bucket, rank=rank, log=log)
+            if rank is None or rank == dist.get_rank(group=self.dp_process_group):
+                for buf, synced in zip(small_bucket, self.unflatten(allreduced, small_bucket)):
+                    buf.copy_(synced)
+
+    def allreduce_no_retain(self,
+                            bucket,
+                            numel_per_bucket=500000000,
+                            rank=None,
+                            log=None):
+        small_bucket = []
+        numel = 0
+        for tensor in bucket:
+            small_bucket.append(tensor)
+            numel = numel + tensor.numel()
+            if numel > numel_per_bucket:
+                self.allreduce_and_copy(small_bucket, rank=rank, log=None)
+                small_bucket = []
+
+        if len(small_bucket) > 0:
+            self.allreduce_and_copy(small_bucket, rank=rank, log=log)
+
+    # allows using reduction of gradients instead of using all_reduce
+
+    def buffered_reduce_fallback(self,
+                                 rank,
+                                 grads,
+                                 elements_per_buffer=500000000,
+                                 log=None):
+        split_buckets = split_half_float_double(grads)
+
+        for i, bucket in enumerate(split_buckets):
+            self.allreduce_no_retain(bucket,
+                                     numel_per_bucket=elements_per_buffer,
+                                     rank=rank,
+                                     log=log)
+
+    #############################################################################
+    #############################################################################
+    #############################################################################
+
+    # views the tensor as multiple partitions and returns
+    # those partitions
+    def get_data_parallel_partitions(self, tensor, group_id):
+        partitions = []
+
+        dp = dist.get_world_size(group=self.real_dp_process_group[group_id])
+        dp_id = dist.get_rank(group=self.real_dp_process_group[group_id])
+
+        total_num_elements = tensor.numel()
+
+        base_size = total_num_elements // dp
+        remaining = total_num_elements % dp
+
+        start = 0
+        for id in range(dp):
+            partition_size = base_size
+            if id < remaining:
+                partition_size = partition_size + 1
+            partitions.append(tensor.narrow(0, start, partition_size))
+            start = start + partition_size
+        return partitions
+
+    def get_partition_info(self, tensor_list, partition_size, partition_id):
+        params_in_partition = []
+        params_not_in_partition = []
+
+        start_index = partition_size * partition_id
+        end_index = partition_size * (partition_id + 1)
+
+        current_index = 0
+        first_offset = 0
+
+        for tensor in tensor_list:
+
+            tensor_size = tensor.numel()
+
+            if (current_index >= start_index and current_index < end_index):
+                params_in_partition.append(tensor)
+
+            elif start_index > current_index and start_index < (current_index +
+                                                                tensor_size):
+                params_in_partition.append(tensor)
+
+                assert (first_offset == 0), "This can happen either zero or only once as this must be the first tensor in the partition"
+                first_offset = start_index - current_index
+
+            else:
+                params_not_in_partition.append(tensor)
+
+            current_index = current_index + tensor_size
+
+        return params_in_partition, params_not_in_partition, first_offset
+
+    def zero_grad(self, set_grads_to_None=True):
+        """
+        Zero FP16 parameter grads.
+        """
+        # FP32 grad should never exist.
+        # For speed, set model fp16 grad to None by default
+        for group in self.bit16_groups:
+            for p in group:
+                if set_grads_to_None:
+                    p.grad = None  # epilogue and in step
+                else:
+                    if p.grad is not None:
+                        p.grad.detach_()
+                        p.grad.zero_()
+
+    def _model_parallel_all_reduce(self, tensor, op):
+        """ Perform all reduce within model parallel group, if any.
+        """
+        if self.model_parallel_group is None:
+            pass
+        else:
+            torch.distributed.all_reduce(tensor=tensor,
+                                         op=op,
+                                         group=self.model_parallel_group)
+
+    def get_grad_norm_direct(self, gradients, params, norm_type=2):
+        """Clips gradient norm of an iterable of parameters.
+
+        This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and
+        added functionality to handle model parallel parameters. Note that
+        the gradients are modified in place.
+
+        Arguments:
+            parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
+                single Tensor that will have gradients normalized
+            max_norm (float or int): max norm of the gradients
+            norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
+                infinity norm.
+
+        Returns:
+            Total norm of the parameters (viewed as a single vector).
+        """
+        norm_type = float(norm_type)
+        if norm_type == inf:
+            total_norm = max(g.data.abs().max() for g in gradients)
+            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=self.dp_process_group)
+
+            # Take max across all GPUs.
+            self._model_parallel_all_reduce(tensor=total_norm_cuda,
+                                            op=torch.distributed.ReduceOp.MAX)
+            total_norm = total_norm_cuda[0].item()
+        else:
+            total_norm = 0.0
+            # if dist.get_rank() == 0:
+            #    logger.info(f"Total Norm beginning {total_norm}")
+            for g, p in zip(gradients, params):
+                # Pipeline parallelism may replicate parameters. Avoid multi-counting.
+                if hasattr(p, PIPE_REPLICATED) and p.ds_pipe_replicated:
+                    continue
+                if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
+                    param_norm = g.data.double().norm(2)
+                    total_norm += param_norm.item()**2
+            # Sum across all model parallel GPUs.
+            total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
+            torch.distributed.all_reduce(total_norm_cuda,
+                                         op=torch.distributed.ReduceOp.SUM,
+                                         group=self.dp_process_group)
+
+            self._model_parallel_all_reduce(tensor=total_norm_cuda,
+                                            op=torch.distributed.ReduceOp.SUM)
+
+            total_norm = total_norm_cuda[0].item()**(1. / norm_type)
+
+        if total_norm == float(
+                'inf') or total_norm == -float('inf') or total_norm != total_norm:
+            total_norm = -1
+
+        return total_norm
+
+    # creates a flat fused tensor from the tensor list starting at the first_offset
+    # in the first tensor of the list. If there are not enough elements in the tensor
+    # list then the flat tensor will be padded with zeros
+    def get_flat_partition(self,
+                           tensor_list,
+                           first_offset,
+                           partition_size,
+                           dtype,
+                           device,
+                           return_tensor_list=False):
+        flat_tensor_list = []
+        current_size = 0
+        for i, tensor in enumerate(tensor_list):
+            if tensor.grad is None:
+                tensor.grad = torch.zeros_like(tensor)
+
+            tensor = tensor.grad
+            num_elements = tensor.numel()
+            tensor_offset = 0
+
+            # we need to offset to get to the right element
+            if i == 0 and first_offset > 0:
+                tensor_offset = first_offset
+                num_elements = num_elements - tensor_offset
+
+            # we dont need all elements of the tensor
+            if num_elements > (partition_size - current_size):
+                num_elements = partition_size - current_size
+
+            # we need a narrow view of the tensor based on the tensor offset and number of elements that
+            # we need from this tensor
+            if tensor_offset > 0 or num_elements < tensor.numel():
+                flat_tensor_list.append(tensor.contiguous().view(-1).narrow(
+                    0,
+                    int(tensor_offset),
+                    int(num_elements)))
+            else:
+                flat_tensor_list.append(tensor)
+
+            current_size = current_size + num_elements
+
+        # this means its the last partition and does not align with the dp boundary. We need to pad before flattening
+        if current_size < partition_size:
+            flat_tensor_list.append(
+                torch.zeros(int(partition_size - current_size),
+                            dtype=dtype,
+                            device=device))
+
+        if return_tensor_list:
+            return flat_tensor_list
+
+        return self.flatten(flat_tensor_list)
+
+    def free_grad_in_param_list(self, param_list):
+        for p in param_list:
+            p.grad = None  # in step
+
+    def reset_cpu_buffers(self):
+        self.norm_for_param_grads = {}
+        self.local_overflow = False
+
+    def log_timers(self, timer_names):
+        if self.timers is None:
+            return
+
+        self.timers.log(names=list(timer_names))
+
+    def start_timers(self, timer_names):
+        if self.timers is None:
+            return
+
+        for name in timer_names:
+            self.timers(name).start()
+
+    def stop_timers(self, timer_names):
+        if self.timers is None:
+            return
+
+        for name in timer_names:
+            self.timers(name).stop()
+
+    def step(self, closure=None):
+        """
+        Not supporting closure.
+        """
+        self.micro_step_id = -1
+
+        see_memory_usage(f"In step before checking overflow")
+
+        # First compute norm for all group so we know if there is overflow
+        self.check_overflow()
+        OPTIMIZER_ALLGATHER = 'optimizer_allgather'
+        OPTIMIZER_GRADIENTS = 'optimizer_gradients'
+        OPTIMIZER_STEP = 'optimizer_step'
+        timer_names = [OPTIMIZER_ALLGATHER, OPTIMIZER_GRADIENTS, OPTIMIZER_STEP]
+
+        prev_scale = self.loss_scale
+        self._update_scale(self.overflow)
+        if self.overflow:
+
+            if dist.get_rank() == 0:
+                logger.info(
+                    "[deepspeed] OVERFLOW! Rank {} Skipping step. Attempted loss scale: {}, "
+                    "reducing to {}".format(dist.get_rank(),
+                                            prev_scale,
+                                            self.loss_scale))
+
+            see_memory_usage('After overflow before clearing gradients')
+            self.zero_grad()
+            if self.cpu_offload:
+                self.reset_cpu_buffers()
+            else:
+                self.averaged_gradients = {}
+
+            see_memory_usage('After overflow after clearing gradients')
+
+            self.start_timers(timer_names)
+            self.stop_timers(timer_names)
+            return
+
+        self.start_timers([OPTIMIZER_GRADIENTS])
+        norm_groups = []
+        single_partition_grad_groups = []
+        skip = False
+        for i, group in enumerate(self.bit16_groups):
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+            if self.cpu_offload:
+                norm_groups.append(
+                    self.complete_grad_norm_calculation_for_cpu_offload(
+                        self.params_in_partition[i]))
+                single_grad_partition = self.single_partition_of_fp32_groups[i].grad
+            else:
+                norm_groups.append(
+                    self.get_grad_norm_direct(self.averaged_gradients[i],
+                                              self.params_in_partition[i]))
+
+                # free gradients for all the parameters that are not updated by this process
+                self.free_grad_in_param_list(self.params_not_in_partition[i])
+
+                # create a flat gradients for parameters updated by this process
+                # If we are last partition, ensure we have same size grads and partition size, if not pad with zero tensors
+                if partition_id == dist.get_world_size(
+                        group=self.real_dp_process_group[i]) - 1:
+                    single_grad_partition = self.flatten_dense_tensors_aligned(
+                        self.averaged_gradients[i],
+                        int(self.partition_size[i])).to(
+                            self.single_partition_of_fp32_groups[i].dtype)
+                else:
+                    single_grad_partition = self.flatten(self.averaged_gradients[i]).to(
+                        self.single_partition_of_fp32_groups[i].dtype)
+                assert single_grad_partition.numel() == self.partition_size[i], \
+                    "averaged gradients have different number of elements that partition size {} {} {} {}".format(
+                        single_grad_partition.numel(), self.partition_size[i], i, partition_id)
+
+                self.single_partition_of_fp32_groups[i].grad = single_grad_partition
+                # release all the gradient since we have already created a necessary copy in dp_grad_partition
+                self.free_grad_in_param_list(self.params_in_partition[i])
+
+                self.averaged_gradients[i] = None
+
+            single_partition_grad_groups.append(single_grad_partition)
+
+        if self.has_moe_layers:
+            self._average_expert_grad_norms(norm_groups)
+
+        scaled_global_grad_norm = get_global_norm(norm_list=norm_groups)
+        self.unscale_and_clip_grads(single_partition_grad_groups,
+                                    scaled_global_grad_norm)
+
+        # Stash unscaled gradient norm
+        self._global_grad_norm = scaled_global_grad_norm / self.loss_scale
+
+        self.stop_timers([OPTIMIZER_GRADIENTS])
+
+        self.start_timers([OPTIMIZER_STEP])
+        if self.deepspeed_adam_offload:
+            from deepspeed.ops.adam import DeepSpeedCPUAdam
+            if type(self.optimizer) == DeepSpeedCPUAdam and self.dtype == torch.half:
+                bit16_param_groups = [[
+                    bit16_partitions[partition_id]
+                ] for bit16_partitions in self.parallel_partitioned_bit16_groups]
+                self.optimizer.step(fp16_param_groups=bit16_param_groups)
+            else:
+                self.optimizer.step()
+                for bit16_partitions, fp32_partition in zip(self.parallel_partitioned_bit16_groups, self.single_partition_of_fp32_groups):
+                    bit16_partitions[partition_id].data.copy_(fp32_partition.data)
+        else:
+            self.optimizer.step()
+
+            # get rid of the fp32 gradients. Not needed anymore
+            if not self.cpu_offload:
+                for group in self.single_partition_of_fp32_groups:
+                    group.grad = None  # in step
+
+            for bit16_partitions, fp32_partition in zip(self.parallel_partitioned_bit16_groups, self.single_partition_of_fp32_groups):
+                bit16_partitions[partition_id].data.copy_(fp32_partition.data)
+
+        self.stop_timers([OPTIMIZER_STEP])
+
+        if self.cpu_offload:
+            self.reset_cpu_buffers()
+
+        self.start_timers([OPTIMIZER_ALLGATHER])
+        # gather the updated weights from everyone
+        all_gather_dp_groups(
+            partitioned_param_groups=self.parallel_partitioned_bit16_groups,
+            dp_process_group=self.real_dp_process_group,
+            start_alignment_factor=self.nccl_start_alignment_factor,
+            allgather_bucket_size=self.allgather_bucket_size)
+
+        self.stop_timers([OPTIMIZER_ALLGATHER])
+
+        # TODO: we probably don't need this? just to be safe
+        for i in range(len(norm_groups)):
+            self._update_model_bit16_weights(i)
+
+        self.log_timers(timer_names)
+        see_memory_usage('After zero_optimizer step')
+
+        return
+
+    def _average_expert_grad_norms(self, norm_groups):
+        for i, norm in enumerate(norm_groups):
+            if self.is_moe_param_group[i]:
+                scaled_norm = norm * 1.0 / float(
+                    dist.get_world_size(group=self.real_dp_process_group[i]))
+                scaled_norm_tensor = torch.tensor(scaled_norm,
+                                                  device='cuda',
+                                                  dtype=torch.float)
+                dist.all_reduce(scaled_norm_tensor, group=self.real_dp_process_group[i])
+                norm_groups[i] = scaled_norm_tensor.item()
+
+    def unscale_and_clip_grads(self, grad_groups_flat, total_norm):
+        # compute combined scale factor for this group
+        combined_scale = self.loss_scale
+        if self.clip_grad > 0.:
+            # norm is in fact norm*scale
+            clip = ((total_norm / self.loss_scale) + 1e-6) / self.clip_grad
+            if clip > 1:
+                combined_scale = clip * self.loss_scale
+
+        for grad in grad_groups_flat:
+            if isinstance(grad, list):
+                sub_partitions = grad
+                for g in sub_partitions:
+                    g.data.mul_(1. / combined_scale)
+            else:
+                grad.data.mul_(1. / combined_scale)
+
+    def _check_overflow(self, partition_gradients=True):
+        self.overflow = self.has_overflow(partition_gradients)
+
+    # `params` is a list / generator of torch.Variable
+    def has_overflow_serial(self, params, is_grad_list=False):
+        for p in params:
+            if p.grad is not None and self._has_inf_or_nan(p.grad.data):
+                return True
+
+        return False
+
+    def has_overflow_partitioned_grads_serial(self):
+        for i in range(len(self.bit16_groups)):
+            for j, grad in enumerate(self.averaged_gradients[i]):
+                if grad is not None and self._has_inf_or_nan(grad.data, j):
+                    return True
+        return False
+
+    def has_overflow(self, partition_gradients=True):
+        if partition_gradients:
+            overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
+            )
+            overflow_gpu = torch.cuda.ByteTensor([overflow])
+            '''This will capture overflow across all data parallel and expert parallel process
+            Since expert parallel process are a subset of data parallel process'''
+            torch.distributed.all_reduce(overflow_gpu,
+                                         op=torch.distributed.ReduceOp.MAX,
+                                         group=self.dp_process_group)
+
+        else:
+            params = []
+            for group in self.bit16_groups:
+                for param in group:
+                    params.append(param)
+
+            overflow = self.has_overflow_serial(params, is_grad_list=partition_gradients)
+            overflow_gpu = torch.cuda.ByteTensor([overflow])
+
+        # Since each model parallel GPU carries only part of the model,
+        # make sure overflow flag is synced across all the model parallel GPUs
+        self._model_parallel_all_reduce(tensor=overflow_gpu,
+                                        op=torch.distributed.ReduceOp.MAX)
+
+        overflow = overflow_gpu[0].item()
+        return bool(overflow)
+
+    # `x` is a torch.Tensor
+    @staticmethod
+    def _has_inf_or_nan(x, j=None):
+        try:
+            # if x is half, the .float() incurs an additional deep copy, but it's necessary if
+            # Pytorch's .sum() creates a one-element tensor of the same type as x
+            # (which is true for some recent version of pytorch).
+            cpu_sum = float(x.float().sum())
+            # More efficient version that can be used if .sum() returns a Python scalar
+            # cpu_sum = float(x.sum())
+        except RuntimeError as instance:
+            # We want to check if inst is actually an overflow exception.
+            # RuntimeError could come from a different error.
+            # If so, we still want the exception to propagate.
+            if "value cannot be converted" not in instance.args[0]:
+                raise
+            return True
+        else:
+            if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
+                return True
+            return False
+
+    def backward(self, loss, retain_graph=False):
+        """
+        :attr:`backward` performs the following steps:
+
+        1. fp32_loss = loss.float()
+        2. scaled_loss = fp32_loss*loss_scale
+        3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's fp16 leaves
+        """
+        self.micro_step_id += 1
+
+        if self.contiguous_gradients:
+            self.ipg_buffer = []
+            buf_0 = torch.empty(int(self.reduce_bucket_size),
+                                dtype=self.dtype,
+                                device=torch.cuda.current_device())
+            self.ipg_buffer.append(buf_0)
+
+            # Use double buffers to avoid data access conflict when overlap_comm is enabled.
+            if self.overlap_comm:
+                buf_1 = torch.empty(int(self.reduce_bucket_size),
+                                    dtype=self.dtype,
+                                    device=torch.cuda.current_device())
+                self.ipg_buffer.append(buf_1)
+            self.ipg_index = 0
+
+        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
+
+    def check_overflow(self, partition_gradients=True):
+        self._check_overflow(partition_gradients)
+
+    def _update_scale(self, has_overflow=False):
+        self.loss_scaler.update_scale(has_overflow)
+
+    # Promote state so it can be retrieved or set via "fp16_optimizer_instance.state"
+    def _get_state(self):
+        return self.optimizer.state
+
+    def _set_state(self, value):
+        self.optimizer.state = value
+
+    state = property(_get_state, _set_state)
+
+    # Promote param_groups so it can be retrieved or set via "fp16_optimizer_instance.param_groups"
+    # (for example, to adjust the learning rate)
+    def _get_param_groups(self):
+        return self.optimizer.param_groups
+
+    def _set_param_groups(self, value):
+        self.optimizer.param_groups = value
+
+    param_groups = property(_get_param_groups, _set_param_groups)
+
+    # Promote loss scale so it can be retrieved or set via "fp16_optimizer_instance.loss_scale"
+    def _get_loss_scale(self):
+        return self.loss_scaler.loss_scale
+
+    def _set_loss_scale(self, value):
+        self.loss_scaler.cur_scale = value
+
+    loss_scale = property(_get_loss_scale, _set_loss_scale)
+    cur_scale = property(_get_loss_scale, _set_loss_scale)
+
+    # Return group tensor after removing paddings that are added for alignment to DP world size.
+    # This method works on the assumption that each group contains a single flattened tensor.
+    def _get_groups_without_padding(self, groups_with_padding):
+        groups_without_padding = []
+        for i, group in enumerate(groups_with_padding):
+            lean_length = group.numel() - self.groups_padding[i]
+            groups_without_padding.append(group[:lean_length])
+
+        return groups_without_padding
+
+    # Return optimizer state after removing paddings that are added for alignment.
+    def _get_state_without_padding(self, state_with_padding, padding):
+        lean_state = {}
+        for key, value in state_with_padding.items():
+            if torch.is_tensor(value):
+                lean_length = value.numel() - padding
+                lean_state[key] = value[:lean_length]
+            else:
+                lean_state[key] = value
+
+        return lean_state
+
+    # Return base optimizer states.
+    # This method assumes that each param group contains a single flattened tensor.
+    def _get_base_optimizer_state(self):
+        optimizer_groups_state = []
+        for i, group in enumerate(self.optimizer.param_groups):
+            p = group['params'][0]
+            lean_optimizer_state = self._get_state_without_padding(
+                self.optimizer.state[p],
+                self.groups_padding[i])
+            optimizer_groups_state.append(lean_optimizer_state)
+
+        return optimizer_groups_state
+
+    def state_dict(self):
+        """
+        Returns a dict containing the current state of this :class:`FP16_Optimizer` instance.
+        This dict contains attributes of :class:`FP16_Optimizer`, as well as the state_dict
+        of the contained Pytorch optimizer.
+        Example::
+            checkpoint = {}
+            checkpoint['model'] = model.state_dict()
+            checkpoint['optimizer'] = optimizer.state_dict()
+            torch.save(checkpoint, "saved.pth")
+        """
+        state_dict = {}
+        state_dict['loss_scaler'] = self.loss_scaler
+        state_dict['dynamic_loss_scale'] = self.dynamic_loss_scale
+        state_dict['overflow'] = self.overflow
+        state_dict[CLIP_GRAD] = self.clip_grad
+
+        if self.elastic_checkpoint:
+            state_dict[BASE_OPTIMIZER_STATE] = self._get_base_optimizer_state()
+        else:
+            state_dict[BASE_OPTIMIZER_STATE] = self.optimizer.state_dict()
+
+        # Remove paddings for DP alignment to enable loading for other alignment values
+        fp32_groups_without_padding = self._get_groups_without_padding(
+            self.single_partition_of_fp32_groups)
+        state_dict[SINGLE_PARTITION_OF_FP32_GROUPS] = fp32_groups_without_padding
+
+        state_dict[ZERO_STAGE] = ZERO_OPTIMIZATION_GRADIENTS
+        state_dict[PARTITION_COUNT] = self.partition_count
+
+        state_dict[DS_VERSION] = version
+
+        return state_dict
+
+    # Restore base optimizer fp32 weights from elastic checkpoint by:
+    # 1) Merging fp32 weights from checkpoints of all partitions
+    # 2) Extracting fp32 weights for current partition from merged weights
+    # 3) Using extracted weights to update base optimizer weights directly.
+    def _restore_from_elastic_fp32_weights(self, all_state_dict):
+        merged_single_partition_of_fp32_groups = []
+
+        for i in range(len(self.single_partition_of_fp32_groups)):
+            partition_id = dist.get_rank(group=self.real_dp_process_group[i])
+            merged_partitions = [
+                sd[SINGLE_PARTITION_OF_FP32_GROUPS][i] for sd in all_state_dict
+            ]
+            if self.is_moe_group(self.optimizer.param_groups[i]):
+                ranks = self.get_ep_ranks(
+                    group_name=self.optimizer.param_groups[i]['name'])
+                merged_partitions = [merged_partitions[i] for i in ranks]
+            flat_merged_partitions = self.flatten_dense_tensors_aligned(
+                merged_partitions,
+                self.nccl_start_alignment_factor *
+                dist.get_world_size(group=self.real_dp_process_group[i]))
+            dp_partitions = self.get_data_parallel_partitions(flat_merged_partitions, i)
+            merged_single_partition_of_fp32_groups.append(dp_partitions[partition_id])
+
+        for current, saved in zip(self.single_partition_of_fp32_groups, merged_single_partition_of_fp32_groups):
+            current.data.copy_(saved.data)
+
+    # Restore base optimizer fp32 weights from ZeRO fp16 or bfloat16 weights
+    def _restore_from_bit16_weights(self):
+        for group_id, (bit16_partitions, fp32_partition) in enumerate(zip(self.parallel_partitioned_bit16_groups, self.single_partition_of_fp32_groups)):
+            partition_id = dist.get_rank(group=self.real_dp_process_group[group_id])
+            fp32_partition.data.copy_(bit16_partitions[partition_id].data)
+
+    # Refresh the fp32 master params from the fp16 or bfloat16 copies.
+    def refresh_fp32_params(self):
+        self._restore_from_bit16_weights()
+
+    # Extract optimizer state for current partition from merged states of all partitions
+    def _partition_base_optimizer_state(self, state_key, all_partition_states, group_id):
+        partition_id = dist.get_rank(group=self.real_dp_process_group[group_id])
+        alignment = dist.get_world_size(group=self.real_dp_process_group[group_id])
+        if torch.is_tensor(all_partition_states[0]):
+            flat_merged_partitions = self.flatten_dense_tensors_aligned(
+                all_partition_states,
+                alignment)
+            dp_partitions = self.get_data_parallel_partitions(flat_merged_partitions,
+                                                              group_id)
+            return dp_partitions[partition_id]
+        else:
+            # Assume non-tensor states are not partitioned and equal across ranks, so return first one
+            return all_partition_states[0]
+
+    def _restore_base_optimizer_state(self, base_optimizer_group_states):
+        if type(base_optimizer_group_states) == dict:
+            base_optimizer_group_states = base_optimizer_group_states['state']
+        for i, group in enumerate(self.optimizer.param_groups):
+            p = group['params'][0]
+            for key, saved in base_optimizer_group_states[i].items():
+                if torch.is_tensor(self.optimizer.state[p][key]):
+                    dst_tensor = self.optimizer.state[p][key]
+                    src_tensor = _get_padded_tensor(saved, dst_tensor.numel())
+                    self.optimizer.state[p][key].data.copy_(src_tensor.data)
+                else:
+                    self.optimizer.state[p][key] = saved
+
+    def get_ep_ranks(self, rank=0, group_name=None):
+        from deepspeed.utils import groups
+        expert_parallel_size_ = groups._get_expert_parallel_world_size(group_name)
+        world_size = groups._get_data_parallel_world_size()
+        rank = groups._get_expert_parallel_rank(group_name)
+        ranks = range(rank, world_size, expert_parallel_size_)
+        return list(ranks)
+
+    # Restore base optimizer state from elastic checkpoint by
+    # 1) Merging optimizer state from checkpoints of all partitions
+    # 2) Extracting optimizer state for current partition from the merged state
+    # 3) Using the extracted value to directly update the base optimizer.
+    def _restore_elastic_base_optimizer_state(self, all_state_dict):
+        base_optimizer_group_states = []
+        for i in range(len(self.optimizer.param_groups)):
+            partition_states = {}
+            all_partition_group_states = [
+                sd[BASE_OPTIMIZER_STATE][i] for sd in all_state_dict
+            ]
+
+            if self.is_moe_group(self.optimizer.param_groups[i]):
+                ranks = self.get_ep_ranks(
+                    group_name=self.optimizer.param_groups[i]['name'])
+                all_partition_group_states = [
+                    all_partition_group_states[i] for i in ranks
+                ]
+
+            for key in all_partition_group_states[0].keys():
+                all_partition_states = [
+                    all_states[key] for all_states in all_partition_group_states
+                ]
+                partition_states[key] = self._partition_base_optimizer_state(
+                    key,
+                    all_partition_states,
+                    i)
+            base_optimizer_group_states.append(partition_states)
+
+        self._restore_base_optimizer_state(base_optimizer_group_states)
+
+    def load_state_dict(self,
+                        state_dict_list,
+                        load_optimizer_states=True,
+                        load_from_fp32_weights=False):
+        r"""Loading ZeRO checkpoint
+
+        Arguments:
+            state_dict_list: List of all saved ZeRO checkpoints, one for each saved partition.
+                Note that the number of saved partitions may differ from number of loading partitions to support
+                changing GPU count, specifically DP world size, between saving and loading checkpoints.
+            load_optimizer_states: Boolean indicating whether or not to load base optimizer states
+            load_from_fp32_weights: Boolean indicating whether to initialize fp32 master weights from fp32
+            copies in checkpoints (no precision loss) or from model's fp16 copies (with precision loss).
+        """
+        """
+        Loads a state_dict created by an earlier call to state_dict().
+        If ``fp16_optimizer_instance`` was constructed from some ``init_optimizer``,
+        whose parameters in turn came from ``model``, it is expected that the user
+        will call ``model.load_state_dict()`` before
+        ``fp16_optimizer_instance.load_state_dict()`` is called.
+        Example::
+            model = torch.nn.Linear(D_in, D_out).cuda().half()
+            optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
+            optimizer = FP16_Optimizer(optimizer, static_loss_scale = 128.0)
+            ...
+            checkpoint = torch.load("saved.pth")
+            model.load_state_dict(checkpoint['model'])
+            optimizer.load_state_dict(checkpoint['optimizer'])
+        """
+
+        # I think it should actually be ok to reload the optimizer before the model.
+        dp_rank = dist.get_rank(group=self.dp_process_group)
+        current_rank_sd = state_dict_list[dp_rank]
+        self.loss_scaler = current_rank_sd.get('loss_scaler', self.loss_scaler)
+        self.dynamic_loss_scale = current_rank_sd.get('dynamic_loss_scale',
+                                                      self.dynamic_loss_scale)
+        self.overflow = current_rank_sd.get('overflow', self.overflow)
+        self.clip_grad = current_rank_sd.get(CLIP_GRAD, self.clip_grad)
+
+        ckpt_version = current_rank_sd.get(DS_VERSION, False)
+        assert ckpt_version, f"Empty ds_version in checkpoint, not clear how to proceed"
+        ckpt_version = pkg_version.parse(ckpt_version)
+
+        # zero stage 1 mode
+        if not self.partition_gradients:
+            required_version = pkg_version.parse("0.3.17")
+            error_str = f"ZeRO stage 1 changed in {required_version} and is not backwards compatible " \
+                "with older stage 1 checkpoints. If you'd like to load an old ZeRO-1 checkpoint " \
+                "please use an older version of DeepSpeed (<= 0.5.8) and set 'legacy_stage1': true in your zero config json."
+            assert required_version <= ckpt_version, f"Old version: {ckpt_version} {error_str}"
+
+        ckpt_is_rigid = isinstance(current_rank_sd[BASE_OPTIMIZER_STATE], dict)
+
+        if load_optimizer_states:
+            if ckpt_is_rigid:
+                # loading rigid ckpt into either rigid or elastic exec
+                self.optimizer.load_state_dict(current_rank_sd[BASE_OPTIMIZER_STATE])
+            else:
+                if self.elastic_checkpoint:
+                    # loading elastic into elastic exec
+                    self._restore_elastic_base_optimizer_state(state_dict_list)
+                else:
+                    # loading an elastic checkpoint into rigid exec
+                    self._restore_base_optimizer_state(
+                        current_rank_sd[BASE_OPTIMIZER_STATE])
+
+        # At this point, the optimizer's references to the model's fp32 parameters are up to date.
+        # The optimizer's hyperparameters and internal buffers are also up to date.
+        # However, the fp32 master copies of the model's fp16 params stored by the optimizer are still
+        # out of date.  There are two options.
+        # 1:  Refresh the master params from the model's fp16 params.
+        # This requires less storage but incurs precision loss.
+        # 2:  Save and restore the fp32 master copies separately.
+        # We choose option 1 if changing DP degree and option 2 otherwise.
+        #
+        # Pytorch Optimizer.load_state_dict casts saved buffers (e.g. momentum) to the type and device
+        # of their associated parameters, because it's possible those buffers might not exist yet in
+        # the current optimizer instance.  In our case, as long as the current FP16_Optimizer has been
+        # constructed in the same way as the one whose state_dict we are loading, the same master params
+        # are guaranteed to exist, so we can just copy_() from the saved master params.
+
+        if load_from_fp32_weights:
+            # option 2 from above
+            if self.elastic_checkpoint and not ckpt_is_rigid:
+                self._restore_from_elastic_fp32_weights(state_dict_list)
+            else:
+                # For non-elastic checkpoint, simply copying from saved weights of current rank is sufficient.
+                for current, saved in zip(self.single_partition_of_fp32_groups, current_rank_sd[SINGLE_PARTITION_OF_FP32_GROUPS]):
+                    src_tensor = _get_padded_tensor(saved, current.numel())
+                    current.data.copy_(src_tensor.data)
+        else:
+            # option 1 from above
+            self._restore_from_bit16_weights()
+
+
+def _handle_overflow(cpu_sum, x, i):
+    import math
+    rank = torch.distributed.get_rank()
+    if rank == 0:
+        t_i = -1
+        for v_i, v in enumerate(x.data.contiguous().view(-1)):
+            if not math.isfinite(float(v)):
+                t_i = v_i
+                break
+        logger.info(
+            f"rank {rank} detected overflow {cpu_sum} in tensor {i}:{t_i} shape {x.shape}"
+        )
+
+
+def estimate_zero2_model_states_mem_needs(total_params,
+                                          num_gpus_per_node=1,
+                                          num_nodes=1,
+                                          cpu_offload=True,
+                                          additional_buffer_factor=1.5):
+
+    total_gpus = num_nodes * num_gpus_per_node
+
+    if cpu_offload:
+        gpu_mem = 2 * total_params
+        cpu_mem = total_params * max(4 * total_gpus, 16) * additional_buffer_factor
+    else:
+        gpu_mem = 4 * total_params + int(16 * total_params / total_gpus)
+        cpu_mem = total_params * 4 * num_gpus_per_node * additional_buffer_factor
+
+    return int(cpu_mem), int(gpu_mem)
+
+
+def model_to_params(model):
+    # shared params calculated only once
+    total_params = sum(
+        dict((p.data_ptr(),
+              p.numel()) for p in model.parameters()).values())
+    return total_params
+
+
+def estimate_zero2_model_states_mem_needs_all_live(model,
+                                                   num_gpus_per_node=1,
+                                                   num_nodes=1,
+                                                   additional_buffer_factor=1.5):
+    """
+    Print out estimates on memory usage requirements for ZeRO 2 params, optim states and gradients
+    for a given ``model`` and hardware setup.
+
+    If you have an actual model object, use this function and everything will be derived
+    automatically.
+
+    If it's a hypothetical model, use ``estimate_zero2_model_states_mem_needs_all_cold`` where you have to pass
+    the ``total_params`` explicitly.
+
+    Args:
+        - ``model``: ``nn.Module`` object
+        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
+        - ``num_nodes``: how many nodes (defaults to 1),
+        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
+
+    """
+
+    total_params = model_to_params(model)
+
+    estimate_zero2_model_states_mem_needs_all_cold(
+        total_params=total_params,
+        num_gpus_per_node=num_gpus_per_node,
+        num_nodes=num_nodes,
+        additional_buffer_factor=additional_buffer_factor)
+
+
+def estimate_zero2_model_states_mem_needs_all_cold(total_params,
+                                                   num_gpus_per_node=1,
+                                                   num_nodes=1,
+                                                   additional_buffer_factor=1.5):
+    """
+    Print out estimates on memory usage requirements for ZeRO 2 params, optim states and gradients
+    for a given ``model`` and hardware setup.
+
+    If it's a hypothetical model, use this function where you have to pass
+    the ``total_params`` and ``largest_layer_params`` explicitly.
+
+    If you have an actual model object, use ``estimate_zero2_model_states_mem_needs_all_live`` and everything
+    will be derived automatically.
+
+    Args:
+        - ``total_params``: total  model params
+        - ``num_gpus_per_node``: how many gpus per node (defaults to 1)
+        - ``num_nodes``: how many nodes (defaults to 1),
+        - ``additional_buffer_factor``: estimation factor (defaults to 1.5):
+
+    """
+    def format_options(cpu_offload):
+        enabled = []
+        device = f'{OFFLOAD_CPU_DEVICE:4}' if cpu_offload else "none"
+        enabled.append(f"{OFFLOAD_OPTIMIZER}={device}")
+        return ", ".join(enabled)
+
+    nodes_str = "nodes" if num_nodes > 1 else "node"
+    gpus_str = "GPUs" if num_gpus_per_node > 1 else "GPU"
+    print(
+        "Estimated memory needed for params, optim states and gradients for a:\n"
+        f"HW: Setup with {num_nodes} {nodes_str}, {num_gpus_per_node} {gpus_str} per node.\n"
+        f"SW: Model with {int(total_params/1e6)}M total params.")
+    print("  per CPU  |  per GPU |   Options")
+    for cpu_offload in [True, False]:
+        cpu_mem, gpu_mem = estimate_zero2_model_states_mem_needs(
+            total_params=total_params,
+            num_gpus_per_node=num_gpus_per_node,
+            num_nodes=num_nodes,
+            cpu_offload=cpu_offload,
+            additional_buffer_factor=additional_buffer_factor
+        )
+
+        options_str = format_options(cpu_offload=cpu_offload)
+        print(f" {cpu_mem/2**30:7.2f}GB | {gpu_mem/2**30:6.2f}GB | {options_str}")
diff --git a/deepspeed/runtime/zero/tiling.py b/deepspeed/runtime/zero/tiling.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a78253df4969d1f07eb7d40b9dfd2d371e70efc
--- /dev/null
+++ b/deepspeed/runtime/zero/tiling.py
@@ -0,0 +1,296 @@
+import torch
+import deepspeed
+from deepspeed.runtime.utils import partition_uniform as partition
+
+
+def split_tensor_along_last_dim(tensor, partitions, contiguous_split_chunks=False):
+    """Split a tensor along its last dimension. Adapted from Megatron-LM.
+
+    Arguments:
+        tensor: input tensor.
+        partitions: list of partition sizes to supply to torch.split
+        contiguous_split_chunks: If True, make each chunk contiguous
+                                 in memory.
+    """
+    # Get the size and dimension.
+    last_dim = tensor.dim() - 1
+
+    # Split.
+    tensor_list = torch.split(tensor, partitions, dim=last_dim)
+    # Note: torch.split does not create contiguous tensors by default.
+    if contiguous_split_chunks:
+        return tuple(chunk.contiguous() for chunk in tensor_list)
+
+    return tensor_list
+
+
+class TiledLinear(torch.nn.Module):
+    def __init__(self,
+                 in_features,
+                 out_features,
+                 bias=True,
+                 in_splits=1,
+                 out_splits=1,
+                 input_is_already_split=False,
+                 combine_out_splits=True,
+                 linear_cls=torch.nn.Linear,
+                 init_linear=None,
+                 **kwargs):
+        """A replacement for ``torch.nn.Linear`` that works with ZeRO-3 to reduce
+        memory requirements via tiling.
+
+        TiledLinear breaks the input and output dimensions of a linear layer
+        into tiles that are processed in sequence. This class enables huge
+        linear layers when combined with ZeRO-3 because inactive tiles can be
+        partitioned and offloaded.
+
+        .. note::
+            We recommend using as few tiles as necessary. Tiling
+            significantly reduces memory usage, but can reduce throughput
+            for inexpensive layers. This due to the smaller kernels having
+            less parallelism and lower arithmetic intensity, while
+            introducing more frequent synchronization and communication.
+
+        Args:
+            in_features (int): See ``torch.nn.Linear``
+            out_features (int): See ``torch.nn.Linear``
+            bias (bool, optional): See ``torch.nn.Linear``
+            in_splits (int, optional): The number of tiles along the input dimension. Defaults to 1.
+            out_splits (int, optional): The number of tiles along the output dimension. Defaults to 1.
+            input_is_already_split (bool, optional): If set to ``True``, assume that the ``input_`` in
+                to ``forward()`` is already split into ``in_splits`` chunks. Defaults to ``False``.
+            combine_out_splits (bool, optional): If set to ``False``, do not combine the ``out_splits`` outputs
+                into a single tensor. Defaults to ``True``.
+            linear_cls (class, optional): The underlying class to build individual tiles.
+                Defaults to ``torch.nn.Linear``.
+            init_linear (``torch.nn.Linear``, optional): If set, copy the parameters of
+                ``init_linear``. Useful for debugging. Defaults to ``None``.
+            kwargs (dict, optional): additional keyword arguments to provide to ``linear_cls()``.
+
+        Raises:
+            RuntimeError: ``in_splits`` must be within the range [1, in_features).
+            RuntimeError: ``out_splits`` must be within the range of [1, out_features).
+        """
+
+        super().__init__()
+
+        if (in_splits < 1) or (in_splits > in_features):
+            raise RuntimeError('in splits must be in range [1, in_features].')
+        if (out_splits < 1) or (out_splits > out_features):
+            raise RuntimeError('out splits must be in range [1, out_features].')
+
+        # global, not necessarily local
+        self.in_features = in_features
+        self.out_features = out_features
+        self.use_bias = bias
+
+        self.out_splits = out_splits
+        self.in_splits = in_splits
+        self.input_is_already_split = input_is_already_split
+        self.combine_out_splits = combine_out_splits
+
+        # Build partition-lists. These are CSR-style splits [0, part0, part1, ..., features]
+        # For example, row_parts[p] gives the start of partition p and row_parts[p+1]
+        # is the exclusive end.
+        self.in_parts = partition(num_items=in_features, num_parts=in_splits)
+        self.out_parts = partition(num_items=out_features, num_parts=out_splits)
+
+        assert len(self.out_parts) == out_splits + 1
+        assert len(self.in_parts) == in_splits + 1
+        assert self.out_parts[0] == 0
+        assert self.out_parts[out_splits] == out_features
+        assert self.in_parts[in_splits] == in_features
+
+        self.linears = torch.nn.ModuleList()
+        for out_id in range(out_splits):
+            self.linears.append(torch.nn.ModuleList())
+
+            local_out_dim = self.out_parts[out_id + 1] - self.out_parts[out_id]
+
+            for in_id in range(in_splits):
+                #if input_size is split, we only need one bias
+                local_bias = bias if in_id == (in_splits - 1) else False
+
+                local_in_dim = self.in_parts[in_id + 1] - self.in_parts[in_id]
+                local = linear_cls(local_in_dim,
+                                   local_out_dim,
+                                   bias=local_bias,
+                                   **kwargs)
+                self.linears[out_id].append(local)
+
+        # Optionally initialize with a known tensor
+        if init_linear is not None:
+            self.copy_params_from(init_linear)
+
+    def forward(self, input_):
+        if self.in_splits > 1 and not self.input_is_already_split:
+            input_parts = partition(input_.shape[-1], self.in_splits)
+            split_sizes = [
+                input_parts[p + 1] - input_parts[p] for p in range(self.in_splits)
+            ]
+            inputs = self._split_global_input(input_, split_sizes)
+        elif self.in_splits > 1:
+            inputs = input_
+            assert len(inputs) == self.in_splits, f"Col splits {self.in_splits} does not match input splits {len(inputs)}"
+        else:
+            # no splits
+            inputs = [input_]
+
+        outputs = [None] * self.out_splits
+        for out_id in range(self.out_splits):
+            for in_id in range(self.in_splits):
+                local_output = self.linears[out_id][in_id](inputs[in_id])
+
+                outputs[out_id] = self._reduce_local_output(in_id=in_id,
+                                                            out_id=out_id,
+                                                            current_out=outputs[out_id],
+                                                            new_out=local_output)
+
+        if self.combine_out_splits:
+            return self._combine_output_splits(outputs)
+
+        return outputs
+
+    def _split_global_input(self, input, split_sizes):
+        """Partition an input tensor along the last dimension, aligned with given splits.
+
+        Subclasses should override this method to account for new input types.
+
+        Args:
+            input (List[Tensor]): The tensor to partition along the last dimension.
+            split_sizes (List[int]): The size of each partition.
+
+        Returns:
+            List[Any]: A list of the chunks of ``input``.
+        """
+        return split_tensor_along_last_dim(input, split_sizes)
+
+    def _reduce_local_output(self, in_id, out_id, current_out, new_out):
+        """Reduce (sum) a new local result into the existing local results.
+
+        Subclasses should override this method.
+
+        For a given ``out_id``, this method is called ``in_id-1`` times. The first input
+        split is a simple assignment.
+
+        Args:
+            in_id (int): The input split that produced ``new_out``.
+            out_id (int): The output split that produced ``new_out``.
+            current_out (Any): The reduced form of all previous ``out_id`` results.
+            new_out (Any): The local result from forward (``in_id``, ``out_id``)e
+
+        Returns:
+            Any: The combined result of ``current_out`` and ``new_out``.
+        """
+
+        if current_out is None:
+            #this clone is necessary to preserve auto grad
+            #there is some issue with inplace update for outputs that are views
+            return new_out.clone()
+        else:
+            return current_out + new_out
+
+    def _combine_output_splits(self, outputs):
+        """Join the splits of the output into a single result.
+
+        Args:
+            outputs (List[Any]): The reduced outputs for each output split.
+
+        Returns:
+            Any: The combined outputs.
+        """
+        assert len(outputs) == self.out_splits
+        return torch.cat(outputs, dim=-1)
+
+    @torch.no_grad()
+    def copy_params_from(self, other):
+        """Copy the weight and bias data from ``other``.
+
+        This is especially useful for reproducible initialization and testing.
+
+        Equivalent to:
+
+        .. code-block:: python
+
+            with torch.no_grad():
+                self.weight.copy_(other.weight)
+                if self.bias is not None:
+                    self.bias.copy_(other.bias)
+
+        .. note::
+            If ZeRO-3 is enabled, this is a collective operation and the
+            updated parameters of data-parallel rank 0 will be visible on all
+            ranks. See :class:`deepspeed.zero.GatheredParameters` for more
+            information.
+
+
+        Args:
+            other (``torch.nn.Linear``): the linear layer to copy from.
+        """
+        assert hasattr(other, 'weight')
+        assert other.weight.size() == (self.out_features, self.in_features)
+        if self.use_bias:
+            assert hasattr(other, 'bias')
+            assert other.bias is not None
+            assert other.bias.size() == (self.out_features, )
+        else:
+            assert other.bias is None
+
+        for row in range(self.out_splits):
+            rstart = self.out_parts[row]
+            rstop = self.out_parts[row + 1]
+
+            for col in range(self.in_splits):
+                cstart = self.in_parts[col]
+                cstop = self.in_parts[col + 1]
+
+                local = self.linears[row][col]
+                global_weight = other.weight[rstart:rstop, cstart:cstop]
+                with deepspeed.zero.GatheredParameters(local.weight, modifier_rank=0):
+                    local.weight.copy_(global_weight)
+
+            if local.bias is not None:
+                with deepspeed.zero.GatheredParameters(local.bias, modifier_rank=0):
+                    local.bias.data.copy_(other.bias[rstart:rstop].data)
+
+
+class TiledLinearReturnBias(TiledLinear):
+    """Wrapper for a Linear class that returns its own bias parameter, such as
+    used by Megatron-LM.
+    """
+    def _reduce_local_output(self, in_id, out_id, current_out, new_out):
+        """Reduces output tensors, but not the returned bias. """
+        if current_out is not None:
+            old_tensor, old_bias = current_out
+        else:
+            old_tensor, old_bias = None, None
+
+        assert isinstance(new_out, tuple)
+        assert len(new_out) == 2
+
+        tensor, bias = new_out
+        assert tensor is not None
+
+        tensor = super()._reduce_local_output(in_id=in_id,
+                                              out_id=out_id,
+                                              current_out=old_tensor,
+                                              new_out=tensor)
+
+        if bias is None:
+            bias = old_bias
+
+        return tensor, bias
+
+    def _combine_output_splits(self, outputs):
+        # stack output tensors
+        tensors = [o[0] for o in outputs]
+        tensor = super()._combine_output_splits(tensors)
+
+        # stack biases if applicable
+        biases = [o[1] for o in outputs if o[1] is not None]
+        if len(biases) > 0:
+            bias = super()._combine_output_splits(biases)
+        else:
+            bias = None
+
+        return tensor, bias
diff --git a/deepspeed/runtime/zero/utils.py b/deepspeed/runtime/zero/utils.py
old mode 100755
new mode 100644
index 8873c8db55d542e1314e032ed846b0aa1ababf8a..fff8ebd31ef38afc61e570fff567ca6e61d2c90b
--- a/deepspeed/runtime/zero/utils.py
+++ b/deepspeed/runtime/zero/utils.py
@@ -1,8 +1,12 @@
+import os
+from typing import List
+
 import torch
 import torch.distributed as dist
 from deepspeed.utils import logger
 from deepspeed.ops.adam import DeepSpeedCPUAdam
 from deepspeed.ops.adam import FusedAdam
+from deepspeed.utils.nvtx import instrument_w_nvtx
 
 
 def _initialize_parameter_parallel_groups(parameter_parallel_size=None):
@@ -23,6 +27,10 @@ def _initialize_parameter_parallel_groups(parameter_parallel_size=None):
     return my_group
 
 
+class ZeRORuntimeException(Exception):
+    pass
+
+
 ZERO_SUPPORTED_OPTIMIZERS = [
     torch.optim.Adam,
     torch.optim.AdamW,
@@ -33,14 +41,47 @@ ZERO_SUPPORTED_OPTIMIZERS = [
 # Add apex FusedAdam to supported list if apex is installed
 try:
     import apex
-    ZERO_SUPPORTED_OPTIMIZERS.append(apex.optimizers.FusedAdam)
+    if hasattr(apex, 'optimizers') and hasattr(apex.optimizers, 'FusedAdam'):
+        ZERO_SUPPORTED_OPTIMIZERS.append(apex.optimizers.FusedAdam)
 except ImportError:
     pass
 
 
 def is_zero_supported_optimizer(optimizer):
     if dist.get_rank() == 0:
-        print(
+        logger.info(
             f'Checking ZeRO support for optimizer={optimizer.__class__.__name__} type={type(optimizer)}'
         )
     return type(optimizer) in ZERO_SUPPORTED_OPTIMIZERS
+
+
+def get_lst_from_rank0(lst: List[int]) -> None:
+    """
+    NOTE: creates both communication and synchronization overhead so should be used
+    sparingly
+    """
+    lst_tensor = torch.tensor(
+        lst if dist.get_rank() == 0 else [-1] * len(lst),
+        dtype=int,
+        # device=torch.cuda.current_device(),
+        device=torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"])),
+        requires_grad=False,
+    )
+    dist.broadcast(lst_tensor, src=0, async_op=False)
+
+    return list(lst_tensor.cpu().numpy())
+
+
+@instrument_w_nvtx
+def assert_ints_same_as_other_ranks(ints: List[int]) -> None:
+    """
+    NOTE: creates both communication and synchronization overhead so should be
+    used sparingly
+
+    takes a list of ints from each rank and ensures that they are the same
+    across ranks, throwing an exception if they are not.
+    """
+    rank0_ints = get_lst_from_rank0(ints)
+    if ints != rank0_ints:
+        raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
+                           f"rank0: {rank0_ints}, rank{dist.get_rank()}: {ints}")
diff --git a/deepspeed/utils/__init__.py b/deepspeed/utils/__init__.py
index c231edca4919f089313032339df075a30c4c5c9e..efa45853f0dfc3846c99eda8aacc91062f756a80 100644
--- a/deepspeed/utils/__init__.py
+++ b/deepspeed/utils/__init__.py
@@ -1,3 +1,5 @@
 from .logging import logger, log_dist
 from .distributed import init_distributed
+from .groups import *
+from .nvtx import instrument_w_nvtx
 from deepspeed.runtime.dataloader import RepeatingLoader
diff --git a/deepspeed/utils/debug.py b/deepspeed/utils/debug.py
new file mode 100644
index 0000000000000000000000000000000000000000..358e997eae636a26abda2953110167df325566eb
--- /dev/null
+++ b/deepspeed/utils/debug.py
@@ -0,0 +1,140 @@
+""" debug utils """
+
+import fcntl
+
+# for debug purposes map module and param objects to their fully qualified names
+module_names = {}
+param_names = {}
+
+
+def debug_extract_module_and_param_names(model):
+    # extract the fully qualified names as soon as the model is acquired
+    global module_names
+    global param_names
+    # XXX: can probably make a map of param2module and vice-versa
+    module_names = {module: name for name, module in model.named_modules()}
+    param_names = {param: name for name, param in model.named_parameters()}
+
+
+def debug_module2name(module):
+    if module in module_names:
+        return module_names[module]
+    else:
+        return "unknown"
+
+
+def debug_module2name_id(module):
+    return f"name={debug_module2name(module)} id={module.id}"
+
+
+def debug_module2name_class(module):
+    return f"name={debug_module2name(module)} {module.__class__.__name__}"
+
+
+def debug_param2name(param):
+    if param in param_names:
+        return param_names[param]
+    else:
+        return "unknown"
+
+
+def debug_param2name_id(param):
+    return f"name={debug_param2name(param)} id={param.ds_id}"
+
+
+def debug_param2name_id_shape(param):
+    return f"name={debug_param2name(param)} id={param.ds_id} shape={param.data.shape}"
+
+
+def debug_param2name_id_shape_device(param):
+    return f"name={debug_param2name(param)} id={param.ds_id} shape={param.data.shape} device={param.device}"
+
+
+def debug_param2name_id_numel(param):
+    return f"name={debug_param2name(param)} id={param.ds_id} numel={param.numel()}"
+
+
+def debug_param2name_id_shape_status(param):
+    return f"name={debug_param2name(param)} id={param.ds_id} shape={param.data.shape} status={param.ds_status}"
+
+
+def printflock(*msgs):
+    """
+
+    For printing messages for all concurrent gpus w/o getting interleaved text.
+
+    This is useful when debugging issues where multi-gpus don't sync.
+
+    1. Enable the force debug in say partitioning and zero3 files
+    2. Override the usual versions with ::
+
+        def print_rank_0(message, debug=False, force=False):
+            rank = torch.distributed.get_rank()
+            printflock(f"[{rank}] {message}")
+    3. run the program and you get both logs non-interleaved
+
+    But this makes it very difficult to make sense of the output, so the ``log_rank_file`` helper
+    function might be more useful, as it's easier to send each log stream into a separate file and
+    then compare those.
+
+    """
+
+    with open(__file__, "r") as fh:
+        fcntl.flock(fh, fcntl.LOCK_EX)
+        try:
+            print(*msgs)
+        finally:
+            fcntl.flock(fh, fcntl.LOCK_UN)
+
+
+fh = None
+
+
+def log_rank_file(rank, *msgs):
+    """
+    Print to a log file of the given rank
+
+    This is useful for debugging hanging in sync processes. Here is a possible workflow:
+
+    1. Enable the force debug in say partitioning and zero3 files
+    2. Override the usual versions of print_rank_0 in those files with ::
+
+        def print_rank_0(message, debug=False, force=False):
+            rank = torch.distributed.get_rank()
+            log_rank_file(rank, message)
+
+    3. run the program
+    4. fix up the expected differences, e.g. different cuda numbers ::
+
+        perl -pi -e 's|cuda:1|cuda:0|' log_rank_*
+
+    5. now diff and see where names and ids diverge - you will find where the gpus don't do the same
+    work (e.g. when some layers get conditionally skipped on one gpu but not all)
+
+        diff -u log_rank_0.txt log_rank_1.txt | less
+
+    """
+    global fh
+    if fh is None:
+        fh = open(f"log_rank_{rank}.txt", "w")
+    for m in msgs:
+        fh.write(f"{m}\n")
+    fh.flush()
+
+
+def print_backward_tensors(tensor):
+    def _print_bwd_tensors(grad_fn):
+        print(f"Backward tensors in {grad_fn}")
+        for funcs in grad_fn.next_functions:
+            if funcs[0]:
+                try:
+                    tensor = getattr(funcs[0], 'variable')
+                    print(funcs[0])
+                    print(
+                        f"Tensor - id: {id(tensor)}, shape: {tensor.shape}, data: {tensor}, grad: {tensor.grad}"
+                    )
+                except AttributeError as e:
+                    _print_bwd_tensors(funcs[0])
+
+    if hasattr(tensor, 'grad_fn'):
+        _print_bwd_tensors(tensor.grad_fn)
diff --git a/deepspeed/utils/distributed.py b/deepspeed/utils/distributed.py
index 1c3748edac428870193dde4d179105c3b3eea934..b643f4672e86cae2d925e34b60b7b706d9586ff9 100644
--- a/deepspeed/utils/distributed.py
+++ b/deepspeed/utils/distributed.py
@@ -38,11 +38,13 @@ def init_distributed(dist_backend="nccl",
             )
         if in_aml() and not in_dlts():
             patch_aml_env_for_torch_nccl_backend(verbose=verbose)
+        elif in_aws_sm():
+            patch_aws_sm_env_for_torch_nccl_backend(verbose=verbose)
         else:
             mpi_discovery(distributed_port=distributed_port, verbose=verbose)
 
     if not torch.distributed.is_initialized():
-        if verbose:
+        if verbose and int(os.getenv('RANK', '0')) == 0:
             logger.info(
                 "Initializing torch distributed with backend: {}".format(dist_backend))
         assert isinstance(timeout, timedelta)
@@ -100,6 +102,11 @@ def in_aml():
     return 'AZUREML_EXPERIMENT_ID' in os.environ
 
 
+def in_aws_sm():
+    # Are we running inside an AWS SageMaker environment?
+    return 'SM_TRAINING_ENV' in os.environ
+
+
 def in_dlts():
     # Are we running on a DLTS cluster?
     return 'DLTS_JOB_ID' in os.environ
@@ -140,3 +147,20 @@ def patch_aml_env_for_torch_nccl_backend(master_port=6105, verbose=True):
                     os.environ['WORLD_SIZE'],
                     os.environ['MASTER_ADDR'],
                     os.environ['MASTER_PORT']))
+
+
+def patch_aws_sm_env_for_torch_nccl_backend(verbose=True):
+    """Helper routine to get and set environment variables when running inside an AWS SageMaker environment.
+    """
+    os.environ["RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]
+    os.environ['LOCAL_RANK'] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
+    os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
+
+    if verbose:
+        logger.info(
+            "Discovered AWS SageMaker settings of world_rank={}, local_rank={}, world_size={}, master_addr={}, master_port={}"
+            .format(os.environ['RANK'],
+                    os.environ['LOCAL_RANK'],
+                    os.environ['WORLD_SIZE'],
+                    os.environ['MASTER_ADDR'],
+                    os.environ['MASTER_PORT']))
diff --git a/deepspeed/utils/exceptions.py b/deepspeed/utils/exceptions.py
new file mode 100644
index 0000000000000000000000000000000000000000..678948c2ef3111b04b8a0d03fa35d11b7314014c
--- /dev/null
+++ b/deepspeed/utils/exceptions.py
@@ -0,0 +1,7 @@
+'''
+Copyright 2022 The Microsoft DeepSpeed Team
+'''
+
+
+class DeprecatedException(Exception):
+    pass
diff --git a/deepspeed/utils/groups.py b/deepspeed/utils/groups.py
new file mode 100644
index 0000000000000000000000000000000000000000..34fd96f1698af9d6ef300526d910b8215ccde2d8
--- /dev/null
+++ b/deepspeed/utils/groups.py
@@ -0,0 +1,390 @@
+'''
+Copyright 2021 The Microsoft DeepSpeed Team
+'''
+
+# The file has been adapted from https://github.com/NVIDIA/Megatron-LM and retains the following license from the original file
+
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ Support different forms of parallelism in DeepSped using multiple process groups.
+ Given that there are multiple scenarios and use-cases, this file is going to be updated
+ frequently. For now, the group creation needed for the training scenario is being implemented.
+ For inference and other new scenarios, the code will be either reused or added to this file.
+"""
+
+import torch
+from torch.distributed.distributed_c10d import _get_global_rank
+from deepspeed.utils import logger, log_dist
+from deepspeed.utils.exceptions import DeprecatedException
+
+# Expert parallel group that the current rank belongs to.
+_EXPERT_PARALLEL_GROUP = {}
+# Expert data parallel group that the current rank belongs to.
+_EXPERT_DATA_PARALLEL_GROUP = {}
+# torch.distributed world group needs to be cloned for some cases
+_WORLD_GROUP = None
+# global object to maintain mpu object if passed by a Megatron client
+mpu = None
+
+
+# Deprecated groups initialize function.
+def initialize(ep_size=1, mpu=None):
+    """ Deprecated function. Retained to inform the users."""
+    raise DeprecatedException(
+        "Please do not use the groups.initialize() API as it is deprecated. Instead, pass the desired ep_size to deepspeed.moe.layer.MoE(..,ep_size,..)"
+    )
+
+
+def _ensure_divisibility(numerator, denominator):
+    """Ensure that numerator is divisible by the denominator."""
+    assert numerator % denominator == 0, '{} is not divisible by {}'.format(
+        numerator, denominator)
+
+
+# Not currently used. Helper function to create a model (tensor) parallel group.
+def _create_model_parallel(model_parallel_size_):
+    """
+    Initialize model data parallel groups.
+
+    Arguments:
+        model_parallel_size: number of GPUs used to parallelize model.
+
+    Returns:
+        Tuple of data parallel group and model parallel group
+
+    Let's say we have a total of 8 GPUs denoted by g0 ... g7 and we
+    use 2 GPUs to parallelize the model. The present function will
+    create 4 model parallel groups and 2 data parallel groups as:
+        4 model parallel groups:
+            [g0, g1], [g2, g3], [g4, g5], [g6, g7]
+        2 data parallel groups:
+            [g0, g2, g4, g6], [g1, g3, g5, g7]
+    Note that for efficiency, the caller should make sure adjacent ranks
+    are on the same DGX box. For example if we are using 2 DGX-1 boxes
+    with a total of 16 GPUs, rank 0 to 7 belong to the first box and
+    ranks 8 to 15 belong to the second box.
+    """
+    log_dist(f'Creating model parallel group with size {model_parallel_size_}',
+             ranks=[0])
+    # Get world size and rank. Ensure some consistencies.
+    assert torch.distributed.is_initialized()
+    world_size = torch.distributed.get_world_size()
+    model_parallel_size = min(model_parallel_size_, world_size)
+    _ensure_divisibility(world_size, model_parallel_size)
+    rank = torch.distributed.get_rank()
+
+    _DATA_PARALLEL_GROUP = None
+    _MODEL_PARALLEL_GROUP = None
+    # Build the data parallel groups.
+    for i in range(model_parallel_size):
+        ranks = range(i, world_size, model_parallel_size)
+        group = torch.distributed.new_group(ranks)
+        if i == (rank % model_parallel_size):
+            _DATA_PARALLEL_GROUP = group
+
+    # Build the model parallel groups.
+    for i in range(world_size // model_parallel_size):
+        ranks = range(i * model_parallel_size, (i + 1) * model_parallel_size)
+        group = torch.distributed.new_group(ranks)
+        if i == (rank // model_parallel_size):
+            _MODEL_PARALLEL_GROUP = group
+
+    return _DATA_PARALLEL_GROUP, _MODEL_PARALLEL_GROUP
+
+
+def _create_expert_and_data_parallel(ep_size):
+    """
+        Create expert and data parallel groups.
+
+        Note: Caller of this function is responsible to check if the groups already exist.
+
+        Example - E + D parallel
+        world_size = 16
+        expert_parallel_size = 2 # number of experts in same group
+        expert_data_parallel_group = [0,2,4,6,8,10,12,14], [1,3,5,7,9,11,13,15] - all reduce is only on MoE params
+        expert_parallel_group = [0, 1], [2,3], [4,5], [6,7], [8,9] - no all reduce, but all to all
+        data_parallel_group = [0,1,...,15] - all reduce is only on non-MoE
+    """
+    assert torch.distributed.is_initialized()
+
+    log_dist(f'Creating expert and data parallel groups with size {ep_size}', ranks=[0])
+    world_size = torch.distributed.get_world_size()
+    rank = torch.distributed.get_rank()
+
+    expert_parallel_size_ = min(ep_size, world_size)
+    _ensure_divisibility(world_size, expert_parallel_size_)
+
+    group_name = f"ep_size_{expert_parallel_size_}"
+
+    # Build the expert data parallel groups.
+    global _EXPERT_DATA_PARALLEL_GROUP
+
+    # Only create group if it does not already exist
+    if group_name not in _EXPERT_DATA_PARALLEL_GROUP:
+        for i in range(expert_parallel_size_):
+            ranks = range(i, world_size, expert_parallel_size_)
+            group = torch.distributed.new_group(ranks)
+            log_dist(
+                f'Creating expert data parallel process group named {group_name} with ranks: {list(ranks)}',
+                [0])
+            if i == (rank % expert_parallel_size_):
+                _EXPERT_DATA_PARALLEL_GROUP[group_name] = group
+
+    # Build the expert parallel groups.
+    global _EXPERT_PARALLEL_GROUP
+
+    # Only create group if it does not already exist
+    if group_name not in _EXPERT_PARALLEL_GROUP:
+        for i in range(world_size // expert_parallel_size_):
+            ranks = range(i * expert_parallel_size_, (i + 1) * expert_parallel_size_)
+            group = torch.distributed.new_group(ranks)
+            log_dist(
+                f'creating expert parallel process group named {group_name} with ranks: {list(ranks)}',
+                [0])
+            if i == (rank // expert_parallel_size_):
+                _EXPERT_PARALLEL_GROUP[group_name] = group
+
+
+def _get_expert_parallel_ranks(world_size, model_parallel_size_, expert_parallel_size_):
+    """Generate expert parallel and expert data parallel group ranks list.
+
+        Example - E + M + D parallel
+        world_size = 16
+        model_degree = 2
+        expert_degree = 4 # number of experts in same group
+        mp_group = [0, 1], [2,3], [4,5] ...
+        data_parallel_group =[0,2,4,6,8,10, 12,14],                 [1,3,5,7,9,11,13,15]
+        expert_parallel_group = [0,2,4,6], [8,10,12,14]             [1,3,5,7], [9,11,13,15]
+        expert_data_parallel_group = [0,8],[2,10],[4,12],[6,14],    [1,9],[3,11],[5,13],[7,15]
+
+    Args:
+        world_size (int): Distributed world size.
+        model_parallel_size_ (int): Model parallel group size.
+        expert_parallel_size_ (int): Expert parallel group size.
+
+    Returns:
+        Expert parallel group ranks and Expert data parallel group ranks list.
+    """
+    _ensure_divisibility(world_size, model_parallel_size_)
+    dp_world_size = world_size // model_parallel_size_
+    _ensure_divisibility(dp_world_size, expert_parallel_size_)
+
+    # Generate data parallel groups
+    data_parallel_groups = []
+    dp_group_size = model_parallel_size_
+    for i in range(dp_group_size):
+        data_parallel_groups.append(list(range(i, world_size, dp_group_size)))
+
+    expert_parallel_groups = []
+    expert_data_parallel_groups = []
+    for dp_ranks in data_parallel_groups:
+        # partition of expert parallel groups, e.g. [0,2,4,6], [8,10,12,14]
+        part_ep_groups = []
+        for i in range(0, dp_world_size, expert_parallel_size_):
+            part_ep_groups.append(dp_ranks[i:i + expert_parallel_size_])
+        expert_parallel_groups.extend(part_ep_groups)
+
+        # zip part_ep_groups get expert data parallel ranks, e.g [0,8],[2,10],[4,12],[6,14]
+        for expert_dp_ranks in zip(*part_ep_groups):
+            expert_data_parallel_groups.append(list(expert_dp_ranks))
+
+    return expert_parallel_groups, expert_data_parallel_groups
+
+
+def _create_expert_data_and_model_parallel(expert_parallel_size_, mpu):
+    """
+        Create expert and data parallel groups based on MPU (model parallel) group.
+
+        Note: Caller of this function is responsible to check if the groups already exist.
+
+        Example - E + M + D parallel
+        world_size = 16
+        model_degree = 2
+        expert_degree = 4 # number of experts in same group
+        mp_group = [0, 1], [2,3], [4,5] ...
+        data_parallel_group =[0,2,4,6,8,10, 12,14],                 [1,3,5,7,9,11,13,15]
+        expert_parallel_group = [0,2,4,6], [8,10,12,14]             [1,3,5,7], [9,11,13,15]
+        expert_data_parallel_group = [0,8],[2,10],[4,12],[6,14],    [1,9],[3,11],[5,13],[7,15]
+    """
+    assert torch.distributed.is_initialized(), "torch distributed is not initialized"
+    model_parallel_size_ = mpu.get_model_parallel_world_size()
+
+    world_size = torch.distributed.get_world_size()
+    rank = torch.distributed.get_rank()
+    dp_world_size = mpu.get_data_parallel_world_size()
+    dp_rank = mpu.get_data_parallel_rank()
+
+    log_dist(
+        f"Creating deepspeed groups with model parallel size {model_parallel_size_}, expert parallel size {expert_parallel_size_}, world size {world_size}, dp world size {dp_world_size}",
+        [0])
+
+    global _EXPERT_PARALLEL_GROUP, _EXPERT_DATA_PARALLEL_GROUP
+
+    # Get world size and rank. Ensure some consistencies.
+    _DATA_PARALLEL_GROUP = mpu.get_data_parallel_group()
+    _MODEL_PARALLEL_GROUP = mpu.get_model_parallel_group()
+
+    expert_parallel_size_ = min(expert_parallel_size_, dp_world_size)
+    _ensure_divisibility(world_size, expert_parallel_size_)
+
+    group_name = f"ep_size_{expert_parallel_size_}"
+
+    # Only create groups if they don't already exist
+    # Need to check conditions outside the group creation loop because of the way torch.dist group creation works
+    if group_name not in _EXPERT_DATA_PARALLEL_GROUP and group_name not in _EXPERT_PARALLEL_GROUP:
+        expert_parallel_groups, expert_data_parallel_groups = _get_expert_parallel_ranks(
+            world_size, model_parallel_size_, expert_parallel_size_)
+        for ranks in expert_parallel_groups:
+            group = torch.distributed.new_group(ranks)
+            if rank in list(ranks):
+                _EXPERT_PARALLEL_GROUP[group_name] = group
+
+        for ranks in expert_data_parallel_groups:
+            group = torch.distributed.new_group(ranks)
+            if rank in list(ranks):
+                _EXPERT_DATA_PARALLEL_GROUP[group_name] = group
+
+
+def _get_max_expert_size():
+    """Get the maximum ep_size from all the created groups."""
+    assert _EXPERT_PARALLEL_GROUP is not None, "Warning! Process group not initialized"
+    keylist = []
+    for key in _EXPERT_PARALLEL_GROUP.keys():
+        # index 2 is ep_size in the group name: ep_size_<ep_size>
+        index = 2
+        keylist.append(int(key.split('_')[index]))
+    return max(keylist) if len(keylist) > 0 else None
+
+
+def _get_max_expert_size_name():
+    """Get the name of the group with max. ep_size"""
+    return f'ep_size_{_get_max_expert_size()}'
+
+
+def _get_max_expert_parallel_group():
+    """Get the max expert parallel size."""
+    return _get_expert_parallel_group(_get_max_expert_size_name())
+
+
+def _get_expert_parallel_group(group_name):
+    """Get the expert parallel group the caller rank belongs to."""
+    assert group_name in _EXPERT_PARALLEL_GROUP, \
+        'expert parallel group is not initialized'
+    return _EXPERT_PARALLEL_GROUP[group_name]
+
+
+def _get_expert_parallel_group_dict():
+    """Get the expert parallel group dict."""
+    return _EXPERT_PARALLEL_GROUP
+
+
+def _get_expert_data_parallel_group(group_name):
+    """Get the expert data parallel group the caller rank belongs to."""
+    assert group_name in _EXPERT_DATA_PARALLEL_GROUP, \
+        'expert data parallel group is not initialized'
+    return _EXPERT_DATA_PARALLEL_GROUP[group_name]
+
+
+def _get_expert_data_parallel_group_dict():
+    """Get the expert data parallel group dict."""
+    return _EXPERT_DATA_PARALLEL_GROUP
+
+
+def _clone_world_group():
+    """Create a clone of the world group
+        Note: We need to clone the torch.distributed world group because we
+        use _get_global_rank() utility function in DeepSpeed at many places.
+        As that function does not work on torch.distributed.group.WORLD, we
+        need to keep a clone of it.
+    """
+    assert torch.distributed.is_initialized(), "torch.distributed is not initialized"
+    global _WORLD_GROUP
+    if _WORLD_GROUP is None:
+        # If not cloned already, clone the world group
+        _WORLD_GROUP = torch.distributed.new_group(
+            ranks=range(torch.distributed.get_world_size()))
+    return _WORLD_GROUP
+
+
+def _get_data_parallel_group():
+    """Get the data parallel group the caller rank belongs to."""
+    assert torch.distributed.is_initialized(), \
+        'torch.distributed is not initialized'
+    global mpu
+    if mpu is not None:
+        return mpu.get_data_parallel_group()
+    # Return the clone of torch.distributed world group
+    return _clone_world_group()
+
+
+def _get_broadcast_src_rank():
+    return _get_global_rank(_get_data_parallel_group(), 0)
+
+
+def _get_expert_broadcast_src_rank(group_name):
+    return _get_global_rank(_get_expert_data_parallel_group(group_name), 0)
+
+
+def _get_expert_parallel_world_size(group_name):
+    """Return world size for the expert parallel group."""
+    return torch.distributed.get_world_size(group=_get_expert_parallel_group(group_name))
+
+
+def _get_expert_data_parallel_world_size(group_name):
+    """Return world size for the expert data parallel group."""
+    return torch.distributed.get_world_size(
+        group=_get_expert_data_parallel_group(group_name))
+
+
+def _get_expert_parallel_rank(group_name):
+    """Return my rank for the expert parallel group."""
+    return torch.distributed.get_rank(group=_get_expert_parallel_group(group_name))
+
+
+def _get_expert_parallel_src_rank(group_name):
+    """Calculate the global rank corresponding to a local rank zero
+    in the expert parallel group."""
+    global_rank = torch.distributed.get_rank()
+    local_world_size = _get_expert_parallel_world_size(group_name)
+    return (global_rank // local_world_size) * local_world_size
+
+
+def _get_expert_data_parallel_rank(group_name):
+    """Return my rank for the expert data parallel group."""
+    return torch.distributed.get_rank(group=_get_expert_data_parallel_group(group_name))
+
+
+def _get_data_parallel_world_size():
+    """Return world size for the data parallel group."""
+    global mpu
+    if mpu is not None:
+        return mpu.get_data_parallel_world_size()
+    return torch.distributed.get_world_size(group=_get_data_parallel_group())
+
+
+def _get_model_parallel_world_size():
+    """Return world size for the model parallel group."""
+    global mpu
+    if mpu is not None:
+        return mpu.get_model_parallel_world_size()
+    return 1
+
+
+def _get_data_parallel_rank():
+    """Return my rank for the data parallel group."""
+    global mpu
+    if mpu is not None:
+        return mpu.get_data_parallel_rank()
+    return torch.distributed.get_rank(group=_get_data_parallel_group())
diff --git a/deepspeed/utils/logging.py b/deepspeed/utils/logging.py
index d16682f902e86de9421f3701e7f79c9bc7c6ff36..3a089b3dd1b8b37fc256d81b895de4cd11b3f0cd 100644
--- a/deepspeed/utils/logging.py
+++ b/deepspeed/utils/logging.py
@@ -1,8 +1,17 @@
 import logging
 import sys
+import os
 
 import torch.distributed as dist
 
+log_levels = {
+    "debug": logging.DEBUG,
+    "info": logging.INFO,
+    "warning": logging.WARNING,
+    "error": logging.ERROR,
+    "critical": logging.CRITICAL,
+}
+
 
 class LoggerFactory:
     @staticmethod
@@ -58,3 +67,58 @@ def log_dist(message, ranks=None, level=logging.INFO):
     if should_log:
         final_message = "[Rank {}] {}".format(my_rank, message)
         logger.log(level, final_message)
+
+
+def print_json_dist(message, ranks=None, path=None):
+    """Print message when one of following condition meets
+
+    + not dist.is_initialized()
+    + dist.get_rank() in ranks if ranks is not None or ranks = [-1]
+
+    Args:
+        message (str)
+        ranks (list)
+        path (str)
+
+    """
+    should_log = not dist.is_initialized()
+    ranks = ranks or []
+    my_rank = dist.get_rank() if dist.is_initialized() else -1
+    if ranks and not should_log:
+        should_log = ranks[0] == -1
+        should_log = should_log or (my_rank in set(ranks))
+    if should_log:
+        message['rank'] = my_rank
+        import json
+        with open(path, 'w') as outfile:
+            json.dump(message, outfile)
+            os.fsync(outfile)
+
+
+def get_current_level():
+    """
+    Return logger's current log level
+    """
+    return logger.getEffectiveLevel()
+
+
+def should_log_le(max_log_level_str):
+    """
+    Args:
+        max_log_level_str: maximum log level as a string
+
+    Returns ``True`` if the current log_level is less or equal to the specified log level. Otherwise ``False``.
+
+    Example:
+
+        ``should_log_le("info")`` will return ``True`` if the current log level is either ``logging.INFO`` or ``logging.DEBUG``
+    """
+
+    if not isinstance(max_log_level_str, str):
+        raise ValueError(f"{max_log_level_str} is not a string")
+
+    max_log_level_str = max_log_level_str.lower()
+    if max_log_level_str not in log_levels:
+        raise ValueError(f"{max_log_level_str} is not one of the `logging` levels")
+
+    return get_current_level() <= log_levels[max_log_level_str]
diff --git a/deepspeed/utils/nvtx.py b/deepspeed/utils/nvtx.py
new file mode 100644
index 0000000000000000000000000000000000000000..11579ffbb17f1338d7717fe0eafa390ce7e0fea0
--- /dev/null
+++ b/deepspeed/utils/nvtx.py
@@ -0,0 +1,15 @@
+import torch
+
+
+def instrument_w_nvtx(func):
+    """decorator that causes an NVTX range to be recorded for the duration of the
+    function call."""
+    if hasattr(torch.cuda.nvtx, "range"):
+
+        def wrapped_fn(*args, **kwargs):
+            with torch.cuda.nvtx.range(func.__qualname__):
+                return func(*args, **kwargs)
+
+        return wrapped_fn
+    else:
+        return func
diff --git a/deepspeed/utils/timer.py b/deepspeed/utils/timer.py
old mode 100755
new mode 100644
index 28104894ee59fc398c39f51ebb4db810091973e2..75b9bcccb4132ba7afa4411069c1ab93e10978f6
--- a/deepspeed/utils/timer.py
+++ b/deepspeed/utils/timer.py
@@ -1,49 +1,74 @@
-'''
+"""
 Copyright 2019 The Microsoft DeepSpeed Team
-'''
+"""
 
+from numpy.core.numeric import count_nonzero
+from deepspeed.elasticity.elasticity import compute_elastic_config
 import time
 import torch
+from numpy import mean
 from deepspeed.utils.logging import log_dist
 
 from deepspeed.utils import logger
 
 try:
     import psutil
+
     PSUTILS_INSTALLED = True
 except ImportError:
     PSUTILS_INSTALLED = False
     pass
 
 
+class CudaEventTimer(object):
+    def __init__(self, start_event: torch.cuda.Event, end_event: torch.cuda.Event):
+        self.start_event = start_event
+        self.end_event = end_event
+
+    def get_elapsed_msec(self):
+        torch.cuda.current_stream().wait_event(self.end_event)
+        self.end_event.synchronize()
+        return self.start_event.elapsed_time(self.end_event)
+
+
 class SynchronizedWallClockTimer:
     """Group of timers. Borrowed from Nvidia Megatron code"""
     class Timer:
         """Timer."""
         def __init__(self, name):
             self.name_ = name
-            self.elapsed_ = 0.0
             self.started_ = False
-            self.start_time = time.time()
+            self.event_timers = []
+            self.start_event = None
+            self.elapsed_records = None
 
         def start(self):
             """Start the timer."""
-            assert not self.started_, 'timer has already been started'
-            torch.cuda.synchronize()
-            self.start_time = time.time()
+            assert not self.started_, f"{self.name} timer has already been started"
+            self.start_event = torch.cuda.Event(enable_timing=True)
+            self.start_event.record()
             self.started_ = True
 
-        def stop(self):
+        def stop(self, reset=False, record=False):
             """Stop the timer."""
-            assert self.started_, 'timer is not started'
-            torch.cuda.synchronize()
-            self.elapsed_ += (time.time() - self.start_time)
+            assert self.started_, "timer is not started"
+            end_event = torch.cuda.Event(enable_timing=True)
+            end_event.record()
+            self.event_timers.append(CudaEventTimer(self.start_event, end_event))
+            self.start_event = None
             self.started_ = False
 
+        def _get_elapsed_msec(self):
+            self.elapsed_records = [et.get_elapsed_msec() for et in self.event_timers]
+            self.event_timers.clear()
+            return sum(self.elapsed_records)
+
         def reset(self):
             """Reset timer."""
-            self.elapsed_ = 0.0
             self.started_ = False
+            self.start_event = None
+            self.elapsed_records = None
+            self.event_timers.clear()
 
         def elapsed(self, reset=True):
             """Calculate the elapsed time."""
@@ -52,7 +77,7 @@ class SynchronizedWallClockTimer:
             if self.started_:
                 self.stop()
             # Get the elapsed time.
-            elapsed_ = self.elapsed_
+            elapsed_ = self._get_elapsed_msec()
             # Reset the elapsed time
             if reset:
                 self.reset()
@@ -61,6 +86,9 @@ class SynchronizedWallClockTimer:
                 self.start()
             return elapsed_
 
+        def mean(self):
+            return trim_mean(self.elapsed_records, 0.1)
+
     def __init__(self):
         self.timers = {}
 
@@ -84,24 +112,35 @@ class SynchronizedWallClockTimer:
     def log(self, names, normalizer=1.0, reset=True, memory_breakdown=False, ranks=None):
         """Log a group of timers."""
         assert normalizer > 0.0
-        string = f'rank={torch.distributed.get_rank()} time (ms)'
+        string = f"rank={torch.distributed.get_rank()} time (ms)"
         for name in names:
             if name in self.timers:
-                elapsed_time = self.timers[name].elapsed(
-                    reset=reset) * 1000.0 / normalizer
-                string += ' | {}: {:.2f}'.format(name, elapsed_time)
+                elapsed_time = (self.timers[name].elapsed(reset=reset) / normalizer)
+                string += " | {}: {:.2f}".format(name, elapsed_time)
 
         log_dist(string, ranks=ranks or [0])
 
-
-class ThroughputTimer():
-    def __init__(self,
-                 batch_size,
-                 num_workers,
-                 start_step=2,
-                 steps_per_output=50,
-                 monitor_memory=False,
-                 logging_fn=None):
+    def get_mean(self, names, normalizer=1.0, reset=True):
+        """Get the mean of a group of timers."""
+        assert normalizer > 0.0
+        means = {}
+        for name in names:
+            if name in self.timers:
+                elapsed_time = (self.timers[name].mean() * 1000.0 / normalizer)
+                means[name] = elapsed_time
+        return means
+
+
+class ThroughputTimer:
+    def __init__(
+        self,
+        batch_size,
+        num_workers,
+        start_step=2,
+        steps_per_output=50,
+        monitor_memory=False,
+        logging_fn=None,
+    ):
         self.start_time = 0
         self.end_time = 0
         self.started = False
@@ -151,10 +190,15 @@ class ThroughputTimer():
             self.total_elapsed_time += duration
             if self.local_step_count % self.steps_per_output == 0:
                 if report_speed:
-                    self.logging("{}/{}, SamplesPerSec={}".format(
-                        self.epoch_count,
-                        self.local_step_count,
-                        self.avg_samples_per_sec()))
+                    self.logging(
+                        "{}/{}, SamplesPerSec={}, MemAllocated={}GB, MaxMemAllocated={}GB"
+                        .format(self.epoch_count,
+                                self.local_step_count,
+                                self.avg_samples_per_sec(),
+                                round(torch.cuda.memory_allocated() / 1024**3,
+                                      2),
+                                round(torch.cuda.max_memory_allocated() / 1024**3,
+                                      2)))
                 if self.monitor_memory:
                     virt_mem = psutil.virtual_memory()
                     swap = psutil.swap_memory()
@@ -162,7 +206,8 @@ class ThroughputTimer():
                         self.epoch_count,
                         self.local_step_count,
                         virt_mem.percent,
-                        swap.percent))
+                        swap.percent,
+                    ))
 
     def avg_samples_per_sec(self):
         if self.total_step_count > 0:
@@ -172,3 +217,20 @@ class ThroughputTimer():
             # training samples per second
             return samples_per_step / avg_time_per_step
         return float("-inf")
+
+
+def trim_mean(data, trim_percent):
+    """Compute the trimmed mean of a list of numbers.
+
+    Args:
+        data (list): List of numbers.
+        trim_percent (float): Percentage of data to trim.
+
+    Returns:
+        float: Trimmed mean.
+    """
+    assert trim_percent >= 0.0 and trim_percent <= 1.0
+    n = len(data)
+    data.sort()
+    k = int(round(n * (trim_percent)))
+    return mean(data[k:n - k])
diff --git a/deepspeed/utils/zero_to_fp32.py b/deepspeed/utils/zero_to_fp32.py
index 3401fd635e7cee02ff8789e256d0c7aa3efef889..8e8b0fd17dd18a28a5cd7e92c9829fd165e93bc4 100644
--- a/deepspeed/utils/zero_to_fp32.py
+++ b/deepspeed/utils/zero_to_fp32.py
@@ -5,26 +5,70 @@
 # the future. Once extracted, the weights don't require DeepSpeed and can be used in any
 # application.
 #
-# example: python zero_to_fp32.py global_step1 pytorch_model.bin
+# example: python zero_to_fp32.py . pytorch_model.bin
 
 import argparse
 import torch
 import glob
+import math
 import os
+import re
 from collections import OrderedDict
-import deepspeed
 
 # while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
 # DeepSpeed data structures it has to be available in the current python environment.
+import deepspeed
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION,
+                                            OPTIMIZER_STATE_DICT,
+                                            PARAM_SHAPES,
+                                            SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS,
+                                            ZERO_STAGE,
+                                            PARTITION_COUNT,
+                                            PARAM_SHAPES,
+                                            BUFFER_NAMES)
 
+debug = 0
+
+# load to cpu
+device = torch.device('cpu')
+
+
+def atoi(text):
+    return int(text) if text.isdigit() else text
+
+
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
 
-def get_optim_files(checkpoint_dir):
 
+def get_model_state_file(checkpoint_dir, zero_stage):
     if not os.path.isdir(checkpoint_dir):
         raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
 
+    # there should be only one file
+    if zero_stage == 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+
+    return file
+
+
+def get_optim_files(checkpoint_dir):
     # XXX: need to test that this simple glob rule works for multi-node setup too
-    optim_files = sorted(glob.glob(f"{checkpoint_dir}/*_optim_states.pt"))
+    optim_files = sorted(glob.glob(os.path.join(checkpoint_dir,
+                                                "*_optim_states.pt")),
+                         key=natural_keys)
 
     if len(optim_files) == 0:
         raise FileNotFoundError(
@@ -33,119 +77,408 @@ def get_optim_files(checkpoint_dir):
     return optim_files
 
 
-def parse_optim_states(files):
+def parse_model_state(file):
+    state_dict = torch.load(file, map_location=device)
+
+    if BUFFER_NAMES not in state_dict:
+        raise ValueError(f"{file} is not a model state checkpoint")
+    buffer_names = state_dict[BUFFER_NAMES]
+    if debug:
+        print("Found buffers:", buffer_names)
+
+    # recover just the buffers while restoring them to fp32 if they were saved in fp16
+    buffers = {
+        k: v.float()
+        for k,
+        v in state_dict["module"].items() if k in buffer_names
+    }
+    param_shapes = state_dict[PARAM_SHAPES]
+
+    ds_version = state_dict.get(DS_VERSION, None)
+
+    return buffers, param_shapes, ds_version
+
+
+def parse_optim_states(files, ds_checkpoint_dir):
+
+    total_files = len(files)
     state_dicts = []
     for f in files:
-        state_dicts.append(torch.load(f))
+        state_dicts.append(torch.load(f, map_location=device))
+
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
 
-    if not "zero_stage" in state_dicts[0]['optimizer_state_dict']:
-        raise ValueError(f"non zero checkpoint")
-    zero_stage = state_dicts[0]['optimizer_state_dict']["zero_stage"]
+    if type(world_size) is list:
+        world_size = max(world_size)
+
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
 
     # the groups are named differently in each stage
     if zero_stage == 2:
-        fp32_groups_key = "single_partition_of_fp32_groups"
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
     elif zero_stage == 3:
-        fp32_groups_key = "fp32_flat_groups"
+        fp32_groups_key = FP32_FLAT_GROUPS
     else:
         raise ValueError(f"unknown zero stage {zero_stage}")
 
-    param_shapes = state_dicts[0]["param_shapes"]
-    fp32_flat_groups = [
-        state_dicts[i]['optimizer_state_dict'][fp32_groups_key][0]
-        for i in range(len(state_dicts))
-    ]
-    world_size = state_dicts[0]['optimizer_state_dict']["partition_count"]
-
-    return zero_stage, world_size, param_shapes, fp32_flat_groups
+    if zero_stage == 2:
+        fp32_flat_groups = [
+            state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key]
+            for i in range(len(state_dicts))
+        ]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
 
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key],
+                      0) for i in range(len(state_dicts))
+        ]
 
-def zero3_partitioned_param_info(unpartitioned_numel, world_size):
-    remainder = unpartitioned_numel % world_size
-    padding_numel = (world_size - remainder) if remainder else 0
-    partitioned_numel = int(unpartitioned_numel / world_size)
-    return partitioned_numel, padding_numel
+    return zero_stage, world_size, fp32_flat_groups
 
 
-def convert_zero_chkpt_to_fp32_consolid_state_dict(checkpoint_dir, output_file):
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir):
     """
-    Convert zero 2 or 3 checkpoint into a single fp32 consolidated state_dict file that can be
-    loaded with ``torch.load(file)`` and used for training without DeepSpeed.
+    Returns fp32 state_dict reconstructed from ds checkpoint
 
     Args:
-        - ``checkpoint_dir``: path to the deepspeed checkpoint folder
-        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
 
     """
-    print(f"Processing zero checkpoint '{checkpoint_dir}'")
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
 
-    optim_files = get_optim_files(checkpoint_dir)
-    zero_stage, world_size, param_shapes, fp32_flat_groups = parse_optim_states(optim_files)
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
     print(
         f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
 
+    model_file = get_model_state_file(ds_checkpoint_dir, zero_stage)
+    buffers, param_shapes, ds_version = parse_model_state(model_file)
+    print(f'Parsing checkpoint created by deepspeed=={ds_version}')
+
+    if zero_stage == 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size,
+                                                          param_shapes,
+                                                          fp32_flat_groups,
+                                                          buffers)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size,
+                                                          param_shapes,
+                                                          fp32_flat_groups,
+                                                          buffers)
+
+
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size,
+                                               param_shapes,
+                                               fp32_flat_groups,
+                                               buffers):
+
     # Reconstruction protocol:
     #
-    # - for zero2 we just need to concat the partitions back to back and reconsolidate over one huge
-    # flat buffer - no need to deal with padding since if there is any it will be only in the tail
-    # of the last partition so there it will be just left out
-    #
-    # - for zero3 we need to zip the partitions together at boundary of each param, re-consolidating
-    # each param, while dealing with padding if any
+    # XXX: document this
+
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(
+                    f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum([
+        full_single_fp32_vector.numel()
+        for full_single_fp32_vector in merged_single_partition_of_fp32_groups
+    ])
+
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum(
+            [sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
 
-    if zero_stage == 2:
-        # XXX: memory usage doubles here (zero2)
-        full_single_fp32_vector = torch.cat(fp32_flat_groups, 0)
+    state_dict = OrderedDict()
 
+    # buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+
+    # params
     # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
     # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+
+            unpartitioned_numel = shape.numel()
+            total_numel += unpartitioned_numel
+            total_params += 1
+
+            if debug:
+                print(
+                    f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} "
+                )
+            state_dict[name] = full_single_fp32_vector.narrow(
+                0,
+                offset,
+                unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(
+                f"consumed {offset} numels out of {avail_numel} - something is wrong")
+
+    print(
+        f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements"
+    )
+
+    return state_dict
+
+
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+
+
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size,
+                                               param_shapes,
+                                               fp32_flat_groups,
+                                               buffers):
+
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+
     state_dict = OrderedDict()
+
+    # buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
     offset = 0
     total_numel = 0
+    total_params = 0
     for name, shape in param_shapes.items():
+
         unpartitioned_numel = shape.numel()
         total_numel += unpartitioned_numel
+        total_params += 1
 
-        if zero_stage == 2:
-            # print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
-            state_dict[name] = full_single_fp32_vector.narrow(
-                0,
-                offset,
-                unpartitioned_numel).view(shape)
-            offset += unpartitioned_numel
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+
+        if debug:
+            print(
+                f"{total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0,
+                                             offset,
+                                             partitioned_numel)
+                  for i in range(world_size)),
+            0).narrow(0,
+                      0,
+                      unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+
+    offset *= world_size
+
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(
+            f"consumed {offset} numels out of {avail_numel} - something is wrong")
+
+    print(
+        f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements"
+    )
+
+    return state_dict
+
+
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+
+    Returns:
+        - pytorch ``state_dict``
 
-        elif zero_stage == 3:
-            partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
-            # print(f"{name} full shape: {shape} partition0 numel {partitioned_numel} partitioned_padding_numel {partitioned_padding_numel}")
-            # XXX: memory usage doubles here (zero3)
-            state_dict[name] = torch.cat(
-                tuple(fp32_flat_groups[i].narrow(0,
-                                                 offset,
-                                                 partitioned_numel)
-                      for i in range(world_size)),
-                0).view(shape)
-            offset += partitioned_numel + partitioned_padding_numel
-
-    # the job is done
-    print(f"Saving fp32 state dict to {output_file} (total_numel={total_numel})")
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
 
+    A typical usage might be ::
+
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
+
+
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    """
+
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    print(f"Saving fp32 state dict to {output_file}")
     torch.save(state_dict, output_file)
 
 
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+
+    Returns:
+        - ``model`: modified model
+
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+
+    A typical usage might be ::
+
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+
+    return model
+
+
 if __name__ == "__main__":
 
     parser = argparse.ArgumentParser()
     parser.add_argument(
         "checkpoint_dir",
         type=str,
-        help=
-        "path to the deepspeed checkpoint folder, e.g., path/checkpoint-1/global_step1")
+        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
     parser.add_argument(
         "output_file",
         type=str,
         help=
-        "path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-1/pytorch_model.bin)"
+        "path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)"
     )
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
     args = parser.parse_args()
 
-    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
+    debug = args.debug
+
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir, args.output_file)
diff --git a/docker/Dockerfile b/docker/Dockerfile
old mode 100755
new mode 100644
diff --git a/docker/Dockerfile.rocm b/docker/Dockerfile.rocm
new file mode 100644
index 0000000000000000000000000000000000000000..5e3f756dcd35833132ccff071139e0e7769cf4ab
--- /dev/null
+++ b/docker/Dockerfile.rocm
@@ -0,0 +1,5 @@
+FROM rocm/pytorch:rocm5.0.1_ubuntu18.04_py3.7_pytorch_1.10.0
+
+# install latest released version of deepspeed
+RUN pip install deepspeed && \
+    ds_report
diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock
index 81646671de47ae1fcedf51583db294a40f2ec4fb..0534d934144c0a751cd57ed664417709ea5d6dcf 100644
--- a/docs/Gemfile.lock
+++ b/docs/Gemfile.lock
@@ -1,39 +1,59 @@
 GEM
   remote: https://rubygems.org/
   specs:
-    activesupport (6.0.3.4)
+    activesupport (6.0.4.6)
       concurrent-ruby (~> 1.0, >= 1.0.2)
       i18n (>= 0.7, < 2)
       minitest (~> 5.1)
       tzinfo (~> 1.1)
       zeitwerk (~> 2.2, >= 2.2.2)
-    addressable (2.7.0)
+    addressable (2.8.0)
       public_suffix (>= 2.0.2, < 5.0)
     coffee-script (2.4.1)
       coffee-script-source
       execjs
     coffee-script-source (1.11.1)
     colorator (1.1.0)
-    commonmarker (0.17.13)
+    commonmarker (0.23.4)
       ruby-enum (~> 0.5)
-    concurrent-ruby (1.1.7)
-    dnsruby (1.61.4)
+    concurrent-ruby (1.1.10)
+    dnsruby (1.61.9)
       simpleidn (~> 0.1)
-    em-websocket (0.5.2)
+    em-websocket (0.5.3)
       eventmachine (>= 0.12.9)
-      http_parser.rb (~> 0.6.0)
-    ethon (0.12.0)
-      ffi (>= 1.3.0)
+      http_parser.rb (~> 0)
+    ethon (0.15.0)
+      ffi (>= 1.15.0)
     eventmachine (1.2.7)
-    execjs (2.7.0)
-    faraday (1.1.0)
+    execjs (2.8.1)
+    faraday (1.10.0)
+      faraday-em_http (~> 1.0)
+      faraday-em_synchrony (~> 1.0)
+      faraday-excon (~> 1.1)
+      faraday-httpclient (~> 1.0)
+      faraday-multipart (~> 1.0)
+      faraday-net_http (~> 1.0)
+      faraday-net_http_persistent (~> 1.0)
+      faraday-patron (~> 1.0)
+      faraday-rack (~> 1.0)
+      faraday-retry (~> 1.0)
+      ruby2_keywords (>= 0.0.4)
+    faraday-em_http (1.0.0)
+    faraday-em_synchrony (1.0.0)
+    faraday-excon (1.1.0)
+    faraday-httpclient (1.0.1)
+    faraday-multipart (1.0.3)
       multipart-post (>= 1.2, < 3)
-      ruby2_keywords
-    ffi (1.13.1)
+    faraday-net_http (1.0.1)
+    faraday-net_http_persistent (1.2.0)
+    faraday-patron (1.0.0)
+    faraday-rack (1.0.0)
+    faraday-retry (1.0.3)
+    ffi (1.15.5)
     forwardable-extended (2.6.0)
     gemoji (3.0.1)
-    github-pages (209)
-      github-pages-health-check (= 1.16.1)
+    github-pages (223)
+      github-pages-health-check (= 1.17.9)
       jekyll (= 3.9.0)
       jekyll-avatar (= 0.7.0)
       jekyll-coffeescript (= 1.1.1)
@@ -42,30 +62,31 @@ GEM
       jekyll-feed (= 0.15.1)
       jekyll-gist (= 1.5.0)
       jekyll-github-metadata (= 2.13.0)
+      jekyll-include-cache (= 0.2.1)
       jekyll-mentions (= 1.6.0)
       jekyll-optional-front-matter (= 0.3.2)
       jekyll-paginate (= 1.1.0)
       jekyll-readme-index (= 0.3.0)
       jekyll-redirect-from (= 0.16.0)
       jekyll-relative-links (= 0.6.1)
-      jekyll-remote-theme (= 0.4.2)
+      jekyll-remote-theme (= 0.4.3)
       jekyll-sass-converter (= 1.5.2)
-      jekyll-seo-tag (= 2.6.1)
+      jekyll-seo-tag (= 2.7.1)
       jekyll-sitemap (= 1.4.0)
       jekyll-swiss (= 1.0.0)
-      jekyll-theme-architect (= 0.1.1)
-      jekyll-theme-cayman (= 0.1.1)
-      jekyll-theme-dinky (= 0.1.1)
-      jekyll-theme-hacker (= 0.1.2)
-      jekyll-theme-leap-day (= 0.1.1)
-      jekyll-theme-merlot (= 0.1.1)
-      jekyll-theme-midnight (= 0.1.1)
-      jekyll-theme-minimal (= 0.1.1)
-      jekyll-theme-modernist (= 0.1.1)
-      jekyll-theme-primer (= 0.5.4)
-      jekyll-theme-slate (= 0.1.1)
-      jekyll-theme-tactile (= 0.1.1)
-      jekyll-theme-time-machine (= 0.1.1)
+      jekyll-theme-architect (= 0.2.0)
+      jekyll-theme-cayman (= 0.2.0)
+      jekyll-theme-dinky (= 0.2.0)
+      jekyll-theme-hacker (= 0.2.0)
+      jekyll-theme-leap-day (= 0.2.0)
+      jekyll-theme-merlot (= 0.2.0)
+      jekyll-theme-midnight (= 0.2.0)
+      jekyll-theme-minimal (= 0.2.0)
+      jekyll-theme-modernist (= 0.2.0)
+      jekyll-theme-primer (= 0.6.0)
+      jekyll-theme-slate (= 0.2.0)
+      jekyll-theme-tactile (= 0.2.0)
+      jekyll-theme-time-machine (= 0.2.0)
       jekyll-titles-from-headings (= 0.5.3)
       jemoji (= 0.12.0)
       kramdown (= 2.3.1)
@@ -73,19 +94,19 @@ GEM
       liquid (= 4.0.3)
       mercenary (~> 0.3)
       minima (= 2.5.1)
-      nokogiri (>= 1.10.4, < 2.0)
-      rouge (= 3.23.0)
+      nokogiri (>= 1.12.5, < 2.0)
+      rouge (= 3.26.0)
       terminal-table (~> 1.4)
-    github-pages-health-check (1.16.1)
+    github-pages-health-check (1.17.9)
       addressable (~> 2.3)
       dnsruby (~> 1.60)
       octokit (~> 4.0)
-      public_suffix (~> 3.0)
+      public_suffix (>= 3.0, < 5.0)
       typhoeus (~> 1.3)
     html-pipeline (2.14.0)
       activesupport (>= 2)
       nokogiri (>= 1.4)
-    http_parser.rb (0.6.0)
+    http_parser.rb (0.8.0)
     i18n (0.9.5)
       concurrent-ruby (~> 1.0)
     jekyll (3.9.0)
@@ -136,57 +157,57 @@ GEM
       jekyll (>= 3.3, < 5.0)
     jekyll-relative-links (0.6.1)
       jekyll (>= 3.3, < 5.0)
-    jekyll-remote-theme (0.4.2)
+    jekyll-remote-theme (0.4.3)
       addressable (~> 2.0)
       jekyll (>= 3.5, < 5.0)
       jekyll-sass-converter (>= 1.0, <= 3.0.0, != 2.0.0)
       rubyzip (>= 1.3.0, < 3.0)
     jekyll-sass-converter (1.5.2)
       sass (~> 3.4)
-    jekyll-seo-tag (2.6.1)
-      jekyll (>= 3.3, < 5.0)
+    jekyll-seo-tag (2.7.1)
+      jekyll (>= 3.8, < 5.0)
     jekyll-sitemap (1.4.0)
       jekyll (>= 3.7, < 5.0)
     jekyll-swiss (1.0.0)
-    jekyll-theme-architect (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-architect (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-cayman (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-cayman (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-dinky (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-dinky (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-hacker (0.1.2)
+    jekyll-theme-hacker (0.2.0)
       jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-leap-day (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-leap-day (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-merlot (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-merlot (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-midnight (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-midnight (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-minimal (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-minimal (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-modernist (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-modernist (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-primer (0.5.4)
+    jekyll-theme-primer (0.6.0)
       jekyll (> 3.5, < 5.0)
       jekyll-github-metadata (~> 2.9)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-slate (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-slate (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-tactile (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-tactile (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
-    jekyll-theme-time-machine (0.1.1)
-      jekyll (~> 3.5)
+    jekyll-theme-time-machine (0.2.0)
+      jekyll (> 3.5, < 5.0)
       jekyll-seo-tag (~> 2.0)
     jekyll-titles-from-headings (0.5.3)
       jekyll (>= 3.3, < 5.0)
@@ -201,43 +222,43 @@ GEM
     kramdown-parser-gfm (1.1.0)
       kramdown (~> 2.0)
     liquid (4.0.3)
-    listen (3.2.1)
+    listen (3.7.1)
       rb-fsevent (~> 0.10, >= 0.10.3)
       rb-inotify (~> 0.9, >= 0.9.10)
     mercenary (0.3.6)
-    mini_portile2 (2.5.0)
+    mini_portile2 (2.8.0)
     minima (2.5.1)
       jekyll (>= 3.5, < 5.0)
       jekyll-feed (~> 0.9)
       jekyll-seo-tag (~> 2.1)
-    minimal-mistakes-jekyll (4.20.2)
+    minimal-mistakes-jekyll (4.24.0)
       jekyll (>= 3.7, < 5.0)
       jekyll-feed (~> 0.1)
       jekyll-gist (~> 1.5)
       jekyll-include-cache (~> 0.1)
       jekyll-paginate (~> 1.1)
       jekyll-sitemap (~> 1.3)
-    minitest (5.14.2)
+    minitest (5.15.0)
     multipart-post (2.1.1)
-    nokogiri (1.11.0)
-      mini_portile2 (~> 2.5.0)
+    nokogiri (1.13.4)
+      mini_portile2 (~> 2.8.0)
       racc (~> 1.4)
-    octokit (4.18.0)
+    octokit (4.22.0)
       faraday (>= 0.9)
       sawyer (~> 0.8.0, >= 0.5.3)
     pathutil (0.16.2)
       forwardable-extended (~> 2.6)
-    public_suffix (3.1.1)
-    racc (1.5.2)
-    rb-fsevent (0.10.4)
+    public_suffix (4.0.7)
+    racc (1.6.0)
+    rb-fsevent (0.11.1)
     rb-inotify (0.10.1)
       ffi (~> 1.0)
-    rexml (3.2.4)
-    rouge (3.23.0)
-    ruby-enum (0.8.0)
+    rexml (3.2.5)
+    rouge (3.26.0)
+    ruby-enum (0.9.0)
       i18n
-    ruby2_keywords (0.0.2)
-    rubyzip (2.3.0)
+    ruby2_keywords (0.0.5)
+    rubyzip (2.3.2)
     safe_yaml (1.0.5)
     sass (3.7.4)
       sass-listen (~> 4.0.0)
@@ -247,23 +268,23 @@ GEM
     sawyer (0.8.2)
       addressable (>= 2.3.5)
       faraday (> 0.8, < 2.0)
-    simpleidn (0.1.1)
+    simpleidn (0.2.1)
       unf (~> 0.1.4)
     terminal-table (1.8.0)
       unicode-display_width (~> 1.1, >= 1.1.1)
     thread_safe (0.3.6)
     typhoeus (1.4.0)
       ethon (>= 0.9.0)
-    tzinfo (1.2.7)
+    tzinfo (1.2.9)
       thread_safe (~> 0.1)
-    tzinfo-data (1.2020.3)
+    tzinfo-data (1.2021.5)
       tzinfo (>= 1.0.0)
     unf (0.1.4)
       unf_ext
-    unf_ext (0.0.7.7)
-    unicode-display_width (1.7.0)
+    unf_ext (0.0.8)
+    unicode-display_width (1.8.0)
     wdm (0.1.1)
-    zeitwerk (2.4.0)
+    zeitwerk (2.5.4)
 
 PLATFORMS
   ruby
@@ -280,4 +301,4 @@ DEPENDENCIES
   wdm (~> 0.1.1)
 
 BUNDLED WITH
-   2.1.4
+   2.3.8
diff --git a/docs/README.md b/docs/README.md
index 0ac7783f3860f2d3324643e8919c42acc3370773..3af6830ab3b2ff3ebeca0b6c1d62d9dbd2e75b9c 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,49 +1,49 @@
-# DeepSpeed Documentation
-
-This directory includes the source code for the website and documentation of DeepSpeed. The `code-docs/` directory is used to build [deepspeed.readthedocs.io](https://deepspeed.readthedocs.io/en/latest/).
-
-[deepspeed.ai](https://www.deepspeed.ai/) is the recommended way to read all DeepSpeed documentation. Directly viewing the Markdown files in this directory will not include images and other features.
-
-## Building the documentation locally
-You can serve the DeepSpeed website locally. This is especially useful for development.
-
-### Prerequisites
-The DeepSpeed website relies on [Jekyll](https://jekyllrb.com/). There are several [guides for installation](https://jekyllrb.com/docs/installation/). The instructions below assume you are in an Ubuntu environment and have been tested on WSL.
-
-First ensure that you have the necessary packages (e.g., `make` and `zlib`).
-```
-sudo apt-get install build-essential zlib1g-dev ruby-full
-```
-
-Add these lines to your `.bashrc` or equivalent to ensure you have permissions to install Ruby packages without `sudo`.
-```
-export GEM_HOME="$HOME/gems"
-export PATH="$HOME/gems/bin:$PATH"
-```
-Don't forget to `source ~/.bashrc` afterwards 😊.
-
-
-Now we can install Jekyll and [Bundler](https://bundler.io/):
-```
-gem install jekyll bundler
-```
-
-### Start a local webserver
-We now need to install the required Ruby packages for the website.
-
-**NOTE**: you should change to this folder (i.e., docs) before running the installation command to avoid this [error](https://stackoverflow.com/questions/10012181/bundle-install-returns-could-not-locate-gemfile/35157872):
-
-> Could not locate Gemfile
-
-**NOTE**: this step frequently hangs when connected to a VPN (including MSVPN). Simply disconnect for the package installation.
-
-
-```
-bundle install
-```
-
-You can now start a local webserver via:
-```
-bundle exec jekyll serve
-```
-The website should now be accessible at [http://localhost:4000](http://localhost:4000)
+# DeepSpeed Documentation
+
+This directory includes the source code for the website and documentation of DeepSpeed. The `code-docs/` directory is used to build [deepspeed.readthedocs.io](https://deepspeed.readthedocs.io/en/latest/).
+
+[deepspeed.ai](https://www.deepspeed.ai/) is the recommended way to read all DeepSpeed documentation. Directly viewing the Markdown files in this directory will not include images and other features.
+
+## Building the documentation locally
+You can serve the DeepSpeed website locally. This is especially useful for development.
+
+### Prerequisites
+The DeepSpeed website relies on [Jekyll](https://jekyllrb.com/). There are several [guides for installation](https://jekyllrb.com/docs/installation/). The instructions below assume you are in an Ubuntu environment and have been tested on WSL.
+
+First ensure that you have the necessary packages (e.g., `make` and `zlib`).
+```
+sudo apt-get install build-essential zlib1g-dev ruby-full
+```
+
+Add these lines to your `.bashrc` or equivalent to ensure you have permissions to install Ruby packages without `sudo`.
+```
+export GEM_HOME="$HOME/gems"
+export PATH="$HOME/gems/bin:$PATH"
+```
+Don't forget to `source ~/.bashrc` afterwards 😊.
+
+
+Now we can install Jekyll and [Bundler](https://bundler.io/):
+```
+gem install jekyll bundler
+```
+
+### Start a local webserver
+We now need to install the required Ruby packages for the website.
+
+**NOTE**: you should change to this folder (i.e., `docs`) before running the installation command to avoid this [error](https://stackoverflow.com/questions/10012181/bundle-install-returns-could-not-locate-gemfile/35157872):
+
+> Could not locate Gemfile
+
+**NOTE**: this step frequently hangs when connected to a VPN (including MSVPN). Simply disconnect for the package installation.
+
+
+```
+bundle install
+```
+
+You can now start a local webserver via:
+```
+bundle exec jekyll serve
+```
+The website should now be accessible at [http://localhost:4000](http://localhost:4000)
diff --git a/docs/_config.yml b/docs/_config.yml
index 19d679042b90538b384a852400efb6bf194ea8b2..dc79fc033b1a98362d0ab1dc73f5b2f25fad9507 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -5,6 +5,7 @@ description: >-
   training easy, efficient, and effective.
 
 locale : "en-US"
+logo: /assets/images/deepspeed-logo-uppercase-bold-white-1.15.svg
 
 repository: microsoft/DeepSpeed
 baseurl: "/" # the subpath of your site, e.g. /blog
@@ -13,6 +14,7 @@ url: "https://www.deepspeed.ai" # the base hostname & protocol for your site, e.
 # Build settings
 remote_theme: "mmistakes/minimal-mistakes@4.19.0"
 minimal_mistakes_skin : "air"
+search: true
 
 plugins:
   - jekyll-feed
@@ -33,15 +35,29 @@ collections:
       - advanced-install.md
       - getting-started.md
       - azure.md
-      - cifar-10.md
-      - bert-pretraining.md
       - bert-finetuning.md
-      - transformer_kernel.md
+      - bert-pretraining.md
+      - cifar-10.md
+      - curriculum-learning.md
+      - flops-profiler.md
+      - pytorch-profiler.md
+      - autotuning.md
+      - gan.md
+      - lrrt.md
       - megatron.md
+      - mixture-of-experts.md
+      - mixture-of-experts-nlg.md
+      - mixture-of-experts-inference.md
       - one-cycle.md
-      - lrrt.md
+      - onebit-adam.md
+      - zero-one-adam.md
+      - onebit-lamb.md
+      - pipeline.md
+      - progressive_layer_dropping.md
+      - sparse-attention.md
+      - transformer_kernel.md
+      - zero-offload.md
       - zero.md
-      - flops-profiler.md
 
 defaults:
   - scope:
@@ -66,8 +82,23 @@ defaults:
       path: ""
       type: posts
     values:
-      layout: single
+      layout: single-full
+      author_profile: false
+      read_time: false
+      comments: false
       share: true
+      related: false
+      toc: true
+      toc_label: "Contents"
+      toc_sticky: true
+      show_date: true
+  - scope:
+      path: ""
+      type: tutorials
+    values:
+      layout: single
+      toc_sticky: true
+
 
 analytics:
   provider: "google-gtag"
@@ -78,3 +109,5 @@ timezone: America/Los_Angeles
 breadcrumbs: true
 
 press_release_v3: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
+press_release_v5: https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/
+press_release_v6: https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/
diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
old mode 100755
new mode 100644
index 318cb2213404a3cf2840d5ddc5e18fd112d66c24..20f00b66760f4030904e00bd5c14fb3bf84760ad
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -1,92 +1,124 @@
 main:
-  - title: "Getting Started"
+  - title: 'Getting Started'
     url: /getting-started/
-  - title: "News"
-    url: /news/
-  - title: "Tutorials"
+  - title: 'Blog'
+    url: /posts/
+  - title: 'Tutorials'
     url: /tutorials/
-  - title: "Documentation"
+  - title: 'Documentation'
     url: https://deepspeed.readthedocs.io/
-  - title: "GitHub"
+  - title: 'GitHub'
     url: https://github.com/microsoft/DeepSpeed
 
 lnav:
-  - title: "Feature Overview"
+  - title: 'Feature Overview'
     url: /features/
-  - title: "Getting Started"
+  - title: 'Getting Started'
     url: /getting-started/
     children:
-      - title: "Installation"
+      - title: 'Installation'
         url: /getting-started/#installation
-      - title: "Writing models"
+      - title: 'Writing models'
         url: /getting-started/#writing-deepspeed-models
-      - title: "Training"
+      - title: 'Training'
         url: /getting-started/#training
-      - title: "Launching"
+      - title: 'Launching'
         url: /getting-started/#launching-deepspeed-training
-  - title: "Configuration"
+  - title: 'Configuration'
     url: /docs/config-json/
     children:
-      - title: "Batch size"
+      - title: 'Autotuning'
+        url: /docs/config-json/#autotuning
+      - title: 'Batch size'
         url: /docs/config-json/#batch-size-related-parameters
-      - title: "Optimizer"
+      - title: 'Optimizer'
         url: /docs/config-json/#optimizer-parameters
-      - title: "Scheduler"
+      - title: 'Scheduler'
         url: /docs/config-json/#scheduler-parameters
-      - title: "Communication"
+      - title: 'Communication'
         url: /docs/config-json/#communication-options
-      - title: "FP16"
+      - title: 'FP16'
         url: /docs/config-json/#fp16-training-options
-      - title: "AMP"
-        url: /docs/config-json/#automatic-mixed-precision-amp-training-options
-      - title: "Gradient Clipping"
+      - title: 'BFLOAT16'
+        url: /docs/config-json/#bfloat16-training-options
+      - title: 'Gradient Clipping'
         url: /docs/config-json/#gradient-clipping
-      - title: "ZeRO optimizations"
+      - title: 'ZeRO optimizations'
         url: /docs/config-json/#zero-optimizations-for-fp16-training
-      - title: "Logging"
+      - title: 'Parameter Offloading'
+        url: /docs/config-json/#parameter-offloading
+      - title: 'Optimizer Offloading'
+        url: /docs/config-json/#optimizer-offloading
+      - title: 'Asynchronous I/O'
+        url: /docs/config-json/#asynchronous-io
+      - title: 'Logging'
         url: /docs/config-json/#logging
-      - title: "Flops Profiler"
+      - title: 'Flops Profiler'
         url: /docs/config-json/#flops-profiler
-      - title: "Activation checkpointing"
+      - title: 'PyTorch Profiler'
+        url: /docs/config-json/#pytorch-profiler
+      - title: 'Activation checkpointing'
         url: /docs/config-json/#activation-checkpointing
-      - title: "Sparse Attention"
+      - title: 'Sparse Attention'
         url: /docs/config-json/#sparse-attention
-  - title: "Tutorials"
+      - title: 'Logging to TensorBoard'
+        url: /docs/config-json/#tensorboard-options
+  - title: 'Tutorials'
     url: /tutorials/
     children:
-      - title: "Getting started"
+      - title: 'Getting started'
         url: /getting-started/
-      - title: "Getting started on Azure"
+      - title: 'Getting started on Azure'
         url: /tutorials/azure/
-      - title: "BingBertSQuAD Fine-tuning"
+      - title: 'Autotuning'
+        url: /tutorials/autotuning/
+      - title: 'BingBertSQuAD Fine-tuning'
         url: /tutorials/bert-finetuning/
-      - title: "BERT Pre-training"
+      - title: 'BERT Pre-training'
         url: /tutorials/bert-pretraining/
-      - title: "CIFAR-10"
+      - title: 'CIFAR-10'
         url: /tutorials/cifar-10/
-      - title: "Flops Profiler"
+      - title: 'Curriculum Learning'
+        url: /tutorials/curriculum-learning/
+      - title: 'Flops Profiler'
         url: /tutorials/flops-profiler/
-      - title: "GAN"
+      - title: 'PyTorch Profiler'
+        url: /tutorials/pytorch-profiler/
+      - title: 'GAN'
         url: /tutorials/gan/
-      - title: "Learning Rate Range Test"
+      - title: 'Inference'
+        url: /tutorials/inference-tutorial/
+      - title: 'Learning Rate Range Test'
         url: /tutorials/lrrt/
-      - title: "Megatron-LM GPT2"
+      - title: 'Megatron-LM GPT2'
         url: /tutorials/megatron/
-      - title: "One-Cycle Schedule"
+      - title: 'Mixture-of-Experts (MoE)'
+        url: /tutorials/mixture-of-experts/
+      - title: 'MoE for NLG'
+        url: /tutorials/mixture-of-experts-nlg/
+      - title: 'MoE Inference'
+        url: /tutorials/mixture-of-experts-inference/
+      - title: 'Mixture-of-Quantization'
+        url: /tutorials/MoQ-tutorial/
+      - title: 'One-Cycle Schedule'
         url: /tutorials/one-cycle/
-      - title: "One-Bit Adam"
+      - title: 'One-Bit Adam'
         url: /tutorials/onebit-adam/
-      - title: "Pipeline Parallelism"
+      - title: "Zero-One Adam"
+        url: /tutorials/zero-one-adam/
+      - title: "One-Bit LAMB"
+        url: /tutorials/onebit-lamb/
+      - title: 'Pipeline Parallelism'
         url: /tutorials/pipeline/
-      - title: "Progressive Layer Dropping"
+      - title: 'Progressive Layer Dropping'
         url: /tutorials/progressive_layer_dropping/
-      - title: "Sparse Attention"
+      - title: 'Sparse Attention'
         url: /tutorials/sparse-attention/
-      - title: "Transformer Kernel"
+      - title: 'Transformer Kernel'
         url: /tutorials/transformer_kernel/
-      - title: "ZeRO-Offload"
+      - title: 'ZeRO-Offload'
         url: /tutorials/zero-offload/
-      - title: "ZeRO Redundancy Optimizer (ZeRO)"
+      - title: 'ZeRO'
         url: /tutorials/zero/
-  - title: "Contributing"
+  - title: 'Contributing'
     url: /contributing/
diff --git a/docs/_includes/analytics.html b/docs/_includes/analytics.html
new file mode 100644
index 0000000000000000000000000000000000000000..5c85236124c2bcb98b3978e44da67547d80f0024
--- /dev/null
+++ b/docs/_includes/analytics.html
@@ -0,0 +1,14 @@
+{% if jekyll.environment == 'production' and site.analytics.provider and page.analytics != false %}
+
+{% case site.analytics.provider %}
+{% when "google" %}
+  {% include /analytics-providers/google.html %}
+{% when "google-universal" %}
+  {% include /analytics-providers/google-universal.html %}
+{% when "google-gtag" %}
+  {% include /analytics-providers/google-gtag.html %}
+{% when "custom" %}
+  {% include /analytics-providers/custom.html %}
+{% endcase %}
+
+{% endif %}
diff --git a/docs/_includes/archive-single.html b/docs/_includes/archive-single.html
new file mode 100644
index 0000000000000000000000000000000000000000..68174807ef50f3977e3d6bde98bae892d1c55d0b
--- /dev/null
+++ b/docs/_includes/archive-single.html
@@ -0,0 +1,30 @@
+{% if post.header.teaser %}
+  {% capture teaser %}{{ post.header.teaser }}{% endcapture %}
+{% else %}
+  {% assign teaser = site.teaser %}
+{% endif %}
+
+{% if post.id %}
+  {% assign title = post.title | markdownify | remove: "<p>" | remove: "</p>" %}
+{% else %}
+  {% assign title = post.title %}
+{% endif %}
+
+<div class="{{ include.type | default: 'list' }}__item">
+  <article class="archive__item" itemscope itemtype="https://schema.org/CreativeWork">
+    {% if include.type == "grid" and teaser %}
+      <div class="archive__item-teaser">
+        <img src="{{ teaser | relative_url }}" alt="">
+      </div>
+    {% endif %}
+    <h2 class="archive__item-title no_toc" itemprop="headline">
+      {% if post.link %}
+        <a href="{{ post.link }}">{{ title }}</a> <a href="{{ post.url | relative_url }}" rel="permalink"><i class="fas fa-link" aria-hidden="true" title="permalink"></i><span class="sr-only">Permalink</span></a>
+      {% else %}
+        <a href="{{ post.url | relative_url }}" rel="permalink">{{ title }}</a>
+      {% endif %}
+    </h2>
+    {% include page__meta.html type=include.type %}
+    {% if post.excerpt %}<p class="archive__item-excerpt" itemprop="description">{{ post.excerpt | markdownify | strip_html | truncate: 160 }}</p>{% endif %}
+  </article>
+</div>
diff --git a/docs/_includes/author-profile-custom-links.html b/docs/_includes/author-profile-custom-links.html
new file mode 100644
index 0000000000000000000000000000000000000000..06e0b630f688a5dd73708540716331b70ea42f2b
--- /dev/null
+++ b/docs/_includes/author-profile-custom-links.html
@@ -0,0 +1,7 @@
+<!--
+  <li>
+    <a href="http://link-to-whatever-social-network.com/user/" itemprop="sameAs" rel="nofollow noopener noreferrer">
+      <i class="fas fa-fw" aria-hidden="true"></i> Custom Social Profile Link
+    </a>
+  </li>
+-->
diff --git a/docs/_includes/author-profile.html b/docs/_includes/author-profile.html
new file mode 100644
index 0000000000000000000000000000000000000000..d384ee7343767a9cfc41c7786e0f219a7a3c1a66
--- /dev/null
+++ b/docs/_includes/author-profile.html
@@ -0,0 +1,252 @@
+{% assign author = page.author | default: page.authors[0] | default: site.author %}
+{% assign author = site.data.authors[author] | default: author %}
+
+<div itemscope itemtype="https://schema.org/Person">
+
+  {% if author.avatar %}
+    <div class="author__avatar">
+      {% if author.home %}
+        <a href="{{ author.home | relative_url }}">
+          <img src="{{ author.avatar | relative_url }}" alt="{{ author.name }}" itemprop="image">
+        </a>
+      {% else %}
+        <img src="{{ author.avatar | relative_url }}" alt="{{ author.name }}" itemprop="image">
+      {% endif %}
+    </div>
+  {% endif %}
+
+  <div class="author__content">
+    {% if author.home %}
+      <a href="{{ author.home | relative_url }}"><h3 class="author__name" itemprop="name">{{ author.name }}</h3></a>
+    {% else %}
+      <h3 class="author__name" itemprop="name">{{ author.name }}</h3>
+    {% endif %}
+    {% if author.bio %}
+      <div class="author__bio" itemprop="description">
+        {{ author.bio | markdownify }}
+      </div>
+    {% endif %}
+  </div>
+
+  <div class="author__urls-wrapper">
+    <button class="btn btn--inverse">{{ site.data.ui-text[site.locale].follow_label | remove: ":" | default: "Follow" }}</button>
+    <ul class="author__urls social-icons">
+      {% if author.location %}
+        <li itemprop="homeLocation" itemscope itemtype="https://schema.org/Place">
+          <i class="fas fa-fw fa-map-marker-alt" aria-hidden="true"></i> <span itemprop="name">{{ author.location }}</span>
+        </li>
+      {% endif %}
+
+      {% if author.links %}
+        {% for link in author.links %}
+          {% if link.label and link.url %}
+            <li><a href="{{ link.url }}" rel="nofollow noopener noreferrer"><i class="{{ link.icon | default: 'fas fa-link' }}" aria-hidden="true"></i><span class="label">{{ link.label }}</span></a></li>
+          {% endif %}
+        {% endfor %}
+      {% endif %}
+
+      {% if author.uri %}
+        <li>
+          <a href="{{ author.uri }}" itemprop="url">
+            <i class="fas fa-fw fa-link" aria-hidden="true"></i><span class="label">{{ site.data.ui-text[site.locale].website_label | default: "Website" }}</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.email %}
+        <li>
+          <a href="mailto:{{ author.email }}">
+            <meta itemprop="email" content="{{ author.email }}" />
+            <i class="fas fa-fw fa-envelope-square" aria-hidden="true"></i><span class="label">{{ site.data.ui-text[site.locale].email_label | default: "Email" }}</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.keybase %}
+        <li>
+          <a href="https://keybase.io/{{ author.keybase }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fas fa-fw fa-key" aria-hidden="true"></i><span class="label">Keybase</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.twitter %}
+        <li>
+          <a href="https://twitter.com/{{ author.twitter }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-twitter-square" aria-hidden="true"></i><span class="label">Twitter</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.facebook %}
+        <li>
+          <a href="https://www.facebook.com/{{ author.facebook }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-facebook-square" aria-hidden="true"></i><span class="label">Facebook</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.linkedin %}
+        <li>
+          <a href="https://www.linkedin.com/in/{{ author.linkedin }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-linkedin" aria-hidden="true"></i><span class="label">LinkedIn</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.xing %}
+        <li>
+          <a href="https://www.xing.com/profile/{{ author.xing }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-xing-square" aria-hidden="true"></i><span class="label">XING</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.instagram %}
+        <li>
+          <a href="https://instagram.com/{{ author.instagram }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-instagram" aria-hidden="true"></i><span class="label">Instagram</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.tumblr %}
+        <li>
+          <a href="https://{{ author.tumblr }}.tumblr.com" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-tumblr-square" aria-hidden="true"></i><span class="label">Tumblr</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.bitbucket %}
+        <li>
+          <a href="https://bitbucket.org/{{ author.bitbucket }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-bitbucket" aria-hidden="true"></i><span class="label">Bitbucket</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.github %}
+        <li>
+          <a href="https://github.com/{{ author.github }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-github" aria-hidden="true"></i><span class="label">GitHub</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.gitlab %}
+        <li>
+          <a href="https://gitlab.com/{{ author.gitlab }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-gitlab" aria-hidden="true"></i><span class="label">GitLab</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.stackoverflow %}
+        <li>
+          <a href="https://stackoverflow.com/users/{{ author.stackoverflow }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-stack-overflow" aria-hidden="true"></i><span class="label">Stack Overflow</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.lastfm %}
+        <li>
+          <a href="https://last.fm/user/{{ author.lastfm }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-lastfm-square" aria-hidden="true"></i><span class="label">Last.fm</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.dribbble %}
+        <li>
+          <a href="https://dribbble.com/{{ author.dribbble }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-dribbble" aria-hidden="true"></i><span class="label">Dribbble</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.pinterest %}
+        <li>
+          <a href="https://www.pinterest.com/{{ author.pinterest }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-pinterest" aria-hidden="true"></i><span class="label">Pinterest</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.foursquare %}
+        <li>
+          <a href="https://foursquare.com/{{ author.foursquare }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-foursquare" aria-hidden="true"></i><span class="label">Foursquare</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.steam %}
+        <li>
+          <a href="https://steamcommunity.com/id/{{ author.steam }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-steam" aria-hidden="true"></i><span class="label">Steam</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.youtube %}
+        {% if author.youtube contains "://" %}
+          <li>
+            <a href="{{ author.youtube }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+              <i class="fab fa-fw fa-youtube" aria-hidden="true"></i><span class="label">YouTube</span>
+            </a>
+          </li>
+        {% elsif author.youtube %}
+          <li>
+            <a href="https://www.youtube.com/user/{{ author.youtube }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+              <i class="fab fa-fw fa-youtube" aria-hidden="true"></i><span class="label">YouTube</span>
+            </a>
+          </li>
+        {% endif %}
+      {% endif %}
+
+      {% if author.soundcloud %}
+        <li>
+          <a href="https://soundcloud.com/{{ author.soundcloud }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-soundcloud" aria-hidden="true"></i><span class="label">SoundCloud</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.weibo %}
+        <li>
+          <a href="https://www.weibo.com/{{ author.weibo }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-weibo" aria-hidden="true"></i><span class="label">Weibo</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.flickr %}
+        <li>
+          <a href="https://www.flickr.com/{{ author.flickr }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-flickr" aria-hidden="true"></i><span class="label">Flickr</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.codepen %}
+        <li>
+          <a href="https://codepen.io/{{ author.codepen }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-codepen" aria-hidden="true"></i><span class="label">CodePen</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% if author.vine %}
+        <li>
+          <a href="https://vine.co/u/{{ author.vine }}" itemprop="sameAs" rel="nofollow noopener noreferrer">
+            <i class="fab fa-fw fa-vine" aria-hidden="true"></i><span class="label">{{ site.data.ui-text[site.locale].email_label | default: "Email" }}</span>
+          </a>
+        </li>
+      {% endif %}
+
+      {% include author-profile-custom-links.html %}
+    </ul>
+  </div>
+</div>
diff --git a/docs/_includes/breadcrumbs.html b/docs/_includes/breadcrumbs.html
new file mode 100644
index 0000000000000000000000000000000000000000..cba3d415fa4f8867cb3c1e7248ee0969b57a91e8
--- /dev/null
+++ b/docs/_includes/breadcrumbs.html
@@ -0,0 +1,39 @@
+{% case site.category_archive.type %}
+  {% when "liquid" %}
+    {% assign path_type = "#" %}
+  {% when "jekyll-archives" %}
+    {% assign path_type = nil %}
+{% endcase %}
+
+{% if page.collection != 'posts' %}
+  {% assign path_type = nil %}
+  {% assign crumb_path = '/' %}
+{% else %}
+  {% assign crumb_path = site.category_archive.path %}
+{% endif %}
+
+<nav class="breadcrumbs">
+  <ol itemscope itemtype="https://schema.org/BreadcrumbList">
+    {% assign crumbs = page.url | split: '/' %}
+    {% assign i = 1 %}
+    {% for crumb in crumbs offset: 1 %}
+      {% if forloop.first %}
+        <li itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
+          <a href="{{ site.url }}{{ site.baseurl }}/" itemprop="item"><span itemprop="name">{{ site.data.ui-text[site.locale].breadcrumb_home_label | default: "Home" }}</span></a>
+          <meta itemprop="position" content="{{ i }}" />
+        </li>
+        <span class="sep">{{ site.data.ui-text[site.locale].breadcrumb_separator | default: "/" }}</span>
+      {% endif %}
+      {% if forloop.last %}
+        <li class="current">{{ page.title }}</li>
+      {% else %}
+        {% assign i = i | plus: 1 %}
+        <li itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
+          <a href="{{ crumb | downcase | replace: '%20', '-' | prepend: path_type | prepend: crumb_path | relative_url }}" itemprop="item"><span itemprop="name">{{ crumb | replace: '-', ' ' | replace: '%20', ' ' | capitalize }}</span></a>
+          <meta itemprop="position" content="{{ i }}" />
+        </li>
+        <span class="sep">{{ site.data.ui-text[site.locale].breadcrumb_separator | default: "/" }}</span>
+      {% endif %}
+    {% endfor %}
+  </ol>
+</nav>
diff --git a/docs/_includes/browser-upgrade.html b/docs/_includes/browser-upgrade.html
new file mode 100644
index 0000000000000000000000000000000000000000..ec6ad0acc5668ff9b3c7f968d2b48e6eab2ded7e
--- /dev/null
+++ b/docs/_includes/browser-upgrade.html
@@ -0,0 +1,3 @@
+<!--[if lt IE 9]>
+<div class="notice--danger align-center" style="margin: 0;">You are using an <strong>outdated</strong> browser. Please <a href="https://browsehappy.com/">upgrade your browser</a> to improve your experience.</div>
+<![endif]-->
diff --git a/docs/_includes/category-list.html b/docs/_includes/category-list.html
new file mode 100644
index 0000000000000000000000000000000000000000..4f38f11b51e8e65e289b463263497a8170b12591
--- /dev/null
+++ b/docs/_includes/category-list.html
@@ -0,0 +1,26 @@
+{% case site.category_archive.type %}
+  {% when "liquid" %}
+    {% assign path_type = "#" %}
+  {% when "jekyll-archives" %}
+    {% assign path_type = nil %}
+{% endcase %}
+
+{% if site.category_archive.path %}
+  {% comment %}
+    <!-- Sort alphabetically regardless of case e.g. a B c d E -->
+    <!-- modified from http://www.codeofclimber.ru/2015/sorting-site-tags-in-jekyll/ -->
+  {% endcomment %}
+  {% capture page_categories %}{% for category in page.categories %}{{ category | downcase }}|{{ category }}{% unless forloop.last %},{% endunless %}{% endfor %}{% endcapture %}
+  {% assign category_hashes = page_categories | split: ',' | sort %}
+
+  <p class="page__taxonomy">
+    <strong><i class="fas fa-fw fa-folder-open" aria-hidden="true"></i> {{ site.data.ui-text[site.locale].categories_label | default: "Categories:" }} </strong>
+    <span itemprop="keywords">
+    {% for hash in category_hashes %}
+      {% assign keyValue = hash | split: '|' %}
+      {% capture category_word %}{{ keyValue[1] | strip_newlines }}{% endcapture %}
+      <a href="{{ category_word | slugify | prepend: path_type | prepend: site.category_archive.path | relative_url }}" class="page__taxonomy-item" rel="tag">{{ category_word }}</a>{% unless forloop.last %}<span class="sep">, </span>{% endunless %}
+    {% endfor %}
+    </span>
+  </p>
+{% endif %}
diff --git a/docs/_includes/comment.html b/docs/_includes/comment.html
new file mode 100644
index 0000000000000000000000000000000000000000..2e3013ee2a76e087c0c648f134354f1816971832
--- /dev/null
+++ b/docs/_includes/comment.html
@@ -0,0 +1,22 @@
+<article id="comment{{ include.index }}" class="js-comment comment" itemprop="comment" itemscope itemtype="https://schema.org/Comment">
+  <div class="comment__avatar-wrapper">
+    <img class="comment__avatar" src="https://www.gravatar.com/avatar/{{ include.email }}?d=mm&s=80" alt="{{ include.name }}">
+  </div>
+  <div class="comment__content-wrapper">
+    <h3 class="comment__author" itemprop="author" itemscope itemtype="https://schema.org/Person">
+      {% unless include.url == blank %}
+        <span itemprop="name"><a rel="external nofollow" itemprop="url" href="{{ include.url }}">{{ include.name }}</a></span>
+      {% else %}
+        <span itemprop="name">{{ include.name }}</span>
+      {% endunless %}
+    </h3>
+    <p class="comment__date">
+      {% if include.date %}
+        {% if include.index %}<a href="#comment{{ include.index }}" itemprop="url">{% endif %}
+        <time datetime="{{ include.date | date_to_xmlschema }}" itemprop="datePublished">{{ include.date | date: "%B %-d, %Y at %I:%M %p" }}</time>
+        {% if include.index %}</a>{% endif %}
+      {% endif %}
+    </p>
+    <div itemprop="text">{{ include.message | markdownify }}</div>
+  </div>
+</article>
diff --git a/docs/_includes/comments.html b/docs/_includes/comments.html
new file mode 100644
index 0000000000000000000000000000000000000000..dbb90d1e3a0f740ba4e11c256ccf7fac7331b53c
--- /dev/null
+++ b/docs/_includes/comments.html
@@ -0,0 +1,159 @@
+<div class="page__comments">
+  {% capture comments_label %}{{ site.data.ui-text[site.locale].comments_label | default: "Comments" }}{% endcapture %}
+  {% case site.comments.provider %}
+    {% when "discourse" %}
+      <h4 class="page__comments-title">{{ comments_label }}</h4>
+      <section id="discourse-comments"></section>
+    {% when "disqus" %}
+      <h4 class="page__comments-title">{{ comments_label }}</h4>
+      <section id="disqus_thread"></section>
+    {% when "facebook" %}
+      <h4 class="page__comments-title">{{ comments_label }}</h4>
+      <section class="fb-comments" data-href="{{ page.url | absolute_url }}" data-mobile="true" data-num-posts="{{ site.comments.facebook.num_posts | default: 5 }}" data-width="100%" data-colorscheme="{{ site.comments.facebook.colorscheme | default: 'light' }}"></section>
+    {% when "staticman_v2" %}
+      <section id="static-comments">
+        {% if site.repository and site.comments.staticman.branch %}
+          <!-- Start static comments -->
+          <div class="js-comments">
+            {% if site.data.comments[page.slug] %}
+              <h4 class="page__comments-title">{{ site.data.ui-text[site.locale].comments_title | default: "Comments" }}</h4>
+              {% assign comments = site.data.comments[page.slug] | sort %}
+
+              {% for comment in comments %}
+                {% assign email = comment[1].email %}
+                {% assign name = comment[1].name %}
+                {% assign url = comment[1].url %}
+                {% assign date = comment[1].date %}
+                {% assign message = comment[1].message %}
+                {% include comment.html index=forloop.index email=email name=name url=url date=date message=message %}
+              {% endfor %}
+            {% endif %}
+          </div>
+          <!-- End static comments -->
+
+          <!-- Start new comment form -->
+          <div class="page__comments-form">
+            <h4 class="page__comments-title">{{ site.data.ui-text[site.locale].comments_label | default: "Leave a Comment" }}</h4>
+            <p class="small">{{ site.data.ui-text[site.locale].comment_form_info | default: "Your email address will not be published. Required fields are marked" }} <span class="required">*</span></p>
+            <form id="new_comment" class="page__comments-form js-form form" method="post" action="{{ site.comments.staticman.endpoint | default: 'https://api.staticman.net/v2/entry/' }}{{ site.repository }}/{{ site.comments.staticman.branch }}/comments">
+              <div class="form__spinner">
+                <i class="fas fa-spinner fa-spin fa-3x fa-fw"></i>
+                <span class="sr-only">{{ site.data.ui-text[site.locale].loading_label | default: "Loading..." }}</span>
+              </div>
+
+              <div class="form-group">
+                <label for="comment-form-message">{{ site.data.ui-text[site.locale].comment_form_comment_label | default: "Comment" }} <small class="required">*</small></label>
+                <textarea type="text" rows="3" id="comment-form-message" name="fields[message]" tabindex="1"></textarea>
+                <div class="small help-block"><a href="https://daringfireball.net/projects/markdown/">{{ site.data.ui-text[site.locale].comment_form_md_info | default: "Markdown is supported." }}</a></div>
+              </div>
+              <div class="form-group">
+                <label for="comment-form-name">{{ site.data.ui-text[site.locale].comment_form_name_label | default: "Name" }} <small class="required">*</small></label>
+                <input type="text" id="comment-form-name" name="fields[name]" tabindex="2" />
+              </div>
+              <div class="form-group">
+                <label for="comment-form-email">{{ site.data.ui-text[site.locale].comment_form_email_label | default: "Email address" }} <small class="required">*</small></label>
+                <input type="email" id="comment-form-email" name="fields[email]" tabindex="3" />
+              </div>
+              <div class="form-group">
+                <label for="comment-form-url">{{ site.data.ui-text[site.locale].comment_form_website_label | default: "Website (optional)" }}</label>
+                <input type="url" id="comment-form-url" name="fields[url]" tabindex="4"/>
+              </div>
+              <div class="form-group hidden" style="display: none;">
+                <input type="hidden" name="options[slug]" value="{{ page.slug }}">
+                <label for="comment-form-location">Not used. Leave blank if you are a human.</label>
+                <input type="text" id="comment-form-location" name="fields[hidden]" autocomplete="off"/>
+                {% if site.reCaptcha.siteKey %}<input type="hidden" name="options[reCaptcha][siteKey]" value="{{ site.reCaptcha.siteKey }}">{% endif %}
+                {% if site.reCaptcha.secret %}<input type="hidden" name="options[reCaptcha][secret]" value="{{ site.reCaptcha.secret }}">{% endif %}
+              </div>
+              <!-- Start comment form alert messaging -->
+              <p class="hidden js-notice">
+                <strong class="js-notice-text"></strong>
+              </p>
+              <!-- End comment form alert messaging -->
+              {% if site.reCaptcha.siteKey %}
+                <div class="form-group">
+                  <div class="g-recaptcha" data-sitekey="{{ site.reCaptcha.siteKey }}"></div>
+                </div>
+              {% endif %}
+              <div class="form-group">
+                <button type="submit" id="comment-form-submit" tabindex="5" class="btn btn--primary btn--large">{{ site.data.ui-text[site.locale].comment_btn_submit | default: "Submit Comment" }}</button>
+              </div>
+            </form>
+          </div>
+          <!-- End new comment form -->
+          {% if site.reCaptcha.siteKey %}<script async src="https://www.google.com/recaptcha/api.js"></script>{% endif %}
+        {% endif %}
+      </section>
+    {% when "staticman" %}
+      <section id="static-comments">
+        {% if site.repository and site.staticman.branch %}
+          <!-- Start static comments -->
+          <div class="js-comments">
+            {% if site.data.comments[page.slug] %}
+              <h4 class="page__comments-title">{{ site.data.ui-text[site.locale].comments_title | default: "Comments" }}</h4>
+              {% assign comments = site.data.comments[page.slug] | sort %}
+
+              {% for comment in comments %}
+                {% assign email = comment[1].email %}
+                {% assign name = comment[1].name %}
+                {% assign url = comment[1].url %}
+                {% assign date = comment[1].date %}
+                {% assign message = comment[1].message %}
+                {% include comment.html index=forloop.index email=email name=name url=url date=date message=message %}
+              {% endfor %}
+            {% endif %}
+          </div>
+          <!-- End static comments -->
+
+          <!-- Start new comment form -->
+          <div class="page__comments-form">
+            <h4 class="page__comments-title">{{ site.data.ui-text[site.locale].comments_label | default: "Leave a Comment" }}</h4>
+            <p class="small">{{ site.data.ui-text[site.locale].comment_form_info | default: "Your email address will not be published. Required fields are marked" }} <span class="required">*</span></p>
+            <form id="new_comment" class="page__comments-form js-form form" method="post" action="https://api.staticman.net/v1/entry/{{ site.repository }}/{{ site.staticman.branch }}">
+              <div class="form__spinner">
+                <i class="fas fa-spinner fa-spin fa-3x fa-fw"></i>
+                <span class="sr-only">{{ site.data.ui-text[site.locale].loading_label | default: "Loading..." }}</span>
+              </div>
+
+              <div class="form-group">
+                <label for="comment-form-message">{{ site.data.ui-text[site.locale].comment_form_comment_label | default: "Comment" }} <small class="required">*</small></label>
+                <textarea type="text" rows="3" id="comment-form-message" name="fields[message]" tabindex="1"></textarea>
+                <div class="small help-block"><a href="https://daringfireball.net/projects/markdown/">{{ site.data.ui-text[site.locale].comment_form_md_info | default: "Markdown is supported." }}</a></div>
+              </div>
+              <div class="form-group">
+                <label for="comment-form-name">{{ site.data.ui-text[site.locale].comment_form_name_label | default: "Name" }} <small class="required">*</small></label>
+                <input type="text" id="comment-form-name" name="fields[name]" tabindex="2" />
+              </div>
+              <div class="form-group">
+                <label for="comment-form-email">{{ site.data.ui-text[site.locale].comment_form_email_label | default: "Email address" }} <small class="required">*</small></label>
+                <input type="email" id="comment-form-email" name="fields[email]" tabindex="3" />
+              </div>
+              <div class="form-group">
+                <label for="comment-form-url">{{ site.data.ui-text[site.locale].comment_form_website_label | default: "Website (optional)" }}</label>
+                <input type="url" id="comment-form-url" name="fields[url]" tabindex="4"/>
+              </div>
+              <div class="form-group hidden" style="display: none;">
+                <input type="hidden" name="options[slug]" value="{{ page.slug }}">
+                <label for="comment-form-location">Not used. Leave blank if you are a human.</label>
+                <input type="text" id="comment-form-location" name="fields[hidden]" autocomplete="off"/>
+              </div>
+              <!-- Start comment form alert messaging -->
+              <p class="hidden js-notice">
+                <strong class="js-notice-text"></strong>
+              </p>
+              <!-- End comment form alert messaging -->
+              <div class="form-group">
+                <button type="submit" id="comment-form-submit" tabindex="5" class="btn btn--primary btn--large">{{ site.data.ui-text[site.locale].comment_btn_submit | default: "Submit Comment" }}</button>
+              </div>
+            </form>
+          </div>
+          <!-- End new comment form -->
+        {% endif %}
+      </section>
+    {% when "utterances" %}
+      <h4 class="page__comments-title">{{ comments_label }}</h4>
+      <section id="utterances-comments"></section>
+    {% when "custom" %}
+      {% include /comments-providers/custom.html %}
+  {% endcase %}
+</div>
diff --git a/docs/_includes/documents-collection.html b/docs/_includes/documents-collection.html
new file mode 100644
index 0000000000000000000000000000000000000000..376a509e3e49c81d1890846d3449a76733ee6329
--- /dev/null
+++ b/docs/_includes/documents-collection.html
@@ -0,0 +1,21 @@
+{% assign entries = site[include.collection] %}
+
+{% if include.sort_by == 'title' %}
+  {% if include.sort_order == 'reverse' %}
+    {% assign entries = entries | sort: 'title' | reverse %}
+  {% else %}
+    {% assign entries = entries | sort: 'title' %}
+  {% endif %}
+{% elsif include.sort_by == 'date' %}
+  {% if include.sort_order == 'reverse' %}
+    {% assign entries = entries | sort: 'date' | reverse %}
+  {% else %}
+    {% assign entries = entries | sort: 'date' %}
+  {% endif %}
+{% endif %}
+
+{%- for post in entries -%}
+  {%- unless post.hidden -%}
+    {% include archive-single.html %}
+  {%- endunless -%}
+{%- endfor -%}
diff --git a/docs/_includes/feature_row b/docs/_includes/feature_row
new file mode 100644
index 0000000000000000000000000000000000000000..03f09c15cf0c8e8e8cf0e2e9d448111d10ca0433
--- /dev/null
+++ b/docs/_includes/feature_row
@@ -0,0 +1,41 @@
+{% if include.id %}
+  {% assign feature_row = page[include.id] %}
+{% else %}
+  {% assign feature_row = page.feature_row %}
+{% endif %}
+
+<div class="feature__wrapper">
+
+  {% for f in feature_row %}
+    <div class="feature__item{% if include.type %}--{{ include.type }}{% endif %}">
+      <div class="archive__item">
+        {% if f.image_path %}
+          <div class="archive__item-teaser">
+            <img src="{{ f.image_path | relative_url }}"
+                 alt="{% if f.alt %}{{ f.alt }}{% endif %}">
+            {% if f.image_caption %}
+              <span class="archive__item-caption">{{ f.image_caption | markdownify | remove: "<p>" | remove: "</p>" }}</span>
+            {% endif %}
+          </div>
+        {% endif %}
+
+        <div class="archive__item-body">
+          {% if f.title %}
+            <h2 class="archive__item-title">{{ f.title }}</h2>
+          {% endif %}
+
+          {% if f.excerpt %}
+            <div class="archive__item-excerpt">
+              {{ f.excerpt | markdownify }}
+            </div>
+          {% endif %}
+
+          {% if f.url %}
+            <p><a href="{{ f.url | relative_url }}" class="btn {{ f.btn_class }}">{{ f.btn_label | default: site.data.ui-text[site.locale].more_label | default: "Learn More" }}</a></p>
+          {% endif %}
+        </div>
+      </div>
+    </div>
+  {% endfor %}
+
+</div>
diff --git a/docs/_includes/figure b/docs/_includes/figure
new file mode 100644
index 0000000000000000000000000000000000000000..f1ce1ebcc110a92d09684d0ff9e0f1086da99d52
--- /dev/null
+++ b/docs/_includes/figure
@@ -0,0 +1,9 @@
+<figure class="{{ include.class }}">
+  <img src="{{ include.image_path | relative_url }}"
+       alt="{% if include.alt %}{{ include.alt }}{% endif %}">
+  {% if include.caption %}
+    <figcaption>
+      {{ include.caption | markdownify | remove: "<p>" | remove: "</p>" }}
+    </figcaption>
+  {% endif %}
+</figure>
diff --git a/docs/_includes/footer.html b/docs/_includes/footer.html
new file mode 100644
index 0000000000000000000000000000000000000000..2bc78963ce39e3e6e79dbee91313961728bb3540
--- /dev/null
+++ b/docs/_includes/footer.html
@@ -0,0 +1,19 @@
+<div class="page__footer-follow">
+  <ul class="social-icons">
+    {% if site.data.ui-text[site.locale].follow_label %}
+      <li><strong>{{ site.data.ui-text[site.locale].follow_label }}</strong></li>
+    {% endif %}
+
+    {% if site.footer.links %}
+      {% for link in site.footer.links %}
+        {% if link.label and link.url %}
+          <li><a href="{{ link.url }}" rel="nofollow noopener noreferrer"><i class="{{ link.icon | default: 'fas fa-link' }}" aria-hidden="true"></i> {{ link.label }}</a></li>
+        {% endif %}
+      {% endfor %}
+    {% endif %}
+
+    <li><a href="{% if site.atom_feed.path %}{{ site.atom_feed.path }}{% else %}{{ '/feed.xml' | relative_url }}{% endif %}"><i class="fas fa-fw fa-rss-square" aria-hidden="true"></i> {{ site.data.ui-text[site.locale].feed_label | default: "Feed" }}</a></li>
+  </ul>
+</div>
+
+<div class="page__footer-copyright">&copy; {{ site.time | date: '%Y' }} {{ site.name | default: site.title }}. {{ site.data.ui-text[site.locale].powered_by | default: "Powered by" }} <a href="https://jekyllrb.com" rel="nofollow">Jekyll</a> &amp; <a href="https://mademistakes.com/work/minimal-mistakes-jekyll-theme/" rel="nofollow">Minimal Mistakes</a>.</div>
diff --git a/docs/_includes/gallery b/docs/_includes/gallery
new file mode 100644
index 0000000000000000000000000000000000000000..71a9e1e1b3d1455cf78508667476088cb69f256b
--- /dev/null
+++ b/docs/_includes/gallery
@@ -0,0 +1,35 @@
+{% if include.id %}
+  {% assign gallery = page[include.id] %}
+{% else %}
+  {% assign gallery = page.gallery %}
+{% endif %}
+
+{% if include.layout %}
+  {% assign gallery_layout = include.layout %}
+{% else %}
+  {% if gallery.size == 2 %}
+    {% assign gallery_layout = 'half' %}
+  {% elsif gallery.size >= 3 %}
+    {% assign gallery_layout = 'third' %}
+  {% else %}
+    {% assign gallery_layout = '' %}
+  {% endif %}
+{% endif %}
+
+<figure class="{{ gallery_layout }} {{ include.class }}">
+  {% for img in gallery %}
+    {% if img.url %}
+      <a href="{{ img.url | relative_url }}"
+        {% if img.title %}title="{{ img.title }}"{% endif %}>
+          <img src="{{ img.image_path | relative_url }}"
+               alt="{% if img.alt %}{{ img.alt }}{% endif %}">
+      </a>
+    {% else %}
+      <img src="{{ img.image_path | relative_url }}"
+           alt="{% if img.alt %}{{ img.alt }}{% endif %}">
+    {% endif %}
+  {% endfor %}
+  {% if include.caption %}
+    <figcaption>{{ include.caption | markdownify | remove: "<p>" | remove: "</p>" }}</figcaption>
+  {% endif %}
+</figure>
diff --git a/docs/_includes/group-by-array b/docs/_includes/group-by-array
new file mode 100644
index 0000000000000000000000000000000000000000..528e40b106c6cf14cfda42d430e84fc6bf69a892
--- /dev/null
+++ b/docs/_includes/group-by-array
@@ -0,0 +1,47 @@
+<!--
+# Jekyll Group-By-Array 0.1.0
+# https://github.com/mushishi78/jekyll-group-by-array
+# © 2015 Max White <mushishi78@gmail.com>
+# MIT License
+-->
+
+<!-- Initialize -->
+{% assign __empty_array = '' | split: ',' %}
+{% assign group_names = __empty_array %}
+{% assign group_items = __empty_array %}
+
+<!-- Map -->
+{% assign __names =  include.collection | map: include.field %}
+
+<!-- Flatten -->
+{% assign __names =  __names | join: ',' | join: ',' | split: ',' %}
+
+<!-- Uniq -->
+{% assign __names =  __names | sort %}
+{% for name in __names %}
+
+<!-- If not equal to previous then it must be unique as sorted -->
+{% unless name == previous %}
+
+<!-- Push to group_names -->
+{% assign group_names = group_names | push: name %}
+{% endunless %}
+
+{% assign previous = name %}
+{% endfor %}
+
+
+<!-- group_items -->
+{% for name in group_names %}
+
+<!-- Collect if contains -->
+{% assign __item = __empty_array %}
+{% for __element in include.collection %}
+{% if __element[include.field] contains name %}
+{% assign __item = __item | push: __element %}
+{% endif %}
+{% endfor %}
+
+<!-- Push to group_items -->
+{% assign group_items = group_items | push: __item %}
+{% endfor %}
diff --git a/docs/_includes/head.html b/docs/_includes/head.html
new file mode 100644
index 0000000000000000000000000000000000000000..3b99471f5ec5ec0a3e84991c4815ed949357873f
--- /dev/null
+++ b/docs/_includes/head.html
@@ -0,0 +1,37 @@
+<meta charset="utf-8">
+
+{% include seo.html %}
+
+<link href="{% if site.atom_feed.path %}{{ site.atom_feed.path }}{% else %}{{ '/feed.xml' | relative_url }}{% endif %}" type="application/atom+xml" rel="alternate" title="{{ site.title }} Feed">
+
+<!-- https://t.co/dKP3o1e -->
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+
+<script>
+  document.documentElement.className = document.documentElement.className.replace(/\bno-js\b/g, '') + ' js ';
+</script>
+
+<!-- For all browsers -->
+<link rel="stylesheet" href="{{ '/assets/css/main.css' | relative_url }}">
+<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@fortawesome/fontawesome-free@5/css/all.min.css">
+
+<!--[if IE]>
+  <style>
+    /* old IE unsupported flexbox fixes */
+    .greedy-nav .site-title {
+      padding-right: 3em;
+    }
+    .greedy-nav button {
+      position: absolute;
+      top: 0;
+      right: 0;
+      height: 100%;
+    }
+  </style>
+<![endif]-->
+
+{% if site.head_scripts %}
+  {% for script in site.head_scripts %}
+    <script src="{{ script | relative_url }}"></script>
+  {% endfor %}
+{% endif %}
diff --git a/docs/_includes/masthead.html b/docs/_includes/masthead.html
new file mode 100644
index 0000000000000000000000000000000000000000..0a14c23ebde0957a5d3361cb3e891aef0c20b5d9
--- /dev/null
+++ b/docs/_includes/masthead.html
@@ -0,0 +1,37 @@
+{% capture logo_path %}{{ site.logo }}{% endcapture %}
+
+<div class="masthead">
+  <div class="masthead__inner-wrap">
+    <div class="masthead__menu">
+      <nav id="site-nav" class="greedy-nav">
+        {% unless logo_path == empty %}
+          <a class="site-logo" href="{{ '/' | relative_url }}"><img src="{{ logo_path | relative_url }}" alt=""></a>
+        {% endunless %}
+        <!-- <a class="site-title" href="{{ '/' | relative_url }}">
+          {{ site.masthead_title | default: site.title }}
+          {% if site.subtitle %}<span class="site-subtitle">{{ site.subtitle }}</span>{% endif %}
+        </a> -->
+        <ul class="visible-links">
+          {%- for link in site.data.navigation.main -%}
+            <li class="masthead__menu-item">
+              <a href="{{ link.url | relative_url }}"{% if link.description %} title="{{ link.description }}"{% endif %}>{{ link.title }}</a>
+            </li>
+          {%- endfor -%}
+        </ul>
+        {% if site.search == true %}
+        <button class="search__toggle" type="button">
+          <span class="visually-hidden">{{ site.data.ui-text[site.locale].search_label | default: "Toggle search" }}</span>
+          <svg class="icon" width="16" height="16" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 15.99 16">
+            <path d="M15.5,13.12L13.19,10.8a1.69,1.69,0,0,0-1.28-.55l-0.06-.06A6.5,6.5,0,0,0,5.77,0,6.5,6.5,0,0,0,2.46,11.59a6.47,6.47,0,0,0,7.74.26l0.05,0.05a1.65,1.65,0,0,0,.5,1.24l2.38,2.38A1.68,1.68,0,0,0,15.5,13.12ZM6.4,2A4.41,4.41,0,1,1,2,6.4,4.43,4.43,0,0,1,6.4,2Z" transform="translate(-.01)"></path>
+          </svg>
+        </button>
+        {% endif %}
+        <button class="greedy-nav__toggle hidden" type="button">
+          <span class="visually-hidden">{{ site.data.ui-text[site.locale].menu_label | default: "Toggle menu" }}</span>
+          <div class="navicon"></div>
+        </button>
+        <ul class="hidden-links hidden"></ul>
+      </nav>
+    </div>
+  </div>
+</div>
diff --git a/docs/_includes/nav_list b/docs/_includes/nav_list
new file mode 100644
index 0000000000000000000000000000000000000000..a035a5bd7b15f2df9377403b80e308808deff497
--- /dev/null
+++ b/docs/_includes/nav_list
@@ -0,0 +1,26 @@
+{% assign navigation = site.data.navigation[include.nav] %}
+
+<nav class="nav__list">
+  {% if page.sidebar.title %}<h3 class="nav__title" style="padding-left: 0;">{{ page.sidebar.title }}</h3>{% endif %}
+  <input id="ac-toc" name="accordion-toc" type="checkbox" />
+  <label for="ac-toc">{{ site.data.ui-text[site.locale].menu_label | default: "Toggle Menu" }}</label>
+  <ul class="nav__items">
+    {% for nav in navigation %}
+      <li>
+        {% if nav.url %}
+          <a href="{{ nav.url | relative_url }}"><span class="nav__sub-title">{{ nav.title }}</span></a>
+        {% else %}
+          <span class="nav__sub-title">{{ nav.title }}</span>
+        {% endif %}
+
+        {% if nav.children != null %}
+        <ul>
+          {% for child in nav.children %}
+            <li><a href="{{ child.url | relative_url }}"{% if child.url == page.url %} class="active"{% endif %}>{{ child.title }}</a></li>
+          {% endfor %}
+        </ul>
+        {% endif %}
+      </li>
+    {% endfor %}
+  </ul>
+</nav>
diff --git a/docs/_includes/page__date.html b/docs/_includes/page__date.html
new file mode 100644
index 0000000000000000000000000000000000000000..e663f9b9c7f08297e65430fefebff12c992868d1
--- /dev/null
+++ b/docs/_includes/page__date.html
@@ -0,0 +1,5 @@
+{% if page.last_modified_at %}
+  <p class="page__date"><strong><i class="fas fa-fw fa-calendar-alt" aria-hidden="true"></i> {{ site.data.ui-text[site.locale].date_label | default: "Updated:" }}</strong> <time datetime="{{ page.last_modified_at | date: "%Y-%m-%d" }}">{{ page.last_modified_at | date: "%B %-d, %Y" }}</time></p>
+{% elsif page.date %}
+  <p class="page__date"><strong><i class="fas fa-fw fa-calendar-alt" aria-hidden="true"></i> {{ site.data.ui-text[site.locale].date_label | default: "Updated:" }}</strong> <time datetime="{{ page.date | date_to_xmlschema }}">{{ page.date | date: "%B %-d, %Y" }}</time></p>
+{% endif %}
diff --git a/docs/_includes/page__hero.html b/docs/_includes/page__hero.html
new file mode 100644
index 0000000000000000000000000000000000000000..3f55aaa60ac5803e50e58e38ef03197a1025580c
--- /dev/null
+++ b/docs/_includes/page__hero.html
@@ -0,0 +1,51 @@
+{% capture overlay_img_path %}{{ page.header.overlay_image | relative_url }}{% endcapture %}
+
+{% if page.header.overlay_filter contains "rgba" %}
+  {% capture overlay_filter %}{{ page.header.overlay_filter }}{% endcapture %}
+{% elsif page.header.overlay_filter %}
+  {% capture overlay_filter %}rgba(0, 0, 0, {{ page.header.overlay_filter }}){% endcapture %}
+{% endif %}
+
+{% if page.header.image_description %}
+  {% assign image_description = page.header.image_description %}
+{% else %}
+  {% assign image_description = page.title %}
+{% endif %}
+
+{% assign image_description = image_description | markdownify | strip_html | strip_newlines | escape_once %}
+
+<div class="page__hero{% if page.header.overlay_color or page.header.overlay_image %}--overlay{% endif %}"
+  style="{% if page.header.overlay_color %}background-color: {{ page.header.overlay_color | default: 'transparent' }};{% endif %} {% if overlay_img_path %}background-image: {% if overlay_filter %}linear-gradient({{ overlay_filter }}, {{ overlay_filter }}), {% endif %}url('{{ overlay_img_path }}');{% endif %}"
+>
+  {% if page.header.overlay_color or page.header.overlay_image %}
+    <div class="wrapper">
+      <h1 id="page-title" class="page__title" itemprop="headline">
+        {% if paginator and site.paginate_show_page_num %}
+          {{ site.title }}{% unless paginator.page == 1 %} {{ site.data.ui-text[site.locale].page | default: "Page" }} {{ paginator.page }}{% endunless %}
+        {% else %}
+          {{ page.title | default: site.title | markdownify | remove: "<p>" | remove: "</p>" }}
+        {% endif %}
+      </h1>
+      {% if page.tagline %}
+        <p class="page__lead">{{ page.tagline | markdownify | remove: "<p>" | remove: "</p>" }}</p>
+      {% elsif page.header.show_overlay_excerpt != false and page.excerpt %}
+        <p class="page__lead">{{ page.excerpt | markdownify | remove: "<p>" | remove: "</p>" }}</p>
+      {% endif %}
+      {% include page__meta.html %}
+      {% if page.header.cta_url %}
+        <p><a href="{{ page.header.cta_url | relative_url }}" class="btn btn--light-outline btn--large">{{ page.header.cta_label | default: site.data.ui-text[site.locale].more_label | default: "Learn More" }}</a></p>
+      {% endif %}
+      {% if page.header.actions %}
+        <p>
+        {% for action in page.header.actions %}
+          <a href="{{ action.url | relative_url }}" class="btn btn--light-outline btn--large">{{ action.label | default: site.data.ui-text[site.locale].more_label | default: "Learn More" }}</a>
+        {% endfor %}
+      {% endif %}
+    </div>
+  {% else %}
+    <img src="{{ page.header.image | relative_url }}" alt="{{ image_description }}" class="page__hero-image">
+  {% endif %}
+  {% if page.header.caption %}
+    <span class="page__hero-caption">{{ page.header.caption | markdownify | remove: "<p>" | remove: "</p>" }}</span>
+  {% endif %}
+</div>
diff --git a/docs/_includes/page__hero_video.html b/docs/_includes/page__hero_video.html
new file mode 100644
index 0000000000000000000000000000000000000000..a313a23d45b92c5b1eb07514ba8c6f07c0ca9e1f
--- /dev/null
+++ b/docs/_includes/page__hero_video.html
@@ -0,0 +1,2 @@
+{% assign video = page.header.video %}
+{% include video id=video.id provider=video.provider danmaku=video.danmaku %}
diff --git a/docs/_includes/page__meta.html b/docs/_includes/page__meta.html
new file mode 100644
index 0000000000000000000000000000000000000000..1afc3d8f85eb90bab332e506fbd9245ebb193889
--- /dev/null
+++ b/docs/_includes/page__meta.html
@@ -0,0 +1,30 @@
+{% assign document = post | default: page %}
+{% if document.read_time or document.show_date %}
+  <p class="page__meta">
+    {% if document.show_date and document.date %}
+      {% assign date = document.date %}
+      <span class="page__meta-date">
+        <i class="far {% if include.type == 'grid' and document.read_time and document.show_date %}fa-fw {% endif %}fa-calendar-alt" aria-hidden="true"></i>
+        <time datetime="{{ date | date_to_xmlschema }}">{{ date | date: "%B %-d, %Y" }}</time>
+      </span>
+    {% endif %}
+
+    {% if document.read_time and document.show_date %}<span class="page__meta-sep"></span>{% endif %}
+
+    {% if document.read_time %}
+      {% assign words_per_minute = document.words_per_minute | default: site.words_per_minute | default: 200 %}
+      {% assign words = document.content | strip_html | number_of_words %}
+
+      <span class="page__meta-readtime">
+        <i class="far {% if include.type == 'grid' and document.read_time and document.show_date %}fa-fw {% endif %}fa-clock" aria-hidden="true"></i>
+        {% if words < words_per_minute %}
+          {{ site.data.ui-text[site.locale].less_than | default: "less than" }} 1 {{ site.data.ui-text[site.locale].minute_read | default: "minute read" }}
+        {% elsif words == words_per_minute %}
+          1 {{ site.data.ui-text[site.locale].minute_read | default: "minute read" }}
+        {% else %}
+          {{ words | divided_by: words_per_minute }} {{ site.data.ui-text[site.locale].minute_read | default: "minute read" }}
+        {% endif %}
+      </span>
+    {% endif %}
+  </p>
+{% endif %}
diff --git a/docs/_includes/page__taxonomy.html b/docs/_includes/page__taxonomy.html
new file mode 100644
index 0000000000000000000000000000000000000000..f10b2026a22bf8337ad2e1135529994cce5bdc80
--- /dev/null
+++ b/docs/_includes/page__taxonomy.html
@@ -0,0 +1,7 @@
+{% if site.tag_archive.type and page.tags[0] %}
+  {% include tag-list.html %}
+{% endif %}
+
+{% if site.category_archive.type and page.categories[0] %}
+  {% include category-list.html %}
+{% endif %}
diff --git a/docs/_includes/paginator.html b/docs/_includes/paginator.html
new file mode 100644
index 0000000000000000000000000000000000000000..bffa0794678e77f28b6667f7005a7ca264e765eb
--- /dev/null
+++ b/docs/_includes/paginator.html
@@ -0,0 +1,69 @@
+{% if paginator.total_pages > 1 %}
+<nav class="pagination">
+  {% assign first_page_path = paginator.first_page_path | default: site.paginate_path | replace: 'page:num', '' | replace: '//', '/' | relative_url %}
+  <ul>
+    {% comment %} Link for previous page {% endcomment %}
+    {% if paginator.previous_page %}
+      {% if paginator.previous_page == 1 %}
+        <li><a href="{{ first_page_path }}">{{ site.data.ui-text[site.locale].pagination_previous | default: "Previous" }}</a></li>
+      {% else %}
+        <li><a href="{{ site.paginate_path | replace: ':num', paginator.previous_page | replace: '//', '/' | relative_url }}">{{ site.data.ui-text[site.locale].pagination_previous | default: "Previous" }}</a></li>
+      {% endif %}
+    {% else %}
+    <li><a href="#" class="disabled"><span aria-hidden="true">{{ site.data.ui-text[site.locale].pagination_previous | default: "Previous" }}</span></a></li>
+    {% endif %}
+
+    {% comment %} First page {% endcomment %}
+    {% if paginator.page == 1 %}
+      <li><a href="#" class="disabled current">1</a></li>
+    {% else %}
+      <li><a href="{{ first_page_path }}">1</a></li>
+    {% endif %}
+
+    {% assign page_start = 2 %}
+    {% if paginator.page > 4 %}
+      {% assign page_start = paginator.page | minus: 2 %}
+      {% comment %} Ellipsis for truncated links {% endcomment %}
+      <li><a href="#" class="disabled">&hellip;</a></li>
+    {% endif %}
+
+    {% assign page_end = paginator.total_pages | minus: 1 %}
+    {% assign pages_to_end = paginator.total_pages | minus: paginator.page %}
+    {% if pages_to_end > 4 %}
+      {% assign page_end = paginator.page | plus: 2 %}
+    {% endif %}
+
+    {% for index in (page_start..page_end) %}
+      {% if index == paginator.page %}
+        <li><a href="{{ site.paginate_path | replace: ':num', index | replace: '//', '/' | relative_url }}" class="disabled current">{{ index }}</a></li>
+      {% else %}
+        {% comment %} Distance from current page and this link {% endcomment %}
+        {% assign dist = paginator.page | minus: index %}
+        {% if dist < 0 %}
+          {% comment %} Distance must be a positive value {% endcomment %}
+          {% assign dist = 0 | minus: dist %}
+        {% endif %}
+        <li><a href="{{ site.paginate_path | replace: ':num', index | relative_url }}">{{ index }}</a></li>
+      {% endif %}
+    {% endfor %}
+
+    {% comment %} Ellipsis for truncated links {% endcomment %}
+    {% if pages_to_end > 3 %}
+      <li><a href="#" class="disabled">&hellip;</a></li>
+    {% endif %}
+
+    {% if paginator.page == paginator.total_pages %}
+      <li><a href="#" class="disabled current">{{ paginator.page }}</a></li>
+    {% else %}
+      <li><a href="{{ site.paginate_path | replace: ':num', paginator.total_pages | replace: '//', '/' | relative_url }}">{{ paginator.total_pages }}</a></li>
+    {% endif %}
+
+    {% comment %} Link next page {% endcomment %}
+    {% if paginator.next_page %}
+      <li><a href="{{ site.paginate_path | replace: ':num', paginator.next_page | replace: '//', '/' | relative_url }}">{{ site.data.ui-text[site.locale].pagination_next | default: "Next" }}</a></li>
+    {% else %}
+      <li><a href="#" class="disabled"><span aria-hidden="true">{{ site.data.ui-text[site.locale].pagination_next | default: "Next" }}</span></a></li>
+    {% endif %}
+  </ul>
+</nav>
+{% endif %}
diff --git a/docs/_includes/post_pagination.html b/docs/_includes/post_pagination.html
new file mode 100644
index 0000000000000000000000000000000000000000..c09dd29f0345248c6bb61799c060c18e92b6ecc4
--- /dev/null
+++ b/docs/_includes/post_pagination.html
@@ -0,0 +1,14 @@
+{% if page.previous or page.next %}
+  <nav class="pagination">
+    {% if page.previous %}
+      <a href="{{ page.previous.url | relative_url }}" class="pagination--pager" title="{{ page.previous.title | markdownify | strip_html }}">{{ site.data.ui-text[site.locale].pagination_previous | default: "Previous" }}</a>
+    {% else %}
+      <a href="#" class="pagination--pager disabled">{{ site.data.ui-text[site.locale].pagination_previous | default: "Previous" }}</a>
+    {% endif %}
+    {% if page.next %}
+      <a href="{{ page.next.url | relative_url }}" class="pagination--pager" title="{{ page.next.title | markdownify | strip_html }}">{{ site.data.ui-text[site.locale].pagination_next | default: "Next" }}</a>
+    {% else %}
+      <a href="#" class="pagination--pager disabled">{{ site.data.ui-text[site.locale].pagination_next | default: "Next" }}</a>
+    {% endif %}
+  </nav>
+{% endif %}
diff --git a/docs/_includes/posts-category.html b/docs/_includes/posts-category.html
new file mode 100644
index 0000000000000000000000000000000000000000..b364f30e94e5e973e8b42c5c2640a24d03b0d75d
--- /dev/null
+++ b/docs/_includes/posts-category.html
@@ -0,0 +1,5 @@
+{%- for post in site.categories[include.taxonomy] -%}
+  {%- unless post.hidden -%}
+    {% include archive-single.html %}
+  {%- endunless -%}
+{%- endfor -%}
diff --git a/docs/_includes/posts-tag.html b/docs/_includes/posts-tag.html
new file mode 100644
index 0000000000000000000000000000000000000000..46fade02a01fef331ea0c05025fbc4eae91200ad
--- /dev/null
+++ b/docs/_includes/posts-tag.html
@@ -0,0 +1,5 @@
+{%- for post in site.tags[include.taxonomy] -%}
+  {%- unless post.hidden -%}
+    {% include archive-single.html %}
+  {%- endunless -%}
+{%- endfor -%}
diff --git a/docs/_includes/scripts.html b/docs/_includes/scripts.html
new file mode 100644
index 0000000000000000000000000000000000000000..bbdaddff0bab3f60c84dce67c2cca80ee16bfb38
--- /dev/null
+++ b/docs/_includes/scripts.html
@@ -0,0 +1,28 @@
+{% if site.footer_scripts %}
+  {% for script in site.footer_scripts %}
+    <script src="{{ script | relative_url }}"></script>
+  {% endfor %}
+{% else %}
+  <script src="{{ '/assets/js/main.min.js' | relative_url }}"></script>
+{% endif %}
+
+{% if site.search == true or page.layout == "search" %}
+  {%- assign search_provider = site.search_provider | default: "lunr" -%}
+  {%- case search_provider -%}
+    {%- when "lunr" -%}
+      {% include_cached search/lunr-search-scripts.html %}
+    {%- when "google" -%}
+      {% include_cached search/google-search-scripts.html %}
+    {%- when "algolia" -%}
+      {% include_cached search/algolia-search-scripts.html %}
+  {%- endcase -%}
+{% endif %}
+
+{% include analytics.html %}
+{% include /comments-providers/scripts.html %}
+
+{% if site.after_footer_scripts %}
+  {% for script in site.after_footer_scripts %}
+    <script src="{{ script | relative_url }}"></script>
+  {% endfor %}
+{% endif %}
diff --git a/docs/_includes/seo.html b/docs/_includes/seo.html
new file mode 100644
index 0000000000000000000000000000000000000000..7df1253b7f85f0a62030da797b463fe1c6819fa0
--- /dev/null
+++ b/docs/_includes/seo.html
@@ -0,0 +1,155 @@
+<!-- begin _includes/seo.html -->
+{%- if site.url -%}
+  {%- assign seo_url = site.url | append: site.baseurl -%}
+{%- endif -%}
+{%- assign seo_url = seo_url | default: site.github.url -%}
+
+{% assign title_separator = site.title_separator | default: '-' | replace: '|', '&#124;' %}
+
+{%- if page.title -%}
+  {%- assign seo_title = page.title | append: " " | append: title_separator | append: " " | append: site.title -%}
+{%- endif -%}
+
+{%- if seo_title -%}
+  {%- assign seo_title = seo_title | markdownify | strip_html | strip_newlines | escape_once -%}
+{%- endif -%}
+
+{% if page.canonical_url %}
+  {%- assign canonical_url = page.canonical_url %}
+{% else %}
+  {%- assign canonical_url = page.url | replace: "index.html", "" | absolute_url %}
+{% endif %}
+
+{%- assign seo_description = page.description | default: page.excerpt | default: site.description -%}
+{%- if seo_description -%}
+  {%- assign seo_description = seo_description | markdownify | strip_html | newline_to_br | strip_newlines | replace: '<br />', ' ' | escape_once | strip -%}
+{%- endif -%}
+
+{%- assign author = page.author | default: page.authors[0] | default: site.author -%}
+{%- assign author = site.data.authors[author] | default: author -%}
+
+{%- if author.twitter -%}
+  {%- assign author_twitter = author.twitter | replace: "@", "" -%}
+{%- endif -%}
+
+{%- assign page_large_image = page.header.og_image | default: page.header.overlay_image | default: page.header.image | absolute_url -%}
+{%- assign page_large_image = page_large_image | escape -%}
+
+{%- assign page_teaser_image = page.header.teaser | default: site.og_image | absolute_url -%}
+{%- assign page_teaser_image = page_teaser_image | escape -%}
+
+{%- assign site_og_image = site.og_image | absolute_url -%}
+{%- assign site_og_image = site_og_image | escape -%}
+
+{%- if page.date -%}
+  {%- assign og_type = "article" -%}
+{%- else -%}
+  {%- assign og_type = "website" -%}
+{%- endif -%}
+
+<title>{{ seo_title | default: site.title }}{% if paginator %}{% unless paginator.page == 1 %} {{ title_separator }} {{ site.data.ui-text[site.locale].page | default: "Page" }} {{ paginator.page }}{% endunless %}{% endif %}</title>
+<meta name="description" content="{{ seo_description }}">
+
+{% if author.name %}
+  <meta name="author" content="{{ author.name | default: author }}">
+  {% if og_type == "article" %}
+  <meta property="article:author" content="{{ author.name | default: author }}">
+  {% endif %}
+{% endif %}
+
+<meta property="og:type" content="{{ og_type }}">
+<meta property="og:locale" content="{{ site.locale | replace: "-", "_" | default: "en_US" }}">
+<meta property="og:site_name" content="{{ site.title }}">
+<meta property="og:title" content="{{ page.title | default: site.title | markdownify | strip_html | strip_newlines | escape_once }}">
+<meta property="og:url" content="{{ canonical_url }}">
+
+{% if seo_description %}
+  <meta property="og:description" content="{{ seo_description }}">
+{% endif %}
+
+{% if page_large_image %}
+  <meta property="og:image" content="{{ page_large_image }}">
+{% elsif page_teaser_image %}
+  <meta property="og:image" content="{{ page_teaser_image }}">
+{% endif %}
+
+{% if site.twitter.username %}
+  <meta name="twitter:site" content="@{{ site.twitter.username | replace: "@", "" }}">
+  <meta name="twitter:title" content="{{ page.title | default: site.title | markdownify | strip_html | strip_newlines | escape_once }}">
+  <meta name="twitter:description" content="{{ seo_description }}">
+  <meta name="twitter:url" content="{{ canonical_url }}">
+
+  {% if page_large_image %}
+    <meta name="twitter:card" content="summary_large_image">
+    <meta name="twitter:image" content="{{ page_large_image }}">
+  {% else %}
+    <meta name="twitter:card" content="summary">
+    {% if page_teaser_image %}
+      <meta name="twitter:image" content="{{ page_teaser_image }}">
+    {% endif %}
+  {% endif %}
+
+  {% if author_twitter %}
+    <meta name="twitter:creator" content="@{{ author_twitter }}">
+  {% endif %}
+{% endif %}
+
+{% if page.date %}
+  <meta property="article:published_time" content="{{ page.date | date_to_xmlschema }}">
+{% endif %}
+
+{% if og_type == "article" and page.last_modified_at %}
+  <meta property="article:modified_time" content="{{ page.last_modified_at | date_to_xmlschema }}">
+{% endif %}
+
+{% if site.facebook %}
+  {% if site.facebook.publisher %}
+    <meta property="article:publisher" content="{{ site.facebook.publisher }}">
+  {% endif %}
+
+  {% if site.facebook.app_id %}
+    <meta property="fb:app_id" content="{{ site.facebook.app_id }}">
+  {% endif %}
+{% endif %}
+
+<link rel="canonical" href="{{ canonical_url }}">
+
+{% if paginator.previous_page %}
+  <link rel="prev" href="{{ paginator.previous_page_path | absolute_url }}">
+{% endif %}
+{% if paginator.next_page %}
+  <link rel="next" href="{{ paginator.next_page_path | absolute_url }}">
+{% endif %}
+
+<script type="application/ld+json">
+  {
+    "@context": "https://schema.org",
+    {% if site.social.type == "Organization" %}
+      "@type": "Organization",
+      "url": {{ '/' | absolute_url | jsonify }}{% if site.og_image %},
+      "logo": {{ site_og_image | jsonify }}{% endif %}
+    {% else %}
+      "@type": "Person",
+      "name": {{ site.social.name | default: site.name | jsonify }},
+      "url": {{ '/' | absolute_url |jsonify }}{% if site.social.links %},
+      "sameAs": {{ site.social.links | jsonify }}{% endif %}
+    {% endif %}
+  }
+</script>
+
+{% if site.google_site_verification %}
+  <meta name="google-site-verification" content="{{ site.google_site_verification }}" />
+{% endif %}
+{% if site.bing_site_verification %}
+  <meta name="msvalidate.01" content="{{ site.bing_site_verification }}">
+{% endif %}
+{% if site.alexa_site_verification %}
+  <meta name="alexaVerifyID" content="{{ site.alexa_site_verification }}">
+{% endif %}
+{% if site.yandex_site_verification %}
+  <meta name="yandex-verification" content="{{ site.yandex_site_verification }}">
+{% endif %}
+{% if site.naver_site_verification %}
+  <meta name="naver-site-verification" content="{{ site.naver_site_verification }}">
+{% endif %}
+<!-- end _includes/seo.html -->
diff --git a/docs/_includes/sidebar.html b/docs/_includes/sidebar.html
new file mode 100644
index 0000000000000000000000000000000000000000..a4ca1ca78151bc5fcdb6b775811fcae1ed96462e
--- /dev/null
+++ b/docs/_includes/sidebar.html
@@ -0,0 +1,19 @@
+{% if page.author_profile or layout.author_profile or page.sidebar %}
+  <div class="sidebar sticky">
+  {% if page.author_profile or layout.author_profile %}{% include author-profile.html %}{% endif %}
+  {% if page.sidebar %}
+    {% for s in page.sidebar %}
+      {% if s.image %}
+        <img src="{{ s.image | relative_url }}"
+             alt="{% if s.image_alt %}{{ s.image_alt }}{% endif %}">
+      {% endif %}
+      {% if s.title %}<h3>{{ s.title }}</h3>{% endif %}
+      {% if s.text %}{{ s.text | markdownify }}{% endif %}
+      {% if s.nav %}{% include nav_list nav=s.nav %}{% endif %}
+    {% endfor %}
+    {% if page.sidebar.nav %}
+      {% include nav_list nav=page.sidebar.nav %}
+    {% endif %}
+  {% endif %}
+  </div>
+{% endif %}
diff --git a/docs/_includes/skip-links.html b/docs/_includes/skip-links.html
new file mode 100644
index 0000000000000000000000000000000000000000..2cd9f17d814a77b22de85bb69f70cc1845083780
--- /dev/null
+++ b/docs/_includes/skip-links.html
@@ -0,0 +1,8 @@
+<nav class="skip-links">
+  <h2 class="screen-reader-text">{{ site.data.ui-text[site.locale].skip_links | default: 'Skip links' }}</h2>
+  <ul>
+    <li><a href="#site-nav" class="screen-reader-shortcut">{{ site.data.ui-text[site.locale].skip_primary_nav | default: 'Skip to primary navigation' }}</a></li>
+    <li><a href="#main" class="screen-reader-shortcut">{{ site.data.ui-text[site.locale].skip_content | default: 'Skip to content' }}</a></li>
+    <li><a href="#footer" class="screen-reader-shortcut">{{ site.data.ui-text[site.locale].skip_footer | default: 'Skip to footer' }}</a></li>
+  </ul>
+</nav>
diff --git a/docs/_includes/social-share.html b/docs/_includes/social-share.html
new file mode 100644
index 0000000000000000000000000000000000000000..0b377982b2688f222d4160b687683a7625d1a9fc
--- /dev/null
+++ b/docs/_includes/social-share.html
@@ -0,0 +1,11 @@
+<section class="page__share">
+  {% if site.data.ui-text[site.locale].share_on_label %}
+    <h4 class="page__share-title">{{ site.data.ui-text[site.locale].share_on_label | default: "Share on" }}</h4>
+  {% endif %}
+
+  <a href="https://twitter.com/intent/tweet?{% if site.twitter.username %}via={{ site.twitter.username | url_encode }}&{% endif %}text={{ page.title | url_encode }}%20{{ page.url | absolute_url | url_encode }}" class="btn btn--twitter" onclick="window.open(this.href, 'window', 'left=20,top=20,width=500,height=500,toolbar=1,resizable=0'); return false;" title="{{ site.data.ui-text[site.locale].share_on_label | default: 'Share on' }} Twitter"><i class="fab fa-fw fa-twitter" aria-hidden="true"></i><span> Twitter</span></a>
+
+  <a href="https://www.facebook.com/sharer/sharer.php?u={{ page.url | absolute_url | url_encode }}" class="btn btn--facebook" onclick="window.open(this.href, 'window', 'left=20,top=20,width=500,height=500,toolbar=1,resizable=0'); return false;" title="{{ site.data.ui-text[site.locale].share_on_label | default: 'Share on' }} Facebook"><i class="fab fa-fw fa-facebook" aria-hidden="true"></i><span> Facebook</span></a>
+
+  <a href="https://www.linkedin.com/shareArticle?mini=true&url={{ page.url | absolute_url | url_encode }}" class="btn btn--linkedin" onclick="window.open(this.href, 'window', 'left=20,top=20,width=500,height=500,toolbar=1,resizable=0'); return false;" title="{{ site.data.ui-text[site.locale].share_on_label | default: 'Share on' }} LinkedIn"><i class="fab fa-fw fa-linkedin" aria-hidden="true"></i><span> LinkedIn</span></a>
+</section>
diff --git a/docs/_includes/tag-list.html b/docs/_includes/tag-list.html
new file mode 100644
index 0000000000000000000000000000000000000000..18ee8bdf1d00d7392809a5d509219084fa30f44e
--- /dev/null
+++ b/docs/_includes/tag-list.html
@@ -0,0 +1,26 @@
+{% case site.tag_archive.type %}
+  {% when "liquid" %}
+    {% assign path_type = "#" %}
+  {% when "jekyll-archives" %}
+    {% assign path_type = nil %}
+{% endcase %}
+
+{% if site.tag_archive.path %}
+  {% comment %}
+    <!-- Sort alphabetically regardless of case e.g. a B c d E -->
+    <!-- modified from http://www.codeofclimber.ru/2015/sorting-site-tags-in-jekyll/ -->
+  {% endcomment %}
+  {% capture page_tags %}{% for tag in page.tags %}{{ tag | downcase }}|{{ tag }}{% unless forloop.last %},{% endunless %}{% endfor %}{% endcapture %}
+  {% assign tag_hashes = page_tags | split: ',' | sort %}
+
+  <p class="page__taxonomy">
+    <strong><i class="fas fa-fw fa-tags" aria-hidden="true"></i> {{ site.data.ui-text[site.locale].tags_label | default: "Tags:" }} </strong>
+    <span itemprop="keywords">
+    {% for hash in tag_hashes %}
+      {% assign keyValue = hash | split: '|' %}
+      {% capture tag_word %}{{ keyValue[1] | strip_newlines }}{% endcapture %}
+      <a href="{{ tag_word | slugify | prepend: path_type | prepend: site.tag_archive.path | relative_url }}" class="page__taxonomy-item" rel="tag">{{ tag_word }}</a>{% unless forloop.last %}<span class="sep">, </span>{% endunless %}
+    {% endfor %}
+    </span>
+  </p>
+{% endif %}
diff --git a/docs/_includes/toc b/docs/_includes/toc
new file mode 100644
index 0000000000000000000000000000000000000000..a234afafad40e57d869d0c9fc601bd0a55cf88cd
--- /dev/null
+++ b/docs/_includes/toc
@@ -0,0 +1,7 @@
+<aside class="sidebar__right">
+<nav class="toc" markdown="1">
+<header><h4 class="nav__title"><i class="fas fa-{{ include.icon | default: 'file-alt' }}"></i> {{ include.title | default: site.data.ui-text[site.locale].toc_label }}</h4></header>
+*  Auto generated table of contents
+{:toc .toc__menu}
+</nav>
+</aside>
diff --git a/docs/_includes/toc.html b/docs/_includes/toc.html
new file mode 100644
index 0000000000000000000000000000000000000000..25b9f6a382b0afe9e409f7e248973030532256cc
--- /dev/null
+++ b/docs/_includes/toc.html
@@ -0,0 +1,96 @@
+{% capture tocWorkspace %}
+    {% comment %}
+        Version 1.0.8
+          https://github.com/allejo/jekyll-toc
+
+        "...like all things liquid - where there's a will, and ~36 hours to spare, there's usually a/some way" ~jaybe
+
+        Usage:
+            {% include toc.html html=content sanitize=true class="inline_toc" id="my_toc" h_min=2 h_max=3 %}
+
+        Parameters:
+            * html         (string) - the HTML of compiled markdown generated by kramdown in Jekyll
+
+        Optional Parameters:
+            * sanitize     (bool)   : false  - when set to true, the headers will be stripped of any HTML in the TOC
+            * class        (string) :   ''   - a CSS class assigned to the TOC
+            * id           (string) :   ''   - an ID to assigned to the TOC
+            * h_min        (int)    :   1    - the minimum TOC header level to use; any header lower than this value will be ignored
+            * h_max        (int)    :   6    - the maximum TOC header level to use; any header greater than this value will be ignored
+            * ordered      (bool)   : false  - when set to true, an ordered list will be outputted instead of an unordered list
+            * item_class   (string) :   ''   - add custom class(es) for each list item; has support for '%level%' placeholder, which is the current heading level
+            * baseurl      (string) :   ''   - add a base url to the TOC links for when your TOC is on another page than the actual content
+            * anchor_class (string) :   ''   - add custom class(es) for each anchor element
+
+        Output:
+            An ordered or unordered list representing the table of contents of a markdown block. This snippet will only
+            generate the table of contents and will NOT output the markdown given to it
+    {% endcomment %}
+
+    {% capture my_toc %}{% endcapture %}
+    {% assign orderedList = include.ordered | default: false %}
+    {% assign minHeader = include.h_min | default: 1 %}
+    {% assign maxHeader = include.h_max | default: 6 %}
+    {% assign nodes = include.html | split: '<h' %}
+    {% assign firstHeader = true %}
+
+    {% capture listModifier %}{% if orderedList %}1.{% else %}-{% endif %}{% endcapture %}
+
+    {% for node in nodes %}
+        {% if node == "" %}
+            {% continue %}
+        {% endif %}
+
+        {% assign headerLevel = node | replace: '"', '' | slice: 0, 1 | times: 1 %}
+
+        {% if headerLevel < minHeader or headerLevel > maxHeader %}
+            {% continue %}
+        {% endif %}
+
+        {% if firstHeader %}
+            {% assign firstHeader = false %}
+            {% assign minHeader = headerLevel %}
+        {% endif %}
+
+        {% assign indentAmount = headerLevel | minus: minHeader %}
+        {% assign _workspace = node | split: '</h' %}
+
+        {% assign _idWorkspace = _workspace[0] | split: 'id="' %}
+        {% assign _idWorkspace = _idWorkspace[1] | split: '"' %}
+        {% assign html_id = _idWorkspace[0] %}
+
+        {% assign _classWorkspace = _workspace[0] | split: 'class="' %}
+        {% assign _classWorkspace = _classWorkspace[1] | split: '"' %}
+        {% assign html_class = _classWorkspace[0] %}
+
+        {% if html_class contains "no_toc" %}
+            {% continue %}
+        {% endif %}
+
+        {% capture _hAttrToStrip %}{{ _workspace[0] | split: '>' | first }}>{% endcapture %}
+        {% assign header = _workspace[0] | replace: _hAttrToStrip, '' %}
+
+        {% assign space = '' %}
+        {% for i in (1..indentAmount) %}
+            {% assign space = space | prepend: '    ' %}
+        {% endfor %}
+
+        {% unless include.item_class == blank %}
+            {% capture listItemClass %}{:.{{ include.item_class | replace: '%level%', headerLevel }}}{% endcapture %}
+        {% endunless %}
+
+        {% capture heading_body %}{% if include.sanitize %}{{ header | strip_html }}{% else %}{{ header }}{% endif %}{% endcapture %}
+        {% capture my_toc %}{{ my_toc }}
+{{ space }}{{ listModifier }} {{ listItemClass }} [{{ heading_body | replace: "|", "\|" }}]({% if include.baseurl %}{{ include.baseurl }}{% endif %}#{{ html_id }}){% if include.anchor_class %}{:.{{ include.anchor_class }}}{% endif %}{% endcapture %}
+    {% endfor %}
+
+    {% if include.class %}
+        {% capture my_toc %}{:.{{ include.class }}}
+{{ my_toc | lstrip }}{% endcapture %}
+    {% endif %}
+
+    {% if include.id %}
+        {% capture my_toc %}{: #{{ include.id }}}
+{{ my_toc | lstrip }}{% endcapture %}
+    {% endif %}
+{% endcapture %}{% assign tocWorkspace = '' %}{{ my_toc | markdownify | strip }}
diff --git a/docs/_includes/video b/docs/_includes/video
new file mode 100644
index 0000000000000000000000000000000000000000..d653fd641df66774fd42f6441a2a09d80afe2648
--- /dev/null
+++ b/docs/_includes/video
@@ -0,0 +1,24 @@
+{% capture video_id %}{{ include.id }}{% endcapture %}
+{% capture video_provider %}{{ include.provider }}{% endcapture %}
+{% capture video_danmaku %}{{ include.danmaku | default: 0 }}{% endcapture %}
+
+{% capture video_src %}
+  {% case video_provider %}
+  {% when "vimeo" %}
+    https://player.vimeo.com/video/{{ video_id }}?dnt=true
+  {% when "youtube" %}
+    https://www.youtube-nocookie.com/embed/{{ video_id }}
+  {% when "google-drive" %}
+    https://drive.google.com/file/d/{{ video_id }}/preview
+  {% when "bilibili" %}
+    https://player.bilibili.com/player.html?bvid={{ video_id }}&page=1&as_wide=1&high_quality=1&danmaku={{ video_danmaku }}
+  {% endcase %}
+{% endcapture %}
+{% assign video_src = video_src | strip %}
+
+<!-- Courtesy of embedresponsively.com //-->
+{% unless video_src == "" %}
+  <div class="responsive-video-container">
+    <iframe src="{{ video_src }}" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowfullscreen></iframe>
+  </div>
+{% endunless %}
diff --git a/docs/_layouts/single-full.html b/docs/_layouts/single-full.html
new file mode 100644
index 0000000000000000000000000000000000000000..20268431ebdf5328217f8b851c9ba581ca07afcd
--- /dev/null
+++ b/docs/_layouts/single-full.html
@@ -0,0 +1,82 @@
+---
+layout: default
+---
+
+{% if page.header.overlay_color or page.header.overlay_image or page.header.image %}
+  {% include page__hero.html %}
+{% elsif page.header.video.id and page.header.video.provider %}
+  {% include page__hero_video.html %}
+{% endif %}
+
+<div id="main" role="main">
+
+  <article class="page" itemscope itemtype="https://schema.org/CreativeWork" style="margin: 0 auto;float:none;">
+    {% if page.title %}<meta itemprop="headline" content="{{ page.title | markdownify | strip_html | strip_newlines | escape_once }}">{% endif %}
+    {% if page.excerpt %}<meta itemprop="description" content="{{ page.excerpt | markdownify | strip_html | strip_newlines | escape_once }}">{% endif %}
+    {% if page.date %}<meta itemprop="datePublished" content="{{ page.date | date_to_xmlschema }}">{% endif %}
+    {% if page.last_modified_at %}<meta itemprop="dateModified" content="{{ page.last_modified_at | date_to_xmlschema }}">{% endif %}
+
+    <div class="page__inner-wrap">
+      {% unless page.header.overlay_color or page.header.overlay_image %}
+        <header>
+          {% if page.title %}<h1 id="page-title" class="page__title" itemprop="headline">{{ page.title | markdownify | remove: "<p>" | remove: "</p>" }}</h1>{% endif %}
+          {% include page__meta.html %}
+        </header>
+      {% endunless %}
+
+      <section class="page__content" itemprop="text">
+        {% if page.toc %}
+          <aside class="sidebar__right {% if page.toc_sticky %}sticky{% endif %}">
+            <nav class="toc">
+              <header><h4 class="nav__title"><i class="fas fa-{{ page.toc_icon | default: 'file-alt' }}"></i> {{ page.toc_label | default: site.data.ui-text[site.locale].toc_label | default: "On this page" }}</h4></header>
+              {% include toc.html sanitize=true html=content h_min=1 h_max=6 class="toc__menu" %}
+            </nav>
+          </aside>
+        {% endif %}
+        {{ content }}
+        {% if page.link %}<div><a href="{{ page.link }}" class="btn btn--primary">{{ site.data.ui-text[site.locale].ext_link_label | default: "Direct Link" }}</a></div>{% endif %}
+      </section>
+
+      <footer class="page__meta">
+        {% if site.data.ui-text[site.locale].meta_label %}
+          <h4 class="page__meta-title">{{ site.data.ui-text[site.locale].meta_label }}</h4>
+        {% endif %}
+        {% include page__taxonomy.html %}
+        {% include page__date.html %}
+      </footer>
+
+      {% if page.share %}{% include social-share.html %}{% endif %}
+
+      {% include post_pagination.html %}
+    </div>
+
+    {% if jekyll.environment == 'production' and site.comments.provider and page.comments %}
+      {% include comments.html %}
+    {% endif %}
+  </article>
+
+  {% comment %}<!-- only show related on a post page when `related: true` -->{% endcomment %}
+  {% if page.id and page.related and site.related_posts.size > 0 %}
+    <div class="page__related">
+      <h4 class="page__related-title">{{ site.data.ui-text[site.locale].related_label | default: "You May Also Enjoy" }}</h4>
+      <div class="grid__wrapper">
+        {% for post in site.related_posts limit:4 %}
+          {% include archive-single.html type="grid" %}
+        {% endfor %}
+      </div>
+    </div>
+  {% comment %}<!-- otherwise show recent posts if no related when `related: true` -->{% endcomment %}
+  {% elsif page.id and page.related %}
+    <div class="page__related">
+      <h4 class="page__related-title">{{ site.data.ui-text[site.locale].related_label | default: "You May Also Enjoy" }}</h4>
+      <div class="grid__wrapper">
+        {% for post in site.posts limit:4 %}
+          {% if post.id == page.id %}
+            {% continue %}
+          {% endif %}
+          {% include archive-single.html type="grid" %}
+        {% endfor %}
+      </div>
+    </div>
+  {% endif %}
+</div>
diff --git a/docs/_pages/config-json.md b/docs/_pages/config-json.md
old mode 100755
new mode 100644
index 9a9554cbd75fc289b1e89a840c76c2f27c5cfe22..53df586ec3e6f00746f266c3d11aff94d3015fd1
--- a/docs/_pages/config-json.md
+++ b/docs/_pages/config-json.md
@@ -4,40 +4,40 @@ title: "DeepSpeed Configuration JSON"
 
 ### Batch Size Related Parameters
 
-**Note:** configuring ***train\_batch\_size*** is required.
+**Note:** <i>**train_batch_size**</i> must be equal to  <i>**train_micro_batch_size_per_gpu**</i> * <i>**gradient_accumulation**</i> * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed.
 {: .notice--warning}
 
-***train\_batch\_size***: [integer]
+<i>**train_batch_size**</i>: [integer]
 
-| Value                                                                                                                                                                                                                                                                                                                                                                             | Example |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| The effective training batch size. This is the amount of data samples that leads to one step of model update. ***train\_batch\_size*** is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., ***train\_step\_batch\_size***),  the gradient accumulation steps (a.k.a., ***gradient\_accumulation\_steps***), and the number of GPUs. | `32`    |
+| Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Example |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| The effective training batch size. This is the amount of data samples that leads to one step of model update. <i>**train_batch_size**</i> is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., <i>**train_micro_batch_size_per_gpu**</i>),  the gradient accumulation steps (a.k.a., <i>**gradient_accumulation_steps**</i>), and the number of GPUs. Can be omitted if both <i>**train_micro_batch_size_per_gpu**</i> and <i>**gradient_accumulation_steps**</i> are provided. | `32`    |
 
 
-***train\_micro\_batch\_size\_per\_gpu***: [integer]
+<i>**train_micro_batch_size_per_gpu**</i>: [integer]
 
-| Description                                                                                                                                                                                                                                                                                                                    | Default                        |
-| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------ |
-| Batch size to be processed by one GPU in one step (without gradient accumulation). When specified, ***gradient\_accumulation\_steps*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***gradient\_accumulation\_steps*** in the configuration JSON. | ***train\_batch\_size*** value |
+| Description                                                                                                                                                                                    | Default                           |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
+| Batch size to be processed by one GPU in one step (without gradient accumulation). Can be omitted if both <i>**train_batch_size**</i> and <i>**gradient_accumulation_steps**</i> are provided. | <i>**train_batch_size**</i> value |
 
-***gradient\_accumulation\_steps***: [integer]
+<i>**gradient_accumulation_steps**</i>: [integer]
 
-| Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Default |
-| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. When specified, ***train\_step\_batch\_size*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***train\_step\_batch\_size*** in the configuration JSON. | `1`     |
+| Description                                                                                                                                                                                                                                                                                                                                                                                                                     | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. Can be omitted if both <i>**train_batch_size**</i> and <i>**train_micro_batch_size_per_gpu**</i> are provided. | `1`     |
 
 
 
 ### Optimizer Parameters
 
-***optimizer***: [dictionary]
+<i>**optimizer**</i>: [dictionary]
 
-| Fields | Value                                                                                                                                                                                                                                                                                        | Example                      |
-| ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
-| type   | The optimizer name. DeepSpeed natively supports **Adam**, **AdamW**, **OneBitAdam**, and **Lamb** optimizers (See [here](https://deepspeed.readthedocs.io/en/latest/optimizers.html) for details) and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"`                     |
-| params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)).                                                                                       | `{"lr": 0.001, "eps": 1e-8}` |
+| Fields | Value                                                                                                                                                                                                                                                                                                        | Example                      |
+| ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------- |
+| type   | The optimizer name. DeepSpeed natively supports **Adam**, **AdamW**, **OneBitAdam**, **Lamb**, and **OneBitLamb** optimizers (See [here](https://deepspeed.readthedocs.io/en/latest/optimizers.html) for details) and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"`                     |
+| params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)).                                                                                                       | `{"lr": 0.001, "eps": 1e-8}` |
 
-  Example of ***optimizer*** with Adam
+  Example of <i>**optimizer**</i> with Adam
 
 ```json
 "optimizer": {
@@ -60,7 +60,7 @@ The Adam optimizer also supports the following two params keys/values in additio
 | torch\_adam   | Use torch's implementation of adam instead of our fused adam implementation | false   |
 | adam\_w\_mode | Apply L2 regularization (also known as AdamW)                               | true    |
 
-  Another example of ***optimizer*** with 1-bit Adam
+Another example of <i>**optimizer**</i> with 1-bit Adam specific parameters is as follows.
 
 ```json
 "optimizer": {
@@ -82,14 +82,80 @@ The Adam optimizer also supports the following two params keys/values in additio
 
 The 1-bit Adam optimizer supports the following three params keys/values in addition to the standard Adam (learn more in our [tutorial](/tutorials/onebit-adam/)):
 
-| "params" key  | Description                                                                 | Default |
-| ------------- | --------------------------------------------------------------------------- | ------- |
-| freeze\_step   | Number of warm up steps before 1-bit compression gets applied to the communication | 100000   |
-| cuda\_aware | To indicate that the underlying MPI library supports CUDA-Aware communication         | false    |
-| comm\_backend\_name | To indicate which backend implementation to use                               | "nccl"   |
+| "params" key        | Description                                                                        | Default |
+| ------------------- | ---------------------------------------------------------------------------------- | ------- |
+| freeze\_step        | Number of warm up steps before 1-bit compression gets applied to the communication | 100000  |
+| cuda\_aware         | To indicate that the underlying MPI library supports CUDA-Aware communication      | false   |
+| comm\_backend\_name | To indicate which backend implementation to use                                    | "nccl"  |
+
+A variant ***optimizer*** for 1-bit Adam is 0/1 Adam, which further optimizes 1-bit Adam via adaptive variance freezing and 1-bit synchronization over optimizer states.
+```json
+"optimizer": {
+    "type": "ZeroOneAdam",
+    "params": {
+      "lr": 1e-3,
+      "weight_decay": 0.01,
+      "bias_correction": false,
+      "var_freeze_step": 1000,
+      "var_update_scaler": 16,
+      "local_step_scaler": 1000,
+      "local_step_clipper": 16,
+      "cuda_aware": false,
+      "comm_backend_name": "nccl"
+    }
+  }
+```
+0/1 Adam supports  the following params key/values in addition to standard Adam (learn more in our [tutorial](/tutorial/zero-one-adam/).)
+| "params" key        | Description                                                                        | Default |
+| ------------------- | ---------------------------------------------------------------------------------- | ------- |
+| var\_freeze\_step   | The latest step to update the variance                                             | 100000  |
+| var\_update\_scaler | The interval to update the variance                                                | 16  |
+| local\_step\_scaler | The interval to scale the local steps interval according to the learning rate policy   | 32678  |
+| local\_step\_clipper | The largest interval for local steps with learning rate policy                     | 16  |
+| cuda\_aware         | To indicate that the underlying MPI library supports CUDA-Aware communication      | false   |
+| comm\_backend\_name | To indicate which backend implementation to use                                    | "nccl"  |
+
+Another example of ***optimizer*** with 1-bit LAMB
+
+```json
+"optimizer": {
+    "type": "OneBitLamb",
+    "params": {
+      "lr": 11e-3,
+      "weight_decay": 0.01,
+      "bias_correction": false,
+      "max_coeff": 0.3,
+      "min_coeff": 0.01,
+      "freeze_step": 1000,
+      "cuda_aware": false,
+      "comm_backend_name": "nccl",
+      "coeff_beta": 0.9,
+      "factor_max": 4.0,
+      "factor_min": 0.5,
+      "factor_threshold": 0.1
+    }
+  }
+```
+
+The 1-bit LAMB optimizer supports the following params keys/values in addition to the standard LAMB (learn more in our [tutorial](/tutorials/onebit-lamb/)):
+
+| "params" key        | Description                                                                               | Default |
+| ------------------- | ----------------------------------------------------------------------------------------- | ------- |
+| max\_coeff          | Scaling coefficient upper bound for original LAMB algorithm and 1-bit LAMB's warmup stage | 10.0    |
+| min\_coeff          | Scaling coefficient lower bound for original LAMB algorithm and 1-bit LAMB's warmup stage | 0.01    |
+| freeze\_step        | Number of warm up steps before 1-bit compression gets applied to the communication        | 100000  |
+| cuda\_aware         | To indicate that the underlying MPI library supports CUDA-Aware communication             | false   |
+| comm\_backend\_name | To indicate which backend implementation to use                                           | "nccl"  |
+| coeff\_beta         | Coefficient used for computing running averages of lamb coefficient                       | 0.9     |
+| factor\_max         | Maximum value of scaling factor to the frozen lamb coefficient during compression stage   | 4.0     |
+| factor\_min         | Minimum value of scaling factor to the frozen lamb coefficient during compression stage   | 0.5     |
+| factor\_threshold   | Threshold of how much the scaling factor can fluctuate between steps                      | 0.1     |
 
 ### Scheduler Parameters
 
+
+DeepSpeed calls the `step()` method of the scheduler at every training step when `model_engine.step()` is executed.
+
 ***scheduler***: [dictionary]
 
 | Fields | Value                                                                                                                      | Example                                        |
@@ -97,7 +163,7 @@ The 1-bit Adam optimizer supports the following three params keys/values in addi
 | type   | The scheduler name. See [here](https://deepspeed.readthedocs.io/en/latest/schedulers.html) for list of support schedulers. | `"WarmupLR"`                                   |
 | params | Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature.       | `{"warmup_min_lr": 0, "warmup_max_lr": 0.001}` |
 
-Example of ***scheduler***
+Example of <i>**scheduler**</i>
 
 ```json
  "scheduler": {
@@ -112,36 +178,36 @@ Example of ***scheduler***
 
 ### Communication options
 
-***fp32\_allreduce***: [boolean]
+<i>**communication_data_type**</i>: [boolean]
 
-| Description                                                    | Default |
-| -------------------------------------------------------------- | ------- |
-| During gradient averaging perform allreduce with 32 bit values | `false` |
+| Description                                                                                                                   | Default |
+| ----------------------------------------------------------------------------------------------------------------------------- | ------- |
+| During gradient averaging perform communication with selected data type. By default it will be determined by selected regime  |  None   |
 
-***prescale\_gradients***: [boolean]
+<i>**prescale_gradients**</i>: [boolean]
 
 | Description                            | Default |
 | -------------------------------------- | ------- |
 | Scale gradients before doing allreduce | `false` |
 
-***gradient_predivide_factor***: [float]
+<i>**gradient_predivide_factor**</i>: [float]
 
 | Description                                                                                                                                       | Default |
 | ------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
 | Before gradient averaging predivide gradients by a specified factor, can sometimes help with fp16 stability when scaling to large numbers of GPUs | `1.0`   |
 
-***sparse\_gradients***: [boolean]
+<i>**sparse_gradients**</i>: [boolean]
 
-| Description                                                                                                              | Default |
-| ------------------------------------------------------------------------------------------------------------------------ | ------- |
-| Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. | `false` |
+| Description                                                                                                                                                                                                                                                                                                                                                 | Default |
+| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. This feature is essentially deprecated as we don't see use cases for it as much anymore. It should be noted that this feature is not compatible with [torch.sparse](https://pytorch.org/docs/stable/sparse.html) related features. | `false` |
 
 ### FP16 training options
 
 **Note:** this mode cannot be combined with the `amp` mode described below.
 {: .notice--warning}
 
-***fp16***: [dictionary]
+<i>**fp16**</i>: [dictionary]
 
 | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Default |
 | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
@@ -158,48 +224,75 @@ Example of ***scheduler***
 }
 ```
 
-***fp16:enabled***: [boolean]
+<i>**fp16:enabled**</i>: [boolean]
 
-| Description                                                                            | Default |
-| -------------------------------------------------------------------------------------- | ------- |
-| ***enabled*** is a **fp16** parameter indicating whether or not FP16 training enabled. | `false` |
+| Description                                                                                 | Default |
+| ------------------------------------------------------------------------------------------- | ------- |
+| <i>**enabled**</i> is a **fp16** parameter indicating whether or not FP16 training enabled. | `false` |
 
-***fp16:loss\_scale***: [float]
+<i>**fp16:loss_scale**</i>: [float]
 
-| Description                                                                                                                                                                                                                  | Default |
-| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| ***loss\_scale*** is a ***fp16*** parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0`   |
+| Description                                                                                                                                                                                                                           | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| <i>**loss_scale**</i> is a <i>**fp16**</i> parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0`   |
 
-***fp16:initial\_scale\_power***: [integer]
+<i>**fp16:initial_scale_power**</i>: [integer]
 
-| Description                                                                                                                                                                                       | Default |
-| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| ***initial\_scale\_power*** is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2<sup>***initial\_scale\_power***</sup>. | `32`    |
+| Description                                                                                                                                                                                             | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| <i>**initial_scale_power**</i> is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2<sup><i>**initial_scale_power**</i></sup>. | `32`    |
 
-***fp16:loss\_scale\_window***: [integer]
+<i>**fp16:loss_scale_window**</i>: [integer]
 
-| Description                                                                                                                       | Default |
-| --------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| ***loss\_scale\_window*** is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000`  |
+| Description                                                                                                                          | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------ | ------- |
+| <i>**loss_scale_window**</i> is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000`  |
 
-***fp16:hysteresis***: [integer]
+<i>**fp16:hysteresis**</i>: [integer]
 
-| Description                                                                                    | Default |
-| ---------------------------------------------------------------------------------------------- | ------- |
-| ***hysteresis*** is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2`     |
+| Description                                                                                         | Default |
+| --------------------------------------------------------------------------------------------------- | ------- |
+| <i>**hysteresis**</i> is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2`     |
 
-***fp16:min\_loss\_scale***: [integer]
+<i>**fp16:min_loss_scale**</i>: [integer]
+
+| Description                                                                                           | Default |
+| ----------------------------------------------------------------------------------------------------- | ------- |
+| <i>**min_loss_scale**</i> is  a **fp16** parameter representing the minimum dynamic loss scale value. | `1000`  |
+
+### BFLOAT16 training options
+
+**Note:** this mode cannot be combined with the `amp` mode described below.
+{: .notice--warning}
+
+**Note:** this mode cannot be combined with the `fp16` mode described above.
+{: .notice--warning}
+
+<i>**bf16**</i>: [dictionary]
+
+| Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Default |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Configuration for using [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) floating-point format as an alternative to FP16. BFLOAT16 requires hardware support (e.g., NVIDIA A100). An example, including the available dictionary keys is illustrated below. Training with bfloat16 does not require loss scaling. | None    |
+
+```json
+"bf16": {
+   "enabled": true
+ }
+```
+
+<i>**bf16:enabled**</i>: [boolean]
+
+| Description                                                        | Default |
+|--------------------------------------------------------------------| ------- |
+| <i>**enabled**</i> indicates whether BFLOAT16 training is enabled. | `false` |
 
-| Description                                                                                        | Default |
-| -------------------------------------------------------------------------------------------------- | ------- |
-| ***min\_loss\_scale*** is  a **fp16** parameter representing the minimum dynamic loss scale value. | `1000`  |
 
 ### Automatic mixed precision (AMP) training options
 
 **Note:** this mode cannot be combined with the `fp16` mode described above. In addition this mode is not currently compatible with ZeRO.
 {: .notice--warning}
 
-***amp***: [dictionary]
+<i>**amp**</i>: [dictionary]
 
 | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Default |
 | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
@@ -214,11 +307,11 @@ Example of ***scheduler***
 }
 ```
 
-***amp:enabled***: [boolean]
+<i>**amp:enabled**</i>: [boolean]
 
-| Description                                                                              | Default |
-| ---------------------------------------------------------------------------------------- | ------- |
-| ***enabled*** is an **amp** parameter indicating whether or not AMP training is enabled. | `false` |
+| Description                                                                                   | Default |
+| --------------------------------------------------------------------------------------------- | ------- |
+| <i>**enabled**</i> is an **amp** parameter indicating whether or not AMP training is enabled. | `false` |
 
 ***amp params***: [various]
 
@@ -228,11 +321,11 @@ Example of ***scheduler***
 
 ### Gradient Clipping
 
-***gradient\_clipping***: [float]
+<i>**gradient_clipping**</i>: [float]
 
 | Description                         | Default |
 | ----------------------------------- | ------- |
-| Enable gradient clipping with value | `0`     |
+| Enable gradient clipping with value | `1.0`   |
 
 
 
@@ -248,171 +341,464 @@ Enabling and configuring ZeRO memory optimizations
     "reduce_scatter": [true|false],
     "reduce_bucket_size": 5e8,
     "contiguous_gradients" : [true|false],
-    "cpu_offload": [true|false],
-    "cpu_offload_params" : [true|false],
-    "cpu_offload_use_pin_memory" : [true|false],
+    "offload_param": {
+      ...
+    },
+    "offload_optimizer": {
+      ...
+    },
     "stage3_max_live_parameters" : 1e9,
     "stage3_max_reuse_distance" : 1e9,
     "stage3_prefetch_bucket_size" : 5e8,
     "stage3_param_persistence_threshold" : 1e6,
     "sub_group_size" : 1e12,
-    "elastic_checkpoint" : [true|false]
+    "elastic_checkpoint" : [true|false],
+    "stage3_gather_16bit_weights_on_model_save": [true|false],
+    "ignore_unused_parameters": [true|false]
+    "round_robin_gradients": [true|false]
     }
 ```
 
-***zero\_optimization***: [dictionary]
+<i>**zero_optimization**</i>: [dictionary]
 
 | Description                                                                                               | Default |
 | --------------------------------------------------------------------------------------------------------- | ------- |
 | Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false` |
 
-***stage***: [integer]
+<i>**stage**</i>: [integer]
 
-| Description                                                                                                                                                           | Default |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Description                                                                                                                                                                                                               | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
 | Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively. | `0`     |
 
-***allgather_partitions***: [boolean]
+<i>**allgather_partitions**</i>: [boolean]
 
 | Description                                                                                                                                      | Default |
 | ------------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
 | Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step | `true`  |
 
-***allgather_bucket_size***: [boolean]
+***allgather_bucket_size***: [integer]
 
 | Description                                                                                                  | Default |
 | ------------------------------------------------------------------------------------------------------------ | ------- |
 | Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `5e8`   |
 
-***overlap_comm***: [boolean]
+<i>**overlap_comm**</i>: [boolean]
 
 | Description                                                                  | Default |
 | ---------------------------------------------------------------------------- | ------- |
 | Attempts to overlap the reduction of the gradients with backward computation | `false` |
 
-***reduce_scatter***: [boolean]
+<i>**reduce_scatter**</i>: [boolean]
 
 | Description                                                             | Default |
 | ----------------------------------------------------------------------- | ------- |
 | Uses reduce or reduce scatter instead of allreduce to average gradients | `true`  |
 
-***reduce_bucket_size***: [boolean]
+***reduce_bucket_size***: [integer]
 
 | Description                                                                                                         | Default |
 | ------------------------------------------------------------------------------------------------------------------- | ------- |
 | Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `5e8`   |
 
-***contiguous_gradients***: [boolean]
+<i>**contiguous_gradients**</i>: [boolean]
+
+| Description                                                                                                         | Default |
+| ------------------------------------------------------------------------------------------------------------------- | ------- |
+| Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. | `True`  |
 
-| Description                                                                                                                                                     | Default |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. Only useful when running very large models. | `False` |
+<i>**grad_hooks**</i>: [boolean]
 
-***cpu_offload***: [boolean]
+| Description                                                                                                                               | Default |
+| ----------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| For use with ZeRO stage 1, enable backward hooks to reduce gradients during the backward pass or wait until the end of the backward pass. | `True`  |
 
-| Description                                                                                                              | Default |
-| ------------------------------------------------------------------------------------------------------------------------ | ------- |
-| Enable offloading of optimizer memory and computation to CPU. This frees up GPU memory for larger models or batch sizes. | `False` |
+***round_robin_gradients***: [boolean]
 
-***cpu_offload_params***: [boolean]
+| Description                                                                                                                                                                                                                                                                         | Default |
+| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Stage 2 optimization for CPU offloading that parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). | `False` |
 
-| Description                                                                                                                       | Default |
-| --------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| Enable offloading of model parameters to CPU. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. | `False` |
+***offload_param***: [dictionary]
 
-***cpu_offload_use_pin_memory***: [boolean]
+| Description                                                                                                                                                                                   | Default |
+| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. See [here](#parameter-offloading) for more details. | `False` |
 
-| Description                                                                               | Default |
-| ----------------------------------------------------------------------------------------- | ------- |
-| Use pinned CPU memory when offloading. Can improve performance. Valid only with stage 3.  | `False` |
+***offload_optimizer***: [dictionary]
+
+| Description                                                                                                                                                                                                                          | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
+| Enable offloading of optimizer state to CPU or NVMe, and optimizer computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid only with stage 2 and 3. See [here](#optimizer-offloading) for more details. | `False` |
 
 ***stage3_max_live_parameters***: [integer]
 
-| Description                                                                                                                           | Default |
-| ------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Description                                                                                                                         | Default |
+| ----------------------------------------------------------------------------------------------------------------------------------- | ------- |
 | The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication. | `1e9`   |
 
 ***stage3_max_reuse_distance***: [integer]
 
-| Description                                                                                                      | Default |
-| ---------------------------------------------------------------------------------------------------------------- | ------- |
+| Description                                                                                                                                          | Default |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
 | Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but perform more communication. | `1e9`   |
 
 ***stage3_prefetch_bucket_size***: [integer]
 
-| Description                                                                                                                     | Default |
-| ------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Description                                                                                                                            | Default |
+| -------------------------------------------------------------------------------------------------------------------------------------- | ------- |
 | The size of the fixed buffer for prefetching parameters. Smaller values use less memory, but can increase stalls due to communication. | `5e8`   |
 
 
 ***stage3_param_persistence_threshold***: [integer]
+
 | Description                                                                                                                                                          | Default |
 | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
 | Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages). | `1e6`   |
 
 
+***stage3_gather_16bit_weights_on_model_save***: [boolean]
+
+| Description                                                                                                                                                                                                                                                                    | Default |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ------- |
+| Consolidate the weights before saving the model by `save_16bit_model()`. Since the weights are partitioned across GPUs, they aren't part of `state_dict`, so this function automatically gathers the weights when this option is enabled and then saves the fp16 model weights. | `False` |
+
+
+***cpu_offload***: [boolean]
+
+**Deprecated:** **cpu_offload** is deprecated and will be removed in future, please use `offload_optimizer` instead.
+{: .notice--warning}
+
+| Description                                                                                                                                       | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Enable offloading of optimizer memory and computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid only with stage 2. | `False` |
+
+
+### Parameter offloading
+Enabling and configuring ZeRO optimization of parameter offloading to CPU/NVMe. Available only with ZeRO stage 3.
+Note that if the value of "device" is not specified or not supported, an assertion will be triggered.
+
+```json
+  "offload_param": {
+    "device": "[cpu|nvme]",
+    "nvme_path": "/local_nvme",
+    "pin_memory": [true|false],
+    "buffer_count": 5,
+    "buffer_size": 1e8,
+    "max_in_cpu": 1e9
+  }
+```
+***device***: [string]
+
+| Description                                                                        | Default |
+| ---------------------------------------------------------------------------------- | ------- |
+| Device memory to offload model parameters. Supported options are `cpu` and `nvme`. | `cpu`   |
+
+***nvme_path***: [string]
+
+| Description                                               | Default       |
+| --------------------------------------------------------- | ------------- |
+| Filesystem path for NVMe device for parameter offloading. | `/local_nvme` |
+
+***pin_memory***: [boolean]
+
+| Description                                                                                          | Default |
+| ---------------------------------------------------------------------------------------------------- | ------- |
+| Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. | `false` |
+
+***buffer_count***: [integer]
+
+| Description                                                        | Default |
+| ------------------------------------------------------------------ | ------- |
+| Number of buffers in buffer pool for parameter offloading to NVMe. | 5       |
+
+
+***buffer_size***: [integer]
+
+| Description                                                      | Default |
+| ---------------------------------------------------------------- | ------- |
+| Size of buffers in buffer pool for parameter offloading to NVMe. | 1e8     |
+
+***max_in_cpu***: [integer]
+
+| Description                                                                                | Default |
+| ------------------------------------------------------------------------------------------ | ------- |
+| Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabled. | 1e9     |
+
+### Optimizer offloading
+Enabling and configuring ZeRO optimization of offloading optimizer computation to CPU and state to CPU/NVMe. CPU offloading is available with ZeRO stage 2 or 3. NVMe offloading is available only with ZeRO stage 3.
+Note that if the value of "device" is not specified or not supported, an assertion will be triggered.
+```json
+  "offload_optimizer": {
+    "device": "[cpu|nvme]",
+    "nvme_path": "/local_nvme",
+    "pin_memory": [true|false],
+    "buffer_count": 4,
+    "fast_init": false
+  }
+```
+***device***: [string]
+
+| Description                                                                                                                                            | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
+| Device memory to offload optimizer state. Supported options are `cpu` and `nvme`. Optimizer computation is offload to CPU regardless of device option. | `cpu`   |
+
+***nvme_path***: [string]
+
+| Description                                                     | Default       |
+| --------------------------------------------------------------- | ------------- |
+| Filesystem path for NVMe device for optimizer state offloading. | `/local_nvme` |
+
+***pin_memory***: [boolean]
+
+| Description                                                                                          | Default |
+| ---------------------------------------------------------------------------------------------------- | ------- |
+| Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. | `false` |
+
+***buffer_count***: [integer]
+
+| Description                                                                                                                                                                                                                                              | Default |
+| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance). | 4       |
+
+***fast_init***: [boolean]
+
+| Description                                                   | Default |
+| ------------------------------------------------------------- | ------- |
+| Enable fast optimizer initialization when offloading to NVMe. | `false` |
+
+
+### Asynchronous I/O
+Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I/O (libaio).
+```json
+  "aio": {
+    "block_size": 1048576,
+    "queue_depth": 8,
+    "thread_count": 1,
+    "single_submit": false,
+    "overlap_events": true
+  }
+```
+***block_size***: [integer]
+
+| Description              | Default |
+| ------------------------ | ------- |
+| I/O block size in bytes. | 1048576 |
+
+***queue_depth***: [integer]
+
+| Description      | Default |
+| ---------------- | ------- |
+| I/O queue depth. | 8       |
+
+***thread_count***: [integer]
+
+| Description                                                               | Default |
+| ------------------------------------------------------------------------- | ------- |
+| Intra-request parallelism for each read/write submitted by a user thread. | 1       |
+
+***single_submit***: [boolean]
+
+| Description                                                                                            | Default |
+| ------------------------------------------------------------------------------------------------------ | ------- |
+| Submit requests to storage device as multiple individual requests as opposed to one block of requests. | `false` |
+
+***overlap_events***: [boolean]
+
+| Description                                                                                                    | Default |
+| -------------------------------------------------------------------------------------------------------------- | ------- |
+| Submit requests to storage device in an overlapped fashion without waiting for completion of earlier requests. | `true`  |
+
+***ignore_unused_parameters***: [boolean]
+
+| Description                                                                                                                                                                                                                                                                                                                                                     | Default |
+| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Unused parameters in modules may be unexpected in static networks, but could be normal in dynamic networks. This controls whether or not training should terminate with an error message when unused parameters are detected. This is set to `False` by default, which means unused parameters are ignored and training continues. Now is just used in stage 2. | `True`  |
+
 ### Logging
 
-***steps\_per\_print***: [integer]
+<i>**steps_per_print**</i>: [integer]
 
-| Description                    | Default |
-| ------------------------------ | ------- |
-| Print train loss every N steps | `10`    |
+| Description                                                                                                                                                                                                                             | Default |
+| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Print progress report every N training steps. The report includes the number of training steps, number of skipped optimizer updates (likely due to overflows in mixed-precision training), current learning rate, and current momentum. | `10`    |
 
-***wall\_clock\_breakdown***: [boolean]
+<i>**wall_clock_breakdown**</i>: [boolean]
 
 | Description                                                             | Default |
 | ----------------------------------------------------------------------- | ------- |
 | Enable timing of the latency of forward/backward/update training phases | `false` |
 
-***dump_state***: [boolean]
+<i>**dump_state**</i>: [boolean]
 
 | Description                                                          | Default |
 | -------------------------------------------------------------------- | ------- |
 | Print out state information of DeepSpeed object after initialization | `false` |
 
+
+### Autotuning
+
+```json
+{
+  "autotuning": {
+    "enabled": false,
+    "results_dir": null,
+    "exps_dir": null,
+    "overwrite": false,
+    "metric": "throughput",
+    "start_profile_step": 3,
+    "end_profile_step": 5,
+    "fast": true,
+    "max_train_batch_size": null,
+    "mp_size": 1,
+    "num_tuning_micro_batch_sizes": 3,
+    "tuner_type": "model_based",
+    "tuner_early_stopping": 5,
+    "tuner_num_trials": 50,
+    "arg_mappings": null
+  }
+}
+```
+<i>**enabled**</i>: [boolean]
+
+| Description            | Default |
+| ---------------------- | ------- |
+| Enables the autotuner. | `false` |
+
+
+<i>**results_dir**</i>: [string]
+
+| Description                                                                                                                      | Default |
+| -------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Path to the autotuning experiment results directory. If None, "autotuning_results" under the training script launching path is used. | `null`  |
+
+<i>**exps_dir**</i>: [string]
+
+| Description                                                                                                                        | Default |
+| ---------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Path to the auotuning experiment descriptions directory. If None, "autotuning_exps" under the train script launching path is used. | `null`  |
+
+<i>**overwrite**</i>: [boolean]
+
+| Description                                                                                                               | Default |
+|---------------------------------------------------------------------------------------------------------------------------| ------- |
+| Whether to run autotuing experiments whose results already exist. Setting it to true would overwrite the existing result. | `false` |
+
+
+<i>**metric**</i>: [string]
+
+| Description                                                                                                                                                                                                                                                            | Default      |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ |
+| The performance metric to use for ranking autotuning experiments. `latency`, `throughput`, and `FLOPS` are currently supported, referring to training step latency, training samples per second, and floating-point operations per second achieved per GPU respectively. | `throughput` |
+
+<i>**start_profile_step**</i>: [integer]
+
+| Description                                                                                                                                         | Default |
+| --------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| The global training step at which to start profiling in an autotuning experiment. Note that warm-up is needed for accurate performance measurement. | `3`     |
+
+<i>**end_profile_step**</i>: [integer]
+
+| Description                                                                                                               | Default |
+| ------------------------------------------------------------------------------------------------------------------------- | ------- |
+| The global training step at which to end profiling in an autotuning experiment. Must not be less than start_profile_step. | `5`     |
+
+
+<i>**fast**</i>: [boolean]
+
+| Description                                                                                  | Default |
+| -------------------------------------------------------------------------------------------- | ------- |
+| Enables fast-model autotuning where only Zero stages and micro-batch sizes per GPU are tuned. | `true` |
+
+<i>**max_train_batch_size**</i>: [int]
+
+| Description                                                                       | Default |
+| --------------------------------------------------------------------------------- | ------- |
+| The maximum train batch size (global effective batch size) for the model training. | `null`  |
+
+<i>**mp_size**</i>: [int]
+
+| Description              | Default |
+| ------------------------ | ------- |
+| Model parallelism degree. | `1`     |
+
+
+<i>**num_tuning_micro_batch_sizes**</i>: [integer]
+
+| Description                                     | Default |
+| ----------------------------------------------- | ------- |
+| The number of micro-batch sizes to explore. | `3`     |
+
+<i>**tuner_type**</i>: [string]
+
+| Description                                                                              | Default       |
+| ---------------------------------------------------------------------------------------- | ------------- |
+| The algorithm defines the order of autotuning space exploration within a ZeRO stage. | `model_based` |
+
+
+<i>**tuner_early_stopping**</i>: [integer]
+
+| Description                                                                                                                                                | Default |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| The number of experiments to run beyond the current best experiment. If no better experiment is found within that number, the Autotuner stops the exploration. | `5`     |
+
+<i>**tuner_num_trials**</i>: [integer]
+
+| Description                                                                           | Default |
+| ------------------------------------------------------------------------------------- | ------- |
+| The maximum number of experiments to explore in the tuning space within a ZeRO stage. | `50`    |
+
+
 ### Flops Profiler
 ```json
 {
   "flops_profiler": {
-    "enabled": true,
+    "enabled": false,
     "profile_step": 1,
     "module_depth": -1,
-    "top_modules": 3,
+    "top_modules": 1,
     "detailed": true,
+    "output_file": null,
     }
 }
 ```
-***enabled***: [boolean]
+<i>**enabled**</i>: [boolean]
 
-| Description                 | Default |
-| --------------------------- | ------- |
-| Enables the flops profiler. | `false` |
+| Description                                                              | Default |
+| ------------------------------------------------------------------------ | ------- |
+| Enables the flops profiler. This would also enables wall_clock_breakdown | `false` |
 
-***profile\_step***: [integer]
+<i>**profile_step**</i>: [integer]
 
 | Description                                                                                                     | Default |
 | --------------------------------------------------------------------------------------------------------------- | ------- |
 | The global training step at which to profile. Note that warm up steps are needed for accurate time measurement. | `1`     |
 
-***module\_depth***: [integer]
+<i>**module_depth**</i>: [integer]
 
-| Description                                                                                                                                                            | Default |
-| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
-| The depth of the model at which to print the aggregated module information. When set to `-1`, it prints information on the innermost modules (with the maximum depth). | `-1`    |
+| Description                                                                                                                                                                           | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| The depth of the model at which to print the aggregated module information. When set to `-1`, it prints information from the top module to the innermost modules (the maximum depth). | `-1`    |
 
-***top\_modules***: [integer]
+<i>**top_modules**</i>: [integer]
 
 | Description                                                                  | Default |
 | ---------------------------------------------------------------------------- | ------- |
-| Limits the aggregated profile output to the number of top modules specified. | `3`     |
+| Limits the aggregated profile output to the number of top modules specified. | `1`     |
 
-***detailed***: [boolean]
+<i>**detailed**</i>: [boolean]
 
 | Description                                  | Default |
 | -------------------------------------------- | ------- |
 | Whether to print the detailed model profile. | `true`  |
 
+<i>**output_file**</i>: [string]
+
+| Description                                                       | Default |
+| ----------------------------------------------------------------- | ------- |
+| Path to the output file. If None, the profiler prints to stdout.. | `null`  |
+
+
 ### Activation Checkpointing
 ```json
   "activation_checkpointing": {
@@ -424,39 +810,39 @@ Enabling and configuring ZeRO memory optimizations
     "profile": false
     }
 ```
-***partition\_activations***: [boolean]
+<i>**partition_activations**</i>: [boolean]
 
 | Description                                                   | Default |
 | ------------------------------------------------------------- | ------- |
 | Enables partition activation when used with model parallelism | `false` |
 
-***cpu\_checkpointing***: [boolean]
+<i>**cpu_checkpointing**</i>: [boolean]
 
 | Description                                                                 | Default |
 | --------------------------------------------------------------------------- | ------- |
 | Offloads partitioned activations to CPU if partition_activations is enabled | `false` |
 
 
-***contiguous\_memory\_optimization***: [boolean]
+<i>**contiguous_memory_optimization**</i>: [boolean]
 
 | Description                                                          | Default |
 | -------------------------------------------------------------------- | ------- |
 | Copies partitioned activations so that they are contiguous in memory | `false` |
 
-***number_checkpoints***: [integer]
+<i>**number_checkpoints**</i>: [integer]
 
 | Description                                                                                              | Default |
 | -------------------------------------------------------------------------------------------------------- | ------- |
-| Total number of activation checkpoints used to allocate memory buffer for contiguous_memoty_optimization | `None`  |
+| Total number of activation checkpoints used to allocate memory buffer for contiguous_memory_optimization | `None`  |
 
-***synchronize\_checkpoint\_boundary***: [boolean]
+<i>**synchronize_checkpoint_boundary**</i>: [boolean]
 
 | Description                                                   | Default |
 | ------------------------------------------------------------- | ------- |
 | Inserts torch.cuda.synchronize() at each checkpoint boundary. | `false` |
 
 
-***profile***: [boolean]
+<i>**profile**</i>: [boolean]
 
 | Description                                                     | Default |
 | --------------------------------------------------------------- | ------- |
@@ -464,7 +850,7 @@ Enabling and configuring ZeRO memory optimizations
 
 ### Sparse Attention
 
-***sparse\_attention***: [dictionary]
+<i>**sparse_attention**</i>: [dictionary]
 
 | Fields                           | Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Example           |
 | -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- |
@@ -482,7 +868,7 @@ Enabling and configuring ZeRO memory optimizations
 | global\_block\_end\_indices      | A list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size of global\_block\_indices parameter, and combining this two parameters, for each index i, blocks from global\_block\_indices[i] to global\_block\_end\_indices[i], exclusive, are considered as global attention; used in `"variable"` and `"bslongformer"` modes.                                                                                               | None              |
 | num\_sliding\_window\_blocks     | An integer determining the number of blocks in sliding local attention window; used in `"bigbird"` and `"bslongformer"` modes.                                                                                                                                                                                                                                                                                                                                                                                 | 3                 |
 
-  Example of ***sparse\_attention***
+  Example of <i>**sparse_attention**</i>
 
 ```json
   "sparse_attention": {
@@ -501,3 +887,118 @@ Enabling and configuring ZeRO memory optimizations
     "num_sliding_window_blocks": 3
   }
 ```
+
+### Curriculum Learning
+```json
+  "curriculum_learning": {
+    "enabled": true,
+    "curriculum_type": "seqlen",
+    "min_difficulty": 8,
+    "max_difficulty": 1024,
+    "schedule_type": "fixed_linear",
+    "schedule_config": {
+      "total_curriculum_step": 40000,
+      "difficulty_step": 8
+    }
+  }
+```
+<i>**enabled**</i>: [boolean]
+
+| Description                               | Default |
+| ----------------------------------------- | ------- |
+| Set to true to enable curriculum learning | `false` |
+
+<i>**curriculum_type**</i>: [string]
+
+| Description                                                       | Default |
+| ----------------------------------------------------------------- | ------- |
+| Type of curriculum difficulty metric. Currently support `seqlen`. | N/A     |
+
+
+<i>**min_difficulty**</i>: [integer]
+
+| Description                   | Default |
+| ----------------------------- | ------- |
+| The starting difficulty level | N/A     |
+
+<i>**max_difficulty**</i>: [integer]
+
+| Description                 | Default |
+| --------------------------- | ------- |
+| The ending difficulty level | N/A     |
+
+<i>**schedule_type**</i>: [string]
+
+| Description                                                                                        | Default |
+| -------------------------------------------------------------------------------------------------- | ------- |
+| Type of curriculum schedule. Currently support `fixed_linear`, `fixed_root`, and `fixed_discrete`. | N/A     |
+
+
+<i>**total_curriculum_step**</i>: [integer]
+
+| Description                                                                                                                                      | Default |
+| ------------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
+| Total number of steps for the curriculum learning. One of the `schedule_config` when the `fixed_linear` and `fixed_root` schedule_type are used. | N/A     |
+
+<i>**difficulty_step**</i>: [integer]
+
+| Description                                                                                                                                                                                                                                                                                          | Default |
+| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| At any time, the curriculum learning difficulty must be multiple of this `difficulty_step`. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. One of the `schedule_config` when the `fixed_linear` and `fixed_root` schedule_type are used. | N/A     |
+
+<i>**root_degree**</i>: [integer]
+
+| Description                                                                                                                | Default |
+| -------------------------------------------------------------------------------------------------------------------------- | ------- |
+| Root degree of the curriculum schedule function. One of the `schedule_config` when the `fixed_root` schedule_type is used. | N/A     |
+
+<i>**difficulty**</i>: [list of integer]
+
+| Description                                                                                                                         | Default |
+| ----------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| List of difficulty levels to be used during schedule. One of the `schedule_config` when the `fixed_discrete` schedule_type is used. | N/A     |
+
+<i>**max_step**</i>: [list of integer]
+
+| Description                                                                                                                  | Default |
+| ---------------------------------------------------------------------------------------------------------------------------- | ------- |
+| List of which step to change difficulty level. One of the `schedule_config` when the `fixed_discrete` schedule_type is used. | N/A     |
+
+### Logging to Tensorboard
+
+**Note:** Deepspeed logs to TensorBoard through PyTorch. Logging to TensorBoard requires that the `tensorboard` package is installed (read more in the [PyTorch documentation](https://pytorch.org/docs/1.8.0/tensorboard.html)).
+{: .notice--warning}
+
+
+Deepspeed can log training details into a [Tensorboard](https://www.tensorflow.org/tensorboard)-compatible file. Below is an overview of what deepspeed will log.
+
+| Field | Description                                                                                                                                                                                                                                                                                               |Conditions |
+| ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
+| `Train/Samples/train_loss`   | The training loss. | None |
+| `Train/Samples/lr`           | The learning rate during training. | None |
+| `Train/Samples/loss_scale`   | The loss scale when training using `fp16`. | `fp16` must be enabled. |
+| `Train/Eigenvalues/ModelBlockParam_{i}`   | Eigen values per param block. | `eigenvalue` must be enabled. |
+| `Train/Samples/elapsed_time_ms_forward`   | The global duration of the forward pass. | `flops_profiler.enabled` or `wall_clock_breakdown`. |
+| `Train/Samples/elapsed_time_ms_backward`   | The global duration of the forward pass. | `flops_profiler.enabled` or `wall_clock_breakdown`.  |
+| `Train/Samples/elapsed_time_ms_backward_inner`   | The backward time that does not include the the gradient reduction time. Only in cases where the gradient reduction is not overlapped, if it is overlapped then the inner time should be about the same as the entire backward time. | `flops_profiler.enabled` or `wall_clock_breakdown`.  |
+| `Train/Samples/elapsed_time_ms_backward_allreduce`   | The global duration of the allreduce operation. | `flops_profiler.enabled` or `wall_clock_breakdown`.  |
+| `Train/Samples/elapsed_time_ms_step`   | The optimizer step time | `flops_profiler.enabled` or `wall_clock_breakdown`.  |
+
+<i>**tensorboard**</i>: [dictionary]
+
+| Fields | Value                                                                                                                                                                                                                                                                                                        |Default |
+| ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
+| enabled   | Whether logging to [Tensorboard](https://www.tensorflow.org/tensorboard) is enabled. | `false` |
+| job_name  | Name for the current job. This will become a new directory inside `output_path` | `"DeepSpeedJobName"` |
+| output_path | Path to where the Tensorboard logs will be written.                           | `~/tensorboard/` |
+
+
+Example of <i>** tensorboard**</i> configuration:
+
+```json
+"tensorboard": {
+    "enabled": true,
+    "output_path": "output/ds_logs/",
+    "job_name": "train_bert"
+}
+```
diff --git a/docs/_pages/features.md b/docs/_pages/features.md
old mode 100755
new mode 100644
index ba955fd574db3277612dde3c6b4e0a1a45f58e14..4410f2b1026806a4533087924c898cba172b3f7a
--- a/docs/_pages/features.md
+++ b/docs/_pages/features.md
@@ -172,15 +172,18 @@ Please see the [core API doc](https://deepspeed.readthedocs.io/) for more detail
 
 ## Training Optimizers
 
-### 1-bit Adam optimizer with up to 5x less communication
+### 1-bit Adam, 0/1 Adam and 1-bit LAMB optimizers with up to 26x less communication
 
-DeepSpeed has an efficient implementation of a novel algorithm called 1-bit Adam.
-It offers the same convergence as Adam, incurs up to 5x less communication that enables
-up to 3.5x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput
+DeepSpeed has three communication-efficient optimizers called 1-bit Adam, 0/1 Adam and 1-bit LAMB.
+They offer the same convergence as Adam/LAMB, incur up to 26x less communication that enables
+up to 6.6x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput
 for SQuAD fine-tuning on bandwidth-limited clusters. For more details on usage and performance,
-please refer to the detailed [tutorial](https://www.deepspeed.ai/tutorials/onebit-adam) and
-[blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.md), respectively.
-<!-- **TODO: add paper link when it is ready ** -->
+please refer to the [1-bit Adam tutorial](https://www.deepspeed.ai/tutorials/onebit-adam),
+[1-bit Adam blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.md),
+[0/1 Adam tutorial](https://www.deepspeed.ai/tutorials/zero-one-adam)
+and [1-bit LAMB tutorial](https://www.deepspeed.ai/tutorials/onebit-lamb/). For technical details,
+please refer to the [1-bit Adam paper](https://arxiv.org/abs/2102.02888), [0/1 Adam paper](https://arxiv.org/abs/2202.06009) and
+[1-bit LAMB paper](https://arxiv.org/abs/2104.06069).
 
 ### Fused Adam optimizer and arbitrary torch.optim.Optimizer
 With DeepSpeed, the user can choose to use a high performance implementation of ADAM from
@@ -239,6 +242,9 @@ DeepSpeed abstracts away data parallelism and model parallelism from the user wh
 comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader
 can automatically handle batch creation appropriately.
 
+## Curriculum Learning
+Please refer to the [Curriculum Learning](/tutorials/curriculum-learning/) tutorial.
+
 ## Performance Analysis and Debugging
 
 DeepSpeed provides a set of tools for performance analysis and debugging.
@@ -287,6 +293,36 @@ The DeepSpeed flops profiler measures the time, flops and parameters of a PyTorc
 ```
 The flops profiler can also be used as a standalone package. Please refer to the [Flops Profiler](/tutorials/flops-profiler) tutorial for more details.
 
+
+### Autotuning
+
+The DeepSpeed Autotuner  uses model information, system information, and heuristics to efficiently tune Zero stage, micro batch size, and other Zero configurations. Using the autotuning feature requires no code change from DeepSpeed users. While `"autotuning": {"enabled": true}` is the minimal required to enable auotuning, there are other parameters users can define to configure the autotuning process. Below shows major parameters and their default values in the autotuning configuration. Please refer to the [Autotuning](/tutorials/autotuning) tutorial for more details.
+
+```json
+{
+  "autotuning": {
+    "enabled": true,
+    "results_dir": null,
+    "exps_dir": null,
+    "overwrite": false,
+    "metric": "throughput",
+    "num_nodes": null,
+    "num_gpus": null,
+    "start_profile_step": 3,
+    "end_profile_step": 5,
+    "fast": true,
+    "num_tuning_micro_batch_sizes": 3,
+    "tuner_type": "model_based",
+    "tuner_early_stopping": 5,
+    "tuner_num_trials": 50,
+    "arg_mappings": null
+  }
+}
+
+```
+The flops profiler can also be used as a standalone package. Please refer to the [Flops Profiler](/tutorials/flops-profiler) tutorial for more details.
+
+
 ## Sparse Attention
 DeepSpeed offers sparse attention to support long sequences. Please refer to the [Sparse Attention](/tutorials/sparse-attention/) tutorial.
 
@@ -306,3 +342,6 @@ DeepSpeed offers sparse attention to support long sequences. Please refer to the
     "num_different_global_patterns": 4
 }
 ```
+
+## Mixture of Experts (MoE)
+To learn more about training Mixture of Experts (MoE) models with DeepSpeed, see our [tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts/) for more details.
diff --git a/docs/_pages/posts-landing.md b/docs/_pages/posts-landing.md
new file mode 100644
index 0000000000000000000000000000000000000000..fcbbd46e251c236fae8c5b4a02d244ef6fcd3252
--- /dev/null
+++ b/docs/_pages/posts-landing.md
@@ -0,0 +1,49 @@
+---
+title: "Blog"
+layout: archive
+permalink: /posts/
+---
+
+{% if paginator %}
+  {% assign posts = paginator.posts %}
+{% else %}
+  {% assign posts = site.posts %}
+{% endif %}
+
+<script type="text/javascript">
+    function filterUsingCategory(selectedCategory) {
+      {% for post in posts %}
+        var cats = {{ post.tags | jsonify }}
+
+        var postDiv = document.getElementById("post-{{post.title | slugify}}");
+        postDiv.style.display = (selectedCategory == 'All' || cats.includes(selectedCategory))
+          ? 'unset'
+          : 'none';
+      {% endfor %}
+    }
+</script>
+
+  <div class="btn-group">
+    <button id="All" class="button-71" role="button" onclick="filterUsingCategory('All')">All ({{ posts.size }})</button>
+    {% assign tags = site.tags | sort %}
+    {% for category in tags %}
+      {% assign cat = category | first %}
+      <button id="{{ cat }}" class="button-71" role="button" onclick="filterUsingCategory(this.id)">{{ cat }} ({{ site.tags[cat].size }})</button>
+    {% endfor %}
+    <hr />
+  </div>
+
+  <div class="posts-wrapper">
+    {% for post in posts %}
+      <div class="post" id="post-{{post.title | slugify}}">
+        <p class="itemInteriorSection">
+          {%- unless post.hidden -%}
+            {% include archive-single.html %}
+            {% if post.image %}
+              <a href="{{ post.link }}"><img src="{{ post.image }}"></a>
+            {% endif %}
+          {%- endunless -%}
+        </p>
+      </div>
+    {% endfor %}
+  </div>
diff --git a/docs/_pages/posts_list_landing.md b/docs/_pages/posts_list_landing.md
new file mode 100644
index 0000000000000000000000000000000000000000..207d48a259f27255cd62193c8da2351b03593b67
--- /dev/null
+++ b/docs/_pages/posts_list_landing.md
@@ -0,0 +1,36 @@
+---
+layout: archive
+permalink: /posts-list/
+---
+
+{% assign sorted_tags = (site.tags | sort:0) %}
+<ul class="tag-box">
+	{% for tag in sorted_tags %}
+		{% assign t = tag | first %}
+		{% assign ps = tag | last %}
+		<li><a href="#{{ t | downcase }}">{{ t }} <span class="size">({{ ps.size }})</span></a></li>
+	{% endfor %}
+</ul>
+
+{% for tag in sorted_tags %}
+  {% assign t = tag | first %}
+  {% assign posts = tag | last %}
+  <div style="text-transform:capitalize;">
+    <h4 id="{{ t | downcase }}">{{ t }}</h4>
+  </div>
+  <ul>
+  {% for post in posts %}
+    {% if post.tags contains t %}
+      {% if post.link %}
+        <li>
+          <span class="date">{{ post.date | date: '%d %b %y' }}</span>:  <a href="{{ post.link }}">{{ post.title }}</a>
+        </li>
+      {% else %}
+        <li>
+          <span class="date">{{ post.date | date: '%d %b %y' }}</span>:  <a href="{{ post.url }}">{{ post.title }}</a>
+        </li>
+      {% endif %}
+    {% endif %}
+  {% endfor %}
+  </ul>
+{% endfor %}
diff --git a/docs/_pages/tutorials-landing.md b/docs/_pages/tutorials-landing.md
index 5447e137c9fd1ea36a0c61072f5fc3d8f45898b6..c7d6af66f411974596fe5672a30b89bf5d38c7b9 100644
--- a/docs/_pages/tutorials-landing.md
+++ b/docs/_pages/tutorials-landing.md
@@ -1,6 +1,49 @@
 ---
 title: "Tutorials"
-layout: collection
+layout: archive
 collection: tutorials
 permalink: /tutorials/
 ---
+
+
+{% if paginator %}
+  {% assign tutorials = paginator.tutorials %}
+{% else %}
+  {% assign tutorials = site.tutorials %}
+{% endif %}
+
+<script type="text/javascript">
+    function filterTutorialsUsingCategory(selectedCategory) {
+      {% for tutorial in tutorials %}
+        var cats = {{ tutorial.tags | jsonify }}
+        var tutorialDiv = document.getElementById("tutorial-{{tutorial.title | slugify}}");
+        tutorialDiv.style.display = (selectedCategory == 'All' || cats.includes(selectedCategory))
+          ? 'unset'
+          : 'none';
+      {% endfor %}
+    }
+</script>
+
+  <div class="btn-group">
+    <button id="All" class="button-71" role="button" onclick="filterTutorialsUsingCategory('All')">All ({{ tutorials.size }})</button>
+    {% assign tags = site.tutorials | map: 'tags' | join: ','  | split: ','  | group_by: tag %}
+    {% for cat in tags %} <!-- of the form {"name":"","items":[],"size":N}-->
+      <button id="{{ cat.name }}" class="button-71" role="button" onclick="filterTutorialsUsingCategory(this.id)">{{ cat.name }} ({{ cat.size }})</button>
+    {% endfor %}
+    <hr />
+  </div>
+  <div class="tutorials-wrapper">
+    {% for tutorial in tutorials %}
+      {% assign post = tutorial %}
+      <div class="tutorial" id="tutorial-{{tutorial.title | slugify}}">
+        <p class="itemInteriorSection">
+          {%- unless tutorial.hidden -%}
+            {% include archive-single.html %}
+            {% if tutorial.image %}
+              <a href="{{ tutorial.link }}"><img src="{{ tutorial.image }}"></a>
+            {% endif %}
+          {%- endunless -%}
+        </p>
+      </div>
+    {% endfor %}
+  </div>
diff --git a/docs/_posts/2020-02-13-release.md b/docs/_posts/2020-02-13-release.md
index 968e97c9b420dca723dec732c61e99b695e104cb..792ff7bfee674c07a5827bfb1b5d956542dc51d1 100644
--- a/docs/_posts/2020-02-13-release.md
+++ b/docs/_posts/2020-02-13-release.md
@@ -1,8 +1,7 @@
 ---
-layout: single
 title: "ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters"
 date:   2020-02-13
-link: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
-excerpt: "Developed by Microsoft AI & Research."
-categories: news
+link: https://www.microsoft.com/en-us/research/blog/ZeRO-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
+excerpt: ""
+tags: training ZeRO
 ---
diff --git a/docs/_posts/2020-02-13-turing-nlg.md b/docs/_posts/2020-02-13-turing-nlg.md
index 32ecd65379c132b8ed030d8dbd2207356f147063..0da59aa8fee3558d5d93815c5effe4d7a0db5a76 100644
--- a/docs/_posts/2020-02-13-turing-nlg.md
+++ b/docs/_posts/2020-02-13-turing-nlg.md
@@ -1,8 +1,7 @@
 ---
-layout: single
 title: "Turing-NLG: A 17-billion-parameter language model by Microsoft"
 date:   2020-02-13
 link: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
 excerpt: "DeepSpeed was used to train the world's largest language model."
-categories: news
+tags: training
 ---
diff --git a/docs/_posts/2020-05-19-bert-record.md b/docs/_posts/2020-05-19-bert-record.md
index 4a00681cbfcdf5853c9291528f131983f64385e4..93d0c9ce34bd6033efeaebcaf30dc9751952987a 100644
--- a/docs/_posts/2020-05-19-bert-record.md
+++ b/docs/_posts/2020-05-19-bert-record.md
@@ -1,10 +1,10 @@
 ---
-layout: single
 title: "The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels"
 excerpt: ""
-categories: news
-new_post: true
+tags: training
 date: 2020-05-19 00:00:00
+toc: false
+tags: training
 ---
 
 We introduce new technology to accelerate single GPU performance via kernel
@@ -17,7 +17,7 @@ DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024
 NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on
 the same number and generation of GPUs.
 
-* Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
+* Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
 * Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html).
 * Tutorial on how to reproduce our results, see our [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/).
 * The source code for our transformer kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
diff --git a/docs/_posts/2020-05-19-press-release.md b/docs/_posts/2020-05-19-press-release.md
index 0a247e2530741266b4935e7a613b4a6be655ba82..9022a7db40c5d129835cc45e6b7fec876d2df547 100644
--- a/docs/_posts/2020-05-19-press-release.md
+++ b/docs/_posts/2020-05-19-press-release.md
@@ -1,9 +1,7 @@
 ---
-layout: single
 title: "ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale"
 excerpt: ""
-link: https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/
-categories: news
-new_post: false
+link: https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/
+tags: training ZeRO
 date: 2020-05-19 02:00:00
 ---
diff --git a/docs/_posts/2020-05-19-zero-stage2.md b/docs/_posts/2020-05-19-zero-stage2.md
index 5ce3ad2522238239098189ed8bcdf7fcfd8df377..4f35012d9aae04c4691f7ef5b81b1a1e0085b204 100644
--- a/docs/_posts/2020-05-19-zero-stage2.md
+++ b/docs/_posts/2020-05-19-zero-stage2.md
@@ -1,10 +1,9 @@
 ---
-layout: single
 title: "An Order-of-Magnitude Larger and Faster Training with ZeRO-2"
 excerpt: ""
-categories: news
-new_post: false
+tags: training ZeRO
 date: 2020-05-19 01:00:00
+toc: false
 ---
 
 ZeRO-2 expands the scope of memory optimizations in the original ZeRO by
@@ -17,7 +16,7 @@ learning training by an order of magnitude. More concretely, ZeRO-2 allows
 training models as large as 170 billion parameters up to 10x faster compared
 to state of the art.
 
-For more information on ZeRO-2, see our [blog post](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
+For more information on ZeRO-2, see our [blog post](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
 
 For more information on how to use ZeRO-2, see an example of training GPT family of models in this [tutorial](/tutorials/megatron/).
 
diff --git a/docs/_posts/2020-05-28-fastest-bert-training.md b/docs/_posts/2020-05-28-fastest-bert-training.md
index 45ca0618e0bece0a8c543676dd5f7f8771f40c73..99d132c1e53dbfb9112085810867eec233814351 100644
--- a/docs/_posts/2020-05-28-fastest-bert-training.md
+++ b/docs/_posts/2020-05-28-fastest-bert-training.md
@@ -1,9 +1,7 @@
 ---
-layout: single
 title: "Microsoft DeepSpeed achieves the fastest BERT training time"
 excerpt: ""
-categories: news
-new_post: false
+tags: training
 date: 2020-05-28 00:00:00
 ---
 
@@ -16,11 +14,11 @@ example, DeepSpeed can attain a staggering 64 teraflops of single GPU
 performance on a NVIDIA V100 GPU which is over 50% of the hardware peak.
 
 In this blog post, we will discuss four technological improvements that enable
-DeepSpeed to achieve this record-breaking BERT training time.  
+DeepSpeed to achieve this record-breaking BERT training time.
 
 1.  Highly optimized transformer kernels to improve compute efficiency
 2.  Overlapping I/O with computation through asynchronous prefetching queue
-3.  Sparse output processing to eliminate wasteful computation  
+3.  Sparse output processing to eliminate wasteful computation
 4.  Layer-norm reordering for training stability and faster convergence
 
 These optimizations not only benefit BERT; they are also applicable to many
@@ -143,7 +141,7 @@ transferring data to and from global memory and overhead from kernel launching.
 Existing compiler-based approaches perform fine-grained fusion (e.g., fusion of
 element-wise operations), leading to missed fusion opportunities. In contrast,
 we fully exploit both fine-grain and coarse-grained fusion, tailored for
-Transformer blocks.  
+Transformer blocks.
 
 **QKV and various fusions.**  We merge the three Query (Q), Key (K), and Value (V)
 weight matrices to dispatch a larger QKV GEMM to expose more parallelism and
@@ -160,7 +158,7 @@ order to rearrange the data in a way that we can put the data in consecutive
 parts of memory. Even though we produce an uncoalesced pattern when accessing
 shared memory, we reduce the cost of uncoalesced access to main memory to
 better exploit memory bandwidth, resulting in 3% to 5% performance improvement
-in the end-to-end training.  
+in the end-to-end training.
 
 ![QKV-Fusion](/assets/images/qkv_fusion.png){: .align-center}
 
@@ -280,7 +278,7 @@ a modification described by several recent works for neural machine
 translation. The Pre-LayerNorm results in several useful characteristics such
 as avoiding vanishing gradient, stable optimization, and performance gain.  It
 allows us to train at aggregated batch size of 64K with increased learning rate
-and faster convergence.  
+and faster convergence.
 
 
 To try out these optimizations and training recipe, please check out our [BERT
diff --git a/docs/_posts/2020-07-24-deepspeed-webinar.md b/docs/_posts/2020-07-24-deepspeed-webinar.md
index 276b97c9522edf84e423a0a8ce1f54ccfb00e5a7..be4ee777ed617ec7b3e9464907221482a0026533 100644
--- a/docs/_posts/2020-07-24-deepspeed-webinar.md
+++ b/docs/_posts/2020-07-24-deepspeed-webinar.md
@@ -1,10 +1,8 @@
 ---
-layout: single
 title: "DeepSpeed Microsoft Research Webinar on August 6th, 2020"
 excerpt: ""
-categories: news
+tags: presentations
 link: https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html
 image: /assets/images/webinar-aug2020.png
-new_post: true
 date: 2020-07-24 00:00:00
 ---
diff --git a/docs/_posts/2020-08-07-webinar-on-demand.md b/docs/_posts/2020-08-07-webinar-on-demand.md
index 6d255520c0df0ae8ac3e29301096ac02cfc5b58d..983e17eca36bbd03a6826756119d709a5e0f5236 100644
--- a/docs/_posts/2020-08-07-webinar-on-demand.md
+++ b/docs/_posts/2020-08-07-webinar-on-demand.md
@@ -1,9 +1,7 @@
 ---
-layout: single
 title: "DeepSpeed Microsoft Research Webinar is now on-demand"
 excerpt: ""
-categories: news
+tags: presentations
 link: https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html
-new_post: true
 date: 2020-08-07 00:00:00
 ---
diff --git a/docs/_posts/2020-09-08-sparse-attention-news.md b/docs/_posts/2020-09-08-sparse-attention-news.md
index 6f235818c33f6e85674340b831f4e17fd8516e83..4c37054a73c1ee65f425c0434c89890910823abe 100644
--- a/docs/_posts/2020-09-08-sparse-attention-news.md
+++ b/docs/_posts/2020-09-08-sparse-attention-news.md
@@ -1,10 +1,9 @@
 ---
-layout: single
 title: "Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention"
 excerpt: ""
-categories: news
-new_post: true
+tags: training
 date: 2020-09-09 00:00:00
+toc: false
 ---
 
 DeepSpeed offers sparse attention kernels, an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5-3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures.
diff --git a/docs/_posts/2020-09-09-ZeRO-Offload.md b/docs/_posts/2020-09-09-ZeRO-Offload.md
old mode 100755
new mode 100644
index 9a45ba8f244e426bf7b8684c3fdd2f288ecb8017..c270ceadf3817d94a4a423c497868453c0c48d06
--- a/docs/_posts/2020-09-09-ZeRO-Offload.md
+++ b/docs/_posts/2020-09-09-ZeRO-Offload.md
@@ -1,14 +1,13 @@
 ---
-layout: single
 title: "10x bigger model training on a single GPU with ZeRO-Offload"
 excerpt: ""
-categories: news
-new_post: true
 date: 2020-09-09 00:00:00
+tags: training ZeRO
+toc: false
 ---
 
 We introduce a new technology called ZeRO-Offload to enable **10X bigger model training on a single GPU**. ZeRO-Offload extends ZeRO-2 to leverage both CPU and GPU memory for training large models. Using a machine with **a single GPU**, our users now can run **models of up to 13 billion parameters** without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
 
 * For more information on ZeRO-Offload, see our [press release]( {{ site.press_release_v3 }} ).
-* For more information on how to use ZeRO-Offload, see our [ZeRO-Offload tutorial](https://www.deepspeed.ai/tutorials/zero-offload/).
+* For more information on how to use ZeRO-Offload, see our [ZeRO-Offload tutorial](https://www.deepspeed.ai/tutorials/ZeRO-offload/).
 * The source code for ZeRO-Offload can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed).
diff --git a/docs/_posts/2020-09-09-onebit-adam-blog-post.md b/docs/_posts/2020-09-09-onebit-adam-blog-post.md
index b16a101578f0f754054c4bae42d5d5bf6c11bed2..413a3d0c1afbe4883086af4e66a87a93e0bf648d 100644
--- a/docs/_posts/2020-09-09-onebit-adam-blog-post.md
+++ b/docs/_posts/2020-09-09-onebit-adam-blog-post.md
@@ -1,10 +1,8 @@
 ---
-layout: single
 title: "DeepSpeed with 1-bit Adam: 5x less communication and 3.4x faster training"
 excerpt: ""
-categories: news
-new_post: false
 date: 2020-09-09 00:00:00
+tags: training
 ---
 
 ## 1. Introduction
diff --git a/docs/_posts/2020-09-09-onebit-adam-news.md b/docs/_posts/2020-09-09-onebit-adam-news.md
index 5dc0f3bd2004d6caf5b81d372a320e8a4c2700fc..8873f58ca01a0760516ef96052301c465928ba11 100644
--- a/docs/_posts/2020-09-09-onebit-adam-news.md
+++ b/docs/_posts/2020-09-09-onebit-adam-news.md
@@ -1,10 +1,9 @@
 ---
-layout: single
 title: "Up to 5x less communication and 3.4x faster training through 1-bit Adam"
 excerpt: ""
-categories: news
-new_post: true
 date: 2020-09-09 00:00:00
+tags: training
+toc: false
 ---
 
 
diff --git a/docs/_posts/2020-09-09-pipeline-parallelism.md b/docs/_posts/2020-09-09-pipeline-parallelism.md
index d8aa20c1cee5c5c73cd24f09b049865fcc440aa4..4f2e53ed80ee9f5478b52741708413ca70c94074 100644
--- a/docs/_posts/2020-09-09-pipeline-parallelism.md
+++ b/docs/_posts/2020-09-09-pipeline-parallelism.md
@@ -1,10 +1,8 @@
 ---
-layout: single
 title: "Training a Trillion Parameters with Pipeline Parallelism"
 excerpt: ""
-categories: news
-new_post: true
 date: 2020-09-09 00:00:00
+tags: training
 ---
 
 DeepSpeed includes new support for pipeline parallelism! DeepSpeed's training
diff --git a/docs/_posts/2020-09-09-sparse-attention.md b/docs/_posts/2020-09-09-sparse-attention.md
index 79032eeaa153057b0d0306f58159e1ca9631c683..aa0fa0bb60d48d5bd260dbad63e2f15c95da8619 100644
--- a/docs/_posts/2020-09-09-sparse-attention.md
+++ b/docs/_posts/2020-09-09-sparse-attention.md
@@ -1,10 +1,8 @@
 ---
-layout: single
 title: "DeepSpeed Sparse Attention"
 excerpt: ""
-categories: news
-new_post: true
 date: 2020-09-09 01:00:00
+tags: training inference
 ---
 
 Attention-based deep learning models such as the transformers are highly effective in capturing relationship between tokens in an input sequence, even across long distances. As a result, they are used with text, image, and sound-based inputs, where the sequence length can be in thousands of tokens. However, despite the effectiveness of attention modules to capture long term dependencies, in practice, their application to long sequence input is limited by compute and memory requirements of the attention computation that grow quadratically, `O(n^2)`, with the sequence length `n`.
diff --git a/docs/_posts/2020-10-28-progressive-layer-dropping-news.md b/docs/_posts/2020-10-28-progressive-layer-dropping-news.md
old mode 100755
new mode 100644
index 5659cf818987d04d4cc5c0cf2dd6e9e4c4040fb5..9664e4de94e75d0dbc0c56cb3490d01971732999
--- a/docs/_posts/2020-10-28-progressive-layer-dropping-news.md
+++ b/docs/_posts/2020-10-28-progressive-layer-dropping-news.md
@@ -1,10 +1,9 @@
 ---
-layout: single
 title: "Progressive Layer Dropping"
 excerpt: ""
-categories: news
-new_post: true
 date: 2020-10-29 00:00:00
+tags: training
+toc: false
 ---
 
 We introduce a new technology called progressive layer dropping (PLD) to speedup the pre-training of Transformer-based networks through efficient and robust compressed training. The pre-training step of Transformer networks often suffer from unbearable overall computational expenses. We analyze the training dynamics and stability of Transformer networks and propose PLD to sparsely update Transformer blocks following a progressive dropping schedule, which smoothly increases the layer dropping rate for each mini-batch as training evolves along both the temporal and the model depth dimension. PLD is able to allow the pre-training to be **2.5X faster** to get similar accuracy on downstream tasks and allows the training to be **24% faster** when training the same number of samples, not at the cost of excessive hardware resources.
diff --git a/docs/_posts/2021-03-08-zero3-offload.md b/docs/_posts/2021-03-08-zero3-offload.md
index fa12ab5b25fb192148219416ffc4b1e03b8904a2..9008ebc9f6faf88b1a3a7f5282b912e338dbccda 100644
--- a/docs/_posts/2021-03-08-zero3-offload.md
+++ b/docs/_posts/2021-03-08-zero3-offload.md
@@ -1,100 +1,98 @@
----
-layout: single
-title: "DeepSpeed ZeRO-3 Offload"
-excerpt: ""
-categories: news
-new_post: true
-date: 2021-03-08 00:00:00
----
-Today we are announcing the release of ZeRO-3 Offload, a highly efficient and easy to use implementation of ZeRO Stage 3 and ZeRO Offload combined, geared towards our continued goal of democratizing AI by making efficient large-scale DL training available to everyone.  The key benefits of ZeRO-3 Offload are:
-
-* Unprecedented memory efficiency to run very large models on a limited number of GPU resources - e.g., fine-tune models with over 40B parameters on a single GPU and over 2 Trillion parameters on 512 GPUs!
-* Extremely Easy to use:
-    * Scale to over a trillion parameters without the need to combine multiple parallelism techniques in complicated ways.
-    * For existing DeepSpeed users, turn on ZeRO-3 Offload with just a few flags in DeepSpeed Config file.
-* High-performance per-GPU throughput and super-linear scalability across GPUs for distributed training.  
-    * With 1 Trillion parameters, ZeRO-3 Offload sustains 25 PetaFlops in compute performance on 512 NVIDIA V100 GPUs, achieving 49 TFlops/GPU.
-    * Up to 2x improvement in throughput compared to ZeRO- 2 Offload on single GPU
-
-
-<h2> Overview of ZeRO family of technology </h2>
-
-The Zero Redundancy Optimizer (abbreviated ZeRO) is a family of memory optimization technologies for large-scale distributed deep learning. Unlike data parallelism (that is efficient but can only support a limited model size) or model parallelism (that can support larger model sizes but requires significant code refactoring while adding communication overhead that limits efficiency), ZeRO allows fitting larger models in memory without requiring code refactoring while remaining very efficient. ZeRO does so by eliminating the memory redundancy that is inherent in data parallelism while limiting the communication overhead to a minimum.
-ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.
-There are three stages in ZeRO corresponding to three model states, as shown in the Figure 1: the first stage (ZeRO-1) partitions only the optimizer states, the second stage (ZeRO-2) partitions both the optimizer states and the gradients and the final stage (ZeRO-3) partitions all three model states (for more details see the ZeRO [paper](https://arxiv.org/abs/1910.02054v3)).
-
-<a href="/assets/images/zero3-offload-memory-overview.png">
-<img src="/assets/images/zero3-offload-memory-overview.png">
-</a>
-Figure 1. Overview of ZeRO memory savings
-
-In addition to these three stages, ZeRO family of technology also consists of ZeRO-2 Offload. ZeRO-2 Offload is a heterogenous DL training technology that works in conjunction with ZeRO-2 to offload partitioned optimizer states and gradients to CPU memory. ZeRO-2 Offload offers the full memory advantage of ZeRO-2 even on a single GPU, while at the same time offering great scalability of ZeRO-2 on multi-GPU setup.  DeepSpeed library has been offering ZeRO-2 Offload since Sept 2020. For details, please see below:
-
-* ZeRO: [Stage 1 blog](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/), [Stage 2 blog](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/), [Tutorial](/tutorials/zero)
-* ZeRO-Offload: [Blog](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/#toc-heading-3), [Tutorials](/tutorials/zero-offload), [Paper link](https://arxiv.org/abs/2101.06840)
-
-<h2>ZeRO-3 Offload</h2>
-With today’s release of ZeRO-3 Offload, we are adding support for partitioning and offloading parameters in addition to optimizer states and gradients partitioning already supported by ZeRO-2 Offload in DeepSpeed. With parameter partitioning ZeRO-3 Offload implements the full set of features in the three stages of ZeRO, that allows for a linear growth in model size with the number of GPUs. In addition, ZeRO-3 Offload can also optionally offload all these model states to CPU to further reduce GPU memory consumption, leveraging both CPU and GPU to maximize memory and compute efficiency of the entire system.
-
-We believe ZeRO-3 Offload offers a massive leap for large model training, in three regards:
-
-i) Unprecedented model scale,
-
-ii) Ease of supporting very-large models, and
-
-iii) Achieving excellent training efficiency.
-
-
-<h2>Unprecedented model scale</h2>
-Unlike ZeRO-2 and ZeRO-Offload where the parameters have to fit in the memory of a single GPU, ZeRO-3 Offload can partition the parameters across GPUs, and offload them to CPU, supporting model sizes that are much larger than the memory on a single GPU. Furthermore, ZeRO-3 Offload goes beyond the state-of-the-art hybrid 3D-parallelism (data, model and pipeline parallelism combined). While 3D Parallelism is limited by the aggregate GPU memory, ZeRO-3 Offload can exploit both GPU and CPU memory, the latter of which is much larger and cheaper compared to GPU memory. This allows ZeRO-3 Offload to train larger model sizes with the given GPU and CPU resources than any other currently available technology.
-
-<i>Model Scale on Single GPU</i>: ZeRO-3 Offload can train models with over 40B parameters efficiently on a single GPU (e.g., 32GB V100 GPU + 1.5TB CPU memory). This is 3x larger than what is possible with ZeRO-2 Offload, the current state-of-the art.
-
-<i>Model Scale on Multi-GPUs</i>: With ZeRO-3 Offload you can train a trillion and two trillion parameter models on NVIDIA 32GB V100 DGX-2 cluster with 256 GPUs and 512 GPUs, respectively. In contrast, the state-of-art 3D Parallelism requires 800 GPUs, and 1600 GPUs, respectively, to fit the same sized models. This represents a 3x reduction in GPUs required to fit models with over a trillion parameters.
-
-<h2>Ease of supporting very large models</h2>
-From a system perspective, training models with hundreds of billions and trillions of parameters is extremely challenging. Data parallelism cannot scale the model size much further beyond a billion parameters, model parallelism (with tensor slicing) cannot be used to scale model size efficiently beyond a single node boundary due to massive communication overheads, and pipeline parallelism cannot scale beyond the number of layers available in a model, which limits both the model size and the number of GPUs that it can scale to.
-
-The only existing parallel technology available that can scale to over a trillion parameters on massively parallel GPU clusters is the [3D parallelism](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/#toc-heading-0) that combines data, model and pipeline parallelism in complex ways. While such a system can be very efficient, it requires major model code refactoring from data scientists to split the model into load balanced pipeline stages. This also makes 3D parallelism inflexible in the type of models that it can support, since models with complex dependency graphs cannot be easily converted into a load balanced pipeline.
-
-ZeRO-3 Offload address these challenges in two ways:
-
-i) With ground-breaking memory efficiency, ZeRO-3 and ZeRO-3 Offload are the only DL parallel technology that can efficiently scale to over a trillion parameters by itself, without requiring a hybrid parallelism strategy, greatly simplifying the system stack for DL training.
-
-ii) ZeRO-3 Offload requires virtually no model refactoring from model scientists, liberating data scientists to scale up complex models to hundreds of billions to trillions of parameters.
-
-<h2>Excellent training efficiency</h2>
-<i>High-performance per-GPU throughput on multiple nodes</i>: ZeRO-3 Offload offers excellent training efficiency for multi-billion and trillion parameter models on multiple nodes. It achieves a sustained throughput of up to 50 Tflops per GPU running on 32 DGX2 nodes comprising 512 NVIDIA V100 GPUs (see Figure 2). In comparison, the standard data parallel training with PyTorch can only achieve 30 TFlops per GPU for a 1.2B parameter model, the largest model that can be trained using data parallelism alone.  
-
-<a href="/assets/images/zero3-offload-512-v100.png">
-<img src="/assets/images/zero3-offload-512-v100.png">
-</a>
-Figure 2. ZeRO-3 Offload: Multi-billion and trillion parameter model throughput on 512 V100 GPUs
-
-ZeRO-3 Offload obtains high efficiency despite the 50% communication overhead of ZeRO Stage 3 compared to standard data parallel training for a fixed batch size. This is made possible through a communication overlap centric design and implementation, which allows ZeRO-3 Offload to hide nearly all of the communication volume with computation, while taking advantage of a larger batch size for improved efficiency resulting from better GPU memory efficiency.
-
-
-<i>Efficient multi-billion parameter model training on a single GPU</i>: ZeRO-3 Offload further democratizes AI by enabling efficient training of multi-billion parameter models on a single GPU. For single GPU training, ZeRO-3 Offload provides benefits over ZeRO-2 Offload along two dimensions. First, ZeRO-3 Offload increases the size of models trainable on a single V100 from 13B to 40B. Second, for ZeRO-3 Offload provides speedups (e.g., 2.3X for 13B) compared to ZeRO-2 Offload for model sizes trainable by both solutions. These results are summarized in Figure 3.  
-
-<a href="/assets/images/zero3-offload-1-v100.png">
-<img src="/assets/images/zero3-offload-1-v100.png">
-</a>
-Figure 3. Multi-billion parameter model training on one V100 GPU
-
-<i>Super-Linear scalability across GPUs</i>: Additionally, ZeRO-3 Offload also preserves the super-linear scalability characteristics that we have demonstrated with all our previous ZeRO technologies (ZeRO Stage 1, ZeRO Stage 2 and ZeRO Offload). ZeRO-3 Offload can exploit the aggregate PCI-E bandwidth between GPU and CPU across all the GPUs in multi-GPU training configuration, and at the same time, it can also exploit the aggregate CPU compute across all the nodes. As a result, the CPU-GPU-CPU communication time as well as the optimizer update time decreases linearly with number of GPUs and nodes, respectively, allowing ZeRO-3 Offload to exhibit super-linear scaling (see Figure 4).
-
-<a href="/assets/images/zero3-offload-200B-scalability.png">
-<img src="/assets/images/zero3-offload-200B-scalability.png">
-</a>
-Figure 4. ZeRO-3 Offload Superlinear Scalability for a 200B parameter model.
-
-<h2>How to use ZeRO-3 Offload</h2>
-As with many other existing DeepSpeed features, once the user model has been converted to use DeepSpeed, enabling ZeRO-3 Offload is as easy as turning on a couple of flags in DeepSpeed Config file. Supporting advanced features like weight sharing, or enabling extremely large models that requires to be partitioned across GPUs/nodes to fit in GPU/CPU memory, can be done with just a couple of additional lines of code change using the ZeRO-3 Offload API.
-
-If you are already a DeepSpeed user, you can find our detailed tutorial on ZeRO-3 Offload below. If you are new to DeepSpeed, we recommend that you start at the getting started page before trying out our ZeRO-3 Offload Tutorial.
-
-* DeepSpeed: [Getting Started Page](/getting-started/)
-
-* ZeRO-3 Offload [Documentation](https://deepspeed.readthedocs.io/en/latest/zero3.html), [Tutorial](/tutorials/zero/#training-trillion-scale-models-with-zero-3-offload)
-
-The DeepSpeed Team is very excited to share ZeRO-3 Offload with the DL community.
+---
+title: "DeepSpeed ZeRO-3 Offload"
+excerpt: ""
+date: 2021-03-08 00:00:00
+tags: training ZeRO
+---
+Today we are announcing the release of ZeRO-3 Offload, a highly efficient and easy to use implementation of ZeRO Stage 3 and ZeRO Offload combined, geared towards our continued goal of democratizing AI by making efficient large-scale DL training available to everyone.  The key benefits of ZeRO-3 Offload are:
+
+* Unprecedented memory efficiency to run very large models on a limited number of GPU resources - e.g., fine-tune models with over 40B parameters on a single GPU and over 2 Trillion parameters on 512 GPUs!
+* Extremely Easy to use:
+    * Scale to over a trillion parameters without the need to combine multiple parallelism techniques in complicated ways.
+    * For existing DeepSpeed users, turn on ZeRO-3 Offload with just a few flags in DeepSpeed Config file.
+* High-performance per-GPU throughput and super-linear scalability across GPUs for distributed training.
+    * With 1 Trillion parameters, ZeRO-3 Offload sustains 25 PetaFlops in compute performance on 512 NVIDIA V100 GPUs, achieving 49 TFlops/GPU.
+    * Up to 2x improvement in throughput compared to ZeRO- 2 Offload on single GPU
+
+
+<h2> Overview of ZeRO family of technology </h2>
+
+The ZeRO Redundancy Optimizer (abbreviated ZeRO) is a family of memory optimization technologies for large-scale distributed deep learning. Unlike data parallelism (that is efficient but can only support a limited model size) or model parallelism (that can support larger model sizes but requires significant code refactoring while adding communication overhead that limits efficiency), ZeRO allows fitting larger models in memory without requiring code refactoring while remaining very efficient. ZeRO does so by eliminating the memory redundancy that is inherent in data parallelism while limiting the communication overhead to a minimum.
+ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.
+There are three stages in ZeRO corresponding to three model states, as shown in the Figure 1: the first stage (ZeRO-1) partitions only the optimizer states, the second stage (ZeRO-2) partitions both the optimizer states and the gradients and the final stage (ZeRO-3) partitions all three model states (for more details see the ZeRO [paper](https://arxiv.org/abs/1910.02054v3)).
+
+<a href="/assets/images/zero3-offload-memory-overview.png">
+<img src="/assets/images/zero3-offload-memory-overview.png">
+</a>
+Figure 1. Overview of ZeRO memory savings
+
+In addition to these three stages, ZeRO family of technology also consists of ZeRO-2 Offload. ZeRO-2 Offload is a heterogeneous DL training technology that works in conjunction with ZeRO-2 to offload partitioned optimizer states and gradients to CPU memory. ZeRO-2 Offload offers the full memory advantage of ZeRO-2 even on a single GPU, while at the same time offering great scalability of ZeRO-2 on multi-GPU setup.  DeepSpeed library has been offering ZeRO-2 Offload since Sept 2020. For details, please see below:
+
+* ZeRO: [Stage 1 blog](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/), [Stage 2 blog](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/), [Tutorial](/tutorials/ZeRO)
+* ZeRO-Offload: [Blog](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/#toc-heading-3), [Tutorials](/tutorials/ZeRO-offload), [Paper link](https://arxiv.org/abs/2101.06840)
+
+<h2>ZeRO-3 Offload</h2>
+With today’s release of ZeRO-3 Offload, we are adding support for partitioning and offloading parameters in addition to optimizer states and gradients partitioning already supported by ZeRO-2 Offload in DeepSpeed. With parameter partitioning ZeRO-3 Offload implements the full set of features in the three stages of ZeRO, that allows for a linear growth in model size with the number of GPUs. In addition, ZeRO-3 Offload can also optionally offload all these model states to CPU to further reduce GPU memory consumption, leveraging both CPU and GPU to maximize memory and compute efficiency of the entire system.
+
+We believe ZeRO-3 Offload offers a massive leap for large model training, in three regards:
+
+i) Unprecedented model scale,
+
+ii) Ease of supporting very-large models, and
+
+iii) Achieving excellent training efficiency.
+
+
+<h2>Unprecedented model scale</h2>
+Unlike ZeRO-2 and ZeRO-Offload where the parameters have to fit in the memory of a single GPU, ZeRO-3 Offload can partition the parameters across GPUs, and offload them to CPU, supporting model sizes that are much larger than the memory on a single GPU. Furthermore, ZeRO-3 Offload goes beyond the state-of-the-art hybrid 3D-parallelism (data, model and pipeline parallelism combined). While 3D Parallelism is limited by the aggregate GPU memory, ZeRO-3 Offload can exploit both GPU and CPU memory, the latter of which is much larger and cheaper compared to GPU memory. This allows ZeRO-3 Offload to train larger model sizes with the given GPU and CPU resources than any other currently available technology.
+
+<i>Model Scale on Single GPU</i>: ZeRO-3 Offload can train models with over 40B parameters efficiently on a single GPU (e.g., 32GB V100 GPU + 1.5TB CPU memory). This is 3x larger than what is possible with ZeRO-2 Offload, the current state-of-the art.
+
+<i>Model Scale on Multi-GPUs</i>: With ZeRO-3 Offload you can train a trillion and two trillion parameter models on NVIDIA 32GB V100 DGX-2 cluster with 256 GPUs and 512 GPUs, respectively. In contrast, the state-of-art 3D Parallelism requires 800 GPUs, and 1600 GPUs, respectively, to fit the same sized models. This represents a 3x reduction in GPUs required to fit models with over a trillion parameters.
+
+<h2>Ease of supporting very large models</h2>
+From a system perspective, training models with hundreds of billions and trillions of parameters is extremely challenging. Data parallelism cannot scale the model size much further beyond a billion parameters, model parallelism (with tensor slicing) cannot be used to scale model size efficiently beyond a single node boundary due to massive communication overheads, and pipeline parallelism cannot scale beyond the number of layers available in a model, which limits both the model size and the number of GPUs that it can scale to.
+
+The only existing parallel technology available that can scale to over a trillion parameters on massively parallel GPU clusters is the [3D parallelism](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/#toc-heading-0) that combines data, model and pipeline parallelism in complex ways. While such a system can be very efficient, it requires major model code refactoring from data scientists to split the model into load balanced pipeline stages. This also makes 3D parallelism inflexible in the type of models that it can support, since models with complex dependency graphs cannot be easily converted into a load balanced pipeline.
+
+ZeRO-3 Offload address these challenges in two ways:
+
+i) With ground-breaking memory efficiency, ZeRO-3 and ZeRO-3 Offload are the only DL parallel technology that can efficiently scale to over a trillion parameters by itself, without requiring a hybrid parallelism strategy, greatly simplifying the system stack for DL training.
+
+ii) ZeRO-3 Offload requires virtually no model refactoring from model scientists, liberating data scientists to scale up complex models to hundreds of billions to trillions of parameters.
+
+<h2>Excellent training efficiency</h2>
+<i>High-performance per-GPU throughput on multiple nodes</i>: ZeRO-3 Offload offers excellent training efficiency for multi-billion and trillion parameter models on multiple nodes. It achieves a sustained throughput of up to 50 Tflops per GPU running on 32 DGX2 nodes comprising 512 NVIDIA V100 GPUs (see Figure 2). In comparison, the standard data parallel training with PyTorch can only achieve 30 TFlops per GPU for a 1.2B parameter model, the largest model that can be trained using data parallelism alone.
+
+<a href="/assets/images/zero3-offload-512-v100.png">
+<img src="/assets/images/zero3-offload-512-v100.png">
+</a>
+Figure 2. ZeRO-3 Offload: Multi-billion and trillion parameter model throughput on 512 V100 GPUs
+
+ZeRO-3 Offload obtains high efficiency despite the 50% communication overhead of ZeRO Stage 3 compared to standard data parallel training for a fixed batch size. This is made possible through a communication overlap centric design and implementation, which allows ZeRO-3 Offload to hide nearly all of the communication volume with computation, while taking advantage of a larger batch size for improved efficiency resulting from better GPU memory efficiency.
+
+
+<i>Efficient multi-billion parameter model training on a single GPU</i>: ZeRO-3 Offload further democratizes AI by enabling efficient training of multi-billion parameter models on a single GPU. For single GPU training, ZeRO-3 Offload provides benefits over ZeRO-2 Offload along two dimensions. First, ZeRO-3 Offload increases the size of models trainable on a single V100 from 13B to 40B. Second, for ZeRO-3 Offload provides speedups (e.g., 2.3X for 13B) compared to ZeRO-2 Offload for model sizes trainable by both solutions. These results are summarized in Figure 3.
+
+<a href="/assets/images/zero3-offload-1-v100.png">
+<img src="/assets/images/zero3-offload-1-v100.png">
+</a>
+Figure 3. Multi-billion parameter model training on one V100 GPU
+
+<i>Super-Linear scalability across GPUs</i>: Additionally, ZeRO-3 Offload also preserves the super-linear scalability characteristics that we have demonstrated with all our previous ZeRO technologies (ZeRO Stage 1, ZeRO Stage 2 and ZeRO Offload). ZeRO-3 Offload can exploit the aggregate PCI-E bandwidth between GPU and CPU across all the GPUs in multi-GPU training configuration, and at the same time, it can also exploit the aggregate CPU compute across all the nodes. As a result, the CPU-GPU-CPU communication time as well as the optimizer update time decreases linearly with number of GPUs and nodes, respectively, allowing ZeRO-3 Offload to exhibit super-linear scaling (see Figure 4).
+
+<a href="/assets/images/zero3-offload-200B-scalability.png">
+<img src="/assets/images/zero3-offload-200B-scalability.png">
+</a>
+Figure 4. ZeRO-3 Offload Superlinear Scalability for a 200B parameter model.
+
+<h2>How to use ZeRO-3 Offload</h2>
+As with many other existing DeepSpeed features, once the user model has been converted to use DeepSpeed, enabling ZeRO-3 Offload is as easy as turning on a couple of flags in DeepSpeed Config file. Supporting advanced features like weight sharing, or enabling extremely large models that requires to be partitioned across GPUs/nodes to fit in GPU/CPU memory, can be done with just a couple of additional lines of code change using the ZeRO-3 Offload API.
+
+If you are already a DeepSpeed user, you can find our detailed tutorial on ZeRO-3 Offload below. If you are new to DeepSpeed, we recommend that you start at the getting started page before trying out our ZeRO-3 Offload Tutorial.
+
+* DeepSpeed: [Getting Started Page](/getting-started/)
+
+* ZeRO-3 Offload [Documentation](https://deepspeed.readthedocs.io/en/latest/zero3.html), [Tutorial](/tutorials/ZeRO/#training-trillion-scale-models-with-ZeRO-3-offload)
+
+The DeepSpeed Team is very excited to share ZeRO-3 Offload with the DL community.
diff --git a/docs/_posts/2021-05-05-MoQ.md b/docs/_posts/2021-05-05-MoQ.md
new file mode 100644
index 0000000000000000000000000000000000000000..e6f7872a40079091f1017294b7ce6244735132ca
--- /dev/null
+++ b/docs/_posts/2021-05-05-MoQ.md
@@ -0,0 +1,60 @@
+---
+title: "Mixture-of-Quantization: A novel quantization approach for reducing model size with minimal accuracy impact"
+excerpt: ""
+date: 2021-05-05 00:00:00
+tags: inference
+---
+
+## A unified suite for quantization-aware training and inference
+
+Running large-scale models on multi-GPU might help reduce latency but increases the deployment cost significantly, especially as the model size grows bigger. To mitigate this issue, we resort to model compression techniques and introduce a new methodology that quantizes Transformer networks with a minimal impact on accuracy. Our technique achieves similar or better performance thanFP16 models through customized inference kernels on lower or equal number of GPUs.
+
+Our scheme is flexible in the sense that it provides users the ability to experiment with any quantization configuration, such as the target number of bits used for quantization precision, and the scheduling by which the model gets quantized during training. Furthermore, we combine both the FP16 and quantized precision as a mixed-precision mechanism to smooth the transition from a high to low precision. Finally, we use the second-order gradient (eigenvalue) of the parameters to adjust the quantization schedule during training.
+
+## Quantization methodology
+
+There are two main approaches of applying quantization: offline quantization on the trained model and quantization-aware training (QAT) that reduces the data-precision during training. Unlike the former scheme, QAT gets the model trained by taking the impact of precision loss into account during the training optimization. This will result in significant improvement of the quantized model accuracy. MoQ is designed on top QAT approach, with the difference that we use a mixture of precisions to train the model toward target quantization, as well as defining a scheduling for reducing the precision.
+
+All existing QAT approaches quantize the model with a certain precision (number of bits) from the beginning of training until completion. However, even by using a relatively high quantization precision (8-bit), there will be some drop in model accuracy, which might not be acceptable for some downstream tasks. For instance, the [Q8BERT](https://arxiv.org/abs/1910.06188v1) work tries QAT for the BERT network, which results in good accuracy for some tasks while others (like SQuAD) lose 0.8% in the F1 score. Other techniques, such as [Q-BERT](https://arxiv.org/abs/1909.05840v1Q-BERT), use grouped quantization with a large grouping size (128) when quantizing a parameter matrix to gain higher accuracy, but they are still inferior to the baseline.
+
+Here, we present MoQ as a flexible solution for linear quantization that allows users to define a schedule as the model trains. Similar to iterative pruning to inject sparsity, we start quantization from a higher precision (16-bit quantization or FP16) and gradually reduce the quantization bits or the mixed-precision ratio for the FP16 part until reaching a target precision (8-bit). To control the precision transition, we define a hyperparameter, called quantization period, that indicates when the precision reduction should happen. We observe that by using such a schedule, we get the closest accuracy to the baseline. Note that in order to reach a certain precision, we need to define the starting bits and period in a way that within the number of samples to train, the model eventually gets quantized using the target number of bits. Please refer to the quantization tutorial for more information.
+
+In order to dynamically adjust quantization precision, we employ eigenvalue as a metric that shows the sensitivity of training to the precision change. Eigenvalue has been previously used ([Q-BERT](https://arxiv.org/abs/1909.05840v1Q-BERT)) for quantization to choose the precision bits on different parts of the network. To combine this with MoQ, we cluster the eigenvalues into several regions based on their absolute values and tune the quantization period for each region accordingly, the higher the magnitude of eigenvalue, the larger the factor and the slower the precision decreases.
+
+![MoQ (8bit)](/assets/images/quantization-8bit.png){: .align-center}
+
+ Figure 1.  Quantization scheduling of one of the GLUE tasks (QNLI), using the eigenvalue of different layers. Different colors show the layers from 0 to 11 for Bert-Base.
+
+Figure 1 shows the result of combining eigenvalue with MoQ for a 12-layer Bert Base model. As we see, the first few layers (0-4) tend to be more sensitive to reduced precision than the last layers, as their quantization period is an order of magnitude larger than the rest. Another observation from this figure is that the neighbor layers reduce the precision in the same way. For instance, layers 9, 10, and 11 on the left chart, and layers 0 and 4 and 1 and 3 on the right chart of Figure 1 get similar schedule. This is due to having similar eigenvalues for these layers throughout the training.
+
+![MoQ (mixed-precision)](/assets/images/quantization-mixedbit.png){: .align-center}
+Figure 2: Mixed-precision quantization for the QNLI using target quantization period as 4 bits.
+
+Figure 2 shows another mixed-precision quantization that sets target bits as 4, however the quantization period keeps updated through the eigenvalues of each layer. As we see, the end quantization bits are different for all layers. The first layers still get to 8-bit quantization as the training samples is not enough to decrease the quantization bits. On the other hand, the last layers keep reducing the precision. We finally reduce the average precision to 6 bits for the entire network while maintaining the accuracy of the model (0.3% drop in accuracy).
+
+![MoQ (mixed-precision)](/assets/images/bingbert-mixedbit.png){: .align-center}
+Figure 3: Mixed-precision quantization with MoQ for Bert SQuAD plus.
+
+As another example, we use eigenvalue-based MoQ to quantize Bert-Large for SQuAD finetuning. Figure 3 shows the number of bits we get to at the end of finetuning on each layer. Here, we see slightly different precision spectrum compared to BertBase on GLUE tasks. As the figure shows, we can reduce the precision on the first few layers more aggressively than the middle ones. Also, the last few layers can tolerate  very low precision similar to the beginning layers. This way of quantization finally results in 90.56 F1 Score which is pretty similar to the baseline.
+
+## Quantized Inference Kernels
+
+By using other quantization methodologies, after the model is quantized, it can only have performance benefit if there is hardware support for integer-based operations. For this reason, the inputs and output of all GeMM operations need to be quantized. However, since the range of input may vary request by request, finding a range of data for each input at inference time is challenging. On the other hand, using a static range for all inputs can impact the inference accuracy.
+
+To alleviate this problem, we introduce inference custom kernels that neither require the hardware support nor the input quantization. These kernels read quantized parameters and dequantize them on-the-fly and use the floating-point units of GPU cores for the GeMM operations. The main benefit of using these kernels is that they reduce the memory footprint required to load a model so that we can run inference on fewer number of GPUs, while improving the performance by saving the memory bandwidth required to run the inference on GPU.
+
+Regarding the quantization implementation, we use different algorithms to quantize a value based on the range of data and the rounding policy. We support both symmetric and asymmetric quantization as the two mostly used schemes. We applied both techniques for QAT and see very similar results, however since symmetric approach is simpler to implement, we implement our inference kernels based on that. Regarding the rounding, we support [stochastic rounding](https://arxiv.org/abs/1502.02551) as another option besides the normal rounding. We have seen that for reducing the precision to as low as 4-bit or lower, stochastic rounding is more helpful as it has an unbiased random behavior during training.
+
+## Ease of use
+
+For enabling quantization through Deepspeed, we only need to pass the scheduling through a JSON configuration file. To add the impact of quantization, we quantize and dequantize the parameters just before they are updated in the optimizer. Thus, we do not incur any change on the modeling side to quantize a model. Instead, we simulate the quantization impact by lowering the precision of data saved in FP16 format. By using this kind of implementation, we have the full flexibility of changing the precision using the training characteristics such as number of steps, and eigenvalue of the parameters and the original FP16 data format. As shown in this blog post, we can improve the quality of a quantized model by adaptively changing the scheduling of the quantization throughout training. For more information on how to use MoQ scheme, please look at our [quantization tutorial](https://www.deepspeed.ai/tutorials/MoQ-tutorial/).
+
+## Improving quantization accuracy.
+
+To show how our quantization scheme preserves accuracy, we have experimented MoQ on several tasks and networks: GLUE tasks on Bert-Base and SQuAD on Bert-Large. Table 1 shows the accuracy results for the baseline without quantization (w/o Quant), basic quantization without using any scheduling during training (Basic Quant), and our MoQ scheme.  Without using any scheduling, the accuracy for 8-bit quantization is often inferior to the baseline, and in this workload, it suffers from a drop of 1.02 point in accuracy (ACC).  In contrast, MoQ powers 8-bit quantization to obtain comparable accuracy as the FP16 baseline, even with a slightly higher ACC, demonstrating the effectiveness of our quantization approach.
+
+|Task         |STSB	|MRPC |COLA |WNLI |SST2 |RTE  |QNLI |QQP  |MNLI |SQuAD|ACC+ |
+|-------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
+|w/o QAT(FP16)|88.71|88.12|56.78|56.34|91.74|65.3 |90.96|90.67|84.04|90.56|0    |
+|Basic QAT    |88.9 |88.35|52.78|55.3 |91.5 |64.2 |90.92|90.59|84.01|90.39|-0.87|
+|MoQ          |88.93|89|59.33|56.34|92.09 |67.15 |90.63|90.94|84.55|90.71|0.75 |
diff --git a/docs/_posts/2021-05-05-inference-kernel-optimization.md b/docs/_posts/2021-05-05-inference-kernel-optimization.md
new file mode 100644
index 0000000000000000000000000000000000000000..63e3ac669e22be1180d7dfea31a7a95f605a9df5
--- /dev/null
+++ b/docs/_posts/2021-05-05-inference-kernel-optimization.md
@@ -0,0 +1,71 @@
+---
+title: "DeepSpeed Inference: Multi-GPU inference with customized inference kernels and quantization support"
+excerpt: ""
+date: 2021-03-16 00:00:00
+tags: inference
+---
+While DeepSpeed supports training advanced large-scale models, using these trained models in the desired application scenarios is still challenging due to three major limitations in existing inference solutions: 1) lack of support for multi-GPU inference to fit large models and meet latency requirements, 2) limited GPU kernel performance when running inference with small batch sizes, and 3) difficulties in exploiting quantization, which includes both quantizing the model to reduce the model size and latency as well as supporting high-performance inference of quantized models without specialized hardware.
+
+To handle these challenges, we introduce DeepSpeed Inference, which seamlessly adds high-performance inference support to large models trained in DeepSpeed with three key features:  inference-adapted parallelism for multi-GPU inference, inference-optimized kernels tuned for small batch sizes, and flexible support for quantize-aware training and inference kernels for quantized models.
+
+## Multi-GPU Inference with Adaptive Parallelism
+
+Parallelism is an effective approach to fit large models and reduce per-device memory consumption for both training and inference. However, simply applying training parallelism choices and degree to inference does not work well. The MP and PP configuration is normally set during the model training, apart from the data parallelism (DP), based on the memory footprint and computation style, and resource budget. On one hand, inference computation intrinsically requires less memory, so it can afford a larger partition per device. It helps reduce the degree of parallelism needed for model deployment. On the other hand, optimizing latency or meeting latency requirements is often a first-class citizen in inference while training optimizes throughput.
+
+To obtain desired latency, DeepSpeed Inference automatically adapts MP as an effective approach to reduce model latency, and its parallelism degree is often determined first. With MP, we can split the mode and parallelize computational operations across multiple devices (GPUs) to reduce latency, but it reduces computation granularity and increases communication that may hurt throughput. Once the latency target has been met, DeepSpeed can apply pipeline parallelism to maximize the throughput. Overall, DeepSpeed Inference supports flexible adaptation of both parallelism approach and degree choices from training to inference, minimizing latency while saving deployment costs.
+
+
+## Customized Inference Kernels for Boosted Compute Efficiency of Transformer Blocks
+
+To achieve high compute efficiency, DeepSpeed-inference offers inference kernels tailored for Transformer blocks through operator fusion, taking model-parallelism for multi-GPU into account. The main difference between our kernel-fusion scheme and similar approaches is that we not only fuse element-wise operations (such as bias-add, residual, and activation function), but also merge the General matrix multiply (GeMM) operations with other operations. To do this, we design an efficient implementation for the vector-matrix or skinny matrix-matrix multiplication that allows us to fuse more operations at the reduction boundary of GeMM operations.
+
+# Kernel-Fusion
+
+We take two main policies for fusing operations: 1) keeping the access-pattern of inputs and outputs intact throughout the sequence of operations fused together; 2) fusing operations at each all-reduce boundary. The first policy ensures that different thread-blocks won’t encounter transferring data between Streaming-Multiprocessors (SMs). This is due to no straight-forward communication among SMs other than using the main memory which adds the block-synching overhead because of non-deterministic behavior of memory access. The reason behind the second policy is that we cannot continue the execution unless the partial results are reduced among the model-parallel GPUs.
+
+![Inference-Kernel-Fusion](/assets/images/inference-kernel-fusion.png){: .align-center}
+
+Figure 1: Transformer Layer with Megatron-style model-parallelism all-reduce components. The figure illustrates the parts of layer fused together with broken lines (width of line shows the fusion depth).
+
+Figure 1 shows the different components of a Transformer layer, and the groups of operations considered for fusion in our inference optimization. We also consider the NVIDIA Megatron-LM style of parallelism that partitions attention (Attn) and feed-forward (FF) blocks across multiple GPUs. Thus, we include the two all-reduce operations that reduce the results among parallel GPUs after Attn and FF blocks. As Figure 1 shows, we fuse the operations inside a Transformer layer at four main regions:
+1.	Input Layer-Norm plus Query, Key, and Value GeMMs and their bias adds.
+2.	Transform plus Attention.
+3.  Intermediate FF, Layer-Norm, Bias-add, Residual, and Gaussian Error Linear Unit (GELU).
+4.	Bias-add plus Residual.
+
+To fuse these operations, we exploit shared-memory as an intermediate cache for transferring data between reduction operations used in layer-norm and GeMM, and the element-wise operations. Moreover, we use the warp-level instructions to communicate data between threads when reducing partial computations. In addition, we use a new schedule for GeMM operations, which allows for fusing as many operations as needed for the third kernel-fusion. We also combine the GeMMs for the attention computation in the second kernel-fusion, by using an implicit matrix transformation in order to reduce the memory pressure. Compared to the unfused computation style using cuBLAS GeMM, we improve the performance by 1.5x, 2.9x. 3x, and 1.2x for all these kernel-fusions, respectively.
+
+## Seamless pipeline from training to inference with automatic kernel-injection
+
+To run the model in Inference mode, DeepSpeed simply requires the location of the model checkpoints and the desired parallelism configuration, i.e., MP/PP degree. DeepSpeed Inference kernels can also be enabled for many well-known model architectures such as HuggingFace (Bert and GPT-2) or Megatron GPT-based models using a pre-defined policy map that maps the original parameters to the parameters in the inference kernels. For other transformer-based models, user can specify their own policy map. Note that DS-Inference can run independent of the training pipeline as long as it receives all model checkpoints, and the DeepSpeed Transformer kernels for inference can be injected into any Transformer model if the right mapping policy is defined. For more information on how to enable Transformer inference kernel as well as specifying parallelism, please refer to out [inference tutorial](https://www.deepspeed.ai/tutorials/inference-tutorial/).
+
+
+## Flexible quantization support
+
+To further reduce the inference cost for large-scale models, we created the DeepSpeed Quantization Toolkit, supporting flexible quantize-aware training and high-performance kernels for quantized inference.
+
+For training, we introduce a novel approach called Mixture of Quantization (MoQ), which is inspired by mixed-precision training while seamlessly applying quantization. With MoQ, we can control the precision of the model by simulating the impact of quantization when updating the parameters at each step of training. Moreover, it supports flexible quantization policies and schedules—we find that by dynamically adjusting the number of quantization bits during training, the final quantized model provides higher accuracy under the same compression ratio. To adapt to different tasks, MoQ can also leverage the second order information of models to detect their sensitivity to precision and adjust the quantization schedule and target accordingly.
+
+To maximize the performance gains from the quantization model, we provide inference kernels tailored for quantized models that reduce latency through optimizing data movement but do not require specialized hardware. Finally, our toolkit does not require any code changes on the client side, making it easy to use.
+
+## Performance results
+
+Boosting throughput and reducing inference cost.  Figure 3 shows the inference throughput per GPU for the three model sizes corresponding to the three Transformer networks, GPT-2, Turing-NLG, and GPT-3. DeepSpeed Inference increases in per-GPU throughput by 2 to 4 times when using the same precision of FP16 as the baseline.  By enabling quantization, we boost throughput further. We reach a throughput improvement of 3x for GPT-2, 5x for Turing-NLG, and 3x for a model that is similar in characteristics and size to GPT-3, which directly translates to 3–5x inference cost reduction on serving these large models. In addition, we achieve these throughput and cost improvements without compromising latency as shown in Figure 5.
+
+![Inference-Throughput](/assets/images/inference-throughput.png){: .align-center}
+
+Figure 3: Inference throughput for different model sizes. DeepSpeed Inference achieves 3x to 5x higher throughput than baseline.
+
+One source of inference cost reduction is through reducing the number of GPUs for hosting large models as shown in Figure 4.  The optimized GPU resources comes from 1) using inference-adapted parallelism, allowing users to adjust the model and pipeline parallelism degree from the trained model checkpoints, and 2) shrinking model memory footprint by half with INT8 quantization.  As shown in this figure, we use 2x less GPUs to run inference for the 17B model size by adapting the parallelism.  Together with INT8 quantization through DeepSpeed MoQ, we use 4x and 2x fewer GPUs for 17B and 175B sizes respectively.
+
+![Inference-Throughput](/assets/images/gpu-numbers.png){: .align-center}
+
+Figure 4: Number of GPUs used for running inference on the different model sizes shown in Figure 4.
+
+Reducing inference latency.  For the application scenarios where inference latency is critical, we can increase model parallelism degree in DeepSpeed Inference to reduce inference latency further.  As Figure 5 depicts, we can reduce the latency by 2.3x compared to PyTorch as we increase the model-parallelism size to 4.  Furthermore, we can still have high latency improvement with a fewer number of GPUs by adapting the parallelism at inference and using MoQ to quantize the model. We obtain 1.3x and 1.9x speedups while using 4x and 2x lower resources than baseline, respectively.
+
+For the application scenarios where inference latency is critical, we can increase model parallelism degree in DeepSpeed Inference to reduce inference latency further.  As Figure 5 depicts, we can reduce the latency by 2.3x compared to PyTorch as we increase the model-parallelism size to 4.  Furthermore, we can still have high latency improvement with a fewer number of GPUs by adapting the parallelism at inference and using MoQ to quantize the model. We obtain 1.3x and 1.9x speedups while using 4x and 2x lower resources than baseline, respectively.
+
+![Inference-Throughput](/assets/images/inference-latency.png){: .align-center}
+
+Figure 5. Inference latency for the 17B model using different parallelism configuration to optimize latency.
diff --git a/docs/_posts/2021-05-14-inference-release.md b/docs/_posts/2021-05-14-inference-release.md
new file mode 100644
index 0000000000000000000000000000000000000000..fd5cca2e0259051bebb9a59929343707429609d9
--- /dev/null
+++ b/docs/_posts/2021-05-14-inference-release.md
@@ -0,0 +1,7 @@
+---
+title: "DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression"
+date:   2021-05-14
+link: https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/
+excerpt: ""
+tags: inference
+---
diff --git a/docs/_posts/2021-08-18-deepspeed-moe.md b/docs/_posts/2021-08-18-deepspeed-moe.md
new file mode 100644
index 0000000000000000000000000000000000000000..5bd9667f2a7f44ad961e8add7353da84b8baf005
--- /dev/null
+++ b/docs/_posts/2021-08-18-deepspeed-moe.md
@@ -0,0 +1,7 @@
+---
+title: "DeepSpeed powers 8x larger MoE model training with high performance"
+excerpt: ""
+link: https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/
+date: 2021-08-18 00:00:00
+tags: training
+---
diff --git a/docs/_posts/2021-11-15-autotuning.md b/docs/_posts/2021-11-15-autotuning.md
new file mode 100644
index 0000000000000000000000000000000000000000..ee48d44c5bdf1fc7f0947bcdbb52914ab41ebb7b
--- /dev/null
+++ b/docs/_posts/2021-11-15-autotuning.md
@@ -0,0 +1,15 @@
+---
+title: "Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed"
+excerpt: ""
+date: 2021-11-16 10:00:00
+tags: training
+toc: false
+---
+
+We introduce a new feature called Autotuning to automatically discover the optimal DeepSpeed configuration that delivers good training speed. One pain point in model training is to figure out good performance-relevant configurations such as micro-batch size to fully utilize the hardware and achieve a high throughput number. This configuration exploring process is commonly done manually but is important since model training is repeated many times and benefits from using a good configuration. Not only is the hand-tuning process time-consuming, but the outcome is hardware-dependent. This means that a good configuration on one hardware might not be the best on another different hardware. The user thus has to hand tune the configuration again. With DeepSpeed, there are more configuration parameters that could potentially affect the training speed, thus making it more tedious to manually tune the configuration.
+
+The DeepSpeed Autotuner mitigates this pain point and automatically discovers the optimal DeepSpeed configuration that delivers good training speed. It not only reduces the time and resources users spend on tuning, but also can discover configurations better than hand-tuned methods. [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning) would demonstrate the effectiveness of autotuning across different models.
+
+* For a brief overview, see the [Autotuning tutorial](https://www.deepspeed.ai/tutorials/autotuning/).
+* For more information on how to use Autotuning, see the [Autotuning README](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning#deepspeed-autotuning).
+* The source code can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed).
diff --git a/docs/_posts/2021-12-09-deepspeed-moe-nlg.md b/docs/_posts/2021-12-09-deepspeed-moe-nlg.md
new file mode 100644
index 0000000000000000000000000000000000000000..6402202cca3b60954a2abb6dc67ab9b4251966c9
--- /dev/null
+++ b/docs/_posts/2021-12-09-deepspeed-moe-nlg.md
@@ -0,0 +1,184 @@
+---
+title: "DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times"
+excerpt: ""
+date: 2021-12-09 22:00:00
+tags: training
+---
+
+Autoregressive transformer-based natural language generation (referred to as
+NLG in the rest of the blog) models can offer convincing solutions to a broad
+range of language tasks from document summarization, headline generation,
+question and answering to even generating code in a wide variety of programming
+languages. Due to the general applicability of these models, improving their
+quality has been of great interest for both academia and industry alike.
+
+The quality of NLG improves with the increase in model size. However, today we
+are getting close to the limit of what the current generation of hardware can
+do. The Megatron-Turing NLG 530B model took 3 months to train on over 2K A100
+GPUs on the NVIDIA Selene Supercomputer, consuming over 3 million GPU hours.
+Another 3 to 5 times of increase in model size would be infeasible within a
+reasonable timeframe.  Given the exorbitant compute resources required to train
+the state-of-art NLG models, a natural question to ask is: "Is it possible to
+make non-trivial improvement to model quality without increasing the compute
+cost?"  Or equivalently, "Is it possible to produce model with similar quality
+using 3 to 5 times less resources?"
+
+Recent works like [GShard](https://arxiv.org/abs/2006.16668) and [Switch
+Transformers](https://arxiv.org/abs/2101.03961) have shown that Mixture of
+Experts (MoE) model structure reduces large model training cost significantly
+for transformer-based encoder-decoder models. An MoE model contains a set of
+sparsely gated experts. During training and inference, only a subset of these
+experts is activated for each input token. Therefore, the model could scale to
+billions of parameters without a proportional increase in the computation.
+Despite showing promising results, the effectiveness of MoE for the much more
+computation intensive NLG family models remains mostly unknown.
+
+Given the tremendous compute and energy requirements for training NLG family of
+models, we explore the opportunities that MoE presents to reduce their training
+cost. **We show that MoE can be applied to NLG family of models to significantly
+improve their model quality with the same training cost. Alternatively, it can
+achieve 5x reduction in training cost to achieve the same model quality of a
+dense NLG model.** For example, by applying MoE we achieved the model quality of
+a 6.7B parameter dense NLG model at the cost of training a 1.3B parameter dense
+model, thanks to the sparse structure of MoE.
+
+Assuming the scaling holds, the results have the potential to completely
+transform the large model training landscape in terms of cost. For example, a
+trillion-parameter dense model can be potentially trained at the cost of a 200B
+parameter (like GPT-3) sized dense model, translating to millions of dollars in
+training cost reduction and energy savings (Brown et al., 2020, Language models
+are few-shot learners).
+
+## MoE based NLG model architecture
+
+To create an MoE based NLG model we studied the GPT like transformer-based NLG
+model. To complete training in a reasonable timeframe, the following models are
+selected: 350M (24 layers, 1024 hidden size, 16 attention heads), 1.3B (24
+layers, 2048 hidden size, 16 attention heads), and 6.7B (32 layers, 4096 hidden
+size, 32 attention heads). We use "350M+MoE-128" to denote a MoE model
+that uses 350M dense model as the base model and adds 128 experts on every
+other feedforward layer.  That is to say, there are in total 12 MoE layers for
+both 350M+MoE-128 and 1.3B+MoE-128.
+
+We use a gating function to activate a subset of experts in the MoE layer for
+each token. Specifically, in our experiments, only the top-1 expert is
+selected. Therefore, during both training and inference, our MoE model will
+have the same number of parameters to be activated for each token as their
+dense part. For example, our 1.3B+MoE-128 will only activate 1.3B parameter per
+token, and the amount of training computation per token will be similar to a
+1.3B dense model.
+
+## MoE training infrastructure and dataset
+
+We pre-trained both the dense and MoE version of the above models using
+[DeepSpeed](http://deepspeed.ai) on 128 A100 GPUs. DeepSpeed uses a
+combination of data parallel and expert parallel training to effectively scale
+the [MoE model training](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/).
+
+We used the same training data as described in the [MT-NLG blog](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/). For a fair
+comparison, we use 300B tokens to train both the dense model and the MoE model.
+
+## MoE leads to better quality for NLG models
+
+Figure 1 shows that the validation loss for the MoE versions of the model is
+significantly better than their dense counter parts. Furthermore, notice that
+the validation loss of the MoE model, 350M+MoE-128, is on par with the
+validation loss of the 1.3B dense model with 4x larger base.  This is also true
+for 1.3B+MoE-128 in comparison with 6.7B dense model with 5x larger base.
+Furthermore, the model quality is on par not only for the validation loss but
+also for a wide variety of 6 ZeRO-shot evaluation tasks as shown in Table 1,
+demonstrating that these models in fact have very similar model quality.
+
+![MoE NLG](/assets/images/moe-nlg.png){: .align-center}
+
+Figure 1: Token-wise validation loss curves for dense and MoE NLG models with different model sizes.
+
+| Model size | LAMBADA: completion prediction | PIQA: commonsense reasoning | BoolQ: reading comprehension | RACE-h: reading comprehension | TriviaQA: question answering | WebQs: question answering |
+| ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| **Dense NLG:** | | | | | | |
+| 350M | 0.5203 | 0.6931 | 0.5364 | 0.3177 | 0.0321 | 0.0157 |
+| 1.3B | 0.6365 | 0.7339 | 0.6339 | 0.3560 | 0.1005 | 0.0325 |
+| 6.7B | 0.7194 | 0.7671 | 0.6703 | 0.3742 | 0.2347 | 0.0512 |
+| **MoE NLG:** | | | | | | |
+| 350M+MoE-128 (13B) | 0.6270 | 0.7459 | 0.6046 | 0.3560 | 0.1658 | 0.0517 |
+| 1.3B+MoE-128 (52B) | 0.6984 | 0.7671 | 0.6492 | 0.3809 | 0.3129 | 0.0719 |
+
+Table 1: ZeRO-shot evaluation results (last six columns) for different dense and MoE NLG models. All ZeRO-shot evaluation results use the accuracy metric.
+
+## Same quality with 5x less training cost
+
+As we saw from the results above, adding MoE with 128 experts to the NLG model
+significantly improves the quality of the NLG model. However, these experts do
+not change the compute requirements of the model as each token is only
+processed by a single expert. Therefore, the compute requirements for dense
+model and its corresponding MoE models with the same base are similar.
+
+More concretely, a 1.3B+MoE-128  model training requires roughly the same
+amount of compute operations as 1.3B dense, while offering much better model
+quality. Furthermore, our results show that by applying MoE we can achieve the
+model quality of a 6.7B parameter dense model at the training cost of 1.3B
+parameter dense model, resulting in an effective training compute reduction of
+5x.
+
+This compute cost reduction can directly be translated into throughput gain,
+training time and training cost reduction by leveraging the efficient DeepSpeed
+MoE training system. Table 2 shows the training throughput of the 1.3B+MoE-128
+model in comparison to the 6.7B dense model on 128 NVIDIA A100 GPUs.
+
+| | Training samples per sec | Throughput gain / Cost Reduction
+| --- | ---: | ---:
+| 6.7B dense | 70 | 1x
+| 1.3B+MoE-128 | 372 | 5x
+
+Table 2: Training throughput (on 128 A100 GPUs) comparing MoE based model vs dense model that can both achieve the same model quality.
+
+## MoE for Inference
+
+The training cost reduction of MoE is not free and comes at the expense of
+increasing the total number of parameters required to achieve the same model
+quality compared to dense models. The 1.3B+MoE-128 have roughly 8x the number
+of parameters (52B) compared to the 6.7B  dense model. So, does this mean
+inference will be 8x slower than the dense model, since inference is generally
+limited by the time taken to read all the model parameters, especially for
+small batch sizes?
+
+Not quite. Note that in the 1.3B+MoE-128 model, each token is processed by a
+unique expert per MoE layer, and the total number of parameters used in
+processing the token is just 1.3B. This can in theory  result in even faster
+inference than the quality-equivalent dense 6.7B model because of 5x less
+compute and parameter read. In reality though, the number of tokens in a batch
+during inference is generally larger than 1. Inferencing, a long sequence
+length or a non-unit batch size may require loading all the experts, increasing
+the total number of parameters loaded by 8x compared to the quality-equivalent
+dense model. Therefore, achieving good inference performance with MoE is still
+challenging even though the parameters used and the computation incurred per
+token is small compared to the quality-equivalent dense model.
+
+Nonetheless, we believe that it is possible to use different forms of
+parallelism to leverage massive memory bandwidth by scaling across a large
+number of devices to speed up MoE inference, making it comparable or faster
+than quality-equivalent dense models for extended inference scenarios and
+creating opportunities to make MoE based models cost efficient for inference in
+addition to training.
+
+## Conclusion and Release
+
+We demonstrate that MoE based models can be applied to NLG task, reducing the
+training cost by 5x compared to dense, autoregressive transformer-based models
+like GPT-3 and MT-NLG 530B. Through MoE based low-cost training we hope to make
+high quality language models accessible to a broad audience, even with limited
+compute resources.
+
+To this end we are releasing our [end-to-end pipeline for training MoE based
+NLG models](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training),
+along with [specific example
+scripts](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training/examples/MoE)
+and [tutorial](/tutorials/mixture-of-experts-nlg) to help get started with our pipeline.  We look forward to the application and
+the innovations that this may bring to the deep learning community.
+
+## Acknowledgement
+
+This work was done in collaboration with Brandon Norick, Zhun Liu, Xia Song from the
+Turing Team, and Young Jin Kim, Alex Muzio, Hany Hassan Awadalla from Z-Code
+Team. We also thank Luis Vargas, Umesh Madan, Gopi Kumar, Andrey Proskurin and
+Mikhail Parakhin for their continuous support and guidance.
diff --git a/docs/_posts/2022-01-19-moe-inference.md b/docs/_posts/2022-01-19-moe-inference.md
new file mode 100644
index 0000000000000000000000000000000000000000..f2ac1c6de2e1bd169d4f253dbc356677662fb285
--- /dev/null
+++ b/docs/_posts/2022-01-19-moe-inference.md
@@ -0,0 +1,7 @@
+---
+title: "DeepSpeed: Advancing MoE inference and training to power next-generation AI scale"
+excerpt: ""
+link: https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/
+date: 2022-01-19 00:00:00
+tags: inference
+---
diff --git a/docs/_posts/2022-03-21-amd-support.md b/docs/_posts/2022-03-21-amd-support.md
new file mode 100644
index 0000000000000000000000000000000000000000..0c9ca3fb68d0977c52b2ad11ba2e53a1020d2b91
--- /dev/null
+++ b/docs/_posts/2022-03-21-amd-support.md
@@ -0,0 +1,7 @@
+---
+title: "Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed"
+excerpt: ""
+link: https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/
+date: 2022-03-21 00:00:00
+tags: training inference ZeRO
+---
diff --git a/docs/_sass/button-group.scss b/docs/_sass/button-group.scss
new file mode 100644
index 0000000000000000000000000000000000000000..ced944f51db05d9dafc5765f542a85d8e8e42fe2
--- /dev/null
+++ b/docs/_sass/button-group.scss
@@ -0,0 +1,68 @@
+.btn-group button {
+    padding: 5px 15px; /* Some padding */
+    cursor: pointer; /* Pointer/hand icon */
+    text-transform: capitalize;
+  }
+
+  /* Clear floats (clearfix hack) */
+  .btn-group:after {
+    content: "";
+    clear: both;
+    display: table;
+  }
+
+.button-71 {
+  background-color: #0092ca;
+  border: 0;
+  border-radius: 56px;
+  color: #fff;
+  cursor: pointer;
+  display: inline-block;
+  font-family: system-ui,-apple-system,system-ui,"Segoe UI",Roboto,Ubuntu,"Helvetica Neue",sans-serif;
+  font-size: 16px;
+  font-weight: 600;
+  outline: 0;
+  padding: 16px 16px;
+  position: relative;
+  text-align: center;
+  text-decoration: none;
+  transition: all .3s;
+  user-select: none;
+  -webkit-user-select: none;
+  touch-action: manipulation;
+  margin: 1px;
+}
+
+.button-71:before {
+  background-color: initial;
+  background-image: linear-gradient(#fff 0, rgba(255, 255, 255, 0) 100%);
+  border-radius: 125px;
+  content: "";
+  height: 50%;
+  left: 4%;
+  opacity: .5;
+  position: absolute;
+  top: 0;
+  transition: all .3s;
+  width: 92%;
+}
+
+.button-71:hover {
+  box-shadow: rgba(255, 255, 255, .2) 0 3px 15px inset, rgba(0, 0, 0, .1) 0 3px 5px, rgba(0, 0, 0, .1) 0 10px 13px;
+  transform: scale(1.05);
+}
+
+
+.button-71:focus {
+  border: 3px solid black;
+  box-shadow: rgba(131, 192, 253, 0.5) 0 0 0 3px;
+  transform: scale(1.05);
+  outline: none;
+}
+
+
+@media (min-width: 768px) {
+  .button-71 {
+    padding: 16px 48px;
+  }
+}
diff --git a/docs/_sass/minimal-mistakes.scss b/docs/_sass/minimal-mistakes.scss
new file mode 100644
index 0000000000000000000000000000000000000000..4df5f7de79181f8577e50e5dc528c55af0e2c0ee
--- /dev/null
+++ b/docs/_sass/minimal-mistakes.scss
@@ -0,0 +1,40 @@
+/*!
+ * Minimal Mistakes Jekyll Theme 4.20.2 by Michael Rose
+ * Copyright 2013-2020 Michael Rose - mademistakes.com | @mmistakes
+ * Licensed under MIT (https://github.com/mmistakes/minimal-mistakes/blob/master/LICENSE)
+*/
+
+/* Variables */
+@import "minimal-mistakes/variables";
+
+/* Mixins and functions */
+@import "minimal-mistakes/vendor/breakpoint/breakpoint";
+@include breakpoint-set("to ems", true);
+@import "minimal-mistakes/vendor/magnific-popup/magnific-popup"; // Magnific Popup
+@import "minimal-mistakes/vendor/susy/susy";
+@import "minimal-mistakes/mixins";
+
+/* Core CSS */
+@import "minimal-mistakes/reset";
+@import "minimal-mistakes/base";
+@import "minimal-mistakes/forms";
+@import "minimal-mistakes/tables";
+@import "minimal-mistakes/animations";
+
+/* Components */
+@import "minimal-mistakes/buttons";
+@import "minimal-mistakes/notices";
+@import "minimal-mistakes/masthead";
+@import "minimal-mistakes/navigation";
+@import "minimal-mistakes/footer";
+@import "minimal-mistakes/search";
+@import "minimal-mistakes/syntax";
+
+/* Utility classes */
+@import "minimal-mistakes/utilities";
+
+/* Layout specific */
+@import "minimal-mistakes/page";
+@import "minimal-mistakes/archive";
+@import "minimal-mistakes/sidebar";
+@import "minimal-mistakes/print";
diff --git a/docs/_sass/minimal-mistakes/_archive.scss b/docs/_sass/minimal-mistakes/_archive.scss
new file mode 100644
index 0000000000000000000000000000000000000000..af8d7848cfa1739bcbfd03a456f9a352c497d32e
--- /dev/null
+++ b/docs/_sass/minimal-mistakes/_archive.scss
@@ -0,0 +1,463 @@
+/* ==========================================================================
+   ARCHIVE
+   ========================================================================== */
+
+.archive {
+  margin-top: 1em;
+  margin-bottom: 2em;
+
+  @include breakpoint($large) {
+    float: right;
+    width: calc(100% - #{$right-sidebar-width-narrow} - 10%);
+    padding-right: $right-sidebar-width-narrow;
+  }
+
+  @include breakpoint($x-large) {
+    width: calc(100% - #{$right-sidebar-width} - 10%);
+    padding-right: $right-sidebar-width;
+  }
+}
+
+.archive__item {
+  position: relative;
+
+  a {
+    position: relative;
+    z-index: 10;
+  }
+
+  a[rel="permalink"] {
+    position: static;
+  }
+}
+
+.archive__subtitle {
+  margin: 1.414em 0 0.5em;
+  padding-bottom: 0.5em;
+  font-size: $type-size-5;
+  color: $muted-text-color;
+  border-bottom: 1px solid $border-color;
+
+  + .list__item .archive__item-title {
+    margin-top: 0.5em;
+  }
+}
+
+.archive__item-title {
+  margin-bottom: 0.25em;
+  font-family: $sans-serif-narrow;
+  line-height: initial;
+  overflow: hidden;
+  text-overflow: ellipsis;
+
+  a[rel="permalink"]::before {
+    content: '';
+    position: absolute;
+    left: 0;
+    top: 0;
+    right: 0;
+    bottom: 0;
+  }
+
+  a + a {
+    opacity: 0.5;
+  }
+}
+
+/* remove border*/
+.page__content {
+  .archive__item-title {
+    margin-top: 1em;
+    border-bottom: none;
+  }
+}
+
+.archive__item-excerpt {
+  margin-top: 0;
+  font-size: $type-size-6;
+
+  & + p {
+    text-indent: 0;
+  }
+
+  a {
+    position: relative;
+  }
+}
+
+.archive__item-teaser {
+  position: relative;
+  border-radius: $border-radius;
+  overflow: hidden;
+
+  img {
+    width: 100%;
+  }
+}
+
+.archive__item-caption {
+  position: absolute;
+  bottom: 0;
+  right: 0;
+  margin: 0 auto;
+  padding: 2px 5px;
+  color: #fff;
+  font-family: $caption-font-family;
+  font-size: $type-size-8;
+  background: #000;
+  text-align: right;
+  z-index: 5;
+  opacity: 0.5;
+  border-radius: $border-radius 0 0 0;
+
+  @include breakpoint($large) {
+    padding: 5px 10px;
+  }
+
+  a {
+    color: #fff;
+    text-decoration: none;
+  }
+}
+
+/*
+   List view
+   ========================================================================== */
+
+.list__item {
+  .page__meta {
+    margin: 0 0 4px;
+    font-size: 0.6em;
+  }
+}
+
+/*
+   Grid view
+   ========================================================================== */
+
+.archive {
+  .grid__wrapper {
+    /* extend grid elements to the right */
+
+    @include breakpoint($large) {
+      margin-right: -1 * $right-sidebar-width-narrow;
+    }
+
+    @include breakpoint($x-large) {
+      margin-right: -1 * $right-sidebar-width;
+    }
+  }
+}
+
+.grid__item {
+  margin-bottom: 2em;
+
+  @include breakpoint($small) {
+    float: left;
+    width: span(5 of 10);
+
+    &:nth-child(2n + 1) {
+      clear: both;
+      margin-left: 0;
+    }
+
+    &:nth-child(2n + 2) {
+      clear: none;
+      margin-left: gutter(of 10);
+    }
+  }
+
+  @include breakpoint($medium) {
+    margin-left: 0; /* override margin*/
+    margin-right: 0; /* override margin*/
+    width: span(3 of 12);
+
+    &:nth-child(2n + 1) {
+      clear: none;
+    }
+
+    &:nth-child(4n + 1) {
+      clear: both;
+    }
+
+    &:nth-child(4n + 2) {
+      clear: none;
+      margin-left: gutter(1 of 12);
+    }
+
+    &:nth-child(4n + 3) {
+      clear: none;
+      margin-left: gutter(1 of 12);
+    }
+
+    &:nth-child(4n + 4) {
+      clear: none;
+      margin-left: gutter(1 of 12);
+    }
+  }
+
+  .page__meta {
+    margin: 0 0 4px;
+    font-size: 0.6em;
+  }
+
+  .page__meta-sep {
+    display: block;
+
+    &::before {
+      display: none;
+    }
+  }
+
+  .archive__item-title {
+    margin-top: 0.5em;
+    font-size: $type-size-5;
+  }
+
+  .archive__item-excerpt {
+    display: none;
+
+    @include breakpoint($medium) {
+      display: block;
+      font-size: $type-size-6;
+    }
+  }
+
+  .archive__item-teaser {
+    @include breakpoint($small) {
+      max-height: 200px;
+    }
+
+    @include breakpoint($medium) {
+      max-height: 120px;
+    }
+  }
+}
+
+/*
+   Features
+   ========================================================================== */
+
+.feature__wrapper {
+  @include clearfix();
+  margin-bottom: 2em;
+  border-bottom: 1px solid $border-color;
+
+  .archive__item-title {
+    margin-bottom: 0;
+  }
+}
+
+.feature__item {
+  position: relative;
+  margin-bottom: 2em;
+  font-size: 1.125em;
+
+  @include breakpoint($small) {
+    float: left;
+    margin-bottom: 0;
+    width: span(4 of 12);
+
+    &:nth-child(3n + 1) {
+      clear: both;
+      margin-left: 0;
+    }
+
+    &:nth-child(3n + 2) {
+      clear: none;
+      margin-left: gutter(of 12);
+    }
+
+    &:nth-child(3n + 3) {
+      clear: none;
+      margin-left: gutter(of 12);
+    }
+
+    .feature__item-teaser {
+      max-height: 200px;
+      overflow: hidden;
+    }
+  }
+
+  .archive__item-body {
+    padding-left: gutter(1 of 12);
+    padding-right: gutter(1 of 12);
+  }
+
+  a.btn::before {
+    content: '';
+    position: absolute;
+    left: 0;
+    top: 0;
+    right: 0;
+    bottom: 0;
+  }
+
+  &--left {
+    position: relative;
+    float: left;
+    margin-left: 0;
+    margin-right: 0;
+    width: 100%;
+    clear: both;
+    font-size: 1.125em;
+
+    .archive__item {
+      float: left;
+    }
+
+    .archive__item-teaser {
+      margin-bottom: 2em;
+    }
+
+    a.btn::before {
+      content: '';
+      position: absolute;
+      left: 0;
+      top: 0;
+      right: 0;
+      bottom: 0;
+    }
+
+    @include breakpoint($small) {
+      .archive__item-teaser {
+        float: left;
+        width: span(5 of 12);
+      }
+
+      .archive__item-body {
+        float: right;
+        padding-left: gutter(0.5 of 12);
+        padding-right: gutter(1 of 12);
+        width: span(7 of 12);
+      }
+    }
+  }
+
+  &--right {
+    position: relative;
+    float: left;
+    margin-left: 0;
+    margin-right: 0;
+    width: 100%;
+    clear: both;
+    font-size: 1.125em;
+
+    .archive__item {
+      float: left;
+    }
+
+    .archive__item-teaser {
+      margin-bottom: 2em;
+    }
+
+    a.btn::before {
+      content: '';
+      position: absolute;
+      left: 0;
+      top: 0;
+      right: 0;
+      bottom: 0;
+    }
+
+    @include breakpoint($small) {
+      text-align: right;
+
+      .archive__item-teaser {
+        float: right;
+        width: span(5 of 12);
+      }
+
+      .archive__item-body {
+        float: left;
+        width: span(7 of 12);
+        padding-left: gutter(0.5 of 12);
+        padding-right: gutter(1 of 12);
+      }
+    }
+  }
+
+  &--center {
+    position: relative;
+    float: left;
+    margin-left: 0;
+    margin-right: 0;
+    width: 100%;
+    clear: both;
+    font-size: 1.125em;
+
+    .archive__item {
+      float: left;
+      width: 100%;
+    }
+
+    .archive__item-teaser {
+      margin-bottom: 2em;
+    }
+
+    a.btn::before {
+      content: '';
+      position: absolute;
+      left: 0;
+      top: 0;
+      right: 0;
+      bottom: 0;
+    }
+
+    @include breakpoint($small) {
+      text-align: center;
+
+      .archive__item-teaser {
+        margin: 0 auto;
+        width: span(5 of 12);
+      }
+
+      .archive__item-body {
+        margin: 0 auto;
+        width: span(7 of 12);
+      }
+    }
+  }
+}
+
+/* Place inside an archive layout */
+
+.archive {
+  .feature__wrapper {
+    .archive__item-title {
+      margin-top: 0.25em;
+      font-size: 1em;
+    }
+  }
+
+  .feature__item,
+  .feature__item--left,
+  .feature__item--center,
+  .feature__item--right {
+    font-size: 1em;
+  }
+}
+
+/*
+   Wide Pages
+   ========================================================================== */
+
+  .wide {
+  .archive {
+    @include breakpoint($large) {
+      padding-right: 0;
+    }
+
+    @include breakpoint($x-large) {
+      padding-right: 0;
+    }
+  }
+}
+
+/* Place inside a single layout */
+
+.layout--single {
+	.feature__wrapper {
+		display: inline-block;
+	}
+}
diff --git a/docs/_sass/minimal-mistakes/_navigation.scss b/docs/_sass/minimal-mistakes/_navigation.scss
new file mode 100644
index 0000000000000000000000000000000000000000..5505714219bbe3d67d20bb0f68547e44f9c100ea
--- /dev/null
+++ b/docs/_sass/minimal-mistakes/_navigation.scss
@@ -0,0 +1,573 @@
+/* ==========================================================================
+   NAVIGATION
+   ========================================================================== */
+
+/*
+   Breadcrumb navigation links
+   ========================================================================== */
+
+.breadcrumbs {
+  @include clearfix;
+  margin: 0 auto;
+  max-width: 100%;
+  padding-left: 1em;
+  padding-right: 1em;
+  font-family: $sans-serif;
+  -webkit-animation: $intro-transition;
+  animation: $intro-transition;
+  -webkit-animation-delay: 0.3s;
+  animation-delay: 0.3s;
+
+  @include breakpoint($x-large) {
+    max-width: $x-large;
+  }
+
+  ol {
+    padding: 0;
+    list-style: none;
+    font-size: $type-size-6;
+
+    @include breakpoint($large) {
+      float: right;
+      width: calc(100% - #{$right-sidebar-width-narrow});
+    }
+
+    @include breakpoint($x-large) {
+      width: calc(100% - #{$right-sidebar-width});
+    }
+  }
+
+  li {
+    display: inline;
+  }
+
+  .current {
+    font-weight: bold;
+  }
+}
+
+/*
+     Post pagination navigation links
+     ========================================================================== */
+
+.pagination {
+  @include clearfix();
+  float: left;
+  margin-top: 1em;
+  padding-top: 1em;
+  width: 100%;
+
+  ul {
+    margin: 0;
+    padding: 0;
+    list-style-type: none;
+    font-family: $sans-serif;
+  }
+
+  li {
+    display: block;
+    float: left;
+    margin-left: -1px;
+
+    a {
+      display: block;
+      margin-bottom: 0.25em;
+      padding: 0.5em 1em;
+      font-family: $sans-serif;
+      font-size: 14px;
+      font-weight: bold;
+      line-height: 1.5;
+      text-align: center;
+      text-decoration: none;
+      color: $muted-text-color;
+      border: 1px solid mix(#000, $border-color, 25%);
+      border-radius: 0;
+
+      &:hover {
+        color: $link-color-hover;
+      }
+
+      &.current,
+      &.current.disabled {
+        color: #fff;
+        background: $primary-color;
+      }
+
+      &.disabled {
+        color: rgba($muted-text-color, 0.5);
+        pointer-events: none;
+        cursor: not-allowed;
+      }
+    }
+
+    &:first-child {
+      margin-left: 0;
+
+      a {
+        border-top-left-radius: $border-radius;
+        border-bottom-left-radius: $border-radius;
+      }
+    }
+
+    &:last-child {
+      a {
+        border-top-right-radius: $border-radius;
+        border-bottom-right-radius: $border-radius;
+      }
+    }
+  }
+
+  /* next/previous buttons */
+  &--pager {
+    display: block;
+    padding: 1em 2em;
+    float: left;
+    width: 50%;
+    font-family: $sans-serif;
+    font-size: $type-size-5;
+    font-weight: bold;
+    text-align: center;
+    text-decoration: none;
+    color: $muted-text-color;
+    border: 1px solid mix(#000, $border-color, 25%);
+    border-radius: $border-radius;
+
+    &:hover {
+      @include yiq-contrasted($muted-text-color);
+    }
+
+    &:first-child {
+      border-top-right-radius: 0;
+      border-bottom-right-radius: 0;
+    }
+
+    &:last-child {
+      margin-left: -1px;
+      border-top-left-radius: 0;
+      border-bottom-left-radius: 0;
+    }
+
+    &.disabled {
+      color: rgba($muted-text-color, 0.5);
+      pointer-events: none;
+      cursor: not-allowed;
+    }
+  }
+}
+
+.page__content + .pagination,
+.page__meta + .pagination,
+.page__share + .pagination,
+.page__comments + .pagination {
+  margin-top: 2em;
+  padding-top: 2em;
+  border-top: 1px solid $border-color;
+}
+
+/*
+     Priority plus navigation
+     ========================================================================== */
+
+.greedy-nav {
+  position: relative;
+  display: -webkit-box;
+  display: -ms-flexbox;
+  display: flex;
+  -webkit-box-align: center;
+  -ms-flex-align: center;
+  align-items: center;
+  min-height: $nav-height;
+  background: $background-color;
+
+  a {
+    display: block;
+    margin: 0 1rem;
+    color: $masthead-link-color;
+    text-decoration: none;
+    -webkit-transition: none;
+    transition: none;
+
+    &:hover {
+      color: $masthead-link-color-hover;
+    }
+
+    &.site-logo {
+      margin-left: 0;
+      margin-right: 0.5rem;
+    }
+
+    &.site-title {
+      margin-left: 0;
+    }
+  }
+
+  img{
+    -webkit-transition: none;
+    transition: none;
+  }
+
+  &__toggle {
+    -ms-flex-item-align: center;
+    align-self: center;
+    height: $nav-toggle-height;
+    border: 0;
+    outline: none;
+    background-color: transparent;
+    cursor: pointer;
+  }
+
+  .visible-links {
+    display: -webkit-box;
+    display: -ms-flexbox;
+    display: flex;
+    -webkit-box-pack: end;
+    -ms-flex-pack: end;
+    justify-content: flex-end;
+    -webkit-box-flex: 1;
+    -ms-flex: 1;
+    flex: 1;
+    overflow: hidden;
+
+    li {
+      -webkit-box-flex: 0;
+      -ms-flex: none;
+      flex: none;
+    }
+
+    a {
+      position: relative;
+
+      &:before {
+        content: "";
+        position: absolute;
+        left: 0;
+        bottom: 0;
+        height: 4px;
+        background: $primary-color;
+        width: 100%;
+        -webkit-transition: $global-transition;
+        transition: $global-transition;
+        -webkit-transform: scaleX(0) translate3d(0, 0, 0);
+        transform: scaleX(0) translate3d(0, 0, 0); // hide
+      }
+
+      &:hover:before {
+        -webkit-transform: scaleX(1);
+        -ms-transform: scaleX(1);
+        transform: scaleX(1); // reveal
+      }
+    }
+  }
+
+  .hidden-links {
+    position: absolute;
+    top: 100%;
+    right: 0;
+    margin-top: 15px;
+    padding: 5px;
+    border: 1px solid $border-color;
+    border-radius: $border-radius;
+    background: $background-color;
+    -webkit-box-shadow: 0 2px 4px 0 rgba(#000, 0.16),
+      0 2px 10px 0 rgba(#000, 0.12);
+    box-shadow: 0 2px 4px 0 rgba(#000, 0.16), 0 2px 10px 0 rgba(#000, 0.12);
+
+    &.hidden {
+      display: none;
+    }
+
+    a {
+      margin: 0;
+      padding: 10px 20px;
+      font-size: $type-size-5;
+
+      &:hover {
+        color: $masthead-link-color-hover;
+        background: $navicon-link-color-hover;
+      }
+    }
+
+    &:before {
+      content: "";
+      position: absolute;
+      top: -11px;
+      right: 10px;
+      width: 0;
+      border-style: solid;
+      border-width: 0 10px 10px;
+      border-color: $border-color transparent;
+      display: block;
+      z-index: 0;
+    }
+
+    &:after {
+      content: "";
+      position: absolute;
+      top: -10px;
+      right: 10px;
+      width: 0;
+      border-style: solid;
+      border-width: 0 10px 10px;
+      border-color: $background-color transparent;
+      display: block;
+      z-index: 1;
+    }
+
+    li {
+      display: block;
+      border-bottom: 1px solid $border-color;
+
+      &:last-child {
+        border-bottom: none;
+      }
+    }
+  }
+}
+
+.no-js {
+  .greedy-nav {
+    .visible-links {
+      -ms-flex-wrap: wrap;
+      flex-wrap: wrap;
+      overflow: visible;
+    }
+  }
+}
+
+/*
+     Navigation list
+     ========================================================================== */
+
+.nav__list {
+  margin-bottom: 1.5em;
+
+  input[type="checkbox"],
+  label {
+    display: none;
+  }
+
+  @include breakpoint(max-width $large - 1px) {
+    label {
+      position: relative;
+      display: inline-block;
+      padding: 0.5em 2.5em 0.5em 1em;
+      color: $gray;
+      font-size: $type-size-6;
+      font-weight: bold;
+      border: 1px solid $light-gray;
+      border-radius: $border-radius;
+      z-index: 20;
+      -webkit-transition: 0.2s ease-out;
+      transition: 0.2s ease-out;
+      cursor: pointer;
+
+      &:before,
+      &:after {
+        content: "";
+        position: absolute;
+        right: 1em;
+        top: 1.25em;
+        width: 0.75em;
+        height: 0.125em;
+        line-height: 1;
+        background-color: $gray;
+        -webkit-transition: 0.2s ease-out;
+        transition: 0.2s ease-out;
+      }
+
+      &:after {
+        -webkit-transform: rotate(90deg);
+        -ms-transform: rotate(90deg);
+        transform: rotate(90deg);
+      }
+
+      &:hover {
+        color: #fff;
+        border-color: $gray;
+        background-color: mix(white, #000, 20%);
+
+        &:before,
+        &:after {
+          background-color: #fff;
+        }
+      }
+    }
+
+    /* selected*/
+    input:checked + label {
+      color: white;
+      background-color: mix(white, #000, 20%);
+
+      &:before,
+      &:after {
+        background-color: #fff;
+      }
+    }
+
+    /* on hover show expand*/
+    label:hover:after {
+      -webkit-transform: rotate(90deg);
+      -ms-transform: rotate(90deg);
+      transform: rotate(90deg);
+    }
+
+    input:checked + label:hover:after {
+      -webkit-transform: rotate(0);
+      -ms-transform: rotate(0);
+      transform: rotate(0);
+    }
+
+    ul {
+      margin-bottom: 1em;
+    }
+
+    a {
+      display: block;
+      padding: 0.25em 0;
+
+      @include breakpoint($large) {
+        padding-top: 0.125em;
+        padding-bottom: 0.125em;
+      }
+
+      &:hover {
+        text-decoration: underline;
+      }
+    }
+  }
+}
+
+.nav__list .nav__items {
+  margin: 0;
+  font-size: 1.25rem;
+
+  a {
+    color: inherit;
+  }
+
+  .active {
+    margin-left: -0.5em;
+    padding-left: 0.5em;
+    padding-right: 0.5em;
+    font-weight: bold;
+  }
+
+  @include breakpoint(max-width $large - 1px) {
+    position: relative;
+    max-height: 0;
+    opacity: 0%;
+    overflow: hidden;
+    z-index: 10;
+    -webkit-transition: 0.3s ease-in-out;
+    transition: 0.3s ease-in-out;
+    -webkit-transform: translate(0, 10%);
+    -ms-transform: translate(0, 10%);
+    transform: translate(0, 10%);
+  }
+}
+
+@include breakpoint(max-width $large - 1px) {
+  .nav__list input:checked ~ .nav__items {
+    -webkit-transition: 0.5s ease-in-out;
+    transition: 0.5s ease-in-out;
+    max-height: 9999px; /* exaggerate max-height to accommodate tall lists*/
+    overflow: visible;
+    opacity: 1;
+    margin-top: 1em;
+    -webkit-transform: translate(0, 0);
+    -ms-transform: translate(0, 0);
+    transform: translate(0, 0);
+  }
+}
+
+.nav__title {
+  margin: 0;
+  padding: 0.5rem 0.75rem;
+  font-family: $sans-serif-narrow;
+  font-size: $type-size-5;
+  font-weight: bold;
+}
+
+.nav__sub-title {
+  display: block;
+  margin: 0.5rem 0;
+  padding: 0.25rem 0;
+  font-family: $sans-serif-narrow;
+  font-size: $type-size-6;
+  font-weight: bold;
+  text-transform: uppercase;
+  border-bottom: 1px solid $border-color;
+}
+
+/*
+     Table of contents navigation
+     ========================================================================== */
+
+.toc {
+  font-family: $sans-serif-narrow;
+  color: $gray;
+  background-color: $background-color;
+  border: 1px solid $border-color;
+  border-radius: $border-radius;
+  -webkit-box-shadow: $box-shadow;
+  box-shadow: $box-shadow;
+
+  .nav__title {
+    color: #fff;
+    font-size: $type-size-6;
+    background: $primary-color;
+    border-top-left-radius: $border-radius;
+    border-top-right-radius: $border-radius;
+  }
+
+  // Scrollspy marks toc items as .active when they are in focus
+  .active a {
+    @include yiq-contrasted($active-color);
+  }
+}
+
+.toc__menu {
+  margin: 0;
+  padding: 0;
+  width: 100%;
+  list-style: none;
+  font-size: $type-size-6;
+
+  @include breakpoint($large) {
+    font-size: $type-size-7;
+  }
+
+  a {
+    display: block;
+    padding: 0.25rem 0.75rem;
+    color: $muted-text-color;
+    font-weight: bold;
+    line-height: 1.5;
+    border-bottom: 1px solid $border-color;
+
+    &:hover {
+      color: $text-color;
+    }
+  }
+
+  li ul > li a {
+    padding-left: 1.25rem;
+    font-weight: normal;
+  }
+
+  li ul li ul > li a {
+    padding-left: 1.75rem;
+  }
+
+  li ul li ul li ul > li a {
+    padding-left: 2.25rem;
+  }
+
+  li ul li ul li ul li ul > li a {
+    padding-left: 2.75rem;
+  }
+
+  li ul li ul li ul li ul li ul > li a {
+    padding-left: 3.25rem
+  }
+}
diff --git a/docs/_sass/minimal-mistakes/_page.scss b/docs/_sass/minimal-mistakes/_page.scss
new file mode 100644
index 0000000000000000000000000000000000000000..289e82dc753c152c1695a9893347b08ffa6a21d6
--- /dev/null
+++ b/docs/_sass/minimal-mistakes/_page.scss
@@ -0,0 +1,564 @@
+/* ==========================================================================
+   SINGLE PAGE/POST
+   ========================================================================== */
+
+#main {
+  @include clearfix;
+  margin-left: auto;
+  margin-right: auto;
+  padding-left: 1em;
+  padding-right: 1em;
+  -webkit-animation: $intro-transition;
+  animation: $intro-transition;
+  max-width: 100%;
+  -webkit-animation-delay: 0.15s;
+  animation-delay: 0.15s;
+
+  @include breakpoint($x-large) {
+    max-width: $max-width;
+  }
+}
+
+body {
+  display: -webkit-box;
+  display: -ms-flexbox;
+  display: flex;
+  min-height: 100vh;
+  -webkit-box-orient: vertical;
+  -webkit-box-direction: normal;
+  -ms-flex-direction: column;
+          flex-direction: column;
+}
+
+.initial-content,
+.search-content {
+  flex: 1 0 auto;
+}
+
+.page {
+  @include breakpoint($large) {
+    float: right;
+    width: calc(100% - #{$right-sidebar-width-narrow});
+    padding-right: $right-sidebar-width-narrow;
+  }
+
+  @include breakpoint($x-large) {
+    width: calc(100% - #{$right-sidebar-width});
+    padding-right: $right-sidebar-width;
+  }
+
+  .page__inner-wrap {
+    float: left;
+    margin-top: 1em;
+    margin-left: 10%;
+    margin-right: 0%;
+    width: 80%;
+    clear: both;
+    max-width: 1280px;
+
+    .page__content {
+      float: center;
+      max-width: 1280px;
+    }
+    .page__meta,
+    .page__share {
+      position: relative;
+      float: left;
+      margin-left: 0;
+      margin-right: 10%;
+      width: 100%;
+      clear: both;
+    }
+  }
+}
+
+.page__title {
+  margin-top: 0;
+  line-height: 1.5;
+
+  & + .page__meta {
+    margin-top: -0.5em;
+  }
+}
+
+.page__lead {
+  font-family: $global-font-family;
+  font-size: $type-size-4;
+}
+
+.page__content {
+  h2 {
+    padding-bottom: 0.5em;
+    border-bottom: 1px solid $border-color;
+  }
+
+	h1, h2, h3, h4, h5, h6 {
+		.header-link {
+			position: relative;
+			left: 0.5em;
+			opacity: 0;
+			font-size: 0.8em;
+			-webkit-transition: opacity 0.2s ease-in-out 0.1s;
+			-moz-transition: opacity 0.2s ease-in-out 0.1s;
+			-o-transition: opacity 0.2s ease-in-out 0.1s;
+			transition: opacity 0.2s ease-in-out 0.1s;
+		}
+
+		&:hover .header-link {
+			opacity: 1;
+		}
+	}
+
+  p,
+  li,
+  dl {
+    font-size: 1em;
+    line-height: 1.7777778;
+  }
+
+  /* paragraph indents */
+  p {
+    margin: 0 0 $indent-var;
+
+    /* sibling indentation*/
+    @if $paragraph-indent == true {
+      & + p {
+        text-indent: $indent-var;
+        margin-top: -($indent-var);
+      }
+    }
+  }
+
+  a:not(.btn) {
+    &:hover {
+      text-decoration: underline;
+
+      img {
+        box-shadow: 0 0 10px rgba(#000, 0.25);
+      }
+    }
+  }
+
+  dt {
+    margin-top: 1em;
+    font-family: $sans-serif;
+    font-weight: bold;
+  }
+
+  dd {
+    margin-left: 1em;
+    font-family: $sans-serif;
+    font-size: $type-size-6;
+  }
+
+  .small {
+    font-size: $type-size-6;
+  }
+
+  /* blockquote citations */
+  blockquote + .small {
+    margin-top: -1.5em;
+    padding-left: 1.25rem;
+  }
+}
+
+.page__hero {
+  position: relative;
+  margin-bottom: 2em;
+  @include clearfix;
+  -webkit-animation: $intro-transition;
+  animation: $intro-transition;
+  -webkit-animation-delay: 0.25s;
+  animation-delay: 0.25s;
+
+  &--overlay {
+    position: relative;
+    margin-bottom: 2em;
+    padding: 3em 0;
+    @include clearfix;
+    background-size: cover;
+    background-repeat: no-repeat;
+    background-position: center;
+    -webkit-animation: $intro-transition;
+    animation: $intro-transition;
+    -webkit-animation-delay: 0.25s;
+    animation-delay: 0.25s;
+
+    a {
+      color: #fff;
+    }
+
+    .wrapper {
+      padding-left: 1em;
+      padding-right: 1em;
+
+      @include breakpoint($x-large) {
+        max-width: $x-large;
+      }
+    }
+
+    .page__title,
+    .page__meta,
+    .page__lead,
+    .btn {
+      color: #fff;
+      text-shadow: 1px 1px 4px rgba(#000, 0.5);
+    }
+
+    .page__lead {
+      max-width: $medium;
+    }
+
+    .page__title {
+      font-size: $type-size-2;
+
+      @include breakpoint($small) {
+        font-size: $type-size-1;
+      }
+    }
+  }
+}
+
+.page__hero-image {
+  width: 100%;
+  height: auto;
+  -ms-interpolation-mode: bicubic;
+}
+
+.page__hero-caption {
+  position: absolute;
+  bottom: 0;
+  right: 0;
+  margin: 0 auto;
+  padding: 2px 5px;
+  color: #fff;
+  font-family: $caption-font-family;
+  font-size: $type-size-7;
+  background: #000;
+  text-align: right;
+  z-index: 5;
+  opacity: 0.5;
+  border-radius: $border-radius 0 0 0;
+
+  @include breakpoint($large) {
+    padding: 5px 10px;
+  }
+
+  a {
+    color: #fff;
+    text-decoration: none;
+  }
+}
+
+/*
+   Social sharing
+   ========================================================================== */
+
+.page__share {
+  margin-top: 2em;
+  padding-top: 1em;
+  border-top: 1px solid $border-color;
+
+  @include breakpoint(max-width $small) {
+    .btn span {
+      border: 0;
+      clip: rect(0 0 0 0);
+      height: 1px;
+      margin: -1px;
+      overflow: hidden;
+      padding: 0;
+      position: absolute;
+      width: 1px;
+    }
+  }
+}
+
+.page__share-title {
+  margin-bottom: 10px;
+  font-size: $type-size-6;
+  text-transform: uppercase;
+}
+
+/*
+   Page meta
+   ========================================================================== */
+
+.page__meta {
+  margin-top: 2em;
+  color: $muted-text-color;
+  font-family: $sans-serif;
+  font-size: $type-size-6;
+
+  p {
+    margin: 0;
+  }
+
+  a {
+    color: inherit;
+  }
+}
+
+.page__meta-title {
+  margin-bottom: 10px;
+  font-size: $type-size-6;
+  text-transform: uppercase;
+}
+
+.page__meta-sep::before {
+  content: "\2022";
+  padding-left: 0.5em;
+  padding-right: 0.5em;
+}
+
+/*
+   Page taxonomy
+   ========================================================================== */
+
+.page__taxonomy {
+  .sep {
+    display: none;
+  }
+
+  strong {
+    margin-right: 10px;
+  }
+}
+
+.page__taxonomy-item {
+  display: inline-block;
+  margin-right: 5px;
+  margin-bottom: 8px;
+  padding: 5px 10px;
+  text-decoration: none;
+  border: 1px solid mix(#000, $border-color, 25%);
+  border-radius: $border-radius;
+
+  &:hover {
+    text-decoration: none;
+    color: $link-color-hover;
+  }
+}
+
+.taxonomy__section {
+  margin-bottom: 2em;
+  padding-bottom: 1em;
+
+  &:not(:last-child) {
+    border-bottom: solid 1px $border-color;
+  }
+
+  .archive__item-title {
+    margin-top: 0;
+  }
+
+  .archive__subtitle {
+    clear: both;
+    border: 0;
+  }
+
+  + .taxonomy__section {
+    margin-top: 2em;
+  }
+}
+
+.taxonomy__title {
+  margin-bottom: 0.5em;
+  color: $muted-text-color;
+}
+
+.taxonomy__count {
+  color: $muted-text-color;
+}
+
+.taxonomy__index {
+  display: grid;
+  grid-column-gap: 2em;
+  grid-template-columns: repeat(2, 1fr);
+  margin: 1.414em 0;
+  padding: 0;
+  font-size: 0.75em;
+  list-style: none;
+
+  @include breakpoint($large) {
+    grid-template-columns: repeat(3, 1fr);
+  }
+
+  a {
+    display: -webkit-box;
+    display: -ms-flexbox;
+    display: flex;
+    padding: 0.25em 0;
+    -webkit-box-pack: justify;
+    -ms-flex-pack: justify;
+    justify-content: space-between;
+    color: inherit;
+    text-decoration: none;
+    border-bottom: 1px solid $border-color;
+  }
+}
+
+.back-to-top {
+  display: block;
+  clear: both;
+  color: $muted-text-color;
+  font-size: 0.6em;
+  text-transform: uppercase;
+  text-align: right;
+  text-decoration: none;
+}
+
+/*
+   Comments
+   ========================================================================== */
+
+.page__comments {
+  float: left;
+  margin-left: 0;
+  margin-right: 0;
+  width: 100%;
+  clear: both;
+}
+
+.page__comments-title {
+  margin-top: 2rem;
+  margin-bottom: 10px;
+  padding-top: 2rem;
+  font-size: $type-size-6;
+  border-top: 1px solid $border-color;
+  text-transform: uppercase;
+}
+
+.page__comments-form {
+  -webkit-transition: $global-transition;
+  transition: $global-transition;
+
+  &.disabled {
+    input,
+    button,
+    textarea,
+    label {
+      pointer-events: none;
+      cursor: not-allowed;
+      filter: alpha(opacity=65);
+      box-shadow: none;
+      opacity: 0.65;
+    }
+  }
+}
+
+.comment {
+  @include clearfix();
+  margin: 1em 0;
+
+  &:not(:last-child) {
+    border-bottom: 1px solid $border-color;
+  }
+}
+
+.comment__avatar-wrapper {
+  float: left;
+  width: 60px;
+  height: 60px;
+
+  @include breakpoint($large) {
+    width: 100px;
+    height: 100px;
+  }
+}
+
+.comment__avatar {
+  width: 40px;
+  height: 40px;
+  border-radius: 50%;
+
+  @include breakpoint($large) {
+    width: 80px;
+    height: 80px;
+    padding: 5px;
+    border: 1px solid $border-color;
+  }
+}
+
+.comment__content-wrapper {
+  float: right;
+  width: calc(100% - 60px);
+
+  @include breakpoint($large) {
+    width: calc(100% - 100px);
+  }
+}
+
+.comment__author {
+  margin: 0;
+
+  a {
+    text-decoration: none;
+  }
+}
+
+.comment__date {
+  @extend .page__meta;
+  margin: 0;
+
+  a {
+    text-decoration: none;
+  }
+}
+
+/*
+   Related
+   ========================================================================== */
+
+.page__related {
+  @include clearfix();
+  float: left;
+  margin-top: 2em;
+  padding-top: 1em;
+  border-top: 1px solid $border-color;
+
+  @include breakpoint($large) {
+    float: right;
+    width: calc(100% - #{$right-sidebar-width-narrow});
+  }
+
+  @include breakpoint($x-large) {
+    width: calc(100% - #{$right-sidebar-width});
+  }
+
+  a {
+    color: inherit;
+    text-decoration: none;
+  }
+}
+
+.page__related-title {
+  margin-bottom: 10px;
+  font-size: $type-size-6;
+  text-transform: uppercase;
+}
+
+/*
+   Wide Pages
+   ========================================================================== */
+
+.wide {
+  .page {
+    @include breakpoint($large) {
+      padding-right: 0;
+    }
+
+    @include breakpoint($x-large) {
+      padding-right: 0;
+    }
+  }
+
+  .page__related {
+    @include breakpoint($large) {
+      padding-right: 0;
+    }
+
+    @include breakpoint($x-large) {
+      padding-right: 0;
+    }
+  }
+}
diff --git a/docs/_sass/minimal-mistakes/_sidebar.scss b/docs/_sass/minimal-mistakes/_sidebar.scss
new file mode 100644
index 0000000000000000000000000000000000000000..63cef338c583cc9faeb7fcfa4c14bbef72618535
--- /dev/null
+++ b/docs/_sass/minimal-mistakes/_sidebar.scss
@@ -0,0 +1,345 @@
+/* ==========================================================================
+   SIDEBAR
+   ========================================================================== */
+
+/*
+   Default
+   ========================================================================== */
+
+.sidebar {
+  @include clearfix();
+  // @include breakpoint(max-width $large) {
+  //   /* fix z-index order of follow links */
+  //   position: relative;
+  //   z-index: 10;
+  //   -webkit-transform: translate3d(0, 0, 0);
+  //   transform: translate3d(0, 0, 0);
+  // }
+
+  @include breakpoint($large) {
+    float: left;
+    width: calc(#{$right-sidebar-width-narrow} - 1em);
+    opacity: 0.75;
+    -webkit-transition: opacity 0.2s ease-in-out;
+    transition: opacity 0.2s ease-in-out;
+
+    &:hover {
+      opacity: 1;
+    }
+
+    &.sticky {
+      overflow-y: auto;
+      /* calculate height of nav list
+         viewport height - nav height - masthead x-padding
+      */
+      max-height: calc(100vh - #{$nav-height} - 2em);
+    }
+  }
+
+  @include breakpoint($x-large) {
+    width: calc(#{$right-sidebar-width} - 1em);
+  }
+
+  > * {
+    margin-top: 1em;
+    margin-bottom: 1em;
+  }
+
+  h2,
+  h3,
+  h4,
+  h5,
+  h6 {
+    margin-bottom: 0;
+    font-family: $sans-serif-narrow;
+  }
+
+  p,
+  li {
+    font-family: $sans-serif;
+    font-size: $type-size-6;
+    line-height: 1.5;
+  }
+
+  img {
+    width: 100%;
+
+    &.emoji {
+      width: 20px;
+      height: 20px;
+    }
+  }
+}
+
+.sidebar__right {
+  margin-bottom: 1em;
+
+  @include breakpoint($large) {
+    position: absolute;
+    top: 0;
+    right: 0;
+    width: $right-sidebar-width-narrow;
+    margin-right: -1.5 * $right-sidebar-width-narrow;
+    padding-left: 1em;
+    z-index: 10;
+
+    &.sticky {
+      @include clearfix();
+      position: -webkit-sticky;
+      position: sticky;
+      top: 2em;
+      float: right;
+    }
+  }
+
+  @include breakpoint($x-large) {
+    width: $right-sidebar-width;
+    margin-right: -1.5 * $right-sidebar-width;
+  }
+}
+
+.splash .sidebar__right {
+  @include breakpoint($large) {
+    position: relative;
+    float: right;
+    margin-right: 0;
+  }
+
+  @include breakpoint($x-large) {
+    margin-right: 0;
+  }
+}
+
+/*
+   Author profile and links
+   ========================================================================== */
+
+.author__avatar {
+  display: table-cell;
+  vertical-align: top;
+  width: 36px;
+  height: 36px;
+
+  @include breakpoint($large) {
+    display: block;
+    width: auto;
+    height: auto;
+  }
+
+  img {
+    max-width: 110px;
+    border-radius: 50%;
+
+    @include breakpoint($large) {
+      padding: 5px;
+      border: 1px solid $border-color;
+    }
+  }
+}
+
+.author__content {
+  display: table-cell;
+  vertical-align: top;
+  padding-left: 15px;
+  padding-right: 25px;
+  line-height: 1;
+
+  @include breakpoint($large) {
+    display: block;
+    width: 100%;
+    padding-left: 0;
+    padding-right: 0;
+  }
+
+  a {
+    color: inherit;
+    text-decoration: none;
+  }
+}
+
+.author__name {
+  margin: 0;
+
+  @include breakpoint($large) {
+    margin-top: 10px;
+    margin-bottom: 10px;
+  }
+}
+.sidebar .author__name {
+  font-family: $sans-serif;
+  font-size: $type-size-5;
+}
+
+.author__bio {
+  margin: 0;
+
+  @include breakpoint($large) {
+    margin-top: 10px;
+    margin-bottom: 20px;
+  }
+}
+
+.author__urls-wrapper {
+  position: relative;
+  display: table-cell;
+  vertical-align: middle;
+  font-family: $sans-serif;
+  z-index: 20;
+  cursor: pointer;
+
+  li:last-child {
+    a {
+      margin-bottom: 0;
+    }
+  }
+
+  .author__urls {
+    span.label {
+      padding-left: 5px;
+    }
+  }
+
+  @include breakpoint($large) {
+    display: block;
+  }
+
+  button {
+    position: relative;
+    margin-bottom: 0;
+
+    &:before {
+      @supports (pointer-events: none) {
+        content: '';
+        position: fixed;
+        top: 0;
+        left: 0;
+        width: 100%;
+        height: 100%;
+        pointer-events: none;
+      }
+    }
+
+    &.open {
+      &:before {
+        pointer-events: auto;
+      }
+    }
+
+    @include breakpoint($large) {
+      display: none;
+    }
+  }
+}
+
+.author__urls {
+  display: none;
+  position: absolute;
+  right: 0;
+  margin-top: 15px;
+  padding: 10px;
+  list-style-type: none;
+  border: 1px solid $border-color;
+  border-radius: $border-radius;
+  background: $background-color;
+  box-shadow: 0 2px 4px 0 rgba(#000, 0.16), 0 2px 10px 0 rgba(#000, 0.12);
+  cursor: default;
+
+  &.is--visible {
+    display: block;
+  }
+
+  @include breakpoint($large) {
+    display: block;
+    position: relative;
+    margin: 0;
+    padding: 0;
+    border: 0;
+    background: transparent;
+    box-shadow: none;
+  }
+
+  &:before {
+    display: block;
+    content: "";
+    position: absolute;
+    top: -11px;
+    left: calc(50% - 10px);
+    width: 0;
+    border-style: solid;
+    border-width: 0 10px 10px;
+    border-color: $border-color transparent;
+    z-index: 0;
+
+    @include breakpoint($large) {
+      display: none;
+    }
+  }
+
+  &:after {
+    display: block;
+    content: "";
+    position: absolute;
+    top: -10px;
+    left: calc(50% - 10px);
+    width: 0;
+    border-style: solid;
+    border-width: 0 10px 10px;
+    border-color: $background-color transparent;
+    z-index: 1;
+
+    @include breakpoint($large) {
+      display: none;
+    }
+  }
+
+  ul {
+    padding: 10px;
+    list-style-type: none;
+  }
+
+  li {
+    white-space: nowrap;
+  }
+
+  a {
+    display: block;
+    margin-bottom: 5px;
+    padding-right: 5px;
+    padding-top: 2px;
+    padding-bottom: 2px;
+    color: inherit;
+    font-size: $type-size-5;
+    text-decoration: none;
+
+    &:hover {
+      text-decoration: underline;
+    }
+  }
+}
+
+/*
+   Wide Pages
+   ========================================================================== */
+
+.wide .sidebar__right {
+  margin-bottom: 1em;
+
+  @include breakpoint($large) {
+    position: initial;
+    top: initial;
+    right: initial;
+    width: initial;
+    margin-right: initial;
+    padding-left: initial;
+    z-index: initial;
+
+    &.sticky {
+      float: none;
+    }
+  }
+
+  @include breakpoint($x-large) {
+    width: initial;
+    margin-right: initial;
+  }
+}
diff --git a/docs/_sass/minimal-mistakes/_variables.scss b/docs/_sass/minimal-mistakes/_variables.scss
new file mode 100644
index 0000000000000000000000000000000000000000..f69086b495174d27aa2798d9e34d0f4f3140abed
--- /dev/null
+++ b/docs/_sass/minimal-mistakes/_variables.scss
@@ -0,0 +1,174 @@
+/* ==========================================================================
+   Variables
+   ========================================================================== */
+
+/*
+   Typography
+   ========================================================================== */
+
+$doc-font-size: 16 !default;
+
+/* paragraph indentation */
+$paragraph-indent: false !default; // true, false (default)
+$indent-var: 1.3em !default;
+
+/* system typefaces */
+$serif: Georgia, Times, serif !default;
+$sans-serif: -apple-system, BlinkMacSystemFont, "Roboto", "Segoe UI",
+  "Helvetica Neue", "Lucida Grande", Arial, sans-serif !default;
+$monospace: Monaco, Consolas, "Lucida Console", monospace !default;
+
+/* sans serif typefaces */
+$sans-serif-narrow: $sans-serif !default;
+$helvetica: Helvetica, "Helvetica Neue", Arial, sans-serif !default;
+
+/* serif typefaces */
+$georgia: Georgia, serif !default;
+$times: Times, serif !default;
+$bodoni: "Bodoni MT", serif !default;
+$calisto: "Calisto MT", serif !default;
+$garamond: Garamond, serif !default;
+
+$global-font-family: $sans-serif !default;
+$header-font-family: $sans-serif !default;
+$caption-font-family: $serif !default;
+
+/* type scale */
+$type-size-1: 2.441em !default; // ~39.056px
+$type-size-2: 1.953em !default; // ~31.248px
+$type-size-3: 1.563em !default; // ~25.008px
+$type-size-4: 1.25em !default; // ~20px
+$type-size-5: 1em !default; // ~16px
+$type-size-6: 0.75em !default; // ~12px
+$type-size-7: 0.6875em !default; // ~11px
+$type-size-8: 0.625em !default; // ~10px
+
+/* headline scale */
+$h-size-1: 1.563em !default; // ~25.008px
+$h-size-2: 1.25em !default; // ~20px
+$h-size-3: 1.125em !default; // ~18px
+$h-size-4: 1.0625em !default; // ~17px
+$h-size-5: 1.03125em !default; // ~16.5px
+$h-size-6: 1em !default; // ~16px
+
+/*
+   Colors
+   ========================================================================== */
+
+$gray: #7a8288 !default;
+$dark-gray: mix(#000, $gray, 50%) !default;
+$darker-gray: mix(#000, $gray, 60%) !default;
+$light-gray: mix(#fff, $gray, 50%) !default;
+$lighter-gray: mix(#fff, $gray, 90%) !default;
+
+$background-color: #fff !default;
+$code-background-color: #fafafa !default;
+$code-background-color-dark: $light-gray !default;
+$text-color: $dark-gray !default;
+$muted-text-color: mix(#fff, $text-color, 20%) !default;
+$border-color: $lighter-gray !default;
+$form-background-color: $lighter-gray !default;
+$footer-background-color: $lighter-gray !default;
+
+$primary-color: #6f777d !default;
+$success-color: #3fa63f !default;
+$warning-color: #d67f05 !default;
+$danger-color: #ee5f5b !default;
+$info-color: #3b9cba !default;
+$focus-color: $primary-color !default;
+$active-color: mix(#fff, $primary-color, 80%) !default;
+
+/* YIQ color contrast */
+$yiq-contrasted-dark-default: $dark-gray !default;
+$yiq-contrasted-light-default: #fff !default;
+$yiq-contrasted-threshold: 175 !default;
+$yiq-debug: false !default;
+
+/* brands */
+$behance-color: #1769ff !default;
+$bitbucket-color: #205081 !default;
+$dribbble-color: #ea4c89 !default;
+$facebook-color: #3b5998 !default;
+$flickr-color: #ff0084 !default;
+$foursquare-color: #0072b1 !default;
+$github-color: #171516 !default;
+$gitlab-color: #e24329 !default;
+$instagram-color: #517fa4 !default;
+$keybase-color: #ef7639 !default;
+$lastfm-color: #d51007 !default;
+$linkedin-color: #007bb6 !default;
+$mastodon-color: #2b90d9 !default;
+$pinterest-color: #cb2027 !default;
+$reddit-color: #ff4500 !default;
+$rss-color: #fa9b39 !default;
+$soundcloud-color: #ff3300 !default;
+$stackoverflow-color: #fe7a15 !default;
+$tumblr-color: #32506d !default;
+$twitter-color: #55acee !default;
+$vimeo-color: #1ab7ea !default;
+$vine-color: #00bf8f !default;
+$youtube-color: #bb0000 !default;
+$xing-color: #006567 !default;
+
+/* links */
+$link-color: mix(#000, $info-color, 20%) !default;
+$link-color-hover: mix(#000, $link-color, 25%) !default;
+$link-color-visited: mix(#fff, $link-color, 15%) !default;
+$masthead-link-color: $primary-color !default;
+$masthead-link-color-hover: mix(#000, $primary-color, 25%) !default;
+$navicon-link-color-hover: mix(#fff, $primary-color, 75%) !default;
+
+/* notices */
+$notice-background-mix: 80% !default;
+$code-notice-background-mix: 90% !default;
+
+
+/* syntax highlighting (base16) */
+$base00: #263238 !default;
+$base01: #2e3c43 !default;
+$base02: #314549 !default;
+$base03: #546e7a !default;
+$base04: #b2ccd6 !default;
+$base05: #eeffff !default;
+$base06: #eeffff !default;
+$base07: #ffffff !default;
+$base08: #f07178 !default;
+$base09: #f78c6c !default;
+$base0a: #ffcb6b !default;
+$base0b: #c3e88d !default;
+$base0c: #89ddff !default;
+$base0d: #82aaff !default;
+$base0e: #c792ea !default;
+$base0f: #ff5370 !default;
+
+/*
+   Breakpoints
+   ========================================================================== */
+
+$small: 600px !default;
+$medium: 768px !default;
+$medium-wide: 900px !default;
+$large: 1024px !default;
+$x-large: 1920px !default;
+$max-width: $x-large !default;
+
+/*
+   Grid
+   ========================================================================== */
+
+$right-sidebar-width-narrow: 200px !default;
+$right-sidebar-width: 300px !default;
+$right-sidebar-width-wide: 400px !default;
+
+/*
+   Other
+   ========================================================================== */
+
+$border-radius: 4px !default;
+$box-shadow: 0 1px 1px rgba(0, 0, 0, 0.125) !default;
+$nav-height: 2em !default;
+$nav-toggle-height: 2rem !default;
+$navicon-width: 1.5rem !default;
+$navicon-height: 0.25rem !default;
+$global-transition: all 0.2s ease-in-out !default;
+$intro-transition: intro 0.3s both !default;
diff --git a/docs/_sass/minimal-mistakes/skins/_air.scss b/docs/_sass/minimal-mistakes/skins/_air.scss
new file mode 100644
index 0000000000000000000000000000000000000000..662e1451926e065f5fdc028f2ad7e424665771f2
--- /dev/null
+++ b/docs/_sass/minimal-mistakes/skins/_air.scss
@@ -0,0 +1,23 @@
+/* ==========================================================================
+   Air skin
+   ========================================================================== */
+
+/* Colors */
+$background-color: white !default;
+$text-color: #222831 !default;
+$muted-text-color: #393e46 !default;
+$primary-color: #0092ca !default;
+$border-color: mix(#fff, #393e46, 75%) !default;
+$footer-background-color: $primary-color !default;
+$link-color: #393e46 !default;
+$masthead-link-color: $text-color !default;
+$masthead-link-color-hover: $text-color !default;
+$navicon-link-color-hover: mix(#fff, $text-color, 80%) !default;
+
+.page__footer {
+  color: #fff !important; // override
+}
+
+.page__footer-follow .social-icons .svg-inline--fa {
+  color: inherit;
+}
diff --git a/docs/_tutorials/MoQ-tutorial.md b/docs/_tutorials/MoQ-tutorial.md
new file mode 100644
index 0000000000000000000000000000000000000000..587ef4b1ed6754dc969341e2d77530cd39a7f28a
--- /dev/null
+++ b/docs/_tutorials/MoQ-tutorial.md
@@ -0,0 +1,212 @@
+---
+title: "DeepSpeed Mixture-of-Quantization (MoQ)"
+tags: training quantization
+---
+
+DeepSpeed introduces new support for model compression using quantization, called Mixture-of-Quantization (MoQ).  MoQ is designed on top of QAT (Quantization-Aware Training), with the difference that it schedules various data precisions across the training process. It starts with quantizing the model with a high precision, such as FP16 or 16-bit quantization, and reduce the precision through a pre-defined schedule until reaching the target quantization bits (like 8-bit). Moreover, we use second-order information of the model parameters to dynamically adjust the quantization schedule for each of layer of the network separately. We have seen that by adding such schedule and using various data precision in the training process, we can quantize the model with better quality and preserve accuracy. For a better understanding of MoQ methodology, please refer to MoQ deep-dive, [here](https://www.deepspeed.ai/2021/05/04/MoQ.html).
+
+Below, we use fine-tune for the GLUE tasks as an illustration of how to use MoQ.
+
+## Prerequisites
+
+To use MoQ for model quantization training, you should satisfy these two requirements:
+
+1. Integrate DeepSpeed into your training script using the [Getting Started](https://www.deepspeed.ai/getting-started/) guide.
+2. Add the parameters to configure your model, we will define MoQ parameters below.
+
+MoQ quantization schedule is defined by a number of parameters which allow users to explore different configurations.
+
+### MoQ Parameters
+
+`enabled`: Whether to enable quantization training, default is False.
+
+`quantize_verbose`: Whether to display verbose details, default is False.
+
+`quantizer_kernel`: Whether to enable quantization kernel, default is False.
+
+`quantize_type`: Quantization type, "symmetric" or "asymmetric", default is "symmetric".
+
+`quantize_groups`: Quantization groups, which shows the number of scales used to quantize a model, default is 1.
+
+`quantize_bits`, The number of bits to control the data-precision transition from a start-bit to the final target-bits (e.g. starting from 16-bit down to 8-bit).
+
+    `start_bits`: The start bits in quantization training. Default is set to 16.
+    `target_bits`: The target bits in quantization training. Default is set to 16.
+
+`quantize_schedule`, This determines how to schedule the training steps at each precision level.
+
+    `quantize_period`: indicates the period by which we reduce down the precision (number of bits) for quantization. By default, we use a period of 100 training steps, that will be doubled every time the precision reduces by 1 bit.
+    `schedule_offset`: indicates when the quantization starts to happen (before this offset, we just use the normal training precision which can be either FP32/FP16). Default is set to 100 steps.
+
+`quantize_algo`, The algorithm used to quantize the model.
+
+    `q_type`: we currently support symmetric and asymmetric quantization that result in signed and unsigned integer values, respectively. Default is set to symmetric
+    `rounding`: for the rounding of the quantized values, we can either round to the nearest value or use stochastic rounding. Default is set to nearest.
+
+### Eigenvalue Parameters
+
+`enabled`: Whether to enable quantization training with eigenvalue schedule, default value is set to False.
+
+`verbose`: Whether to display verbose details of eigenvalue computation, default value is set to False.
+
+`max_iter`: Max iteration in computing eigenvalue, default value is set to 100.
+
+`tol`: The tolerance error in computing eigenvalue, default value is set to 1e-2.
+
+`stability`: Variance stabilization factor, default value is set to 1e-6.
+
+`gas_boundary_resolution`: Indicates eigenvalue computation by every N gas boundary, default value is set to 1.
+
+`layer_name`: The model scope name pointing to all layers for eigenvalue computation, default value is set to "bert.encoder.layer".
+
+`layer_num`: How many layers to compute eigenvalue.
+
+
+## How to Use MoQ for GLUE Training Tasks
+
+Before fine-tuning the GLUE tasks using DeepSpeed MoQ, you need:
+
+1. Install DeepSpeed.
+2. Checkout Huggingface transformers branch, install it with all required packages.
+
+### DeepSpeed Configuration File
+
+Prepare a config file `test.json` as below, please note following important parameters for quantization training:
+
+```
+{
+    "optimizer": {
+      "type": "AdamW",
+      "params": {
+        "lr": 2e-5,
+        "weight_decay": 0.0,
+        "bias_correction": true
+      }
+    },
+    "gradient_clipping": 1.0,
+    "fp16": {
+      "initial_scale_power": 16,
+      "enabled": true
+    },
+    "quantize_training": {
+      "enabled": true,
+      "quantize_verbose": true,
+      "quantizer_kernel": true,
+      "quantize-algo": {
+        "q_type": "symmetric"
+      },
+      "quantize_bits": {
+        "start_bits": 16,
+        "target_bits": 8
+      },
+      "quantize_schedule": {
+        "quantize_period": 400,
+        "schedule_offset": 0
+      },
+      "quantize_groups": 8,
+    }
+}
+
+```
+
+### Test Script
+
+Create a script file under `huggingface/examples` folder as below, enabling DeepSpeed using the json file prepared above.
+
+Here we use `MRPC` task as an example.
+
+```
+TSK=mrpc
+TEST_JSON=test.json
+
+python text-classification/run_glue.py \
+  --model_name_or_path bert-base-cased \
+  --task_name $TSK \
+  --do_train \
+  --do_eval \
+  --max_seq_length 128 \
+  --per_device_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3 \
+  --output_dir /tmp/$TSK/ \
+  --fp16 \
+  --warmup_steps 2 \
+  --deepspeed test.json
+```
+
+Running this script will get `MPRC` accuracy and F1 metric results with MoQ quantization.
+
+
+### Quantization with dynamic schedule using second-order information (Eigenvalue)
+
+Eigenvalues can be used as a proxy for layer sensitivity during training, and can be used to create a layer-wise quantization schedule. When eigenvalue calculation is enabled, DeepSpeed will compute the eigenvalues for each specified layer at the `gas_boundary_resolution` and use it to increase the `quantize_period` by up to 5x based on layer sensitivity to allow the layer enough iterations to adapt before the next precision reduction phase. The factor of 5x was chosen based on heuristics.
+
+Please note:
+
+1. Enabling eigenvalue will make the training much slower, it needs longer time to compute eigenvalue for each layer.
+2. During fp16 training, some eigenvalues of some layers can become NaN/Inf due to limited range. For those layers, we return the max of all the non-NaN/Inf eigenvalues across all the layers. If all the eigenvalues are NaN, we return 1.0 for each of them.
+3. Eigenvalues can increase the `quantize_period` by up to 5x (chosen based on heuristics). When combined with doubling of the `quantize_period` during each 1-bit precision reduction phase, this can result in very large `quantize_period` specially if the initial `quantize_period` was large to begin with. Therefore, it is important to start with a relatively small `quantize_period` when using eigenvalues to allow training to go through all the precision transition phases before the training ends.
+4. Enabling eigenvalue doesn't guarantee better accuracy result, usually it needs tuning with other settings, such as `start_bits`, `quantize_period` and `quantize_groups`.
+
+```
+{
+	......
+
+    "quantize_training": {
+      "enabled": true,
+      "quantize_verbose": true,
+      "quantizer_kernel": true,
+      "quantize_type": "symmetric",
+      "quantize_bits": {
+        "start_bits": 12,
+        "target_bits": 8
+      },
+      "quantize_schedule": {
+        "quantize_period": 10,
+        "schedule_offset": 0
+      },
+      "quantize_groups": 8,
+      "fp16_mixed_quantize": {
+        "enabled": false,
+        "quantize_change_ratio": 0.001
+      },
+      "eigenvalue": {
+        "enabled": true,
+        "verbose": true,
+        "max_iter": 50,
+        "tol": 1e-2,
+        "stability": 0,
+        "gas_boundary_resolution": 1,
+        "layer_name": "bert.encoder.layer",
+        "layer_num": 12
+      }
+    }
+}
+
+```
+
+### Finetuning Results
+
+Here, we show the results for the GLUE tasks fine-tuning with quantization. The below table illustrates the scheduling parameters we used for each task to reach the reported accuracy. For all these experiments, we use symmetric grouped quantization with 8 groups.
+
+
+|Task         |STSB	|MRPC |COLA |WNLI |SST2 |RTE  |QNLI |QQP  |MNLI |
+|-------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|
+|start-bits   |12	  |12   |12   |12   |12   |12   |12   |12   |14   |
+|period       |10	  |10   |8    |8    |400  |8    |64   |18   |12   |
+|Enable Eigenvalue       |False	  |True   |True    |True    |False  |True    |False   |True   |True   |
+
+As we see in the following table, MoQ consistently preserve accuracy across different down-stream tasks.
+
+
+|Task         |STSB	|MRPC |COLA |WNLI |SST2 |RTE  |QNLI |QQP  |MNLI |SQuAD|ACC+ |
+|-------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
+|w/o QAT(FP16)|88.71|88.12|56.78|56.34|91.74|65.3 |90.96|90.67|84.04|90.56|0    |
+|Basic QAT    |88.9 |88.35|52.78|55.3 |91.5 |64.2 |90.92|90.59|84.01|90.39|-0.87|
+|MoQ          |88.93|89|59.33|56.34|92.09 |67.15 |90.63|90.94|84.55|90.71|0.75 |
+
+### Tips
+
+When using the MoQ, one needs to consider the number of samples and training iterations before setting the correct quantization period or offset to make sure that the quantization reaches the desired level of precision before training finishes.
+
+Enabling eigenvalues for quantization dynamically adjust the quantization period on the different parts of the network. This has two positive impact: 1) the quantized network can potentially produce higher accuracy than quantizing each layer with same `quantize_period` ; 2) it automatically identifies a good quantization schedule for each layer based on its sensitivity.
diff --git a/docs/_tutorials/advanced-install.md b/docs/_tutorials/advanced-install.md
index 18f60e8640397ef2ddf2811b216eb0cd51be51d1..09a8e6d562346c1433e8dedc15d01ca788f372fd 100644
--- a/docs/_tutorials/advanced-install.md
+++ b/docs/_tutorials/advanced-install.md
@@ -1,6 +1,7 @@
 ---
 title: "Installation Details"
 date: 2020-10-28
+tags: getting-started
 ---
 
 The quickest way to get started with DeepSpeed is via pip, this will install
@@ -11,15 +12,11 @@ just-in-time (JIT) using [torch's JIT C++ extension loader that relies on
 ninja](https://pytorch.org/docs/stable/cpp_extension.html) to build and
 dynamically link them at runtime.
 
-**Note:** [PyTorch](https://pytorch.org/) must be installed _before_ installing
-DeepSpeed.
-{: .notice--info}
-
 ```bash
 pip install deepspeed
 ```
 
-After installation, you can validate your install and see which ops your machine
+After installation, you can validate your installation and see which ops your machine
 is compatible with via the DeepSpeed environment report with `ds_report` or
 `python -m deepspeed.env_report`. We've found this report useful when debugging
 DeepSpeed install or compatibility issues.
@@ -30,14 +27,17 @@ ds_report
 
 ## Pre-install DeepSpeed Ops
 
+**Note:** [PyTorch](https://pytorch.org/) must be installed _before_ pre-compiling any DeepSpeed c++/cuda ops. However, this is not required if using the default mode of JIT compilation of ops.
+{: .notice--info}
+
 Sometimes we have found it useful to pre-install either some or all DeepSpeed
 C++/CUDA ops instead of using the JIT compiled path. In order to support
 pre-installation we introduce build environment flags to turn on/off building
 specific ops.
 
-You can indicate to our installer (either install.sh or pip install) that you
+You can indicate to our installer (either `install.sh` or `pip install`) that you
 want to attempt to install all of our ops by setting the `DS_BUILD_OPS`
-environment variable to 1, for example:
+environment variable to `1`, for example:
 
 ```bash
 DS_BUILD_OPS=1 pip install deepspeed
@@ -47,9 +47,9 @@ DeepSpeed will only install any ops that are compatible with your machine.
 For more details on which ops are compatible with your system please try our
 `ds_report` tool described above.
 
-If you want to install only a specific op (e.g., FusedLamb), you can toggle
+If you want to install only a specific op (e.g., `FusedLamb`), you can toggle
 with `DS_BUILD` environment variables at installation time. For example, to
-install DeepSpeed with only the FusedLamb op use:
+install DeepSpeed with only the `FusedLamb` op use:
 
 ```bash
 DS_BUILD_FUSED_LAMB=1 pip install deepspeed
@@ -62,8 +62,10 @@ Available `DS_BUILD` options include:
 * `DS_BUILD_FUSED_LAMB` builds the FusedLamb op
 * `DS_BUILD_SPARSE_ATTN` builds the sparse attention op
 * `DS_BUILD_TRANSFORMER` builds the transformer op
+* `DS_BUILD_TRANSFORMER_INFERENCE` builds the transformer-inference op
 * `DS_BUILD_STOCHASTIC_TRANSFORMER` builds the stochastic transformer op
 * `DS_BUILD_UTILS` builds various optimized utilities
+* `DS_BUILD_AIO` builds asynchronous (NVMe) I/O op
 
 To speed up the build-all process, you can parallelize the compilation process with:
 
@@ -73,11 +75,23 @@ DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option
 
 This should complete the full build 2-3 times faster. You can adjust `-j` to specify how many cpu-cores are to be used during the build. In the example it is set to 8 cores.
 
+You can also build a binary wheel and install it on multiple machines that have the same type of GPUs and the same software environment (CUDA toolkit, pytorch, python, etc.)
+
+```bash
+DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel
+```
+
+This will create a pypi binary wheel under `dist`, e.g., ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` and then you can install it directly on multiple machines, in our example:
+
+```bash
+pip install dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
+```
+
 
 ## Install DeepSpeed from source
 
 After cloning the DeepSpeed repo from GitHub, you can install DeepSpeed in
-JIT mode via pip (see below). This install should complete
+JIT mode via pip (see below). This installation should complete
 quickly since it is not compiling any C++/CUDA source files.
 
 ```bash
@@ -88,22 +102,40 @@ For installs spanning multiple nodes we find it useful to install DeepSpeed
 using the
 [install.sh](https://github.com/microsoft/DeepSpeed/blob/master/install.sh)
 script in the repo. This will build a python wheel locally and copy it to all
-the nodes listed in your hostfile (either given via --hostfile, or defaults to
-/job/hostfile).
+the nodes listed in your hostfile (either given via `--hostfile`, or defaults to
+`/job/hostfile`).
+
+When the code using DeepSpeed is used for the first time it'll automatically build only the CUDA
+extensions, required for the run, and by default it'll place them under
+`~/.cache/torch_extensions/`. The next time the same program is executed these now precompiled
+extensions will be loaded form that directory.
+
+If you use multiple virtual environments this could be a problem, since by default there is only one
+`torch_extensions` directory, but different virtual environments may use different setups (e.g., different
+python or cuda versions) and then the loading of a CUDA extension built by another environment will
+fail. Therefore, if you need to you can override the default location with the help of the
+ `TORCH_EXTENSIONS_DIR` environment variable. So in each virtual environment you can point it to a
+ unique directory and DeepSpeed will use it to save and load CUDA extensions.
 
+ You can also change it just for a specific run with:
+
+```bash
+ TORCH_EXTENSIONS_DIR=./torch-extensions deepspeed ...
+```
 
 ## Building for the correct architectures
 
 If you're getting the following error:
 
-```python
+```
 RuntimeError: CUDA error: no kernel image is available for execution on the device
 ```
-when running deepspeed that means that the cuda extensions weren't built for the card you're trying to use it for.
+when running deepspeed, that means that the cuda extensions weren't built for the card you're trying to use it for.
 
-When building from source deepspeed will try to support a wide range of architectures, but under jit-mode it'll only support the archs visible at the time of building.
+When building from source deepspeed will try to support a wide range of architectures, but under jit-mode it'll only
+support the architectures visible at the time of building.
 
-You can build specifically for a desired range of architectures by setting a `TORCH_CUDA_ARCH_LIST` env variable, like so:
+You can build specifically for a desired range of architectures by setting a `TORCH_CUDA_ARCH_LIST` env variable:
 
 ```bash
 TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" pip install ...
@@ -111,20 +143,19 @@ TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" pip install ...
 
 It will also make the build faster when you only build for a few architectures.
 
-This is also recommended to do to ensure your exact architecture is used. Due to a variety of technical reasons a distributed pytorch binary isn't built to fully support all architectures, skipping binary compatible ones, at a potential cost of underutilizing your full card's compute capabilities. To see which archs get included during the deepspeed build from source - save the log and grep for `-gencode` arguments.
+This is also recommended to ensure your exact architecture is used. Due to a variety of technical reasons, a distributed pytorch binary isn't built to fully support all architectures, skipping binary compatible ones, at a potential cost of underutilizing your full card's compute capabilities. To see which architectures get included during the deepspeed build from source - save the log and grep for `-gencode` arguments.
 
-The full list of nvidia gpus and their compute capabilities can be found [here](https://developer.nvidia.com/cuda-gpus).
+The full list of nvidia GPUs and their compute capabilities can be found [here](https://developer.nvidia.com/cuda-gpus).
 
 ## Feature specific dependencies
 
-Some DeepSpeed features require specific dependencies outside of the general
-dependencies of DeepSpeed.
+Some DeepSpeed features require specific dependencies outside the general dependencies of DeepSpeed.
 
 * Python package dependencies per feature/op please
-see our [requirements
-directory](https://github.com/microsoft/DeepSpeed/tree/master/requirements).
+see our [requirements directory](https://github.com/microsoft/DeepSpeed/tree/master/requirements).
 
-* We attempt to keep the system level dependencies to a minimum, however some features do require special system-level packages. Please see our `ds_report` tool output to see if you are missing any system-level packages for a given feature.
+* We attempt to keep the system level dependencies to a minimum, however some features do require special system-level
+packages. Please see our `ds_report` tool output to see if you are missing any system-level packages for a given feature.
 
 ## Pre-compiled DeepSpeed builds from PyPI
 
diff --git a/docs/_tutorials/autotuning.md b/docs/_tutorials/autotuning.md
new file mode 100644
index 0000000000000000000000000000000000000000..303087d298a7bb9a7d8d88ae3122b129b8f61a1e
--- /dev/null
+++ b/docs/_tutorials/autotuning.md
@@ -0,0 +1,122 @@
+---
+title: "Autotuning"
+excerpt: "Automatically discover the optimal DeepSpeed configuration that delivers good training speed"
+tags: training performance-tuning
+---
+
+Make sure you've read the DeepSpeed tutorials on [Getting Started](https://www.deepspeed.ai/getting-started/) and [Zero Redundancy Optimizer](https://www.deepspeed.ai/tutorials/zero/) before stepping through this tutorial.
+
+One pain point in model training is to figure out good performance-relevant configurations such as micro-batch size to fully utilize the hardware and achieve a high throughput number. This configuration exploring process is commonly done manually but is important since model training is repeated many times and benefits from using a good configuration. Not only is the hand-tuning process time-consuming, but the outcome is hardware-dependent. This means that a good configuration on one hardware might not be the best on another different hardware. The user thus has to hand tune the configuration again. With DeepSpeed, there are more configuration parameters that could potentially affect the training speed, thus making it more tedious to manually tune the configuration.
+
+The DeepSpeed Autotuner mitigates this pain point and automatically discovers the optimal DeepSpeed configuration that delivers good training speed. It not only reduces the time and resources users spend on tuning, but also can discover configurations better than hand-tuned methods. In this tutorial, we showcase the usage and benefits of the autotuning feature in DeepSpeed. For more details, please see the [README.md](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning).
+
+## Tuning scope and strategy
+
+The DeepSpeed Autotuner uses model information, system information, and heuristics to efficiently tune system knobs that affect compute and memory efficiencies, such as ZeRO optimization stages, micro-batch sizes, and many other ZeRO optimization configurations.
+Currently, the DeepSpeed Autotuner tunes ZeRO stages, micro-batch size per GPU, and ZeRO configurations (offloading is not yet supported) on top of other configurations such as optimizer, scheduler, fp16 defined by the user in the DeepSpeed configuration file.
+Note that ZeRO stages, micro-batch sizes, and other ZeRO configurations to tune are also configurable and can be overwritten by the user through the DeepSpeed configuration file. See [Configuring Tuning Scope](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning#configuring-tuning-scope) for details.
+
+
+## Ease of use
+
+DeepSpeed Autotuning is easy to use, requiring no code change from DeepSpeed users.
+Compared to the original training script (`deepspeed your_program.py <normal cl args> --deepspeed ds_config.json`), invoking the autotuning feature in DeepSpeed only requires setting an `autotuning` flag after the DeepSpeed launcher (see [Usage](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning#usage) for details), and adding `" autotuning": {"enabled": true}` to the DeepSpeed configuration file. Users can further tailor the autotuning process by changing the autotuning configuration in the DeepSpeed configuration JSON file (See [Autotuning Configuration](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning#autotuning-configuration) for details).
+
+## Example
+
+We demonstrate the usage and benefit of autotuning using the training of a 0.77 billion parameter [GPT2-large model](https://huggingface.co/gpt2-large) from Hugging Face on 16 Nvidia V100 GPUs. For more examples, refer to [autotuning](https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning) in the DeepSpeedExamples repo. Note that autotuning works with any DeepSpeed-accelerated model training, not limited to Hugging Face models.
+
+The model has:
+
+- 36-layer
+- 1280 hidden dimension
+- 20 attention heads
+- 774M parameters.
+
+### Environment
+
+The training use fp16 and runs on 1 node with 16 Nvidia V100 GPUs. The autotuning uses the same hardware resource as the training. `max_train_batch_size` is not defined. The HF packages below are used.
+
+HF examples require installing the `transformers` package from source:
+```bash
+    git clone https://github.com/huggingface/transformers.git
+    cd transformers
+    pip install .
+```
+The `datasets` package can be installed by `pip install datasets`
+
+Below are the versions used in this test.
+
+- transformers (4.12.0.dev0)
+- datasets (1.11.0)
+
+### Enabling Autotuning
+
+To enable the autotuning, add `--autotuning run` is added to the training script and add `"autotuning": {"enabled": true}` to the DeepSpeed configuration file. If the user training script uses DeepSpeed configuration parameters as training script arguments, the name mappings between the parameters in DeepSpeed configuration and the training script arguments must be provided in the `arg_mappings` dictionary in the `autotuning` section of the DeepSpeed configuration file.
+
+Train script:
+```bash
+    deepspeed --autotuning run --num_nodes=$NNODES --num_gpus=$NGPUS $HF_PATH/transformers/examples/pytorch/language-modeling/run_clm.py --deepspeed $DS_CONFIG\
+    --model_name_or_path $MODEL_NAME \
+    --dataset_name wikitext \
+    --dataset_config_name wikitext-2-raw-v1 \
+    --do_train \
+    --do_eval \
+    --fp16 \
+    --per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --learning_rate 2e-5 \
+    --num_train_epochs $NEPOCHS \
+    --output_dir ${OUTPUT_DIR} \
+    --overwrite_output_dir
+```
+
+DeepSpeed configuration file:
+```json
+{
+  "train_micro_batch_size_per_gpu": "auto",
+  "fp16": {
+    "enabled": true
+  },
+  "autotuning": {
+    "enabled": true,
+    "arg_mappings": {
+      "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+      "gradient_accumulation_steps ": "--gradient_accumulation_steps"
+    }
+  }
+}
+```
+
+### Throughput Comparison
+
+The table below shows the throughput (samples per second) comparison. The corresponding micro-batch size per GPU (mbs or tmbspg) and ZeRO stage used to achieve the throughput value is also shown in the parentheses. Assume the strategy users would use in the hand-tuning process is to start from `mbs = 1` and increase mbs by 2 each time until running out of GPU memory.
+ - `baseline` is the vanilla Hugging Face (HF) without DeepSpeed (DS) and mbs is hand-tuned.
+ - `HF + DS hand-tuned` is HF with DS, and mbs is hand-tuned while other DS configuration uses default values.
+ - `HF + DS autotuning` is HF with DS, and the DS configuration selected from autotuning.
+
+Notation: Hugging Face (HF), DeepSpeed (DS), ZeRO stage (z), gradient accumulation steps (gas), micro-batch size per GPU (mbs or tmbspg).
+
+| Model name | baseline (vanilla HF) | HF + DS hand-tuned       | HF + DS autotuning (fast-mode) |
+| ---------- | -------------------- | ------------------------ | ------------------------------ |
+| GPT2-large | 27.874 (mbs = 1)     | 56.797 (z = 1, mbs = 2), | 69.061 (z = 1, mbs = 3)        |
+
+The detailed `HF + DS autotuning` result summary is shown below.
+
+Note that the performance metric used in autotuning is calculated using the timings captured within DeepSpeed forward, backward and step functions. The sum of these timings is less than the actual training step latency, thus the throughput metric values used by autotuning would be higher than the end-to-end throughput in training.
+
+- Fast-mode Autotuning time: 27 mins
+- Number of experiments: 13
+- Throughput Improvement over baseline: 2.48x
+
+| tuning_space | num_experiments | best_metric_val | best_exp_name   |
+| :----------- | --------------: | --------------: | :-------------- |
+| z0           |               4 |         59.0229 | z0_gas1_tmbspg2 |
+| z1           |               5 |         87.3017 | z1_gas1_tmbspg3 |
+| z2           |               3 |         77.8338 | z2_gas1_tmbspg3 |
+| z3           |               1 |               0 | z3_gas1_tmbspg3 |
+| global       |              13 |         87.3017 | z1_gas1_tmbspg3 |
+
+Tuning completed in 0:27:33.988447. Total number of experiments: 13.
+
+As we can see the DeepSpeed Autotuner can select a better than hand-tuned configuration with a reasonable number of experiments. Examples in [Autotuning Hugging Face Examples](https://github.com/microsoft/DeepSpeedExamples/tree/master/autotuning/hf#autotuning-hugging-face-examples) would demonstrate the effectiveness of autotuning across different models.
diff --git a/docs/_tutorials/azure.md b/docs/_tutorials/azure.md
index 45d41a618a23dadb520b14c973a78315c3c71543..1016aeafd007c2257a24e54e735e6ec923869106 100644
--- a/docs/_tutorials/azure.md
+++ b/docs/_tutorials/azure.md
@@ -1,5 +1,6 @@
 ---
 title: "Getting Started with DeepSpeed on Azure"
+tags: getting-started
 ---
 
 This tutorial will help you get started running DeepSpeed on [Azure virtual
@@ -10,7 +11,7 @@ benefit all your large model training jobs.
 
 If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/).
 
-To use DeepSpeed on [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/), please take a look at easy-to-use examples for Transformers and CIFAR training from [AzureML Examples GitHub](https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed).
+To use DeepSpeed on [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/), please take a look at easy-to-use examples for Transformers and CIFAR training from [AzureML Examples GitHub](https://github.com/Azure/azureml-examples/tree/main/python-sdk/workflows/train/deepspeed).
 
 To help with launching Azure instances we suggest using the [Azure
 CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
diff --git a/docs/_tutorials/bert-finetuning.md b/docs/_tutorials/bert-finetuning.md
old mode 100755
new mode 100644
index 24e4fa167ddced39852de50014b03e35a29e9efc..f7ea8226022e27ed2d3d5cbcfe123bc29526f458
--- a/docs/_tutorials/bert-finetuning.md
+++ b/docs/_tutorials/bert-finetuning.md
@@ -1,6 +1,7 @@
 ---
 title: "BingBertSQuAD Fine-tuning"
 excerpt: ""
+tags: training fine-tuning
 ---
 
 In this tutorial we will be adding DeepSpeed to the BingBert model for the SQuAD fine-tuning task, called "BingBertSquad" henceforth. We will also demonstrate performance gains.
diff --git a/docs/_tutorials/bert-pretraining.md b/docs/_tutorials/bert-pretraining.md
old mode 100755
new mode 100644
index 0791fb3308fe4429aff7858cf1a879dc20aef951..e3771b7fdad23179bc4c60032791c0e8a87a258a
--- a/docs/_tutorials/bert-pretraining.md
+++ b/docs/_tutorials/bert-pretraining.md
@@ -1,6 +1,7 @@
 ---
 title: "BERT Pre-training"
 excerpt: ""
+tags: training pre-training
 ---
 
 In this tutorial we will apply DeepSpeed to pre-train the BERT
diff --git a/docs/_tutorials/cifar-10.md b/docs/_tutorials/cifar-10.md
index c7b53e58357a05c8297eed2e7717fa429ec0a05a..11a05a78a7494c989d60ebb9b134dd8d87432b4f 100644
--- a/docs/_tutorials/cifar-10.md
+++ b/docs/_tutorials/cifar-10.md
@@ -1,6 +1,7 @@
 ---
 title: "CIFAR-10 Tutorial"
 excerpt: "Train your first model with DeepSpeed!"
+tags: getting-started
 ---
 
 If you haven't already, we advise you to first read through the
diff --git a/docs/_tutorials/curriculum-learning.md b/docs/_tutorials/curriculum-learning.md
new file mode 100644
index 0000000000000000000000000000000000000000..938955ab57cc8625bcf10c08b6d2c8bf6d177785
--- /dev/null
+++ b/docs/_tutorials/curriculum-learning.md
@@ -0,0 +1,137 @@
+---
+title: "Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training"
+tags: training pre-training
+---
+
+**Note:**
+This tutorial was updated on 10/29/2021. Changes include: 1) A more detailed tuning strategy. 2) Pipeline parallelism support. 3) Token-based learning rate decay. 4) A new GPT-2 example at [github.com/microsoft/Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed). See details below.
+{: .notice--info}
+
+In this tutorial, we introduce DeepSpeed's curriculum learning-based data pipeline, which presents easier or simpler examples earlier during training. By enabling stable training with 8x/4x larger batch size/learning rate (whereas the baseline approach struggles with training divergence), we observe that curriculum learning (based on sequence length) provides stable and 3.3x faster GPT-2 pre-training (tested on 117M and 1.5B parameters), together with better token-wise convergence speed and zero-shot WikiText-103/LAMBADA evaluation results. In addition, since curriculum learning only affects the data pipeline, its benefit is complementary to many DeepSpeed features and other system optimization techniques. For example, curriculum learning is compatible with DeepSpeed's [ZeRO Redundancy Optimizer](/tutorials/zero/), [ZeRO-Offload](/tutorials/zero-offload/), and [3D Parallelism](/tutorials/pipeline/).
+
+To illustrate the benefits and usage of curriculum learning, we use the Megatron-LM GPT-2 pre-training task as example. For more details on this task, please refer to the [Megatron-LM GPT2 tutorial](/tutorials/megatron/). In addition, we also have a [paper](https://arxiv.org/abs/2108.06084) which provides the technical details including implementation and evaluations.
+
+## 1. Configurations and tuning strategy
+Curriculum learning can be used by setting the `curriculum_learning` key in the DeepSpeed configuration file:
+
+```json
+{
+  "train_batch_size": 4096,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 1,
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.00015,
+      "max_grad_norm": 1.0,
+      "betas": [0.9, 0.95]
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "curriculum_learning": {
+    "enabled": true,
+    "curriculum_type": "seqlen",
+    "min_difficulty": 8,
+    "max_difficulty": 1024,
+    "schedule_type": "fixed_linear",
+    "schedule_config": {
+      "total_curriculum_step": 15000,
+      "difficulty_step": 8
+    }
+  }
+}
+```
+To support curriculum learning, we add the following new parameters:
+
+`curriculum_type` is the type of curriculum difficulty metric. Currently we support the `seqlen` metric which presents shorter sequences earlier in training. We implement this type of curriculum learning by performing training data sequence truncation before the actual forward pass. We will describe how to implement this in the Megatron-LM GPT-2 pre-training example below.
+
+`min_difficulty` is the starting difficulty level. For the `seqlen` metric it means we start with sequence length as `min_difficulty`. We observe that lower `min_difficulty` usually provides better stability/convergence speed but with two caveats: First, sometimes (especially for large models) starting with too small difficulty level may lead to severe overfitting (e.g., training loss divergence or validation perplexity fluctuations) thus hurting the convergence. Second, for `seqlen` metric we recommended setting `min_difficulty` to a multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable [NVIDIA GPU's Tensor Core acceleration](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/). To tune this hyperparameter for `seqlen` metric, we recommend starting with `min_difficulty` at 8 (million-scale models) or 64 (billion-scale models), and then increase it if you observe divergence or validation perplexity fluctuations at the very beginning.
+
+`max_difficulty` is the ending difficulty level. For the `seqlen` metric it should be set to the full sequence length (e.g., 1024 for Megatron-LM GPT-2 pre-training).
+
+`schedule_type` is the scheduling policy for curriculum learning (i.e., which difficulty level to use at certain step). Currently we support three schedules: `fixed_linear`, `fixed_root`, and `fixed_discrete`. We recommend to first try the `fixed_linear` schedule, which is easier to tune and provides great training stability/efficiency gain in our tests. Each schedule has its own configurations:
+
+
+### 1.1 fixed_linear schedule
+For `fixed_linear` schedule there are two configurations:
+
+```json
+"schedule_type": "fixed_linear",
+"schedule_config": {
+  "total_curriculum_step": 15000,
+  "difficulty_step": 8
+}
+```
+
+The `total_curriculum_step` is the total number of steps for the curriculum learning. For `fixed_linear` schedule the difficulty level will increase linearly from `min_difficulty` to `max_difficulty` during `total_curriculum_step` steps. This configuration must be tuned for each training task. We observe that too small and too large `total_curriculum_step` are both suboptimal: with too small `total_curriculum_step` curriculum learning might not be able to provide enough training stability benefit so the training might still diverge; with too large `total_curriculum_step` the model may overfit during curriculum learning on the easier/simpler training data thus hurt the overall convergence. To tune this hyperparameter, we recommend a binary search to find the largest `total_curriculum_step` that does not have significant validation perplexity fluctuation during the first few multiples of LR warmup steps. The underlying rationale can be found in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.1.
+
+The `difficulty_step` configuration ensures that at any time the difficulty level is a multiple of `difficulty_step`. A smaller value is preferable since it gives more smooth curriculum and better stability. We usually set it to 8 (for FP16 data) or 16 (for INT8 data) to enable [NVIDIA GPU's Tensor Core acceleration](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/). If this is unrelated to your hardware, you can set it to 1.
+
+### 1.2 fixed_root schedule
+For `fixed_root` schedule there are three configurations:
+
+```json
+"schedule_type": "fixed_root",
+"schedule_config": {
+  "total_curriculum_step": 15000,
+  "difficulty_step": 8,
+  "root_degree": 2
+}
+```
+
+The `total_curriculum_step` and `difficulty_step` have the same meaning as for the `fixed_linear` schedule. The `root_degree` determines the root degree of the root function of the schedule. The difficulty level at certain step is determined as `((current step/total_curriculum_step)**(1/root_degree)) * (max_difficulty - min_difficulty) + min_difficulty`. Thus `fixed_linear` is basically a special case of `fixed_root` with `root_degree` as 1. In our (limited) study, we find the `fixed_root` schedule does not provide any clear advantage over `fixed_linear` schedule, while requiring one additional parameter.
+
+### 1.3 fixed_discrete schedule
+For `fixed_discrete` schedule there are two configurations:
+
+```json
+"schedule_type": "fixed_discrete",
+"schedule_config": {
+  "difficulty": [1,2,3],
+  "max_step": [5,10]
+}
+```
+
+The `difficulty` is a list of difficulty levels to be used during schedule. The `max_step` is a list of step timestamp to determine when to switch to next difficulty level. For example, the json config above means that at step 1-5 difficulty 1 is used, at step 6-10 difficulty 2 is used, from step 11 difficulty 3 is used. This `fixed_discrete` schedule provides the most flexible curriculum learning scheduling. However, we find that one risk of this kind of schedule is that if the model stays at certain difficulty level for too long, training divergence may happen when switching to next difficulty due to severe overfitting.
+
+## 2. Curriculum learning for Megatron-LM GPT-2 pre-training
+
+**Watch out!**
+After the update on 10/29/2021, now there are two curriculum learning examples for Megatron-LM GPT-2 pre-training. Both of them have some unique features and limitations. See details below.
+{: .notice--warning}
+
+We provide two curriculum learning examples for Megatron-LM GPT-2 pre-training:
+
+The first one is at [Megatron-DeepSpeed/tree/main/examples/curriculum_learning](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/curriculum_learning). This integration is based on a newer Megatron-LM fork, and only this curriculum learning example supports pipeline parallelism. However, as of 10/29/2021, we haven't verified ZeRO-2 and ZeRO-3 on this fork. Overall, we highly recommend you to use this example if your model does not require ZeRO-2/3.
+
+The second one is at [DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/curriculum_learning/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3/curriculum_learning). This integration is based on an older Megatron-LM hard copy that we will eventually deprecate and this curriculum learning example does not support pipeline parallelism. We recommend you to ONLY use this example if your model requires ZeRO-2/3.
+
+Besides the DeepSpeed curriculum learning json configurations described above, there are some other necessary changes on the user side to integrate curriculum learning:
+
+### 2.1 Training data truncation
+
+To enable `seqlen`-based curriculum learning, we need to add the functionality of training data truncation based on the given curriculum sequence length. For the case without pipeline parallelism, it is necessary to add a `curriculum_seqlen` argument in the model's forward pass and use it to perform training data sequence length truncation. For Megatron-LM GPT-2 pre-training, we implement this in `forward()` in [megatron/model/gpt2_model.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/megatron/model/gpt2_model.py) and in `forward_step()` in [pretrain_gpt2.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/pretrain_gpt2.py).
+
+For the case with pipeline parallelism, due to DeepSpeed engine limitations we cannot inject the `curriculum_seqlen` argument in the forward pass. Instead, we create a duplicate of `deepspeed.runtime.data_pipeline.curriculum_scheduler` on the user side, and use it to retrieve the `curriculum_seqlen`. This implementation can be found in [megatron/training.py](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py).
+
+### 2.2 Disable batch size warmup (`--rampup-batch-size`)
+In our [paper](https://arxiv.org/abs/2108.06084) section 5.4 we demonstrate that curriculum learning (`seqlen`-based) provides much better training stability than the batch size warmup technique introduced by Open AI GPT-3. So when using curriculum learning you need to remove the `--rampup-batch-size` config in your training script. It's not recommended using both curriculum learning and batch size warmup, because both of them reduce the number of tokens in a batch. Another related change you might want is to increase your micro batch size, since without batch size warmup your batch size will be fixed now.
+
+### 2.3 Token-based training termination
+
+Because curriculum learning changes length of each sequence/sample during training, it is very hard/impossible to use number of steps/samples to terminate the training exactly at the desired number of tokens. Thus, we add a `--train-tokens` config for accurate token-based termination. We recommend increasing your original `--train-samples` or `--train-iters` to a large enough number (e.g., 3X of what you used for baseline), and set `--train-tokens` at the exact desired number of training tokens.
+
+### 2.4 Token-based LR decay
+
+Again because curriculum learning changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus, we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full `seqlen` (e.g., 1K for GPT-2 and 2K for GPT-3). If previously you were using `--lr-decay-iters`, you can calculate your `--lr-decay-tokens` by multiplying the former by full `seqlen` and the global batch size. Then you need to replace `--lr-decay-samples` or `--lr-decay-iters` with `--lr-decay-tokens` in your script.
+
+### 2.5 LR warmup adjustment
+
+For LR warmup we don't change it to token-based, because doing so for curriculum learning means slowing down the LR warmup, which is both unnecessary and harmful. However, to avoid too fast warmup you may need to adjust your `--lr-warmup-samples` or `--lr-warmup-iters` from non-CL cases for various reasons (e.g., if you used `--rampup-batch-size` in non-CL case, for CL we don't use it so the number of samples per batch will be different at beginning). Assuming you want to use `X` tokens to warmup the LR (for OpenAI GPT-3 this was 375M tokens), then for curriculum learning case you shall set `--lr-warmup-samples` as `X` divided by the `min_difficulty`, or set `--lr-warmup-iters` as `X` divided by `min_difficulty * --global-batch-size`. This is a rough estimation based on that curriculum learning starts from seqlen `min_difficulty` and it won't increase too much during LR warmup.
diff --git a/docs/_tutorials/flops-profiler.md b/docs/_tutorials/flops-profiler.md
index 39d0015dd4fe101205b798725ec215bf924b80de..b90a55efcada05fdde6773431bc61b3cd4e72d22 100644
--- a/docs/_tutorials/flops-profiler.md
+++ b/docs/_tutorials/flops-profiler.md
@@ -1,84 +1,160 @@
 ---
 title: "Flops Profiler"
-excerpt: "Measure the parameters, latency, and floating point operations of your model"
+excerpt: "Measure the parameters, latency, and floating-point operations of your model"
+tags: profiling performance-tuning
 ---
 
-In this tutorial, we introduce the DeepSpeed flops profiler and provide examples of its usage.
+In this tutorial, we introduce the DeepSpeed Flops Profiler and provide examples of its usage.
 
   - [Overview](#overview)
-  - [Supported Models](#supported-models)
-  - [Multi-GPU, Multi-node Runs](#multi-gpu-multi-node-runs)
+  - [Flops Measurement](#flops-measurement)
+  - [Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism](#multi-gpu-multi-node-data-parallelism-and-model-parallelism)
   - [Usage](#usage)
 
 ## Overview
 
-The DeepSpeed flops profiler profiles the forward pass of a PyTorch model and prints the model graph with the measured profile attached to each module.
-It shows the parameters, latency, and number of floating point operations of the modules within the model to identify potential bottlenecks.
-It also outputs the names of the top `k` modules in terms of aggregated time, flops, and number of parameters at depth `l` with `k` and `l` specified by the user.
-The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a standalone package.
+Effective use of hardware resources is critical to good performance, but performance inefficiency in existing implementations for large-scale model training and inference are often hard to spot and attribute to specific module components. DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point operations per second, i.e., FLOPS) of a model and its submodules, with an eye towards eliminating inefficiencies in existing implementations.
 
-The output profile is computed for each batch of input and printed to the `stdout`. For each module, the measured profile is annotated after the name and is listed in the order of `number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency of the module, percentage of the total latency, floating point operations per second (FLOPS)`. Note that the number of floating point operations is estimated as `2 * MACs` in the profiler (each MAC operation is counted as 2 floating point operations).
-
-Below is an example output for LeNet5 with batch size 1024:
+Below is an example output for BERT-Large(NVIDIA) on an A100 GPU with batch size `80`:
 
 ```shell
 -------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   1
-Number of parameters:           61.71 k
-Number of multiply-accumulate operations (MACs):   439.56 M
-Number of floating point operations ( = 2 * MACs):   879.12 M
-Latency:                        25.7 ms
-Floating point operations per second(FLOPS):   34.2 GFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 2 are {'Conv2d': '421.91 MMACs', 'Linear': '11.18 MMACs', 'AvgPool2d': '6.46 MMACs'}
-Top 3 modules in params at depth 2 are {'Conv2d': '50.69 k', 'Linear': '11.01 k', 'Tanh': '0'}
-Top 3 modules in latency at depth 2 are {'Conv2d': '11.37 ms', 'Linear': '5.27 ms', 'AvgPool2d': '5.02 ms'}
-
------------------------------- Detailed Profile ------------------------------
+Profile Summary at step 10:
+Notations:
+data parallel size (dp_size), model parallel size(mp_size),
+number of parameters (params), number of multiply-accumulate operations(MACs),
+number of floating-point operations (flops), floating-point operations per second (FLOPS),
+fwd latency (forward propagation latency), bwd latency (backward propagation latency),
+step (weights update latency), iter latency (sum of fwd, bwd and step latency)
+
+world size:                                                   1
+data parallel size:                                           1
+model parallel size:                                          1
+batch size per GPU:                                           80
+params per gpu:                                               336.23 M
+params of model = params per GPU * mp_size:                   336.23 M
+fwd MACs per GPU:                                             3139.93 G
+fwd flops per GPU:                                            6279.86 G
+fwd flops of model = fwd flops per GPU * mp_size:             6279.86 G
+fwd latency:                                                  76.67 ms
+bwd latency:                                                  108.02 ms
+fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          81.9 TFLOPS
+bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      116.27 TFLOPS
+fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   102.0 TFLOPS
+step latency:                                                 34.09 us
+iter latency:                                                 184.73 ms
+samples/second:                                               433.07
+
+----------------------------- Aggregated Profile per GPU -----------------------------
+Top modules in terms of params, MACs or fwd latency at different model depths:
+depth 0:
+    params      - {'BertForPreTrainingPreLN': '336.23 M'}
+    MACs        - {'BertForPreTrainingPreLN': '3139.93 GMACs'}
+    fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
+depth 1:
+    params      - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
+    MACs        - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
+    fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
+depth 2:
+    params      - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
+    MACs        - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
+    fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
+depth 3:
+    params      - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
+    MACs        - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
+    fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
+depth 4:
+    params      - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
+    MACs        - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
+    fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
+depth 5:
+    params      - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
+    MACs        - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
+    fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
+depth 6:
+    params      - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
+    MACs        - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
+    fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}
+
+------------------------------ Detailed Profile per GPU ------------------------------
 Each module profile is listed after its name in the following order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
-
-LeNet5(
-  61.71 k, 100.00% Params, 439.56 MMACs, 100.00% MACs, 25.7 ms, 100.00% latency, 34.2 GFLOPS,
-  (feature_extractor): Sequential(
-    50.69 k, 82.15% Params, 428.37 MMACs, 97.45% MACs, 20.12 ms, 78.27% latency, 42.59 GFLOPS,
-    (0): Conv2d(156, 0.25% Params, 125.24 MMACs, 28.49% MACs, 9.8 ms, 38.12% latency, 25.56 GFLOPS, 1, 6, kernel_size=(5, 5), stride=(1, 1))
-    (1): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 2.85 ms, 11.08% latency, 0.0 FLOPS, )
-    (2): AvgPool2d(0, 0.00% Params, 4.82 MMACs, 1.10% MACs, 4.01 ms, 15.59% latency, 2.4 GFLOPS, kernel_size=2, stride=2, padding=0)
-    (3): Conv2d(2.42 k, 3.92% Params, 247.4 MMACs, 56.28% MACs, 924.83 us, 3.60% latency, 535.02 GFLOPS, 6, 16, kernel_size=(5, 5), stride=(1, 1))
-    (4): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 672.1 us, 2.62% latency, 0.0 FLOPS, )
-    (5): AvgPool2d(0, 0.00% Params, 1.64 MMACs, 0.37% MACs, 1.01 ms, 3.95% latency, 3.23 GFLOPS, kernel_size=2, stride=2, padding=0)
-    (6): Conv2d(48.12 k, 77.98% Params, 49.27 MMACs, 11.21% MACs, 647.31 us, 2.52% latency, 152.25 GFLOPS, 16, 120, kernel_size=(5, 5), stride=(1, 1))
-    (7): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 82.02 us, 0.32% latency, 0.0 FLOPS, )
-  )
-  (classifier): Sequential(
-    11.01 k, 17.85% Params, 11.18 MMACs, 2.54% MACs, 5.41 ms, 21.06% latency, 4.13 GFLOPS,
-    (0): Linear(10.16 k, 16.47% Params, 10.32 MMACs, 2.35% MACs, 2.47 ms, 9.60% latency, 8.37 GFLOPS, in_features=120, out_features=84, bias=True)
-    (1): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 90.12 us, 0.35% latency, 0.0 FLOPS, )
-    (2): Linear(850, 1.38% Params, 860.16 KMACs, 0.20% MACs, 2.8 ms, 10.91% latency, 613.62 MFLOPS, in_features=84, out_features=10, bias=True)
+params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
+
+BertForPreTrainingPreLN(
+  336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
+  (bert): BertModel(
+    335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
+    (embeddings): BertEmbeddings(...)
+    (encoder): BertEncoder(
+      302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
+      (FinalLayerNorm): FusedLayerNorm(...)
+      (layer): ModuleList(
+        302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
+        (0): BertLayer(
+          12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
+          (attention): BertAttention(
+            4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
+            (self): BertSelfAttention(
+              3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
+              (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
+              (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
+              (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
+              (dropout): Dropout(...)
+              (softmax): Softmax(...)
+            )
+            (output): BertSelfOutput(
+              1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
+              (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
+              (dropout): Dropout(...)
+            )
+          )
+          (PreAttentionLayerNorm): FusedLayerNorm(...)
+          (PostAttentionLayerNorm): FusedLayerNorm(...)
+          (intermediate): BertIntermediate(
+            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
+            (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
+          )
+          (output): BertOutput(
+            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
+            (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
+            (dropout): Dropout(...)
+          )
+        )
+        ...
+        (23): BertLayer(...)
+      )
+    )
+    (pooler): BertPooler(...)
   )
+  (cls): BertPreTrainingHeads(...)
 )
 ------------------------------------------------------------------------------
+
 ```
 
-## Supported Models
+In the summary profile, the DeepSpeed Flops Profiler outputs the number of parameters, floating-point operations (flops), FLOPS, latency, and throughput in samples/second of the model. This profile shows how much performance gap (compared to the peak hardware performance) the current model execution has and helps users tune the training or inference setup (e.g., hyperparameters, data parallelism, model parallelism, system configurations, etc.) for better performance.
+
+The DeepSpeed Flops Profiler also measures significant modules at different model depths (aggregated profile) and module-specific profile in the model architecture (detailed profile). Using these profiles, DeepSpeed users can understand how each layer or submodule contributes to the overall model complexity/performance. Then users can adjust or refactor the model design to improve performance. For example, using the profiler, DeepSpeed users can quantitatively tell if stacking smaller layers is lighter or more performant than having bigger ones. The aggregated and detailed profiles also allow users to quickly identify bottleneck modules. In the BERT-Large example above, using the DeepSpeed Flops Profiler, we find that BertLayer is the most significant layer and contains quite a few dropout, softmax, and layer norm along with linear modules. These modules are not heavy in flops and would trigger many GPU kernel invocations and create excessive read/write requests to memory. The pattern shown in the detailed profile suggests this is a perfect match for kernel fusion, and we developed fused transformer-kernels to reduce data movement (see [DeepSpeedBert](/tutorials/bert-pretraining)). After applying our optimizations, we see a 25% improvement in FLOPS per GPU and overall training samples/second in the DeepSpeed Flops Profiler output.
+
+The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime without any user code change or be used independently from DeepSpeed as a standalone package. When using DeepSpeed for model training, the profiler can be enabled in the DeepSpeed [configuration file](/docs/config-json/#flops-profiler). As a standalone package, the profiler API can be used in both training and inference code. The DeepSpeed profiler is still under active development and includes just initial features.  Stay connected for more exciting features to be added soon.
 
-The flops estimation is partly inspired by [ptflops](https://github.com/sovrasov/flops-counter.pytorch) with the major difference being that the DeepSpeed flops profiler captures ```torch.nn.functional``` invoked in a module to estimate the flops. Thus the DeepSpeed flops profiler allows for customized modules in the model, e.g., ```ParallelTransformerLayerworks, ParallelSelfAttention, RowParallelLinear, etc.``` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). This is in contrast to tools that profile at ```torch.nn.module``` level, such as ptflops, which require users to write customized flops calculation functions for each customized module. Finally, the DeepSpeed flops profiler also supports flops computation at module level (for RNNs).
+## Flops Measurement
 
-## Multi-GPU, Multi-node Runs
+Similar to existing flops calculation tools or methods, the DeepSpeed Flops Profiler measures the flops of the forward pass of a module and the flops of the backward pass is estimated as `2` times of that of the forward pass.
+Different from the PyTorch profiler which calculates the flops of PyTorch operators, the DeepSpeed Flops Profiler measures the flops within modules in a model and provides more insights to the users about the model execution.
+The flops estimation is partly inspired by [ptflops](https://github.com/sovrasov/flops-counter.pytorch) with the major difference being that the DeepSpeed Flops Profiler not only supports flops computation directly at module level, but can also capture ```torch.nn.functional``` invoked in a module to estimate the flops.
+Thus the DeepSpeed Flops Profiler allows for customized modules in the model, e.g., `ParallelTransformerLayerworks`, `ParallelSelfAttention`, `RowParallelLinear`, etc. in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). This is in contrast to ptflops which requires users to write customized flops calculation functions for each customized module.
 
-For models running on multi-GPU or multi-node, only the model parallelism (e.g. ```--model-parallel-size``` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)) affects the number of flops and parameters profiled, i.e.,
-`model_parallel_size * flops = total_flops` and `model_parallel_size * parameters = total_parameters`. The number of GPUs or nodes does not affect the output profile.
+## Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism
 
+The DeepSpeed Flops Profiler outputs the per GPU profile as well as the world size, data parallel size, and model parallel size.
+
+For models running on multi-GPU or multi-node, only change of the model parallelism (e.g., `--model-parallel-size` in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)) affects the number of flops and parameters profiled, i.e.,
+`model_parallel_size * flops = total_flops` and `model_parallel_size * parameters = total_parameters`. The data parallel size or world size (related to the number of GPUs or nodes) does not affect the per GPU profile.
 
 ## Usage
 
-The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a standalone package. When using DeepSpeed for model training, the flops profiler can be configured in the deepspeed_config file without user code changes. To use the flops profiler outside of the DeepSpeed runtime, one can simply install DeepSpeed and import the flops_profiler package to use the APIs directly. Examples of each usage are given below.
+The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime or as a standalone package. When using DeepSpeed for model training, the profiler can be configured in the deepspeed [configuration file](/docs/config-json/#flops-profiler) without user code changes. To use the flops profiler outside the DeepSpeed runtime, install DeepSpeed and import the `flops_profiler` package to use the APIs directly. Examples of each usage are given below.
 
   - [Usage With the DeepSpeed Runtime](#usage-with-the-deepspeed-runtime)
     - [Example: Megatron-LM](#example-megatron-lm)
@@ -89,17 +165,9 @@ The DeepSpeed flops profiler can be used with the DeepSpeed runtime or as a stan
     - [In Model Training Workflow](#in-model-training-workflow)
       - [Example Training Workflow](#example-training-workflow)
 
-
 ### Usage With the DeepSpeed Runtime
 
-When using DeepSpeed for model training, the flops profiler can be configured in the `deepspeed_config` file. No explicit API calls are needed to use the profiler. Refer to [flops profiler](https://www.deepspeed.ai/docs/config-json/#flops-profiler) for details.
-
-
-#### Example: Megatron-LM
-
-For information on running Megatron-LM with DeepSpeed, please refer to our tutorial [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
-
-The flops profiler can be enabled by adding the following field to the `deepspeed_config` file.
+When using DeepSpeed for model training, the profiler can be configured in the deepspeed [configuration file](/docs/config-json/#flops-profiler). No explicit API calls are needed to use the profiler. The profiler can be enabled by adding the following field to deepspeed's configuration json file. Refer to [flops profiler](/docs/config-json/#flops-profiler) for details.
 
 ```json
 {
@@ -107,84 +175,131 @@ The flops profiler can be enabled by adding the following field to the `deepspee
     "enabled": true,
     "profile_step": 1,
     "module_depth": -1,
-    "top_modules": 3,
+    "top_modules": 1,
     "detailed": true,
+    "output_file": null
     }
 }
 ```
 
-An example output of 4-layer Megatron-LM model (`hidden_size = 512, num_attention_heads = 16, batch_size = 8, seq_length = 1024`) is shown below.
+#### Example: Megatron-LM
+
+For information on running Megatron-LM with DeepSpeed, please refer to our tutorial [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
+
+An example output of 12-layer Megatron-LM model (`hidden_size = 8192, num_attention_heads = 32, batch_size = 1024, seq_length = 1024`) is shown below.
 
 ```shell
 -------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   1
-Number of parameters:           38.89 M
-Number of multiply-accumulate operations (MACs):   314.61 G
-Number of floating point operations ( = 2 * MACs):   629.21 G
-Latency:                        33.81 ms
-Floating point operations per second(FLOPS):   18.61 TFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 8 are {'ColumnParallelLinear': '60.13 GMACs', 'RowParallelLinear': '42.95 GMACs', 'FusedScaleMaskSoftmax': '536.87 MMACs'}
-Top 3 modules in params at depth 8 are {'ColumnParallelLinear': '7.35 M', 'RowParallelLinear': '5.25 M', 'FusedScaleMaskSoftmax': '0'}
-Top 3 modules in latency at depth 8 are {'ColumnParallelLinear': '659.23 us', 'RowParallelLinear': '587.94 us', 'FusedScaleMaskSoftmax': '370.98 us'}
-
------------------------------- Detailed Profile ------------------------------
+Profile Summary at step 10:
+Notations:
+data parallel size (dp_size), model parallel size(mp_size),
+number of parameters (params), number of multiply-accumulate operations(MACs),
+number of floating-point operations (flops), floating-point operations per second (FLOPS),
+fwd latency (forward propagation latency), bwd latency (backward propagation latency),
+step (weights update latency), iter latency (sum of fwd, bwd and step latency)
+
+world size:                                                   1
+data parallel size:                                           1
+model parallel size:                                          1
+batch size per GPU:                                           1024
+params per gpu:                                               1.29 M
+params of model = params per GPU * mp_size:                   1.29 M
+fwd MACs per GPU:                                             41271.95 G
+fwd flops per GPU:                                            82543.9 G
+fwd flops of model = fwd flops per GPU * mp_size:             82543.9 G
+fwd latency:                                                  1.89 s
+bwd latency:                                                  5.38 s
+fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          43.68 TFLOPS
+bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      30.7 TFLOPS
+fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   34.07 TFLOPS
+step latency:                                                 34.12 s
+iter latency:                                                 41.39 s
+samples/second:                                               24.74
+
+----------------------------- Aggregated Profile per GPU -----------------------------
+Top 1 modules in terms of params, MACs or fwd latency at different model depths:
+depth 0:
+    params      - {'GPT2Model': '1.29 M'}
+    MACs        - {'GPT2Model': '41271.95 GMACs'}
+    fwd latency - {'GPT2Model': '1.84 s'}
+depth 1:
+    params      - {'TransformerLanguageModel': '1.29 M'}
+    MACs        - {'TransformerLanguageModel': '39584.03 GMACs'}
+    fwd latency - {'TransformerLanguageModel': '1.83 s'}
+depth 2:
+    params      - {'ParallelTransformer': '1.29 M'}
+    MACs        - {'ParallelTransformer': '39584.03 GMACs'}
+    fwd latency - {'ParallelTransformer': '1.81 s'}
+depth 3:
+    params      - {'ModuleList': '1.28 M'}
+    MACs        - {'ModuleList': '39584.03 GMACs'}
+    fwd latency - {'ModuleList': '1.3 s'}
+depth 4:
+    params      - {'ParallelTransformerLayerPart2': '688.15 k'}
+    MACs        - {'ParallelTransformerLayerPart2': '26388.28 GMACs'}
+    fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'}
+depth 5:
+    params      - {'ParallelMLP': '491.54 k'}
+    MACs        - {'ParallelMLP': '26388.28 GMACs'}
+    fwd latency - {'ParallelMLP': '849.4 ms'}
+
+------------------------------ Detailed Profile per GPU ------------------------------
 Each module profile is listed after its name in the following order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
-
-DistributedDataParallel(
-  38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.81 ms, 100.00% latency, 18.61 TFLOPS,
-  (module): FP16_Module(
-    38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.77 ms, 99.89% latency, 18.63 TFLOPS,
-    (module): GPT2Model(
-      38.89 M, 100.00% Params, 314.61 GMACs, 100.00% MACs, 33.69 ms, 99.66% latency, 18.67 TFLOPS,
-      (language_model): TransformerLanguageModel(
-        38.89 M, 100.00% Params, 103.62 GMACs, 32.94% MACs, 5.58 ms, 16.51% latency, 37.13 TFLOPS,
-        (embedding): Embedding(
-          26.28 M, 67.57% Params, 0 MACs, 0.00% MACs, 545.98 us, 1.61% latency, 0.0 FLOPS,
-          (word_embeddings): VocabParallelEmbedding(25.76 M, 66.23% Params, 0 MACs, 0.00% MACs, 223.88 us, 0.66% latency, 0.0 FLOPS, )
-          (position_embeddings): Embedding(524.29 k, 1.35% Params, 0 MACs, 0.00% MACs, 147.1 us, 0.44% latency, 0.0 FLOPS, 1024, 512)
-          (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.39 us, 0.23% latency, 0.0 FLOPS, p=0.1, inplace=False)
+params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
+
+Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'.
+1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
+2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.
+
+GPT2Model(
+  1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS,
+  (language_model): TransformerLanguageModel(
+    1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS,
+    (embedding): Embedding(
+      2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS,
+      (word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, )
+      (position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192)
+      (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
+    )
+    (transformer): ParallelTransformer(
+      1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS,
+      (layers): ModuleList(
+        1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS,
+        (0): ParallelTransformerLayerPart1(
+          49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS,
+          (input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
+          (attention): ParallelSelfAttention(
+            32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS,
+            (query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, )
+            (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, )
+            (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
+            (dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, )
+          )
         )
-        (transformer): ParallelTransformer(
-          12.61 M, 32.43% Params, 103.62 GMACs, 32.94% MACs, 5.0 ms, 14.78% latency, 41.49 TFLOPS,
-          (layers): ModuleList(
-            12.61 M, 32.42% Params, 103.62 GMACs, 32.94% MACs, 4.4 ms, 13.01% latency, 47.13 TFLOPS,
-            (0): ParallelTransformerLayer(
-              3.15 M, 8.11% Params, 25.9 GMACs, 8.23% MACs, 1.36 ms, 4.02% latency, 38.09 TFLOPS,
-              (input_layernorm): FusedLayerNorm(1.02 k, 0.00% Params, 0 MACs, 0.00% MACs, 92.51 us, 0.27% latency, 0.0 FLOPS, torch.Size([512]), eps=1e-05, elementwise_affine=True)
-              (attention): ParallelSelfAttention(
-                1.05 M, 2.70% Params, 8.72 GMACs, 2.77% MACs, 754.59 us, 2.23% latency, 23.12 TFLOPS,
-                (query_key_value): ColumnParallelLinear(787.97 k, 2.03% Params, 6.44 GMACs, 2.05% MACs, 182.87 us, 0.54% latency, 70.46 TFLOPS, )
-                (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.04% MACs, 120.4 us, 0.36% latency, 2.23 TFLOPS, )
-                (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 47.45 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
-                (dense): RowParallelLinear(262.66 k, 0.68% Params, 2.15 GMACs, 0.68% MACs, 81.78 us, 0.24% latency, 52.52 TFLOPS, )
-              )
-              (post_attention_layernorm): FusedLayerNorm(1.02 k, 0.00% Params, 0 MACs, 0.00% MACs, 57.22 us, 0.17% latency, 0.0 FLOPS, torch.Size([512]), eps=1e-05, elementwise_affine=True)
-              (mlp): ParallelMLP(
-                2.1 M, 5.40% Params, 17.18 GMACs, 5.46% MACs, 224.83 us, 0.67% latency, 152.83 TFLOPS,
-                (dense_h_to_4h): ColumnParallelLinear(1.05 M, 2.70% Params, 8.59 GMACs, 2.73% MACs, 64.13 us, 0.19% latency, 267.87 TFLOPS, )
-                (dense_4h_to_h): RowParallelLinear(1.05 M, 2.70% Params, 8.59 GMACs, 2.73% MACs, 90.36 us, 0.27% latency, 190.13 TFLOPS, )
-              )
-            )
-            ...
-            (3): ParallelTransformerLayer(...)
-          (final_layernorm): FusedLayerNorm(1.02 k, 0.00% Params, 0 MACs, 0.00% MACs, 52.69 us, 0.16% latency, 0.0 TFLOPS, torch.Size([512]), eps=1e-05, elementwise_affine=True)
+        (1): ParallelTransformerLayerPart2(
+          57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS,
+          (post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
+          (mlp): ParallelMLP(
+            40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS,
+            (dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, )
+            (dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, )
+          )
         )
+        ...
+        (23): ParallelTransformerLayerPart2(...)
       )
+      (final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
     )
   )
 )
+------------------------------------------------------------------------------
+
+
 ```
 
 ###  Usage Outside the DeepSpeed Runtime
 
-The flops profiler can be used as a standalone package outside of the DeepSpeed runtime.
+The profiler can be used as a standalone package outside of the DeepSpeed runtime.
 One can simply install DeepSpeed and import the `flops_profiler` package to use the APIs directly.
 Refer to [installation of DeepSpeed](https://www.deepspeed.ai/getting-started/#installation) for installing DeepSpeed.
 
@@ -205,73 +320,18 @@ from deepspeed.profiling.flops_profiler import get_model_profile
 with torch.cuda.device(0):
     model = models.alexnet()
     batch_size = 256
-    macs, params = get_model_profile(model=model, # model
-                                     input_res=(batch_size, 3, 224, 224), # input shape or input to the input_constructor
-                                     input_constructor=None, # if specified, a constructor taking input_res is used as input to the model
-                                     print_profile=True, # prints the model graph with the measured profile attached to each module
-                                     detailed=True, # print the detailed profile
-                                     module_depth=-1, # depth into the nested modules with -1 being the inner most modules
-                                     top_modules=3, # the number of top modules to print aggregated profile
-                                     warm_up=10, # the number of warm-ups before measuring the time of each module
-                                     as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
-                                     ignore_modules=None) # the list of modules to ignore in the profiling
-```
-
-An example output:
-
-```shell
--------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   10
-Number of parameters:           61.1 M
-Number of multiply-accumulate operations (MACs):   183.18 G
-Number of floating point operations ( = 2 * MACs):   366.36 G
-Latency:                        22.13 ms
-Floating point operations per second(FLOPS):   16.56 TFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 2 are {'Conv2d': '167.95 GMACs', 'Linear': '15.01 GMACs', 'ReLU': '126.26 MMACs'}
-Top 3 modules in params at depth 2 are {'Linear': '58.63 M', 'Conv2d': '2.47 M', 'ReLU': '0'}
-Top 3 modules in latency at depth 2 are {'Conv2d': '13.96 ms', 'Linear': '6.23 ms', 'ReLU': '730.75 us'}
-
------------------------------- Detailed Profile ------------------------------
-Each module profile is listed after its name in the following order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
-
-AlexNet(
-  61.1 M, 100.00% Params, 183.18 GMACs, 100.00% MACs, 22.13 ms, 100.00% latency, 16.56 TFLOPS,
-  (features): Sequential(
-    2.47 M, 4.04% Params, 168.17 GMACs, 91.81% MACs, 15.17 ms, 68.57% latency, 22.17 TFLOPS,
-    (0): Conv2d(23.3 k, 0.04% Params, 18.04 GMACs, 9.85% MACs, 633.0 us, 2.86% latency, 57.0 TFLOPS, 3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
-    (1): ReLU(0, 0.00% Params, 49.56 MMACs, 0.03% MACs, 163.79 us, 0.74% latency, 605.17 GFLOPS, inplace=True)
-    (2): MaxPool2d(0, 0.00% Params, 49.56 MMACs, 0.03% MACs, 159.26 us, 0.72% latency, 622.38 GFLOPS, kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
-    (3): Conv2d(307.39 k, 0.50% Params, 57.37 GMACs, 31.32% MACs, 6.15 ms, 27.81% latency, 18.64 TFLOPS, 64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
-    (4): ReLU(0, 0.00% Params, 35.83 MMACs, 0.02% MACs, 185.01 us, 0.84% latency, 387.34 GFLOPS, inplace=True)
-    (5): MaxPool2d(0, 0.00% Params, 35.83 MMACs, 0.02% MACs, 134.23 us, 0.61% latency, 533.89 GFLOPS, kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
-    (6): Conv2d(663.94 k, 1.09% Params, 28.72 GMACs, 15.68% MACs, 389.58 us, 1.76% latency, 147.47 TFLOPS, 192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-    (7): ReLU(0, 0.00% Params, 16.61 MMACs, 0.01% MACs, 76.53 us, 0.35% latency, 434.15 GFLOPS, inplace=True)
-    (8): Conv2d(884.99 k, 1.45% Params, 38.29 GMACs, 20.90% MACs, 6.38 ms, 28.82% latency, 12.01 TFLOPS, 384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-    (9): ReLU(0, 0.00% Params, 11.08 MMACs, 0.01% MACs, 104.43 us, 0.47% latency, 212.12 GFLOPS, inplace=True)
-    (10): Conv2d(590.08 k, 0.97% Params, 25.53 GMACs, 13.94% MACs, 405.79 us, 1.83% latency, 125.83 TFLOPS, 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-    (11): ReLU(0, 0.00% Params, 11.08 MMACs, 0.01% MACs, 65.57 us, 0.30% latency, 337.85 GFLOPS, inplace=True)
-    (12): MaxPool2d(0, 0.00% Params, 11.08 MMACs, 0.01% MACs, 122.07 us, 0.55% latency, 181.46 GFLOPS, kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
-  )
-  (avgpool): AdaptiveAvgPool2d(0, 0.00% Params, 2.36 MMACs, 0.00% MACs, 259.4 us, 1.17% latency, 18.19 GFLOPS, output_size=(6, 6))
-  (classifier): Sequential(
-    58.63 M, 95.96% Params, 15.01 GMACs, 8.19% MACs, 6.54 ms, 29.54% latency, 4.59 TFLOPS,
-    (0): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 42.68 us, 0.19% latency, 0.0 FLOPS, p=0.5, inplace=False)
-    (1): Linear(37.75 M, 61.79% Params, 9.66 GMACs, 5.28% MACs, 301.36 us, 1.36% latency, 64.13 TFLOPS, in_features=9216, out_features=4096, bias=True)
-    (2): ReLU(0, 0.00% Params, 1.05 MMACs, 0.00% MACs, 79.39 us, 0.36% latency, 26.41 GFLOPS, inplace=True)
-    (3): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 39.58 us, 0.18% latency, 0.0 FLOPS, p=0.5, inplace=False)
-    (4): Linear(16.78 M, 27.46% Params, 4.29 GMACs, 2.34% MACs, 234.37 us, 1.06% latency, 36.65 TFLOPS, in_features=4096, out_features=4096, bias=True)
-    (5): ReLU(0, 0.00% Params, 1.05 MMACs, 0.00% MACs, 56.03 us, 0.25% latency, 37.43 GFLOPS, inplace=True)
-    (6): Linear(4.1 M, 6.71% Params, 1.05 GMACs, 0.57% MACs, 5.69 ms, 25.72% latency, 368.42 GFLOPS, in_features=4096, out_features=1000, bias=True)
-  )
-)
-------------------------------------------------------------------------------
+    flops, macs, params = get_model_profile(model=model, # model
+                                    input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
+                                    args=None, # list of positional arguments to the model.
+                                    kwargs=None, # dictionary of keyword arguments to the model.
+                                    print_profile=True, # prints the model graph with the measured profile attached to each module
+                                    detailed=True, # print the detailed profile
+                                    module_depth=-1, # depth into the nested modules, with -1 being the inner most modules
+                                    top_modules=1, # the number of top modules to print aggregated profile
+                                    warm_up=10, # the number of warm-ups before measuring the time of each module
+                                    as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
+                                    output_file=None, # path to the output file. If None, the profiler prints to stdout.
+                                    ignore_modules=None) # the list of modules to ignore in the profiling
 ```
 
 ##### Example: Bert
@@ -283,15 +343,15 @@ from transformers import BertForSequenceClassification, BertTokenizer
 from deepspeed.profiling.flops_profiler import get_model_profile
 
 
-def bert_input_constructor(input_shape, tokenizer):
+def bert_input_constructor(batch_size, seq_len, tokenizer):
     fake_seq = ""
-    for _ in range(input_shape[1] - 2):  # ignore the two special tokens [CLS] and [SEP]
+    for _ in range(seq_len - 2):  # ignore the two special tokens [CLS] and [SEP]
       fake_seq += tokenizer.pad_token
-    inputs = tokenizer([fake_seq] * input_shape[0],
+    inputs = tokenizer([fake_seq] * batch_size,
                        padding=True,
                        truncation=True,
                        return_tensors="pt")
-    labels = torch.tensor([1] * input_shape[0])
+    labels = torch.tensor([1] * batch_size)
     inputs = dict(inputs)
     inputs.update({"labels": labels})
     return inputs
@@ -304,11 +364,9 @@ with torch.cuda.device(0):
     seq_len = 128
     enable_profile = True
     if enable_profile:
-      macs, params = get_model_profile(
+      flops, macs, params = get_model_profile(
           model,
-          (batch_size, seq_len),
-          input_constructor=partial(bert_input_constructor,
-                                    tokenizer=tokenizer),
+          kwargs=bert_input_constructor(batch_size, seq_len, tokenizer),
           print_profile=True,
           detailed=True,
       )
@@ -317,104 +375,21 @@ with torch.cuda.device(0):
       outputs = model(inputs)
 ```
 
-An example output:
-
-```
--------------------------- DeepSpeed Flops Profiler --------------------------
-Summary of forward pass:
-Profile step:                   1
-Number of parameters:           109.48 M
-Number of multiply-accumulate operations (MACs):   43.5 G
-Number of floating point operations ( = 2 * MACs):   87.0 G
-Latency:                        393.7 ms
-Floating point operations per second(FLOPS):   220.97 GFLOPS
-
------------------------------ Aggregated Profile -----------------------------
-Top 3 modules in MACs at depth 7 are {'Linear': '14.5 GMACs', 'Dropout': '0 MACs', 'LayerNorm': '0 MACs'}
-Top 3 modules in params at depth 7 are {'Linear': '28.35 M', 'LayerNorm': '18.43 k', 'Dropout': '0'}
-Top 3 modules in latency at depth 7 are {'Linear': '153.7 ms', 'LayerNorm': '4.74 ms', 'Dropout': '597.95 us'}
-
------------------------------- Detailed Profile ------------------------------
-Each module profile is listed after its name in the following order:
-number of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency).
-Note:
-1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.
-2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
-
-BertForSequenceClassification(
-  109.48 M, 100.00% Params, 43.5 GMACs, 100.00% MACs, 393.7 ms, 100.00% latency, 220.97 GFLOPS,
-  (bert): BertModel(
-    109.48 M, 100.00% Params, 43.5 GMACs, 100.00% MACs, 393.38 ms, 99.92% latency, 221.15 GFLOPS,
-    (embeddings): BertEmbeddings(
-      23.84 M, 21.77% Params, 0 MACs, 0.00% MACs, 1.79 ms, 0.45% latency, 0.0 FLOPS,
-      (word_embeddings): Embedding(23.44 M, 21.41% Params, 0 MACs, 0.00% MACs, 485.18 us, 0.12% latency, 0.0 FLOPS, 30522, 768, padding_idx=0)
-      (position_embeddings): Embedding(393.22 k, 0.36% Params, 0 MACs, 0.00% MACs, 111.1 us, 0.03% latency, 0.0 FLOPS, 512, 768)
-      (token_type_embeddings): Embedding(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 215.53 us, 0.05% latency, 0.0 FLOPS, 2, 768)
-      (LayerNorm): LayerNorm(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 386.95 us, 0.10% latency, 0.0 FLOPS, (768,), eps=1e-12, elementwise_affine=True)
-      (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 20.27 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-    )
-    (encoder): BertEncoder(
-      85.05 M, 77.69% Params, 43.5 GMACs, 99.99% MACs, 391.03 ms, 99.32% latency, 222.47 GFLOPS,
-      (layer): ModuleList(
-        85.05 M, 77.69% Params, 43.5 GMACs, 99.99% MACs, 390.82 ms, 99.27% latency, 222.59 GFLOPS,
-        (0): BertLayer(
-          7.09 M, 6.47% Params, 3.62 GMACs, 8.33% MACs, 31.91 ms, 8.10% latency, 227.21 GFLOPS,
-          (attention): BertAttention(
-            2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 16.39 ms, 4.16% latency, 147.47 GFLOPS,
-            (self): BertSelfAttention(
-              1.77 M, 1.62% Params, 906.76 MMACs, 2.08% MACs, 15.07 ms, 3.83% latency, 120.36 GFLOPS,
-              (query): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 3.66 ms, 0.93% latency, 164.91 GFLOPS, in_features=768, out_features=768, bias=True)
-              (key): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 3.72 ms, 0.94% latency, 162.36 GFLOPS, in_features=768, out_features=768, bias=True)
-              (value): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 4.52 ms, 1.15% latency, 133.65 GFLOPS, in_features=768, out_features=768, bias=True)
-              (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 24.08 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-            )
-            (output): BertSelfOutput(
-              592.13 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 1.29 ms, 0.33% latency, 469.21 GFLOPS,
-              (dense): Linear(590.59 k, 0.54% Params, 301.99 MMACs, 0.69% MACs, 504.26 us, 0.13% latency, 1.2 TFLOPS, in_features=768, out_features=768, bias=True)
-              (LayerNorm): LayerNorm(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 437.97 us, 0.11% latency, 0.0 FLOPS, (768,), eps=1e-12, elementwise_affine=True)
-              (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 21.93 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-            )
-          )
-          (intermediate): BertIntermediate(
-            2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 9.57 ms, 2.43% latency, 252.35 GFLOPS,
-            (dense): Linear(2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 8.75 ms, 2.22% latency, 276.11 GFLOPS, in_features=768, out_features=3072, bias=True)
-          )
-          (output): BertOutput(
-            2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 5.77 ms, 1.47% latency, 418.39 GFLOPS,
-            (dense): Linear(2.36 M, 2.16% Params, 1.21 GMACs, 2.78% MACs, 5.13 ms, 1.30% latency, 471.15 GFLOPS, in_features=3072, out_features=768, bias=True)
-            (LayerNorm): LayerNorm(1.54 k, 0.00% Params, 0 MACs, 0.00% MACs, 310.9 us, 0.08% latency, 0.0 FLOPS, (768,), eps=1e-12, elementwise_affine=True)
-            (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 29.8 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
-          )
-        )
-        ...
-        (11): BertLayer(...)
-      )
-    )
-    (pooler): BertPooler(
-      590.59 k, 0.54% Params, 2.36 MMACs, 0.01% MACs, 337.12 us, 0.09% latency, 14.0 GFLOPS,
-      (dense): Linear(590.59 k, 0.54% Params, 2.36 MMACs, 0.01% MACs, 173.57 us, 0.04% latency, 27.19 GFLOPS, in_features=768, out_features=768, bias=True)
-      (activation): Tanh(0, 0.00% Params, 0 MACs, 0.00% MACs, 46.01 us, 0.01% latency, 0.0 FLOPS, )
-    )
-  )
-  (dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 19.55 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
-  (classifier): Linear(1.54 k, 0.00% Params, 6.14 KMACs, 0.00% MACs, 56.51 us, 0.01% latency, 217.47 MFLOPS, in_features=768, out_features=2, bias=True)
-)
-------------------------------------------------------------------------------
-```
-
 #### In Model Training Workflow
 
 To profile model forward in a training workflow, use the `FlopsProfiler`class.
-The `FlopsProfiler`class provides the follwing methods:
+The `FlopsProfiler`class provides the following methods:
   * `start_profile()` - starts profiling
-  * `get_total_flops(as_string=False)` - returns the total number of MACs in the model
+  * `get_total_flops(as_string=False)` - returns the total number of floating-point operations in the model
+  * `get_total_macs(as_string=False)` - returns the total number of MACs in the model
   * `get_total_params(as_string=False)` - returns the total number of parameters in the model
-  * `print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True)` - prints the model profile
-  * `end_profile()` - ends profiling and cleans up. This should be invoked at the end of the profiling and AFTER `get_total_flops`, `get_total_params` or `print_model_profile`.
+  * `print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None)` - prints the model profile
+  * `stop_profile()` - stops profiling. This stops the flops counting in the model.
+  * `end_profile()` - cleans up. This cleans up the profile attributes added to the model during the profiling. This should be invoked at the end of the profiling and AFTER `get_total_flops`, `get_total_params` or `print_model_profile`.
 
 ##### Example Training Workflow
 
-Below is an example of this usage in a typical training workflow. Note that the flops profiler only captures the forward pass in a training step. The flops of a backward pass can be roughly estimated from that of the forward pass (~2x).
+Below is an example of this usage in a typical training workflow.
 
 ```python
 from deepspeed.profiling.flops_profiler import FlopsProfiler
@@ -435,8 +410,10 @@ for step, batch in enumerate(data_loader):
 
   # end profiling and print output
   if step == profile_step: # if using multi nodes, check global_rank == 0 as well
-    flops = prof.get_total_flops(as_string=True)
-    params = prof.get_total_params(as_string=True)
+    prof.stop_profile()
+    flops = prof.get_total_flops()
+    macs = prof.get_total_macs()
+    params = prof.get_total_params()
     if print_profile:
         prof.print_model_profile(profile_step=profile_step)
     prof.end_profile()
@@ -446,4 +423,5 @@ for step, batch in enumerate(data_loader):
 
   # weight update
   optimizer.step()
+
 ```
diff --git a/docs/_tutorials/gan.md b/docs/_tutorials/gan.md
old mode 100755
new mode 100644
index d880f48db28efff3a08404ef76da03eec9444283..1389c91617dd3121b6eada070e2613b71a4be178
--- a/docs/_tutorials/gan.md
+++ b/docs/_tutorials/gan.md
@@ -1,6 +1,7 @@
 ---
 title: "DCGAN Tutorial"
 excerpt: "Train your first GAN model with DeepSpeed!"
+tags: getting-started training
 ---
 
 If you haven't already, we advise you to first read through the [Getting Started](/getting-started/) guide before stepping through this
diff --git a/docs/_tutorials/getting-started.md b/docs/_tutorials/getting-started.md
index ecd3159df8c9e972d7a4de6ddb8034402aa2f374..eea063171c5cf2e67c4b4cc99948a4ad10f6e039 100644
--- a/docs/_tutorials/getting-started.md
+++ b/docs/_tutorials/getting-started.md
@@ -1,24 +1,28 @@
 ---
-title: "Getting Started"
+title: 'Getting Started'
 permalink: /getting-started/
-excerpt: "First steps with DeepSpeed"
-date: 2020-05-15
+excerpt: 'First steps with DeepSpeed'
+tags: getting-started
 ---
 
 ## Installation
 
 * Installing is as simple as `pip install deepspeed`, [see more details](/tutorials/advanced-install/).
-* Please see our [Azure tutorial](/tutorials/azure/) to get started with DeepSpeed on Azure!
-* To get started with DeepSpeed on AzureML, please see the [AzureML Examples GitHub](https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed)
-* If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies.
+* To get started with DeepSpeed on AzureML, please see the [AzureML Examples GitHub](https://github.com/Azure/azureml-examples/tree/main/python-sdk/workflows/train/deepspeed)
+* DeepSpeed has direct integrations with [HuggingFace Transformers](https://github.com/huggingface/transformers) and [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning). HuggingFace Transformers users can now easily accelerate their models with DeepSpeed through a simple ``--deepspeed`` flag + config file [See more details](https://huggingface.co/transformers/main_classes/trainer.html#deepspeed). PyTorch Lightning provides easy access to DeepSpeed through the Lightning Trainer [See more details](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed#deepspeed).
+* DeepSpeed on AMD can be used via our [ROCm images](https://hub.docker.com/r/deepspeed/rocm501/tags), e.g., `docker pull deepspeed/rocm501:ds060_pytorch110`.
+
+
 
 ## Writing DeepSpeed Models
+
 DeepSpeed model training is accomplished using the DeepSpeed engine. The engine
 can wrap any arbitrary model of type `torch.nn.module` and has a minimal set of APIs
 for training and checkpointing the model. Please see the tutorials for detailed
 examples.
 
 To initialize the DeepSpeed engine:
+
 ```python
 model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                      model=model,
@@ -27,10 +31,10 @@ model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
 
 `deepspeed.initialize` ensures that all of the necessary setup required for
 distributed data parallel or mixed precision training are done
-appropriately under the hood.  In addition to wrapping the model, DeepSpeed can
+appropriately under the hood. In addition to wrapping the model, DeepSpeed can
 construct and manage the training optimizer, data loader, and the learning rate
 scheduler based on the parameters passed to `deepspeed.initialize` and the
-DeepSpeed [configuration file](#deepspeed-configuration).
+DeepSpeed [configuration file](#deepspeed-configuration). Note that DeepSpeed automatically executes the learning rate schedule at every training step.
 
 If you already have a distributed environment setup, you'd need to replace:
 
@@ -48,7 +52,6 @@ The default is to use the NCCL backend, which DeepSpeed has been thoroughly test
 
 But if you don't need the distributed environment setup until after `deepspeed.initialize()` you don't have to use this function, as DeepSpeed will automatically initialize the distributed environment during its `initialize`. Regardless, you will need to remove `torch.distributed.init_process_group` if you already had it in place.
 
-
 ### Training
 
 Once the DeepSpeed engine has been initialized, it can be used to train the
@@ -67,32 +70,31 @@ for step, batch in enumerate(data_loader):
     model_engine.step()
 ```
 
-
 Under the hood, DeepSpeed automatically performs the necessary operations
 required for distributed data parallel training, in mixed precision, with a
-pre-defined learning rate schedule:
+pre-defined learning rate scheduler:
 
-* **Gradient Averaging**: in distributed data parallel training, `backward`
+- **Gradient Averaging**: in distributed data parallel training, `backward`
   ensures that gradients are averaged across data parallel processes after
   training on an `train_batch_size`.
 
-* **Loss Scaling**: in FP16/mixed precision training, the DeepSpeed
+- **Loss Scaling**: in FP16/mixed precision training, the DeepSpeed
   engine automatically handles scaling the loss to avoid precision loss in the
   gradients.
 
-* **Learning Rate Schedule**: if using DeepSpeed's learning rate
-  schedule, then DeepSpeed automatically handles any updates to the learning
-  rate when `step` is executed.
-
-
+- **Learning Rate Scheduler**: when using a DeepSpeed's learning rate scheduler (specified in the `ds_config.json` file), DeepSpeed calls the `step()` method of the scheduler at every training step (when `model_engine.step()` is executed). When not using DeepSpeed's learning rate scheduler:
+  - if the schedule is supposed to execute at every training step, then the user can pass the scheduler to `deepspeed.initialize` when initializing the DeepSpeed engine and let DeepSpeed manage it for update or save/restore.
+  - if the schedule is supposed to execute at any other interval (e.g., training epochs), then the user should NOT pass the scheduler to DeepSpeed during initialization and must manage it explicitly.
 
 ### Model Checkpointing
+
 Saving and loading the training state is handled via the `save_checkpoint` and
 `load_checkpoint` API in DeepSpeed which takes two arguments to uniquely
 identify a checkpoint:
-  * `ckpt_dir`: the directory where checkpoints will be saved.
-  * `ckpt_id`: an identifier that uniquely identifies a checkpoint in the directory.
-    In the following code snippet, we use the loss value as the checkpoint identifier.
+
+- `ckpt_dir`: the directory where checkpoints will be saved.
+- `ckpt_id`: an identifier that uniquely identifies a checkpoint in the directory.
+  In the following code snippet, we use the loss value as the checkpoint identifier.
 
 ```python
 #load checkpoint
@@ -122,17 +124,19 @@ for step, batch in enumerate(data_loader):
 
 DeepSpeed can automatically save and restore the model, optimizer, and the
 learning rate scheduler states while hiding away these details from the user.
-However, the user may want to save other data in addition to these that are
+However, the user may want to save additional data that are
 unique to a given model training. To support these items, `save_checkpoint`
 accepts a client state dictionary `client_sd` for saving. These items can be
 retrieved from `load_checkpoint` as a return argument. In the example above,
 the `step` value is stored as part of the `client_sd`.
 
-Important: all processes must call this method and not just the process with rank 0. It is because
+**Important**: all processes must call this method and not just the process with rank 0. It is because
 each process needs to save its master weights and scheduler+optimizer states. This method will hang
 waiting to synchronize with other processes if it's called just for the process with rank 0.
+{: .notice--info}
 
 ## DeepSpeed Configuration
+
 DeepSpeed features can be enabled, disabled, or configured using a config JSON
 file that should be specified as `args.deepspeed_config`. A sample config file
 is shown below. For a full set of features see [ API
@@ -156,6 +160,7 @@ doc](/docs/config-json/).
 ```
 
 # Launching DeepSpeed Training
+
 DeepSpeed installs the entry point `deepspeed` to launch distributed training.
 We illustrate an example usage of DeepSpeed with the following assumptions:
 
@@ -164,18 +169,20 @@ We illustrate an example usage of DeepSpeed with the following assumptions:
 3. `client args` is the `argparse` command line arguments
 4. `ds_config.json` is the configuration file for DeepSpeed
 
-
 ## Resource Configuration (multi-node)
+
 DeepSpeed configures multi-node compute resources with hostfiles that are compatible with
 [OpenMPI](https://www.open-mpi.org/) and [Horovod](https://github.com/horovod/horovod).
-A hostfile is a list of *hostnames* (or SSH aliases), which are machines accessible via passwordless
-SSH, and *slot counts*, which specify the number of GPUs available on the system. For
+A hostfile is a list of _hostnames_ (or SSH aliases), which are machines accessible via passwordless
+SSH, and _slot counts_, which specify the number of GPUs available on the system. For
 example,
+
 ```
 worker-1 slots=4
 worker-2 slots=4
 ```
-specifies that two machines named *worker-1* and *worker-2* each have four GPUs to use
+
+specifies that two machines named _worker-1_ and _worker-2_ each have four GPUs to use
 for training.
 
 Hostfiles are specified with the `--hostfile` command line option. If no hostfile is
@@ -183,9 +190,9 @@ specified, DeepSpeed searches for `/job/hostfile`. If no hostfile is specified o
 DeepSpeed queries the number of GPUs on the local machine to discover the number of local
 slots available.
 
-
 The following command launches a PyTorch training job across all available nodes and GPUs
 specified in `myhostfile`:
+
 ```bash
 deepspeed --hostfile=myhostfile <client_entry.py> <client args> \
   --deepspeed --deepspeed_config ds_config.json
@@ -195,20 +202,25 @@ Alternatively, DeepSpeed allows you to restrict distributed training of your mod
 subset of the available nodes and GPUs. This feature is enabled through two command line
 arguments: `--num_nodes` and `--num_gpus`. For example, distributed training can be
 restricted to use only two nodes with the following command:
+
 ```bash
 deepspeed --num_nodes=2 \
 	<client_entry.py> <client args> \
 	--deepspeed --deepspeed_config ds_config.json
 ```
+
 You can instead include or exclude specific resources using the `--include` and
 `--exclude` flags. For example, to use all available resources **except** GPU 0 on node
-*worker-2* and GPUs 0 and 1 on *worker-3*:
+_worker-2_ and GPUs 0 and 1 on _worker-3_:
+
 ```bash
 deepspeed --exclude="worker-2:0@worker-3:0,1" \
 	<client_entry.py> <client args> \
 	--deepspeed --deepspeed_config ds_config.json
 ```
-Similarly, you can use **only** GPUs 0 and 1 on *worker-2*:
+
+Similarly, you can use **only** GPUs 0 and 1 on _worker-2_:
+
 ```bash
 deepspeed --include="worker-2:0,1" \
 	<client_entry.py> <client args> \
@@ -228,24 +240,26 @@ executing from and also in your home directory (`~/`).
 As a concrete example, some clusters require special NCCL variables to set
 prior to training. The user can simply add these variables to a
 `.deepspeed_env` file in their home directory that looks like this:
+
 ```
 NCCL_IB_DISABLE=1
 NCCL_SOCKET_IFNAME=eth0
 ```
+
 DeepSpeed will then make sure that these environment variables are set when
 launching each process on every node across their training job.
 
-
 ### MPI and AzureML Compatibility
+
 As described above, DeepSpeed provides its own parallel launcher to help launch
 multi-node/multi-gpu training jobs. If you prefer to launch your training job
 using MPI (e.g., mpirun), we provide support for this. It should be noted that
-DeepSpeed will still use the torch distributed NCCL backend and *not* the MPI
+DeepSpeed will still use the torch distributed NCCL backend and _not_ the MPI
 backend.
 
 To launch your training job with mpirun + DeepSpeed or with AzureML (which uses
 mpirun as a launcher backend) you simply need to install the
-[mpi4py](https://pypi.org/project/mpi4py/) python package.  DeepSpeed will use
+[mpi4py](https://pypi.org/project/mpi4py/) python package. DeepSpeed will use
 this to discover the MPI environment and pass the necessary state (e.g., world
 size, rank) to the torch distributed backend.
 
@@ -259,8 +273,9 @@ deepspeed.init_distributed()
 ```
 
 ## Resource Configuration (single-node)
+
 In the case that we are only running on a single node (with one or more GPUs)
-DeepSpeed *does not* require a hostfile as described above. If a hostfile is
+DeepSpeed _does not_ require a hostfile as described above. If a hostfile is
 not detected or passed in then DeepSpeed will query the number of GPUs on the
 local machine to discover the number of slots available. The `--include` and
 `--exclude` arguments work as normal, but the user should specify 'localhost'
@@ -269,6 +284,7 @@ as the hostname.
 Also note that `CUDA_VISIBLE_DEVICES` can't be used with DeepSpeed to control
 which devices should be used. For example, to use only gpu1 of the current
 node, do:
+
 ```bash
 deepspeed --include localhost:1 ...
 ```
diff --git a/docs/_tutorials/inference-tutorial.md b/docs/_tutorials/inference-tutorial.md
new file mode 100644
index 0000000000000000000000000000000000000000..253bb7092467242974b2b06682f18d50b86f4e5c
--- /dev/null
+++ b/docs/_tutorials/inference-tutorial.md
@@ -0,0 +1,149 @@
+---
+title: "Getting Started with DeepSpeed for Inferencing Transformer based Models"
+tags: inference
+---
+
+DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. To further reduce latency and cost, we introduce inference-customized kernels. Finally, we propose a novel approach to quantize models, called MoQ, to both shrink the model and reduce the inference cost at production. For more details on the inference related optimizations in DeepSpeed, please refer to our [blog post](https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/).
+
+DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already loaded from a checkpoint, and DeepSpeed will do the rest. It will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. For list of compatible models please see [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py).
+
+## Initializing for Inference
+
+For inference with DeepSpeed, use `init_inference` API to load the model for inference. Here, you can specify the MP degree, and if the model has not been loaded with the appropriate checkpoint, you can also provide the checkpoint description using a `json` file or the checkpoint path.
+
+To inject the high-performance kernels, you need to set the `replace_with_kernel_inject` to True and pass int the `replace_method` as `'auto'` for the compatible models, or define a new policy in [replace_policy class](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py) and pass in the `injection_policy` that specifies the different parameters of a Transformer layer, such as attention and feed-forward parts. The `injection_policy` shows the mapping between the parameters of the original layer implementation with the inference-customized Transformer layer.
+
+```python
+# create the model
+if args.pre_load_checkpoint:
+    model = model_class.from_pretrained(args.model_name_or_path)
+else:
+    model = model_class()
+...
+
+import deepspeed
+
+# Initialize the DeepSpeed-Inference engine
+ds_engine = deepspeed.init_inference(model,
+                                 mp_size=2,
+                                 dtype=torch.half,
+                                 checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
+                                 replace_method='auto',
+                                 replace_with_kernel_inject=True)
+model = ds_engine.module
+output = model('Input String')
+```
+
+To run inference with only model-parallelism for the models that we don't support kernels, you can pass an injection policy that shows the two specific linear layers on a Transformer Encoder/Decoder layer: 1) the attention output GeMM and 2) layer output GeMM. We need these part of the layer to add the required all-reduce communication between GPUs to merge the partial results across model-parallel ranks. Below, we bring an example that shows how you can use deepspeed-inference with a T5 model:
+
+
+```python
+# create the model
+import transformers
+from transformers.models.t5.modeling_t5 import T5Block
+
+import deepspeed
+
+pipe = pipeline("text2text-generation", model="google/t5-v1_1-small", device=local_rank)
+# Initialize the DeepSpeed-Inference engine
+pipe.model = deepspeed.init_inference(
+    pipe.model,
+    mp_size=world_size,
+    dtype=torch.float,
+    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
+)
+output = pipe('Input String')
+```
+
+## Loading Checkpoints
+
+For the models trained using HuggingFace, the model checkpoint can be pre-loaded using the `from_pretrained` API as shown above. For Megatron-LM models trained with model parallelism, we require a list of all the model parallel checkpoints passed in JSON config. Below we show how to load a Megatron-LM checkpoint trained using MP=2.
+
+```json
+"checkpoint.json":
+{
+  "type": "Megatron",
+    "version": 0.0,
+    "checkpoints": [
+        "mp_rank_00/model_optim_rng.pt",
+        "mp_rank_01/model_optim_rng.pt",
+    ],
+}
+```
+For models that are trained with DeepSpeed, the checkpoint `json` file only requires storing the path to the model checkpoints.
+```json
+"checkpoint.json":
+{
+  "type": "DeepSpeed",
+    "version": 0.3,
+    "checkpoint_path": "path_to_checkpoints",
+}
+```
+
+> DeepSpeed supports running different MP degree for inference than from training. For example, a model trained without any MP can be run with MP=2, or a model trained with MP=4 can be inferenced without any MP. DeepSpeed automatically merges or splits checkpoints during initialization as necessary.
+
+## Launching
+
+Use the DeepSpeed launcher `deepspeed` to launch inference on multiple GPUs:
+
+```bash
+deepspeed --num_gpus 2 inference.py
+```
+
+## End-to-End GPT NEO 2.7B Inference
+
+DeepSpeed inference can be used in conjunction with HuggingFace `pipeline`. Below is the end-to-end client code combining DeepSpeed inference with HuggingFace `pipeline` for generating text using the GPT-NEO-2.7B model.
+
+```python
+# Filename: gpt-neo-2.7b-generation.py
+import os
+import deepspeed
+import torch
+from transformers import pipeline
+
+local_rank = int(os.getenv('LOCAL_RANK', '0'))
+world_size = int(os.getenv('WORLD_SIZE', '1'))
+generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
+                     device=local_rank)
+
+
+
+generator.model = deepspeed.init_inference(generator.model,
+                                           mp_size=world_size,
+                                           dtype=torch.float,
+                                           replace_method='auto',
+					   replace_with_kernel_inject=True)
+
+string = generator("DeepSpeed is", do_sample=True, min_length=50)
+if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
+    print(string)
+
+```
+The above script modifies the model in HuggingFace text-generation pipeline to use DeepSpeed inference. Note that here we can run the inference on multiple GPUs using the model-parallel tensor-slicing across GPUs even though the original model was trained without any model parallelism and the checkpoint is also a single GPU checkpoint. To run the client simply run:
+
+```bash
+deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py
+```
+Below is an output of the generated text.  You can try other prompt and see how this model generates text.
+
+```log
+[{
+    'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions'
+}]
+```
+
+## Datatypes and Quantized Models
+
+DeepSpeed inference supports fp32, fp16 and int8 parameters. The appropriate datatype can be set using dtype in `init_inference`, and DeepSpeed will choose the kernels optimized for that datatype. For quantized int8 models, if the model was quantized using DeepSpeed's quantization approach ([MoQ](https://www.deepspeed.ai/news/2020/05/27/MoQ.html)), the setting by which the quantization is applied needs to be passed to `init_inference`. This setting includes the number of groups used for quantization and whether the MLP part of transformer is quantized with extra grouping. For more information on these parameters, please visit our [quantization tutorial](https://www.deepspeed.ai/tutorials/MoQ-tutorial/).
+
+```python
+import deepspeed
+model = deepspeed.init_inference(model,
+                                 checkpoint='./checkpoint.json',
+                                 dtype=torch.int8,
+                                 quantization_setting=(quantize_groups,
+                                                       mlp_exra_grouping)
+                                )
+```
+
+Congratulations! You have completed DeepSpeed inference Tutorial.
diff --git a/docs/_tutorials/large-models-w-deepspeed.md b/docs/_tutorials/large-models-w-deepspeed.md
new file mode 100644
index 0000000000000000000000000000000000000000..ea6145cb6ae53d66c0a07a6e193cbca5243cb8f6
--- /dev/null
+++ b/docs/_tutorials/large-models-w-deepspeed.md
@@ -0,0 +1,53 @@
+---
+title: "Training your large model with DeepSpeed"
+tags: training large-model
+---
+
+## Overview
+
+DeepSpeed has been used to train or is in the process of training some of the largest dense models in existence. These include but not limited to:
+
+* [Megatron-Turing NLG 530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) language model trained in collaboration with NVIDIA
+* [Big Science](https://bigscience.huggingface.co/) (near 200 billion parameter) model, in collaboration with Hugging Face and hundreds of researchers around the world.
+* [Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) (17.2 billion parameters) trained by Microsoft
+
+DeepSpeed offers a collection of system technologies, that has made it possible to train models at these scales. The best technology to train your large model depends on various factors such as the model architecture, batch size, inter-connect bandwidth, etc. Given the number of available choices, this can be confusing and outright daunting. This page is meant as a starting guide to help you navigate your journey towards training your large model.
+
+## Possible ways to train a large model
+
+At a broad level, there are two primary paths to training a large model:
+
+* ZeRO (Zero Redundancy Optimizer) based technologies
+* 3D Parallelism based technologies
+
+**ZeRO based technologies**: In simple terms, ZeRO is a memory efficient form of data parallelism that gives you access to the aggregate GPU memory of all the GPU devices available to you, without inefficiency caused by the data replication in data parallelism. In addition, DeepSpeed also offers heterogeneous memory technologies based on ZeRO such as ZeRO-Offload and ZeRO-Infinity, which allow you to effectively leverage CPU and NVMe memory when they are available on your target systems.
+
+Since, ZeRO is a replacement to data parallelism, it offers a seamless integration that does not require model code refactoring for existing data-parallel models. For majority of cases, ZeRO based technologies offers model scalability, training throughput efficiency without compromising ease of use.
+
+**3D Parallelism based technologies**: 3D Parallelism refers to a combination of three different forms of parallel technologies namely tensor-slicing, pipeline-parallelism, and data parallelism (or ZeRO powered data parallelism). Combing these three forms allows for harnessing the strength of each of these technologies without the drawback of any. 3D Parallelism enables DeepSeed to achieve excellent training throughput efficiency in the scenarios where relying on ZeRO based technologies alone might be insufficient. However, 3D parallelism requires non-trivial model code refactoring, and therefore a careful consideration is important to identify cases where 3D-Parallelism can bring non-trivial throughput benefits.
+
+## Deciding which technology to use
+
+**3D Parallelism for GPT-2/GPT-3 like models**: If you are attempting to train a model whose architecture resembles very closely with GPT-2 or GPT-3, then we have already done the hard work of porting 3D parallelism to a GPT-2/GPT-3 architecture-based model and have created a training pipeline that you can use to efficiently train models with hundreds of billion or even trillions of parameters. Both Megatron-Turing NLG 530B and Big Science use a variation of this code base to scale the model training. You can find the code and tutorial to get started in the [DeepSpeed-Megatron GPT-3](https://github.com/microsoft/megatron-deepspeed) repo. For more information on 3D parallelism please chekcout the resources below:
+
+[3D Parallelism Tutorial](https://www.deepspeed.ai/tutorials/pipeline/) A generic tutorial on how to port your model to use DeepSpeed 3D parallelism
+
+[3D Parallelism Deep Dive](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/) A Microsoft Research blog post that takes a deep dive into 3D parallelism implementation in DeepSpeed.
+
+**ZeRO based technologies**: For most training scenarios, ZeRO offer training efficiency that is on par with 3D parallelism without requiring model code refactoring. Therefore, if you do not already have your code ported to use 3D parallelism, we suggest first trying ZeRO lines of technology to see if it fits your need. Adding ZeRO to your training pipeline with DeepSpeed is simple and does not require you to make changes to your model.  Given the trivial cost of trying out ZeRO with DeepSpeed, it is the fastest way to evaluate and decide if you should further invest in porting your model to use 3D parallelism. Enabling ZeRO with DeepSpeed also gives you access to ZeRO-Offload and ZeRO-Infinity that can enable fine tuning large models on limited GPU resources. To get started, please checkout our [ZeRO Tutorial](https://www.deepspeed.ai/tutorials/zero/).
+
+For more indepth information on ZeRO lines of technologies, please checkout our papers:
+
+[ZeRO (SC20)](https://arxiv.org/pdf/1910.02054.pdf), [ZeRO Offload (ATC21) ](https://www.usenix.org/system/files/atc21-ren-jie.pdf), and [ZeRO-Infinity (SC21)](https://arxiv.org/pdf/2104.07857.pdf),
+
+and blog posts:
+
+[ZeRO & DeepSpeed](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ ), [ZeRO-2 & DeepSpeed](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/ ), [ZeRO-Offload](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/ ), and [ZeRO-Infinity & DeepSpeed](https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/ )
+
+## Understanding performance tradeoff between ZeRO and 3D Parallelism
+
+The performance of ZeRO and 3D parallelism is generally on par with each other, when the batch size per GPU is not extremely small. ZeRO is a more memory efficient form of data parallelism, and the communication cost of ZeRO is quite similar to that of data parallelism itself. Therefore, for all scenarios where data parallelism works well, so will ZeRO. In fact, ZeRO enables fitting significantly larger batch sizes for large models, when compared to data parallelism due to its memory efficiency, allowing for much better throughput efficiency than data parallelism.
+
+However, in certain scenarios the batch size may not be large enough for ZeRO to be efficient. This maybe especially true when training on thousands of GPUs or with limited network bandwidth. For example, training a GPT-3 model on 4K GPUs, and with a batch size limit of 2K will result in a batch on 0.5 per GPU, which depending on sequence length and network bandwidth might not be sufficiently large to sustain good performance using ZeRO alone.
+
+In such scenarios, one should consider if its possible to increase the batch size to get better efficiency. However, if increasing the batch size is not an option due to convergence related concerns, then pipeline parallelism in 3D parallelism can increase the effective network bandwidth proportional to the number of pipeline stages, allowing 3D parallelism to achieve better throughput than ZeRO.
diff --git a/docs/_tutorials/lrrt.md b/docs/_tutorials/lrrt.md
index d2e1e40519342af446b3b8c3f846dbb5413e76a2..1659ab5bbd4db89243285010542aca37183be837 100644
--- a/docs/_tutorials/lrrt.md
+++ b/docs/_tutorials/lrrt.md
@@ -1,5 +1,6 @@
 ---
 title: "Learning Rate Range Test"
+tags: training learning-rate
 ---
 This tutorial shows how to use to perform Learning Rate range tests in PyTorch.
 
diff --git a/docs/_tutorials/megatron.md b/docs/_tutorials/megatron.md
index 31a71d57ee9291cb04bee8385459f8790dae31b1..7d81ecdcd28d53c36c514cb186d8fd5a6a88824e 100644
--- a/docs/_tutorials/megatron.md
+++ b/docs/_tutorials/megatron.md
@@ -1,5 +1,6 @@
 ---
 title: "Megatron-LM GPT2"
+tags: training
 ---
 
 If you haven't already, we advise you to first read through the [Getting
@@ -18,18 +19,14 @@ reduction_** from using DeepSpeed.
 
 ## Training GPT-2 with the Original Megatron-LM
 
-The original model code from
-[Megatron-LM](https://github.com/NVIDIA/Megatron-LM).  We've copied this repo
-under
-[DeepSpeedExamples/Megatron-LM/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
-and made it available as a submodule. To download, execute:
+We've copied the original model code from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) into DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3) and made it available as a submodule. To download, execute:
 ```bash
 git submodule update --init --recursive
 ```
 
 ### Training Data Setup
 * Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data)
-  to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
+  to download the `webtext` data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
 
 ### Running Unmodified Megatron-LM GPT2 model
 
@@ -52,7 +49,7 @@ To use DeepSpeed we will modify three files :
 
 * `arguments.py` : Arguments configurations
 * `pretrain_gpt2.py` : Main entry point for training
-* `utils.py` : Checkpoints saving and loading utilities
+* `utils.py` : Checkpoint saving and loading utilities
 
 
 ### Argument Parsing
@@ -79,7 +76,7 @@ def get_args():
 
 
 ### Initialization and Training
-We modify `pretrain.py` to enable training with DeepSpeed.
+We will modify `pretrain.py` to enable training with DeepSpeed.
 
 #### Initialization
 We use `deepspeed.initialize` to create `model_engine`, `optimizer` and LR
@@ -237,9 +234,9 @@ as the learning rate.
 
 
 ##### Loss Scaling
-The GPT2 training script logs the loss scaling value during training. Inside,
+The GPT2 training script logs the loss scaling value during training. Inside
 the DeepSpeed optimizer, this value is stored as `cur_scale` instead of
-`loss_scale` in Megatron's optimizer. Therefore, we appropriately replace it in
+`loss_scale` as in Megatron's optimizer. Therefore, we appropriately replace it in
 the logging string.
 
 ```python
@@ -250,9 +247,9 @@ the logging string.
 ```
 
 
-### Checkpoints Saving & Loading
+### Checkpoint Saving & Loading
 
-DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle
+The DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle
 the states from both the client model and its own internal.
 
 ```python
@@ -260,12 +257,12 @@ def save_checkpoint(self, save_dir, tag, client_state={})
 def load_checkpoint(self, load_dir, tag)
 ```
 
-Applying DeepSpeed needs to update utils.py in which Megatron-LM GPT2 saves and
-loads its checkpoints.
+To use DeepSpeed, we need to update `utils.py` in which Megatron-LM GPT2 saves and
+loads checkpoints.
 
-A new function `save_ds_checkpoint()` is created as below for DeepSpeed, it
-collects the client model states and passes to DeepSpeed engine by calling
-`save_checkpoint()` of DeepSpeed.
+Create a new function `save_ds_checkpoint()` as shown below. The new function
+collects the client model states and passes them to the DeepSpeed engine by calling
+DeepSpeed's `save_checkpoint()`.
 
 ```python
  def save_ds_checkpoint(iteration, model, args):
@@ -285,7 +282,7 @@ collects the client model states and passes to DeepSpeed engine by calling
 
 ```
 
-In Megatron-LM GPT2 `save_checkpoint()` function, adds following lines to
+In Megatron-LM GPT2's `save_checkpoint()` function, add the following lines to
 invoke the above function for DeepSpeed.
 
 ```python
@@ -299,7 +296,7 @@ invoke the above function for DeepSpeed.
 
 ```
 
-In `load_checkpoint()` function, use DeepSpeed loading checkpoint API as below,
+In the `load_checkpoint()` function, use DeepSpeed checkpoint loading API as below,
 and return the states for the client model.
 
 ```python
@@ -322,7 +319,7 @@ and return the states for the client model.
 
 ### DeepSpeed Activation Checkpoints (Optional)
 
-DeepSpeed can reduce the activation memory during model parallel training by partitioning activation checkpoints across model parallel GPUs, or offloading them to CPU. These optimizations are optional, and can be skipped unless activation memory becomes a memory bottleneck. To enable partition activation, we use the `deepspeed.checkpointing` API to replace Megatron's activation checkpointing and random state tracker APIs. The replacement should happen before the first invocation of these APIs.
+DeepSpeed can reduce the activation memory during model parallel training by partitioning activation checkpoints across model parallel GPUs, or offloading them to CPU. These optimizations are optional, and can be skipped unless activation memory becomes a bottleneck. To enable partition activation, we use the `deepspeed.checkpointing` API to replace Megatron's activation checkpointing and random state tracker APIs. The replacement should happen before the first invocation of these APIs.
 
 a) Replace in `pretrain_gpt.py` :
 
@@ -348,17 +345,17 @@ b) Replace in `mpu/transformer.py`:
 
 ```python
 if deepspeed.checkpointing.is_configured():
-            global get_cuda_rng_tracker, checkpoint
-            get_cuda_rng_tracker = deepspeed.checkpoint.get_cuda_rng_tracker
-            checkpoint = deepspeed.checkpointing.checkpoint
+    global get_cuda_rng_tracker, checkpoint
+    get_cuda_rng_tracker = deepspeed.checkpoint.get_cuda_rng_tracker
+    checkpoint = deepspeed.checkpointing.checkpoint
 
 ```
 
-With these replacements, various DeepSpeed activation checkpointing optimizations such as activation partitioning, contiguous checkpointing, and CPU checkpointing, can be specified with either `deepspeed.checkpointing.configure` or in the `deepspeed_config` file.
+With these replacements, various DeepSpeed activation checkpointing optimizations such as activation partitioning, contiguous checkpointing, and CPU checkpointing, can be specified either with `deepspeed.checkpointing.configure` or in the `deepspeed_config` file.
 
 
 ### Train  scripts
-Assume webtext data was prepared in previous step, to start training
+We assume that the `webtext` data was prepared in the previous step. To start training
 Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to
 start training.
 
@@ -370,17 +367,17 @@ start training.
 ## DeepSpeed Evaluation using GPT-2
 
 DeepSpeed enables training very large models effectively via the advanced [ZeRO
-optimizer](https://arxiv.org/abs/1910.02054v2). In February, we released a sub-set
-of optimizations from ZeRO in DeepSpeed that performs optimizer state partitioning.
-We refer to them as ZeRO-1. In May, 2020 we extended ZeRO-1 in DeepSpeed to include
+optimizer](https://arxiv.org/abs/1910.02054v2). In February 2020, we released a sub-set
+of optimizations from ZeRO in DeepSpeed that perform optimizer state partitioning.
+We refer to them as ZeRO-1. In May 2020, we extended ZeRO-1 in DeepSpeed to include
 additional optimizations from ZeRO including gradient and activation partitioning,
-as well as contiguous memory optimizations. We refer to this release as ZeRO-2.  
+as well as contiguous memory optimizations. We refer to this release as ZeRO-2.
 
 ZeRO-2 significantly reduces the memory
 footprint for training large models which means large models can be trained with i) less
 model parallelism and ii) larger batch sizes. A lower model parallelism degree improves
-training efficiency by increasing the granularity of the computation such as the matrix
-multiplication where performance is directly related to the size of the matrices.
+training efficiency by increasing the granularity of computations such as matrix
+multiplications where performance is directly related to the size of the matrices.
 Furthermore, less model parallelism also results in less communication between model
 parallel GPUs, which further boosts performance.  Larger batch size has a similar effect
 of increasing the computational granularity as well as reducing communication, also
@@ -394,15 +391,12 @@ we elevate the model scale and speed to an entirely new level compared to Megatr
 
 More concretely, DeepSpeed and ZeRO-2 excel in four aspects (as visualized in Figure 2), supporting an order-of-magnitude bigger models, up to 10x faster, with superlinear scalability, and improved usability to democratize large model training. These four aspects are detailed below.
 
+**Model size**: State-of-the-art large models such as OpenAI GPT-2, NVIDIA Megatron-LM, Google T5, and Microsoft Turing-NLG have sizes of 1.5B, 8.3B, 11B, and 17B parameters respectively. ZeRO-2 provides system support to efficiently run models of 170 billion parameters, an order-of-magnitude bigger than these largest models (Figure 2, top left).
 
-Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains super linear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.
+**Speed**: Improved memory efficiency powers higher throughput and faster training. Figure 2 (bottom left) shows system throughput of ZeRO-2 and ZeRO-1 (both combining ZeRO-powered data parallelism with NVIDIA Megatron-LM model parallelism) as well as using the state-of-the-art model parallelism approach Megatron-LM alone (baseline in Figure 2, bottom left). ZeRO-2 runs 100-billion-parameter models on a 400 NVIDIA V100 GPU cluster with over 38 teraflops per GPU and aggregated performance over 15 petaflops. For models of the same size, ZeRO-2 is 10x faster in training speed when compared with using Megatron-LM alone and 5x faster when compared with ZeRO-1.
 
-Model size: State-of-the-art large models such as OpenAI GPT-2, NVIDIA Megatron-LM, Google T5, and Microsoft Turing-NLG have sizes of 1.5B, 8.3B, 11B, and 17B parameters respectively. ZeRO-2 provides system support to efficiently run models of 170 billion parameters, an order-of-magnitude bigger than these largest models (Figure 2, top left).
+**Scalability**: We observe superlinear speedup (Figure 2, top right), where the performance more than doubles when the number of GPUs are doubled. ZeRO-2 reduces the memory footprint of the model states as we increase the data parallelism degree, allowing us to fit larger batch sizes per GPU and resulting in better performance.
 
-Speed: Improved memory efficiency powers higher throughput and faster training. Figure 2 (bottom left) shows system throughput of ZeRO-2 and ZeRO-1 (both combining ZeRO-powered data parallelism with NVIDIA Megatron-LM model parallelism) as well as using the state-of-the-art model parallelism approach Megatron-LM alone (baseline in Figure 2, bottom left). ZeRO-2 runs 100-billion-parameter models on a 400 NVIDIA V100 GPU cluster with over 38 teraflops per GPU and aggregated performance over 15 petaflops. For models of the same size, ZeRO-2 is 10x faster in training speed when compared with using Megatron-LM alone and 5x faster when compared with ZeRO-1.
+**Democratizing large model training**: ZeRO-2 empowers model scientists to train models up to 13 billion parameters efficiently without any model parallelism that typically requires model refactoring (Figure 2, bottom right). 13 billion parameters is larger than most of the largest state-of-the-art models (such as Google T5, with 11 billion parameters). Model scientists can therefore experiment freely with large models without worrying about model parallelism. In comparison, the implementations of classic data-parallelism approaches (such as PyTorch Distributed Data Parallel) run out of memory with 1.4-billion-parameter models, while ZeRO-1 supports up to 6 billion parameters for comparison.
 
-Scalability: We observe superlinear speedup (Figure 2, top right), where the performance more than doubles when the number of GPUs are doubled. ZeRO-2 reduces the memory footprint of the model states as we increase the data parallelism degree, allowing us to fit larger batch sizes per GPU and resulting in better performance.
-
-Democratizing large model training: ZeRO-2 empowers model scientists to train models up to 13 billion parameters efficiently without any model parallelism that typically requires model refactoring (Figure 2, bottom right). 13 billion parameters is larger than most of the largest state-of-the-art models (such as Google T5, with 11 billion parameters). Model scientists can therefore experiment freely with large models without worrying about model parallelism. In comparison, the implementations of classic data-parallelism approaches (such as PyTorch Distributed Data Parallel) run out of memory with 1.4-billion-parameter models, while ZeRO-1 supports up to 6 billion parameters for comparison.
-
-Furthermore, in the absence of model parallelism, these models can be trained on low bandwidth clusters while still achieving significantly better throughput compared to using model parallelism. For example, the GPT-2 model can be trained nearly 4x faster with ZeRO powered data parallelism compared to using model parallelism on a four node cluster connected with 40 Gbps Infiniband interconnect, where each node have four NVIDIA 16GB V100 GPUs connected with PCI-E. Therefore, with this performance improvement, large model training is no longer limited to GPU clusters with ultra fast interconnect but also accessible on modest clusters with limited bandwidth.
+Furthermore, in the absence of model parallelism, these models can be trained on low bandwidth clusters while still achieving significantly better throughput compared to using model parallelism. For example, the GPT-2 model can be trained nearly 4x faster with ZeRO powered data parallelism compared to using model parallelism on a four node cluster connected with 40 Gbps Infiniband interconnect, where each node has four NVIDIA 16GB V100 GPUs connected with PCI-E. Therefore, with this performance improvement, large model training is no longer limited to GPU clusters with ultra fast interconnect, but also accessible on modest clusters with limited bandwidth.
diff --git a/docs/_tutorials/mixture-of-experts-inference.md b/docs/_tutorials/mixture-of-experts-inference.md
new file mode 100644
index 0000000000000000000000000000000000000000..42df78dd0cfcb16e521d9060474f8c7ae61cf051
--- /dev/null
+++ b/docs/_tutorials/mixture-of-experts-inference.md
@@ -0,0 +1,118 @@
+---
+title: "Getting Started with DeepSpeed-MoE for Inferencing Large-Scale MoE Models"
+tags: MoE inference
+---
+
+DeepSpeed-MoE Inference introduces several important features on top of the inference optimization for dense models ([DeepSpeed-Inference blog post](https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/)). It embraces several different types of parallelism, i.e. data-parallelism and tensor-slicing for the non-expert parameters and expert-parallelism and expert-slicing for the expert parameters. To maximize the aggregate memory-bandwidth, we provide the communication scheduling with parallelism coordination to effectively group and route tokens with the same critical-data-path. Moreover, we propose new modeling optimizations, PR-MoE and MoS, to reduce MoE model size while maintaining accuracy. For more information on the DeepSpeed MoE inference optimization, please refer to our [blog post]({{ site.press_release_v6 }}).
+
+DeepSpeed provides a seamless inference mode for the variant of MoE models that are trained via the DeepSpeed-MoE library ([MoE tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/)). To do so, one needs to simply use the deepspeed-inference engine to initialize the model to run the model in the eval mode.
+
+## MoE Inference Performance
+
+In modern production environments, powerful DL models are often served using hundreds of GPU devices to meet the traffic demand and deliver low latency. It is important to explore how these two broad goals of high throughput and low latency can be realized for MoE model inference at scale.
+
+For dense models, throughput can be increased by using multiple GPUs and data parallelism (independent replicas with no inter-GPU communication), whereas lower latency can be achieved by techniques like tensor-slicing to partition the model across multiple GPUs.  The best case scaling in terms of total throughput is linear with respect to the increasing number of GPUs, i.e., a constant throughput per GPU. This is possible for pure data parallel inference scenarios as there is no communication between GPUs. To reduce latency, tensor-slicing style of model parallelism has proven to be beneficial but it comes with the cost - communication overhead between GPUs - which often lowers per GPU throughput and results in sublinear scaling of total throughput. In other words, for dense models, we cannot leverage parallelism to optimize both latency and throughput at the same time; there is a tradeoff between them. MoE inference, however, provides unique opportunities to offer optimized latency and throughput simultaneously while scaling to a large number of devices.
+
+Figure below shows how we achieve both low latency and super-linear throughput increase simultaneously. We discuss this at length in our [paper](https://arxiv.org/abs/2201.05596).
+
+![52b-MoE-128](/assets/images/moe-lat-tput.png)
+
+## End-to-End MoE Inference Example
+
+In this part, we elaborate the usage of MoE inference support in the DeepSpeed library using an end-to-end example.
+
+### Initializing for Inference
+
+For inference with DeepSpeed-MoE, use `init_inference` API to load the DeepSpeed MoE model for inference. Here, you can specify the model-parallelism/tensor-slicing degree (mp_size), expert parallelism degree (ep_size), and number of experts (moe_exeperts). We create various process groups based on minimum of the world\_size (total number of GPUs) and expert parallel size. By using this group, we can partition the experts among expert-parallel GPUs. If number of experts is lower than total number of GPUs, DeepSpeed-MoE leverages expert-slicing for partitioning the expert parameters between the expert-parallel GPUs. Furthermore, if the model has not been loaded with the appropriate checkpoint, you can also provide the checkpoint description using a `json` file or simply pass the `'checkpoint'` path to load the model. To inject the high-performance inference kernels, you can pass int the `replace_method` as `'auto'` and set the `replace_with_kernel_inject` to True.
+
+```python
+
+import deepspeed
+import torch.distributed as dist
+
+# Set expert-parallel size
+world_size = dist.get_world_size()
+expert_parallel_size = min(world_size, args.num_experts)
+
+# create the MoE model
+moe_model = get_model(model, ep_size=expert_parallel_size)
+...
+
+# Initialize the DeepSpeed-Inference engine
+ds_engine = deepspeed.init_inference(moe_model,
+                                     mp_size=tensor_slicing_size,
+                                     dtype=torch.half,
+                                     moe_experts=args.num_experts,
+                                     checkpoint=args.checkpoint_path,
+                                     replace_method='auto',
+                                     replace_with_kernel_inject=True,)
+model = ds_engine.module
+output = model('Input String')
+```
+
+### Various configuration options
+
+Here, we show a text-generation example using an MoE model for which we can specify the model-parallel size and number of experts.
+DeepSpeed inference-engine takes care of creating the different parallelism groups using the tensor-slicing degree, number of experts, and the total number of GPUs used for running the MoE model. Regarding the expert parameters, we first use the expert-parallelism to assign each group of experts to one GPU. If number of GPUs is higher than number of experts, we use expert-slicing to partition each expert vertically/horizontally across the GPUs.
+
+Let's take a look at some of the parameters passed to run our example. Please refer to [DeepSpeed-Example](https://github.com/microsoft/Megatron-DeepSpeed/blob/moe/examples/generate_text.sh) for a complete generate-text inference example.
+
+
+```bash
+generate_samples_gpt.py \
+       --tensor-model-parallel-size 1 \
+       --num-experts ${experts} \
+       --num-layers 24 \
+       --hidden-size 2048 \
+       --num-attention-heads 32 \
+       --max-position-embeddings 1024 \
+       --tokenizer-type GPT2BPETokenizer \
+       --load $checpoint_path \
+       --fp16 \
+       --ds-inference \
+```
+
+### Performance for standard MoE model
+
+In order to show the performance scaling of DeepSpeed-MoE inference with increasing number of GPUs, we consider a 52B model architecture with 128 experts and 1.3B dense model using the parameters shown in the script above. In this example, we set tensor-slicing degree to one since the non-expert part of the model is relatively small (805M parameters). We use the last flag, `ds-inference`, to switch between DeepSpeed-MoE and PyTorch implementations.
+
+For DeepSpeed-MoE inference, we show our results in this tutorial using two versions: 1) Generic, the current open source version of the DeepSpeed library that includes support for flexible parallelism and PR-MoE model optimization, and 2) Specialized, the most optimized version of DeepSpeed MoE inference system including special computation and communication kernels that will be released later. As mentioned in our [blog post]({{ site.press_release_v6 }}), MoE inference optimizations will be released in a staged fashion.
+
+Figure below shows the inference performance of three different configuration, PyTorch, DeepSpeed-MoE (Generic), and DeepSpeed-MoE (Specialized), running on 8, 16, and 32 GPUs. Compared to PyTorch, DeepSpeed-MoE obtains significantly higher performance benefit as we increased the number of GPUs. By using the generic DeepSpeed-MoE inference, we can get between 24% to 60% performance improvement over PyTorch. Additionally, by enabling the full features of DeepSpeed-MoE inference, such as communication optimization and MoE customized kernels, the performance speedup gets boosted (2x – 3.2x).
+
+![52b-MoE-128](/assets/images/1.3B-MoE-128.png)
+
+### Faster Performance and Lower Inference Cost using PR-MoE optimizations
+
+To select between different MoE structures, we add a new parameter in our inference example, called `mlp-type`, to select between the `'standard'` MoE structure and the `'residual'` one to enable the modeling optimizations offered by PR-MoE. In addition to changing the `mlp-type`, we need to pass the number of experts differently when using PR-MoE. In contrast to standard MoE which uses the same number of experts for each MoE layer, PR-MoE uses different expert-count for the initial layers than the deeper layers of the network. Below is an example of PR-MoE using a mixture of 64 and 128 experts for every other layers:
+
+```bash
+experts="64 64 64 64 64 64 64 64 64 64 128 128"
+generate_samples_gpt.py \
+       --tensor-model-parallel-size 1 \
+       --num-experts ${experts} \
+       --mlp_type 'residual' \
+       --num-layers 24 \
+       --hidden-size 2048 \
+       --num-attention-heads 16 \
+       --max-position-embeddings 1024 \
+       --tokenizer-type GPT2BPETokenizer \
+       --load $checpoint_path \
+       --fp16 \
+       --ds-inference \
+```
+
+To evaluate the performance of PR-MoE, we use the two model structures, `'standard'` and `'residual'` and the configuration parameters as shown in the table below. Since we cannot fit the non-expert part of the 24B+MoE-128 on a single GPU, we use a model-parallel size larger than one. We choose the tensor-slicing degree in order to get the best performance benefit.
+
+|Model          |Size (billions) |#Layers |Hidden size |MP degree |EP degree |
+|-------------  |-----           |-----   |-----       |-----     |-----     |
+|2.4B+MoE-128   |107.7           |16      |3584        |1         |64 - 128  |
+|24B+MoE-128    |1046.9          |30      |8192        |8         |64 - 128  |
+
+We use 1 node (8 A100 GPUs) to run inference on the 2.4B+MoE-128 and 8 nodes (64 A100 GPUs) for the 24B+MoE-128. Figure below shows the performance of three different configurations: MoE-Standard (PyTorch), MoE-Standard (DeepSpeed-Generic), PR-MoE (DeepSpeed-Generic). By using the standard-MoE DeepSpeed improves inference performance by 1.4x and 1.65x compared to PyTorch for the two models, respectively. Furthermore, by using the PR-MoE, we can improve the performance speedups to 1.81x and 1.87x, while keeping the model quality maintained.
+
+![52b-MoE-128](/assets/images/prmoe.png)
+
+More performance results and scaling toward bigger models and larger number of GPUs can be seen from our [blog post]({{ site.press_release_v6 }}) and [paper](https://arxiv.org/abs/2201.05596).
+
+Congratulations! You have completed the DeepSpeed MoE inference tutorial.
diff --git a/docs/_tutorials/mixture-of-experts-nlg.md b/docs/_tutorials/mixture-of-experts-nlg.md
new file mode 100644
index 0000000000000000000000000000000000000000..e43cb83d0ed9cd2f8ebe81f07460b6a179900b6e
--- /dev/null
+++ b/docs/_tutorials/mixture-of-experts-nlg.md
@@ -0,0 +1,70 @@
+---
+title: "Mixture of Experts for NLG models"
+tags: MoE training
+---
+
+In this tutorial, we introduce how to apply DeepSpeed Mixture of Experts (MoE) to NLG models, which reduces the training cost by 5 times and reduce the MoE model size by 3 times (details in our [Blog]({{ site.press_release_v6 }})). We use the GPT-3 like models in Megatron-LM framework as the example. Before reading this tutorial, we recommend to first read the tutorials about [Mixture of Experts](/tutorials/mixture-of-experts/) and [Megatron-LM GPT pre-training](/tutorials/megatron/).
+
+## 1. Installation
+
+You would need to install DeepSpeed v0.6.0 or higher to use the MoE feature. The MoE for NLG model examples are in the [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) repo (currently under [the moe branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe) but later could be merged to main branch).
+
+## 2. Training NLG+MoE models
+
+### 2.1. Changes to the model
+To apply MoE to the GPT-style model, we made several changes in Megatron framework, mostly in `megatron/model/` where we add the MoE layers into the model.
+
+### 2.2. Pre-training the Standard MoE model
+We provide example training scripts under [examples/MoE](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe/examples/MoE) which we used to perform the experiments in our [Blog]({{ site.press_release_v6 }}). There are a few new hyperparameters for standard MoE model:
+
+`--num-experts`: the number of experts per MoE layer. In our experiments we set it to 128. Larger number of experts tend to provide better convergence, but it's a diminishing return.
+
+`--moe-expert-parallel-size`: degree of the MoE expert parallelism. In other words, there will be `num-experts/moe-expert-parallel-size` experts on each GPU. Thus `--moe-expert-parallel-size` should be no more than both number of GPUs, and `--num-experts`.
+
+`--moe-loss-coeff`: scaling coefficient for adding MoE loss to model loss. In our experiments we find that 0.01 is a good setting.
+
+`--moe-train-capacity-factor`, `--moe-eval-capacity-factor`, `--moe-min-capacity`: these configs determine how many tokens can a single expert handle. Larger numbers could lead to better convergence, but would also lead to slower training since the load would be more unbalanced on different experts.
+
+`--disable-moe-token-dropping`: this will completely remove the limitation of how many tokens can a single expert handle. For the same reason as above, we only recommend using this during inference/eval.
+
+
+
+### 2.3. Pre-training the PR-MoE model
+PR-MoE is a new designed MoE models, standing for Pyramid-Residual-MoE, which improves the parameter efficiency up to 3x as compared to standard MoE. Please see our [Blog]({{ site.press_release_v6 }}) for more details. We provide example training scripts under [examples/MoE](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe/examples/MoE). There are a few different hyperparameters for PR-MoE model compared to standard MoE:
+
+`--num-experts`: Instead of providing a single number, to enable Pyramid-MoE, you need to provide a list, whose length is the same as the number of MoE layers. We suggest to use more experts in the latter stage (close to output) of the model.
+
+`--mlp-type`: chosen from `[standard, residual]`. When it is residual, Residual-MoE is enabled.
+
+In addition to the new hyperparameters above for standard MoE and PR-MoE, for NLG+MoE models we found that it's helpful to lower the learning rate and increase the learning rate decay duration compared to the base dense model. Details of our tuning can be found in the example training scripts.
+
+Regarding training data, we are not able to release our internal data but any public data for Megatron-LM pre-training can be directly used to train MoE models (with the caveat that it might not provide the exact same model quality as in our experiments). For example, we evaluated The Pile dataset ([pile.eleuther.ai](https://pile.eleuther.ai/), [github.com/EleutherAI/the-pile](https://github.com/EleutherAI/the-pile)) for both dense and MoE models. Table 1 below shows that this public data provides similar evaluation results as our internal data.
+
+| Model size | LAMBADA: completion prediction | PIQA: commonsense reasoning | BoolQ: reading comprehension | RACE-h: reading comprehension | TriviaQA: question answering | WebQs: question answering |
+| ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| **Dense NLG:** | | | | | | |
+| 350M, internal data | 0.5203 | 0.6931 | 0.5364 | 0.3177 | 0.0321 | 0.0157 |
+| 350M, public Pile | 0.5106 | 0.6589 | 0.5933 | 0.3196 | 0.0257 | 0.0064 |
+| **Standard MoE NLG:** | | | | | | |
+| 350M+MoE-128, internal data | 0.6270 | 0.7459 | 0.6046 | 0.3560 | 0.1658 | 0.0517 |
+| 350M+MoE-128, public Pile | 0.6128 | 0.7323 | 0.6040 | 0.3349 | 0.1111 | 0.0335 |
+| **PR-MoE NLG:** | | | | | | |
+| 350M+MoE-128, internal data | 0.6365 | 0.7399 | 0.5988 | 0.3569 | 0.1630 | 0.0473 |
+| **PR-MoE + MoS NLG:** | | | | | | |
+| 350M+MoE-128, internal data | 0.6346 | 0.7334 | 0.5807 | 0.3483 | 0.1369 | 0.0522 |
+
+
+Table 1: Zero-shot evaluation results (last six columns) for different dense and MoE NLG models. All zero-shot evaluation results use the accuracy metric.
+
+### 2.4. Training MoS with reduced model size
+MoS, standing for Mixture-of-Students, is a staged distillation-based technique for compressing large MoE models. MoS further reduces the model size by 12.5%, leading to up 3.7x model size reduction when combined with PR-MoE over the standard MoE. The reduced model size helps reduce the latecy and cost during inference. To train an MoS model, one needs to specify a few additional parameters. We will use PR-MoE as an example:
+
+`--mos`: This would enable Mixture-of-Students via knowledge distillation.
+
+`--load-teacher`: This specifies the path to the teacher model checkpoint. This is a mandatory argumentment for using MoS and the teacher model checkpoint can be obtained by either training a standard MoE or the PR-MoE.
+
+`num-layers-teacher`, `--hidden-size-teacher`, `--hidden-size-teacher`, `--num-experts-teacher`: In addition to the teacher model checkpoint path, we also need to specify the model architecture of the teacher model such as its number of layers, hidden dimension size, and the number of experts per MoE layer. In the case of PR-MoE, we need to also provide a list of experts for the teacher model, where we remove a few expert layers from the teacher model.
+
+In addition to the new parameters above, we observe that using the teacher PR-MoE during the entire training process may adversely impact the final student model accuracy. In our experiments, we use a staged distillation method by stopping distillation early in the training process (e.g., after 400K steps) and perform optimization only against the standard language modeling loss for the rest of the training.
+
+We provide example training scripts under [examples/MoE](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe/examples/MoE). Details of our parameter settings can be found in the example training scripts. The performance results of MoS can be seen from our [blog post](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/) and our [paper](https://arxiv.org/abs/2201.05596).
diff --git a/docs/_tutorials/mixture-of-experts.md b/docs/_tutorials/mixture-of-experts.md
new file mode 100644
index 0000000000000000000000000000000000000000..23d807ab3eb15cdc94d7fe026baeff40a609000c
--- /dev/null
+++ b/docs/_tutorials/mixture-of-experts.md
@@ -0,0 +1,168 @@
+---
+title: "Mixture of Experts"
+tags: MoE training
+---
+
+DeepSpeed v0.5 introduces new support for training Mixture of Experts (MoE) models. MoE models are an emerging class of sparsely activated models that have sublinear compute costs with respect to their parameters. For example, the [Switch Transformer](https://arxiv.org/abs/2101.03961) consists of over 1.6 trillion parameters, while the compute required to train it is approximately equal to that of a 10 billion-parameter dense model. This increase in model size offers tremendous accuracy gains for a constant compute budget.
+
+For more details on results and further discussion, please see our press release: [DeepSpeed powers 8x larger MoE model training with high performance]({{ site.press_release_v5 }}).
+
+## Getting started with a simple MoE example
+
+**Note:** DeepSpeed MoE requires Pytorch 1.8 or above.
+{: .notice--info}
+
+As a simple starting point we will show how to apply DeepSpeed MoE to a cifar10 example. Please refer to
+our [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar) going forward.
+
+If you are adding MoE to an existing model you can use the snippet below to help guide you:
+
+
+### Expert groups initialization
+
+DeepSpeed MoE supports five different forms of parallelism, and it exploits both GPU and CPU memory. Its flexible design enables users to mix different types of prevalent parallelism techniques, as shown in the table below.
+
+| Short Name       | Flexible Parallelism Configurations | Benefit                                                                     |
+| ---------------- | ------------------------------------| --------------------------------------------------------------------------- |
+| E                | Expert                              | Scales the model size by increasing the number of experts                   |
+| E + D            | Expert + Data                       | Accelerates training throughput by scaling to multiple data parallel groups |
+| E + Z            | Expert + ZeRO-powered data          | Partitions the nonexpert parameters to support larger base models           |
+| E + D + M        | Expert + Data + Model               | Supports massive hidden sizes and even larger base models than E+Z          |
+| E + D + Z        | Expert + Data + ZeRO-powered data   | Supports massive hidden sizes and even larger base models than E+Z          |
+| E + Z-Off + M    | Expert + ZeRO-Offload + Model       | Leverages both GPU and CPU memory for large MoE models on limited # of GPUs |
+
+To support different forms of parallelism, we create various process groups inside DeepSpeed. The helper functions that DeepSpeed uses reside in ```deepspeed.utils.groups.py```
+
+Note: The following function has been deprecated now and model training code does not need to call this anymore.
+
+```python
+deepspeed.utils.groups.initialize(ep_size="desired expert-parallel world size")
+```
+
+Instead, the MoE layer API now accepts ```ep_size``` as an argument in addition to ```num_experts```. This new API allows users to create MoE models, which can have a different number of experts and a different expert parallelism degree for each MoE layer.
+
+The GPUs (or ranks) participating in an expert-parallel group of size ```ep_size``` will distribute the total number of experts specified by the layer.
+
+### MoE layer API
+
+The hidden_size is the input dimension of a particular layer and the output dimension is the same as that. This could lead to some changes to your model definition, especially for vision/convolutional models because the input/output dimensions don't match in certain cases. E.g. in the CIFAR-10 example, we modify the third fully connected layer to add the MoE layer. To cater for this, we need to add an additional fully-connected layer, whose input dimension is equal to the output dimension of the MoE layer.
+
+Original model config
+
+```python
+    self.fc3 = nn.Linear(84, 10)
+```
+
+Updated with MoE Layers
+
+```python
+    self.fc3 = nn.Linear(84, 84)
+    self.fc3 = deepspeed.moe.layer.MoE(hidden_size=84, expert=self.fc3, num_experts=args.num_experts, ep_size=<desired expert-parallel world size> ...)
+    self.fc4 = nn.Linear(84, 10)
+```
+
+### Pyramid-Residual MoE
+
+Recently, we proposed a novel [Pyramid-Residual MoE](https://arxiv.org/abs/2201.05596]) (PR-MoE) model architecture. To create such an MoE model, the users need to do two additional things: 1) To make a pyramid structure, pass num_experts as a list e.g. [4, 8] and 2) Use the ```use_residual``` flag to indicate that the MoE layer is now a Residual MoE layer.
+
+```python
+self.experts = deepspeed.moe.layer.MoE(hidden_size=input_dim, expert=ExpertModule(), num_experts=[..], ep_size=ep_size, use_residual=True)
+```
+
+### An Example Scenario
+
+Given a total number of GPUs in our world size and a subset of GPUs in our expert-parallel world as follows.
+
+```python
+WORLD_SIZE = 4
+EP_WORLD_SIZE = 2
+EXPERTS = [8]
+```
+
+The model code needs to use the deepspeed.moe.layer.MoE API as follows.
+
+```python
+self.experts = deepspeed.moe.layer.MoE(hidden_size=input_dim, expert=ExpertModule(), num_experts=EXPERTS, ep_size=EP_WORLD_SIZE)
+```
+
+With the above two commands, the DeepSpeed runtime will be set to train an MoE model with a total of 8 experts on 4 GPUs in 4 experts/GPU mode. We call this the E + D mode as described earlier in the table.
+
+
+```python
+import torch
+import deepspeed
+import deepspeed.utils.groups as groups
+from deepspeed.moe.layer import MoE
+
+WORLD_SIZE = 4
+EP_WORLD_SIZE = 2
+EXPERTS = 8
+
+fc3 = torch.nn.Linear(84, 84)
+fc3 = MoE(hidden_size=84, expert=self.fc3, num_experts=EXPERTS, ep_size=EP_WORLD_SIZE, k=1)
+fc4 = torch.nn.Linear(84, 10)
+
+```
+
+For a runnable end-to-end example that covers both the standard MoE architecture as well as the PR-MoE model , please look at the [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar). In addition, see the advanced usage section of this tutorial that links to a more comprehensive example for NLG models.
+
+### Combining ZeRO-Offload and DeepSpeed MoE for very large models
+
+To use MoE Layers in DeepSpeed, we rely on two parameter groups that are passed to an optimizer. A concrete example to create such groups is available from the [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar).
+
+The relevant function that creates these param groups is as follows.
+
+```python
+def create_moe_param_groups(model):
+    from deepspeed.moe.utils import split_params_into_different_moe_groups_for_optimizer
+
+    parameters = {'params': [p for p in model.parameters()], 'name': 'parameters'}
+
+    return split_params_into_different_moe_groups_for_optimizer(parameters)
+```
+
+The above param groups can then be fed to the ZeRO stage-2 optimizer as follows.
+
+```python
+
+net = Net()
+
+parameters = create_moe_param_groups(net)
+
+model_engine, optimizer, trainloader, __ = deepspeed.initialize(
+    args=args, model=net, model_parameters=parameters, training_data=trainset)
+```
+
+We are working on automating this functionality in the DeepSpeed ZeRO optimizer so the model training code can be simplified further.
+
+To run the [cifar10 example](https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar) with ZeRO-Offload (stage 2) and MoE, please set the ds_config flags
+
+```json
+"zero_optimization": {
+      "stage": 2,
+      "allgather_partitions": true,
+      "reduce_scatter": true,
+      "allgather_bucket_size": 50000000,
+      "reduce_bucket_size": 50000000,
+      "overlap_comm": true,
+      "contiguous_gradients": true,
+      "cpu_offload": true
+  }
+```
+
+An additional optimization to save memory for extremely large model training on limited number of GPUs has also been introduced. Please enable that using the following config flag to the fp16 optimizer in ds_config.
+
+  ```json
+    "fp16": {
+      "enabled": true,
+      "fp16_master_weights_and_grads": true,
+  }
+  ```
+
+## Random Token Selection
+
+We have devised a new technique called “Random Token Selection” that greatly improves convergence. Random token selection addresses the limitation of biased selection problem in MoE model training. Our upcoming paper describes this technique and its results in detail. This feature is already part of the DeepSpeed runtime and is enabled by default so users can take advantage without any config flags or command-line arguments.
+
+## Advanced MoE usage
+
+We have added an example of applying MoE to NLG models. Please read more in this [newsletter](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html) and [tutorial](/tutorials/mixture-of-experts-nlg/).
diff --git a/docs/_tutorials/one-cycle.md b/docs/_tutorials/one-cycle.md
index 0cb8a45f31f023ffc62baba3c9c842146a38388e..12967ad56ad519dbd6af17cfab0f456057b42757 100644
--- a/docs/_tutorials/one-cycle.md
+++ b/docs/_tutorials/one-cycle.md
@@ -1,5 +1,6 @@
 ---
 title: "1-Cycle Schedule"
+tags: training learning-rate
 ---
 
 This tutorial shows how to implement 1Cycle schedules for learning rate and
@@ -23,9 +24,9 @@ Started](/getting-started/) guide.
 model. We will define the 1-Cycle parameters below.
 
 ## Overview
-The 1-cycle schedule operates in two phases, a cycle phase and a decay phase,
+The 1-cycle schedule operates in two phases, a cycle phase and a decay phase
 which span one iteration over the training data. For concreteness, we will
-review how 1-cycle schedule of learning rate works. In the cycle phase,
+review how the 1-cycle learning rate schedule works. In the cycle phase,
 the learning rate oscillates between a minimum value and a maximum value over a
 number of training steps. In the decay phase, the learning rate decays starting
 from the minimum value of the cycle phase. An example of 1-cycle learning rate
@@ -36,10 +37,10 @@ schedule during model training is illustrated below.
 ### 1-Cycle Parameters
 
 The 1-Cycle schedule is defined by a number of parameters which allow users
-explore different configurations. The literature recommends concurrent tuning
+to explore different configurations. The literature recommends concurrent tuning
 of learning rate and momentum because they are correlated hyperparameters. We
 have leveraged this recommendation to reduce configuration burden by organizing
-the 1-cycle parameters into two groups to:
+the 1-cycle parameters into two groups:
 
 1. Global parameters for configuring the cycle and decay phase
 2. Local parameters for configuring learning rate and momentum
diff --git a/docs/_tutorials/onebit-adam.md b/docs/_tutorials/onebit-adam.md
index 1a15000135c988262d5b4786d8dd0adb6d64f8f7..5166869ebe996447b3e5ec7fddc3440fa02e26b0 100644
--- a/docs/_tutorials/onebit-adam.md
+++ b/docs/_tutorials/onebit-adam.md
@@ -1,13 +1,19 @@
 ---
 title: "1-bit Adam: Up to 5x less communication volume and up to 3.4x faster training"
+tags: training IO
+toc: false
 ---
 
+**Note:**
+On 03/07/2022 we released 0/1 Adam, which is a new communication-efficient Adam optimizer partially following the 1-bit Adam's design. Compared to the 1-bit Adam described below, 0/1 Adam provides better communication efficiency and the same final model quality on different tasks including BERT, GPT-2, and ImageNet. Thus we would recommend to first try 0/1 Adam ([tutorial](/tutorials/zero-one-adam/)), and then try 1-bit Adam if 0/1 Adam couldn't provide baseline Adam's convergence in your task.
+{: .notice--info}
+
 **Note:**
 This tutorial is updated on 03/04/2021 to reflect the 1-bit Adam v2. Changes include: 1) NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. 2) Add support to momentum masks for those parameters with constant zero gradients during training. 3) Bug fixes. See details below.
 {: .notice--info}
 
 **Watch out!**
-1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 1-bit Adam is compatible with both FP16 and FP32, currently we only verified the convergence under mixed precision/FP16 training. 3) Currently 1-bit Adam is not compatible with pipeline parallelism. 4) Frequent checkpoint loading could hurt 1-bit Adam's convergence. See details below.
+1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 1-bit Adam is compatible with both FP16 and FP32, currently we only verified the convergence under mixed precision/FP16 training. 3) Currently the MPI-based implementation is not compatible with pipeline parallelism. 4) Frequent checkpoint loading could hurt 1-bit Adam's convergence. See details below.
 {: .notice--warning}
 
 In this tutorial, we are going to introduce the 1-bit Adam optimizer in DeepSpeed. 1-bit Adam can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 5x. Detailed description of the 1-bit Adam algorithm, its implementation in DeepSpeed, and performance evaluation is available from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html). We also have a [paper](https://arxiv.org/abs/2102.02888) which provides the most complete details including algorithm, system implementation, theoretical analysis, and more evaluations.
@@ -23,7 +29,7 @@ For more details on these tasks, please refer to the tutorial posts on [BingBert
 
 ### 1.1 Pre-requisites for installing DeepSpeed
 
-If you don't already have a copy of the DeepSpeed repository, please clone in
+If you don't already have a copy of the DeepSpeed repository, please clone it
 now and checkout the DeepSpeedExamples submodule that contains the BingBertSQuAD and BERT Pre-training examples.
 
 ```shell
@@ -71,7 +77,7 @@ mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flag
 
 ### 1.3 1-bit Algorithm
 
-The detailed description of the 1-bit Algorithm can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888).
+The detailed description of the 1-bit Algorithm can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888).
 
 ### 1.4 Configuration of 1-bit Adam
 The 1-bit Adam feature can be used by setting the optimizer configuration options as follows. An example json config file is shown below.
@@ -106,7 +112,7 @@ Please note three new parameters `freeze_step`, `cuda_aware`, and `comm_backend_
 Because 1-bit compression cannot represent exact zero, the compression error would keep accumulating in the momentum if a parameter have constant zero gradients during training. For example, for BERT pre-training seq length 128, `bert.embeddings.position_embeddings.weight` has constant zeros in its gradient and momentum for row 129 to 512, because it only learns up to seq length 128 while the model supports up to seq length 512. Thus in 1-bit Adam v2 we added support of a momentum mask for users to specify those params that have constant exact zeros in their gradients. See [example script](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/deepspeed_train.py) for how to configure this momentum mask. One thing to note is that we don't use momentum mask saved in checkpoints since this mask could change during training (e.g., BERT seqlen 128 and 512 require different masks). So you have to provide this mask every time in your training script.
 
 **Watch out!**
-1-bit Adam replies on an compression error compensation mechanism to maintain the convergence speed at compression stage. When loading checkpoints, we actually reset the compression errors for 3 reasons: 1) The worker and server error at each GPU are distinct, so in current implementation only rank 0's errors are saved in the checkpoint. Thus we have to reset the errors. If we want to save them correctly we need O(num_gpu*model_size) memory in order to gather all the error, which is a very large memory requirement. It's possible to save them in a distributed way, but it will make the checkpoint saving/loading much more complicated. 2) Even if we are able to save the compression errors correctly, you need to have the exact same number of GPUs in order to load them correctly. 3) We verified on BERT pre-training that occasionally resetting the compression error at checkpoint loading does not affect the convergence. However, please avoid frequent checkpoint loading which could break the error compensation mechanism thus affect the convergence.
+1-bit Adam relies on an compression error compensation mechanism to maintain the convergence speed at compression stage. When loading checkpoints, we actually reset the compression errors for 3 reasons: 1) The worker and server error at each GPU are distinct, so in current implementation only rank 0's errors are saved in the checkpoint. Thus we have to reset the errors. If we want to save them correctly we need O(num_gpu*model_size) memory in order to gather all the error, which is a very large memory requirement. It's possible to save them in a distributed way, but it will make the checkpoint saving/loading much more complicated. 2) Even if we are able to save the compression errors correctly, you need to have the exact same number of GPUs in order to load them correctly. 3) We verified on BERT pre-training that occasionally resetting the compression error at checkpoint loading does not affect the convergence. However, please avoid frequent checkpoint loading which could break the error compensation mechanism thus affect the convergence.
 {: .notice--warning}
 
 ## 2. BingBertSQuAD Fine-tuning with 1-bit Adam
@@ -191,7 +197,7 @@ Table 1. Fine-tuning configuration
 
 ### 2.3 Performance Results for BingBertSQuAD Fine-tuning
 
-***Accuracy:***
+<i>**Accuracy:**</i>
 The results are summarized in the table below. The total batch size is set to 96 and training is conducted
 on 32 GPUs for 2 epochs. A set of parameters (seeds and learning rates) were tried and the best ones were selected.
 We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scores we achieved that are on-par or better than the [HuggingFace results](https://github.com/huggingface/transformers/tree/master/examples/question-answering).
@@ -209,7 +215,7 @@ We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scor
 
 Figure 1: Scalability of 1-bit Adam for SQuAD Finetuning on V100 GPUs with batch size of 3/GPU. -->
 
-Performance results of SQuAD Fine-tuning can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888).
+Performance results of SQuAD Fine-tuning can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888).
 
 
 
@@ -289,4 +295,4 @@ The above file is for BERT-large. For BERT-base training (sequence length 128),
 
 ### 3.3 Performance Results for BERT Pre-training
 
-Performance results of BERT Pre-training can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888).
+Performance results of BERT Pre-training can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html) and our [paper](https://arxiv.org/abs/2102.02888).
diff --git a/docs/_tutorials/onebit-lamb.md b/docs/_tutorials/onebit-lamb.md
new file mode 100644
index 0000000000000000000000000000000000000000..822f79e61740651b0bf53f8e85832ba0788bc61d
--- /dev/null
+++ b/docs/_tutorials/onebit-lamb.md
@@ -0,0 +1,131 @@
+---
+title: "1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed"
+tags: training IO
+---
+
+**Watch out!**
+1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 1-bit LAMB is compatible with both FP16 and FP32, currently we only verified the convergence under mixed precision/FP16 training. 3) Currently the MPI-based implementation is not compatible with pipeline parallelism. 4) Frequent checkpoint loading could hurt 1-bit LAMB's convergence. See details below.
+{: .notice--warning}
+
+In this tutorial, we introduce DeepSpeed's 1-bit LAMB optimizer which enables communication-efficient large-scale large-batch training with LAMB's convergence speed. 1-bit LAMB can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 4.6x. We also have a [paper](https://arxiv.org/abs/2104.06069) which provides the technical details including algorithm, system implementation, and evaluations.
+
+To illustrate the benefits and usage of 1-bit LAMB optimizer, we use the BERT Pre-training task as example. For more details on this task, please refer to the [tutorial](/tutorials/bert-pretraining/).
+
+## 1. Overview
+
+### 1.1 Pre-requisites for installing DeepSpeed
+
+If you don't already have a copy of the DeepSpeed repository, please clone it
+now and checkout the DeepSpeedExamples submodule that contains the BERT Pre-training example.
+
+```shell
+git clone https://github.com/microsoft/DeepSpeed
+cd DeepSpeed
+git submodule update --init --recursive
+cd DeepSpeedExamples/
+```
+
+### 1.2 Pre-requisites for 1-bit LAMB
+
+#### 1.2.1 NCCL-based implementation
+
+In DeepSpeed, we introduce a system implementation for compressed communication using the NCCL backend of PyTorch distributed. This implementation provides better performance and usability than the MPI-based implementation below. Thus we highly recommend users to choose this implementation.
+
+**Watch out!**
+This NCCL-based implementation requires PyTorch >= 1.8. It also requires NCCL >= 2.8.3 when you have 64 or more GPUs to avoid certain NCCL runtime bugs. Currently (2021/03/16) NCCL 2.8.3 is not officially supported by PyTorch. The solution we used is by hacking in NCCL 2.8.3 via `LD_PRELOAD`: 1) Install NCCL 2.8.3. This works for us on a CUDA 11 system: `apt-get install -y libnccl2=2.8.3-1+cuda11.0 libnccl-dev=2.8.3-1+cuda11.0`. 2) Set `LD_PRELOAD` to the the library path. This works for us: `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnccl.so.2.8.3`. To confirm `LD_PRELOAD` is working you can see the version it uses in the NCCL logs if you have `NCCL_DEBUG=INFO`, it should say: NCCL version 2.8.3+cuda11.0.
+{: .notice--warning}
+
+#### 1.2.2 MPI-based implementation
+
+For this implementation, we rely on Message Passing Interface (MPI) for advanced communication primitives.
+
+We package the necessary dependencies in the DeepSpeed docker images. However, if you are using a different build system, please install MPI and mpi4py on your system. To install the prerequisites run:
+
+```shell
+pip install deepspeed[1bit_adam]
+```
+
+We have tested CUDA-Aware MPI communication using the [MVAPICH2-GDR](http://mvapich.cse.ohio-state.edu/userguide/gdr/) library. However, any CUDA-Aware communication library including [OpenMPI](https://www.open-mpi.org/) should work fine with these examples.
+
+An example launch command for 1-bit LAMB using the `deepspeed` launcher is as follows:
+
+```shell
+deepspeed --launcher=[mvapich|openmpi] script.py
+```
+
+Please note that for MPI-based implementation of 1-bit LAMB, the `--launcher=[mvapich|openmpi]` flag is required when using the `deepspeed` launcher.
+
+Alternatively, the standard mpirun launcher can also be used as follows:
+
+```shell
+mpirun -np [num processes] -ppn [num GPUs on each node] -hostfile [hostfile] [MPI flags] python [training_script.py]
+```
+
+### 1.3 1-bit LAMB Algorithm
+
+The detailed description of the 1-bit LAMB algorithm can be seen from our [paper](https://arxiv.org/abs/2104.06069).
+
+### 1.4 Configuration of 1-bit LAMB
+The 1-bit LAMB feature can be used by setting the optimizer configuration options as follows. An example json config file is shown below.
+
+```json
+{
+  "train_batch_size": 65536,
+  "train_micro_batch_size_per_gpu": 64,
+  "optimizer": {
+    "type": "OneBitLamb",
+    "params": {
+      "lr": 11e-3,
+      "max_coeff": 0.3,
+      "min_coeff": 0.01,
+      "freeze_step": 1000,
+      "cuda_aware": false,
+      "comm_backend_name": "nccl",
+      "coeff_beta": 0.9,
+      "factor_max": 4.0,
+      "factor_min": 0.5,
+      "factor_threshold": 0.1
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "initial_scale_power": 16
+  }
+}
+```
+Please note the new parameters `freeze_step`, `cuda_aware`, `comm_backend_name`, `coeff_beta`, `factor_max`, `factor_min`, and `factor_threshold` that have been added to support the 1-bit LAMB feature:
+
+`freeze_step` is the number of warm up steps before 1-bit compression gets applied to the communication. In order to determine the number of warm up steps, one strategy is to set 15-25% of the total training steps for a given model (This is related to LAMB's variance/second moment term and scaling coefficient. See detailed analysis in our [paper](https://arxiv.org/abs/2104.06069)). If it provides the desired outcome, one can try to extract more performance by reducing the steps systematically. In future, we plan to introduce a threshold that can automatically search and decide for the number of warm up steps for different models. The examples below have been tuned for the number of warm up steps. The `freeze_step` parameter has already been set to the best number we found in the corresponding run scripts.
+
+`cuda_aware` is used for MPI-based implementation to indicate that the underlying MPI library supports CUDA-Aware communication. This feature is only supported on systems with InfiniBand interconnect and a CUDA-Aware MPI library like [MVAPICH2-GDR](http://mvapich.cse.ohio-state.edu/userguide/gdr/) or OpenMPI built with CUDA-Aware support. Setting `cuda_aware` to False will allow training on Ethernet based systems. However, the communication will happen using sender as well as receiver side memory copies between CPU and GPU buffers before and after communication.
+
+`comm_backend_name` is used to indicate which backend implementation to use. You can choose between NCCL and MPI-based implementations by setting `comm_backend_name` to "nccl" or "mpi". When using NCCL-based implementation, there is no need to set `cuda_aware`.
+
+`coeff_beta` is used when calculating a moving average of the LAMB scaling coefficient during the warmup stage. This moving average is then used as the frozen base scaling coefficient during the compression stage.
+
+`factor_max`, `factor_min`, and `factor_threshold` are used to regularize the adaptive scaling of the frozen base scaling coefficient during the compression stage. `factor_max` and `factor_min` are the scaling factor upper/lower bound. `factor_threshold` defines the threshold of how much the scaling factor can fluctuate between steps.
+
+#### 1.4.1 Momentum masks for parameters with constant zero gradients
+Because 1-bit compression cannot represent exact zero, the compression error would keep accumulating in the momentum if a parameter have constant zero gradients during training. For example, for BERT pre-training seq length 128, `bert.embeddings.position_embeddings.weight` has constant zeros in its gradient and momentum for row 129 to 512, because it only learns up to seq length 128 while the model supports up to seq length 512. Thus in 1-bit LAMB we added support of a momentum mask for users to specify those params that have constant exact zeros in their gradients. See [example script](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/deepspeed_train.py) for how to configure this momentum mask. One thing to note is that we don't use momentum mask saved in checkpoints since this mask could change during training (e.g., BERT seqlen 128 and 512 require different masks). So you have to provide this mask every time in your training script.
+
+**Watch out!**
+1-bit LAMB relies on an compression error compensation mechanism to maintain the convergence speed at compression stage. When loading checkpoints, we actually reset the compression errors for 3 reasons: 1) The worker and server error at each GPU are distinct, so in current implementation only rank 0's errors are saved in the checkpoint. Thus we have to reset the errors. If we want to save them correctly we need O(num_gpu*model_size) memory in order to gather all the error, which is a very large memory requirement. It's possible to save them in a distributed way, but it will make the checkpoint saving/loading much more complicated. 2) Even if we are able to save the compression errors correctly, you need to have the exact same number of GPUs in order to load them correctly. 3) We verified on BERT pre-training that occasionally resetting the compression error at checkpoint loading does not affect the convergence. However, please avoid frequent checkpoint loading which could break the error compensation mechanism thus affect the convergence.
+{: .notice--warning}
+
+## 2. BERT Pre-training with 1-bit LAMB
+For data downloading and pre-processing, please refer to the [BERT Pre-training tutorial](/tutorials/bert-pretraining/).
+
+### 2.1 Running Pre-training with DeepSpeed and 1-bit LAMB
+
+We provide example scripts under [DeepSpeedExamples/bing_bert/1-bit_lamb/](https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert/1-bit_lamb). There are 3 sets of scripts corresponding to NCCL-based implementation, MPI-based implementation on Ethernet systems, and MPI-based implementation on InfiniBand systems. For MPI-based implementation, we provide both example scripts when launching with deepspeed or mpirun.
+
+### 2.2 Configuration for BERT Pre-training with DeepSpeed and 1-bit LAMB enabled
+
+The `deepspeed_bsz64k_onebitlamb_config_seq128_*.json` and `deepspeed_bsz32k_onebitlamb_config_seq512_*.json` files give the user the ability to specify DeepSpeed
+options in terms of batch size, micro batch size, optimizer, learning rate, and other parameters. In these files we include the tuned hyperparameters to reproduce experiments in our [paper](https://arxiv.org/abs/2104.06069).
+
+### 2.3 Performance Results for BERT Pre-training
+
+Performance results can be seen in our [paper](https://arxiv.org/abs/2104.06069).
diff --git a/docs/_tutorials/pipeline.md b/docs/_tutorials/pipeline.md
index 1751846830ef0f08e6b1cf694c5462d55a8ff21e..8eff1d996c04c462b4e85603951a2bb080fc86aa 100644
--- a/docs/_tutorials/pipeline.md
+++ b/docs/_tutorials/pipeline.md
@@ -1,5 +1,6 @@
 ---
 title: "Pipeline Parallelism"
+tags: training large-model
 ---
 
 DeepSpeed v0.3 includes new support for pipeline parallelism! Pipeline
diff --git a/docs/_tutorials/progressive_layer_dropping.md b/docs/_tutorials/progressive_layer_dropping.md
old mode 100755
new mode 100644
index 8a447e97c9454bc9e20113e115c2b5c662a5e3b7..b7b868bf29d35d30e736adacefdcfdbaa32159f2
--- a/docs/_tutorials/progressive_layer_dropping.md
+++ b/docs/_tutorials/progressive_layer_dropping.md
@@ -1,155 +1,155 @@
----
-title: "Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping"
-
----
-
-In this tutorial, we are going to introduce the progressive layer dropping (PLD) in DeepSpeed and provide examples on how to use PLD. PLD allows to train Transformer networks such as BERT 24% faster under the same number of samples and 2.5 times faster to get similar accuracy on downstream tasks. Detailed description of PLD and the experimental results are available in our [technical report](https://arxiv.org/pdf/2010.13369.pdf).
-
-To illustrate how to use PLD in DeepSpeed, we show how to enable PLD to pre-train a BERT model and fine-tune the pre-trained model on the GLUE datasets.
-
-## Running Pre-training with DeepSpeed and PLD
-
-To perform pre-training, one needs to first prepare the datasets. For this part, please refer our [BERT Pre-training](/tutorials/bert-pretraining/) post, which contains detailed information on how to do data downloading and pre-processing. For the below experiment, we use Wikipedia text and Bookcorpus, similar as [Devlin et. al.](https://arxiv.org/abs/1810.04805).
-
-The main part of pre-training is done in `deepspeed_train.py`, which has
-already been modified to use DeepSpeed. The  `ds_train_bert_progressive_layer_drop_bsz4k_seq128.sh` is the shell script that launches the pre-training with DeepSpeed and PLD.
-
-```shell
-bash ds_train_bert_progressive_layer_drop_bsz4k_seq128.sh
-```
-
-Most of the flags in the above script should be familiar if you have stepped through the BERT pre-training [tutorial](/tutorials/bert-pretraining/). To enable training with PLD, one needs to enable PLD in both the client script and in the DeepSpeed engine. To enable PLD in the client script, one needs to add the following command line flag to enable progressive layer dropping on Transformer blocks.
-
-```bash
---progressive_layer_drop
-```
-
-To enable PLD in DeepSpeed, one needs to update the json configuration file with an appropriate PLD configuration dictionary like below:
-
-```json
-{
-  ...
-  "progressive_layer_drop": {
-    "enabled": true,
-    "theta": 0.5,
-    "gamma": 0.001
-  }
-}
-```
-
-we recommend a PLD theta value of 0.5 and gamma of 0.001 because these have worked well in our experiments.
-
-With these configuration changes, the DeepSpeed engine should print a runtime message as below:
-
-    [INFO] [logging.py:60:log_dist] [Rank 0] Enabled progressive layer dropping (theta = 0.5)
-
-The `deepspeed_bsz4k_progressive_layer_drop_config_seq128.json` file allows users to specify DeepSpeed options in terms of batch size, micro batch size, optimizer, learning rate, sequence length, and other parameters. Below is the DeepSpeed configuration file we use for running BERT and PLD.
-
-```json
-{
-  "train_batch_size": 4096,
-  "train_micro_batch_size_per_gpu": 16,
-  "steps_per_print": 1000,
-  "prescale_gradients": true,
-  "gradient_predivide_factor": 8,
-  "optimizer": {
-    "type": "Adam",
-    "params": {
-      "lr": 1e-3,
-      "weight_decay": 0.01,
-      "bias_correction": false
-    }
-  },
-  "gradient_clipping": 1.0,
-
-  "wall_clock_breakdown": false,
-
-  "fp16": {
-    "enabled": true,
-    "loss_scale": 0
-  },
-
-  "progressive_layer_drop": {
-    "enabled": true,
-    "theta": 0.5,
-    "gamma": 0.001
-  }
-}
-```
-
-Note that the above configuration assumes training on 64 X 32GB V100 GPUs. Each GPU uses a micro batch size of 16 and accumulates gradients until the effective batch size reaches 4096. If you have GPUs with less memory, you may need to reduce "train_micro_batch_size_per_gpu". Alternatively, if you have more GPUs, you can increase the "train_batch_size" to increase training speed. We use the following hyperparameters for pre-training BERT with PLD enabled.
-
-| Parameters                     | Value                   |
-| ------------------------------ | ----------------------- |
-| Effective batch size           | 4K                      |
-| Train micro batch size per GPU | 16                      |
-| Optimizer                      | Adam                    |
-| Peak learning rate             | 1e-3                    |
-| Sequence-length                | 128                     |
-| Learning rate scheduler        | Warmup linear decay exp |
-| Warmup ratio                   | 0.02                    |
-| Decay rate                     | 0.99                    |
-| Decay step                     | 1000                    |
-| Weight decay                   | 0.01                    |
-| Gradient clipping              | 1.0                     |
-
-Table 1. Pre-training hyperparameters
-
-**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stabilize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py".
-
-## Fine-tuning with DeepSpeed on GLUE Tasks
-
-We use GLUE for fine-tuning tasks. GLUE (General Language Understanding Evaluation benchmark) (https://gluebenchmark.com/) is a  collection of sentence or sentence-pair natural language understanding tasks including question answering, sentiment analysis, and textual entailment.  It is designed to favor sample-efficient learning and knowledge-transfer across a range of different linguistic tasks in different domains.
-
-One can download all GLUE data using the provided helper [script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e). Once the data has been downloaded, one can set up the data and move the data to "/data/GlueData", which is the default location for hosting GLUE data. We then can use the PLD pre-trained BERT model checkpoint to run the fine-tuning.
-
-The main part of fine-tuning is done in `run_glue_classifier_bert_base.py`, which has
-already been modified to use DeepSpeed. Before the fine-tuning, one needs to specify the BERT model configuration through the following config in `run_glue_classifier_bert_base.py`. In this case, it has already been modified to be the same as the configuration of the pre-trained model.
-
-```json
-    bert_model_config = {
-        "vocab_size_or_config_json_file": 119547,
-        "hidden_size": 768,
-        "num_hidden_layers": 12,
-        "num_attention_heads": 12,
-        "intermediate_size": 3072,
-        "hidden_act": "gelu",
-        "hidden_dropout_prob": 0.1,
-        "attention_probs_dropout_prob": 0.1,
-        "max_position_embeddings": 512,
-        "type_vocab_size": 2,
-        "initializer_range": 0.02
-    }
-```
-
-Next, one can load a DeepSpeed style checkpoint with the following command, which has also already been added in the script.
-
-```shell
-model.load_state_dict(checkpoint_state_dict['module'], strict=False)
-```
-
-Finally, the `run_glue_classifier_bert_base.sh` script invokes pre-training and setups several hyperparameters relevant to fine-tuning.
-
-```shell
-bash run_glue_bert_base_finetune.sh [task] [batch size] [learning rate] [number of epochs] [job name] [checkpoint path]
-```
-
-An example would be:
-
-```shell
-bash run_glue_bert_base_finetune.sh MNLI 32 3e-5 5 "fine_tune_MNLI" deepspeed_checkpoint.pt
-```
-
-
-
-### Expected Results
-
-The fine-tuning results can be found under the "logs" directory, and below are expected results for PLD on GLUE tasks. The "Lr" row indicates the learning rate we use for getting the corresponding accuracy result for each task.
-
-|                        | RTE  | MRPC      | STS-B     | CoLA | SST-2 | QNLI | QQP       | MNLI-m/mm | GLUE |
-| ---------------------- | :--: | --------- | --------- | ---- | ----- | ---- | --------- | --------- | ---- |
-| Metrics                | Acc. | F1/Acc.   | PCC/SCC   | Acc. | Acc.  | Acc. | F1/Acc.   | Acc.      |      |
-| Bert_{base} (original) | 66.4 | 88.9/84.8 | 87.1/89.2 | 52.1 | 93.5  | 90.5 | 71.2/89.2 | 84.6/83.4 | 80.7 |
-| Bert_{base} (Our impl) | 67.8 | 88.0/86.0 | 89.5/89.2 | 52.5 | 91.2  | 87.1 | 89.0/90.6 | 82.5/83.4 | 82.1 |
-| PLD                    | 69.3 | 86.6/84.3 | 90.0/89.6 | 55.8 | 91.6  | 90.7 | 89.6/91.2 | 84.1/83.8 | 82.9 |
-| Lr                     | 7e-5 | 9e-5      | 7e-5      | 5e-5 | 7e-5  | 9e-5 | 2e-4      | 3e-5      |      |
+---
+title: "Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping"
+tags: training
+---
+
+In this tutorial, we are going to introduce the progressive layer dropping (PLD) in DeepSpeed and provide examples on how to use PLD. PLD allows to train Transformer networks such as BERT 24% faster under the same number of samples and 2.5 times faster to get similar accuracy on downstream tasks. Detailed description of PLD and the experimental results are available in our [technical report](https://arxiv.org/pdf/2010.13369.pdf).
+
+To illustrate how to use PLD in DeepSpeed, we show how to enable PLD to pre-train a BERT model and fine-tune the pre-trained model on the GLUE datasets.
+
+## Running Pre-training with DeepSpeed and PLD
+
+To perform pre-training, one needs to first prepare the datasets. For this part, please refer our [BERT Pre-training](/tutorials/bert-pretraining/) post, which contains detailed information on how to do data downloading and pre-processing. For the below experiment, we use Wikipedia text and Bookcorpus, similar as [Devlin et. al.](https://arxiv.org/abs/1810.04805).
+
+The main part of pre-training is done in `deepspeed_train.py`, which has
+already been modified to use DeepSpeed. The  `ds_train_bert_progressive_layer_drop_bsz4k_seq128.sh` is the shell script that launches the pre-training with DeepSpeed and PLD.
+
+```shell
+bash ds_train_bert_progressive_layer_drop_bsz4k_seq128.sh
+```
+
+Most of the flags in the above script should be familiar if you have stepped through the BERT pre-training [tutorial](/tutorials/bert-pretraining/). To enable training with PLD, one needs to enable PLD in both the client script and in the DeepSpeed engine. To enable PLD in the client script, one needs to add the following command line flag to enable progressive layer dropping on Transformer blocks.
+
+```bash
+--progressive_layer_drop
+```
+
+To enable PLD in DeepSpeed, one needs to update the json configuration file with an appropriate PLD configuration dictionary like below:
+
+```json
+{
+  ...
+  "progressive_layer_drop": {
+    "enabled": true,
+    "theta": 0.5,
+    "gamma": 0.001
+  }
+}
+```
+
+we recommend a PLD theta value of 0.5 and gamma of 0.001 because these have worked well in our experiments.
+
+With these configuration changes, the DeepSpeed engine should print a runtime message as below:
+
+    [INFO] [logging.py:60:log_dist] [Rank 0] Enabled progressive layer dropping (theta = 0.5)
+
+The `deepspeed_bsz4k_progressive_layer_drop_config_seq128.json` file allows users to specify DeepSpeed options in terms of batch size, micro batch size, optimizer, learning rate, sequence length, and other parameters. Below is the DeepSpeed configuration file we use for running BERT and PLD.
+
+```json
+{
+  "train_batch_size": 4096,
+  "train_micro_batch_size_per_gpu": 16,
+  "steps_per_print": 1000,
+  "prescale_gradients": true,
+  "gradient_predivide_factor": 8,
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 1e-3,
+      "weight_decay": 0.01,
+      "bias_correction": false
+    }
+  },
+  "gradient_clipping": 1.0,
+
+  "wall_clock_breakdown": false,
+
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0
+  },
+
+  "progressive_layer_drop": {
+    "enabled": true,
+    "theta": 0.5,
+    "gamma": 0.001
+  }
+}
+```
+
+Note that the above configuration assumes training on 64 X 32GB V100 GPUs. Each GPU uses a micro batch size of 16 and accumulates gradients until the effective batch size reaches 4096. If you have GPUs with less memory, you may need to reduce "train_micro_batch_size_per_gpu". Alternatively, if you have more GPUs, you can increase the "train_batch_size" to increase training speed. We use the following hyperparameters for pre-training BERT with PLD enabled.
+
+| Parameters                     | Value                   |
+| ------------------------------ | ----------------------- |
+| Effective batch size           | 4K                      |
+| Train micro batch size per GPU | 16                      |
+| Optimizer                      | Adam                    |
+| Peak learning rate             | 1e-3                    |
+| Sequence-length                | 128                     |
+| Learning rate scheduler        | Warmup linear decay exp |
+| Warmup ratio                   | 0.02                    |
+| Decay rate                     | 0.99                    |
+| Decay step                     | 1000                    |
+| Weight decay                   | 0.01                    |
+| Gradient clipping              | 1.0                     |
+
+Table 1. Pre-training hyperparameters
+
+**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stabilize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py".
+
+## Fine-tuning with DeepSpeed on GLUE Tasks
+
+We use GLUE for fine-tuning tasks. GLUE (General Language Understanding Evaluation benchmark) (https://gluebenchmark.com/) is a  collection of sentence or sentence-pair natural language understanding tasks including question answering, sentiment analysis, and textual entailment.  It is designed to favor sample-efficient learning and knowledge-transfer across a range of different linguistic tasks in different domains.
+
+One can download all GLUE data using the provided helper [script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e). Once the data has been downloaded, one can set up the data and move the data to "/data/GlueData", which is the default location for hosting GLUE data. We then can use the PLD pre-trained BERT model checkpoint to run the fine-tuning.
+
+The main part of fine-tuning is done in `run_glue_classifier_bert_base.py`, which has
+already been modified to use DeepSpeed. Before the fine-tuning, one needs to specify the BERT model configuration through the following config in `run_glue_classifier_bert_base.py`. In this case, it has already been modified to be the same as the configuration of the pre-trained model.
+
+```json
+    bert_model_config = {
+        "vocab_size_or_config_json_file": 119547,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02
+    }
+```
+
+Next, one can load a DeepSpeed style checkpoint with the following command, which has also already been added in the script.
+
+```shell
+model.load_state_dict(checkpoint_state_dict['module'], strict=False)
+```
+
+Finally, the `run_glue_classifier_bert_base.sh` script invokes pre-training and setups several hyperparameters relevant to fine-tuning.
+
+```shell
+bash run_glue_bert_base_finetune.sh [task] [batch size] [learning rate] [number of epochs] [job name] [checkpoint path]
+```
+
+An example would be:
+
+```shell
+bash run_glue_bert_base_finetune.sh MNLI 32 3e-5 5 "fine_tune_MNLI" deepspeed_checkpoint.pt
+```
+
+
+
+### Expected Results
+
+The fine-tuning results can be found under the "logs" directory, and below are expected results for PLD on GLUE tasks. The "Lr" row indicates the learning rate we use for getting the corresponding accuracy result for each task.
+
+|                        | RTE  | MRPC      | STS-B     | CoLA | SST-2 | QNLI | QQP       | MNLI-m/mm | GLUE |
+| ---------------------- | :--: | --------- | --------- | ---- | ----- | ---- | --------- | --------- | ---- |
+| Metrics                | Acc. | F1/Acc.   | PCC/SCC   | Acc. | Acc.  | Acc. | F1/Acc.   | Acc.      |      |
+| Bert_{base} (original) | 66.4 | 88.9/84.8 | 87.1/89.2 | 52.1 | 93.5  | 90.5 | 71.2/89.2 | 84.6/83.4 | 80.7 |
+| Bert_{base} (Our impl) | 67.8 | 88.0/86.0 | 89.5/89.2 | 52.5 | 91.2  | 87.1 | 89.0/90.6 | 82.5/83.4 | 82.1 |
+| PLD                    | 69.3 | 86.6/84.3 | 90.0/89.6 | 55.8 | 91.6  | 90.7 | 89.6/91.2 | 84.1/83.8 | 82.9 |
+| Lr                     | 7e-5 | 9e-5      | 7e-5      | 5e-5 | 7e-5  | 9e-5 | 2e-4      | 3e-5      |      |
diff --git a/docs/_tutorials/pytorch-profiler.md b/docs/_tutorials/pytorch-profiler.md
new file mode 100644
index 0000000000000000000000000000000000000000..a9a9f58d6e32d73bbf707a42fae1d67a5e1c7ec7
--- /dev/null
+++ b/docs/_tutorials/pytorch-profiler.md
@@ -0,0 +1,87 @@
+---
+title: "Using PyTorch Profiler with DeepSpeed for performance debugging"
+tags: profiling performance-tuning
+---
+
+This tutorial describes how to use [PyTorch Profiler](https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/) with DeepSpeed.
+
+PyTorch Profiler is an open-source tool that enables accurate and efficient performance analysis and troubleshooting for large-scale deep learning models.  The profiling results can be outputted as a `.json` trace file and viewed in Google Chrome's trace viewer (chrome://tracing).
+Microsoft Visual Studio Code's Python extension integrates TensorBoard into the code editor, including the support for the PyTorch Profiler.
+
+For more details, refer to [PYTORCH PROFILER](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#pytorch-profiler).
+
+## Profile the model training loop
+
+Below shows how to profile the training loop by wrapping the code in the profiler context manager. The Profiler assumes that the training process is composed of steps (which are numbered starting from zero). PyTorch profiler accepts a number of parameters, e.g. `schedule`, `on_trace_ready`, `with_stack`, etc.
+
+In the example below, the profiler will skip the first `5` steps, use the next `2` steps as the warm up, and actively record the next `6` steps. The profiler will stop the recording after the first two cycles since `repeat` is set to `2`.
+For the detailed usage of the `schedule`, please refer to [Using profiler to analyze long-running jobs](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-profiler-to-analyze-long-running-jobs).
+
+```python
+from torch.profiler import profile, record_function, ProfilerActivity
+
+with torch.profiler.profile(
+    schedule=torch.profiler.schedule(
+        wait=5, # during this phase profiler is not active
+        warmup=2, # during this phase profiler starts tracing, but the results are discarded
+        active=6, # during this phase profiler traces and records data
+        repeat=2), # specifies an upper bound on the number of cycles
+    on_trace_ready=tensorboard_trace_handler,
+    with_stack=True # enable stack tracing, adds extra profiling overhead
+) as profiler:
+    for step, batch in enumerate(data_loader):
+        print("step:{}".format(step))
+
+        #forward() method
+        loss = model_engine(batch)
+
+        #runs backpropagation
+        model_engine.backward(loss)
+
+        #weight update
+        model_engine.step()
+        profiler.step() # send the signal to the profiler that the next step has started
+```
+
+## Label arbitrary code ranges
+
+The `record_function` context manager can be used to label arbitrary code ranges with user provided names. For example, the following code marks `"model_forward"` as a label:
+
+```python
+with profile(record_shapes=True) as prof: # record_shapes indicates whether to record shapes of the operator inputs
+    with record_function("""):"
+        model_engine(inputs)
+```
+
+## Profile CPU or GPU activities
+
+The `activities` parameter passed to the Profiler specifies a list of activities to profile during the execution of the code range wrapped with a profiler context manager:
+- ProfilerActivity.CPU - PyTorch operators, TorchScript functions and user-defined code labels (`record_function`).
+- ProfilerActivity.CUDA - on-device CUDA kernels.
+**Note** that CUDA profiling incurs non-negligible overhead.
+
+The example below profiles both the CPU and GPU activities in the model forward pass and prints the summary table sorted by total CUDA time.
+
+```python
+with profile(activities=[
+        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
+    with record_function("model_forward"):
+        model_engine(inputs)
+
+print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
+```
+
+
+## Profile memory consumption
+
+By passing `profile_memory=True` to PyTorch profiler, we enable the memory profiling functionality which records the amount of memory (used by the model’s tensors) that was allocated (or released) during the execution of the model’s operators. For example:
+
+```python
+with profile(activities=[ProfilerActivity.CUDA],
+        profile_memory=True, record_shapes=True) as prof:
+    model(inputs)
+
+print(prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10))
+```
+
+`self` memory corresponds to the memory allocated (released) by the operator, excluding the children calls to the other operators.
diff --git a/docs/_tutorials/sparse-attention.md b/docs/_tutorials/sparse-attention.md
index 184d3e621e2df4db1647d54ff9e48b250329e05a..d28b2d1ff33ca1be6895993ec3435d9bb785cf36 100644
--- a/docs/_tutorials/sparse-attention.md
+++ b/docs/_tutorials/sparse-attention.md
@@ -1,10 +1,11 @@
 ---
-title: "DeepSpeed  Sparse Attention"
+title: "DeepSpeed Sparse Attention"
+tags: training
 ---
 
 In this tutorial we describe how to use DeepSpeed Sparse Attention (SA) and its building-block kernels. The easiest way to use SA is through DeepSpeed launcher. We will describe this through an example in [How to use sparse attention with DeepSpeed launcher](#how-to-use-sparse-attention-with-deepspeed-launcher) section. But before that, we introduce modules provided by DeepSpeed SA in the [next](#sparse-attention-modules) section.
 
-**Note:** Currently DeepSpeed Sparse Attention can be used only on NVIDIA V100 GPU using Torch >= 1.5 and Cuda 10.1 or 10.2.
+**Note:** Currently, DeepSpeed Sparse Attention can be used only on NVIDIA V100 or A100 GPUs using Torch >= 1.6 and CUDA 10.1, 10.2, 11.0, or 11.1.
 {: .notice--warning}
 
 ## Sparse attention modules
@@ -148,7 +149,7 @@ Please refer to the Docstrings for details of how to use each module separately.
 ## How to config sparsity structures
 Following we describe supported sparsity structures, their parameter set and the flexibility of adding arbitrary sparsity pattern on the self-attention layer. You can update DeepSpeed config file using any of the supported sparsity structures and set the parameters accordingly.
 
-* **SpasityConfig**:
+* **SparsityConfig**:
 This module, is the parent class for all sparsity structures and contains the shared features of all sparsity structures. It takes the following parameters:
   * `num_heads`: an integer determining number of attention heads of the layer.
   * `block`: an integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines size of such square blocks; `Block X Block`.
@@ -164,7 +165,7 @@ This structure is based on [Generative Modeling with Sparse Transformers](https:
 
 ![Fixed sparsity structure](/assets/images/sa_fixed_sparsity_structure.png)
 
-* **BSLongformer** (BSLongformerSparistyConfig):
+* **BSLongformer** (BSLongformerSparsityConfig):
 This structure is an edited version of [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf), in which instead of single token-wise sparsity, we offer block of tokens sparsity. Parameters that define this patters are:
 	* `num_sliding_window_blocks`: an integer determining the number of blocks in sliding local attention window.
 	* `global_block_indices`: a list of integers determining which blocks are considered as global attention. Given indices, determine the blocks that all other token blocks attend to and they attend to all other token blocks. Notice that if `global_block_end_indices` parameter is set, this parameter is used as starting index of each global window.
@@ -184,7 +185,7 @@ This structure also combines the idea of local, global and random attention. Fur
 	* `global_block_end_indices`: a list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size as `global_block_indices` parameter, and combining this two parameters, for each index `i`, blocks from `global_block_indices[i]` to `global_block_end_indices[i]` (exclusive) are considered as global attention block.
 	* `attention`: a string determining attention type. Attention can be `unidirectional`, such as autoregressive models, in which tokens attend only to tokens appear before them in the context. Considering that, the upper triangular of attention matrix is empty as above figure. Or it can be `bidirectional`, such as BERT, in which tokens can attend to any other tokens before or after them. Then, the upper triangular part of the attention matrix is mirror of the lower triangular in the above figure.
 	* `horizontal_global_attention`: a boolean determining if blocks that are global representative of a local window, also attend to all other blocks. This is valid only if attention type is `bidirectional`. Looking at the attention matrix, that means global attention not only includes the vertical blocks, but also horizontal blocks
-Figure bellow illustrates an example of `variable` sparsity, in which blue, orange and green blocks illustrate local, global, and random attention blocks respectively.
+Figure below illustrates an example of `variable` sparsity, in which blue, orange and green blocks illustrate local, global, and random attention blocks respectively.
 
 ![Variable sparsity structure](/assets/images/sa_variable_sparsity_structure.png)
 
diff --git a/docs/_tutorials/transformer_kernel.md b/docs/_tutorials/transformer_kernel.md
old mode 100755
new mode 100644
index 9dbcf26e2a12d297adaa15f53be00f901dc59215..915117fc3af987e2edca9ca5d4d0be8ccf1f30d7
--- a/docs/_tutorials/transformer_kernel.md
+++ b/docs/_tutorials/transformer_kernel.md
@@ -1,5 +1,6 @@
 ---
-title: "DeepSpeed  Transformer Kernel"
+title: "DeepSpeed Transformer Kernel"
+tags: training
 ---
 
 This tutorial shows how to enable the DeepSpeed transformer kernel and set its different configuration parameters.
diff --git a/docs/_tutorials/zero-offload.md b/docs/_tutorials/zero-offload.md
index 31c89bd5934e72b76730df7452c1bd891f4c3d8d..420760f73391f2d197c5e271383dc7f665411fd9 100644
--- a/docs/_tutorials/zero-offload.md
+++ b/docs/_tutorials/zero-offload.md
@@ -1,73 +1,75 @@
----
-title: "ZeRO-Offload"
----
-We recommend that you read the tutorials on [Getting Started](/getting-started/)  and [ZeRO](/tutorials/zero/) before stepping through this tutorial.
-
-ZeRO-Offload is a ZeRO optimization that offloads the optimizer memory and computation from the GPU to the host CPU. ZeRO-Offload enables large models with up to 13 billion parameters to be efficiently trained on a single GPU. In this tutorial we will use ZeRO-Offload to train a 10-billion parameter GPT-2 model in DeepSpeed. Furthermore, *using ZeRO-Offload in a DeepSpeed model is quick and easy because all you need is to change a few configurations in the DeepSpeed configuration json*. No code changes are needed.
-
-## ZeRO-Offload Overview
-For large model training, optimizers such as [Adam](https://arxiv.org/abs/1412.6980), can consume a significant amount of GPU compute and memory. ZeRO-Offload reduces the GPU compute and memory requirements of such models by leveraging compute and memory resources on the host CPU  to execute the optimizer. Furthermore, to prevent the optimizer from becoming a bottleneck, ZeRO-Offload uses DeepSpeed's highly optimized CPU implementation of Adam called [DeeSpeedCPUAdam](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/ops/adam). DeepSpeedCPUAdam is 5X--7X faster than the standard PyTorch implementation. To deep dive into the design and performance of ZeRO-Offload, please see our [blog post](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/#toc-heading-3).
-
-## Training Environment
-For this tutorial, we will configure a 10 billion parameter GPT-2 model using the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code. We advise stepping through the Megatron-LM [tutorial](/tutorials/megatron/) if you have not previously done so. We will use a single [NVIDIA Tesla V100-SXM3 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM for this exercise.
-
-## Training a 10B parameter GPT-2 on 1 V100 GPU
-We need to make changes to the Megatron-LM launch script and to the DeepSpeed configuration json.
-
-### Megatron-LM GPT-2 launch script changes
-We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model with activation checkpointing enabled, which can be achieved by the following set of changes:
-
-```bash
-       --model-parallel-size 1 \
-       --num-layers 50 \
-       --hidden-size 4096 \
-       --num-attention-heads 32 \
-       --batch-size 10 \
-       --deepspeed_config ds_zero_offload.config \
-       --checkpoint-activations
-```
-
-Most of the flags in the changes above should be familiar if you have stepped through the Megatron-LM [tutorial](/tutorials/megatron/).
-
-Second, we need to apply the following changes to ensure that only one GPU is used for training.
-```bash
-   deepspeed --num_nodes 1 --num_gpus 1 ...
-```
-
-### DeepSpeed Configuration Changes
-ZeRO-Offload leverages much for ZeRO stage 2 mechanisms, and so the configuration changes to enable ZeRO-Offload is an extension of those required to enable ZeRO stage 2. The **zero_optimization** key to enable ZeRO-Offload is shown below:
-
-```json
-{
-    "zero_optimization": {
-        "stage": 2,
-        "cpu_offload": true,
-        "contiguous_gradients": true,
-        "overlap_comm": true
-    }
-}
-```
-
-As seen above, in addition to setting the _stage_ field to **2** (to enable ZeRO stage 2), we also need to set _cpu_offload_ flag to **true** to enable ZeRO-Offload optimizations. In addition, we can  set other ZeRO stage 2 optimization flags, such as _overlap_comm_ to tune ZeRO-Offload performance.  With these changes we can now run the model. We share some screenshots of the training below.
-
-Here is a screenshot of the training log:
-
-<a href="/assets/images/zero_offload_dp1_10B_log.png">
-<img src="/assets/images/zero_offload_dp1_10B_log.png">
-</a>
-
-
-Here is a screenshot of nvidia-smi showing that only GPU 0 is active during training:
-
-<a href="/assets/images/zero_offload_dp1_10B_smi.png">
-<img src="/assets/images/zero_offload_dp1_10B_smi.png">
-</a>
-
-Finally, here is a screenshot of htop showing host CPU and memory activity during optimizer computation:
-
-<a href="/assets/images/zero_offload_dp1_10B_cpu.png">
-<img src="/assets/images/zero_offload_dp1_10B_cpu.png">
-</a>
-
-Congratulations! You have completed the ZeRO-Offload tutorial.
-
+---
+title: "ZeRO-Offload"
+tags: training IO large-model
+---
+ZeRO-3 Offload consists of a subset of features in our newly released ZeRO-Infinity. Read our [ZeRO-Infinity blog](https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/) to learn more!
+
+We recommend that you read the tutorials on [Getting Started](/getting-started/)  and [ZeRO](/tutorials/zero/) before stepping through this tutorial.
+
+ZeRO-Offload is a ZeRO optimization that offloads the optimizer memory and computation from the GPU to the host CPU. ZeRO-Offload enables large models with up to 13 billion parameters to be efficiently trained on a single GPU. In this tutorial we will use ZeRO-Offload to train a 10-billion parameter GPT-2 model in DeepSpeed. Furthermore, *using ZeRO-Offload in a DeepSpeed model is quick and easy because all you need is to change a few configurations in the DeepSpeed configuration json*. No code changes are needed.
+
+## ZeRO-Offload Overview
+For large model training, optimizers such as [Adam](https://arxiv.org/abs/1412.6980), can consume a significant amount of GPU compute and memory. ZeRO-Offload reduces the GPU compute and memory requirements of such models by leveraging compute and memory resources on the host CPU  to execute the optimizer. Furthermore, to prevent the optimizer from becoming a bottleneck, ZeRO-Offload uses DeepSpeed's highly optimized CPU implementation of Adam called [DeepSpeedCPUAdam](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/ops/adam). DeepSpeedCPUAdam is 5X--7X faster than the standard PyTorch implementation. To deep dive into the design and performance of ZeRO-Offload, please see our [blog post](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/#toc-heading-3).
+
+## Training Environment
+For this tutorial, we will configure a 10 billion parameter GPT-2 model using the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3) GPT-2 code. We advise stepping through the Megatron-LM [tutorial](/tutorials/megatron/) if you have not previously done so. We will use a single [NVIDIA Tesla V100-SXM3 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM for this exercise.
+
+## Training a 10B parameter GPT-2 on a single V100 GPU
+We need to make changes to the Megatron-LM launch script and to the DeepSpeed configuration json.
+
+### Megatron-LM GPT-2 launch script changes
+We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model with activation checkpointing enabled, which can be achieved by the following set of changes:
+
+```bash
+       --model-parallel-size 1 \
+       --num-layers 50 \
+       --hidden-size 4096 \
+       --num-attention-heads 32 \
+       --batch-size 10 \
+       --deepspeed_config ds_zero_offload.config \
+       --checkpoint-activations
+```
+
+Most of the flags in the changes above should be familiar if you have stepped through the Megatron-LM [tutorial](/tutorials/megatron/).
+
+Second, we need to apply the following changes to ensure that only one GPU is used for training.
+```bash
+   deepspeed --num_nodes 1 --num_gpus 1 ...
+```
+
+### DeepSpeed Configuration Changes
+ZeRO-Offload leverages many ZeRO stage 2 mechanisms, and so the configuration changes to enable ZeRO-Offload are an extension of those required to enable ZeRO stage 2. The `zero_optimization` configuration to enable ZeRO-Offload is shown below:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 2,
+        "cpu_offload": true,
+        "contiguous_gradients": true,
+        "overlap_comm": true
+    }
+}
+```
+
+As seen above, in addition to setting the _stage_ field to **2** (to enable ZeRO stage 2), we also need to set _cpu_offload_ flag to **true** to enable ZeRO-Offload optimizations. In addition, we can  set other ZeRO stage 2 optimization flags, such as _overlap_comm_ to tune ZeRO-Offload performance.  With these changes we can now run the model. We share some screenshots of the training below.
+
+Here is a screenshot of the training log:
+
+<a href="/assets/images/zero_offload_dp1_10B_log.png">
+<img src="/assets/images/zero_offload_dp1_10B_log.png">
+</a>
+
+
+Here is a screenshot of `nvidia-smi` showing that only GPU 0 is active during training:
+
+<a href="/assets/images/zero_offload_dp1_10B_smi.png">
+<img src="/assets/images/zero_offload_dp1_10B_smi.png">
+</a>
+
+Finally, here is a screenshot of `htop` showing host CPU and memory activity during optimizer computation:
+
+<a href="/assets/images/zero_offload_dp1_10B_cpu.png">
+<img src="/assets/images/zero_offload_dp1_10B_cpu.png">
+</a>
+
+Congratulations! You have completed the ZeRO-Offload tutorial.
diff --git a/docs/_tutorials/zero-one-adam.md b/docs/_tutorials/zero-one-adam.md
new file mode 100644
index 0000000000000000000000000000000000000000..751fce597e9f470e8a0adff8a2331ac9b5649331
--- /dev/null
+++ b/docs/_tutorials/zero-one-adam.md
@@ -0,0 +1,141 @@
+---
+title: "Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam"
+---
+
+**Watch out!**
+1) The NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs). See details below. 2) Although 0/1 Adam is compatible with both FP16 and FP32, currently we only verified the convergence under mixed precision/FP16 training. 3) Currently the MPI-based implementation is not compatible with pipeline parallelism. 4) Frequent checkpoint loading could hurt 0/1 Adam's convergence. See details below.
+{: .notice--warning}
+
+In this tutorial, we introduce DeepSpeed's 0/1 Adam optimizer, which can improve model training speed on communication-constrained clusters, especially for communication-intensive large models. For instance, it is able to reduce the overall communication volume on BERT-large pre-training by up to 26x without affecting the end-to-end model accuracy.
+Compared to the 1-bit Adam optimizer, 0/1 Adam provides a more flexible way of using compressed communication via adaptive variance state freezing. Additionally, it allows the computing nodes to skip communication rounds during training using a technique called 1-bit sync, without compromising the convergence speed.
+We have a [paper](https://arxiv.org/abs/2202.06009) which provides the technical details including algorithm, system implementation, and evaluations.
+
+To illustrate the benefits and usage of 0/1 Adam optimizer, we use the BERT Pre-training task as example. For more details on this task, please refer to the [tutorial](/tutorials/bert-pretraining/).
+
+## 1. Overview
+
+### 1.1 Pre-requisites for installing DeepSpeed
+
+If you don't already have a copy of the DeepSpeed repository, please clone it
+now and checkout the DeepSpeedExamples submodule that contains the BERT Pre-training example.
+
+```shell
+git clone https://github.com/microsoft/DeepSpeed
+cd DeepSpeed
+git submodule update --init --recursive
+cd DeepSpeedExamples/
+```
+
+### 1.2 Pre-requisites for 0/1 Adam
+
+#### 1.2.1 NCCL-based implementation
+
+In DeepSpeed, we introduce a system implementation for compressed communication using the NCCL backend of PyTorch distributed. This implementation provides better performance and usability than the MPI-based implementation below. Thus we highly recommend users to choose this implementation.
+
+**Watch out!**
+This NCCL-based implementation requires PyTorch >= 1.8. It also requires NCCL >= 2.8.3 when you have 64 or more GPUs to avoid certain NCCL runtime bugs. Currently (2021/03/16) NCCL 2.8.3 is not officially supported by PyTorch. The solution we used is by hacking in NCCL 2.8.3 via `LD_PRELOAD`: 1) Install NCCL 2.8.3. This works for us on a CUDA 11 system: `apt-get install -y libnccl2=2.8.3-1+cuda11.0 libnccl-dev=2.8.3-1+cuda11.0`. 2) Set `LD_PRELOAD` to the the library path. This works for us: `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnccl.so.2.8.3`. To confirm `LD_PRELOAD` is working you can see the version it uses in the NCCL logs if you have `NCCL_DEBUG=INFO`, it should say: NCCL version 2.8.3+cuda11.0.
+{: .notice--warning}
+
+#### 1.2.2 MPI-based implementation
+
+For this implementation, we rely on Message Passing Interface (MPI) for advanced communication primitives.
+
+We package the necessary dependencies in the DeepSpeed docker images. However, if you are using a different build system, please install MPI and mpi4py on your system. To install the prerequisites run:
+
+```shell
+pip install deepspeed[1bit_adam]
+```
+
+We have tested CUDA-Aware MPI communication using the [MVAPICH2-GDR](http://mvapich.cse.ohio-state.edu/userguide/gdr/) library. However, any CUDA-Aware communication library including [OpenMPI](https://www.open-mpi.org/) should work fine with these examples.
+
+An example launch command for 0/1 Adam using the `deepspeed` launcher is as follows:
+
+```shell
+deepspeed --launcher=[mvapich|openmpi] script.py
+```
+
+Please note that for MPI-based implementation of 0/1 Adam, the `--launcher=[mvapich|openmpi]` flag is required when using the `deepspeed` launcher.
+
+Alternatively, the standard mpirun launcher can also be used as follows:
+
+```shell
+mpirun -np [num processes] -ppn [num GPUs on each node] -hostfile [hostfile] [MPI flags] python [training_script.py]
+```
+
+### 1.3 0/1 Adam Algorithm
+
+The detailed description of the 0/1 Adam algorithm can be seen from our [paper](https://arxiv.org/abs/2202.06009).
+
+### 1.4 Configuration of 0/1 Adam
+The 0/1 Adam feature can be used by setting the optimizer configuration options as follows. An example json config file is shown below.
+
+```json
+{
+  "train_batch_size": 4096,
+  "train_micro_batch_size_per_gpu": 16,
+  "optimizer": {
+    "type": "ZeroOneAdam",
+    "params": {
+      "lr": 1e-3,
+      "weight_decay": 0.01,
+      "bias_correction": false,
+      "var_freeze_step": 1000,
+      "var_update_scaler": 16,
+      "local_step_scaler": 1000,
+      "local_step_clipper": 16,
+      "cuda_aware": false,
+      "comm_backend_name": "nccl"
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "initial_scale_power": 16
+  }
+}
+```
+Please note the new parameters `var_freeze_step`, `var_update_scaler`, `local_step_scaler`, `local_step_clipper`, `cuda_aware` and `comm_backend_name` that have been added to support the 0/1 Adam feature:
+
+`var_freeze_step` is the latest step to update the variance. Using the notation from [0/1 Adam paper](https://arxiv.org/abs/2202.06009), it denotes the $\max\{i|i \in \mathcal{T}_v\}$. Note that this is different from the `freeze_step` in 1-bit Adam. The `var_freeze_step` is usually the last step of the learning rate warmup and thus does not require tuning. Note that this hyperparameter is optional. In practice, we can avoid tuning this parameter by setting it to a sufficiently large number (larger than the total number of steps). Following this, 0/1 Adam still enjoys the non-trivial communication reduction without affecting the convergence speed.
+
+`var_update_scaler` is the interval to update the variance. Note that the update policy for variance follows an exponential rule. Formally, if we denote $k_j$ as the step where $j$-th variance update takes place, then it follows that $k_{j+1} - k_j = 2\cdot\exp\{\lfloor j/\kappa\rfloor\}$ (please refer to the [0/1 Adam paper](https://arxiv.org/abs/2202.06009) for detailed explanation), and the `var_update_scaler` denotes the $\kappa$ factor in such expression.
+In practice, we found its default value (16) is able to work well on most of the tasks, including BERT-Base/Large pretraining, GPT pretraining, and ImageNet training.
+
+`local_step_scaler` and `local_step_clipper` are two hyperparameters for learning rate based local step policy in 0/1 Adam. Formally, if we denote $k_j$ as the step where $j$-th synchronization takes place among all the workers, then it follows that $k_{j+1} - k_j = 2\cdot\exp\{\min(\lfloor j/\alpha\rfloor, \beta )\}$ (please refer to the [0/1 Adam paper](https://arxiv.org/abs/2202.06009) for detailed explanation). Following such notations, `local_step_scaler` and `local_step_clipper` denote the $\alpha$ and $\beta$, respectively. Informally, `local_step_scaler` decides the frequency of synchronization while `local_step_clipper` denotes the maximal local step interval 0/1 Adam can use.
+The learning rate policy is the default policy used in 0/1 Adam, and the value of `local_step_scaler` can be pre-calculated (see [0/1 Adam paper](https://arxiv.org/abs/2202.06009) Section 6). We can also trivially construct other policies by setting these two hyperparameters such as constant local step interval policy by setting `local_step_scaler=1` and `local_step_clipper=constant`.
+
+`cuda_aware` is used for MPI-based implementation to indicate that the underlying MPI library supports CUDA-Aware communication. This feature is only supported on systems with InfiniBand interconnect and a CUDA-Aware MPI library like [MVAPICH2-GDR](http://mvapich.cse.ohio-state.edu/userguide/gdr/) or OpenMPI built with CUDA-Aware support. Setting `cuda_aware` to False will allow training on Ethernet based systems. However, the communication will happen using sender as well as receiver side memory copies between CPU and GPU buffers before and after communication.
+
+`comm_backend_name` is used to indicate which backend implementation to use. You can choose between NCCL and MPI-based implementations by setting `comm_backend_name` to "nccl" or "mpi". When using NCCL-based implementation, there is no need to set `cuda_aware`.
+
+#### 1.4.1 Momentum masks for parameters with constant zero gradients
+Because 1-bit compression cannot represent exact zero, the compression error would keep accumulating in the momentum if a parameter have constant zero gradients during training. For example, for BERT pre-training seq length 128, `bert.embeddings.position_embeddings.weight` has constant zeros in its gradient and momentum for row 129 to 512, because it only learns up to seq length 128 while the model supports up to seq length 512. Thus in 0/1 Adam we added support of a momentum mask for users to specify those params that have constant exact zeros in their gradients. See [example script](https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/deepspeed_train.py) for how to configure this momentum mask. One thing to note is that we don't use momentum mask saved in checkpoints since this mask could change during training (e.g., BERT seqlen 128 and 512 require different masks). So you have to provide this mask every time in your training script.
+
+**Watch out!**
+0/1 Adam relies on an compression error compensation mechanism to maintain the convergence speed at compression stage. When loading checkpoints, aside from resetting the compression errors as 1-bit Adam, we additionally need to reset the local step buffer. Since the local step buffer can potentially fail to capture the training dynamics if the checkpoints are loaded by different number of nodes (GPUs).
+{: .notice--warning}
+
+## 2. BERT Pre-training with 0/1 Adam
+For data downloading and pre-processing, please refer to the [BERT Pre-training tutorial](/tutorials/bert-pretraining/).
+
+### 2.1 Running Pre-training with DeepSpeed and 0/1 Adam
+
+We provide example scripts under [DeepSpeedExamples/bing_bert/01_adam/](https://github.com/microsoft/DeepSpeedExamples/tree/master/bing_bert/01_adam). There are 3 sets of scripts corresponding to NCCL-based implementation, MPI-based implementation on Ethernet systems, and MPI-based implementation on InfiniBand systems. For MPI-based implementation, we provide both example scripts when launching with deepspeed or mpirun.
+
+### 2.2 Configuration for BERT Pre-training with DeepSpeed and 0/1 Adam enabled
+
+The `deepspeed_bsz4k_01adam_config_seq128_*.json` and `deepspeed_bsz4k_01adam_config_seq512_*.json` files give the user the ability to specify DeepSpeed
+options in terms of batch size, micro batch size, optimizer, learning rate, and other parameters. In these files we include the tuned hyperparameters to reproduce experiments in our [paper](https://arxiv.org/abs/2202.06009).
+
+### 2.3 Performance Results for BERT Pre-training
+
+Performance results can be seen in our [paper](https://arxiv.org/abs/2202.06009).
+
+### 2.4 GLUE Fine-tuning
+We additionally provide the fine-tuning scripts for BERT pre-training checkpoints over [GLUE tasks](https://gluebenchmark.com/). The scripts are available at [DeepSpeedExamples/BingBertGlue](https://github.com/microsoft/DeepSpeedExamples/tree/master/BingBertGlue). The `glue_bert_base.json` and `glue_bert_large.json` files give the user the ability to specify DeepSpeed
+options/parameters like micro batch size over BERT-base and BERT-large checkpoints, respectively. Currently we use Adam as the default optimizer for GLUE fine-tuning since the fine-tuning tasks usually use small batch size (~32) and do not require large-scale systems. `run_glue_bert_base_finetune.sh` and `run_glue_bert_large_finetune.sh` give the scripts for launching fine-tuning tasks, where we can modify variables like task name, number of epochs, model, etc. Note that to launch the fine-tuning, we must specify the path for checkpoint, for instance,
+```
+bash run_glue_bert_base_finetune.sh <path to checkpoint>
+```
+Specific GLUE scores and hyperparameters for 0/1 Adam are included in our [paper](https://arxiv.org/abs/2202.06009) Table 1.
diff --git a/docs/_tutorials/zero.md b/docs/_tutorials/zero.md
index ad6e222707e04a7cdd6f46fa11d147c2e7a005ac..c84339ece9e5805a9e05f516f8b1ba860a1082d4 100644
--- a/docs/_tutorials/zero.md
+++ b/docs/_tutorials/zero.md
@@ -1,264 +1,305 @@
----
-title: "Zero Redundancy Optimizer (ZeRO)"
----
-If you have not done so already, we advise that you read the DeepSpeed tutorials on [Getting Started](/getting-started/) and [Megatron-LM GPT-2](/tutorials/megatron/) before stepping through this tutorial.
-
-In this tutorial, we will apply the ZeRO optimizer to the [Megatron-LM GPT-2](https://github.com/NVIDIA/Megatron-LM) model. ZeRO is a powerful set of memory optimization techniques that enable effective FP16 training of large models with trillions of parameters, such as [GPT-2](https://openai.com/blog/better-language-models/) and [Turing-NLG 17B](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/). Compared to the alternative model parallelism approaches for training large models, a key appeal of ZeRO is that no model code modifications are required. As this tutorial will demonstrate, *using ZeRO in a DeepSpeed model is quick and easy because all you need is to change a few configurations in the DeepSpeed configuration JSON*. No code changes are needed.
-
-## ZeRO Overview
-ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware. Concretely, ZeRO is being implemented as incremental stages of optimizations, where optimizations in earlier stages are available in the later stages. To deep dive into ZeRO, please see our [paper](https://arxiv.org/abs/1910.02054v3).
-
-* **Stage 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
-
-* **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
-
-* **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO will automatically collect and partition them during the forward and backward passes.
-
-## Training environment
-We use the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.
-
-## Enabling ZeRO Optimization
-To enable ZeRO optimizations for a DeepSpeed model, we simply add the **_zero_optimization_** key to the DeepSpeed JSON configuration. A full description of configuration knobs of the **zero_optimization** key is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
-
-### Training a 1.5B Parameter GPT-2 model
-We demonstrate the benefits of ZeRO stage 1 by showing that it enables data parallel training of a 1.5 billion parameter GPT-2 model on eight V100 GPUs. We configure training to use a batch size of 1 per device to ensure that the memory consumption is primarily due to model parameters and optimizer states. We create this training scenario by applying the following modifications to the deepspeed launch script:
-
-```bash
-       --model-parallel-size 1 \
-       --num-layers 48 \
-       --hidden-size 1600 \
-       --num-attention-heads 16 \
-       --batch-size 1 \
-       --deepspeed_config ds_zero_stage_1.config \
-```
-
-Training this model without ZeRO fails with an out-of-memory (OOM) error as shown below:
-
-<a href="/assets/images/oom_dp8_1.5B_log.png">
-<img src="/assets/images/oom_dp8_1.5B_log.png">
-</a>
-
-A key reason why this model does not fit in GPU memory is that the Adam optimizer states for the model consume 18GB; a significant portion of the 32GB RAM. By using ZeRO stage 1 to partition the optimizer state among eight data parallel ranks, the per-device memory consumption can be reduced to 2.25GB, thus making the model trainable. To enable ZeRO stage 1, we simply update the DeepSpeed JSON config file as below:
-
-```json
-{
-    "zero_optimization": {
-        "stage":1,
-        "reduce_bucket_size": 5e8
-    }
-}
-```
-As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 500M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory.   Below we provide some screenshots of the model training:
-
-
-<a href="/assets/images/zero1_dp8_1.5B_log.png">
-<img src="/assets/images/zero1_dp8_1.5B_log.png">
-</a>
-
-<a href="/assets/images/zero1_dp8_1.5B_smi.png">
-<img src="/assets/images/zero1_dp8_1.5B_smi.png">
-</a>
-
-
-From the nvidia-smi screenshot above we can see that only GPUs 6-7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
-
-### Training a 10B Parameter GPT-2 model
-ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this by training a model with 10B parameters using 32 V100 GPUs.
-
-First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
-
-```bash
-       --model-parallel-size 1 \
-       --num-layers 50 \
-       --hidden-size 4096 \
-       --num-attention-heads 32 \
-       --batch-size 1 \
-       --deepspeed_config ds_zero_stage_2.config \
-       --checkpoint-activations
-```
-
-Next, we need to update the DeepSpeed JSON configuration, as shown below, to enable ZeRO stage 2 optimizations:
-
-```json
-{
-    "zero_optimization": {
-        "stage":2,
-        "contiguous_gradients": true,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 5e8,
-        "allgather_bucket_size": 5e8
-    }
-}
-```
-
-In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmentation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
-
-Here is a screenshot of the training log:
-
-<a href="/assets/images/zero2_dp32_10B_log.png">
-<img src="/assets/images/zero2_dp32_10B_log.png">
-</a>
-
-Here is a screenshot of nvidia-smi showing GPU activity during training:
-
-<a href="/assets/images/zero2_dp32_10B_smi.png">
-<img src="/assets/images/zero2_dp32_10B_smi.png">
-</a>
-
-### Training trillion-scale models with ZeRO-3 Offload
-
-Stage 3 can be enabled in the JSON configuration. A full description of these
-configurations is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
-
-```json
-{
-  "zero_optimization": {
-    "stage": 3,
-    "cpu_offload": true,
-    "cpu_offload_params": true,
-    "overlap_comm": true,
-    "contiguous_gradients": true,
-    "stage3_max_live_parameters": 6000000,
-    "stage3_max_reuse_distance": 100000000,
-    "stage3_prefetch_bucket_size": 200000,
-    "stage3_param_persitance_threshold": 100000,
-    "reduce_bucket_size": 3000000,
-    "sub_group_size": 1e6
-  }
-}
-```
-
-
-ZeRO-3 will automatically collect and partition the parameters as they are
-needed during the forward and backward passes. However, in some cases a
-parameter may be used outside of its module's forward pass. We call these
-*external parameters*. ZeRO-3 can coordinate these parameters if they are
-registered. Please see our [ZeRO-3 docs](https://deepspeed.readthedocs.io/en/latest/zero3.html) for more
-information and examples of external parameters.
-
-The Megatron-LM model has three external parameters that must be registered
-with ZeRO-3. External parameters are those that are accessed outside of the
-owning module's forward pass.
-
-1. `megatron/model/gpt2_model.py:GPT2Model`: register the word embedding for both uses in forward.
-
-```python
-    class GPT2Model(MegatronModule):
-    def __init__(self, num_tokentypes=0, parallel_output=True):
-        ...
-        deepspeed.zero.register_external_parameter(self,
-                                                   self.language_model.embedding.word_embeddings.weight)
-
-
-    def forward(self, input_ids, position_ids, attention_mask, labels=None,
-                tokentype_ids=None, layer_past=None, get_key_value=False,
-                forward_method_parallel_output=None):
-        # self.embeddings will compute its forward pass here
-        lm_output = self.language_model(input_ids,
-                                        position_ids,
-                                        attention_mask,
-                                        tokentype_ids=tokentype_ids,
-                                        layer_past=layer_past,
-                                        get_key_value=get_key_value)
-        ...
-
-        # Accesses word_embeddings.weight outside of the embedding's forward pass.
-        output = parallel_lm_logits(
-            lm_output,
-            self.language_model.embedding.word_embeddings.weight,
-            parallel_output)
-```
-
-2. `megatron/model/transformer.py:ParallelMLP`: register a bias that is
-returned from a submodule forward and used in this forward.
-
-```python
-class ParallelMLP(MegatronModule):
-    def __init__(self, init_method, output_layer_init_method):
-        ...
-        if self.dense_h_to_4h.bias is not None:
-            deepspeed.zero.register_external_parameter(self, self.dense_h_to_4h.bias)
-
-    def forward(self, hidden_states):
-
-        # bias_parallel is a parameter of dense_h_to_4h
-
-        # [s, b, 4hp]
-        intermediate_parallel, bias_parallel = self.dense_h_to_4h(hidden_states)
-        ...
-```
-
-3. `megatron/model/transformer.py:ParallelTransformerLayer`: register two biases that
-are returned from submodules and used in forward.
-
-```python
-class ParallelTransformerLayer(MegatronModule):
-    ...
-    def __init__(self, attention_mask_func, init_method,
-                 output_layer_init_method, layer_number):
-        ...
-        if self.attention.dense.bias is not None:
-            deepspeed.zero.register_external_parameter(self, self.attention.dense.bias)
-        if self.mlp.dense_4h_to_h.bias is not None:
-            deepspeed.zero.register_external_parameter(self, self.mlp.dense_4h_to_h.bias)
-
-    def forward(self, hidden_states, attention_mask, layer_past=None,
-                get_key_value=False):
-        ...
-        # attention_bias is a parameter returned from attention
-
-        # Self attention.
-        attention_output, attention_bias = \
-            self.attention(layernorm_output,
-                           attention_mask,
-                           layer_past=layer_past,
-                           get_key_value=get_key_value)
-
-        ...
-
-        # mlp_bias is a parameter returned from mlp
-        mlp_output, mlp_bias = self.mlp(layernorm_output)
-        ...
-```
-
-
-
-#### Allocating Massive Megatron-LM Models
-
-We make two further changes to model initialization in order to support models
-that exceed *local* system memory, but not *total* system memory.
-
-1. Allocate the model in a memory-scalable fashion. The model parameters will
-be allocated and immediately partitioned across the data parallel group. If
-`remote_device="cpu"`, the model will also be allocated in CPU memory
-instead of GPU memory. Please see the full
-[ZeRO-3 Init docs](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.Init)
-for more details.
-
-    ```python
-    with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(),
-                             remote_device=get_args().remote_device,
-                             enabled=get_args().zero_stage==3):
-        model = GPT2Model(num_tokentypes=0, parallel_output=True)
-    ```
-
-2. Gather the position embeddings weight for initialization. DeepSpeed will automatically
-gather a module's parameters during its constructor and for its forward and backward pass.
-However, additional accesses must coordinate with DeepSpeed to ensure that parameter data
-is gathered and subsequently partitioned. If the tensor is modified, the `modifier_rank`
-argument should also be used to ensure all ranks have a consistent view of
-the data. Please see the full
-[GatheredParameters docs](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters)
-for more details.
-
-    ```python
-    self.position_embeddings = torch.nn.Embedding(...)
-    with deepspeed.zero.GatheredParameters(self.position_embeddings.weight,
-                                           modifier_rank=0):
-        # Initialize the position embeddings.
-        self.init_method(self.position_embeddings.weight)
-    ```
-
-
-Congratulations! You have completed the ZeRO tutorial.
+---
+title: "Zero Redundancy Optimizer"
+excerpt: ""
+tags: training large-model
+---
+
+If you have not done so already, we advise that you read the DeepSpeed tutorials on [Getting Started](/getting-started/) and [Megatron-LM GPT-2](/tutorials/megatron/) before stepping through this tutorial.
+
+In this tutorial, we will apply the ZeRO optimizer to the [Megatron-LM GPT-2](https://github.com/NVIDIA/Megatron-LM) model. ZeRO is a powerful set of memory optimization techniques that enable effective training of large models with trillions of parameters, such as [GPT-2](https://openai.com/blog/better-language-models/) and [Turing-NLG 17B](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/). Compared to the alternative model parallelism approaches for training large models, a key appeal of ZeRO is that no model code modifications are required. As this tutorial will demonstrate, *using ZeRO in a DeepSpeed model is quick and easy because all you need is to change a few configurations in the DeepSpeed configuration JSON*. No code changes are needed.
+
+## ZeRO Overview
+ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware. Concretely, ZeRO is being implemented as incremental stages of optimizations, where optimizations in earlier stages are available in the later stages. To deep dive into ZeRO, please see our [paper](https://arxiv.org/abs/1910.02054v3).
+
+* **Stage 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
+
+* **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
+
+* **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
+
+In addition, ZeRO-3 includes the *infinity offload engine* to form ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload to both CPU and NVMe memory for huge memory savings.
+
+## Training environment
+We use the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.
+
+## Enabling ZeRO Optimization
+To enable ZeRO optimizations for a DeepSpeed model, we simply add the **_zero_optimization_** key to the DeepSpeed JSON configuration. A full description of configuration knobs of the **zero_optimization** key is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
+
+### Training a 1.5B Parameter GPT-2 model
+We demonstrate the benefits of ZeRO stage 1 by showing that it enables data parallel training of a 1.5 billion parameter GPT-2 model on eight V100 GPUs. We configure training to use a batch size of 1 per device to ensure that the memory consumption is primarily due to model parameters and optimizer states. We create this training scenario by applying the following modifications to the deepspeed launch script:
+
+```bash
+       --model-parallel-size 1 \
+       --num-layers 48 \
+       --hidden-size 1600 \
+       --num-attention-heads 16 \
+       --batch-size 1 \
+       --deepspeed_config ds_zero_stage_1.config \
+```
+
+Training this model without ZeRO fails with an out-of-memory (OOM) error as shown below:
+
+<a href="/assets/images/oom_dp8_1.5B_log.png">
+<img src="/assets/images/oom_dp8_1.5B_log.png">
+</a>
+
+A key reason why this model does not fit in GPU memory is that the Adam optimizer states for the model consume 18GB; a significant portion of the 32GB RAM. By using ZeRO stage 1 to partition the optimizer state among eight data parallel ranks, the per-device memory consumption can be reduced to 2.25GB, thus making the model trainable. To enable ZeRO stage 1, we simply update the DeepSpeed JSON config file as below:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 1,
+        "reduce_bucket_size": 5e8
+    }
+}
+```
+As seen above, we set two fields in the `zero_optimization` key. Specifically we set the `stage` field to 1, and the optional `reduce_bucket_size` for gradient reduction to 500M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory.   Below we provide some screenshots of the model training:
+
+
+<a href="/assets/images/zero1_dp8_1.5B_log.png">
+<img src="/assets/images/zero1_dp8_1.5B_log.png">
+</a>
+
+<a href="/assets/images/zero1_dp8_1.5B_smi.png">
+<img src="/assets/images/zero1_dp8_1.5B_smi.png">
+</a>
+
+
+From the `nvidia-smi` screenshot above we can see that only GPUs 6-7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
+
+### Training a 10B Parameter GPT-2 model
+ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this by training a model with 10B parameters using 32 V100 GPUs.
+
+First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
+
+```bash
+       --model-parallel-size 1 \
+       --num-layers 50 \
+       --hidden-size 4096 \
+       --num-attention-heads 32 \
+       --batch-size 1 \
+       --deepspeed_config ds_zero_stage_2.config \
+       --checkpoint-activations
+```
+
+Next, we need to update the DeepSpeed JSON configuration, as shown below, to enable ZeRO stage 2 optimizations:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 2,
+        "contiguous_gradients": true,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 5e8,
+        "allgather_bucket_size": 5e8
+    }
+}
+```
+
+In the above changes, we have set the `stage` field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled `contiguous_gradients` to reduce memory fragmentation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
+
+Here is a screenshot of the training log:
+
+<a href="/assets/images/zero2_dp32_10B_log.png">
+<img src="/assets/images/zero2_dp32_10B_log.png">
+</a>
+
+Here is a screenshot of nvidia-smi showing GPU activity during training:
+
+<a href="/assets/images/zero2_dp32_10B_smi.png">
+<img src="/assets/images/zero2_dp32_10B_smi.png">
+</a>
+
+### Training trillion-scale models with ZeRO-Infinity
+
+ZeRO-3, the third stage of ZeRO, partitions the full model state (i.e.,
+weights, gradients, and optimizer states) to scale memory savings linearly
+with the degree of data parallelism. ZeRO-3 can be enabled in the JSON
+configuration. A full description of these configurations is available
+[here](/docs/config-json/#zero-optimizations-for-fp16-training).
+
+
+#### Offloading to CPU and NVMe with ZeRO-Infinity
+
+ZeRO-Infinity uses DeepSpeed's infinity offload engine to offload the full
+model state to CPU or NVMe memory, allowing for even larger model sizes. Offloading
+can be enabled inside the DeepSpeed configuration:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 3,
+        "contiguous_gradients": true,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_prefetch_bucket_size": 1e7,
+        "stage3_param_persistence_threshold": 1e5,
+        "reduce_bucket_size": 1e7,
+        "sub_group_size": 1e9,
+        "offload_optimizer": {
+            "device": "cpu"
+         },
+        "offload_param": {
+            "device": "cpu"
+       }
+   }
+}
+```
+
+**ZeRO-Infinity vs ZeRO-Offload:**
+DeepSpeed first included offloading capabilities with ZeRO-Offload,
+a system for offloading optimizer and gradient states to CPU memory
+within ZeRO-2. ZeRO-Infinity is the next generation of offloading
+capabilities accessible to ZeRO-3. ZeRO-Infinity is able to offload
+more data than ZeRO-Offload and has more effective bandwidth utilization
+and overlapping of computation and communication.
+{: .notice--info}
+
+
+
+
+#### Allocating Massive Megatron-LM Models
+
+We make two further changes to model initialization in order to support models
+that exceed *local* system memory, but not *total* system memory.
+
+1. Allocate the model in a memory-scalable fashion. The model parameters will
+be allocated and immediately partitioned across the data parallel group. If
+`remote_device` is  `"cpu"` or `"nvme"`, the model will also be allocated in CPU/NVMe memory
+instead of GPU memory. Please see the full
+[ZeRO-3 Init docs](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.Init)
+for more details.
+
+    ```python
+    with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(),
+                             remote_device=get_args().remote_device,
+                             enabled=get_args().zero_stage==3):
+        model = GPT2Model(num_tokentypes=0, parallel_output=True)
+    ```
+
+2. Gather the embeddings weight for initialization. DeepSpeed will automatically
+gather a module's parameters during its constructor and for its forward and backward pass.
+However, additional accesses must coordinate with DeepSpeed to ensure that parameter data
+is gathered and subsequently partitioned. If the tensor is modified, the `modifier_rank`
+argument should also be used to ensure all ranks have a consistent view of
+the data. Please see the full
+[GatheredParameters docs](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters)
+for more details.
+
+    ```python
+    self.position_embeddings = torch.nn.Embedding(...)
+    with deepspeed.zero.GatheredParameters(self.position_embeddings.weight,
+                                           modifier_rank=0):
+        # Initialize the position embeddings.
+        self.init_method(self.position_embeddings.weight)
+
+    ...
+
+    self.tokentype_embeddings = torch.nn.Embedding(...)
+    with deepspeed.zero.GatheredParameters(self.tokentype_embeddings.weight,
+                                        modifier_rank=0):
+        # Initialize the token-type embeddings.
+        self.init_method(self.tokentype_embeddings.weight)
+    ```
+
+#### Memory-centric tiling
+ZeRO-Infinity includes a replacement for `Linear` layers that further reduces memory.
+We optionally tile the model parallel linear layers found in each Transformer layer. Note
+that model parallelism and tiling can be combined by specifying the corresponding
+base class when building the layer.
+The `deepspeed.zero.TiledLinear` module exploits the data fetch and release
+pattern of ZeRO-3 to reduce the working memory requirements by breaking down
+a large operator into smaller tiles that can be executed sequentially.
+
+We include the changes for one example from Megatron-LM's [ParallelMLP](https://github.com/microsoft/DeepSpeedExamples/blob/bdf8e59aede8c8e0577e8d4d557298ca8515268f/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L82). Three more
+model-parallel layers in `transformer.py` proceed similarly.
+
+The model parallel layers of Megatron-LM have a special form in which the
+additive `bias` of the layer is delayed and instead returned from `forward()`
+to be fused with a later operator. DeepSpeed's
+`deepspeed.zero.TiledLinearReturnBias` subclass of `TiledLinear` simply also
+forwards the returned `bias` parameter without accumulating.
+
+```diff
+@@ -1,6 +1,9 @@
+-self.dense_h_to_4h = mpu.ColumnParallelLinear(
++self.dense_h_to_4h = deepspeed.zero.TiledLinearReturnBias(
+     args.hidden_size,
+     4 * args.hidden_size,
++    in_splits=args.tile_factor,
++    out_splits=4*args.tile_factor,
++    linear_cls=mpu.ColumnParallelLinear,
+     gather_output=False,
+     init_method=init_method,
+     skip_bias_add=True)
+```
+
+Note that we scale `in_splits` and `out_splits` proportionally with `input_size` and `output_size`.  This
+results in tiles of fixed size `[hidden/tile_factor, hidden/tile_factor]`.
+
+#### Registering external parameters
+
+**Deprecated:**
+DeepSpeed version `0.3.15` introduced automatic external parameter
+registration and this step is no longer needed.
+{: .notice--info}
+
+
+## Extracting weights
+
+If you need to take the pretrained weights out of Deepspeed here is what you can do for getting fp16 weights:
+
+- under ZeRO-2, `state_dict` contains the fp16 model weights and these can be saved normally with `torch.save`.
+- under ZeRO-3, `state_dict` contains just the placeholders since the model weights are partitioned across multiple GPUs. If you want to get to these weights enable:
+
+```json
+    "zero_optimization": {
+        "stage3_gather_16bit_weights_on_model_save": true
+    },
+```
+And then save the model using:
+
+```python
+            if self.deepspeed:
+                self.deepspeed.save_16bit_model(output_dir, output_file)
+```
+
+Because it requires consolidation of the weights on one GPU it can be slow and memory demanding, so only use this feature when needed.
+
+Note that if `stage3_gather_16bit_weights_on_model_save` is `False`, no weights will be saved (again, because `state_dict` doesn't have them).
+You can use this method to save ZeRO-2 weights as well.
+
+If you'd like to get the fp32 weights, we supply a special script that can do offline consolidation. It requires no configuration files or GPUs. Here is an example of its usage:
+
+``` bash
+$ cd /path/to/checkpoint_dir
+$ ./zero_to_fp32.py . pytorch_model.bin
+Processing zero checkpoint at global_step1
+Detected checkpoint of type zero stage 3, world_size: 2
+Saving fp32 state dict to pytorch_model.bin (total_numel=60506624)
+```
+
+The `zero_to_fp32.py` script gets created automatically when you save a checkpoint.
+
+Note: currently this script uses 2x memory (general RAM) of the size of the final checkpoint.
+{: .notice--info}
+
+Alternatively, if you have plenty of spare CPU memory and instead of getting the file you want your model to be updated to its fp32 weights, you can do the following at the end of the training:
+
+``` python
+    from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+    fp32_model = load_state_dict_from_zero_checkpoint(deepspeed.module, checkpoint_dir)
+```
+
+Beware, that the model will be good for saving, but no longer good for continuing the training and will require a `deepspeed.initialize()` anew.
+
+If you just want the `state_dict`, you can do:
+
+``` python
+    from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)
+```
+
+
+Congratulations! You have completed the ZeRO tutorial.
diff --git a/docs/assets/css/main.scss b/docs/assets/css/main.scss
index 26a771784d01ed6a36ac88f2b444bdd7f6b1fb49..7fccdd65bd303bd00f8804b59d81e66fa20dca08 100644
--- a/docs/assets/css/main.scss
+++ b/docs/assets/css/main.scss
@@ -6,6 +6,7 @@
 
 @import "minimal-mistakes/skins/{{ site.minimal_mistakes_skin | default: 'default' }}"; // skin
 @import "minimal-mistakes"; // main partials
+@import "button-group"; // main partials
 
 //
 // DeepSpeed customizations
@@ -45,3 +46,24 @@
     @include yiq-contrasted($active-color);
   }
 }
+
+ul.tag-box li {
+  display: inline-block;
+  list-style: none;
+  list-style-image: none;
+  margin-bottom: 10px;
+  text-transform: capitalize;
+}
+ul.tag-box li a {
+  background: #e6e6e6;
+  padding: 2px 8px;
+  border-radius: 3px;
+  color:  #0092ca;
+  text-transform: capitalize;
+  font-weight: bold;
+}
+ul.tag-box li span.size {
+  font-weight: bold;
+}
+
+.site-logo img { max-height: 200%; width: auto; }
diff --git a/docs/assets/images/1.3B-MoE-128.png b/docs/assets/images/1.3B-MoE-128.png
new file mode 100644
index 0000000000000000000000000000000000000000..7d9b462468081e199dd38101cbc1961aacbc87d3
Binary files /dev/null and b/docs/assets/images/1.3B-MoE-128.png differ
diff --git a/docs/assets/images/DeepSpeed_dark.svg b/docs/assets/images/DeepSpeed_dark.svg
new file mode 100644
index 0000000000000000000000000000000000000000..c7444e4976afd74f02391b3919166bc67ae99e54
--- /dev/null
+++ b/docs/assets/images/DeepSpeed_dark.svg
@@ -0,0 +1,27 @@
+<svg width="287" height="107" viewBox="0 0 287 107" fill="none" xmlns="http://www.w3.org/2000/svg">
+<g clip-path="url(#clip0)">
+<rect width="287" height="107" fill="#343434"/>
+<path d="M110.738 65.5H108.096V62.6963H108.031C106.807 64.8232 104.916 65.8867 102.359 65.8867C100.286 65.8867 98.6265 65.1509 97.3804 63.6792C96.145 62.1968 95.5273 60.1826 95.5273 57.6367C95.5273 54.9082 96.2148 52.7222 97.5898 51.0786C98.9648 49.4351 100.796 48.6133 103.084 48.6133C105.351 48.6133 107 49.5049 108.031 51.2881H108.096V41.0723H110.738V65.5ZM108.096 58.0396V55.6064C108.096 54.2744 107.655 53.1465 106.774 52.2227C105.894 51.2988 104.776 50.8369 103.423 50.8369C101.812 50.8369 100.544 51.4277 99.6201 52.6094C98.6963 53.791 98.2344 55.4238 98.2344 57.5078C98.2344 59.4092 98.6748 60.9131 99.5557 62.0195C100.447 63.1152 101.64 63.6631 103.133 63.6631C104.604 63.6631 105.797 63.1313 106.71 62.0679C107.634 61.0044 108.096 59.6616 108.096 58.0396ZM129.349 57.9106H117.699C117.742 59.7476 118.236 61.1655 119.182 62.1646C120.127 63.1636 121.427 63.6631 123.081 63.6631C124.939 63.6631 126.647 63.0508 128.205 61.8262V64.3076C126.755 65.3604 124.837 65.8867 122.453 65.8867C120.122 65.8867 118.29 65.1401 116.958 63.647C115.626 62.1431 114.96 60.0322 114.96 57.3145C114.96 54.7471 115.685 52.6577 117.135 51.0464C118.596 49.4243 120.406 48.6133 122.565 48.6133C124.725 48.6133 126.395 49.3115 127.577 50.708C128.758 52.1045 129.349 54.0435 129.349 56.5249V57.9106ZM126.642 55.6709C126.631 54.1455 126.261 52.9585 125.53 52.1099C124.811 51.2612 123.806 50.8369 122.517 50.8369C121.271 50.8369 120.213 51.2827 119.343 52.1743C118.473 53.0659 117.936 54.2314 117.731 55.6709H126.642ZM146.623 57.9106H134.973C135.016 59.7476 135.51 61.1655 136.455 62.1646C137.4 63.1636 138.7 63.6631 140.354 63.6631C142.213 63.6631 143.921 63.0508 145.479 61.8262V64.3076C144.028 65.3604 142.111 65.8867 139.726 65.8867C137.395 65.8867 135.563 65.1401 134.231 63.647C132.899 62.1431 132.233 60.0322 132.233 57.3145C132.233 54.7471 132.958 52.6577 134.409 51.0464C135.87 49.4243 137.68 48.6133 139.839 48.6133C141.998 48.6133 143.668 49.3115 144.85 50.708C146.032 52.1045 146.623 54.0435 146.623 56.5249V57.9106ZM143.916 55.6709C143.905 54.1455 143.534 52.9585 142.804 52.1099C142.084 51.2612 141.08 50.8369 139.791 50.8369C138.544 50.8369 137.486 51.2827 136.616 52.1743C135.746 53.0659 135.209 54.2314 135.005 55.6709H143.916ZM153.342 63.1152H153.277V73.0894H150.635V49H153.277V51.9004H153.342C154.642 49.709 156.543 48.6133 159.046 48.6133C161.173 48.6133 162.833 49.3545 164.025 50.8369C165.217 52.3086 165.813 54.2852 165.813 56.7666C165.813 59.5273 165.142 61.7402 163.799 63.4053C162.457 65.0596 160.62 65.8867 158.289 65.8867C156.151 65.8867 154.502 64.9629 153.342 63.1152ZM153.277 56.4604V58.7646C153.277 60.1289 153.718 61.2891 154.599 62.2451C155.49 63.1904 156.618 63.6631 157.982 63.6631C159.583 63.6631 160.834 63.0508 161.737 61.8262C162.65 60.6016 163.106 58.8989 163.106 56.7183C163.106 54.8813 162.682 53.4419 161.833 52.3999C160.985 51.3579 159.835 50.8369 158.385 50.8369C156.849 50.8369 155.614 51.374 154.679 52.4482C153.745 53.5117 153.277 54.8491 153.277 56.4604ZM169.036 64.9038V62.0679C170.476 63.1313 172.06 63.6631 173.79 63.6631C176.11 63.6631 177.27 62.8896 177.27 61.3428C177.27 60.9023 177.168 60.5317 176.964 60.231C176.771 59.9194 176.502 59.6455 176.158 59.4092C175.825 59.1729 175.428 58.9634 174.966 58.7808C174.515 58.5874 174.026 58.3887 173.5 58.1846C172.769 57.8945 172.125 57.6045 171.566 57.3145C171.018 57.0137 170.556 56.6807 170.18 56.3154C169.815 55.9395 169.536 55.5151 169.342 55.0425C169.16 54.5698 169.068 54.0166 169.068 53.3828C169.068 52.6094 169.246 51.9272 169.6 51.3364C169.955 50.7349 170.427 50.2354 171.018 49.8379C171.609 49.4297 172.28 49.1235 173.032 48.9194C173.795 48.7153 174.579 48.6133 175.385 48.6133C176.813 48.6133 178.092 48.8604 179.22 49.3545V52.0293C178.006 51.2344 176.609 50.8369 175.03 50.8369C174.536 50.8369 174.09 50.896 173.693 51.0142C173.295 51.1216 172.952 51.2773 172.662 51.4814C172.382 51.6855 172.162 51.9326 172.001 52.2227C171.851 52.502 171.775 52.8135 171.775 53.1572C171.775 53.5869 171.851 53.9468 172.001 54.2368C172.162 54.5269 172.393 54.7847 172.694 55.0103C172.995 55.2358 173.36 55.4399 173.79 55.6226C174.219 55.8052 174.708 56.0039 175.256 56.2188C175.986 56.498 176.642 56.7881 177.222 57.0889C177.802 57.3789 178.296 57.7119 178.704 58.0879C179.112 58.4531 179.424 58.8774 179.639 59.3608C179.864 59.8442 179.977 60.4189 179.977 61.085C179.977 61.9014 179.794 62.6104 179.429 63.2119C179.075 63.8135 178.597 64.313 177.995 64.7104C177.394 65.1079 176.701 65.4033 175.917 65.5967C175.132 65.79 174.311 65.8867 173.451 65.8867C171.754 65.8867 170.282 65.5591 169.036 64.9038ZM186.761 63.1152H186.696V73.0894H184.054V49H186.696V51.9004H186.761C188.061 49.709 189.962 48.6133 192.465 48.6133C194.592 48.6133 196.251 49.3545 197.444 50.8369C198.636 52.3086 199.232 54.2852 199.232 56.7666C199.232 59.5273 198.561 61.7402 197.218 63.4053C195.875 65.0596 194.039 65.8867 191.708 65.8867C189.57 65.8867 187.921 64.9629 186.761 63.1152ZM186.696 56.4604V58.7646C186.696 60.1289 187.137 61.2891 188.018 62.2451C188.909 63.1904 190.037 63.6631 191.401 63.6631C193.002 63.6631 194.253 63.0508 195.156 61.8262C196.069 60.6016 196.525 58.8989 196.525 56.7183C196.525 54.8813 196.101 53.4419 195.252 52.3999C194.404 51.3579 193.254 50.8369 191.804 50.8369C190.268 50.8369 189.033 51.374 188.098 52.4482C187.164 53.5117 186.696 54.8491 186.696 56.4604ZM216.715 57.9106H205.065C205.108 59.7476 205.603 61.1655 206.548 62.1646C207.493 63.1636 208.793 63.6631 210.447 63.6631C212.306 63.6631 214.014 63.0508 215.571 61.8262V64.3076C214.121 65.3604 212.204 65.8867 209.819 65.8867C207.488 65.8867 205.656 65.1401 204.324 63.647C202.992 62.1431 202.326 60.0322 202.326 57.3145C202.326 54.7471 203.051 52.6577 204.501 51.0464C205.962 49.4243 207.772 48.6133 209.932 48.6133C212.091 48.6133 213.761 49.3115 214.943 50.708C216.125 52.1045 216.715 54.0435 216.715 56.5249V57.9106ZM214.008 55.6709C213.998 54.1455 213.627 52.9585 212.896 52.1099C212.177 51.2612 211.172 50.8369 209.883 50.8369C208.637 50.8369 207.579 51.2827 206.709 52.1743C205.839 53.0659 205.302 54.2314 205.098 55.6709H214.008ZM233.989 57.9106H222.339C222.382 59.7476 222.876 61.1655 223.821 62.1646C224.767 63.1636 226.066 63.6631 227.721 63.6631C229.579 63.6631 231.287 63.0508 232.845 61.8262V64.3076C231.395 65.3604 229.477 65.8867 227.092 65.8867C224.761 65.8867 222.93 65.1401 221.598 63.647C220.266 62.1431 219.6 60.0322 219.6 57.3145C219.6 54.7471 220.325 52.6577 221.775 51.0464C223.236 49.4243 225.046 48.6133 227.205 48.6133C229.364 48.6133 231.035 49.3115 232.216 50.708C233.398 52.1045 233.989 54.0435 233.989 56.5249V57.9106ZM231.282 55.6709C231.271 54.1455 230.9 52.9585 230.17 52.1099C229.45 51.2612 228.446 50.8369 227.157 50.8369C225.911 50.8369 224.853 51.2827 223.982 52.1743C223.112 53.0659 222.575 54.2314 222.371 55.6709H231.282ZM252.084 65.5H249.441V62.6963H249.377C248.152 64.8232 246.262 65.8867 243.705 65.8867C241.632 65.8867 239.972 65.1509 238.726 63.6792C237.491 62.1968 236.873 60.1826 236.873 57.6367C236.873 54.9082 237.561 52.7222 238.936 51.0786C240.311 49.4351 242.142 48.6133 244.43 48.6133C246.697 48.6133 248.346 49.5049 249.377 51.2881H249.441V41.0723H252.084V65.5ZM249.441 58.0396V55.6064C249.441 54.2744 249.001 53.1465 248.12 52.2227C247.239 51.2988 246.122 50.8369 244.769 50.8369C243.157 50.8369 241.89 51.4277 240.966 52.6094C240.042 53.791 239.58 55.4238 239.58 57.5078C239.58 59.4092 240.021 60.9131 240.901 62.0195C241.793 63.1152 242.985 63.6631 244.479 63.6631C245.95 63.6631 247.143 63.1313 248.056 62.0679C248.979 61.0044 249.441 59.6616 249.441 58.0396Z" fill="white"/>
+<path d="M34.1305 67.6806C36.4372 65.3739 39.6428 64.8396 41.2905 66.4872C42.9381 68.1349 42.4039 71.3405 40.0971 73.6472C37.7904 75.954 23.9871 83.7906 23.9871 83.7906C23.9871 83.7906 31.8238 69.9873 34.1305 67.6806Z" fill="#F17F00"/>
+<circle cx="40.0974" cy="37.2509" r="3.79717" transform="rotate(45 40.0974 37.2509)" fill="#FFD85A"/>
+<circle cx="55.6108" cy="52.764" r="3.79717" transform="rotate(45 55.6108 52.764)" fill="#FFD85A"/>
+<circle cx="78.881" cy="29.494" r="3.79717" transform="rotate(45 78.881 29.494)" fill="#FFD85A"/>
+<circle cx="38.3074" cy="56.9408" r="3.79717" transform="rotate(45 38.3074 56.9408)" fill="#FFD85A"/>
+<circle cx="70.5278" cy="67.681" r="3.79717" transform="rotate(45 70.5278 67.681)" fill="#FFD85A"/>
+<circle cx="50.8379" cy="69.4711" r="3.79717" transform="rotate(45 50.8379 69.4711)" fill="#FFD85A"/>
+<line x1="39.947" y1="37.4136" x2="78.8092" y2="29.4793" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="70.3643" y1="68.1263" x2="78.8823" y2="29.3333" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="40.4924" y1="36.7975" x2="38.4654" y2="57.0154" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="38.1289" y1="56.8053" x2="56.0291" y2="52.4134" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="55.9074" y1="52.2414" x2="50.9372" y2="69.8399" stroke="#FFD85A"/>
+<line x1="50.7678" y1="69.1781" x2="70.8194" y2="67.3004" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="50.4041" y1="69.6991" x2="39.9762" y2="37.0126" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="70.6683" y1="67.6962" x2="38.0867" y2="57.6573" stroke="#FFD85A" stroke-width="1.5"/>
+<line x1="55.3219" y1="52.5941" x2="78.8712" y2="29.3586" stroke="#FFD85A" stroke-width="1.5"/>
+</g>
+<defs>
+<clipPath id="clip0">
+<rect width="287" height="107" fill="white"/>
+</clipPath>
+</defs>
+</svg>
diff --git a/docs/assets/images/DeepSpeed_dark_transparent.svg b/docs/assets/images/DeepSpeed_dark_transparent.svg
new file mode 100644
index 0000000000000000000000000000000000000000..c1202d66b2db848c10368364509d3f152757fb61
--- /dev/null
+++ b/docs/assets/images/DeepSpeed_dark_transparent.svg
@@ -0,0 +1,19 @@
+<svg width="287" height="107" viewBox="0 0 287 107" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M110.738 65.5H108.096V62.6963H108.031C106.807 64.8232 104.916 65.8867 102.359 65.8867C100.286 65.8867 98.6265 65.1509 97.3804 63.6792C96.145 62.1968 95.5273 60.1826 95.5273 57.6367C95.5273 54.9082 96.2148 52.7222 97.5898 51.0786C98.9648 49.4351 100.796 48.6133 103.084 48.6133C105.351 48.6133 107 49.5049 108.031 51.2881H108.096V41.0723H110.738V65.5ZM108.096 58.0396V55.6064C108.096 54.2744 107.655 53.1465 106.774 52.2227C105.894 51.2988 104.776 50.8369 103.423 50.8369C101.812 50.8369 100.544 51.4277 99.6201 52.6094C98.6963 53.791 98.2344 55.4238 98.2344 57.5078C98.2344 59.4092 98.6748 60.9131 99.5557 62.0195C100.447 63.1152 101.64 63.6631 103.133 63.6631C104.604 63.6631 105.797 63.1313 106.71 62.0679C107.634 61.0044 108.096 59.6616 108.096 58.0396ZM129.349 57.9106H117.699C117.742 59.7476 118.236 61.1655 119.182 62.1646C120.127 63.1636 121.427 63.6631 123.081 63.6631C124.939 63.6631 126.647 63.0508 128.205 61.8262V64.3076C126.755 65.3604 124.837 65.8867 122.453 65.8867C120.122 65.8867 118.29 65.1401 116.958 63.647C115.626 62.1431 114.96 60.0322 114.96 57.3145C114.96 54.7471 115.685 52.6577 117.135 51.0464C118.596 49.4243 120.406 48.6133 122.565 48.6133C124.725 48.6133 126.395 49.3115 127.577 50.708C128.758 52.1045 129.349 54.0435 129.349 56.5249V57.9106ZM126.642 55.6709C126.631 54.1455 126.261 52.9585 125.53 52.1099C124.811 51.2612 123.806 50.8369 122.517 50.8369C121.271 50.8369 120.213 51.2827 119.343 52.1743C118.473 53.0659 117.936 54.2314 117.731 55.6709H126.642ZM146.623 57.9106H134.973C135.016 59.7476 135.51 61.1655 136.455 62.1646C137.4 63.1636 138.7 63.6631 140.354 63.6631C142.213 63.6631 143.921 63.0508 145.479 61.8262V64.3076C144.028 65.3604 142.111 65.8867 139.726 65.8867C137.395 65.8867 135.563 65.1401 134.231 63.647C132.899 62.1431 132.233 60.0322 132.233 57.3145C132.233 54.7471 132.958 52.6577 134.409 51.0464C135.87 49.4243 137.68 48.6133 139.839 48.6133C141.998 48.6133 143.668 49.3115 144.85 50.708C146.032 52.1045 146.623 54.0435 146.623 56.5249V57.9106ZM143.916 55.6709C143.905 54.1455 143.534 52.9585 142.804 52.1099C142.084 51.2612 141.08 50.8369 139.791 50.8369C138.544 50.8369 137.486 51.2827 136.616 52.1743C135.746 53.0659 135.209 54.2314 135.005 55.6709H143.916ZM153.342 63.1152H153.277V73.0894H150.635V49H153.277V51.9004H153.342C154.642 49.709 156.543 48.6133 159.046 48.6133C161.173 48.6133 162.833 49.3545 164.025 50.8369C165.217 52.3086 165.813 54.2852 165.813 56.7666C165.813 59.5273 165.142 61.7402 163.799 63.4053C162.457 65.0596 160.62 65.8867 158.289 65.8867C156.151 65.8867 154.502 64.9629 153.342 63.1152ZM153.277 56.4604V58.7646C153.277 60.1289 153.718 61.2891 154.599 62.2451C155.49 63.1904 156.618 63.6631 157.982 63.6631C159.583 63.6631 160.834 63.0508 161.737 61.8262C162.65 60.6016 163.106 58.8989 163.106 56.7183C163.106 54.8813 162.682 53.4419 161.833 52.3999C160.985 51.3579 159.835 50.8369 158.385 50.8369C156.849 50.8369 155.614 51.374 154.679 52.4482C153.745 53.5117 153.277 54.8491 153.277 56.4604ZM169.036 64.9038V62.0679C170.476 63.1313 172.06 63.6631 173.79 63.6631C176.11 63.6631 177.27 62.8896 177.27 61.3428C177.27 60.9023 177.168 60.5317 176.964 60.231C176.771 59.9194 176.502 59.6455 176.158 59.4092C175.825 59.1729 175.428 58.9634 174.966 58.7808C174.515 58.5874 174.026 58.3887 173.5 58.1846C172.769 57.8945 172.125 57.6045 171.566 57.3145C171.018 57.0137 170.556 56.6807 170.18 56.3154C169.815 55.9395 169.536 55.5151 169.342 55.0425C169.16 54.5698 169.068 54.0166 169.068 53.3828C169.068 52.6094 169.246 51.9272 169.6 51.3364C169.955 50.7349 170.427 50.2354 171.018 49.8379C171.609 49.4297 172.28 49.1235 173.032 48.9194C173.795 48.7153 174.579 48.6133 175.385 48.6133C176.813 48.6133 178.092 48.8604 179.22 49.3545V52.0293C178.006 51.2344 176.609 50.8369 175.03 50.8369C174.536 50.8369 174.09 50.896 173.693 51.0142C173.295 51.1216 172.952 51.2773 172.662 51.4814C172.382 51.6855 172.162 51.9326 172.001 52.2227C171.851 52.502 171.775 52.8135 171.775 53.1572C171.775 53.5869 171.851 53.9468 172.001 54.2368C172.162 54.5269 172.393 54.7847 172.694 55.0103C172.995 55.2358 173.36 55.4399 173.79 55.6226C174.219 55.8052 174.708 56.0039 175.256 56.2188C175.986 56.498 176.642 56.7881 177.222 57.0889C177.802 57.3789 178.296 57.7119 178.704 58.0879C179.112 58.4531 179.424 58.8774 179.639 59.3608C179.864 59.8442 179.977 60.4189 179.977 61.085C179.977 61.9014 179.794 62.6104 179.429 63.2119C179.075 63.8135 178.597 64.313 177.995 64.7104C177.394 65.1079 176.701 65.4033 175.917 65.5967C175.132 65.79 174.311 65.8867 173.451 65.8867C171.754 65.8867 170.282 65.5591 169.036 64.9038ZM186.761 63.1152H186.696V73.0894H184.054V49H186.696V51.9004H186.761C188.061 49.709 189.962 48.6133 192.465 48.6133C194.592 48.6133 196.251 49.3545 197.444 50.8369C198.636 52.3086 199.232 54.2852 199.232 56.7666C199.232 59.5273 198.561 61.7402 197.218 63.4053C195.875 65.0596 194.039 65.8867 191.708 65.8867C189.57 65.8867 187.921 64.9629 186.761 63.1152ZM186.696 56.4604V58.7646C186.696 60.1289 187.137 61.2891 188.018 62.2451C188.909 63.1904 190.037 63.6631 191.401 63.6631C193.002 63.6631 194.253 63.0508 195.156 61.8262C196.069 60.6016 196.525 58.8989 196.525 56.7183C196.525 54.8813 196.101 53.4419 195.252 52.3999C194.404 51.3579 193.254 50.8369 191.804 50.8369C190.268 50.8369 189.033 51.374 188.098 52.4482C187.164 53.5117 186.696 54.8491 186.696 56.4604ZM216.715 57.9106H205.065C205.108 59.7476 205.603 61.1655 206.548 62.1646C207.493 63.1636 208.793 63.6631 210.447 63.6631C212.306 63.6631 214.014 63.0508 215.571 61.8262V64.3076C214.121 65.3604 212.204 65.8867 209.819 65.8867C207.488 65.8867 205.656 65.1401 204.324 63.647C202.992 62.1431 202.326 60.0322 202.326 57.3145C202.326 54.7471 203.051 52.6577 204.501 51.0464C205.962 49.4243 207.772 48.6133 209.932 48.6133C212.091 48.6133 213.761 49.3115 214.943 50.708C216.125 52.1045 216.715 54.0435 216.715 56.5249V57.9106ZM214.008 55.6709C213.998 54.1455 213.627 52.9585 212.896 52.1099C212.177 51.2612 211.172 50.8369 209.883 50.8369C208.637 50.8369 207.579 51.2827 206.709 52.1743C205.839 53.0659 205.302 54.2314 205.098 55.6709H214.008ZM233.989 57.9106H222.339C222.382 59.7476 222.876 61.1655 223.821 62.1646C224.767 63.1636 226.066 63.6631 227.721 63.6631C229.579 63.6631 231.287 63.0508 232.845 61.8262V64.3076C231.395 65.3604 229.477 65.8867 227.092 65.8867C224.761 65.8867 222.93 65.1401 221.598 63.647C220.266 62.1431 219.6 60.0322 219.6 57.3145C219.6 54.7471 220.325 52.6577 221.775 51.0464C223.236 49.4243 225.046 48.6133 227.205 48.6133C229.364 48.6133 231.035 49.3115 232.216 50.708C233.398 52.1045 233.989 54.0435 233.989 56.5249V57.9106ZM231.282 55.6709C231.271 54.1455 230.9 52.9585 230.17 52.1099C229.45 51.2612 228.446 50.8369 227.157 50.8369C225.911 50.8369 224.853 51.2827 223.982 52.1743C223.112 53.0659 222.575 54.2314 222.371 55.6709H231.282ZM252.084 65.5H249.441V62.6963H249.377C248.152 64.8232 246.262 65.8867 243.705 65.8867C241.632 65.8867 239.972 65.1509 238.726 63.6792C237.491 62.1968 236.873 60.1826 236.873 57.6367C236.873 54.9082 237.561 52.7222 238.936 51.0786C240.311 49.4351 242.142 48.6133 244.43 48.6133C246.697 48.6133 248.346 49.5049 249.377 51.2881H249.441V41.0723H252.084V65.5ZM249.441 58.0396V55.6064C249.441 54.2744 249.001 53.1465 248.12 52.2227C247.239 51.2988 246.122 50.8369 244.769 50.8369C243.157 50.8369 241.89 51.4277 240.966 52.6094C240.042 53.791 239.58 55.4238 239.58 57.5078C239.58 59.4092 240.021 60.9131 240.901 62.0195C241.793 63.1152 242.985 63.6631 244.479 63.6631C245.95 63.6631 247.143 63.1313 248.056 62.0679C248.979 61.0044 249.441 59.6616 249.441 58.0396Z" fill="white"/>
+<path d="M34.1305 67.6806C36.4372 65.3739 39.6428 64.8396 41.2905 66.4872C42.9381 68.1349 42.4039 71.3405 40.0971 73.6472C37.7904 75.954 23.9871 83.7906 23.9871 83.7906C23.9871 83.7906 31.8238 69.9873 34.1305 67.6806Z" fill="#F17F00"/>
+<path d="M37.4124 39.9359C38.8953 41.4188 41.2995 41.4188 42.7824 39.9359C44.2653 38.453 44.2653 36.0488 42.7824 34.5659C41.2995 33.083 38.8953 33.083 37.4124 34.5659C35.9295 36.0488 35.9295 38.453 37.4124 39.9359Z" fill="#FFD85A"/>
+<path d="M52.9258 55.449C54.4087 56.9319 56.813 56.9319 58.2958 55.449C59.7787 53.9661 59.7787 51.5619 58.2958 50.079C56.813 48.5961 54.4087 48.5961 52.9258 50.079C51.4429 51.5619 51.443 53.9661 52.9258 55.449Z" fill="#FFD85A"/>
+<path d="M76.1961 32.179C77.679 33.6619 80.0832 33.6619 81.5661 32.179C83.049 30.6961 83.049 28.2919 81.5661 26.809C80.0832 25.3261 77.679 25.3261 76.1961 26.809C74.7132 28.2919 74.7132 30.6961 76.1961 32.179Z" fill="#FFD85A"/>
+<path d="M35.6224 59.6258C37.1053 61.1087 39.5095 61.1087 40.9924 59.6258C42.4753 58.1429 42.4753 55.7387 40.9924 54.2558C39.5095 52.7729 37.1053 52.7729 35.6224 54.2558C34.1395 55.7387 34.1395 58.1429 35.6224 59.6258Z" fill="#FFD85A"/>
+<path d="M67.8428 70.366C69.3257 71.8489 71.7299 71.8489 73.2128 70.366C74.6957 68.8831 74.6957 66.4789 73.2128 64.996C71.73 63.5131 69.3257 63.5131 67.8428 64.996C66.3599 66.4789 66.3599 68.8831 67.8428 70.366Z" fill="#FFD85A"/>
+<path d="M48.1529 72.1561C49.6358 73.639 52.04 73.639 53.5229 72.1561C55.0058 70.6732 55.0058 68.269 53.5229 66.7861C52.04 65.3032 49.6358 65.3032 48.1529 66.7861C46.67 68.269 46.67 70.6732 48.1529 72.1561Z" fill="#FFD85A"/>
+<path d="M39.947 37.4136L78.8092 29.4793" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M70.3643 68.1263L78.8823 29.3333" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M40.4923 36.7975L38.4653 57.0154" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M38.1289 56.8053L56.0291 52.4134" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M55.9075 52.2414L50.9373 69.8399" stroke="#FFD85A"/>
+<path d="M50.7678 69.1781L70.8194 67.3004" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M50.4042 69.6991L39.9763 37.0126" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M70.6683 67.6962L38.0867 57.6573" stroke="#FFD85A" stroke-width="1.5"/>
+<path d="M55.3218 52.5941L78.8711 29.3586" stroke="#FFD85A" stroke-width="1.5"/>
+</svg>
diff --git a/docs/assets/images/DeepSpeed_light.svg b/docs/assets/images/DeepSpeed_light.svg
new file mode 100644
index 0000000000000000000000000000000000000000..77a86ace9fbcff6d31bdef51d95b50f2ab4dc400
--- /dev/null
+++ b/docs/assets/images/DeepSpeed_light.svg
@@ -0,0 +1,27 @@
+<svg width="287" height="107" viewBox="0 0 287 107" fill="none" xmlns="http://www.w3.org/2000/svg">
+<g clip-path="url(#clip0)">
+<rect width="287" height="107" fill="white"/>
+<path d="M110.738 65.5H108.096V62.6963H108.031C106.807 64.8232 104.916 65.8867 102.359 65.8867C100.286 65.8867 98.6265 65.1509 97.3804 63.6792C96.145 62.1968 95.5273 60.1826 95.5273 57.6367C95.5273 54.9082 96.2148 52.7222 97.5898 51.0786C98.9648 49.4351 100.796 48.6133 103.084 48.6133C105.351 48.6133 107 49.5049 108.031 51.2881H108.096V41.0723H110.738V65.5ZM108.096 58.0396V55.6064C108.096 54.2744 107.655 53.1465 106.774 52.2227C105.894 51.2988 104.776 50.8369 103.423 50.8369C101.812 50.8369 100.544 51.4277 99.6201 52.6094C98.6963 53.791 98.2344 55.4238 98.2344 57.5078C98.2344 59.4092 98.6748 60.9131 99.5557 62.0195C100.447 63.1152 101.64 63.6631 103.133 63.6631C104.604 63.6631 105.797 63.1313 106.71 62.0679C107.634 61.0044 108.096 59.6616 108.096 58.0396ZM129.349 57.9106H117.699C117.742 59.7476 118.236 61.1655 119.182 62.1646C120.127 63.1636 121.427 63.6631 123.081 63.6631C124.939 63.6631 126.647 63.0508 128.205 61.8262V64.3076C126.755 65.3604 124.837 65.8867 122.453 65.8867C120.122 65.8867 118.29 65.1401 116.958 63.647C115.626 62.1431 114.96 60.0322 114.96 57.3145C114.96 54.7471 115.685 52.6577 117.135 51.0464C118.596 49.4243 120.406 48.6133 122.565 48.6133C124.725 48.6133 126.395 49.3115 127.577 50.708C128.758 52.1045 129.349 54.0435 129.349 56.5249V57.9106ZM126.642 55.6709C126.631 54.1455 126.261 52.9585 125.53 52.1099C124.811 51.2612 123.806 50.8369 122.517 50.8369C121.271 50.8369 120.213 51.2827 119.343 52.1743C118.473 53.0659 117.936 54.2314 117.731 55.6709H126.642ZM146.623 57.9106H134.973C135.016 59.7476 135.51 61.1655 136.455 62.1646C137.4 63.1636 138.7 63.6631 140.354 63.6631C142.213 63.6631 143.921 63.0508 145.479 61.8262V64.3076C144.028 65.3604 142.111 65.8867 139.726 65.8867C137.395 65.8867 135.563 65.1401 134.231 63.647C132.899 62.1431 132.233 60.0322 132.233 57.3145C132.233 54.7471 132.958 52.6577 134.409 51.0464C135.87 49.4243 137.68 48.6133 139.839 48.6133C141.998 48.6133 143.668 49.3115 144.85 50.708C146.032 52.1045 146.623 54.0435 146.623 56.5249V57.9106ZM143.916 55.6709C143.905 54.1455 143.534 52.9585 142.804 52.1099C142.084 51.2612 141.08 50.8369 139.791 50.8369C138.544 50.8369 137.486 51.2827 136.616 52.1743C135.746 53.0659 135.209 54.2314 135.005 55.6709H143.916ZM153.342 63.1152H153.277V73.0894H150.635V49H153.277V51.9004H153.342C154.642 49.709 156.543 48.6133 159.046 48.6133C161.173 48.6133 162.833 49.3545 164.025 50.8369C165.217 52.3086 165.813 54.2852 165.813 56.7666C165.813 59.5273 165.142 61.7402 163.799 63.4053C162.457 65.0596 160.62 65.8867 158.289 65.8867C156.151 65.8867 154.502 64.9629 153.342 63.1152ZM153.277 56.4604V58.7646C153.277 60.1289 153.718 61.2891 154.599 62.2451C155.49 63.1904 156.618 63.6631 157.982 63.6631C159.583 63.6631 160.834 63.0508 161.737 61.8262C162.65 60.6016 163.106 58.8989 163.106 56.7183C163.106 54.8813 162.682 53.4419 161.833 52.3999C160.985 51.3579 159.835 50.8369 158.385 50.8369C156.849 50.8369 155.614 51.374 154.679 52.4482C153.745 53.5117 153.277 54.8491 153.277 56.4604ZM169.036 64.9038V62.0679C170.476 63.1313 172.06 63.6631 173.79 63.6631C176.11 63.6631 177.27 62.8896 177.27 61.3428C177.27 60.9023 177.168 60.5317 176.964 60.231C176.771 59.9194 176.502 59.6455 176.158 59.4092C175.825 59.1729 175.428 58.9634 174.966 58.7808C174.515 58.5874 174.026 58.3887 173.5 58.1846C172.769 57.8945 172.125 57.6045 171.566 57.3145C171.018 57.0137 170.556 56.6807 170.18 56.3154C169.815 55.9395 169.536 55.5151 169.342 55.0425C169.16 54.5698 169.068 54.0166 169.068 53.3828C169.068 52.6094 169.246 51.9272 169.6 51.3364C169.955 50.7349 170.427 50.2354 171.018 49.8379C171.609 49.4297 172.28 49.1235 173.032 48.9194C173.795 48.7153 174.579 48.6133 175.385 48.6133C176.813 48.6133 178.092 48.8604 179.22 49.3545V52.0293C178.006 51.2344 176.609 50.8369 175.03 50.8369C174.536 50.8369 174.09 50.896 173.693 51.0142C173.295 51.1216 172.952 51.2773 172.662 51.4814C172.382 51.6855 172.162 51.9326 172.001 52.2227C171.851 52.502 171.775 52.8135 171.775 53.1572C171.775 53.5869 171.851 53.9468 172.001 54.2368C172.162 54.5269 172.393 54.7847 172.694 55.0103C172.995 55.2358 173.36 55.4399 173.79 55.6226C174.219 55.8052 174.708 56.0039 175.256 56.2188C175.986 56.498 176.642 56.7881 177.222 57.0889C177.802 57.3789 178.296 57.7119 178.704 58.0879C179.112 58.4531 179.424 58.8774 179.639 59.3608C179.864 59.8442 179.977 60.4189 179.977 61.085C179.977 61.9014 179.794 62.6104 179.429 63.2119C179.075 63.8135 178.597 64.313 177.995 64.7104C177.394 65.1079 176.701 65.4033 175.917 65.5967C175.132 65.79 174.311 65.8867 173.451 65.8867C171.754 65.8867 170.282 65.5591 169.036 64.9038ZM186.761 63.1152H186.696V73.0894H184.054V49H186.696V51.9004H186.761C188.061 49.709 189.962 48.6133 192.465 48.6133C194.592 48.6133 196.251 49.3545 197.444 50.8369C198.636 52.3086 199.232 54.2852 199.232 56.7666C199.232 59.5273 198.561 61.7402 197.218 63.4053C195.875 65.0596 194.039 65.8867 191.708 65.8867C189.57 65.8867 187.921 64.9629 186.761 63.1152ZM186.696 56.4604V58.7646C186.696 60.1289 187.137 61.2891 188.018 62.2451C188.909 63.1904 190.037 63.6631 191.401 63.6631C193.002 63.6631 194.253 63.0508 195.156 61.8262C196.069 60.6016 196.525 58.8989 196.525 56.7183C196.525 54.8813 196.101 53.4419 195.252 52.3999C194.404 51.3579 193.254 50.8369 191.804 50.8369C190.268 50.8369 189.033 51.374 188.098 52.4482C187.164 53.5117 186.696 54.8491 186.696 56.4604ZM216.715 57.9106H205.065C205.108 59.7476 205.603 61.1655 206.548 62.1646C207.493 63.1636 208.793 63.6631 210.447 63.6631C212.306 63.6631 214.014 63.0508 215.571 61.8262V64.3076C214.121 65.3604 212.204 65.8867 209.819 65.8867C207.488 65.8867 205.656 65.1401 204.324 63.647C202.992 62.1431 202.326 60.0322 202.326 57.3145C202.326 54.7471 203.051 52.6577 204.501 51.0464C205.962 49.4243 207.772 48.6133 209.932 48.6133C212.091 48.6133 213.761 49.3115 214.943 50.708C216.125 52.1045 216.715 54.0435 216.715 56.5249V57.9106ZM214.008 55.6709C213.998 54.1455 213.627 52.9585 212.896 52.1099C212.177 51.2612 211.172 50.8369 209.883 50.8369C208.637 50.8369 207.579 51.2827 206.709 52.1743C205.839 53.0659 205.302 54.2314 205.098 55.6709H214.008ZM233.989 57.9106H222.339C222.382 59.7476 222.876 61.1655 223.821 62.1646C224.767 63.1636 226.066 63.6631 227.721 63.6631C229.579 63.6631 231.287 63.0508 232.845 61.8262V64.3076C231.395 65.3604 229.477 65.8867 227.092 65.8867C224.761 65.8867 222.93 65.1401 221.598 63.647C220.266 62.1431 219.6 60.0322 219.6 57.3145C219.6 54.7471 220.325 52.6577 221.775 51.0464C223.236 49.4243 225.046 48.6133 227.205 48.6133C229.364 48.6133 231.035 49.3115 232.216 50.708C233.398 52.1045 233.989 54.0435 233.989 56.5249V57.9106ZM231.282 55.6709C231.271 54.1455 230.9 52.9585 230.17 52.1099C229.45 51.2612 228.446 50.8369 227.157 50.8369C225.911 50.8369 224.853 51.2827 223.982 52.1743C223.112 53.0659 222.575 54.2314 222.371 55.6709H231.282ZM252.084 65.5H249.441V62.6963H249.377C248.152 64.8232 246.262 65.8867 243.705 65.8867C241.632 65.8867 239.972 65.1509 238.726 63.6792C237.491 62.1968 236.873 60.1826 236.873 57.6367C236.873 54.9082 237.561 52.7222 238.936 51.0786C240.311 49.4351 242.142 48.6133 244.43 48.6133C246.697 48.6133 248.346 49.5049 249.377 51.2881H249.441V41.0723H252.084V65.5ZM249.441 58.0396V55.6064C249.441 54.2744 249.001 53.1465 248.12 52.2227C247.239 51.2988 246.122 50.8369 244.769 50.8369C243.157 50.8369 241.89 51.4277 240.966 52.6094C240.042 53.791 239.58 55.4238 239.58 57.5078C239.58 59.4092 240.021 60.9131 240.901 62.0195C241.793 63.1152 242.985 63.6631 244.479 63.6631C245.95 63.6631 247.143 63.1313 248.056 62.0679C248.979 61.0044 249.441 59.6616 249.441 58.0396Z" fill="black"/>
+<path d="M34.1305 67.6806C36.4372 65.3739 39.6428 64.8396 41.2905 66.4872C42.9381 68.1349 42.4039 71.3405 40.0971 73.6472C37.7904 75.954 23.9871 83.7906 23.9871 83.7906C23.9871 83.7906 31.8238 69.9873 34.1305 67.6806Z" fill="#FFC60F"/>
+<circle cx="40.0974" cy="37.2509" r="3.79717" transform="rotate(45 40.0974 37.2509)" fill="#F17F00"/>
+<circle cx="55.6108" cy="52.764" r="3.79717" transform="rotate(45 55.6108 52.764)" fill="#F17F00"/>
+<circle cx="78.881" cy="29.494" r="3.79717" transform="rotate(45 78.881 29.494)" fill="#F17F00"/>
+<circle cx="38.3074" cy="56.9408" r="3.79717" transform="rotate(45 38.3074 56.9408)" fill="#F17F00"/>
+<circle cx="70.5278" cy="67.681" r="3.79717" transform="rotate(45 70.5278 67.681)" fill="#F17F00"/>
+<circle cx="50.8379" cy="69.4711" r="3.79717" transform="rotate(45 50.8379 69.4711)" fill="#F17F00"/>
+<line x1="39.947" y1="37.4136" x2="78.8092" y2="29.4793" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="70.3643" y1="68.1263" x2="78.8823" y2="29.3333" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="40.4924" y1="36.7975" x2="38.4654" y2="57.0154" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="38.1289" y1="56.8053" x2="56.0291" y2="52.4134" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="56.148" y1="52.3093" x2="51.1777" y2="69.9078" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="50.7678" y1="69.1781" x2="70.8194" y2="67.3004" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="50.4041" y1="69.6991" x2="39.9762" y2="37.0126" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="70.6683" y1="67.6962" x2="38.0867" y2="57.6573" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="55.3219" y1="52.5941" x2="78.8712" y2="29.3586" stroke="#F17F00" stroke-width="1.5"/>
+</g>
+<defs>
+<clipPath id="clip0">
+<rect width="287" height="107" fill="white"/>
+</clipPath>
+</defs>
+</svg>
diff --git a/docs/assets/images/DeepSpeed_light_transparent.svg b/docs/assets/images/DeepSpeed_light_transparent.svg
new file mode 100644
index 0000000000000000000000000000000000000000..c305e0d66e55e99acbd1e11b9e73cfc7ae11bada
--- /dev/null
+++ b/docs/assets/images/DeepSpeed_light_transparent.svg
@@ -0,0 +1,19 @@
+<svg width="287" height="107" viewBox="0 0 287 107" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M110.738 65.5H108.096V62.6963H108.031C106.807 64.8232 104.916 65.8867 102.359 65.8867C100.286 65.8867 98.6265 65.1509 97.3804 63.6792C96.145 62.1968 95.5273 60.1826 95.5273 57.6367C95.5273 54.9082 96.2148 52.7222 97.5898 51.0786C98.9648 49.4351 100.796 48.6133 103.084 48.6133C105.351 48.6133 107 49.5049 108.031 51.2881H108.096V41.0723H110.738V65.5ZM108.096 58.0396V55.6064C108.096 54.2744 107.655 53.1465 106.774 52.2227C105.894 51.2988 104.776 50.8369 103.423 50.8369C101.812 50.8369 100.544 51.4277 99.6201 52.6094C98.6963 53.791 98.2344 55.4238 98.2344 57.5078C98.2344 59.4092 98.6748 60.9131 99.5557 62.0195C100.447 63.1152 101.64 63.6631 103.133 63.6631C104.604 63.6631 105.797 63.1313 106.71 62.0679C107.634 61.0044 108.096 59.6616 108.096 58.0396ZM129.349 57.9106H117.699C117.742 59.7476 118.236 61.1655 119.182 62.1646C120.127 63.1636 121.427 63.6631 123.081 63.6631C124.939 63.6631 126.647 63.0508 128.205 61.8262V64.3076C126.755 65.3604 124.837 65.8867 122.453 65.8867C120.122 65.8867 118.29 65.1401 116.958 63.647C115.626 62.1431 114.96 60.0322 114.96 57.3145C114.96 54.7471 115.685 52.6577 117.135 51.0464C118.596 49.4243 120.406 48.6133 122.565 48.6133C124.725 48.6133 126.395 49.3115 127.577 50.708C128.758 52.1045 129.349 54.0435 129.349 56.5249V57.9106ZM126.642 55.6709C126.631 54.1455 126.261 52.9585 125.53 52.1099C124.811 51.2612 123.806 50.8369 122.517 50.8369C121.271 50.8369 120.213 51.2827 119.343 52.1743C118.473 53.0659 117.936 54.2314 117.731 55.6709H126.642ZM146.623 57.9106H134.973C135.016 59.7476 135.51 61.1655 136.455 62.1646C137.4 63.1636 138.7 63.6631 140.354 63.6631C142.213 63.6631 143.921 63.0508 145.479 61.8262V64.3076C144.028 65.3604 142.111 65.8867 139.726 65.8867C137.395 65.8867 135.563 65.1401 134.231 63.647C132.899 62.1431 132.233 60.0322 132.233 57.3145C132.233 54.7471 132.958 52.6577 134.409 51.0464C135.87 49.4243 137.68 48.6133 139.839 48.6133C141.998 48.6133 143.668 49.3115 144.85 50.708C146.032 52.1045 146.623 54.0435 146.623 56.5249V57.9106ZM143.916 55.6709C143.905 54.1455 143.534 52.9585 142.804 52.1099C142.084 51.2612 141.08 50.8369 139.791 50.8369C138.544 50.8369 137.486 51.2827 136.616 52.1743C135.746 53.0659 135.209 54.2314 135.005 55.6709H143.916ZM153.342 63.1152H153.277V73.0894H150.635V49H153.277V51.9004H153.342C154.642 49.709 156.543 48.6133 159.046 48.6133C161.173 48.6133 162.833 49.3545 164.025 50.8369C165.217 52.3086 165.813 54.2852 165.813 56.7666C165.813 59.5273 165.142 61.7402 163.799 63.4053C162.457 65.0596 160.62 65.8867 158.289 65.8867C156.151 65.8867 154.502 64.9629 153.342 63.1152ZM153.277 56.4604V58.7646C153.277 60.1289 153.718 61.2891 154.599 62.2451C155.49 63.1904 156.618 63.6631 157.982 63.6631C159.583 63.6631 160.834 63.0508 161.737 61.8262C162.65 60.6016 163.106 58.8989 163.106 56.7183C163.106 54.8813 162.682 53.4419 161.833 52.3999C160.985 51.3579 159.835 50.8369 158.385 50.8369C156.849 50.8369 155.614 51.374 154.679 52.4482C153.745 53.5117 153.277 54.8491 153.277 56.4604ZM169.036 64.9038V62.0679C170.476 63.1313 172.06 63.6631 173.79 63.6631C176.11 63.6631 177.27 62.8896 177.27 61.3428C177.27 60.9023 177.168 60.5317 176.964 60.231C176.771 59.9194 176.502 59.6455 176.158 59.4092C175.825 59.1729 175.428 58.9634 174.966 58.7808C174.515 58.5874 174.026 58.3887 173.5 58.1846C172.769 57.8945 172.125 57.6045 171.566 57.3145C171.018 57.0137 170.556 56.6807 170.18 56.3154C169.815 55.9395 169.536 55.5151 169.342 55.0425C169.16 54.5698 169.068 54.0166 169.068 53.3828C169.068 52.6094 169.246 51.9272 169.6 51.3364C169.955 50.7349 170.427 50.2354 171.018 49.8379C171.609 49.4297 172.28 49.1235 173.032 48.9194C173.795 48.7153 174.579 48.6133 175.385 48.6133C176.813 48.6133 178.092 48.8604 179.22 49.3545V52.0293C178.006 51.2344 176.609 50.8369 175.03 50.8369C174.536 50.8369 174.09 50.896 173.693 51.0142C173.295 51.1216 172.952 51.2773 172.662 51.4814C172.382 51.6855 172.162 51.9326 172.001 52.2227C171.851 52.502 171.775 52.8135 171.775 53.1572C171.775 53.5869 171.851 53.9468 172.001 54.2368C172.162 54.5269 172.393 54.7847 172.694 55.0103C172.995 55.2358 173.36 55.4399 173.79 55.6226C174.219 55.8052 174.708 56.0039 175.256 56.2188C175.986 56.498 176.642 56.7881 177.222 57.0889C177.802 57.3789 178.296 57.7119 178.704 58.0879C179.112 58.4531 179.424 58.8774 179.639 59.3608C179.864 59.8442 179.977 60.4189 179.977 61.085C179.977 61.9014 179.794 62.6104 179.429 63.2119C179.075 63.8135 178.597 64.313 177.995 64.7104C177.394 65.1079 176.701 65.4033 175.917 65.5967C175.132 65.79 174.311 65.8867 173.451 65.8867C171.754 65.8867 170.282 65.5591 169.036 64.9038ZM186.761 63.1152H186.696V73.0894H184.054V49H186.696V51.9004H186.761C188.061 49.709 189.962 48.6133 192.465 48.6133C194.592 48.6133 196.251 49.3545 197.444 50.8369C198.636 52.3086 199.232 54.2852 199.232 56.7666C199.232 59.5273 198.561 61.7402 197.218 63.4053C195.875 65.0596 194.039 65.8867 191.708 65.8867C189.57 65.8867 187.921 64.9629 186.761 63.1152ZM186.696 56.4604V58.7646C186.696 60.1289 187.137 61.2891 188.018 62.2451C188.909 63.1904 190.037 63.6631 191.401 63.6631C193.002 63.6631 194.253 63.0508 195.156 61.8262C196.069 60.6016 196.525 58.8989 196.525 56.7183C196.525 54.8813 196.101 53.4419 195.252 52.3999C194.404 51.3579 193.254 50.8369 191.804 50.8369C190.268 50.8369 189.033 51.374 188.098 52.4482C187.164 53.5117 186.696 54.8491 186.696 56.4604ZM216.715 57.9106H205.065C205.108 59.7476 205.603 61.1655 206.548 62.1646C207.493 63.1636 208.793 63.6631 210.447 63.6631C212.306 63.6631 214.014 63.0508 215.571 61.8262V64.3076C214.121 65.3604 212.204 65.8867 209.819 65.8867C207.488 65.8867 205.656 65.1401 204.324 63.647C202.992 62.1431 202.326 60.0322 202.326 57.3145C202.326 54.7471 203.051 52.6577 204.501 51.0464C205.962 49.4243 207.772 48.6133 209.932 48.6133C212.091 48.6133 213.761 49.3115 214.943 50.708C216.125 52.1045 216.715 54.0435 216.715 56.5249V57.9106ZM214.008 55.6709C213.998 54.1455 213.627 52.9585 212.896 52.1099C212.177 51.2612 211.172 50.8369 209.883 50.8369C208.637 50.8369 207.579 51.2827 206.709 52.1743C205.839 53.0659 205.302 54.2314 205.098 55.6709H214.008ZM233.989 57.9106H222.339C222.382 59.7476 222.876 61.1655 223.821 62.1646C224.767 63.1636 226.066 63.6631 227.721 63.6631C229.579 63.6631 231.287 63.0508 232.845 61.8262V64.3076C231.395 65.3604 229.477 65.8867 227.092 65.8867C224.761 65.8867 222.93 65.1401 221.598 63.647C220.266 62.1431 219.6 60.0322 219.6 57.3145C219.6 54.7471 220.325 52.6577 221.775 51.0464C223.236 49.4243 225.046 48.6133 227.205 48.6133C229.364 48.6133 231.035 49.3115 232.216 50.708C233.398 52.1045 233.989 54.0435 233.989 56.5249V57.9106ZM231.282 55.6709C231.271 54.1455 230.9 52.9585 230.17 52.1099C229.45 51.2612 228.446 50.8369 227.157 50.8369C225.911 50.8369 224.853 51.2827 223.982 52.1743C223.112 53.0659 222.575 54.2314 222.371 55.6709H231.282ZM252.084 65.5H249.441V62.6963H249.377C248.152 64.8232 246.262 65.8867 243.705 65.8867C241.632 65.8867 239.972 65.1509 238.726 63.6792C237.491 62.1968 236.873 60.1826 236.873 57.6367C236.873 54.9082 237.561 52.7222 238.936 51.0786C240.311 49.4351 242.142 48.6133 244.43 48.6133C246.697 48.6133 248.346 49.5049 249.377 51.2881H249.441V41.0723H252.084V65.5ZM249.441 58.0396V55.6064C249.441 54.2744 249.001 53.1465 248.12 52.2227C247.239 51.2988 246.122 50.8369 244.769 50.8369C243.157 50.8369 241.89 51.4277 240.966 52.6094C240.042 53.791 239.58 55.4238 239.58 57.5078C239.58 59.4092 240.021 60.9131 240.901 62.0195C241.793 63.1152 242.985 63.6631 244.479 63.6631C245.95 63.6631 247.143 63.1313 248.056 62.0679C248.979 61.0044 249.441 59.6616 249.441 58.0396Z" fill="black"/>
+<path d="M34.1305 67.6806C36.4372 65.3739 39.6428 64.8396 41.2905 66.4872C42.9381 68.1349 42.4039 71.3405 40.0971 73.6472C37.7904 75.954 23.9871 83.7906 23.9871 83.7906C23.9871 83.7906 31.8238 69.9873 34.1305 67.6806Z" fill="#FFC60F"/>
+<path d="M37.4124 39.9359C38.8953 41.4188 41.2995 41.4188 42.7824 39.9359C44.2653 38.453 44.2653 36.0488 42.7824 34.5659C41.2995 33.083 38.8953 33.083 37.4124 34.5659C35.9295 36.0488 35.9295 38.453 37.4124 39.9359Z" fill="#F17F00"/>
+<path d="M52.9258 55.4491C54.4087 56.9319 56.813 56.9319 58.2958 55.4491C59.7787 53.9662 59.7787 51.5619 58.2958 50.079C56.813 48.5962 54.4087 48.5962 52.9258 50.079C51.4429 51.5619 51.443 53.9662 52.9258 55.4491Z" fill="#F17F00"/>
+<path d="M76.1961 32.179C77.679 33.6619 80.0832 33.6619 81.5661 32.179C83.049 30.6961 83.049 28.2919 81.5661 26.809C80.0832 25.3261 77.679 25.3261 76.1961 26.809C74.7132 28.2919 74.7132 30.6961 76.1961 32.179Z" fill="#F17F00"/>
+<path d="M35.6224 59.6258C37.1053 61.1087 39.5095 61.1087 40.9924 59.6258C42.4753 58.1429 42.4753 55.7387 40.9924 54.2558C39.5095 52.7729 37.1053 52.7729 35.6224 54.2558C34.1395 55.7387 34.1395 58.1429 35.6224 59.6258Z" fill="#F17F00"/>
+<path d="M67.8428 70.3661C69.3257 71.8489 71.7299 71.8489 73.2128 70.366C74.6957 68.8832 74.6957 66.4789 73.2128 64.996C71.73 63.5132 69.3257 63.5132 67.8428 64.996C66.3599 66.4789 66.3599 68.8832 67.8428 70.3661Z" fill="#F17F00"/>
+<path d="M48.1529 72.1561C49.6358 73.639 52.04 73.639 53.5229 72.1561C55.0058 70.6732 55.0058 68.269 53.5229 66.7861C52.04 65.3032 49.6358 65.3032 48.1529 66.7861C46.67 68.269 46.67 70.6732 48.1529 72.1561Z" fill="#F17F00"/>
+<path d="M39.947 37.4135L78.8092 29.4792" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M70.3643 68.1263L78.8823 29.3333" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M40.4923 36.7975L38.4653 57.0154" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M38.1289 56.8054L56.0291 52.4135" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M56.148 52.3093L51.1777 69.9078" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M50.7678 69.1781L70.8194 67.3004" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M50.4042 69.6991L39.9763 37.0126" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M70.6683 67.6962L38.0867 57.6573" stroke="#F17F00" stroke-width="1.5"/>
+<path d="M55.3218 52.5941L78.8711 29.3586" stroke="#F17F00" stroke-width="1.5"/>
+</svg>
diff --git a/docs/assets/images/adam-convergence.png b/docs/assets/images/adam-convergence.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/bert-ib.png b/docs/assets/images/bert-ib.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/bert-scaling.png b/docs/assets/images/bert-scaling.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/bert-tcp.png b/docs/assets/images/bert-tcp.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/bingbert-mixedbit.png b/docs/assets/images/bingbert-mixedbit.png
new file mode 100644
index 0000000000000000000000000000000000000000..d8342f85863799b8c367e96bd148bd04c966cf56
Binary files /dev/null and b/docs/assets/images/bingbert-mixedbit.png differ
diff --git a/docs/assets/images/convergence-table.png b/docs/assets/images/convergence-table.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/deepspeed-logo-uppercase-bold-white-1.15.svg b/docs/assets/images/deepspeed-logo-uppercase-bold-white-1.15.svg
new file mode 100644
index 0000000000000000000000000000000000000000..cc2b5736b16a3733776aeb204c2d07ae50ef8641
--- /dev/null
+++ b/docs/assets/images/deepspeed-logo-uppercase-bold-white-1.15.svg
@@ -0,0 +1,24 @@
+<?xml version="1.0" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
+<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="body_1" width="330" height="100">
+
+<g transform="matrix(1.1494253 0 0 1.1494253 0.057474773 -0)">
+    <path d="M101.619 54.3759L101.619 31.2694L109.805 31.2694C 118.012 31.2694 122.115 35.0238 122.115 42.5326C 122.115 46.1312 120.993 49.0048 118.748 51.1532C 116.513 53.3016 113.532 54.3759 109.805 54.3759L109.805 54.3759L101.619 54.3759zM106.824 35.5072L106.824 50.1542L109.402 50.1542C 111.658 50.1542 113.425 49.4774 114.703 48.1239C 115.992 46.7704 116.637 44.9281 116.637 42.597C 116.637 40.3949 115.998 38.6654 114.719 37.4086C 113.452 36.141 111.669 35.5072 109.37 35.5072L109.37 35.5072L106.824 35.5072zM140.275 47.576L129.511 47.576C 129.683 49.9716 131.193 51.1693 134.039 51.1693C 135.855 51.1693 137.45 50.7396 138.825 49.8802L138.825 49.8802L138.825 53.5541C 137.3 54.3705 135.318 54.7787 132.879 54.7787C 130.215 54.7787 128.147 54.0428 126.676 52.5712C 125.204 51.0887 124.468 49.0262 124.468 46.3837C 124.468 43.6444 125.263 41.4745 126.853 39.8739C 128.443 38.2733 130.398 37.473 132.718 37.473C 135.124 37.473 136.983 38.1874 138.293 39.6161C 139.615 41.0448 140.275 42.9838 140.275 45.433L140.275 45.433L140.275 47.576zM135.554 44.4501C 135.554 42.0868 134.598 40.9052 132.686 40.9052C 131.869 40.9052 131.16 41.2435 130.559 41.9203C 129.968 42.597 129.608 43.4403 129.479 44.4501L129.479 44.4501L135.554 44.4501zM158.129 47.576L147.365 47.576C 147.537 49.9716 149.046 51.1693 151.893 51.1693C 153.708 51.1693 155.303 50.7396 156.678 49.8802L156.678 49.8802L156.678 53.5541C 155.153 54.3705 153.171 54.7787 150.733 54.7787C 148.069 54.7787 146.001 54.0428 144.529 52.5712C 143.057 51.0887 142.322 49.0262 142.322 46.3837C 142.322 43.6444 143.116 41.4745 144.706 39.8739C 146.296 38.2733 148.251 37.473 150.572 37.473C 152.978 37.473 154.836 38.1874 156.147 39.6161C 157.468 41.0448 158.129 42.9838 158.129 45.433L158.129 45.433L158.129 47.576zM153.407 44.4501C 153.407 42.0868 152.451 40.9052 150.539 40.9052C 149.723 40.9052 149.014 41.2435 148.412 41.9203C 147.822 42.597 147.462 43.4403 147.333 44.4501L147.333 44.4501L153.407 44.4501zM166.314 52.4584L166.25 52.4584L166.25 61.9652L161.158 61.9652L161.158 37.8759L166.25 37.8759L166.25 40.3573L166.314 40.3573C 167.571 38.4344 169.338 37.473 171.615 37.473C 173.753 37.473 175.402 38.2089 176.562 39.6805C 177.733 41.1415 178.319 43.1342 178.319 45.6586C 178.319 48.4086 177.636 50.6161 176.272 52.2811C 174.919 53.9462 173.114 54.7787 170.858 54.7787C 168.871 54.7787 167.356 54.0052 166.314 52.4584zM166.169 45.6908L166.169 47.0121C 166.169 48.1508 166.47 49.08 167.072 49.7997C 167.673 50.5194 168.463 50.8793 169.44 50.8793C 170.6 50.8793 171.497 50.4335 172.131 49.5419C 172.776 48.6395 173.098 47.3666 173.098 45.723C 173.098 42.8226 171.97 41.3724 169.714 41.3724C 168.672 41.3724 167.818 41.7699 167.152 42.5648C 166.497 43.349 166.169 44.391 166.169 45.6908zM181.058 53.4896L181.058 48.3334C 181.992 49.1176 183.008 49.7084 184.103 50.1058C 185.199 50.4926 186.305 50.6859 187.423 50.6859C 188.078 50.6859 188.647 50.6268 189.131 50.5087C 189.625 50.3905 190.033 50.2294 190.355 50.0253C 190.688 49.8104 190.935 49.5634 191.096 49.2841C 191.258 48.994 191.338 48.6825 191.338 48.3495C 191.338 47.8983 191.209 47.4955 190.951 47.141C 190.694 46.7865 190.339 46.4589 189.888 46.1581C 189.448 45.8573 188.921 45.5673 188.309 45.288C 187.697 45.0087 187.036 44.724 186.327 44.434C 184.522 43.682 183.174 42.7635 182.282 41.6786C 181.402 40.5936 180.961 39.2831 180.961 37.7469C 180.961 36.5438 181.203 35.5126 181.686 34.6532C 182.17 33.7831 182.825 33.0687 183.652 32.5101C 184.49 31.9515 185.457 31.5433 186.552 31.2855C 187.648 31.017 188.808 30.8827 190.033 30.8827C 191.236 30.8827 192.3 30.9579 193.223 31.1083C 194.158 31.2479 195.017 31.4681 195.802 31.7689L195.802 31.7689L195.802 36.5868C 195.415 36.3182 194.99 36.0819 194.529 35.8778C 194.077 35.6737 193.61 35.5072 193.127 35.3783C 192.643 35.2386 192.16 35.1366 191.677 35.0721C 191.204 35.0077 190.753 34.9755 190.323 34.9755C 189.732 34.9755 189.195 35.0345 188.712 35.1527C 188.228 35.2601 187.82 35.4159 187.487 35.62C 187.154 35.8241 186.896 36.0712 186.714 36.3612C 186.531 36.6405 186.44 36.9574 186.44 37.3119C 186.44 37.6986 186.542 38.0477 186.746 38.3593C 186.95 38.66 187.24 38.9501 187.616 39.2294C 187.992 39.4979 188.448 39.7665 188.986 40.035C 189.523 40.2928 190.13 40.5614 190.806 40.8407C 191.73 41.2274 192.557 41.641 193.288 42.0814C 194.029 42.5111 194.663 42.9999 195.189 43.5477C 195.716 44.0956 196.118 44.724 196.398 45.433C 196.677 46.1312 196.817 46.9476 196.817 47.8822C 196.817 49.1713 196.57 50.2562 196.075 51.1371C 195.592 52.0072 194.931 52.7162 194.094 53.264C 193.256 53.8011 192.278 54.1879 191.161 54.4242C 190.054 54.6605 188.884 54.7787 187.648 54.7787C 186.381 54.7787 185.172 54.6713 184.023 54.4564C 182.884 54.2416 181.896 53.9193 181.058 53.4896zM205.276 52.4584L205.212 52.4584L205.212 61.9652L200.12 61.9652L200.12 37.8759L205.212 37.8759L205.212 40.3573L205.276 40.3573C 206.533 38.4344 208.3 37.473 210.577 37.473C 212.715 37.473 214.364 38.2089 215.524 39.6805C 216.695 41.1415 217.281 43.1342 217.281 45.6586C 217.281 48.4086 216.598 50.6161 215.234 52.2811C 213.881 53.9462 212.076 54.7787 209.82 54.7787C 207.833 54.7787 206.318 54.0052 205.276 52.4584zM205.131 45.6908L205.131 47.0121C 205.131 48.1508 205.432 49.08 206.033 49.7997C 206.635 50.5194 207.425 50.8793 208.402 50.8793C 209.562 50.8793 210.459 50.4335 211.093 49.5419C 211.738 48.6395 212.06 47.3666 212.06 45.723C 212.06 42.8226 210.932 41.3724 208.676 41.3724C 207.634 41.3724 206.78 41.7699 206.114 42.5648C 205.459 43.349 205.131 44.391 205.131 45.6908zM235.408 47.576L224.644 47.576C 224.816 49.9716 226.325 51.1693 229.172 51.1693C 230.988 51.1693 232.583 50.7396 233.958 49.8802L233.958 49.8802L233.958 53.5541C 232.432 54.3705 230.45 54.7787 228.012 54.7787C 225.348 54.7787 223.28 54.0428 221.808 52.5712C 220.337 51.0887 219.601 49.0262 219.601 46.3837C 219.601 43.6444 220.396 41.4745 221.986 39.8739C 223.575 38.2733 225.531 37.473 227.851 37.473C 230.257 37.473 232.115 38.1874 233.426 39.6161C 234.747 41.0448 235.408 42.9838 235.408 45.433L235.408 45.433L235.408 47.576zM230.687 44.4501C 230.687 42.0868 229.731 40.9052 227.819 40.9052C 227.002 40.9052 226.293 41.2435 225.692 41.9203C 225.101 42.597 224.741 43.4403 224.612 44.4501L224.612 44.4501L230.687 44.4501zM253.261 47.576L242.498 47.576C 242.67 49.9716 244.179 51.1693 247.026 51.1693C 248.841 51.1693 250.436 50.7396 251.811 49.8802L251.811 49.8802L251.811 53.5541C 250.286 54.3705 248.304 54.7787 245.865 54.7787C 243.201 54.7787 241.134 54.0428 239.662 52.5712C 238.19 51.0887 237.454 49.0262 237.454 46.3837C 237.454 43.6444 238.249 41.4745 239.839 39.8739C 241.429 38.2733 243.384 37.473 245.704 37.473C 248.111 37.473 249.969 38.1874 251.28 39.6161C 252.601 41.0448 253.261 42.9838 253.261 45.433L253.261 45.433L253.261 47.576zM248.54 44.4501C 248.54 42.0868 247.584 40.9052 245.672 40.9052C 244.856 40.9052 244.147 41.2435 243.545 41.9203C 242.954 42.597 242.594 43.4403 242.466 44.4501L242.466 44.4501L248.54 44.4501zM272.436 54.3759L267.344 54.3759L267.344 52.1039L267.28 52.1039C 266.12 53.8871 264.417 54.7787 262.172 54.7787C 260.11 54.7787 258.45 54.0536 257.193 52.6034C 255.936 51.1425 255.308 49.1068 255.308 46.4965C 255.308 43.7679 256.001 41.5819 257.386 39.9384C 258.772 38.2948 260.588 37.473 262.833 37.473C 264.96 37.473 266.442 38.2357 267.28 39.7611L267.28 39.7611L267.344 39.7611L267.344 29.9481L272.436 29.9481L272.436 54.3759zM267.441 46.3514L267.441 45.1107C 267.441 44.0365 267.13 43.1449 266.507 42.4359C 265.884 41.7269 265.073 41.3724 264.073 41.3724C 262.946 41.3724 262.065 41.8182 261.431 42.7098C 260.797 43.5907 260.48 44.8046 260.48 46.3514C 260.48 47.8016 260.786 48.9188 261.399 49.703C 262.011 50.4872 262.86 50.8793 263.945 50.8793C 264.976 50.8793 265.814 50.4657 266.458 49.6385C 267.114 48.8114 267.441 47.7157 267.441 46.3514z" stroke="none" fill="#000000" fill-rule="nonzero" />
+    <path d="M39.1268 56.5563C 41.4335 54.2496 44.6392 53.7153 46.2868 55.363C 47.9345 57.0106 47.4002 60.2163 45.0935 62.523C 42.7868 64.8297 28.9835 72.6663 28.9835 72.6663C 28.9835 72.6663 36.8201 58.863 39.1268 56.5563z" stroke="none" fill="#FFC60F" fill-rule="nonzero" />
+    <path transform="matrix(0.70710677 0.7071067 -0.7071067 0.70710677 31.681961 -24.233828)"  d="M48.89097 26.1266C 48.89097 27.17516 48.52025 28.070162 47.778805 28.811605C 47.03736 29.55305 46.142357 29.92377 45.0938 29.92377C 44.045242 29.92377 43.150238 29.55305 42.408794 28.811605C 41.66735 28.070162 41.29663 27.17516 41.29663 26.1266C 41.29663 25.078041 41.66735 24.18304 42.408794 23.441595C 43.150238 22.700151 44.045242 22.32943 45.0938 22.32943C 46.142357 22.32943 47.03736 22.700151 47.778805 23.441595C 48.52025 24.18304 48.89097 25.078041 48.89097 26.1266C 48.89097 26.170788 48.890198 26.214962 48.888657 26.259119" stroke="none" fill="#F17F00" fill-rule="nonzero" />
+    <path transform="matrix(0.70710677 0.7071067 -0.7071067 0.70710677 47.19536 -30.65969)"  d="M64.40437 41.64C 64.40437 42.68856 64.03365 43.58356 63.292206 44.325005C 62.550766 45.06645 61.65576 45.437172 60.6072 45.43717C 59.558643 45.437172 58.663643 45.06645 57.922195 44.325005C 57.180756 43.58356 56.810036 42.68856 56.81003 41.64C 56.810036 40.591442 57.180756 39.69644 57.922195 38.954994C 58.663643 38.213554 59.558643 37.84283 60.6072 37.84283C 61.65576 37.84283 62.550766 38.213554 63.292206 38.954994C 64.03365 39.69644 64.40437 40.591442 64.40437 41.64C 64.40437 41.68419 64.4036 41.728363 64.402054 41.77252" stroke="none" fill="#F17F00" fill-rule="nonzero" />
+    <path transform="matrix(0.70710677 0.7071067 -0.7071067 0.70710677 37.55661 -53.92969)"  d="M87.67437 18.37C 87.67437 19.41856 87.30365 20.313562 86.5622 21.055006C 85.820755 21.79645 84.92576 22.167171 83.8772 22.167171C 82.82864 22.167171 81.93364 21.79645 81.19219 21.055006C 80.450745 20.313562 80.080025 19.41856 80.080025 18.37C 80.080025 17.321442 80.450745 16.426441 81.19219 15.684997C 81.93364 14.943554 82.82864 14.572832 83.8772 14.572831C 84.92576 14.572832 85.820755 14.943554 86.5622 15.684997C 87.30365 16.426441 87.67437 17.321442 87.67437 18.37C 87.67437 18.414188 87.6736 18.45836 87.67205 18.50252" stroke="none" fill="#F17F00" fill-rule="nonzero" />
+    <path transform="matrix(0.70710677 0.7071067 -0.7071067 0.70710677 45.08051 -17.200996)"  d="M47.10087 45.8165C 47.10087 46.86506 46.73015 47.760063 45.988705 48.501507C 45.24726 49.24295 44.352257 49.613674 43.3037 49.61367C 42.255142 49.613674 41.36014 49.24295 40.618694 48.501507C 39.877254 47.760063 39.50653 46.86506 39.50653 45.8165C 39.50653 44.767944 39.877254 43.872944 40.618694 43.131496C 41.36014 42.390057 42.255142 42.019333 43.3037 42.019333C 44.352257 42.019333 45.24726 42.390057 45.988705 43.131496C 46.73015 43.872944 47.10087 44.767944 47.10087 45.8165C 47.10087 45.860687 47.100098 45.90486 47.098557 45.94902" stroke="none" fill="#F17F00" fill-rule="nonzero" />
+    <path transform="matrix(0.70710677 0.7071067 -0.7071067 0.70710677 62.11236 -36.838516)"  d="M79.32137 56.557C 79.32137 57.60556 78.95065 58.500565 78.209206 59.242004C 77.46776 59.98345 76.57276 60.35417 75.5242 60.354168C 74.47565 60.35417 73.58064 59.98345 72.839195 59.242004C 72.09775 58.500565 71.72703 57.60556 71.72703 56.557C 71.72703 55.508442 72.09775 54.61344 72.839195 53.871994C 73.58064 53.130554 74.47565 52.759834 75.5242 52.75983C 76.57276 52.759834 77.46776 53.130554 78.209206 53.871994C 78.95065 54.61344 79.32137 55.508442 79.32137 56.557C 79.32137 56.60119 79.3206 56.645363 79.31905 56.689518" stroke="none" fill="#F17F00" fill-rule="nonzero" />
+    <path transform="matrix(0.70710677 0.7071067 -0.7071067 0.70710677 57.610943 -22.391335)"  d="M59.63137 58.3469C 59.631374 59.395462 59.26065 60.290466 58.519207 61.031906C 57.777763 61.773354 56.882763 62.144073 55.8342 62.14407C 54.785645 62.144073 53.890644 61.773354 53.149197 61.031906C 52.407757 60.290466 52.037033 59.395462 52.037033 58.3469C 52.037033 57.298344 52.407757 56.403343 53.149197 55.661896C 53.890644 54.920456 54.785645 54.549736 55.8342 54.549732C 56.882763 54.549736 57.777763 54.920456 58.519207 55.661896C 59.26065 56.403343 59.631374 57.298344 59.63137 58.3469C 59.631374 58.39109 59.630604 58.435265 59.62906 58.47942" stroke="none" fill="#F17F00" fill-rule="nonzero" />
+    <path d="M44.9435 26.2895L83.8057 18.3551" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M75.3605 57.0021L83.8786 18.2091" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M45.4887 25.6734L43.4618 45.8914" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M43.1252 45.6813L61.0254 41.2893" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M61.1444 41.1853L56.1741 58.7838" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M55.7643 58.0542L75.8159 56.1765" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M55.4005 58.5749L44.9725 25.8883" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M75.6647 56.5722L43.083 46.5333" stroke="#F17F00" stroke-width="1.5" fill="none" />
+    <path d="M60.3182 41.4698L83.8676 18.2344" stroke="#F17F00" stroke-width="1.5" fill="none" />
+</g>
+</svg>
diff --git a/docs/assets/images/deepspeed-logo-uppercase-bold-white.svg b/docs/assets/images/deepspeed-logo-uppercase-bold-white.svg
new file mode 100644
index 0000000000000000000000000000000000000000..32395acc8dc8a33be3cd39a739dd666eb1d98361
--- /dev/null
+++ b/docs/assets/images/deepspeed-logo-uppercase-bold-white.svg
@@ -0,0 +1,19 @@
+<svg width="287" height="87" viewBox="0 0 287 87" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M101.619 54.3759V31.2694H109.805C118.012 31.2694 122.115 35.0238 122.115 42.5326C122.115 46.1312 120.993 49.0048 118.748 51.1532C116.513 53.3016 113.532 54.3759 109.805 54.3759H101.619ZM106.824 35.5072V50.1542H109.402C111.658 50.1542 113.425 49.4774 114.703 48.1239C115.992 46.7704 116.637 44.9281 116.637 42.597C116.637 40.3949 115.998 38.6654 114.719 37.4086C113.452 36.141 111.669 35.5072 109.37 35.5072H106.824ZM140.275 47.576H129.511C129.683 49.9716 131.193 51.1693 134.039 51.1693C135.855 51.1693 137.45 50.7396 138.825 49.8802V53.5541C137.3 54.3705 135.318 54.7787 132.879 54.7787C130.215 54.7787 128.147 54.0428 126.676 52.5712C125.204 51.0887 124.468 49.0262 124.468 46.3837C124.468 43.6444 125.263 41.4745 126.853 39.8739C128.443 38.2733 130.398 37.473 132.718 37.473C135.124 37.473 136.983 38.1874 138.293 39.6161C139.615 41.0448 140.275 42.9838 140.275 45.433V47.576ZM135.554 44.4501C135.554 42.0868 134.598 40.9052 132.686 40.9052C131.869 40.9052 131.16 41.2435 130.559 41.9203C129.968 42.597 129.608 43.4403 129.479 44.4501H135.554ZM158.129 47.576H147.365C147.537 49.9716 149.046 51.1693 151.893 51.1693C153.708 51.1693 155.303 50.7396 156.678 49.8802V53.5541C155.153 54.3705 153.171 54.7787 150.733 54.7787C148.069 54.7787 146.001 54.0428 144.529 52.5712C143.057 51.0887 142.322 49.0262 142.322 46.3837C142.322 43.6444 143.116 41.4745 144.706 39.8739C146.296 38.2733 148.251 37.473 150.572 37.473C152.978 37.473 154.836 38.1874 156.147 39.6161C157.468 41.0448 158.129 42.9838 158.129 45.433V47.576ZM153.407 44.4501C153.407 42.0868 152.451 40.9052 150.539 40.9052C149.723 40.9052 149.014 41.2435 148.412 41.9203C147.822 42.597 147.462 43.4403 147.333 44.4501H153.407ZM166.314 52.4584H166.25V61.9652H161.158V37.8759H166.25V40.3573H166.314C167.571 38.4344 169.338 37.473 171.615 37.473C173.753 37.473 175.402 38.2089 176.562 39.6805C177.733 41.1415 178.319 43.1342 178.319 45.6586C178.319 48.4086 177.636 50.6161 176.272 52.2811C174.919 53.9462 173.114 54.7787 170.858 54.7787C168.871 54.7787 167.356 54.0052 166.314 52.4584ZM166.169 45.6908V47.0121C166.169 48.1508 166.47 49.08 167.072 49.7997C167.673 50.5194 168.463 50.8793 169.44 50.8793C170.6 50.8793 171.497 50.4335 172.131 49.5419C172.776 48.6395 173.098 47.3666 173.098 45.723C173.098 42.8226 171.97 41.3724 169.714 41.3724C168.672 41.3724 167.818 41.7699 167.152 42.5648C166.497 43.349 166.169 44.391 166.169 45.6908ZM181.058 53.4896V48.3334C181.992 49.1176 183.008 49.7084 184.103 50.1058C185.199 50.4926 186.305 50.6859 187.423 50.6859C188.078 50.6859 188.647 50.6268 189.131 50.5087C189.625 50.3905 190.033 50.2294 190.355 50.0253C190.688 49.8104 190.935 49.5634 191.096 49.2841C191.258 48.994 191.338 48.6825 191.338 48.3495C191.338 47.8983 191.209 47.4955 190.951 47.141C190.694 46.7865 190.339 46.4589 189.888 46.1581C189.448 45.8573 188.921 45.5673 188.309 45.288C187.697 45.0087 187.036 44.724 186.327 44.434C184.522 43.682 183.174 42.7635 182.282 41.6786C181.402 40.5936 180.961 39.2831 180.961 37.7469C180.961 36.5438 181.203 35.5126 181.686 34.6532C182.17 33.7831 182.825 33.0687 183.652 32.5101C184.49 31.9515 185.457 31.5433 186.552 31.2855C187.648 31.017 188.808 30.8827 190.033 30.8827C191.236 30.8827 192.3 30.9579 193.223 31.1083C194.158 31.2479 195.017 31.4681 195.802 31.7689V36.5868C195.415 36.3182 194.99 36.0819 194.529 35.8778C194.077 35.6737 193.61 35.5072 193.127 35.3783C192.643 35.2386 192.16 35.1366 191.677 35.0721C191.204 35.0077 190.753 34.9755 190.323 34.9755C189.732 34.9755 189.195 35.0345 188.712 35.1527C188.228 35.2601 187.82 35.4159 187.487 35.62C187.154 35.8241 186.896 36.0712 186.714 36.3612C186.531 36.6405 186.44 36.9574 186.44 37.3119C186.44 37.6986 186.542 38.0477 186.746 38.3593C186.95 38.66 187.24 38.9501 187.616 39.2294C187.992 39.4979 188.448 39.7665 188.986 40.035C189.523 40.2928 190.13 40.5614 190.806 40.8407C191.73 41.2274 192.557 41.641 193.288 42.0814C194.029 42.5111 194.663 42.9999 195.189 43.5477C195.716 44.0956 196.118 44.724 196.398 45.433C196.677 46.1312 196.817 46.9476 196.817 47.8822C196.817 49.1713 196.57 50.2562 196.075 51.1371C195.592 52.0072 194.931 52.7162 194.094 53.264C193.256 53.8011 192.278 54.1879 191.161 54.4242C190.054 54.6605 188.884 54.7787 187.648 54.7787C186.381 54.7787 185.172 54.6713 184.023 54.4564C182.884 54.2416 181.896 53.9193 181.058 53.4896ZM205.276 52.4584H205.212V61.9652H200.12V37.8759H205.212V40.3573H205.276C206.533 38.4344 208.3 37.473 210.577 37.473C212.715 37.473 214.364 38.2089 215.524 39.6805C216.695 41.1415 217.281 43.1342 217.281 45.6586C217.281 48.4086 216.598 50.6161 215.234 52.2811C213.881 53.9462 212.076 54.7787 209.82 54.7787C207.833 54.7787 206.318 54.0052 205.276 52.4584ZM205.131 45.6908V47.0121C205.131 48.1508 205.432 49.08 206.033 49.7997C206.635 50.5194 207.425 50.8793 208.402 50.8793C209.562 50.8793 210.459 50.4335 211.093 49.5419C211.738 48.6395 212.06 47.3666 212.06 45.723C212.06 42.8226 210.932 41.3724 208.676 41.3724C207.634 41.3724 206.78 41.7699 206.114 42.5648C205.459 43.349 205.131 44.391 205.131 45.6908ZM235.408 47.576H224.644C224.816 49.9716 226.325 51.1693 229.172 51.1693C230.988 51.1693 232.583 50.7396 233.958 49.8802V53.5541C232.432 54.3705 230.45 54.7787 228.012 54.7787C225.348 54.7787 223.28 54.0428 221.808 52.5712C220.337 51.0887 219.601 49.0262 219.601 46.3837C219.601 43.6444 220.396 41.4745 221.986 39.8739C223.575 38.2733 225.531 37.473 227.851 37.473C230.257 37.473 232.115 38.1874 233.426 39.6161C234.747 41.0448 235.408 42.9838 235.408 45.433V47.576ZM230.687 44.4501C230.687 42.0868 229.731 40.9052 227.819 40.9052C227.002 40.9052 226.293 41.2435 225.692 41.9203C225.101 42.597 224.741 43.4403 224.612 44.4501H230.687ZM253.261 47.576H242.498C242.67 49.9716 244.179 51.1693 247.026 51.1693C248.841 51.1693 250.436 50.7396 251.811 49.8802V53.5541C250.286 54.3705 248.304 54.7787 245.865 54.7787C243.201 54.7787 241.134 54.0428 239.662 52.5712C238.19 51.0887 237.454 49.0262 237.454 46.3837C237.454 43.6444 238.249 41.4745 239.839 39.8739C241.429 38.2733 243.384 37.473 245.704 37.473C248.111 37.473 249.969 38.1874 251.28 39.6161C252.601 41.0448 253.261 42.9838 253.261 45.433V47.576ZM248.54 44.4501C248.54 42.0868 247.584 40.9052 245.672 40.9052C244.856 40.9052 244.147 41.2435 243.545 41.9203C242.954 42.597 242.594 43.4403 242.466 44.4501H248.54ZM272.436 54.3759H267.344V52.1039H267.28C266.12 53.8871 264.417 54.7787 262.172 54.7787C260.11 54.7787 258.45 54.0536 257.193 52.6034C255.936 51.1425 255.308 49.1068 255.308 46.4965C255.308 43.7679 256.001 41.5819 257.386 39.9384C258.772 38.2948 260.588 37.473 262.833 37.473C264.96 37.473 266.442 38.2357 267.28 39.7611H267.344V29.9481H272.436V54.3759ZM267.441 46.3514V45.1107C267.441 44.0365 267.13 43.1449 266.507 42.4359C265.884 41.7269 265.073 41.3724 264.073 41.3724C262.946 41.3724 262.065 41.8182 261.431 42.7098C260.797 43.5907 260.48 44.8046 260.48 46.3514C260.48 47.8016 260.786 48.9188 261.399 49.703C262.011 50.4872 262.86 50.8793 263.945 50.8793C264.976 50.8793 265.814 50.4657 266.458 49.6385C267.114 48.8114 267.441 47.7157 267.441 46.3514Z" fill="black"/>
+<path d="M39.1268 56.5563C41.4335 54.2496 44.6392 53.7153 46.2868 55.363C47.9345 57.0106 47.4002 60.2163 45.0935 62.523C42.7868 64.8297 28.9835 72.6663 28.9835 72.6663C28.9835 72.6663 36.8201 58.863 39.1268 56.5563Z" fill="#FFC60F"/>
+<circle cx="45.0938" cy="26.1266" r="3.79717" transform="rotate(45 45.0938 26.1266)" fill="#F17F00"/>
+<circle cx="60.6072" cy="41.64" r="3.79717" transform="rotate(45 60.6072 41.64)" fill="#F17F00"/>
+<circle cx="83.8772" cy="18.37" r="3.79717" transform="rotate(45 83.8772 18.37)" fill="#F17F00"/>
+<circle cx="43.3037" cy="45.8165" r="3.79717" transform="rotate(45 43.3037 45.8165)" fill="#F17F00"/>
+<circle cx="75.5242" cy="56.557" r="3.79717" transform="rotate(45 75.5242 56.557)" fill="#F17F00"/>
+<circle cx="55.8342" cy="58.3469" r="3.79717" transform="rotate(45 55.8342 58.3469)" fill="#F17F00"/>
+<line x1="44.9435" y1="26.2895" x2="83.8057" y2="18.3551" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="75.3605" y1="57.0021" x2="83.8786" y2="18.2091" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="45.4887" y1="25.6734" x2="43.4618" y2="45.8914" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="43.1252" y1="45.6813" x2="61.0254" y2="41.2893" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="61.1444" y1="41.1853" x2="56.1741" y2="58.7838" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="55.7643" y1="58.0542" x2="75.8159" y2="56.1765" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="55.4005" y1="58.5749" x2="44.9725" y2="25.8883" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="75.6647" y1="56.5722" x2="43.083" y2="46.5333" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="60.3182" y1="41.4698" x2="83.8676" y2="18.2344" stroke="#F17F00" stroke-width="1.5"/>
+</svg>
diff --git a/docs/assets/images/deepspeed-logo-uppercase-white.svg b/docs/assets/images/deepspeed-logo-uppercase-white.svg
new file mode 100644
index 0000000000000000000000000000000000000000..1fc99199f8d63c199c523582df5c6a377f2341c1
--- /dev/null
+++ b/docs/assets/images/deepspeed-logo-uppercase-white.svg
@@ -0,0 +1,19 @@
+<svg width="287" height="107" viewBox="0 0 287 107" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M102.006 63.3759V40.2694H108.387C116.53 40.2694 120.601 44.0238 120.601 51.5326C120.601 55.099 119.468 57.9672 117.201 60.1371C114.945 62.2963 111.921 63.3759 108.129 63.3759H102.006ZM104.713 42.7186V60.9266H108.161C111.191 60.9266 113.549 60.1156 115.235 58.4935C116.922 56.8715 117.765 54.5726 117.765 51.597C117.765 45.6781 114.617 42.7186 108.323 42.7186H104.713ZM138.052 55.7865H126.402C126.445 57.6234 126.939 59.0414 127.884 60.0404C128.829 61.0394 130.129 61.5389 131.783 61.5389C133.642 61.5389 135.35 60.9266 136.907 59.702V62.1835C135.457 63.2362 133.54 63.7626 131.155 63.7626C128.824 63.7626 126.992 63.016 125.66 61.5228C124.328 60.0189 123.662 57.9081 123.662 55.1903C123.662 52.6229 124.387 50.5336 125.838 48.9222C127.299 47.3002 129.109 46.4891 131.268 46.4891C133.427 46.4891 135.097 47.1874 136.279 48.5839C137.461 49.9803 138.052 51.9193 138.052 54.4008V55.7865ZM135.344 53.5468C135.334 52.0214 134.963 50.8344 134.233 49.9857C133.513 49.1371 132.509 48.7128 131.219 48.7128C129.973 48.7128 128.915 49.1586 128.045 50.0502C127.175 50.9418 126.638 52.1073 126.434 53.5468H135.344ZM155.325 55.7865H143.675C143.718 57.6234 144.212 59.0414 145.157 60.0404C146.103 61.0394 147.403 61.5389 149.057 61.5389C150.915 61.5389 152.623 60.9266 154.181 59.702V62.1835C152.731 63.2362 150.813 63.7626 148.428 63.7626C146.097 63.7626 144.266 63.016 142.934 61.5228C141.602 60.0189 140.936 57.9081 140.936 55.1903C140.936 52.6229 141.661 50.5336 143.111 48.9222C144.572 47.3002 146.382 46.4891 148.541 46.4891C150.7 46.4891 152.371 47.1874 153.552 48.5839C154.734 49.9803 155.325 51.9193 155.325 54.4008V55.7865ZM152.618 53.5468C152.607 52.0214 152.237 50.8344 151.506 49.9857C150.786 49.1371 149.782 48.7128 148.493 48.7128C147.247 48.7128 146.189 49.1586 145.319 50.0502C144.448 50.9418 143.911 52.1073 143.707 53.5468H152.618ZM162.044 60.9911H161.98V70.9652H159.337V46.8759H161.98V49.7762H162.044C163.344 47.5848 165.245 46.4891 167.748 46.4891C169.875 46.4891 171.535 47.2303 172.727 48.7128C173.92 50.1844 174.516 52.161 174.516 54.6425C174.516 57.4032 173.844 59.6161 172.502 61.2811C171.159 62.9354 169.322 63.7626 166.991 63.7626C164.853 63.7626 163.204 62.8387 162.044 60.9911ZM161.98 54.3363V56.6405C161.98 58.0048 162.42 59.1649 163.301 60.121C164.193 61.0663 165.321 61.5389 166.685 61.5389C168.285 61.5389 169.537 60.9266 170.439 59.702C171.352 58.4774 171.809 56.7748 171.809 54.5941C171.809 52.7572 171.385 51.3177 170.536 50.2758C169.687 49.2338 168.538 48.7128 167.088 48.7128C165.552 48.7128 164.316 49.2499 163.382 50.3241C162.447 51.3876 161.98 52.725 161.98 54.3363ZM178.012 62.4413V59.2509C178.378 59.5731 178.813 59.8632 179.318 60.121C179.833 60.3788 180.37 60.599 180.929 60.7816C181.498 60.9535 182.068 61.0878 182.637 61.1844C183.206 61.2811 183.733 61.3295 184.216 61.3295C185.881 61.3295 187.122 61.0233 187.938 60.411C188.765 59.788 189.179 58.8964 189.179 57.7362C189.179 57.1132 189.039 56.5707 188.76 56.1088C188.491 55.6469 188.115 55.2279 187.632 54.8519C187.149 54.4652 186.574 54.1 185.908 53.7562C185.253 53.4017 184.544 53.0311 183.781 52.6444C182.975 52.2362 182.223 51.8226 181.525 51.4037C180.827 50.9847 180.22 50.5228 179.704 50.0179C179.189 49.5131 178.781 48.9437 178.48 48.3099C178.19 47.6654 178.045 46.9135 178.045 46.0541C178.045 45.0013 178.276 44.0883 178.738 43.3148C179.199 42.5306 179.806 41.8861 180.558 41.3812C181.31 40.8763 182.164 40.5004 183.12 40.2533C184.087 40.0062 185.07 39.8827 186.069 39.8827C188.346 39.8827 190.006 40.1566 191.048 40.7045V43.7499C189.684 42.8046 187.933 42.3319 185.795 42.3319C185.204 42.3319 184.614 42.3964 184.023 42.5253C183.432 42.6434 182.906 42.8422 182.444 43.1215C181.982 43.4008 181.606 43.7606 181.316 44.201C181.026 44.6415 180.881 45.1786 180.881 45.8124C180.881 46.4032 180.988 46.9135 181.203 47.3431C181.428 47.7728 181.756 48.1649 182.186 48.5194C182.615 48.8739 183.136 49.2177 183.749 49.5507C184.372 49.8837 185.086 50.2489 185.892 50.6464C186.719 51.0546 187.503 51.4843 188.244 51.9354C188.986 52.3866 189.635 52.8861 190.194 53.434C190.753 53.9818 191.193 54.5887 191.515 55.2548C191.848 55.9208 192.015 56.6835 192.015 57.5428C192.015 58.6815 191.789 59.6483 191.338 60.4432C190.898 61.2274 190.296 61.8666 189.533 62.3607C188.781 62.8549 187.911 63.2094 186.923 63.4242C185.935 63.6498 184.893 63.7626 183.797 63.7626C183.432 63.7626 182.981 63.7303 182.444 63.6659C181.906 63.6122 181.359 63.5262 180.8 63.4081C180.241 63.3007 179.71 63.1664 179.205 63.0052C178.711 62.8334 178.313 62.6454 178.012 62.4413ZM198.976 60.9911H198.911V70.9652H196.269V46.8759H198.911V49.7762H198.976C200.276 47.5848 202.177 46.4891 204.68 46.4891C206.807 46.4891 208.467 47.2303 209.659 48.7128C210.851 50.1844 211.448 52.161 211.448 54.6425C211.448 57.4032 210.776 59.6161 209.433 61.2811C208.091 62.9354 206.254 63.7626 203.923 63.7626C201.785 63.7626 200.136 62.8387 198.976 60.9911ZM198.911 54.3363V56.6405C198.911 58.0048 199.352 59.1649 200.233 60.121C201.124 61.0663 202.252 61.5389 203.616 61.5389C205.217 61.5389 206.469 60.9266 207.371 59.702C208.284 58.4774 208.74 56.7748 208.74 54.5941C208.74 52.7572 208.316 51.3177 207.468 50.2758C206.619 49.2338 205.469 48.7128 204.019 48.7128C202.483 48.7128 201.248 49.2499 200.313 50.3241C199.379 51.3876 198.911 52.725 198.911 54.3363ZM228.93 55.7865H217.281C217.323 57.6234 217.818 59.0414 218.763 60.0404C219.708 61.0394 221.008 61.5389 222.662 61.5389C224.521 61.5389 226.229 60.9266 227.786 59.702V62.1835C226.336 63.2362 224.419 63.7626 222.034 63.7626C219.703 63.7626 217.871 63.016 216.539 61.5228C215.207 60.0189 214.541 57.9081 214.541 55.1903C214.541 52.6229 215.266 50.5336 216.717 48.9222C218.177 47.3002 219.988 46.4891 222.147 46.4891C224.306 46.4891 225.976 47.1874 227.158 48.5839C228.34 49.9803 228.93 51.9193 228.93 54.4008V55.7865ZM226.223 53.5468C226.213 52.0214 225.842 50.8344 225.112 49.9857C224.392 49.1371 223.387 48.7128 222.098 48.7128C220.852 48.7128 219.794 49.1586 218.924 50.0502C218.054 50.9418 217.517 52.1073 217.313 53.5468H226.223ZM246.204 55.7865H234.554C234.597 57.6234 235.091 59.0414 236.036 60.0404C236.982 61.0394 238.281 61.5389 239.936 61.5389C241.794 61.5389 243.502 60.9266 245.06 59.702V62.1835C243.61 63.2362 241.692 63.7626 239.307 63.7626C236.976 63.7626 235.145 63.016 233.813 61.5228C232.481 60.0189 231.815 57.9081 231.815 55.1903C231.815 52.6229 232.54 50.5336 233.99 48.9222C235.451 47.3002 237.261 46.4891 239.42 46.4891C241.579 46.4891 243.25 47.1874 244.431 48.5839C245.613 49.9803 246.204 51.9193 246.204 54.4008V55.7865ZM243.497 53.5468C243.486 52.0214 243.115 50.8344 242.385 49.9857C241.665 49.1371 240.661 48.7128 239.372 48.7128C238.126 48.7128 237.068 49.1586 236.198 50.0502C235.327 50.9418 234.79 52.1073 234.586 53.5468H243.497ZM264.299 63.3759H261.656V60.5721H261.592C260.367 62.6991 258.477 63.7626 255.92 63.7626C253.847 63.7626 252.187 63.0267 250.941 61.5551C249.706 60.0726 249.088 58.0585 249.088 55.5126C249.088 52.7841 249.776 50.598 251.151 48.9545C252.526 47.3109 254.357 46.4891 256.645 46.4891C258.912 46.4891 260.561 47.3807 261.592 49.1639H261.656V38.9481H264.299V63.3759ZM261.656 55.9154V53.4823C261.656 52.1503 261.216 51.0223 260.335 50.0985C259.454 49.1747 258.337 48.7128 256.984 48.7128C255.372 48.7128 254.105 49.3036 253.181 50.4852C252.257 51.6669 251.795 53.2997 251.795 55.3837C251.795 57.285 252.236 58.7889 253.116 59.8954C254.008 60.9911 255.2 61.5389 256.694 61.5389C258.165 61.5389 259.358 61.0072 260.271 59.9437C261.195 58.8802 261.656 57.5375 261.656 55.9154Z" fill="black"/>
+<path d="M39.1268 65.5563C41.4335 63.2496 44.6392 62.7153 46.2868 64.363C47.9345 66.0106 47.4002 69.2163 45.0935 71.523C42.7868 73.8297 28.9835 81.6663 28.9835 81.6663C28.9835 81.6663 36.8201 67.863 39.1268 65.5563Z" fill="#FFC60F"/>
+<circle cx="45.0938" cy="35.1266" r="3.79717" transform="rotate(45 45.0938 35.1266)" fill="#F17F00"/>
+<circle cx="60.6072" cy="50.64" r="3.79717" transform="rotate(45 60.6072 50.64)" fill="#F17F00"/>
+<circle cx="83.8772" cy="27.37" r="3.79717" transform="rotate(45 83.8772 27.37)" fill="#F17F00"/>
+<circle cx="43.3037" cy="54.8165" r="3.79717" transform="rotate(45 43.3037 54.8165)" fill="#F17F00"/>
+<circle cx="75.5242" cy="65.557" r="3.79717" transform="rotate(45 75.5242 65.557)" fill="#F17F00"/>
+<circle cx="55.8342" cy="67.3469" r="3.79717" transform="rotate(45 55.8342 67.3469)" fill="#F17F00"/>
+<line x1="44.9435" y1="35.2895" x2="83.8057" y2="27.3551" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="75.3605" y1="66.0021" x2="83.8786" y2="27.2091" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="45.4887" y1="34.6734" x2="43.4618" y2="54.8914" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="43.1252" y1="54.6813" x2="61.0254" y2="50.2893" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="61.1444" y1="50.1853" x2="56.1741" y2="67.7838" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="55.7643" y1="67.0542" x2="75.8159" y2="65.1765" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="55.4005" y1="67.5749" x2="44.9725" y2="34.8883" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="75.6647" y1="65.5722" x2="43.083" y2="55.5333" stroke="#F17F00" stroke-width="1.5"/>
+<line x1="60.3182" y1="50.4698" x2="83.8676" y2="27.2344" stroke="#F17F00" stroke-width="1.5"/>
+</svg>
diff --git a/docs/assets/images/gpu-numbers.png b/docs/assets/images/gpu-numbers.png
new file mode 100644
index 0000000000000000000000000000000000000000..fa086d9188c6ba4f6d222d30987b841cadeaa1bf
Binary files /dev/null and b/docs/assets/images/gpu-numbers.png differ
diff --git a/docs/assets/images/inference-gemm-scheduling.png b/docs/assets/images/inference-gemm-scheduling.png
new file mode 100644
index 0000000000000000000000000000000000000000..56780515a76ca05fa121f89f8a09c246f8e2e284
Binary files /dev/null and b/docs/assets/images/inference-gemm-scheduling.png differ
diff --git a/docs/assets/images/inference-kernel-fusion.png b/docs/assets/images/inference-kernel-fusion.png
new file mode 100644
index 0000000000000000000000000000000000000000..6f0ef92330197e53289e24ba9135f7014c2f7189
Binary files /dev/null and b/docs/assets/images/inference-kernel-fusion.png differ
diff --git a/docs/assets/images/inference-latency.png b/docs/assets/images/inference-latency.png
new file mode 100644
index 0000000000000000000000000000000000000000..2e1971e30bef4ea634cb12c24168176acb3d9054
Binary files /dev/null and b/docs/assets/images/inference-latency.png differ
diff --git a/docs/assets/images/inference-throughput.png b/docs/assets/images/inference-throughput.png
new file mode 100644
index 0000000000000000000000000000000000000000..195d753180ac861a71bb0fc79e388f4c0de82115
Binary files /dev/null and b/docs/assets/images/inference-throughput.png differ
diff --git a/docs/assets/images/moe-lat-tput.png b/docs/assets/images/moe-lat-tput.png
new file mode 100644
index 0000000000000000000000000000000000000000..e995e913b4f6bdd55d36f2f38d84f498a3d0ffe7
Binary files /dev/null and b/docs/assets/images/moe-lat-tput.png differ
diff --git a/docs/assets/images/moe-nlg.png b/docs/assets/images/moe-nlg.png
new file mode 100644
index 0000000000000000000000000000000000000000..acda9d129a65dd7603fd5cedaacd33599a62f1e2
Binary files /dev/null and b/docs/assets/images/moe-nlg.png differ
diff --git a/docs/assets/images/onebit-adam-overview.png b/docs/assets/images/onebit-adam-overview.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/onebit-convergence.png b/docs/assets/images/onebit-convergence.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/pipe-schedule.png b/docs/assets/images/pipe-schedule.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/prmoe.png b/docs/assets/images/prmoe.png
new file mode 100644
index 0000000000000000000000000000000000000000..cbbb5f97c38e257bc1f5abf3c1b0c343b95dccf5
Binary files /dev/null and b/docs/assets/images/prmoe.png differ
diff --git a/docs/assets/images/quantization-8bit.png b/docs/assets/images/quantization-8bit.png
new file mode 100644
index 0000000000000000000000000000000000000000..f46c752a25ff2bd404e8233a4d2f0da2cd392f17
Binary files /dev/null and b/docs/assets/images/quantization-8bit.png differ
diff --git a/docs/assets/images/quantization-mixedbit.png b/docs/assets/images/quantization-mixedbit.png
new file mode 100644
index 0000000000000000000000000000000000000000..0647913472bec8b46ebf5ebad96ca7431df58c79
Binary files /dev/null and b/docs/assets/images/quantization-mixedbit.png differ
diff --git a/docs/assets/images/squad-ib.png b/docs/assets/images/squad-ib.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/squad-scaling.png b/docs/assets/images/squad-scaling.png
old mode 100755
new mode 100644
diff --git a/docs/assets/images/squad-tcp.png b/docs/assets/images/squad-tcp.png
old mode 100755
new mode 100644
diff --git a/docs/code-docs/build-api-docs.sh b/docs/code-docs/build-api-docs.sh
old mode 100755
new mode 100644
diff --git a/docs/code-docs/source/autotuning.rst b/docs/code-docs/source/autotuning.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8019ce3d0c26f1bbf5a0eda1fd4a260e9430ddff
--- /dev/null
+++ b/docs/code-docs/source/autotuning.rst
@@ -0,0 +1,14 @@
+Autotuning
+==============
+One pain point in model training is to figure out good performance-relevant configurations such as micro-batch size to fully utilize the hardware and achieve a high throughput number. This configuration exploring process is commonly done manually but is important since model training is repeated many times and benefits from using a good configuration. Not only is the hand-tuning process time-consuming, but the outcome is hardware-dependent. This means that a good configuration on one hardware might not be the best on another different hardware. The user thus has to hand tune the configuration again. With DeepSpeed, there are more configuration parameters that could potentially affect the training speed, thus making it more tedious to manually tune the configuration.
+
+The DeepSpeed Autotuner mitigates this pain point and automatically discovers the optimal DeepSpeed configuration that delivers good training speed.
+The Autotuner uses model information, system information, and heuristics to efficiently tune system knobs that affect compute and memory efficiencies, such as ZeRO optimization stages, micro-batch sizes, and many other ZeRO optimization configurations.
+It not only reduces the time and resources users spend on tuning, but also can discover configurations better than hand-tuned methods.
+
+Please see the the `Autotuning tutorial <https://www.deepspeed.ai/tutorials/autotuning/>`_ for usage details.
+
+Autotuner
+---------------------------------------------------
+.. autoclass:: deepspeed.autotuning.autotuner
+   :members:
diff --git a/docs/code-docs/source/conf.py b/docs/code-docs/source/conf.py
index eb9a412d8a4ad01db443fc09da63095c3bdc3ecc..cb00d0d6be457525234563dc0ede6364ad1adae0 100644
--- a/docs/code-docs/source/conf.py
+++ b/docs/code-docs/source/conf.py
@@ -72,6 +72,7 @@ html_context = {
 
 # Mock imports so we don't have to install torch to build the docs.
 from unittest.mock import MagicMock
+
 sys.path.insert(0, os.path.abspath('../../../'))
 
 # Prepend module names to class descriptions?
@@ -79,4 +80,4 @@ add_module_names = True
 
 autoclass_content = 'both'
 
-autodoc_mock_imports = ["torch", "apex", "mpi4py", "tensorboardX", "numpy", "cupy"]
+autodoc_mock_imports = ["apex", "mpi4py", "tensorboardX", "numpy", "cupy"]
diff --git a/docs/code-docs/source/index.rst b/docs/code-docs/source/index.rst
index f7940668012bda8196001561262cb81d1b302000..293d93830e6b18b992352ca38394ccabfac1f7ac 100644
--- a/docs/code-docs/source/index.rst
+++ b/docs/code-docs/source/index.rst
@@ -8,6 +8,7 @@ Model Setup
    :maxdepth: 2
 
    initialize
+   inference-init
 
 Training API
 ------------
@@ -17,6 +18,13 @@ Training API
 
    training
 
+Inference API
+------------
+
+.. toctree::
+   :maxdepth: 2
+
+   inference-engine
 
 Checkpointing API
 -----------------
@@ -34,7 +42,12 @@ ZeRO API
 
    zero3
 
+Mixture of Experts (MoE)
+------------------------
+.. toctree::
+   :maxdepth: 2
 
+   moe
 
 Transformer Kernel API
 ----------------------
@@ -71,6 +84,20 @@ Flops Profiler
 
    flops-profiler
 
+Autotuning
+--------------------
+.. toctree::
+   :maxdepth: 2
+
+   autotuning
+
+Memory Usage
+------------------
+.. toctree::
+   :maxdepth: 2
+
+   memory
+
 Indices and tables
 ------------------
 
diff --git a/docs/code-docs/source/inference-engine.rst b/docs/code-docs/source/inference-engine.rst
new file mode 100644
index 0000000000000000000000000000000000000000..00a745fd7078d3f57a21b567219e0ea137da86d0
--- /dev/null
+++ b/docs/code-docs/source/inference-engine.rst
@@ -0,0 +1,15 @@
+Inference API
+============
+
+:func:`deepspeed.init_inference` returns an *inference engine*
+of type :class:`InferenceEngine`.
+
+.. code-block:: python
+
+    for step, batch in enumerate(data_loader):
+        #forward() method
+        loss = engine(batch)
+
+Forward Propagation
+-------------------
+.. autofunction:: deepspeed.InferenceEngine.forward
diff --git a/docs/code-docs/source/inference-init.rst b/docs/code-docs/source/inference-init.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b4a155cf5588b76898b702716316140ca0b7c3f7
--- /dev/null
+++ b/docs/code-docs/source/inference-init.rst
@@ -0,0 +1,11 @@
+Inference Setup
+-----------------------
+The entrypoint for inference with DeepSpeed is ``deepspeed.init_inference()``.
+
+Example usage:
+
+.. code-block:: python
+
+    engine = deepspeed.init_inference(model=net)
+
+.. autofunction:: deepspeed.init_inference
diff --git a/docs/code-docs/source/initialize.rst b/docs/code-docs/source/initialize.rst
index 938045de8fc8aa23b4b9be55eccddd6f5fe84a04..492c42fe9ef68242657b86e82580a3b32e0a9fac 100644
--- a/docs/code-docs/source/initialize.rst
+++ b/docs/code-docs/source/initialize.rst
@@ -25,7 +25,7 @@ to add DeepSpeed's builtin arguments to your application's parser.
 
 Training Initialization
 -----------------------
-The entrypoint for all training with DeepSpeed is ``deepspeed.initialize()``. Will initialize distributed backend if it is not intialized already.
+The entrypoint for all training with DeepSpeed is ``deepspeed.initialize()``. Will initialize distributed backend if it is not initialized already.
 
 Example usage:
 
@@ -39,6 +39,6 @@ Example usage:
 
 Distributed Initialization
 -----------------------
-Optional distributed backend initializating separate from ``deepspeed.initialize()``. Useful in scenarios where the user wants to use torch distributed calls before calling ``deepspeed.initialize()``, such as when using model parallelism, pipeline parallelism, or certain data loader scenarios.
+Optional distributed backend initialization separate from ``deepspeed.initialize()``. Useful in scenarios where the user wants to use torch distributed calls before calling ``deepspeed.initialize()``, such as when using model parallelism, pipeline parallelism, or certain data loader scenarios.
 
 .. autofunction:: deepspeed.init_distributed
diff --git a/docs/code-docs/source/memory.rst b/docs/code-docs/source/memory.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5c92dc199aa4de6002adb6f45a78c01881673ea7
--- /dev/null
+++ b/docs/code-docs/source/memory.rst
@@ -0,0 +1,290 @@
+Memory Requirements
+-----------------------
+
+
+API To Estimate Memory Usage
+============================
+
+ZeRO2:
+
+.. autofunction:: deepspeed.runtime.zero.stage2.estimate_zero2_model_states_mem_needs_all_live
+
+.. autofunction:: deepspeed.runtime.zero.stage2.estimate_zero2_model_states_mem_needs_all_cold
+
+Examples:
+
+Let's try a 3B model with just 1 node with 8 gpus, using live model:
+
+.. code-block:: bash
+
+    python -c 'from transformers import AutoModel; \
+    from deepspeed.runtime.zero.stage2 import estimate_zero2_model_states_mem_needs_all_live; \
+    model = AutoModel.from_pretrained("t5-3b"); \
+    estimate_zero2_model_states_mem_needs_all_live(model, num_gpus_per_node=8, num_nodes=1)'
+    Estimated memory needed for params, optim states and gradients for a:
+    HW: Setup with 1 node, 8 GPUs per node.
+    SW: Model with 2851M total params.
+      per CPU  |  per GPU |   Options
+      127.48GB |   5.31GB | cpu_offload=1
+      127.48GB |  15.93GB | cpu_offload=0
+
+Now, without the actual model, which requires us to know ``total_params`` and
+``largest_layer_params``, but we got those from the run above, so future estimators are now much
+faster as we don't need to load the model.
+
+.. code-block:: bash
+
+    python -c 'from deepspeed.runtime.zero.stage2 import estimate_zero2_model_states_mem_needs_all_cold; \
+    estimate_zero2_model_states_mem_needs_all_cold(total_params=2851e6, num_gpus_per_node=8, num_nodes=1)'
+    Estimated memory needed for params, optim states and gradients for a:
+    HW: Setup with 1 node, 8 GPUs per node.
+    SW: Model with 2851M total params.
+      per CPU  |  per GPU |   Options
+      127.45GB |   5.31GB | cpu_offload=1
+      127.45GB |  15.93GB | cpu_offload=0
+
+There is a slight difference due to rounding - the actual live model has a few more params
+
+
+ZeRO3:
+
+.. autofunction:: deepspeed.runtime.zero.stage3.estimate_zero3_model_states_mem_needs_all_live
+
+.. autofunction:: deepspeed.runtime.zero.stage3.estimate_zero3_model_states_mem_needs_all_cold
+
+Examples:
+
+Let's try a 3B model with just 1 node with 8 gpus, using live model:
+
+.. code-block:: bash
+
+    python -c 'from transformers import AutoModel; \
+    from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
+    model = AutoModel.from_pretrained("t5-3b"); \
+    estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=8, num_nodes=1)'
+
+    Estimated memory needed for params, optim states and gradients for a:
+    HW: Setup with 1 node, 8 GPUs per node.
+    SW: Model with 2851M total params, 32M largest layer params.
+      per CPU  |  per GPU |   Options
+       71.71GB |   0.12GB | cpu_offload=1, cpu_offload_params=1, zero_init=1
+      127.48GB |   0.12GB | cpu_offload=1, cpu_offload_params=1, zero_init=0
+       63.74GB |   0.79GB | cpu_offload=1, cpu_offload_params=0, zero_init=1
+      127.48GB |   0.79GB | cpu_offload=1, cpu_offload_params=0, zero_init=0
+        1.47GB |   6.10GB | cpu_offload=0, cpu_offload_params=0, zero_init=1
+      127.48GB |   6.10GB | cpu_offload=0, cpu_offload_params=0, zero_init=0
+
+Now, without the actual model, which requires us to know ``total_params`` and
+``largest_layer_params``, but we got those from the run above, so future estimators are now much
+faster as we don't need to load the model.
+
+.. code-block:: bash
+
+    python -c 'from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_cold; \
+    estimate_zero3_model_states_mem_needs_all_cold(total_params=2851e6, largest_layer_params=32e6, num_gpus_per_node=8, num_nodes=1)'
+
+    Estimated memory needed for params, optim states and gradients for a:
+    HW: Setup with 1 node, 8 GPUs per node.
+    SW: Model with 2851M total params, 32M largest layer params.
+      per CPU  |  per GPU |   Options
+       71.69GB |   0.12GB | cpu_offload=1, cpu_offload_params=1, zero_init=1
+      127.45GB |   0.12GB | cpu_offload=1, cpu_offload_params=1, zero_init=0
+       63.72GB |   0.78GB | cpu_offload=1, cpu_offload_params=0, zero_init=1
+      127.45GB |   0.78GB | cpu_offload=1, cpu_offload_params=0, zero_init=0
+        1.43GB |   6.09GB | cpu_offload=0, cpu_offload_params=0, zero_init=1
+      127.45GB |   6.09GB | cpu_offload=0, cpu_offload_params=0, zero_init=0
+
+There is a slight difference due to rounding - the actual live model has a few more params
+
+
+
+Discussion
+==========
+
+Let's look in detail how the memory estimator API calculates these numbers and also discuss some additional numbers that aren't covered by the API.
+
+In the following discussion:
+
+- ``params`` - total number of model params, which can be calculated as:
+
+.. code-block:: python
+
+    print(sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values()))
+
+Some models already include the number of params in the model name, e.g. t5-11b (11B params), gpt-neo-1.3B (1.3B params), etc.
+
+Also if the model weights are stored in ``fp32`` the other quick way to calculate the size of the model is to simply divide the size of the ``state_dict`` file by 4 (fp32 == 4 bytes). For example, you can see that `t5-11b's pytorch_model.bin <https://huggingface.co/t5-11b/tree/main>`__ is 42.1GB in size, so if we divide it by 4, we can immediately tell it's an 11B model.
+
+The following calculations show how much memory is required by model params, gradients and optimizer states. In addition to those you will need enough memory to fit activation calculations and any temporary memory for intermediate calculations, which for long sequences could be very significant (e.g. could take the same amount of memory as params+grads+optim_states combined).
+
+The optimizer states assume that ``Adam`` is used, where 4 bytes per parameter are used by momentum and another 4 by variance (8 in total).
+
+Gradients at ``fp32`` take 4 bytes, and parameters take 2 bytes at ``fp16`` and 4 bytes at ``fp32``.
+
+**GPU RAM**
+
+The big question is how big of a model you can fit on the hardware you have? Or rather what size of a GPU RAM do you need to fit the desired model.
+
+
+* ZeRO-2:
+
+   - ``"offload_optimizer": {"device": "cpu"}``: 2 * params
+
+   Example: a 40GB GPU can fit ~11B param model (regardless of how many GPUs are used). Here the model is loaded in ``fp16`` so just the model weights take about 22GB and the remaining 18GB are used by other components. You can barely fit a very small batch size in this scenario.
+
+   - ``"offload_optimizer": {"device": "none"}``: 4 * params + 16 * params/ (total number of gpus)
+
+* ZeRO-3:
+
+``largest_layer_memory = 4*largest_layer_params`` - GPU memory needed to gather the largest layer on a single GPU. 2 bytes fp16 params are gathered and 2 bytes fp16 grads are computed (total 4x). The optimizer states and fp32 parameters are updated in partitioned form and copied to fp16 params in partitioned form. This happens during the optimizer step. After that the fp16 params are sufficient.
+
+   - case 1: ``"offload_param": {"device": "none"}, "offload_optimizer": {"device": "none"}`` - largest_layer_memory + 18 * params / total number of gpus across all nodes
+   - case 2: ``"offload_param": {"device": "cpu"}, "offload_optimizer": {"device": "cpu"}``- largest_layer_memory. The main limit here is general RAM.
+   - case 3: ``"offload_param": {"device": "none"}, "offload_optimizer": {"device": "cpu"}``- largest_layer_memory + 2 * params / total number of gpus across all nodes
+
+     Example:
+
+.. code-block:: python
+
+    from transformers import AutoModel
+    model = AutoModel.from_pretrained("t5-large")
+
+    # shared params calculated only ones
+    total_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
+
+    largest_layer_params = 0
+    for m in model.modules():
+        # assuming no shared params within a single layer
+        layer_params = sum(p.numel() for p in m.parameters(recurse=False))
+        largest_layer_params = max(largest_layer_params, layer_params)
+
+    largest_layer_memory = (4*largest_layer_params)
+
+    total_gpus = 4
+
+    case1 = largest_layer_memory + int(18*total_params/total_gpus)
+    case2 = largest_layer_memory
+    case3 = largest_layer_memory + int(2*total_params/total_gpus)
+
+    print(f"total params:         {total_params/1e6:6.2f}M")
+    print(f"largest layer params: {largest_layer_params/1e6:6.2f}M")
+    print(f"largest layer memory: {largest_layer_memory>>20:6}MB")
+    print(f"case1 gpu memory: {(case1)>>20:6}MB")
+    print(f"case2 gpu memory: {(case2)>>20:6}MB")
+    print(f"case3 gpu memory: {(case3)>>20:6}MB")
+
+    total params:         737.67M
+    largest layer params:  32.90M
+    largest layer memory:    125MB
+    case1 gpu memory:   3291MB
+    case2 gpu memory:    125MB
+    case3 gpu memory:    477MB
+
+
+**General RAM**:
+
+One of the key features of ZeRO is its CPU offload which can dramatically extend the total memory pool accessible to the project by using general RAM. One can easily expand their general RAM by 10x times, at a significantly lower cost than what it'd take to have the same GPU RAM. And often, it's not even possible to buy GPUs with a lot of RAM (112GB GPU anybody?) since they simply don't yet exist.
+
+In the following calculations we will use:
+
+- ``additional_buffer_factor=1.5`` as an additional buffer factor to be conservative
+- ``n_gpus`` the number of GPUs on a single node (machine)
+- ``total_gpus`` the total number of GPUs across all nodes
+- ``params`` - total number of model params (see above for how to get this number)
+
+* ZeRO-2:
+
+   - ``"offload_optimizer": {"device": "none"}``:
+
+      params * 4 * n_gpus * additional_buffer_factor - this is the memory needed only at the beginning to initialize the model on CPU memory
+
+   - ``"offload_optimizer": {"device": "cpu"}``:
+
+      params * max(4 * n_gpus, 16) * additional_buffer_factor
+
+   Example: xxx
+
+* ZeRO-3:
+
+   gpus_factor = n_gpus / total_gpus
+
+   - case 1: ``"offload_param": {"device": "none"}, "offload_optimizer": {"device": "none"}``:
+
+      Without ``zero.Init``:
+
+          params * 4 * n_gpus * additional_buffer_factor
+
+          this is the memory needed only at the beginning to initialize the model on CPU memory. Once the model is transferred to GPUs this memory is freed.
+
+      With ``zero.Init``:
+
+          largest_layer_params * 4 * n_gpus * additional_buffer_factor
+
+          assuming Pytorch is deallocating the memory once the tensors are moved to the GPU by ZeRO.Init
+
+   - case 2: ``"offload_param": {"device": "cpu"}, "offload_optimizer": {"device": "cpu"}``:
+
+      Without ``zero.Init``:
+
+          params * max(4 * n_gpus, 18 * gpus_factor) * additional_buffer_factor
+
+      With ``zero.Init``:
+
+          params * 18 * gpus_factor * additional_buffer_factor
+
+   - case 3: ``"offload_param": {"device": "none"}, "offload_optimizer": {"device": "cpu"}``:
+
+      Without ``zero.Init``:
+
+          params * max(4 * n_gpus, 16 * gpus_factor) * additional_buffer_factor
+
+      With ``zero.Init``:
+
+          params * 16 * gpus_factor * additional_buffer_factor
+
+
+Here is a breakdown for the 16 and 18 multipliers (b = bytes):
+
+4 (in ``4*n_gpus``):
+
+- when pytorch creates a model it creates it in fp32 by default (4 bytes)
+
+16:
+
+- 16b for fp32: 4b params, 4b grads, 4b momentum and 4b variance per parameter
+
+18:
+
+- 16b for fp32: 4b params, 4b grads, 4b momentum and 4b variance per parameter
+- +2b for fp16 params
+
+Note about gradients: While gradients are stored in fp16 (2 bytes), during the weight update, all of them are converted into fp32 before doing the weight updates since the weight updates are done at almost the entire model granularity (param_group granularity) in FusedAdam Optimizer in DeepSpeed. So after that conversion we would need the 4 bytes per gradient for nearly the entire set of weights.
+
+
+**Pinned Memory**
+
+Pinned general RAM is included in normal general RAM allocations (i.e. this is not extra memory allocations but simply shows how much of the general RAM is pinned)
+
+* ZeRO-2: can't be controlled
+
+* ZeRO-3
+
+To enable add: ``"cpu_offload_use_pin_memory" : true``
+
+Now there are 2 sub-cases:
+
+1. ``"cpu_offload_params": true``:
+
+   - 6 * params (2b for fp16 params + 4b for fp32 gradients)
+   - if ``gradient_accumulation_steps > 1`` an additional 2b for fp16 gradients are pinned
+
+2. ``"cpu_offload_params": false``:
+
+   - 4b for fp32 gradients
+
+
+**Activation Memory**
+
+XXX: For Transformers is probably around (2* seq * attn_heads + 16 * hidden_size) * sequence * batch/gpu
+
+This needs to be completed.
diff --git a/docs/code-docs/source/model-checkpointing.rst b/docs/code-docs/source/model-checkpointing.rst
index 064f228f1e2cb2ffbfcdae637798627a179ba9c8..c797943dd662f987dec022d08bc44bcdc733d0c0 100644
--- a/docs/code-docs/source/model-checkpointing.rst
+++ b/docs/code-docs/source/model-checkpointing.rst
@@ -10,3 +10,15 @@ Loading Training Checkpoints
 Saving Training Checkpoints
 ---------------------------
 .. autofunction:: deepspeed.DeepSpeedEngine.save_checkpoint
+
+
+ZeRO Checkpoint fp32 Weights Recovery
+-------------------------------------
+
+DeepSpeed provides routines for extracting fp32 weights from the saved ZeRO checkpoint's optimizer states.
+
+.. autofunction:: deepspeed.utils.zero_to_fp32.get_fp32_state_dict_from_zero_checkpoint
+
+.. autofunction:: deepspeed.utils.zero_to_fp32.load_state_dict_from_zero_checkpoint
+
+.. autofunction:: deepspeed.utils.zero_to_fp32.convert_zero_checkpoint_to_fp32_state_dict
diff --git a/docs/code-docs/source/moe.rst b/docs/code-docs/source/moe.rst
new file mode 100644
index 0000000000000000000000000000000000000000..10634a27b94964acf648700c5a1abcc8561ef0a9
--- /dev/null
+++ b/docs/code-docs/source/moe.rst
@@ -0,0 +1,7 @@
+Mixture of Experts (MoE)
+====================
+
+Layer specification
+--------------------
+.. autoclass:: deepspeed.moe.layer.MoE
+    :members:
diff --git a/docs/code-docs/source/optimizers.rst b/docs/code-docs/source/optimizers.rst
old mode 100755
new mode 100644
index d7b338561b96894520ed1e264af5deae1fb9c181..daf8da6bdbf2f091007dd587631567a055b1ef48
--- a/docs/code-docs/source/optimizers.rst
+++ b/docs/code-docs/source/optimizers.rst
@@ -1,20 +1,28 @@
-Optimizers
-===================
-
-DeepSpeed offers high-performance implementations of ``Adam`` optimizer on CPU; ``FusedAdam``, ``FusedAdam``, ``OneBitAdam`` optimizers on GPU.
-
-Adam (CPU)
-----------------------------
-.. autoclass:: deepspeed.ops.adam.DeepSpeedCPUAdam
-
-FusedAdam (GPU)
-----------------------------
-.. autoclass:: deepspeed.ops.adam.FusedAdam
-
-FusedLamb (GPU)
-----------------------------
-.. autoclass:: deepspeed.ops.lamb.FusedLamb
-
-OneBitAdam (GPU)
-----------------------------
-.. autoclass:: deepspeed.runtime.fp16.onebit.adam.OneBitAdam
+Optimizers
+===================
+
+DeepSpeed offers high-performance implementations of ``Adam`` optimizer on CPU; ``FusedAdam``, ``FusedLamb``, ``OnebitAdam``, ``OnebitLamb`` optimizers on GPU.
+
+Adam (CPU)
+----------------------------
+.. autoclass:: deepspeed.ops.adam.DeepSpeedCPUAdam
+
+FusedAdam (GPU)
+----------------------------
+.. autoclass:: deepspeed.ops.adam.FusedAdam
+
+FusedLamb (GPU)
+----------------------------
+.. autoclass:: deepspeed.ops.lamb.FusedLamb
+
+OneBitAdam (GPU)
+----------------------------
+.. autoclass:: deepspeed.runtime.fp16.onebit.adam.OnebitAdam
+
+ZeroOneAdam (GPU)
+----------------------------
+.. autoclass:: deepspeed.runtime.fp16.onebit.zoadam.ZeroOneAdam
+
+OnebitLamb (GPU)
+----------------------------
+.. autoclass:: deepspeed.runtime.fp16.onebit.lamb.OnebitLamb
diff --git a/docs/code-docs/source/pipeline.rst b/docs/code-docs/source/pipeline.rst
index b82ea05fee16fd01cc2820acfa4408416405a74f..0047f0ccf7d29bfe765eb059b458ce33fac7784e 100644
--- a/docs/code-docs/source/pipeline.rst
+++ b/docs/code-docs/source/pipeline.rst
@@ -12,6 +12,8 @@ Model Specification
 .. autoclass:: deepspeed.pipe.TiedLayerSpec
     :members:
 
+.. autoclass:: deepspeed.runtime.pipe.ProcessTopology
+    :members:
 
 Training
 --------
diff --git a/docs/code-docs/source/schedulers.rst b/docs/code-docs/source/schedulers.rst
old mode 100755
new mode 100644
index 6be3112164ef0258601f577034bef45ff862f1f7..5bc23ffb0acf2c6bd8868f02f62deaa91ad869e8
--- a/docs/code-docs/source/schedulers.rst
+++ b/docs/code-docs/source/schedulers.rst
@@ -1,24 +1,25 @@
-Learning Rate Schedulers
-===================
-
-DeepSpeed offers implementations of ``LRRangeTest``, ``OneCycle``, ``WarmupLR``, ``WarmupDecayLR`` learning rate schedulers.
-
-
-LRRangeTest
----------------------------
-.. autoclass:: deepspeed.runtime.lr_schedules.LRRangeTest
-
-
-OneCycle
----------------------------
-.. autoclass:: deepspeed.runtime.lr_schedules.OneCycle
-
-
-WarmupLR
----------------------------
-.. autoclass:: deepspeed.runtime.lr_schedules.WarmupLR
-
-
-WarmupDecayLR
----------------------------
-.. autoclass:: deepspeed.runtime.lr_schedules.WarmupDecayLR
+Learning Rate Schedulers
+===================
+
+DeepSpeed offers implementations of ``LRRangeTest``, ``OneCycle``, ``WarmupLR``, ``WarmupDecayLR`` learning rate schedulers. When using a DeepSpeed's learning rate scheduler (specified in the `ds_config.json` file), DeepSpeed calls the `step()` method of the scheduler at every training step (when `model_engine.step()` is executed). When not using a DeepSpeed's learning rate scheduler:
+  * if the schedule is supposed to execute at every training step, then the user can pass the scheduler to `deepspeed.initialize` when initializing the DeepSpeed engine and let DeepSpeed manage it for update or save/restore.
+  * if the schedule is supposed to execute at any other interval (e.g., training epochs), then the user should NOT pass the scheduler to DeepSpeed during initialization and must manage it explicitly.
+
+LRRangeTest
+---------------------------
+.. autoclass:: deepspeed.runtime.lr_schedules.LRRangeTest
+
+
+OneCycle
+---------------------------
+.. autoclass:: deepspeed.runtime.lr_schedules.OneCycle
+
+
+WarmupLR
+---------------------------
+.. autoclass:: deepspeed.runtime.lr_schedules.WarmupLR
+
+
+WarmupDecayLR
+---------------------------
+.. autoclass:: deepspeed.runtime.lr_schedules.WarmupDecayLR
diff --git a/docs/code-docs/source/training.rst b/docs/code-docs/source/training.rst
index d88d755f39cbcd069516f7491853b06fa40d68d1..e3a7029aae50e1c96ed62f67fe1bd2989d4b1483 100644
--- a/docs/code-docs/source/training.rst
+++ b/docs/code-docs/source/training.rst
@@ -31,3 +31,11 @@ Optimizer Step
 Gradient Accumulation
 ---------------------
 .. autofunction:: deepspeed.DeepSpeedEngine.is_gradient_accumulation_boundary
+
+
+Model Saving
+------------
+.. autofunction:: deepspeed.DeepSpeedEngine.save_16bit_model
+
+
+Additionally when a DeepSpeed checkpoint is created, a script ``zero_to_fp32.py`` is added there which can be used to reconstruct fp32 master weights into a single pytorch ``state_dict`` file.
diff --git a/docs/code-docs/source/zero3.rst b/docs/code-docs/source/zero3.rst
index c986990444f336ce32e6123c47579d3a8c921655..daced77d909396a9ee60e2764dc324b358364856 100644
--- a/docs/code-docs/source/zero3.rst
+++ b/docs/code-docs/source/zero3.rst
@@ -1,5 +1,5 @@
-ZeRO-3 Offload
-##############
+ZeRO
+####
 
 The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across
 data-parallel processes by partitioning the three model states (optimizer
@@ -8,33 +8,53 @@ replicating them. By doing this, it boosts memory efficiency compared to
 classic data-parallelism while retaining its computational granularity and
 communication efficiency.
 
-ZeRO-Offload further increases memory efficiency by offloading the
-optimizer's states and computations to the CPU. The model parameters can also
-be offloaded for even more memory savings!
+#. **ZeRO Stage 1**: The optimizer states (e.g., for `Adam optimizer <https://arxiv.org/abs/1412.6980>`_, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
+
+#. **ZeRO Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
+
+#. **ZeRO Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
+
+In addition, ZeRO-3 includes the *infinity offload engine* to form
+ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload
+all model states to both CPU and NVMe memory for huge memory savings.
+
+
+For a deep dive of our algorithms, please see our `papers <https://www.deepspeed.ai/#publications>`_ on `ZeRO
+<https://arxiv.org/abs/1910.02054>`_, `ZeRO-Offload
+<https://arxiv.org/abs/2101.06840>`_,
+and `ZeRO-Infinity <https://arxiv.org/abs/2104.07857>`_.
+
+.. note::
+    DeepSpeed first included offloading capabilities with **ZeRO-Offload**, a
+    system for offloading optimizer and gradient states to CPU memory within
+    ZeRO-2. **ZeRO-Infinity** is the next generation of offloading
+    capabilities, accessible to ZeRO-3. ZeRO-Infinity has all of the savings
+    of ZeRO-Offload, plus is able to offload more the model weights and has
+    more effective bandwidth utilization and overlapping of computation and
+    communication.
+
 
-For more information on our algorithms, please see our papers on `ZeRO
-<https://arxiv.org/abs/1910.02054>`_ and `ZeRO-Offload
-<https://arxiv.org/abs/2101.06840>`_.
 
 Getting Started
 ---------------
 
 If you are new to DeepSpeed, check out our `Getting Started <https://www.deepspeed.ai/getting-started/>`_ page.
 
-Once you are training with DeepSpeed, enabling ZeRO-3 Offload is as simple as enabling it
+Once you are training with DeepSpeed, enabling ZeRO-3 offload is as simple as enabling it
 in your DeepSpeed configuration! Below are a few examples of ZeRO-3 configurations. Please see
 our `config guide <https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`_
 for a complete list of options for configuration and performance tuning.
 
 .. note::
-        ZeRO-3 Offload works best with our heavily optimized
+        ZeRO-Infinity and ZeRO-Offload work best with our heavily optimized
         :class:`deepspeed.ops.adam.DeepSpeedCPUAdam` optimizer. We recommend using
         our `optimizer config <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
         to instruct :meth:`deepspeed.initialize` to build the optimizer for you.
 
 
-Example ZeRO-3 Offload Configurations
-=====================================
+
+Example ZeRO-3 Configurations
+=============================
 
 #. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2),
    and parameters (stage 3).
@@ -45,7 +65,6 @@ Example ZeRO-3 Offload Configurations
         {
             "zero_optimization": {
                 "stage": 3,
-                "overlap_comm": true
             },
             "fp16": {
                 "enabled": true
@@ -66,16 +85,16 @@ Example ZeRO-3 Offload Configurations
         }
 
 
-#. Additionally offload the optimizer states and computations to the CPU.
+#. Additionally offload the optimizer states and computations to the CPU with ZeRO-Infinity.
 
     .. code-block:: python
-        :emphasize-lines:  4
 
         {
             "zero_optimization": {
                 "stage": 3,
-                "cpu_offload": true,
-                "overlap_comm": true
+                "offload_optimizer": {
+                    "device": "cpu"
+                }
             },
             ...
         }
@@ -84,14 +103,36 @@ Example ZeRO-3 Offload Configurations
 #. Save even more memory by offloading parameters to the CPU memory.
 
     .. code-block:: python
-        :emphasize-lines:  5
 
         {
             "zero_optimization": {
                 "stage": 3,
-                "cpu_offload": true,
-                "cpu_offload_params": true,
-                "overlap_comm": true
+                "offload_optimizer": {
+                    "device": "cpu"
+                }
+                "offload_param": {
+                    "device": "cpu"
+                }
+            },
+            ...
+        }
+
+
+#. Save even MORE memory by offloading to NVMe (if available on your system):
+
+    .. code-block:: python
+
+        {
+            "zero_optimization": {
+                "stage": 3,
+                "offload_optimizer": {
+                    "device": "nvme",
+                    "nvme_path": "/nvme_data"
+                }
+                "offload_param": {
+                    "device": "nvme",
+                    "nvme_path": "/nvme_data"
+                }
             },
             ...
         }
@@ -107,6 +148,9 @@ granularity of (sub)module ``forward()`` methods. The backward pass is
 handled similarly. This strategy has two underlying assumptions:
 
 #. The forward and backward passes of submodules must individually fit in device memory.
+   If this not the case, :class:`deepspeed.zero.TiledLinear` implements
+   **memory-centric tiling** and works with ZeRO-3 to break linear layers
+   into a sequence of smaller submodules that can fit in memory.
 
 #. A module's parameters are only accessed within its own ``__init__`` and ``forward()`` methods.
    Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter.
@@ -126,7 +170,6 @@ you can simply allocate your model in our context:
         model = MyLargeModel()
 
 
-
 .. autoclass:: deepspeed.zero.Init
     :members:
 
@@ -160,25 +203,75 @@ parameters are accessed outside of the module that created them. To do so, use
 Registering External Parameters
 ===============================
 
-Consider the following pattern common in language models such as GPT:
+ZeRO-3 will automatically collect and partition the model parameters as they
+are needed during the forward and backward passes. However, in some cases a
+parameter may be used outside of its module's forward pass. We call these
+*external* parameters. ZeRO-3 can coordinate these parameters if they are
+registered either automatically or manually.
 
-.. code-block:: python
 
-    class LanguageModel(torch.nn.Module):
-        ...
-        def forward(self, inputs):
-            embeds = self.embeddings(inputs)
-            ...
-            logits = compute_logits(output, self.embeddings.weight)
-            ...
+.. note::
+    DeepSpeed version ``0.3.15`` includes automatic external parameter
+    discovery and registration to support the most common cases. Parameters
+    can still be manually registered if they cannot be automatically
+    detected.
+
+
+DeepSpeed can automatically detect the following external parameter scenarios:
+
+
+#. Parameter access: consider the following pattern common in language models such as GPT:
+
+   The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
+   ``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
+   because it is used in the training loop outside of its owning module's
+   forward pass.
+
+
+   .. code-block:: python
+
+       class LanguageModel(torch.nn.Module):
+           ...
+           def forward(self, inputs):
+               embeds = self.embeddings(inputs)
+               ...
+               logits = compute_logits(output, self.embeddings.weight)
+               ...
+
+
+#. Returning a parameter:
+
+   ``CustomLinear`` returns both an output and its own ``bias`` parameter. DeepSpeed
+   will detect the external ``bias`` parameter and register it with submodules that
+   use ``CustomLinear``.
+
+   .. code-block:: python
+
+       class CustomLinear(torch.nn.Linear):
+           def forward(self, *input):
+               output = super().forward(*input)
+               return output, self.bias
 
 
-The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
-``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
-because it is used in the training loop outside of its owning module's
-forward pass. DeepSpeed will coordinate external parameters if they are
-registered prior to the first forward pass.
 
 .. autofunction:: deepspeed.zero.register_external_parameter
 
 .. autofunction:: deepspeed.zero.unregister_external_parameter
+
+
+Memory-Centric Tiling
+---------------------
+
+To reduce the working memory requirements of DL training for large models,
+ZeRO-Infinity includes technique called *memory-centric tiling* that exploits
+the data fetch and release pattern of ZeRO-3 to reduce the working memory
+requirements by breaking down a large operator into smaller tiles that can be
+executed sequentially. When combined with ZeRO-3, the parameter and gradients
+of each tile can be fetched and released one at a time, reducing the working
+memory proportional to the number of tiles. Therefore, ZeRO-Infinity can
+support operators of arbitrary sizes, without refactoring for model
+parallelism to fit them in limited GPU memory.
+
+
+.. autoclass:: deepspeed.zero.TiledLinear
+    :members:
diff --git a/docs/index.md b/docs/index.md
old mode 100755
new mode 100644
index 497f88bab5c3229c61ee3e323d95f41f856f6246..38830ce3141748282c3dad989ccc193827d512fb
--- a/docs/index.md
+++ b/docs/index.md
@@ -2,9 +2,23 @@
 layout: single
 toc: true
 toc_label: "Contents"
+title: "Latest News"
+
 ---
 
-<b>03/2021: DeepSpeed is hiring! Come join us: [SDE 2](https://careers.microsoft.com/us/en/job/1013160/Software-Engineer-2), [Sr. SDE](https://careers.microsoft.com/us/en/job/1017151/Senior-Software-Engineer), [Sr. Researcher](https://careers.microsoft.com/us/en/job/1016440/Senior-Researcher)</b>
+* [2022/03/21] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
+* [2022/03/07] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
+* [2022/01/19] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
+    * [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
+    * [Mixture of Experts (MoE) Inference tutorial](https://www.deepspeed.ai/tutorials/moe-inference-tutorial).
+* [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/2021/11/16/autotuning.html)
+* [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
+  * Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
+
+
+<b> DeepSpeed+Megatron trained the world's most powerful language model: [MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) <b>
+
+<b> DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai) </b>
 
 DeepSpeed is a deep learning optimization library that makes distributed training easy,
 efficient, and effective.
@@ -15,9 +29,9 @@ efficient, and effective.
 
 DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
 * Extreme scale: Using current generation of GPU clusters with hundreds of devices,  3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
-* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
+* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of the art, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
 * Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
-* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.  1-bit Adam reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.
+* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.  1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
 
 Early adopters of DeepSpeed have already produced
 a language model (LM) with over 17B parameters called
@@ -29,20 +43,6 @@ DeepSpeed is an important part of Microsoft’s new
 initiative to enable next-generation AI capabilities at scale, where you can find more
 information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale).
 
-# What's New?
-* [2021/04/02] [[DeepSpeed on AzureML] Transformers and CIFAR examples are now available on AzureML GitHub](https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed)
-* [2021/03/30] [[PyTorch Lightning Blog] Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed](https://medium.com/pytorch-lightning/accessible-multi-billion-parameter-model-training-with-pytorch-lightning-deepspeed-c9333ac3bb59)
-* [2021/03/16] [1-bit Adam v2: NCCL-based implementation and more](https://www.deepspeed.ai/tutorials/onebit-adam/)
-* [2021/03/08] [ZeRO-3 Offload: Scale your models to trillion parameters without code changes while leveraging both CPUs & GPUs](https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html)
-* [2021/01/19] [[🤗Hugging Face Blog] Fit More and Train Faster With ZeRO via DeepSpeed and FairScale](https://huggingface.co/blog/zero-deepspeed-fairscale)
-* [2020/11/12] [Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies](#installation)
-* [2020/11/10] [Efficient and robust compressed training through progressive layer dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
-* [2020/09/10] [DeepSpeed v0.3: Extreme-scale model training for everyone]({{ site.press_release_v3 }})
-  * [Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention-news.html)
-  * [Training a trillion parameters with pipeline parallelism](https://www.deepspeed.ai/news/2020/09/08/pipeline-parallelism.html)
-  * [Up to 5x less communication and 3.4x faster training through 1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-news.html)
-  * [10x bigger model training on a single GPU with ZeRO-Offload](https://www.deepspeed.ai/news/2020/09/08/ZeRO-Offload.html)
-
 # Why DeepSpeed?
 Training advanced deep learning models is challenging. Beyond model design,
 model scientists also need to set up the state-of-the-art training techniques
@@ -136,10 +136,10 @@ combinations, which we call 3D parallelism.
 Pipeline parallelism of DeepSpeed reduce communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.
 ![Low-bandwidth GPT-2 Performance](/assets/images/pp-lowbw-gpt2.png)
 
-1-bit Adam reduces communication volume by up to 5x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.  [Read more here](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html).
+1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks.  [1-bit Adam blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [1-bit Adam tutorial](https://www.deepspeed.ai/tutorials/onebit-adam/), [0/1 Adam tutorial](https://www.deepspeed.ai/tutorials/zero-one-adam/), [1-bit LAMB tutorial](https://www.deepspeed.ai/tutorials/onebit-lamb/).
 
 ## Supporting long sequence length
-DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers **an order-of-magnitude longer input sequence** and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures.  [Read more here](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html).
+DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers **an order-of-magnitude longer input sequence** and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures.  [Read more here](https://www.deepspeed.ai/2020/09/08/sparse-attention.html).
 
 
 ## Fast convergence for effectiveness
@@ -148,7 +148,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
 effectiveness of model training and reduce the number of samples required to
 convergence to desired accuracy.
 
-*Read more*: [Tuning tutorial](/tutorials/1Cycle).
+*Read more*: [Tuning tutorial](/tutorials/one-cycle).
 
 
 ## Good Usability
@@ -167,7 +167,7 @@ Below we provide a brief feature list, see our detailed [feature overview](https
   * Integration with Megatron-LM
 * [Pipeline Parallelism](https://www.deepspeed.ai/tutorials/pipeline/)
   * 3D Parallelism
-* [The Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/)
+* [The Zero Redundancy Optimizer](https://www.deepspeed.ai/tutorials/zero/)
   * Optimizer State and Gradient Partitioning
   * Activation Partitioning
   * Constant Buffer Optimization
@@ -175,14 +175,14 @@ Below we provide a brief feature list, see our detailed [feature overview](https
 * [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
   * Leverage both CPU/GPU memory for model training
   * Support 10B model training on a single GPU
-* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
-* [Sparse attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html)
+* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
+* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
   * Memory- and compute-efficient sparse kernels
   * Support 10x long sequences than dense
   * Flexible support to different sparse structures
-* [1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html)
+* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
   * Custom communication collective
-  * Up to 5x communication volume saving
+  * Up to 26x communication volume saving
 * [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
   * Smart Gradient Accumulation
   * Communication/Computation Overlap
@@ -201,11 +201,15 @@ Below we provide a brief feature list, see our detailed [feature overview](https
   * Learning Rate Range Test
   * 1Cycle Learning Rate Schedule
 * [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
-* [Progressive Layer Dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
+* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
+  * A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
+  * Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
+  * Complementary to many other DeepSpeed features
+* [Progressive Layer Dropping](https://www.deepspeed.ai/2020/10/28/progressive-layer-dropping-news.html)
   * Efficient and robust compressed training
   * Up to 2.5x convergence speedup for pre-training
 * [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
-
+* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)
 
 # Contributing
 DeepSpeed welcomes your contributions! Please see our
@@ -235,7 +239,12 @@ comments.
 2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
 3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
 4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
-5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888).
+5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
+6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
+7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
+8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
+9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. [arXiv:2202.06009](https://arxiv.org/abs/2202.06009).
+10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596).
 
 # Videos
 1. DeepSpeed KDD 2020 Tutorial
diff --git a/install.sh b/install.sh
old mode 100755
new mode 100644
index 7c26883d6db032865c3b8b7dda38de2bc1259852..6770924d1ef8b5784d77b0f950acba9856377c27
--- a/install.sh
+++ b/install.sh
@@ -156,7 +156,7 @@ python setup.py $VERBOSE bdist_wheel
 
 if [ "$local_only" == "1" ]; then
     echo "Installing deepspeed"
-    $PIP_SUDO pip uninstall -y deepspeed
+#    $PIP_SUDO pip uninstall -y deepspeed
     $PIP_SUDO $PIP_INSTALL dist/deepspeed*.whl
     ds_report
 else
diff --git a/op_builder/__init__.py b/op_builder/__init__.py
index 38f27a9897cef905c68a6d42d1b8fe31746861fe..dcac71011aa8be21d81f2b55f9692fc99d6b1211 100644
--- a/op_builder/__init__.py
+++ b/op_builder/__init__.py
@@ -2,23 +2,31 @@
 Copyright 2020 The Microsoft DeepSpeed Team
 """
 from .cpu_adam import CPUAdamBuilder
+from .cpu_adagrad import CPUAdagradBuilder
 from .fused_adam import FusedAdamBuilder
 from .fused_lamb import FusedLambBuilder
 from .sparse_attn import SparseAttnBuilder
 from .transformer import TransformerBuilder
 from .stochastic_transformer import StochasticTransformerBuilder
 from .utils import UtilsBuilder
-from .builder import get_default_compute_capatabilities
+from .async_io import AsyncIOBuilder
+from .transformer_inference import InferenceBuilder
+from .quantizer import QuantizerBuilder
+from .builder import get_default_compute_capabilities, OpBuilder
 
 # TODO: infer this list instead of hard coded
 # List of all available ops
 __op_builders__ = [
     CPUAdamBuilder(),
+    CPUAdagradBuilder(),
     FusedAdamBuilder(),
     FusedLambBuilder(),
     SparseAttnBuilder(),
     TransformerBuilder(),
     StochasticTransformerBuilder(),
-    UtilsBuilder()
+    AsyncIOBuilder(),
+    UtilsBuilder(),
+    QuantizerBuilder(),
+    InferenceBuilder()
 ]
 ALL_OPS = {op.name: op for op in __op_builders__}
diff --git a/op_builder/async_io.py b/op_builder/async_io.py
new file mode 100644
index 0000000000000000000000000000000000000000..aec7911ce96bff070f3b2cc65f5fa58170a2cb86
--- /dev/null
+++ b/op_builder/async_io.py
@@ -0,0 +1,106 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import distutils.spawn
+import subprocess
+
+from .builder import OpBuilder
+
+
+class AsyncIOBuilder(OpBuilder):
+    BUILD_VAR = "DS_BUILD_AIO"
+    NAME = "async_io"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.aio.{self.NAME}_op'
+
+    def sources(self):
+        return [
+            'csrc/aio/py_lib/deepspeed_py_copy.cpp',
+            'csrc/aio/py_lib/py_ds_aio.cpp',
+            'csrc/aio/py_lib/deepspeed_py_aio.cpp',
+            'csrc/aio/py_lib/deepspeed_py_aio_handle.cpp',
+            'csrc/aio/py_lib/deepspeed_aio_thread.cpp',
+            'csrc/aio/common/deepspeed_aio_utils.cpp',
+            'csrc/aio/common/deepspeed_aio_common.cpp',
+            'csrc/aio/common/deepspeed_aio_types.cpp'
+        ]
+
+    def include_paths(self):
+        return ['csrc/aio/py_lib', 'csrc/aio/common']
+
+    def cxx_args(self):
+        # -O0 for improved debugging, since performance is bound by I/O
+        CPU_ARCH = self.cpu_arch()
+        SIMD_WIDTH = self.simd_width()
+        return [
+            '-g',
+            '-Wall',
+            '-O0',
+            '-std=c++14',
+            '-shared',
+            '-fPIC',
+            '-Wno-reorder',
+            CPU_ARCH,
+            '-fopenmp',
+            SIMD_WIDTH,
+            '-laio',
+        ]
+
+    def extra_ldflags(self):
+        return ['-laio']
+
+    def check_for_libaio_pkg(self):
+        libs = dict(
+            dpkg=["-l",
+                  "libaio-dev",
+                  "apt"],
+            pacman=["-Q",
+                    "libaio",
+                    "pacman"],
+            rpm=["-q",
+                 "libaio-devel",
+                 "yum"],
+        )
+
+        found = False
+        for pkgmgr, data in libs.items():
+            flag, lib, tool = data
+            path = distutils.spawn.find_executable(pkgmgr)
+            if path is not None:
+                cmd = f"{pkgmgr} {flag} {lib}"
+                result = subprocess.Popen(cmd,
+                                          stdout=subprocess.PIPE,
+                                          stderr=subprocess.PIPE,
+                                          shell=True)
+                if result.wait() == 0:
+                    found = True
+                else:
+                    self.warning(
+                        f"{self.NAME}: please install the {lib} package with {tool}")
+                break
+        return found
+
+    def is_compatible(self, verbose=True):
+        # Check for the existence of libaio by using distutils
+        # to compile and link a test program that calls io_submit,
+        # which is a function provided by libaio that is used in the async_io op.
+        # If needed, one can define -I and -L entries in CFLAGS and LDFLAGS
+        # respectively to specify the directories for libaio.h and libaio.so.
+        aio_compatible = self.has_function('io_submit', ('aio', ))
+        if verbose and not aio_compatible:
+            self.warning(
+                f"{self.NAME} requires the dev libaio .so object and headers but these were not found."
+            )
+
+            # Check for the libaio package via known package managers
+            # to print suggestions on which package to install.
+            self.check_for_libaio_pkg()
+
+            self.warning(
+                "If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found."
+            )
+        return super().is_compatible(verbose) and aio_compatible
diff --git a/op_builder/builder.py b/op_builder/builder.py
index 68782b35d6bac9bf082ff8127ff21158249c6d9a..a895e890b5daf6b85ab062ca0a95c2cecd653387 100644
--- a/op_builder/builder.py
+++ b/op_builder/builder.py
@@ -2,11 +2,19 @@
 Copyright 2020 The Microsoft DeepSpeed Team
 """
 import os
+import sys
 import time
-import torch
+import json
 import importlib
 from pathlib import Path
 import subprocess
+import shlex
+import shutil
+import tempfile
+import distutils.ccompiler
+import distutils.log
+import distutils.sysconfig
+from distutils.errors import CompileError, LinkError
 from abc import ABC, abstractmethod
 
 YELLOW = '\033[93m'
@@ -16,6 +24,16 @@ WARNING = f"{YELLOW} [WARNING] {END}"
 DEFAULT_TORCH_EXTENSION_PATH = "/tmp/torch_extensions"
 DEFAULT_COMPUTE_CAPABILITIES = "6.0;6.1;7.0"
 
+try:
+    import torch
+except ImportError:
+    print(
+        f"{WARNING} unable to import torch, please install it if you want to pre-compile any deepspeed ops."
+    )
+else:
+    TORCH_MAJOR = int(torch.__version__.split('.')[0])
+    TORCH_MINOR = int(torch.__version__.split('.')[1])
+
 
 def installed_cuda_version():
     import torch.utils.cpp_extension
@@ -34,7 +52,7 @@ def installed_cuda_version():
     return int(cuda_major), int(cuda_minor)
 
 
-def get_default_compute_capatabilities():
+def get_default_compute_capabilities():
     compute_caps = DEFAULT_COMPUTE_CAPABILITIES
     import torch.utils.cpp_extension
     if torch.utils.cpp_extension.CUDA_HOME is not None and installed_cuda_version(
@@ -47,14 +65,38 @@ def get_default_compute_capatabilities():
     return compute_caps
 
 
+# list compatible minor CUDA versions - so that for example pytorch built with cuda-11.0 can be used
+# to build deepspeed and system-wide installed cuda 11.2
+cuda_minor_mismatch_ok = {
+    10: [
+        "10.0",
+        "10.1",
+        "10.2",
+    ],
+    11: [
+        "11.0",
+        "11.1",
+        "11.2",
+        "11.3",
+        "11.4",
+        "11.5",
+        "11.6",
+    ],
+}
+
+
 def assert_no_cuda_mismatch():
     cuda_major, cuda_minor = installed_cuda_version()
     sys_cuda_version = f'{cuda_major}.{cuda_minor}'
     torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
     # This is a show-stopping error, should probably not proceed past this
     if sys_cuda_version != torch_cuda_version:
-        if sys_cuda_version == "11.1" and torch_cuda_version == "11.0":
-            # it works to build against installed cuda-11.1 while torch was built with cuda-11.0
+        if (cuda_major in cuda_minor_mismatch_ok
+                and sys_cuda_version in cuda_minor_mismatch_ok[cuda_major]
+                and torch_cuda_version in cuda_minor_mismatch_ok[cuda_major]):
+            print(f"Installed CUDA version {sys_cuda_version} does not match the "
+                  f"version torch was compiled with {torch.version.cuda} "
+                  "but since the APIs are compatible, accepting this combination")
             return
         raise Exception(
             f"Installed CUDA version {sys_cuda_version} does not match the "
@@ -62,23 +104,10 @@ def assert_no_cuda_mismatch():
             "cuda/cpp extensions without a matching cuda version.")
 
 
-def assert_torch_info(torch_info):
-    install_torch_version = torch_info['version']
-    install_cuda_version = torch_info['cuda_version']
-
-    current_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
-    current_torch_version = ".".join(torch.__version__.split('.')[:2])
-
-    if install_cuda_version != current_cuda_version or install_torch_version != current_torch_version:
-        raise RuntimeError(
-            "PyTorch and CUDA version mismatch! DeepSpeed ops were compiled and installed "
-            "with a different version than what is being used at runtime. Please re-install "
-            f"DeepSpeed or switch torch versions. DeepSpeed install versions: "
-            f"torch={install_torch_version}, cuda={install_cuda_version}, runtime versions:"
-            f"torch={current_torch_version}, cuda={current_cuda_version}")
-
-
 class OpBuilder(ABC):
+    _rocm_version = None
+    _is_rocm_pytorch = None
+
     def __init__(self, name):
         self.name = name
         self.jit_mode = False
@@ -98,6 +127,76 @@ class OpBuilder(ABC):
         '''
         pass
 
+    def hipify_extension(self):
+        pass
+
+    @staticmethod
+    def assert_torch_info(torch_info):
+        install_torch_version = torch_info['version']
+        install_cuda_version = torch_info['cuda_version']
+        install_hip_version = torch_info['hip_version']
+
+        if not OpBuilder.is_rocm_pytorch():
+            current_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
+        else:
+            current_hip_version = ".".join(torch.version.hip.split('.')[:2])
+
+        current_torch_version = ".".join(torch.__version__.split('.')[:2])
+
+        if not OpBuilder.is_rocm_pytorch():
+            if install_cuda_version != current_cuda_version or install_torch_version != current_torch_version:
+                raise RuntimeError(
+                    "PyTorch and CUDA version mismatch! DeepSpeed ops were compiled and installed "
+                    "with a different version than what is being used at runtime. Please re-install "
+                    f"DeepSpeed or switch torch versions. DeepSpeed install versions: "
+                    f"torch={install_torch_version}, cuda={install_cuda_version}, runtime versions:"
+                    f"torch={current_torch_version}, cuda={current_cuda_version}")
+        else:
+            if install_hip_version != current_hip_version or install_torch_version != current_torch_version:
+                raise RuntimeError(
+                    "PyTorch and HIP version mismatch! DeepSpeed ops were compiled and installed "
+                    "with a different version than what is being used at runtime. Please re-install "
+                    f"DeepSpeed or switch torch versions. DeepSpeed install versions: "
+                    f"torch={install_torch_version}, hip={install_hip_version}, runtime versions:"
+                    f"torch={current_torch_version}, hip={current_hip_version}")
+
+    @staticmethod
+    def is_rocm_pytorch():
+        if OpBuilder._is_rocm_pytorch is not None:
+            return OpBuilder._is_rocm_pytorch
+
+        _is_rocm_pytorch = False
+        try:
+            import torch
+        except ImportError:
+            pass
+        else:
+            if TORCH_MAJOR > 1 or (TORCH_MAJOR == 1 and TORCH_MINOR >= 5):
+                _is_rocm_pytorch = hasattr(torch.version,
+                                           'hip') and torch.version.hip is not None
+                if _is_rocm_pytorch:
+                    from torch.utils.cpp_extension import ROCM_HOME
+                    _is_rocm_pytorch = ROCM_HOME is not None
+        OpBuilder._is_rocm_pytorch = _is_rocm_pytorch
+        return OpBuilder._is_rocm_pytorch
+
+    @staticmethod
+    def installed_rocm_version():
+        if OpBuilder._rocm_version:
+            return OpBuilder._rocm_version
+
+        ROCM_MAJOR = '0'
+        ROCM_MINOR = '0'
+        if OpBuilder.is_rocm_pytorch():
+            from torch.utils.cpp_extension import ROCM_HOME
+            #with open('/opt/rocm/.info/version-dev', 'r') as file:
+            with open('/public/home/aiss/2022/dtk-21.10.1/.info/version-dev', 'r') as file:
+                ROCM_VERSION_DEV_RAW = file.read()
+            ROCM_MAJOR = ROCM_VERSION_DEV_RAW.split('.')[0]
+            ROCM_MINOR = ROCM_VERSION_DEV_RAW.split('.')[1]
+        OpBuilder._rocm_version = (int(ROCM_MAJOR), int(ROCM_MINOR))
+        return OpBuilder._rocm_version
+
     def include_paths(self):
         '''
         Returns list of include paths, relative to root of deepspeed package (i.e., DeepSpeed/deepspeed)
@@ -116,7 +215,7 @@ class OpBuilder(ABC):
         '''
         return []
 
-    def is_compatible(self):
+    def is_compatible(self, verbose=True):
         '''
         Check if all non-python dependencies are satisfied to build this op
         '''
@@ -136,22 +235,171 @@ class OpBuilder(ABC):
             valid = valid or result.wait() == 0
         return valid
 
-    def simd_width(self):
+    def has_function(self, funcname, libraries, verbose=False):
+        '''
+        Test for existence of a function within a tuple of libraries.
+
+        This is used as a smoke test to check whether a certain library is available.
+        As a test, this creates a simple C program that calls the specified function,
+        and then distutils is used to compile that program and link it with the specified libraries.
+        Returns True if both the compile and link are successful, False otherwise.
+        '''
+        tempdir = None  # we create a temporary directory to hold various files
+        filestderr = None  # handle to open file to which we redirect stderr
+        oldstderr = None  # file descriptor for stderr
+        try:
+            # Echo compile and link commands that are used.
+            if verbose:
+                distutils.log.set_verbosity(1)
+
+            # Create a compiler object.
+            compiler = distutils.ccompiler.new_compiler(verbose=verbose)
+
+            # Configure compiler and linker to build according to Python install.
+            distutils.sysconfig.customize_compiler(compiler)
+
+            # Create a temporary directory to hold test files.
+            tempdir = tempfile.mkdtemp()
+
+            # Define a simple C program that calls the function in question
+            prog = "void %s(void); int main(int argc, char** argv) { %s(); return 0; }" % (
+                funcname,
+                funcname)
+
+            # Write the test program to a file.
+            filename = os.path.join(tempdir, 'test.c')
+            with open(filename, 'w') as f:
+                f.write(prog)
+
+            # Redirect stderr file descriptor to a file to silence compile/link warnings.
+            if not verbose:
+                filestderr = open(os.path.join(tempdir, 'stderr.txt'), 'w')
+                oldstderr = os.dup(sys.stderr.fileno())
+                os.dup2(filestderr.fileno(), sys.stderr.fileno())
+
+            # Workaround for behavior in distutils.ccompiler.CCompiler.object_filenames()
+            # Otherwise, a local directory will be used instead of tempdir
+            drive, driveless_filename = os.path.splitdrive(filename)
+            root_dir = driveless_filename[0] if os.path.isabs(driveless_filename) else ''
+            output_dir = os.path.join(drive, root_dir)
+
+            # Attempt to compile the C program into an object file.
+            cflags = shlex.split(os.environ.get('CFLAGS', ""))
+            objs = compiler.compile([filename],
+                                    output_dir=output_dir,
+                                    extra_preargs=self.strip_empty_entries(cflags))
+
+            # Attempt to link the object file into an executable.
+            # Be sure to tack on any libraries that have been specified.
+            ldflags = shlex.split(os.environ.get('LDFLAGS', ""))
+            compiler.link_executable(objs,
+                                     os.path.join(tempdir,
+                                                  'a.out'),
+                                     extra_preargs=self.strip_empty_entries(ldflags),
+                                     libraries=libraries)
+
+            # Compile and link succeeded
+            return True
+
+        except CompileError:
+            return False
+
+        except LinkError:
+            return False
+
+        except:
+            return False
+
+        finally:
+            # Restore stderr file descriptor and close the stderr redirect file.
+            if oldstderr is not None:
+                os.dup2(oldstderr, sys.stderr.fileno())
+            if filestderr is not None:
+                filestderr.close()
+
+            # Delete the temporary directory holding the test program and stderr files.
+            if tempdir is not None:
+                shutil.rmtree(tempdir)
+
+    def strip_empty_entries(self, args):
+        '''
+        Drop any empty strings from the list of compile and link flags
+        '''
+        return [x for x in args if len(x) > 0]
+
+    def cpu_arch(self):
+        try:
+            from cpuinfo import get_cpu_info
+        except ImportError as e:
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return "-march=native"
+
+        try:
+            cpu_info = get_cpu_info()
+        except Exception as e:
+            self.warning(
+                f"{self.name} attempted to use `py-cpuinfo` but failed (exception type: {type(e)}, {e}), "
+                "falling back to `lscpu` to get this information.")
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return "-march=native"
+
+        if cpu_info['arch'].startswith('PPC_'):
+            # gcc does not provide -march on PowerPC, use -mcpu instead
+            return '-mcpu=native'
+        return '-march=native'
+
+    def _backup_cpuinfo(self):
+        # Construct cpu_info dict from lscpu that is similar to what py-cpuinfo provides
         if not self.command_exists('lscpu'):
             self.warning(
-                f"{self.name} is attempted to query 'lscpu' to detect the existence "
-                "of AVX instructions. However, 'lscpu' does not appear to exist on "
-                "your system, will fall back to non-vectorized execution.")
-            return ''
-
+                f"{self.name} attempted to query 'lscpu' after failing to use py-cpuinfo "
+                "to detect the CPU architecture. 'lscpu' does not appear to exist on "
+                "your system, will fall back to use -march=native and non-vectorized execution."
+            )
+            return None
         result = subprocess.check_output('lscpu', shell=True)
         result = result.decode('utf-8').strip().lower()
-        if 'genuineintel' in result:
+
+        cpu_info = {}
+        cpu_info['arch'] = None
+        cpu_info['flags'] = ""
+        if 'genuineintel' in result or 'authenticamd' in result:
+            cpu_info['arch'] = 'X86_64'
             if 'avx512' in result:
+                cpu_info['flags'] += 'avx512,'
+            if 'avx2' in result:
+                cpu_info['flags'] += 'avx2'
+        elif 'ppc64le' in result:
+            cpu_info['arch'] = "PPC_"
+
+        return cpu_info
+
+    def simd_width(self):
+        try:
+            from cpuinfo import get_cpu_info
+        except ImportError as e:
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return '-D__SCALAR__'
+
+        try:
+            cpu_info = get_cpu_info()
+        except Exception as e:
+            self.warning(
+                f"{self.name} attempted to use `py-cpuinfo` but failed (exception type: {type(e)}, {e}), "
+                "falling back to `lscpu` to get this information.")
+            cpu_info = self._backup_cpuinfo()
+            if cpu_info is None:
+                return '-D__SCALAR__'
+
+        if cpu_info['arch'] == 'X86_64':
+            if 'avx512' in cpu_info['flags']:
                 return '-D__AVX512__'
-            elif 'avx2' in result:
+            elif 'avx2' in cpu_info['flags']:
                 return '-D__AVX256__'
-        return ''
+        return '-D__SCALAR__'
 
     def python_requirements(self):
         '''
@@ -196,11 +444,12 @@ class OpBuilder(ABC):
 
     def builder(self):
         from torch.utils.cpp_extension import CppExtension
-        return CppExtension(name=self.absolute_name(),
-                            sources=self.sources(),
-                            include_dirs=self.include_paths(),
-                            extra_compile_args={'cxx': self.cxx_args()},
-                            extra_link_args=self.extra_ldflags())
+        return CppExtension(
+            name=self.absolute_name(),
+            sources=self.strip_empty_entries(self.sources()),
+            include_dirs=self.strip_empty_entries(self.include_paths()),
+            extra_compile_args={'cxx': self.strip_empty_entries(self.cxx_args())},
+            extra_link_args=self.strip_empty_entries(self.extra_ldflags()))
 
     def load(self, verbose=True):
         from ...git_version_info import installed_ops, torch_info
@@ -208,14 +457,14 @@ class OpBuilder(ABC):
             # Ensure the op we're about to load was compiled with the same
             # torch/cuda versions we are currently using at runtime.
             if isinstance(self, CUDAOpBuilder):
-                assert_torch_info(torch_info)
+                self.assert_torch_info(torch_info)
 
             return importlib.import_module(self.absolute_name())
         else:
             return self.jit_load(verbose)
 
     def jit_load(self, verbose=True):
-        if not self.is_compatible():
+        if not self.is_compatible(verbose):
             raise RuntimeError(
                 f"Unable to JIT load the {self.name} op due to it not being compatible due to hardware/software issue."
             )
@@ -226,7 +475,7 @@ class OpBuilder(ABC):
                 f"Unable to JIT load the {self.name} op due to ninja not being installed."
             )
 
-        if isinstance(self, CUDAOpBuilder):
+        if isinstance(self, CUDAOpBuilder) and not self.is_rocm_pytorch():
             assert_no_cuda_mismatch()
 
         self.jit_mode = True
@@ -240,19 +489,36 @@ class OpBuilder(ABC):
         os.makedirs(ext_path, exist_ok=True)
 
         start_build = time.time()
+        sources = [self.deepspeed_src_path(path) for path in self.sources()]
+        extra_include_paths = [
+            self.deepspeed_src_path(path) for path in self.include_paths()
+        ]
+
+        # Torch will try and apply whatever CCs are in the arch list at compile time,
+        # we have already set the intended targets ourselves we know that will be
+        # needed at runtime. This prevents CC collisions such as multiple __half
+        # implementations. Stash arch list to reset after build.
+        torch_arch_list = None
+        if "TORCH_CUDA_ARCH_LIST" in os.environ:
+            torch_arch_list = os.environ.get("TORCH_CUDA_ARCH_LIST")
+            os.environ["TORCH_CUDA_ARCH_LIST"] = ""
+
         op_module = load(
             name=self.name,
-            sources=[self.deepspeed_src_path(path) for path in self.sources()],
-            extra_include_paths=[
-                self.deepspeed_src_path(path) for path in self.include_paths()
-            ],
-            extra_cflags=self.cxx_args(),
-            extra_cuda_cflags=self.nvcc_args(),
-            extra_ldflags=self.extra_ldflags(),
+            sources=self.strip_empty_entries(sources),
+            extra_include_paths=self.strip_empty_entries(extra_include_paths),
+            extra_cflags=self.strip_empty_entries(self.cxx_args()),
+            extra_cuda_cflags=self.strip_empty_entries(self.nvcc_args()),
+            extra_ldflags=self.strip_empty_entries(self.extra_ldflags()),
             verbose=verbose)
         build_duration = time.time() - start_build
         if verbose:
             print(f"Time to load {self.name} op: {build_duration} seconds")
+
+        # Reset arch list so we are not silently removing it for other possible use cases
+        if torch_arch_list:
+            os.environ["TORCH_CUDA_ARCH_LIST"] = torch_arch_list
+
         return op_module
 
 
@@ -275,7 +541,6 @@ class CUDAOpBuilder(OpBuilder):
         - `cross_compile_archs` uses ; separator.
 
         """
-
         ccs = []
         if self.jit_mode:
             # Compile for underlying architectures since we know those at runtime
@@ -298,7 +563,7 @@ class CUDAOpBuilder(OpBuilder):
                 cross_compile_archs = cross_compile_archs_env.replace(' ', ';')
             else:
                 if cross_compile_archs is None:
-                    cross_compile_archs = get_default_compute_capatabilities()
+                    cross_compile_archs = get_default_compute_capabilities()
             ccs = cross_compile_archs.split(';')
 
         args = []
@@ -312,8 +577,6 @@ class CUDAOpBuilder(OpBuilder):
 
     def version_dependent_macros(self):
         # Fix from apex that might be relevant for us as well, related to https://github.com/NVIDIA/apex/issues/456
-        TORCH_MAJOR = int(torch.__version__.split('.')[0])
-        TORCH_MINOR = int(torch.__version__.split('.')[1])
         version_ge_1_1 = []
         if (TORCH_MAJOR > 1) or (TORCH_MAJOR == 1 and TORCH_MINOR > 0):
             version_ge_1_1 = ['-DVERSION_GE_1_1']
@@ -325,16 +588,112 @@ class CUDAOpBuilder(OpBuilder):
             version_ge_1_5 = ['-DVERSION_GE_1_5']
         return version_ge_1_1 + version_ge_1_3 + version_ge_1_5
 
-    def is_compatible(self):
-        return super().is_compatible()
+    def is_compatible(self, verbose=True):
+        return super().is_compatible(verbose)
 
     def builder(self):
         from torch.utils.cpp_extension import CUDAExtension
-        assert_no_cuda_mismatch()
-        return CUDAExtension(name=self.absolute_name(),
-                             sources=self.sources(),
-                             include_dirs=self.include_paths(),
-                             extra_compile_args={
-                                 'cxx': self.cxx_args(),
-                                 'nvcc': self.nvcc_args()
-                             })
+        if not self.is_rocm_pytorch():
+            assert_no_cuda_mismatch()
+        cuda_ext = CUDAExtension(
+            name=self.absolute_name(),
+            sources=self.strip_empty_entries(self.sources()),
+            include_dirs=self.strip_empty_entries(self.include_paths()),
+            libraries=self.strip_empty_entries(self.libraries_args()),
+            extra_compile_args={
+                'cxx': self.strip_empty_entries(self.cxx_args()),
+                'nvcc': self.strip_empty_entries(self.nvcc_args())
+            })
+        if self.is_rocm_pytorch():
+            # hip converts paths to absolute, this converts back to relative
+            sources = cuda_ext.sources
+            curr_file = Path(__file__).parent.parent  # ds root
+            for i in range(len(sources)):
+                src = Path(sources[i])
+                sources[i] = str(src.relative_to(curr_file))
+            cuda_ext.sources = sources
+        return cuda_ext
+
+    def hipify_extension(self):
+        if self.is_rocm_pytorch():
+            from torch.utils.hipify import hipify_python
+            hipify_python.hipify(
+                project_directory=os.getcwd(),
+                output_directory=os.getcwd(),
+                header_include_dirs=self.include_paths(),
+                includes=[os.path.join(os.getcwd(),
+                                       '*')],
+                extra_files=[os.path.abspath(s) for s in self.sources()],
+                show_detailed=True,
+                is_pytorch_extension=True,
+                hipify_extra_files_only=True,
+            )
+
+    def cxx_args(self):
+        if sys.platform == "win32":
+            return ['-O2']
+        else:
+            return ['-O3', '-std=c++14', '-g', '-Wno-reorder']
+
+    def nvcc_args(self):
+        args = ['-O3']
+        if self.is_rocm_pytorch():
+            ROCM_MAJOR, ROCM_MINOR = self.installed_rocm_version()
+            args += [
+                '-std=c++14',
+                '-U__HIP_NO_HALF_OPERATORS__',
+                '-U__HIP_NO_HALF_CONVERSIONS__',
+                '-U__HIP_NO_HALF2_OPERATORS__',
+                '-DROCM_VERSION_MAJOR=%s' % ROCM_MAJOR,
+                '-DROCM_VERSION_MINOR=%s' % ROCM_MINOR
+            ]
+        else:
+            cuda_major, _ = installed_cuda_version()
+            args += [
+                '--use_fast_math',
+                '-std=c++17'
+                if sys.platform == "win32" and cuda_major > 10 else '-std=c++14',
+                '-U__CUDA_NO_HALF_OPERATORS__',
+                '-U__CUDA_NO_HALF_CONVERSIONS__',
+                '-U__CUDA_NO_HALF2_OPERATORS__'
+            ]
+            args += self.compute_capability_args()
+        return args
+
+    def libraries_args(self):
+        if sys.platform == "win32":
+            return ['cublas', 'curand']
+        else:
+            return []
+
+
+class TorchCPUOpBuilder(CUDAOpBuilder):
+    def extra_ldflags(self):
+        if not self.is_rocm_pytorch():
+            return ['-lcurand']
+        else:
+            return []
+
+    def cxx_args(self):
+        import torch
+        if not self.is_rocm_pytorch():
+            CUDA_LIB64 = os.path.join(torch.utils.cpp_extension.CUDA_HOME, "lib64")
+        else:
+            CUDA_LIB64 = os.path.join(torch.utils.cpp_extension.ROCM_HOME, "lib")
+        CPU_ARCH = self.cpu_arch()
+        SIMD_WIDTH = self.simd_width()
+
+        args = super().cxx_args()
+        ###aiss debug0506###########
+        args += [
+            f'-L{CUDA_LIB64}',
+            #'-lcudart',
+            #'-lcublas',
+            '-lrocblas',
+            '-lhipblas',
+            '-g',
+            CPU_ARCH,
+            '-fopenmp',
+            SIMD_WIDTH,
+        ]
+        return args
diff --git a/op_builder/cpu_adagrad.py b/op_builder/cpu_adagrad.py
new file mode 100644
index 0000000000000000000000000000000000000000..24f0ff3ff52505c11c61a188f34a2b15e68fc5bc
--- /dev/null
+++ b/op_builder/cpu_adagrad.py
@@ -0,0 +1,44 @@
+"""
+Copyright 2020 The Microsoft DeepSpeed Team
+"""
+import os
+from .builder import TorchCPUOpBuilder
+
+
+class CPUAdagradBuilder(TorchCPUOpBuilder):
+    BUILD_VAR = "DS_BUILD_CPU_ADAGRAD"
+    NAME = "cpu_adagrad"
+
+    def __init__(self):
+        super().__init__(name=self.NAME)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.adagrad.{self.NAME}_op'
+
+    def sources(self):
+        return ['csrc/adagrad/cpu_adagrad.cpp', 'csrc/common/custom_cuda_kernel.cu']
+
+    def include_paths(self):
+        import torch
+        if not self.is_rocm_pytorch():
+            CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
+        else:
+            CUDA_INCLUDE = [
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "rocrand"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "hiprand"),
+            ]
+        return ['csrc/includes'] + CUDA_INCLUDE
+
+   ##aiss add
+    def cxx_args(self):
+
+        return [
+            '-lrocblas',
+            '-lhipblas',
+            '-D__HIP_DEVICE_COMPILE__',]
diff --git a/op_builder/cpu_adam.py b/op_builder/cpu_adam.py
index 464f597751e7476a3922866720b0e5f7088feb52..35a2a8a17d54b3f60a70e2ce8fee32488372a75c 100644
--- a/op_builder/cpu_adam.py
+++ b/op_builder/cpu_adam.py
@@ -2,12 +2,10 @@
 Copyright 2020 The Microsoft DeepSpeed Team
 """
 import os
-import torch
-import subprocess
-from .builder import CUDAOpBuilder
+from .builder import TorchCPUOpBuilder
 
 
-class CPUAdamBuilder(CUDAOpBuilder):
+class CPUAdamBuilder(TorchCPUOpBuilder):
     BUILD_VAR = "DS_BUILD_CPU_ADAM"
     NAME = "cpu_adam"
 
@@ -18,54 +16,38 @@ class CPUAdamBuilder(CUDAOpBuilder):
         return f'deepspeed.ops.adam.{self.NAME}_op'
 
     def sources(self):
-        return ['csrc/adam/cpu_adam.cpp', 'csrc/adam/custom_cuda_kernel.cu']
+        return ['csrc/adam/cpu_adam.cpp', 'csrc/common/custom_cuda_kernel.cu']
+
+    def libraries_args(self):
+        args = super().libraries_args()
+        #args += ['curand']
+    #aiss debug 0506########
+        args += ['hiprand']
+        args += ['rocrand']
+        return args
 
     def include_paths(self):
-        CUDA_INCLUDE = os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")
-        return ['csrc/includes', CUDA_INCLUDE]
-
-    def simd_width(self):
-        if not self.command_exists('lscpu'):
-            self.warning(
-                "CPUAdam attempted to query 'lscpu' to detect the existence "
-                "of AVX instructions. However, 'lscpu' does not appear to exist on "
-                "your system, will fall back to non-vectorized execution.")
-            return ''
-
-        result = subprocess.check_output('lscpu', shell=True)
-        result = result.decode('utf-8').strip().lower()
-        if 'genuineintel' in result:
-            if 'avx512' in result:
-                return '-D__AVX512__'
-            elif 'avx2' in result:
-                return '-D__AVX256__'
-        return '-D__SCALAR__'
-
+        import torch
+        if not self.is_rocm_pytorch():
+            CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
+        else:
+            CUDA_INCLUDE = [
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "rocrand"),
+                os.path.join(torch.utils.cpp_extension.ROCM_HOME,
+                             "include",
+                             "hiprand"),
+            ]
+        return ['csrc/includes'] + CUDA_INCLUDE
+
+   ##aiss add
     def cxx_args(self):
-        CUDA_LIB64 = os.path.join(torch.utils.cpp_extension.CUDA_HOME, "lib64")
-        SIMD_WIDTH = self.simd_width()
 
         return [
-            '-O3',
-            '-std=c++14',
-            f'-L{CUDA_LIB64}',
-            '-lcudart',
-            '-lcublas',
-            '-g',
-            '-Wno-reorder',
-            '-march=native',
-            '-fopenmp',
-            SIMD_WIDTH
+            '-lrocblas',
+            '-lhipblas',
+            '-D__HIP_DEVICE_COMPILE__',
         ]
-
-    def nvcc_args(self):
-        args = [
-            '-O3',
-            '--use_fast_math',
-            '-std=c++14',
-            '-U__CUDA_NO_HALF_OPERATORS__',
-            '-U__CUDA_NO_HALF_CONVERSIONS__',
-            '-U__CUDA_NO_HALF2_OPERATORS__'
-        ]
-        args += self.compute_capability_args()
-        return args
diff --git a/op_builder/fused_adam.py b/op_builder/fused_adam.py
index 8ffe349aa6397d0afe2892f7411fa2ad80ece83d..6ff264fbf1a1089a2d35afc520f98d9d7548f924 100644
--- a/op_builder/fused_adam.py
+++ b/op_builder/fused_adam.py
@@ -1,7 +1,6 @@
 """
 Copyright 2020 The Microsoft DeepSpeed Team
 """
-import torch
 from .builder import CUDAOpBuilder
 
 
@@ -19,13 +18,15 @@ class FusedAdamBuilder(CUDAOpBuilder):
         return ['csrc/adam/fused_adam_frontend.cpp', 'csrc/adam/multi_tensor_adam.cu']
 
     def include_paths(self):
-        return ['csrc/includes']
+        return ['csrc/includes', 'csrc/adam']
 
     def cxx_args(self):
-        return ['-O3'] + self.version_dependent_macros()
+        args = super().cxx_args()
+        return args + self.version_dependent_macros()
 
     def nvcc_args(self):
-        return ['-lineinfo',
-                '-O3',
-                '--use_fast_math'
-                ] + self.version_dependent_macros() + self.compute_capability_args()
+        nvcc_flags = ['-O3'] + self.version_dependent_macros()
+        if not self.is_rocm_pytorch():
+            nvcc_flags.extend(['-lineinfo',
+                               '--use_fast_math'] + self.compute_capability_args())
+        return nvcc_flags
diff --git a/op_builder/fused_lamb.py b/op_builder/fused_lamb.py
index a750083373aa42d243ff01034e1c14d42024d531..106728f6f3fe9e6449deeb0228410e50cfc8648a 100644
--- a/op_builder/fused_lamb.py
+++ b/op_builder/fused_lamb.py
@@ -1,7 +1,6 @@
 """
 Copyright 2020 The Microsoft DeepSpeed Team
 """
-import torch
 from .builder import CUDAOpBuilder
 
 
@@ -22,10 +21,18 @@ class FusedLambBuilder(CUDAOpBuilder):
         return ['csrc/includes']
 
     def cxx_args(self):
-        return ['-O3'] + self.version_dependent_macros()
+        args = super().cxx_args()
+        return args + self.version_dependent_macros()
 
     def nvcc_args(self):
-        return ['-lineinfo',
-                '-O3',
-                '--use_fast_math'
-                ] + self.version_dependent_macros() + self.compute_capability_args()
+        nvcc_flags = ['-O3'] + self.version_dependent_macros()
+        if self.is_rocm_pytorch():
+            ROCM_MAJOR, ROCM_MINOR = self.installed_rocm_version()
+            nvcc_flags += [
+                '-DROCM_VERSION_MAJOR=%s' % ROCM_MAJOR,
+                '-DROCM_VERSION_MINOR=%s' % ROCM_MINOR
+            ]
+        else:
+            nvcc_flags.extend(['-lineinfo',
+                               '--use_fast_math'] + self.compute_capability_args())
+        return nvcc_flags
diff --git a/op_builder/quantizer.py b/op_builder/quantizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..43bc5778ea20fbb658bcde6c6f96d0fc5f840e3b
--- /dev/null
+++ b/op_builder/quantizer.py
@@ -0,0 +1,22 @@
+from .builder import CUDAOpBuilder
+
+
+class QuantizerBuilder(CUDAOpBuilder):
+    BUILD_VAR = "DS_BUILD_QUANTIZER"
+    NAME = "quantizer"
+
+    def __init__(self, name=None):
+        name = self.NAME if name is None else name
+        super().__init__(name=name)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.quantizer.{self.NAME}_op'
+
+    def sources(self):
+        return [
+            'csrc/quantization/pt_binding.cpp',
+            'csrc/quantization/quantizer.cu',
+        ]
+
+    def include_paths(self):
+        return ['csrc/includes']
diff --git a/op_builder/sparse_attn.py b/op_builder/sparse_attn.py
index 9a46c2ff3de6542a0be759f842215e09695c22df..004fdd698200f4c8d47831297a5a3306352e4bef 100644
--- a/op_builder/sparse_attn.py
+++ b/op_builder/sparse_attn.py
@@ -1,10 +1,14 @@
 """
 Copyright 2020 The Microsoft DeepSpeed Team
 """
-import torch
 import warnings
 from .builder import OpBuilder
 
+try:
+    from packaging import version as pkg_version
+except ImportError:
+    pkg_version = None
+
 
 class SparseAttnBuilder(OpBuilder):
     BUILD_VAR = "DS_BUILD_SPARSE_ATTN"
@@ -22,11 +26,22 @@ class SparseAttnBuilder(OpBuilder):
     def cxx_args(self):
         return ['-O2', '-fopenmp']
 
-    def is_compatible(self):
+    def is_compatible(self, verbose=True):
         # Check to see if llvm and cmake are installed since they are dependencies
-        required_commands = ['llvm-config|llvm-config-9', 'cmake']
-        command_status = list(map(self.command_exists, required_commands))
-        deps_compatible = all(command_status)
+        #required_commands = ['llvm-config|llvm-config-9', 'cmake']
+        #command_status = list(map(self.command_exists, required_commands))
+        #deps_compatible = all(command_status)
+
+#####aiss debug 0506##############
+        if self.is_rocm_pytorch():
+        #    self.warning(f'{self.NAME} is not compatible with ROCM')
+        #    return False
+            return True
+        try:
+            import torch
+        except ImportError:
+            self.warning(f"unable to import torch, please install it first")
+            return False
 
         # torch-cpu will not have a cuda version
         if torch.version.cuda is None:
@@ -34,11 +49,10 @@ class SparseAttnBuilder(OpBuilder):
             self.warning(f"{self.NAME} cuda is not available from torch")
         else:
             major, minor = torch.version.cuda.split('.')[:2]
-            cuda_compatible = int(major) == 10 and int(minor) >= 1
+            cuda_compatible = (int(major) == 10
+                               and int(minor) >= 1) or (int(major) >= 11)
             if not cuda_compatible:
-                self.warning(
-                    f"{self.NAME} requires CUDA version 10.1+, does not currently support >=11 or <10.1"
-                )
+                self.warning(f"{self.NAME} requires CUDA version 10.1+")
 
         TORCH_MAJOR = int(torch.__version__.split('.')[0])
         TORCH_MINOR = int(torch.__version__.split('.')[1])
@@ -48,5 +62,26 @@ class SparseAttnBuilder(OpBuilder):
                 f'{self.NAME} requires a torch version >= 1.5 but detected {TORCH_MAJOR}.{TORCH_MINOR}'
             )
 
-        return super().is_compatible(
-        ) and deps_compatible and torch_compatible and cuda_compatible
+        try:
+            import triton
+        except ImportError:
+            # auto-install of triton is broken on some systems, reverting to manual install for now
+            # see this issue: https://github.com/microsoft/DeepSpeed/issues/1710
+            self.warning(
+                f"please install triton==1.0.0 if you want to use sparse attention")
+            return False
+
+        if pkg_version:
+            installed_triton = pkg_version.parse(triton.__version__)
+            triton_mismatch = installed_triton != pkg_version.parse("1.0.0")
+        else:
+            installed_triton = triton.__version__
+            triton_mismatch = installed_triton != "1.0.0"
+
+        if triton_mismatch:
+            self.warning(
+                f"using untested triton version ({installed_triton}), only 1.0.0 is known to be compatible"
+            )
+            return False
+
+        return super().is_compatible(verbose) and torch_compatible and cuda_compatible
diff --git a/op_builder/stochastic_transformer.py b/op_builder/stochastic_transformer.py
index b7e2f3845117ed458857d3e16584351dcf42b98d..aa47c13c49e4b64a0a30c089825104ae075bbbbb 100644
--- a/op_builder/stochastic_transformer.py
+++ b/op_builder/stochastic_transformer.py
@@ -1,7 +1,6 @@
 """
 Copyright 2020 The Microsoft DeepSpeed Team
 """
-import torch
 from .transformer import TransformerBuilder
 
 
diff --git a/op_builder/transformer.py b/op_builder/transformer.py
index 877f2190adae141b845f5d0244058f9401b82211..239f29552d980984dae4884e5a3272e6a30b68ce 100644
--- a/op_builder/transformer.py
+++ b/op_builder/transformer.py
@@ -1,7 +1,6 @@
 """
 Copyright 2020 The Microsoft DeepSpeed Team
 """
-import torch
 from .builder import CUDAOpBuilder
 
 
@@ -16,6 +15,12 @@ class TransformerBuilder(CUDAOpBuilder):
     def absolute_name(self):
         return f'deepspeed.ops.transformer.{self.NAME}_op'
 
+    def extra_ldflags(self):
+        if not self.is_rocm_pytorch():
+            return ['-lcurand']
+        else:
+            return []
+
     def sources(self):
         return [
             'csrc/transformer/ds_transformer_cuda.cpp',
@@ -29,19 +34,11 @@ class TransformerBuilder(CUDAOpBuilder):
         ]
 
     def include_paths(self):
-        return ['csrc/includes']
-
-    def nvcc_args(self):
-        args = [
-            '-O3',
-            '--use_fast_math',
-            '-std=c++14',
-            '-U__CUDA_NO_HALF_OPERATORS__',
-            '-U__CUDA_NO_HALF_CONVERSIONS__',
-            '-U__CUDA_NO_HALF2_OPERATORS__'
-        ]
-
-        return args + self.compute_capability_args()
-
-    def cxx_args(self):
-        return ['-O3', '-std=c++14', '-g', '-Wno-reorder']
+        includes = ['csrc/includes']
+        if self.is_rocm_pytorch():
+            from torch.utils.cpp_extension import ROCM_HOME
+            includes += [
+                '{}/hiprand/include'.format(ROCM_HOME),
+                '{}/rocrand/include'.format(ROCM_HOME)
+            ]
+        return includes
diff --git a/op_builder/transformer_inference.py b/op_builder/transformer_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..23eab4886e80e4026e738d17411e54a9f68448d7
--- /dev/null
+++ b/op_builder/transformer_inference.py
@@ -0,0 +1,32 @@
+from .builder import CUDAOpBuilder
+
+
+class InferenceBuilder(CUDAOpBuilder):
+    BUILD_VAR = "DS_BUILD_TRANSFORMER_INFERENCE"
+    NAME = "transformer_inference"
+
+    def __init__(self, name=None):
+        name = self.NAME if name is None else name
+        super().__init__(name=name)
+
+    def absolute_name(self):
+        return f'deepspeed.ops.transformer.inference.{self.NAME}_op'
+
+    def sources(self):
+        return [
+            'csrc/transformer/inference/csrc/pt_binding.cpp',
+            'csrc/transformer/inference/csrc/gelu.cu',
+            'csrc/transformer/inference/csrc/normalize.cu',
+            'csrc/transformer/inference/csrc/softmax.cu',
+            'csrc/transformer/inference/csrc/dequantize.cu',
+            'csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu',
+        ]
+
+    def extra_ldflags(self):
+        if not self.is_rocm_pytorch():
+            return ['-lcurand']
+        else:
+            return []
+
+    def include_paths(self):
+        return ['csrc/transformer/inference/includes']
diff --git a/release/bump_patch_version.py b/release/bump_patch_version.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f1150deab50d5ce990daac7ab86caa5f278a1ea
--- /dev/null
+++ b/release/bump_patch_version.py
@@ -0,0 +1,9 @@
+from packaging import version as pkg_version
+
+with open('../version.txt') as fd:
+    version = pkg_version.parse(fd.read())
+
+with open('../version.txt', 'w') as fd:
+    fd.write(f'{version.major}.{version.minor}.{version.micro + 1}\n')
+
+print(f'{version} -> {version.major}.{version.minor}.{version.micro + 1}')
diff --git a/release/release.sh b/release/release.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1366532e8b06eff2250baef8c886bb71b0068ecd
--- /dev/null
+++ b/release/release.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+
+cd ..
+
+if [ ! -f ~/.pypirc ]; then
+    echo 'create .pypirc in order to upload to PyPI'
+    exit 1
+fi
+
+version=$1
+
+if [ -z $version ]; then
+    echo "please provide version number for release"
+    exit 1
+fi
+
+if [[ $version == *"v"* ]]; then
+    echo "please only include version number without 'v' prefix"
+    exit 1
+fi
+
+if [ "${version}" != `cat version.txt` ]; then
+    echo "version=${version} does not match version.txt"
+    cat version.txt
+    exit 1
+fi
+
+python -c "import twine"
+if [ $? != 0 ]; then
+    echo 'please install twine via pip'
+    exit 1
+fi
+
+DS_BUILD_STRING="" python setup.py sdist
+
+if [ ! -f dist/deepspeed-${version}.tar.gz ]; then
+    echo "prepared version does not match version given ($version), bump version first?"
+    ls dist
+    exit 1
+fi
+
+python -m twine upload dist/deepspeed-${version}.tar.gz --repository deepspeed
+
+git tag v${version}
+git push origin v${version}
+
+echo "bumping up patch version"
+cd -
+python bump_patch_version.py
diff --git a/requirements/requirements-1bit-mpi.txt b/requirements/requirements-1bit-mpi.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66c5ba0468f857f33547a9cabf6c928808e38393
--- /dev/null
+++ b/requirements/requirements-1bit-mpi.txt
@@ -0,0 +1 @@
+mpi4py
diff --git a/requirements/requirements-autotuning-ml.txt b/requirements/requirements-autotuning-ml.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8b19064303a33147d18b823fc902fb4b7a4310a3
--- /dev/null
+++ b/requirements/requirements-autotuning-ml.txt
@@ -0,0 +1,3 @@
+hjson
+tabulate
+xgboost
diff --git a/requirements/requirements-autotuning.txt b/requirements/requirements-autotuning.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a5d51591ebf67bc53acb1c8481da364dae742a02
--- /dev/null
+++ b/requirements/requirements-autotuning.txt
@@ -0,0 +1 @@
+tabulate
diff --git a/requirements/requirements-dev.txt b/requirements/requirements-dev.txt
index 0118f6ee760cf95e1f059a71cdc711856e7502a7..313379c4ecc250f77123cddfce24af9cfcadcb8c 100644
--- a/requirements/requirements-dev.txt
+++ b/requirements/requirements-dev.txt
@@ -1,7 +1,13 @@
+clang-format
+docutils<0.18
+importlib-metadata>=4
+megatron-lm==1.1.5
+pre-commit
 pytest
 pytest-forked
-pre-commit
-clang-format
-sphinx
+pytest-randomly
+pytest-xdist
 recommonmark
+sphinx
 sphinx-rtd-theme
+torchvision
diff --git a/requirements/requirements-readthedocs.txt b/requirements/requirements-readthedocs.txt
index c032a8c9fdad12d3a701ebe69eb84a96f1523992..f3ffe3b615a2433e52b475a82901408b33e9ae2e 100644
--- a/requirements/requirements-readthedocs.txt
+++ b/requirements/requirements-readthedocs.txt
@@ -1,2 +1,5 @@
-tqdm
+docutils<0.18
+hjson
 psutil
+torch
+tqdm
diff --git a/requirements/requirements-sparse_attn.txt b/requirements/requirements-sparse_attn.txt
index 5c3a59af407c2b701fd4f05ef8385abb544310ee..f929bb0168a5103a81c3102406991d2580578536 100644
--- a/requirements/requirements-sparse_attn.txt
+++ b/requirements/requirements-sparse_attn.txt
@@ -1 +1 @@
-triton==0.2.3
+triton==1.0.0
diff --git a/requirements/requirements.txt b/requirements/requirements.txt
index 43e48838686615a39c8c463256ac89d91be01a67..895e252a454ffb47e3836652b6127a94feaac84c 100644
--- a/requirements/requirements.txt
+++ b/requirements/requirements.txt
@@ -1,7 +1,8 @@
-torch>=1.2
-torchvision>=0.4.0
-tqdm
-tensorboardX==1.8
+hjson
 ninja
 numpy
+packaging
 psutil
+py-cpuinfo
+torch
+tqdm
diff --git a/setup.cfg b/setup.cfg
new file mode 100644
index 0000000000000000000000000000000000000000..d858fc7dd6fd7a7ef98a7bc5b86eff4641d62742
--- /dev/null
+++ b/setup.cfg
@@ -0,0 +1,3 @@
+[options.entry_points]
+pytest_randomly.random_seeder =
+    deepspeed = deepspeed.runtime.utils:set_random_seed
diff --git a/setup.py b/setup.py
old mode 100755
new mode 100644
index de8d1d583409321b9c9c8fc06bf843738d52fb31..1bb1116d78b382e31280521ad94eebc26712b865
--- a/setup.py
+++ b/setup.py
@@ -3,26 +3,52 @@ Copyright 2020 The Microsoft DeepSpeed Team
 
 DeepSpeed library
 
-Create a new wheel via the following command: python setup.py bdist_wheel
+To build wheel on Windows:
+    1. Install pytorch, such as pytorch 1.8 + cuda 11.1
+    2. Install visual cpp build tool
+    3. Launch cmd console with Administrator privilege for creating required symlink folders
+
+Create a new wheel via the following command:
+    python setup.py bdist_wheel
 
 The wheel will be located at: dist/*.whl
 """
 
 import os
+import sys
 import shutil
 import subprocess
 import warnings
 from setuptools import setup, find_packages
+from setuptools.command import egg_info
 import time
 
+torch_available = True
 try:
     import torch
     from torch.utils.cpp_extension import BuildExtension
 except ImportError:
-    raise ImportError('Unable to import torch, please visit https://pytorch.org/ '
-                      'to see how to properly install torch on your system.')
+    torch_available = False
+    print('[WARNING] Unable to import torch, pre-compiling ops will be disabled. ' \
+        'Please visit https://pytorch.org/ to see how to properly install torch on your system.')
+
+from op_builder import ALL_OPS, get_default_compute_capabilities, OpBuilder
+
+# fetch rocm state
+is_rocm_pytorch = OpBuilder.is_rocm_pytorch()
+rocm_version = OpBuilder.installed_rocm_version()
+#aiss
+print("is_rocm_pytorch: ", is_rocm_pytorch)
+print("rocm_version: ", rocm_version)
+
+RED_START = '\033[31m'
+RED_END = '\033[0m'
+ERROR = f"{RED_START} [ERROR] {RED_END}"
+
 
-from op_builder import ALL_OPS, get_default_compute_capatabilities
+def abort(msg):
+    print(f"{ERROR} {msg}")
+    assert False, msg
 
 
 def fetch_requirements(path):
@@ -32,16 +58,28 @@ def fetch_requirements(path):
 
 install_requires = fetch_requirements('requirements/requirements.txt')
 extras_require = {
-    '1bit_adam': fetch_requirements('requirements/requirements-1bit-adam.txt'),
+    '1bit': [], # add cupy based on cuda/rocm version
+    '1bit_mpi': fetch_requirements('requirements/requirements-1bit-mpi.txt'),
     'readthedocs': fetch_requirements('requirements/requirements-readthedocs.txt'),
     'dev': fetch_requirements('requirements/requirements-dev.txt'),
+    'autotuning': fetch_requirements('requirements/requirements-autotuning.txt'),
+    'autotuning_ml': fetch_requirements('requirements/requirements-autotuning-ml.txt'),
+    'sparse_attn': fetch_requirements('requirements/requirements-sparse_attn.txt')
 }
 
-# If MPI is available add 1bit-adam requirements
-if torch.cuda.is_available():
-    if shutil.which('ompi_info') or shutil.which('mpiname'):
+# Add specific cupy version to both onebit extension variants
+if torch_available and torch.cuda.is_available():
+    cupy = None
+    if is_rocm_pytorch:
+        rocm_major, rocm_minor = rocm_version
+        # XXX cupy support for rocm 5 is not available yet
+        if rocm_major <= 4:
+            cupy = f"cupy-rocm-{rocm_major}-{rocm_minor}"
+    else:
         cupy = f"cupy-cuda{torch.version.cuda.replace('.','')[:3]}"
-        extras_require['1bit_adam'].append(cupy)
+    if cupy:
+        extras_require['1bit'].append(cupy)
+        extras_require['1bit_mpi'].append(cupy)
 
 # Make an [all] extra that installs all needed dependencies
 all_extras = set()
@@ -53,55 +91,82 @@ extras_require['all'] = list(all_extras)
 cmdclass = {}
 
 # For any pre-installed ops force disable ninja
-cmdclass['build_ext'] = BuildExtension.with_options(use_ninja=False)
+if torch_available:
+    cmdclass['build_ext'] = BuildExtension.with_options(use_ninja=False)
 
-TORCH_MAJOR = torch.__version__.split('.')[0]
-TORCH_MINOR = torch.__version__.split('.')[1]
+if torch_available:
+    TORCH_MAJOR = torch.__version__.split('.')[0]
+    TORCH_MINOR = torch.__version__.split('.')[1]
+else:
+    TORCH_MAJOR = "0"
+    TORCH_MINOR = "0"
 
-if not torch.cuda.is_available():
+if torch_available and not torch.cuda.is_available():
     # Fix to allow docker builds, similar to https://github.com/NVIDIA/apex/issues/486
     print(
         "[WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only "
         "you can ignore this message. Adding compute capability for Pascal, Volta, and Turing "
         "(compute capabilities 6.0, 6.1, 6.2)")
     if os.environ.get("TORCH_CUDA_ARCH_LIST", None) is None:
-        os.environ["TORCH_CUDA_ARCH_LIST"] = get_default_compute_capatabilities()
+        os.environ["TORCH_CUDA_ARCH_LIST"] = get_default_compute_capabilities()
 
 ext_modules = []
 
-# Default to pre-install kernels to false so we rely on JIT
-BUILD_OP_DEFAULT = int(os.environ.get('DS_BUILD_OPS', 0))
+# Default to pre-install kernels to false so we rely on JIT on Linux, opposite on Windows.
+BUILD_OP_PLATFORM = 1 if sys.platform == "win32" else 0
+BUILD_OP_DEFAULT = int(os.environ.get('DS_BUILD_OPS', BUILD_OP_PLATFORM))
 print(f"DS_BUILD_OPS={BUILD_OP_DEFAULT}")
 
+if BUILD_OP_DEFAULT:
+    assert torch_available, "Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops."
+
 
 def command_exists(cmd):
-    result = subprocess.Popen(f'type {cmd}', stdout=subprocess.PIPE, shell=True)
-    return result.wait() == 0
+    if sys.platform == "win32":
+        result = subprocess.Popen(f'{cmd}', stdout=subprocess.PIPE, shell=True)
+        return result.wait() == 1
+    else:
+        result = subprocess.Popen(f'type {cmd}', stdout=subprocess.PIPE, shell=True)
+        return result.wait() == 0
 
 
-def op_enabled(op_name):
+def op_envvar(op_name):
     assert hasattr(ALL_OPS[op_name], 'BUILD_VAR'), \
         f"{op_name} is missing BUILD_VAR field"
-    env_var = ALL_OPS[op_name].BUILD_VAR
+    return ALL_OPS[op_name].BUILD_VAR
+
+sparse_env='DS_BUILD_SPARSE_ATTN'
+def op_enabled(op_name):
+    env_var = op_envvar(op_name)
     return int(os.environ.get(env_var, BUILD_OP_DEFAULT))
 
 
+compatible_ops = dict.fromkeys(ALL_OPS.keys(), False)
 install_ops = dict.fromkeys(ALL_OPS.keys(), False)
 for op_name, builder in ALL_OPS.items():
     op_compatible = builder.is_compatible()
-
+    compatible_ops[op_name] = op_compatible
+    # If op is requested but not available, throw an error
+    if op_enabled(op_name) and not op_compatible:
+        env_var = op_envvar(op_name)
+        if env_var not in os.environ:
+            builder.warning(f"One can disable {op_name} with {env_var}=0")
+        abort(f"Unable to pre-compile {op_name}")
+    print(f"op_name: {op_name}")
     # If op is compatible update install reqs so it can potentially build/run later
     if op_compatible:
         reqs = builder.python_requirements()
         install_requires += builder.python_requirements()
-
+       
+    # if op is compatible but install is not enabled (JIT mode)
+    if is_rocm_pytorch and op_compatible and not op_enabled(op_name):
+        builder.hipify_extension()
     # If op install enabled, add builder to extensions
     if op_enabled(op_name) and op_compatible:
+        assert torch_available, f"Unable to pre-compile {op_name}, please first install torch"
         install_ops[op_name] = op_enabled(op_name)
         ext_modules.append(builder.builder())
 
-compatible_ops = {op_name: op.is_compatible() for (op_name, op) in ALL_OPS.items()}
-
 print(f'Install Ops={install_ops}')
 
 # Write out version/git info
@@ -120,6 +185,22 @@ else:
     git_hash = "unknown"
     git_branch = "unknown"
 
+
+def create_dir_symlink(src, dest):
+    if not os.path.islink(dest):
+        if os.path.exists(dest):
+            os.remove(dest)
+        assert not os.path.exists(dest)
+        os.symlink(src, dest)
+
+
+if sys.platform == "win32":
+    # This creates a symbolic links on Windows.
+    # It needs Administrator privilege to create symlinks on Windows.
+    create_dir_symlink('..\\..\\csrc', '.\\deepspeed\\ops\\csrc')
+    create_dir_symlink('..\\..\\op_builder', '.\\deepspeed\\ops\\op_builder')
+    egg_info.manifest_maker.template = 'MANIFEST_win.in'
+
 # Parse the DeepSpeed version string from version.txt
 version_str = open('version.txt', 'r').read().strip()
 
@@ -143,9 +224,17 @@ else:
 torch_version = ".".join([TORCH_MAJOR, TORCH_MINOR])
 # Set cuda_version to 0.0 if cpu-only
 cuda_version = "0.0"
-if torch.version.cuda is not None:
+# Set hip_version to 0.0 if cpu-only
+hip_version = "0.0"
+if torch_available and torch.version.cuda is not None:
     cuda_version = ".".join(torch.version.cuda.split('.')[:2])
-torch_info = {"version": torch_version, "cuda_version": cuda_version}
+if torch_available and hasattr(torch.version, 'hip') and torch.version.hip is not None:
+    hip_version = ".".join(torch.version.hip.split('.')[:2])
+torch_info = {
+    "version": torch_version,
+    "cuda_version": cuda_version,
+    "hip_version": hip_version
+}
 
 print(f"version={version_str}, git_hash={git_hash}, git_branch={git_branch}")
 with open('deepspeed/git_version_info_installed.py', 'w') as fd:
@@ -175,10 +264,16 @@ setup(name='deepspeed',
       author='DeepSpeed Team',
       author_email='deepspeed@microsoft.com',
       url='http://deepspeed.ai',
+      project_urls={
+          'Documentation': 'https://deepspeed.readthedocs.io',
+          'Source': 'https://github.com/microsoft/DeepSpeed',
+      },
       install_requires=install_requires,
       extras_require=extras_require,
       packages=find_packages(exclude=["docker",
-                                      "third_party"]),
+                                      "third_party",
+                                      "csrc",
+                                      "op_builder"]),
       include_package_data=True,
       scripts=[
           'bin/deepspeed',
@@ -191,7 +286,8 @@ setup(name='deepspeed',
       classifiers=[
           'Programming Language :: Python :: 3.6',
           'Programming Language :: Python :: 3.7',
-          'Programming Language :: Python :: 3.8'
+          'Programming Language :: Python :: 3.8',
+          'Programming Language :: Python :: 3.9'
       ],
       license='MIT',
       ext_modules=ext_modules,
diff --git a/tests/benchmarks/flatten_bench.py b/tests/benchmarks/flatten_bench.py
new file mode 100644
index 0000000000000000000000000000000000000000..a337a1b525f35045f50132e07cabee7a3303c30c
--- /dev/null
+++ b/tests/benchmarks/flatten_bench.py
@@ -0,0 +1,134 @@
+#!/usr/bin/env python
+# run the benchmark under timeit (-t), cProfile (-c), line_profiler (-l)
+#
+# usage:
+# ./flatten_bench.py -t
+# ./flatten_bench.py -c
+# kernprof -l flatten_bench.py -l; python -m line_profiler  flatten_bench.py.lprof
+
+import argparse
+
+import gc
+
+import torch
+from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
+from deepspeed.ops.op_builder import UtilsBuilder
+
+from apex_C import flatten as flatten_apex
+
+util_ops = UtilsBuilder().load()
+flatten = util_ops.flatten
+unflatten = util_ops.unflatten
+
+torch.manual_seed(0)
+# emulate a small typical model weights
+x = [
+    torch.rand((512,
+                512)).cuda(),
+    torch.rand((512,
+                1024)).cuda(),
+    torch.rand((512,
+                30000)).cuda()
+]
+t = x * 30
+
+# warm up and check that the same output is produced
+flat_py = _flatten_dense_tensors(t)
+flat_cpp = flatten(t)
+flat_apex = flatten_apex(t)
+#numel = flat_cpp.numel()
+assert torch.eq(flat_py, flat_cpp).all(), "both produce the same tensor"
+assert torch.eq(flat_py, flat_apex).all(), "both produce the same tensor"
+
+TIMES = 1000
+
+
+# the programs being tested
+def py():
+    for i in range(TIMES):
+        flat = _flatten_dense_tensors(t)
+
+
+def cpp():
+    for i in range(TIMES):
+        flat = flatten(t)
+
+
+def apex():
+    for i in range(TIMES):
+        flat = flatten_apex(t)
+
+
+#### cProfile ####
+
+import cProfile
+
+
+def cprofileme():
+    print("--------------- cProfile -----------------")
+    print("py")
+    cProfile.run("py()", sort=-1)
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("cpp")
+    cProfile.run("cpp()", sort=-1)
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("apex")
+    cProfile.run("apex()", sort=-1)
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+#### timeit ####
+
+import timeit
+
+
+def timeme():
+    print("--------------- timeit -----------------")
+    print(f'py  ={timeit.Timer("py()", globals=globals()).timeit(number=1)}')
+    gc.collect()
+    torch.cuda.empty_cache()
+    print(f'cpp ={timeit.Timer("cpp()", globals=globals()).timeit(number=1)}')
+    gc.collect()
+    torch.cuda.empty_cache()
+    print(f'apex={timeit.Timer("apex()", globals=globals()).timeit(number=1)}')
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+#### line_profiler ####
+# this one requires a special way to be called
+# pip install line_profiler
+# kernprof -l flatten_bench.py -l; python -m line_profiler  flatten_bench.py.lprof
+
+
+def line_profileme():
+    print("--------------- line_profiler -----------------")
+    print("py")
+    profile(py)()
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("cpp")
+    profile(cpp)()
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("apex")
+    profile(apex)()
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-l", action='store_true')
+    parser.add_argument("-c", action='store_true')
+    parser.add_argument("-t", action='store_true')
+    args = parser.parse_args()
+    if args.l:
+        line_profileme()
+    elif args.c:
+        cprofileme()
+    elif args.t:
+        timeme()
diff --git a/tests/benchmarks/unflatten_bench.py b/tests/benchmarks/unflatten_bench.py
new file mode 100644
index 0000000000000000000000000000000000000000..85baf751ad9c886252aac0f1fafd07ff5ebb1044
--- /dev/null
+++ b/tests/benchmarks/unflatten_bench.py
@@ -0,0 +1,143 @@
+#!/usr/bin/env python
+
+# run the benchmark under timeit (-t), cProfile (-c), line_profiler (-l)
+#
+# usage:
+# ./unflatten_bench.py -t
+# ./unflatten_bench.py -c
+# kernprof -l unflatten_bench.py -l; python -m line_profiler  unflatten_bench.py.lprof
+
+import argparse
+import gc
+import torch
+from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
+from deepspeed.ops.op_builder import UtilsBuilder
+
+from apex_C import flatten as flatten_apex
+from apex_C import unflatten as unflatten_apex
+
+util_ops = UtilsBuilder().load()
+flatten = util_ops.flatten
+unflatten = util_ops.unflatten
+
+torch.manual_seed(0)
+# emulate a small typical model weights
+x = [
+    torch.rand((512,
+                512)).cuda(),
+    torch.rand((512,
+                1024)).cuda(),
+    torch.rand((512,
+                30000)).cuda()
+]
+unflat_t = x * 30
+
+# warm up and check that the same output is produced
+flat_py = _flatten_dense_tensors(unflat_t)
+flat_cpp = flatten(unflat_t)
+flat_apex = flatten_apex(unflat_t)
+#numel = flat_cpp.numel()
+assert torch.eq(flat_py, flat_cpp).all(), "both produce the same tensor"
+assert torch.eq(flat_py, flat_apex).all(), "both produce the same tensor"
+
+flat_t = flat_py
+unflat_py = _unflatten_dense_tensors(flat_py, unflat_t)
+for i in range(len(unflat_t)):
+    assert torch.eq(unflat_t[i], unflat_py[i]).all()
+unflat_cpp = _unflatten_dense_tensors(flat_cpp, unflat_t)
+for i in range(len(unflat_t)):
+    assert torch.eq(unflat_t[i], unflat_cpp[i]).all()
+unflat_apex = _unflatten_dense_tensors(flat_apex, unflat_t)
+for i in range(len(unflat_t)):
+    assert torch.eq(unflat_t[i], unflat_apex[i]).all()
+
+
+# the programs being tested
+def py():
+    for i in range(1000):
+        unflat = _unflatten_dense_tensors(flat_t, unflat_t)
+
+
+def cpp():
+    for i in range(1000):
+        unflat = unflatten(flat_t, unflat_t)
+
+
+def apex():
+    for i in range(1000):
+        unflat = unflatten_apex(flat_t, unflat_t)
+
+
+#### cProfile ####
+
+import cProfile
+
+
+def cprofileme():
+    print("--------------- cProfile -----------------")
+    print("py")
+    cProfile.run("py()", sort=-1)
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("cpp")
+    cProfile.run("cpp()", sort=-1)
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("apex")
+    cProfile.run("apex()", sort=-1)
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+#### timeit ####
+
+import timeit
+
+
+def timeme():
+    print("--------------- timeit -----------------")
+    print(f'py  ={timeit.Timer("py()", globals=globals()).timeit(number=1)}')
+    gc.collect()
+    torch.cuda.empty_cache()
+    print(f'cpp ={timeit.Timer("cpp()", globals=globals()).timeit(number=1)}')
+    gc.collect()
+    torch.cuda.empty_cache()
+    print(f'apex={timeit.Timer("apex()", globals=globals()).timeit(number=1)}')
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+#### line_profiler ####
+# this one requires a special way to be called
+# pip install line_profiler
+# kernprof -l unflatten_bench.py -l; python -m line_profiler unflatten_bench.py.lprof
+
+
+def line_profileme():
+    print("--------------- line_profier -----------------")
+    print("py")
+    profile(py)()
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("cpp")
+    profile(cpp)()
+    gc.collect()
+    torch.cuda.empty_cache()
+    print("apex")
+    profile(apex)()
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-l", action='store_true')
+    parser.add_argument("-c", action='store_true')
+    parser.add_argument("-t", action='store_true')
+    args = parser.parse_args()
+    if args.l:
+        line_profileme()
+    elif args.c:
+        cprofileme()
+    elif args.t:
+        timeme()
diff --git a/tests/conftest.py b/tests/conftest.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0e4705f4984f0b0f595843d35a2fa42b7c72901
--- /dev/null
+++ b/tests/conftest.py
@@ -0,0 +1,9 @@
+# tests directory-specific settings - this file is run automatically by pytest before any tests are run
+
+import sys
+from os.path import abspath, dirname, join
+
+# allow having multiple repository checkouts and not needing to remember to rerun
+# 'pip install -e .[dev]' when switching between checkouts and running tests.
+git_repo_path = abspath(join(dirname(dirname(__file__)), "src"))
+sys.path.insert(1, git_repo_path)
diff --git a/tests/lightning/test_simple.py b/tests/lightning/test_simple.py
new file mode 100644
index 0000000000000000000000000000000000000000..6fb36372a17a5a32a4efdcaa906188649854de4e
--- /dev/null
+++ b/tests/lightning/test_simple.py
@@ -0,0 +1,55 @@
+import torch
+from pytorch_lightning import LightningModule, Trainer
+from pytorch_lightning.plugins import DeepSpeedPlugin
+from torch.utils.data import DataLoader, Dataset
+
+
+class RandomDataset(Dataset):
+    def __init__(self, size, length):
+        self.len = length
+        self.data = torch.randn(length, size)
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return self.len
+
+
+class BoringModel(LightningModule):
+    def __init__(self):
+        super().__init__()
+        self.layer = torch.nn.Linear(32, 2)
+
+    def forward(self, x):
+        return self.layer(x)
+
+    def training_step(self, batch, batch_idx):
+        loss = self(batch).sum()
+        self.log("train_loss", loss)
+        return {"loss": loss}
+
+    def validation_step(self, batch, batch_idx):
+        loss = self(batch).sum()
+        self.log("valid_loss", loss)
+
+    def test_step(self, batch, batch_idx):
+        loss = self(batch).sum()
+        self.log("test_loss", loss)
+
+    def configure_optimizers(self):
+        return torch.optim.SGD(self.layer.parameters(), lr=0.1)
+
+    def train_dataloader(self):
+        return DataLoader(RandomDataset(32, 64), batch_size=2)
+
+    def val_dataloader(self):
+        return DataLoader(RandomDataset(32, 64), batch_size=2)
+
+
+def test_lightning_model():
+    """Test that DeepSpeed works with a simple LightningModule and LightningDataModule."""
+
+    model = BoringModel()
+    trainer = Trainer(strategy=DeepSpeedPlugin(), max_epochs=1, precision=16, gpus=1)
+    trainer.fit(model)
diff --git a/tests/model/BingBertSquad/BingBertSquad_run_func_test.py b/tests/model/BingBertSquad/BingBertSquad_run_func_test.py
old mode 100755
new mode 100644
index 33e5c1846710873a7065524afbb2fac7adb2008a..90e6858e8bcbaac1b722e7b37210bc8a7e997ab8
--- a/tests/model/BingBertSquad/BingBertSquad_run_func_test.py
+++ b/tests/model/BingBertSquad/BingBertSquad_run_func_test.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/model/BingBertSquad/BingBertSquad_test_common.py b/tests/model/BingBertSquad/BingBertSquad_test_common.py
old mode 100755
new mode 100644
index 940c18cf8c067b4426f4ad9edc5d58432875283b..a9678bb6923fb986a9cb202fca69b98e1fe1aae8
--- a/tests/model/BingBertSquad/BingBertSquad_test_common.py
+++ b/tests/model/BingBertSquad/BingBertSquad_test_common.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 
diff --git a/tests/model/BingBertSquad/__init__.py b/tests/model/BingBertSquad/__init__.py
old mode 100755
new mode 100644
index 3e53472dd6e43a1541378c770840fc29062de24e..e122adbdfddecd2663c881a24bdc5b775fae1f75
--- a/tests/model/BingBertSquad/__init__.py
+++ b/tests/model/BingBertSquad/__init__.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 
 from .BingBertSquad_run_func_test import BingBertSquadFuncTestCase
diff --git a/tests/model/BingBertSquad/deepspeed_bsz24_fp16_config.json b/tests/model/BingBertSquad/deepspeed_bsz24_fp16_config.json
old mode 100755
new mode 100644
diff --git a/tests/model/BingBertSquad/deepspeed_bsz24_fp16_eigenvalue_quantize_config.json b/tests/model/BingBertSquad/deepspeed_bsz24_fp16_eigenvalue_quantize_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..9c5d0700ef87f6095a2b0bd8587d3899523a7dce
--- /dev/null
+++ b/tests/model/BingBertSquad/deepspeed_bsz24_fp16_eigenvalue_quantize_config.json
@@ -0,0 +1,49 @@
+{
+  "train_batch_size": 24,
+  "train_micro_batch_size_per_gpu": 3,
+  "steps_per_print": 1,
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 3e-5,
+      "weight_decay": 0.0,
+      "bias_correction": false
+    }
+  },
+  "gradient_clipping": 1.0,
+  "fp16": {
+    "enabled": true
+  },
+  "tensorboard": {
+    "enabled": true,
+    "output_path": "/tmp/eigenvalue_quantize_output",
+    "job_name": "eigenvalue_quantize"
+  },
+  "eigenvalue": {
+      "enabled": true,
+      "verbose": true,
+      "max_iter": 50,
+      "tol": 1e-2,
+      "stability": 0,
+      "gas_boundary_resolution": 1,
+      "model_name": "bert-large"
+  },
+  "quantize_training": {
+    "quantize_bits": {
+      "start_bits": 12,
+      "target_bits": 4
+    },
+    "quantize_type": "symmetric",
+    "quantize_schedule": {
+      "quantize_period": 400,
+      "schedule_offset": 400
+    },
+    "quantize_groups": 16,
+    "fp16_mixed_quantize": {
+      "enabled": true,
+      "quantize_change_ratio": 0.001
+    },
+    "quantize_verbose": true,
+    "quantize_eigenvalue": true
+  }
+}
diff --git a/tests/model/BingBertSquad/deepspeed_bsz24_fp16_zero2_config.json b/tests/model/BingBertSquad/deepspeed_bsz24_fp16_zero2_config.json
old mode 100755
new mode 100644
diff --git a/tests/model/BingBertSquad/deepspeed_bsz24_fp32_config.json b/tests/model/BingBertSquad/deepspeed_bsz24_fp32_config.json
old mode 100755
new mode 100644
diff --git a/tests/model/BingBertSquad/run_BingBertSquad.sh b/tests/model/BingBertSquad/run_BingBertSquad.sh
old mode 100755
new mode 100644
index 7631217619ae8fad2c52e2ceb7b8c11feaf8ccbd..fcfdf5e66361338ffd8c632ae2e71a8c1208bd3b
--- a/tests/model/BingBertSquad/run_BingBertSquad.sh
+++ b/tests/model/BingBertSquad/run_BingBertSquad.sh
@@ -93,7 +93,7 @@ done
 
 # Validate path to BingBertSquad script
 if [ -z "${BingBertSquad_DIR+x}" ]; then
-  export BingBertSquad_DIR=../../../DeepSpeedExamples/BingBertSquad
+  export BingBertSquad_DIR=../../../../DeepSpeedExamples/BingBertSquad
   echo "BingBertSquad_DIR environment variable not set; trying default: ${BingBertSquad_DIR}"
 fi
 validate_folder ${BingBertSquad_DIR} "BingBertSquad_DIR"
@@ -137,7 +137,6 @@ fi
 
 JOB_NAME="BingBertSquad_ds-${enable_deepspeed}_${num_gpus}-gpu"
 
-#            --do_predict \
 squad_args="--bert_model bert-large-uncased \
             --do_train \
             --do_lower_case \
@@ -148,6 +147,7 @@ squad_args="--bert_model bert-large-uncased \
             --num_train_epochs ${epochs} \
             --max_seq_length 384 \
             --doc_stride 128 \
+            --do_predict \
             --output_dir ${OUTPUT_DIR} \
             --gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
 		      	--job_name ${JOB_NAME} \
diff --git a/tests/model/BingBertSquad/run_BingBertSquad_sanity.sh b/tests/model/BingBertSquad/run_BingBertSquad_sanity.sh
old mode 100755
new mode 100644
diff --git a/tests/model/BingBertSquad/run_tests.sh b/tests/model/BingBertSquad/run_tests.sh
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/__init__.py b/tests/model/Megatron_GPT2/__init__.py
index 4577058a0bc122a0a81ee43fa7586dab754cc8eb..2451ec7ae5bf8768066b6091362dda8dc9efc3b5 100644
--- a/tests/model/Megatron_GPT2/__init__.py
+++ b/tests/model/Megatron_GPT2/__init__.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs4_zero1.json b/tests/model/Megatron_GPT2/ds_config_func_bs4_zero1.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs4_zero2.json b/tests/model/Megatron_GPT2/ds_config_func_bs4_zero2.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs4_zero2_offload.json b/tests/model/Megatron_GPT2/ds_config_func_bs4_zero2_offload.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs8_no_zero.json b/tests/model/Megatron_GPT2/ds_config_func_bs8_no_zero.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs8_zero0_gas3.json b/tests/model/Megatron_GPT2/ds_config_func_bs8_zero0_gas3.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs8_zero1.json b/tests/model/Megatron_GPT2/ds_config_func_bs8_zero1.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs8_zero2.json b/tests/model/Megatron_GPT2/ds_config_func_bs8_zero2.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs8_zero2_gas3.json b/tests/model/Megatron_GPT2/ds_config_func_bs8_zero2_gas3.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_bs8_zero2_offload.json b/tests/model/Megatron_GPT2/ds_config_func_bs8_zero2_offload.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_func_scheduler.json b/tests/model/Megatron_GPT2/ds_config_func_scheduler.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_perf_bs16.json b/tests/model/Megatron_GPT2/ds_config_perf_bs16.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_perf_bs32.json b/tests/model/Megatron_GPT2/ds_config_perf_bs32.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_config_perf_bs8.json b/tests/model/Megatron_GPT2/ds_config_perf_bs8.json
old mode 100755
new mode 100644
diff --git a/tests/model/Megatron_GPT2/ds_gpt2_test.sh b/tests/model/Megatron_GPT2/ds_gpt2_test.sh
old mode 100755
new mode 100644
index ac5d7e379023208700ae02a19f418ed10b55753c..1f60036b83c5898756117ba7ef9fe859d7af378d
--- a/tests/model/Megatron_GPT2/ds_gpt2_test.sh
+++ b/tests/model/Megatron_GPT2/ds_gpt2_test.sh
@@ -3,7 +3,7 @@
 helpFunction()
 {
     echo ""
-    echo "Usage: $0 -m model-parallelism -g gpu-per-node -n node# -b batch-size -s stpes -l layers -h hidden_size -q seq_length -e heads -c ckpt_num_layers -p [-d]"
+    echo "Usage: $0 -m model-parallelism -g gpu-per-node -n node# -b batch-size -s steps -l layers -h hidden_size -q seq_length -e heads -c ckpt_num_layers -p [-d]"
     echo -e "\t-m model parallelism"
     echo -e "\t-g gpus per node"
     echo -e "\t-n node count"
diff --git a/tests/model/Megatron_GPT2/run_checkpoint_test.py b/tests/model/Megatron_GPT2/run_checkpoint_test.py
old mode 100755
new mode 100644
index cf11af6c2ae4af50be23403eca80dd9287367098..fe564d4fdb8afe4f6386706f3793f24f6fb8ff3d
--- a/tests/model/Megatron_GPT2/run_checkpoint_test.py
+++ b/tests/model/Megatron_GPT2/run_checkpoint_test.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/model/Megatron_GPT2/run_func_test.py b/tests/model/Megatron_GPT2/run_func_test.py
old mode 100755
new mode 100644
index f8ab5bcb3333a2e682c99693afeb88ce3c22ac7b..463aa1f94f151191b962fdb50c7b2b9512c40726
--- a/tests/model/Megatron_GPT2/run_func_test.py
+++ b/tests/model/Megatron_GPT2/run_func_test.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/model/Megatron_GPT2/run_perf_baseline.py b/tests/model/Megatron_GPT2/run_perf_baseline.py
old mode 100755
new mode 100644
index e7fb1c7229030e82cc4715bcb3eb429f2dc52d6f..f30e9cfe9bc13620025e29b0b4e04e872f12ec0e
--- a/tests/model/Megatron_GPT2/run_perf_baseline.py
+++ b/tests/model/Megatron_GPT2/run_perf_baseline.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/model/Megatron_GPT2/run_perf_test.py b/tests/model/Megatron_GPT2/run_perf_test.py
old mode 100755
new mode 100644
index fbf816427b0b516b658c0b038a35873e95ac1efb..64b20f4866a49e48e11816d063fab97b2ca98f30
--- a/tests/model/Megatron_GPT2/run_perf_test.py
+++ b/tests/model/Megatron_GPT2/run_perf_test.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/model/Megatron_GPT2/test_common.py b/tests/model/Megatron_GPT2/test_common.py
old mode 100755
new mode 100644
index ae1dd328de2ed92e87e6c366828a531cf0aeb4a8..04b3e4a23a6c065c30d8618b17a98411c732c308
--- a/tests/model/Megatron_GPT2/test_common.py
+++ b/tests/model/Megatron_GPT2/test_common.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 
diff --git a/tests/model/run_sanity_check.py b/tests/model/run_sanity_check.py
old mode 100755
new mode 100644
index b7d12ba18f07c1d082077fb197b09284737e800c..2f020b52db163aa09d2c2b42831c8815a98df28f
--- a/tests/model/run_sanity_check.py
+++ b/tests/model/run_sanity_check.py
@@ -1,4 +1,3 @@
-# coding=utf-8
 # Copyright (c) 2019, The Microsoft DeepSpeed Team. All rights reserved.
 #
 # Note: please copy webtext data to "Megatron-LM" folder, before running this script.
diff --git a/tests/perf/adam_test.py b/tests/perf/adam_test.py
old mode 100755
new mode 100644
index 0f29cab4662e2c70a9df6874daae594d6a71a610..1ddcd44bbdd49843f41d0a4c353daf88f8f34d07
--- a/tests/perf/adam_test.py
+++ b/tests/perf/adam_test.py
@@ -1,24 +1,24 @@
-import torch
-from deepspeed.ops.adam import DeepSpeedCPUAdam
-import time
-
-device = 'cpu'
-model_size = 1 * 1024**3
-group_size = [model_size, 274432]
-
-param = [torch.nn.Parameter(torch.ones(size, device=device)) for size in group_size]
-optimizer = DeepSpeedCPUAdam(param)
-#torch.set_num_threads(128)
-for i, p in enumerate(param):
-    p.grad = torch.ones(group_size[i], device=device)
-#param.grad = torch.ones(model_size, device=device)
-avg = 0
-for i in range(100):
-    start = time.time()
-    optimizer.step()
-    stop = time.time()
-    avg += (stop - start)
-    for i, p in enumerate(param):
-        p.grad = torch.ones(group_size[i], device=device) * 2
-    #param.grad = torch.ones(model_size, device=device) * 2
-print("Elapsed Time is ", avg / 100)
+import torch
+from deepspeed.ops.adam import DeepSpeedCPUAdam
+import time
+
+device = 'cpu'
+model_size = 1 * 1024**3
+group_size = [model_size, 274432]
+
+param = [torch.nn.Parameter(torch.ones(size, device=device)) for size in group_size]
+optimizer = DeepSpeedCPUAdam(param)
+#torch.set_num_threads(128)
+for i, p in enumerate(param):
+    p.grad = torch.ones(group_size[i], device=device)
+#param.grad = torch.ones(model_size, device=device)
+avg = 0
+for i in range(100):
+    start = time.time()
+    optimizer.step()
+    stop = time.time()
+    avg += (stop - start)
+    for i, p in enumerate(param):
+        p.grad = torch.ones(group_size[i], device=device) * 2
+    #param.grad = torch.ones(model_size, device=device) * 2
+print("Elapsed Time is ", avg / 100)
diff --git a/tests/perf/adam_test1.py b/tests/perf/adam_test1.py
old mode 100755
new mode 100644
index b0aba0fcd6b913ee8884abf8d3ea5785f27e9795..88f1a1c5961df8d385336f5b223f891086687314
--- a/tests/perf/adam_test1.py
+++ b/tests/perf/adam_test1.py
@@ -1,22 +1,22 @@
-import torch
-from deepspeed.ops.adam import DeepSpeedCPUAdam
-import time
-
-device = 'cpu'
-model_size = 1 * 1024**3
-param = torch.nn.Parameter(torch.ones(model_size, device=device))
-param_fp16 = torch.nn.Parameter(torch.ones(model_size,
-                                           dtype=torch.half,
-                                           device='cuda:0'))
-
-optimizer = DeepSpeedCPUAdam([param])
-#torch.set_num_threads(128)
-param.grad = torch.ones(model_size, device=device)
-avg = 0
-for i in range(100):
-    start = time.time()
-    optimizer.step(fp16_param_groups=[param_fp16])
-    stop = time.time()
-    avg += (stop - start)
-    param.grad = torch.ones(model_size, device=device) * 2
-print("Elapsed Time is ", avg / 100)
+import torch
+from deepspeed.ops.adam import DeepSpeedCPUAdam
+import time
+
+device = 'cpu'
+model_size = 1 * 1024**3
+param = torch.nn.Parameter(torch.ones(model_size, device=device))
+param_fp16 = torch.nn.Parameter(torch.ones(model_size,
+                                           dtype=torch.half,
+                                           device='cuda:0'))
+
+optimizer = DeepSpeedCPUAdam([param])
+#torch.set_num_threads(128)
+param.grad = torch.ones(model_size, device=device)
+avg = 0
+for i in range(100):
+    start = time.time()
+    optimizer.step(fp16_param_groups=[param_fp16])
+    stop = time.time()
+    avg += (stop - start)
+    param.grad = torch.ones(model_size, device=device) * 2
+print("Elapsed Time is ", avg / 100)
diff --git a/tests/small_model_debugging/test_model.py b/tests/small_model_debugging/test_model.py
old mode 100755
new mode 100644
index a5a506a6fa9a2ef32041117ac93e4704fd23fbb9..c957bf8f1ecbfb30e441d37435a7182b5c2f51f6
--- a/tests/small_model_debugging/test_model.py
+++ b/tests/small_model_debugging/test_model.py
@@ -15,9 +15,9 @@ class SimpleModel(torch.nn.Module):
         self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
 
     def forward(self, x, y):
-        hidden_dim = x
-        hidden_dim = self.linear(hidden_dim)
-        return self.cross_entropy_loss(hidden_dim, y)
+        hidden = x
+        hidden = self.linear(hidden)
+        return self.cross_entropy_loss(hidden, y)
 
 
 def create_config_from_dict(tmpdir, config_dict):
diff --git a/tests/unit/__init__.py b/tests/unit/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/unit/common.py b/tests/unit/common.py
index f92b1058aa929d286e86c4000461085601640c3c..57ed50f17cea7b04fa5cb7b9931cff4568ef271d 100644
--- a/tests/unit/common.py
+++ b/tests/unit/common.py
@@ -8,11 +8,61 @@ from torch.multiprocessing import Process
 import deepspeed
 
 import pytest
+from functools import wraps
+import unittest
+from pathlib import Path
+
+from pathlib import Path
 
 # Worker timeout *after* the first worker has completed.
 DEEPSPEED_UNIT_WORKER_TIMEOUT = 120
 
 
+def get_xdist_worker_id():
+    xdist_worker = os.environ.get('PYTEST_XDIST_WORKER', None)
+    if xdist_worker is not None:
+        xdist_worker_id = xdist_worker.replace('gw', '')
+        return int(xdist_worker_id)
+    return None
+
+
+def get_master_port():
+    master_port = os.environ.get('DS_TEST_PORT', '29503')
+    xdist_worker_id = get_xdist_worker_id()
+    if xdist_worker_id is not None:
+        master_port = str(int(master_port) + xdist_worker_id)
+    return master_port
+
+
+def set_cuda_visibile():
+    cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES", None)
+    xdist_worker_id = get_xdist_worker_id()
+    if xdist_worker_id is None:
+        xdist_worker_id = 0
+    if cuda_visible is None:
+        # CUDA_VISIBLE_DEVICES is not set, discover it from nvidia-smi instead
+        import subprocess
+        is_rocm_pytorch = hasattr(torch.version, 'hip') and torch.version.hip is not None
+        if is_rocm_pytorch:
+            rocm_smi = subprocess.check_output(['rocm-smi', '--showid'])
+            gpu_ids = filter(lambda s: 'GPU' in s,
+                             rocm_smi.decode('utf-8').strip().split('\n'))
+            num_gpus = len(list(gpu_ids))
+        else:
+            nvidia_smi = subprocess.check_output(['nvidia-smi', '--list-gpus'])
+            num_gpus = len(nvidia_smi.decode('utf-8').strip().split('\n'))
+        cuda_visible = ",".join(map(str, range(num_gpus)))
+
+    # rotate list based on xdist worker id, example below
+    # wid=0 -> ['0', '1', '2', '3']
+    # wid=1 -> ['1', '2', '3', '0']
+    # wid=2 -> ['2', '3', '0', '1']
+    # wid=3 -> ['3', '0', '1', '2']
+    dev_id_list = cuda_visible.split(",")
+    dev_id_list = dev_id_list[xdist_worker_id:] + dev_id_list[:xdist_worker_id]
+    os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(dev_id_list)
+
+
 def distributed_test(world_size=2, backend='nccl'):
     """A decorator for executing a function (e.g., a unit test) in a distributed manner.
     This decorator manages the spawning and joining of processes, initialization of
@@ -34,12 +84,17 @@ def distributed_test(world_size=2, backend='nccl'):
         def dist_init(local_rank, num_procs, *func_args, **func_kwargs):
             """Initialize torch.distributed and execute the user function. """
             os.environ['MASTER_ADDR'] = '127.0.0.1'
-            os.environ['MASTER_PORT'] = '29503'
+            os.environ['MASTER_PORT'] = get_master_port()
             os.environ['LOCAL_RANK'] = str(local_rank)
             # NOTE: unit tests don't support multi-node so local_rank == global rank
             os.environ['RANK'] = str(local_rank)
             os.environ['WORLD_SIZE'] = str(num_procs)
 
+            # turn off NCCL logging if set
+            os.environ.pop('NCCL_DEBUG', None)
+
+            set_cuda_visibile()
+
             deepspeed.init_distributed(dist_backend=backend)
 
             if torch.cuda.is_available():
@@ -47,6 +102,12 @@ def distributed_test(world_size=2, backend='nccl'):
 
             run_func(*func_args, **func_kwargs)
 
+            # make sure all ranks finish at the same time
+            torch.distributed.barrier()
+
+            # tear down after test completes
+            torch.distributed.destroy_process_group()
+
         def dist_launcher(num_procs, *func_args, **func_kwargs):
             """Launch processes and gracefully handle failures. """
 
@@ -102,3 +163,8 @@ def distributed_test(world_size=2, backend='nccl'):
         return run_func_decorator
 
     return dist_wrap
+
+
+def get_test_path(filename):
+    curr_path = Path(__file__).parent
+    return str(curr_path.joinpath(filename))
diff --git a/tests/unit/ds_batch_config.json b/tests/unit/ds_batch_config.json
old mode 100755
new mode 100644
index 2558a5b9d31b3125bc7cdf45c5d414f5658a789b..2e86c1929cae61c55237eb5d7aa5f35344a4922d
--- a/tests/unit/ds_batch_config.json
+++ b/tests/unit/ds_batch_config.json
@@ -1,15 +1,15 @@
-{
-    "train_batch_size": 2,
-    "gradient_accumulation_steps": 1,
-    "optimizer": {
-      "type": "Adam",
-      "params": {
-        "lr": 0.00015
-      }
-    },
-    "gradient_clipping": 1.0,
-    "fp16": {
-      "enabled": true,
-      "loss_scale": 0
-    }
-  }
+{
+    "train_batch_size": 2,
+    "gradient_accumulation_steps": 1,
+    "optimizer": {
+      "type": "Adam",
+      "params": {
+        "lr": 0.00015
+      }
+    },
+    "gradient_clipping": 1.0,
+    "fp16": {
+      "enabled": true,
+      "loss_scale": 0
+    }
+  }
diff --git a/tests/unit/gpt2-merges.txt b/tests/unit/gpt2-merges.txt
new file mode 100644
index 0000000000000000000000000000000000000000..226b0752cac7789c48f0cb3ec53eda48b7be36cc
--- /dev/null
+++ b/tests/unit/gpt2-merges.txt
@@ -0,0 +1,50001 @@
+#version: 0.2
+Ġ t
+Ġ a
+h e
+i n
+r e
+o n
+Ġt he
+e r
+Ġ s
+a t
+Ġ w
+Ġ o
+e n
+Ġ c
+i t
+i s
+a n
+o r
+e s
+Ġ b
+e d
+Ġ f
+in g
+Ġ p
+o u
+Ġa n
+a l
+a r
+Ġt o
+Ġ m
+Ġo f
+Ġ in
+Ġ d
+Ġ h
+Ġan d
+i c
+a s
+l e
+Ġt h
+i on
+o m
+l l
+en t
+Ġ n
+Ġ l
+s t
+Ġ re
+v e
+Ġ e
+r o
+l y
+Ġb e
+Ġ g
+Ġ T
+c t
+Ġ S
+i d
+o t
+Ġ I
+u t
+e t
+Ġ A
+Ġ is
+Ġ on
+i m
+a m
+o w
+a y
+a d
+s e
+Ġth at
+Ġ C
+i g
+Ġf or
+a c
+Ġ y
+v er
+u r
+Ġ u
+l d
+Ġs t
+Ġ M
+' s
+Ġ he
+Ġ it
+at ion
+it h
+i r
+c e
+Ġy ou
+i l
+Ġ B
+Ġw h
+o l
+Ġ P
+Ġw ith
+Ġ 1
+t er
+c h
+Ġa s
+Ġw e
+Ġ (
+n d
+i ll
+Ġ D
+i f
+Ġ 2
+a g
+er s
+k e
+Ġ "
+Ġ H
+e m
+Ġc on
+Ġ W
+Ġ R
+he r
+Ġw as
+Ġ r
+o d
+Ġ F
+u l
+at e
+Ġa t
+r i
+p p
+o re
+ĠT he
+Ġs e
+u s
+Ġp ro
+Ġh a
+u m
+Ġa re
+Ġd e
+a in
+an d
+Ġo r
+ig h
+es t
+is t
+a b
+r om
+Ġ N
+t h
+Ġc om
+Ġ G
+u n
+o p
+0 0
+Ġ L
+Ġn ot
+es s
+Ġe x
+Ġ v
+re s
+Ġ E
+e w
+it y
+an t
+Ġb y
+e l
+o s
+or t
+o c
+q u
+Ġf rom
+Ġha ve
+Ġs u
+i ve
+ou ld
+Ġs h
+Ġth is
+n t
+r a
+p e
+igh t
+ar t
+m ent
+Ġa l
+u st
+en d
+- -
+al l
+Ġ O
+ac k
+Ġc h
+Ġ le
+i es
+re d
+ar d
+â Ģ
+ou t
+Ġ J
+Ġa b
+e ar
+i v
+al ly
+ou r
+o st
+g h
+p t
+Ġp l
+as t
+Ġc an
+a k
+om e
+u d
+T he
+Ġh is
+Ġd o
+Ġg o
+Ġh as
+g e
+' t
+Ġ U
+r ou
+Ġs a
+Ġ j
+Ġb ut
+Ġw or
+Ġa ll
+e ct
+Ġ k
+am e
+Ġw ill
+o k
+Ġw he
+Ġthe y
+id e
+0 1
+f f
+ic h
+p l
+t her
+Ġt r
+. .
+Ġin t
+i e
+u re
+ag e
+Ġn e
+i al
+a p
+in e
+ic e
+Ġm e
+Ġo ut
+an s
+on e
+on g
+ion s
+Ġwh o
+Ġ K
+Ġu p
+Ġthe ir
+Ġa d
+Ġ 3
+Ġu s
+at ed
+ou s
+Ġm ore
+u e
+o g
+ĠS t
+in d
+i ke
+Ġs o
+im e
+p er
+. "
+b er
+i z
+a ct
+Ġon e
+Ġsa id
+Ġ -
+a re
+Ġyou r
+c c
+ĠT h
+Ġc l
+e p
+a ke
+ab le
+i p
+Ġcon t
+Ġwh ich
+i a
+Ġ im
+Ġab out
+Ġwe re
+ver y
+u b
+Ġh ad
+Ġ en
+Ġcom p
+, "
+ĠI n
+Ġu n
+Ġa g
+i re
+ac e
+a u
+ar y
+Ġw ould
+as s
+r y
+Ġ âĢ
+c l
+o ok
+e re
+s o
+Ġ V
+ig n
+i b
+Ġof f
+Ġt e
+v en
+Ġ Y
+i le
+o se
+it e
+or m
+Ġ2 01
+Ġre s
+Ġm an
+Ġp er
+Ġo ther
+or d
+ul t
+Ġbe en
+Ġl ike
+as e
+an ce
+k s
+ay s
+ow n
+en ce
+Ġd is
+ct ion
+Ġan y
+Ġa pp
+Ġs p
+in t
+res s
+ation s
+a il
+Ġ 4
+ic al
+Ġthe m
+Ġhe r
+ou nt
+ĠC h
+Ġa r
+Ġ if
+Ġthe re
+Ġp e
+Ġy ear
+a v
+Ġm y
+Ġs ome
+Ġwhe n
+ou gh
+ac h
+Ġth an
+r u
+on d
+ic k
+Ġo ver
+ve l
+Ġ qu
+Ċ Ċ
+Ġs c
+re at
+re e
+ĠI t
+ou nd
+p ort
+Ġal so
+Ġp art
+f ter
+Ġk n
+Ġbe c
+Ġt ime
+en s
+Ġ 5
+op le
+Ġwh at
+Ġn o
+d u
+m er
+an g
+Ġn ew
+-- --
+Ġg et
+or y
+it ion
+ing s
+Ġj ust
+Ġint o
+Ġ 0
+ent s
+o ve
+t e
+Ġpe ople
+Ġp re
+Ġit s
+Ġre c
+Ġt w
+i an
+ir st
+ar k
+or s
+Ġwor k
+ad e
+o b
+Ġs he
+Ġo ur
+w n
+in k
+l ic
+Ġ1 9
+ĠH e
+is h
+nd er
+au se
+Ġh im
+on s
+Ġ [
+Ġ ro
+f orm
+i ld
+at es
+ver s
+Ġon ly
+o ll
+Ġs pe
+c k
+e ll
+am p
+Ġa cc
+Ġb l
+i ous
+ur n
+f t
+o od
+Ġh ow
+he d
+Ġ '
+Ġa fter
+a w
+Ġat t
+o v
+n e
+Ġpl ay
+er v
+ic t
+Ġc ould
+it t
+Ġa m
+Ġf irst
+Ġ 6
+Ġa ct
+Ġ $
+e c
+h ing
+u al
+u ll
+Ġcom m
+o y
+o ld
+c es
+at er
+Ġf e
+Ġbe t
+w e
+if f
+Ġtw o
+oc k
+Ġb ack
+) .
+id ent
+Ġu nder
+rou gh
+se l
+x t
+Ġm ay
+rou nd
+Ġp o
+p h
+is s
+Ġd es
+Ġm ost
+Ġd id
+Ġad d
+j ect
+Ġin c
+f ore
+Ġp ol
+on t
+Ġag ain
+cl ud
+ter n
+Ġkn ow
+Ġne ed
+Ġcon s
+Ġc o
+Ġ .
+Ġw ant
+Ġse e
+Ġ 7
+n ing
+i ew
+ĠTh is
+c ed
+Ġe ven
+Ġin d
+t y
+ĠW e
+at h
+Ġthe se
+Ġp r
+Ġu se
+Ġbec ause
+Ġf l
+n g
+Ġn ow
+ĠâĢ ĵ
+c om
+is e
+Ġm ake
+Ġthe n
+ow er
+Ġe very
+ĠU n
+Ġse c
+os s
+u ch
+Ġe m
+Ġ =
+ĠR e
+i ed
+r it
+Ġin v
+le ct
+Ġsu pp
+at ing
+Ġl ook
+m an
+pe ct
+Ġ 8
+ro w
+Ġb u
+Ġwhe re
+if ic
+Ġyear s
+i ly
+Ġd iff
+Ġsh ould
+Ġre m
+T h
+I n
+Ġe v
+d ay
+' re
+ri b
+Ġre l
+s s
+Ġde f
+Ġr ight
+Ġs y
+) ,
+l es
+00 0
+he n
+Ġth rough
+ĠT r
+_ _
+Ġw ay
+Ġd on
+Ġ ,
+Ġ1 0
+as ed
+Ġas s
+ub lic
+Ġre g
+ĠA nd
+i x
+Ġ very
+Ġin clud
+ot her
+Ġim p
+ot h
+Ġsu b
+ĠâĢ Ķ
+Ġbe ing
+ar g
+ĠW h
+= =
+ib le
+Ġdo es
+an ge
+r am
+Ġ 9
+er t
+p s
+it ed
+ation al
+Ġb r
+Ġd own
+Ġman y
+ak ing
+Ġc all
+ur ing
+it ies
+Ġp h
+ic s
+al s
+Ġde c
+at ive
+en er
+Ġbe fore
+il ity
+Ġwe ll
+Ġm uch
+ers on
+Ġth ose
+Ġsu ch
+Ġ ke
+Ġ end
+ĠB ut
+as on
+t ing
+Ġl ong
+e f
+Ġth ink
+y s
+Ġbe l
+Ġs m
+it s
+a x
+Ġo wn
+Ġpro v
+Ġs et
+if e
+ment s
+b le
+w ard
+Ġsh ow
+Ġp res
+m s
+om et
+Ġo b
+Ġs ay
+ĠS h
+t s
+f ul
+Ġe ff
+Ġg u
+Ġin st
+u nd
+re n
+c ess
+Ġ ent
+ĠY ou
+Ġgo od
+Ġst art
+in ce
+Ġm ade
+t t
+st em
+ol og
+u p
+Ġ |
+um p
+Ġhe l
+ver n
+ul ar
+u ally
+Ġa c
+Ġm on
+Ġl ast
+Ġ2 00
+1 0
+Ġst ud
+u res
+ĠA r
+sel f
+ar s
+mer ic
+u es
+c y
+Ġm in
+oll ow
+Ġc ol
+i o
+Ġm od
+Ġc ount
+ĠC om
+he s
+Ġf in
+a ir
+i er
+âĢ Ķ
+re ad
+an k
+at ch
+e ver
+Ġst r
+Ġpo int
+or k
+ĠN ew
+Ġs ur
+o ol
+al k
+em ent
+Ġus ed
+ra ct
+we en
+Ġs ame
+ou n
+ĠA l
+c i
+Ġdiff ere
+Ġwh ile
+---- ----
+Ġg ame
+ce pt
+Ġs im
+.. .
+Ġin ter
+e k
+Ġre port
+Ġpro du
+Ġst ill
+l ed
+a h
+Ġhe re
+Ġwor ld
+Ġth ough
+Ġn um
+ar ch
+im es
+al e
+ĠS e
+ĠI f
+/ /
+ĠL e
+Ġre t
+Ġre f
+Ġtr ans
+n er
+ut ion
+ter s
+Ġt ake
+ĠC l
+Ġcon f
+w ay
+a ve
+Ġgo ing
+Ġs l
+u g
+ĠA meric
+Ġspe c
+Ġh and
+Ġbet ween
+ist s
+ĠD e
+o ot
+I t
+Ġe ar
+Ġagain st
+Ġh igh
+g an
+a z
+at her
+Ġex p
+Ġo p
+Ġin s
+Ġg r
+Ġhel p
+Ġre qu
+et s
+in s
+ĠP ro
+is m
+Ġf ound
+l and
+at a
+us s
+am es
+Ġp erson
+Ġg reat
+p r
+Ġs ign
+ĠA n
+' ve
+Ġs omet
+Ġs er
+h ip
+Ġr un
+Ġ :
+Ġt er
+ire ct
+Ġf ollow
+Ġd et
+ic es
+Ġf ind
+1 2
+Ġm em
+Ġc r
+e red
+e x
+Ġex t
+ut h
+en se
+c o
+Ġte am
+v ing
+ou se
+as h
+at t
+v ed
+Ġsy stem
+ĠA s
+d er
+iv es
+m in
+Ġle ad
+ĠB l
+c ent
+Ġa round
+Ġgo vern
+Ġc ur
+vel op
+an y
+Ġc our
+al th
+ag es
+iz e
+Ġc ar
+od e
+Ġl aw
+Ġre ad
+' m
+c on
+Ġre al
+Ġsupp ort
+Ġ1 2
+.. ..
+Ġre ally
+n ess
+Ġf act
+Ġd ay
+Ġb oth
+y ing
+Ġs erv
+ĠF or
+Ġth ree
+Ġw om
+Ġm ed
+od y
+ĠThe y
+5 0
+Ġex per
+t on
+Ġe ach
+ak es
+Ġc he
+Ġc re
+in es
+Ġre p
+1 9
+g g
+ill ion
+Ġg rou
+ut e
+i k
+W e
+g et
+E R
+Ġm et
+Ġs ays
+o x
+Ġd uring
+er n
+iz ed
+a red
+Ġf am
+ic ally
+Ġha pp
+ĠI s
+Ġch ar
+m ed
+v ent
+Ġg ener
+i ent
+p le
+i et
+re nt
+1 1
+v es
+pt ion
+Ġ2 0
+form ation
+Ġc or
+Ġoff ic
+ie ld
+Ġto o
+is ion
+Ġin f
+Ġ Z
+t he
+o ad
+Ġp ublic
+Ġpro g
+r ic
+* *
+Ġw ar
+Ġp ower
+v iew
+Ġf ew
+Ġl oc
+Ġdiffere nt
+Ġst ate
+Ġhe ad
+' ll
+Ġp oss
+Ġst at
+re t
+ant s
+Ġv al
+Ġis s
+Ġc le
+i vers
+an c
+Ġex pl
+Ġan other
+Ġ Q
+Ġa v
+th ing
+n ce
+W h
+Ġch ild
+Ġs ince
+i red
+l ess
+Ġl ife
+Ġde velop
+itt le
+Ġde p
+Ġp ass
+ã ĥ
+Ġt urn
+or n
+Th is
+b ers
+ro ss
+ĠA d
+Ġf r
+Ġres p
+Ġsec ond
+o h
+Ġ /
+Ġdis c
+Ġ &
+Ġsomet hing
+Ġcomp le
+Ġ ed
+Ġf il
+Ġmon th
+a j
+u c
+Ġgovern ment
+Ġwith out
+Ġle g
+Ġd ist
+Ġp ut
+Ġqu est
+an n
+Ġpro t
+2 0
+Ġne ver
+i ence
+Ġle vel
+Ġar t
+Ġth ings
+Ġm ight
+Ġeff ect
+Ġcont ro
+Ġc ent
+Ġ1 8
+Ġall ow
+Ġbel ie
+ch ool
+ot t
+Ġinc re
+Ġfe el
+Ġres ult
+Ġl ot
+Ġf un
+ot e
+Ġt y
+ere st
+Ġcont in
+Ġus ing
+Ġb ig
+2 01
+Ġas k
+Ġb est
+Ġ )
+I N
+Ġo pp
+3 0
+Ġnum ber
+in ess
+S t
+le ase
+Ġc a
+Ġm ust
+Ġd irect
+Ġg l
+Ġ <
+Ġop en
+Ġp ost
+Ġcom e
+Ġse em
+ord ing
+Ġwe ek
+ate ly
+it al
+Ġe l
+ri end
+Ġf ar
+Ġt ra
+in al
+Ġp ri
+ĠU S
+Ġpl ace
+Ġfor m
+Ġto ld
+" :
+ain s
+at ure
+ĠTr ump
+Ġst and
+Ġ #
+id er
+ĠF r
+Ġne xt
+Ġs oc
+Ġp ur
+Ġle t
+Ġl ittle
+Ġh um
+Ġ i
+r on
+1 5
+Ġ1 5
+Ġcomm un
+Ġm ark
+ĠThe re
+Ġw r
+ĠTh at
+Ġin formation
+w ays
+Ġb us
+a pp
+Ġinv est
+m e
+Ġh ard
+ain ed
+e ad
+Ġim port
+Ġapp ro
+Ġt est
+Ġt ri
+Ġre st
+os ed
+Ġf ull
+Ġc are
+ĠS p
+Ġc ase
+O N
+Ġs k
+Ġl ess
+Ġ +
+Ġpart ic
+ĠP l
+ab ly
+u ck
+is hed
+ch n
+b e
+Ġl ist
+at or
+Ġto p
+Ġad v
+ĠB e
+ru ct
+Ġd em
+r ation
+l ing
+g y
+re en
+g er
+Ġh ome
+Ġle ft
+Ġbet ter
+Ġd ata
+Ġ1 1
+Ġatt ack
+Ġpro ble
+l ine
+ard s
+Ġbe h
+r al
+ĠH ow
+ĠS he
+ar ge
+Ġ --
+: //
+Ġb ro
+ĠP h
+at s
+Ġbu ild
+w w
+id ed
+a im
+as es
+en cy
+Ġm ain
+in ed
+Ġinclud ing
+Ġ {
+Ġg ot
+Ġint erest
+Ġke ep
+Ġ X
+Ġe as
+ain ing
+Ġcl ass
+âĢ ¦
+ĠN o
+Ġv ar
+Ġsm all
+amp le
+A T
+Ġ ide
+ĠS o
+Ġre ce
+Ġpol it
+Ġm ov
+Ġpl an
+Ġper cent
+iv ing
+Ġc amp
+Ġp ay
+1 4
+s c
+is ed
+Ġu nt
+one y
+pl oy
+== ==
+Ġdid n
+ĠI nd
+el s
+ert ain
+Ġp os
+__ __
+i ver
+Ġpro cess
+Ġprog ram
+if ied
+ĠR ep
+1 6
+u ro
+olog y
+at ter
+in a
+Ġn ame
+ĠA ll
+Ġf our
+Ġret urn
+v ious
+b s
+Ġcall ed
+Ġm ove
+ĠS c
+ir d
+Ġgrou p
+Ġb re
+Ġm en
+Ġc ap
+t en
+e e
+Ġd ri
+le g
+he re
+uth or
+Ġp at
+Ġcur rent
+id es
+Ġp op
+t o
+ent ion
+Ġal ways
+Ġm il
+Ġwom en
+Ġ1 6
+Ġo ld
+iv en
+ra ph
+ĠO r
+r or
+ent ly
+Ġn ear
+ĠE x
+re am
+s h
+Ġ1 4
+Ġf ree
+iss ion
+st and
+ĠC on
+al ity
+us ed
+1 3
+Ġdes ign
+Ġch ange
+Ġch ang
+Ġb o
+Ġv is
+em ber
+Ġb ook
+read y
+Ġk ill
+2 5
+pp ed
+Ġa way
+Ġab le
+Ġcount ry
+Ġcon st
+ar n
+Ġor der
+A R
+i or
+i um
+or th
+1 8
+ail able
+Ġs w
+Ġm illion
+Ġ1 3
+at ic
+t ed
+ĠG o
+Ġo per
+en g
+Ġth ing
+aj or
+con om
+ĠCom m
+Ġwh y
+u red
+ur al
+Ġs chool
+b y
+ĠM ar
+Ġa ff
+Ġd ays
+Ġan n
+us h
+an e
+I f
+e g
+Ġpro f
+Ġhe alth
+ou th
+B ut
+ion al
+. ,
+Ġs ol
+Ġal ready
+Ġ3 0
+Ġchar act
+H e
+Ġf riend
+E S
+i ans
+ic le
+' d
+ĠO n
+Ġle ast
+Ġp rom
+Ġd r
+Ġh ist
+it her
+Ġ est
+i qu
+1 7
+s on
+Ġte ll
+Ġt alk
+oh n
+o int
+le ction
+A N
+Ġunt il
+au gh
+Ġl ater
+Ġ ve
+Ġv iew
+end ing
+iv ed
+Ġwor d
+w are
+Ġc ost
+Ġen ough
+Ġg ive
+ĠUn ited
+Ġte chn
+are nt
+O R
+Ġp ar
+ĠD r
+Ġ201 6
+r ist
+er ing
+Ġ Â
+Ġl arge
+s ide
+ac y
+cc ess
+Ġw in
+Ġimport ant
+Ġ19 9
+Ġdoes n
+Ġ1 7
+Ġbus iness
+Ġcle ar
+Ġre se
+" ,
+ur y
+Ġe qu
+as ter
+al f
+ĠAmeric an
+n ect
+Ġex pect
+ivers ity
+Ġo cc
+ĠF l
+Ġk ind
+Ġme an
+Ġp ast
+Ġde v
+Ġb as
+le t
+ra ft
+Ġor gan
+Ġde l
+Ġper form
+Ġst ory
+Ġse ason
+ĠC ol
+Ġcl aim
+Ġc ame
+Ġwith in
+Ġl ine
+Ġpro ject
+ĠA t
+Ġcontro l
+end ed
+ĠS y
+Ġa ir
+iz ation
+Ġ *
+le y
+Ġm oney
+id d
+Y ou
+f or
+Ġfam ily
+Ġm aking
+Ġb it
+Ġpol ice
+Ġhapp en
+Ġ vers
+on y
+u ff
+ĠW hen
+Ġs it
+ide o
+l f
+is on
+Ġsu re
+g in
+Ġapp ear
+Ġl ight
+Ġ es
+o f
+Ġw ater
+Ġt imes
+n ot
+Ġg row
+Ġcomp any
+ĠT e
+ow s
+Ġm ar
+our ce
+i ol
+ar m
+b r
+Ġex ample
+Ġcon c
+Ġf ore
+ĠT o
+p ro
+E N
+ri es
+Ġ2 5
+ĠC an
+ne y
+Ġact ually
+Ġe ver
+ur ity
+ak en
+ap s
+Ġt ax
+Ġm ajor
+am a
+Ġof ten
+er al
+Ġhum an
+Ġj ob
+is ter
+Ġav ailable
+oc r
+en n
+a id
+iv id
+Ġrec ord
+? "
+Ġs ing
+ĠA m
+id ence
+Ġnew s
+st er
+Ġe conom
+Ġfollow ing
+ĠB r
+is ing
+Ġh our
+m ost
+um ent
+Ġse x
+Ġdes c
+Ġbec ome
+ĠE d
+Ġto ok
+Ġha ving
+Ġprodu ct
+a ult
+A s
+ar ing
+Ġme ans
+Ġh op
+un e
+Ġch o
+Ġc ertain
+Ġn on
+Ġde al
+2 4
+le ment
+oc i
+en e
+Ġs ide
+ĠP r
+ĠM ay
+Ġre ason
+u ed
+c hed
+ul ation
+Ġe lect
+Ġoffic ial
+Ġposs ible
+Ġh old
+and s
+ot s
+Ġc ity
+or ies
+Ġse ver
+Ġchild ren
+Ġon ce
+Ġact iv
+l er
+Ġn ight
+it ions
+ĠJ ohn
+a pe
+pl ay
+Ġd one
+Ġl im
+Ġwork ing
+ĠP res
+or ld
+e b
+ĠC o
+Ġb ody
+ail s
+ut es
+ĠM r
+Ġwhe ther
+Ġa uthor
+ro p
+Ġpro per
+Ġse en
+) ;
+Ġf ac
+ĠS u
+Ġcon d
+it ing
+Ġcour se
+Ġ }
+-------- --------
+a ign
+Ġev ent
+Ġen g
+Ġp ot
+Ġin tern
+i am
+Ġsh ort
+em pt
+ã Ĥ
+ĠG od
+il ar
+8 0
+Ġor ig
+I S
+our n
+ab ility
+it ive
+Ġd am
+Ġ1 00
+Ġp ress
+Ġdo ing
+Ġprot ect
+r ing
+Ġthough t
+Ġquest ion
+re w
+ĠW ar
+Ġsever al
+ĠSt ate
+Ġg iven
+Ġf und
+ĠT w
+Ġw ent
+an ces
+w ork
+p or
+m y
+4 0
+Ġar g
+art ment
+ust om
+Ġpol ic
+Ġme et
+Ġc reat
+2 2
+ĠSt ates
+Ġg ames
+ra w
+ut ure
+Ġunder stand
+ur s
+ĠO b
+l ish
+s y
+Ġm akes
+Ġw on
+ag on
+Ġh tt
+Ġl ove
+ent ial
+Ġcomple te
+p ar
+ĠI m
+A L
+Ġacc ount
+Â ł
+ore d
+ver t
+Ġ ident
+Ġ201 5
+Ġother s
+ĠM in
+i ber
+ver age
+The re
+ition al
+d d
+Ġpro b
+Ġyou ng
+Ġal ong
+Ġacc ording
+Ġy et
+Ġmem bers
+ĠWh at
+o id
+ĠM an
+A nd
+Ġam ong
+a i
+Ġem ploy
+ĠR es
+Ġ >
+Ġinv ol
+Ġl ow
+a f
+ĠC ar
+Ġh ig
+ĠO ne
+ĠS ec
+in ation
+Ġlike ly
+Ġan t
+ag ed
+ĠR uss
+Ġb en
+Ġre le
+F or
+b ack
+ĠN ot
+Ġpres ident
+b all
+Ġacc ess
+ivid ual
+ĠD em
+ĠE uro
+6 0
+Ġkn own
+ir l
+ĠG r
+Ġear ly
+u se
+iet y
+âĢ ĵ
+Ġf ight
+Ġs ent
+Ġto day
+Ġmark et
+" .
+Ġb ased
+Ġstr ong
+ur ther
+Ġde b
+m ber
+Ġproble m
+Ġde ath
+Ġsoc ial
+im ate
+A S
+ort un
+Ġcamp aign
+er y
+C h
+Ġe y
+i ally
+Ġm us
+w h
+p os
+Ġ er
+Ġsa f
+Ġmonth s
+ir on
+Ġv iol
+Ġf ive
+Ġst re
+Ġplay ers
+in c
+al d
+y ear
+a un
+Ġsu ccess
+Ġpres ent
+ere nce
+Ġ201 4
+Ġsu gg
+Ġpartic ular
+Ġtr y
+Ġsugg est
+ĠCh rist
+on es
+Ġpri v
+2 3
+Ġc rit
+Ġl and
+Ġloc al
+if y
+2 9
+Ġa ut
+E D
+ĠG u
+Ġm ult
+Ġpolit ical
+Ġask ed
+Ġfor mer
+it ter
+ri pt
+Ġcl ose
+Ġp ract
+ĠY ork
+Ġget ting
+Ġac ross
+Ġcom b
+Ġbelie ve
+Ġ z
+Ġto get
+Ġtoget her
+ĠC ent
+ir c
+Ġind ividual
+ĠM c
+2 7
+is k
+ĠE ng
+Ġf ace
+Ġ2 4
+Ġval ue
+Ġare a
+e v
+Ġw rit
+ĠPres ident
+Ġv ot
+Ġke y
+Ġm om
+p ut
+Ġany thing
+Ġexper ience
+att le
+Ġm ind
+a ff
+om m
+Ġf uture
+g ed
+Ġc ut
+Ġto t
+it ch
+Ġv ideo
+Ġinvest ig
+Ġn et
+ĠM y
+r ict
+i en
+. )
+Ġimp ro
+th ough
+ward s
+Ġcon nect
+ĠM ed
+sel ves
+ens ive
+m b
+o ber
+at ors
+A n
+Ġ5 0
+Ġre du
+res ent
+Ġab ove
+Ġf re
+ĠEuro pe
+s w
+Ġam ount
+ĠA pp
+Ġe ither
+Ġmil it
+Ġan al
+Ġf ail
+ĠE n
+al es
+Ġspec ial
+Ġbl ack
+I T
+c her
+Ġlook ing
+Ġf ire
+y n
+Ġal most
+o on
+Ġstud y
+Ġm iss
+c hes
+ro wn
+Ġt re
+Ġcommun ity
+Ġmed ia
+Ġf ood
+Ġcom es
+ĠUn iversity
+Ġsing le
+Wh at
+u ly
+Ġh alf
+ag ue
+h od
+ĠRep ublic
+Ġstart ed
+Ġqu ick
+ot o
+b ook
+Ġiss ue
+it or
+Ġel se
+Ġcons ider
+2 6
+ro du
+Ġt aken
+2 8
+9 9
+ĠW ith
+Ġtr ue
+Ġw a
+Ġtr ad
+Ġag o
+Ġm ess
+ie f
+Ġadd ed
+o ke
+Ġb ad
+Ġf av
+3 3
+Ġsim ilar
+as k
+ĠD on
+Ġcharact er
+ort s
+ĠH ouse
+Ġreport ed
+Ġty pe
+v al
+i od
+ĠHow ever
+Ġt arg
+Ġent ire
+pp ing
+Ġhist ory
+Ġl ive
+ff ic
+.... ....
+ed eral
+Ġtr ying
+Ġdisc uss
+ĠH ar
+ac es
+l ished
+Ġse lf
+os p
+re st
+Ġro om
+el t
+Ġf all
+ol ution
+Ġe t
+Ġ x
+Ġis n
+Ġide a
+b o
+Ġs ound
+ĠD ep
+Ġsome one
+ci ally
+ull y
+Ġf oc
+Ġob ject
+if t
+ap er
+Ġplay er
+Ġr ather
+Ġserv ice
+as hing
+ĠD o
+ĠP art
+ru g
+m on
+p ly
+Ġm or
+Ġnot hing
+Ġprov ide
+I C
+un g
+Ġpart y
+Ġex ist
+Ġm ag
+7 0
+Ġr ul
+Ġh ouse
+Ġbeh ind
+Ġhow ever
+ĠW orld
+Ġs um
+Ġapp lic
+Ġ ;
+Ġfun ction
+g r
+ĠP ol
+Ġfr ont
+2 00
+Ġser ies
+Ġt em
+Ġty p
+ill s
+Ġo pt
+Ġpoint s
+Ġbel ow
+itt ed
+Ġspec ific
+Ġ201 7
+um b
+Ġr a
+Ġpre vious
+Ġpre t
+re me
+Ġc ustom
+Ġcour t
+ĠM e
+Ġre pl
+Ġwho le
+g o
+c er
+Ġt reat
+ĠA ct
+Ġprob ably
+Ġle arn
+end er
+ĠA ss
+Ġvers ion
+n ow
+Ġche ck
+ĠC al
+R E
+min ist
+O n
+our ces
+Ġben ef
+Ġd oc
+Ġdet er
+Ġen c
+Ġsu per
+Ġadd ress
+Ġv ict
+Ġ201 3
+Ġme as
+t r
+Ġf ield
+W hen
+Ġsign ific
+u ge
+Ġfe at
+Ġcomm on
+l oad
+Ġbe gin
+Ġbr ing
+Ġa ction
+er man
+Ġdesc rib
+Ġind ust
+Ġwant ed
+ri ed
+m ing
+Ġatt empt
+4 5
+f er
+Ġd ue
+ress ion
+# #
+Ġsh all
+Ġs ix
+o o
+Ġst ep
+Ġp ub
+Ġhim self
+Ġ2 3
+Ġc op
+Ġd est
+Ġst op
+A C
+ib ility
+Ġl ab
+ic ult
+Ġhour s
+Ġcre ate
+Ġf urther
+ĠAmeric a
+ĠC ity
+Ġd ou
+he ad
+S T
+ĠN orth
+c ing
+Ġn ational
+u le
+ĠIn st
+Ġt aking
+ĠQ u
+ir t
+Ġre d
+Ġrese arch
+v iron
+ĠG e
+Ġbre ak
+an a
+Ġsp ace
+ater ial
+Ġrec ent
+ĠA b
+Ġgener al
+Ġh it
+Ġper iod
+Ġevery thing
+ive ly
+Ġph ys
+Ġsay ing
+an ks
+Ġc ou
+Ġc ult
+ac ed
+e al
+u ation
+Ġc oun
+l u
+Ġinclud e
+Ġpos ition
+ĠA fter
+ĠCan ad
+ĠE m
+Ġim m
+ĠR ed
+Ġp ick
+Ġcom pl
+Ġm atter
+re g
+e xt
+ang u
+is c
+o le
+a ut
+Ġcomp et
+e ed
+f ect
+Ġ2 1
+ĠS en
+ĠThe se
+as ing
+Ġcan not
+Ġin it
+Ġrel ations
+ac hed
+Ġb ar
+Ġ4 0
+ĠT H
+Ġ201 2
+Ġv ol
+Ġg round
+Ġsec urity
+Ġup d
+il t
+3 5
+Ġconc ern
+ĠJ ust
+Ġwh ite
+Ġseem s
+ĠH er
+pe cially
+i ents
+Ġann oun
+Ġf ig
+ight s
+Ġst ri
+l ike
+id s
+Ġs us
+Ġw atch
+Ġ â
+Ġw ind
+ĠC ont
+Ġit self
+Ġm ass
+A l
+y le
+iqu e
+ĠN ational
+Ġab s
+Ġp ack
+Ġout side
+Ġan im
+Ġp ain
+et er
+Ġman ag
+du ct
+og n
+Ġ ]
+ĠSe pt
+se c
+o ff
+ĠJ an
+Ġf oot
+ad es
+Ġth ird
+Ġm ot
+Ġev idence
+int on
+Ġth reat
+a pt
+pl es
+c le
+Ġl o
+Ġde cl
+Ġit em
+med i
+Ġrep resent
+om b
+am er
+Ġsignific ant
+og raph
+s u
+Ġc al
+i res
+00 00
+I D
+A M
+Ġsim ply
+Ġlong er
+Ġf ile
+O T
+c he
+S o
+ate g
+or g
+ĠH is
+Ġen er
+Ġd om
+Ġup on
+il i
+": "
+Ġthem selves
+Ġcom ing
+Ġqu ite
+Ġdiff icult
+ĠB ar
+il ities
+re l
+end s
+c ial
+6 4
+Ġwom an
+ra p
+y r
+Ġne cess
+ip s
+Ġte xt
+Ġrequ ire
+Ġmilit ary
+Ġre view
+Ġresp ons
+7 5
+Ġsub ject
+Ġinst ead
+Ġiss ues
+Ġg en
+" ,"
+Ġmin utes
+Ġwe ap
+r ay
+am ed
+t ime
+b l
+H ow
+Ġc ode
+ĠS m
+Ġhig her
+ĠSt e
+r is
+Ġp age
+Ġstud ents
+ĠIn tern
+Ġmet hod
+ĠA ug
+ĠP er
+ĠA g
+Ġpolic y
+ĠS w
+Ġex ec
+Ġac cept
+um e
+rib ut
+Ġword s
+Ġfin al
+Ġchang es
+ĠDem ocr
+Ġfriend s
+Ġres pect
+Ġe p
+Ġcomp an
+iv il
+Ġdam age
+** **
+og le
+viron ment
+Ġne g
+ent al
+Ġa p
+Ġtot al
+iv al
+! "
+l im
+Ġneed s
+Ġag re
+Ġdevelop ment
+Ġa ge
+ip le
+2 1
+Ġresult s
+ĠA f
+S h
+Ġg un
+ĠOb ama
+ro ll
+Ġ @
+Ġright s
+ĠB rit
+Ġrun ning
+Ġwas n
+Ġp ort
+Ġr ate
+Ġpret ty
+Ġtarg et
+Ġsa w
+Ġc irc
+Ġwor ks
+ic ro
+al t
+o ver
+ww w
+Th at
+l ier
+Ġevery one
+ud e
+Ġp ie
+idd le
+ra el
+Ġr ad
+Ġbl ock
+Ġw alk
+T o
+ã ģ
+n es
+ĠA ust
+a ul
+ro te
+ĠS outh
+ess ion
+op h
+Ġshow s
+Ġs ite
+Ġj o
+Ġr isk
+cl us
+l t
+Ġin j
+id ing
+ĠS pe
+Ġch all
+ir m
+Ġ2 2
+itt ing
+st r
+Ġh y
+L E
+ke y
+Ġbe gan
+at ur
+ashing ton
+l am
+ĠD av
+b it
+Ġs ize
+ĠP ar
+3 8
+ourn al
+f ace
+Ġdec ision
+Ġl arg
+Ġj ud
+re ct
+Ġcontin ue
+ĠO ct
+ove red
+ĠI nt
+==== ====
+Ġp arent
+ĠW ill
+Ġeas y
+Ġd rug
+ang er
+Ġs ense
+Ġd i
+id ay
+Ġener gy
+ist ic
+Ġass oci
+ar ter
+ob al
+e ks
+ĠE l
+ur ch
+Ġg irl
+o e
+it le
+Ġ2 8
+ĠC he
+Ġrequ est
+Ġso on
+Ġh ost
+k y
+Ġst ates
+om es
+Ġm aterial
+le x
+Ġmom ent
+Ġan sw
+on se
+Ġes pecially
+Ġn orm
+Ġserv ices
+p ite
+r an
+Ġro le
+4 4
+) :
+Ġc red
+C l
+____ ____
+Ġm at
+Ġl og
+ĠCl inton
+O U
+Ġoff ice
+Ġ2 6
+Ġch arg
+Ġtr ack
+m a
+Ġhe art
+Ġb all
+Ġperson al
+Ġbuild ing
+n a
+s et
+b ody
+ĠBl ack
+Ġincre ase
+itt en
+Ġneed ed
+3 6
+3 2
+= "
+Ġl ost
+Ġbec ame
+Ġgrou ps
+ĠM us
+Ġw rote
+ĠP e
+Ġpro p
+j oy
+Ã ©
+ĠWh ite
+Ġde ad
+. '
+Ġhtt p
+Ġwe bs
+O S
+Ġins ide
+Ġwr ong
+Ġstat ement
+Ġ ...
+y l
+Ġfil m
+Ġmus ic
+Ġsh are
+ific ation
+Ġre lease
+Ġfor ward
+Ġst ay
+Ġcomp ut
+it te
+s er
+Ġorig inal
+Ġc ard
+Ġc and
+Ġd iv
+at ural
+Ġfav or
+O M
+Ġc ases
+us es
+Ġse ction
+Ġle ave
+g ing
+ov ed
+ĠW ashington
+3 9
+ĠG l
+Ġrequ ired
+act ion
+ap an
+o or
+it er
+ĠK ing
+Ġcount ries
+ĠG erman
+ll ing
+Ġ2 7
+3 4
+Ġquest ions
+Ġpr im
+Ġc ell
+Ġsh oot
+Ġany one
+ĠW est
+Ġaff ect
+ep end
+Ġon line
+ĠIs rael
+ĠSept ember
+Ġab ility
+Ġcont ent
+is es
+Ġre ve
+Ġl aun
+Ġind ic
+Ġfor ce
+c ast
+Ġso ld
+av ing
+f l
+Ġso ft
+Ġcompan ies
+ce ed
+Ġart icle
+Ġa ud
+Ġre v
+Ġed uc
+Ġplay ing
+0 5
+Ġhe ld
+ct or
+Ġrele ased
+Ġf ederal
+3 7
+Ġad minist
+Ġinter view
+Ġinst all
+Ġrece ived
+Ġs ource
+u k
+P h
+Ġser ious
+Ġcre ated
+Ġc ause
+Ġim medi
+Ġdef in
+u el
+ĠDep artment
+ct ions
+ĠC our
+ĠN ow
+z e
+it es
+it ution
+Ġl ate
+Ġspe ak
+n ers
+Ġleg al
+ar i
+ĠC or
+Ġwe eks
+Ġmod el
+Ġp red
+Ġex act
+B C
+ĠB y
+IN G
+os ing
+Ġt akes
+Ġreg ard
+Ġopp ortun
+Ġpr ice
+Ġ19 8
+ĠA pr
+f ully
+Ġor d
+Ġproble ms
+ru ction
+h am
+ĠC ount
+le ge
+Ġlead ers
+E T
+le v
+Ġde ep
+olog ical
+es e
+h aps
+ĠS ome
+Ġp ers
+Ġcont ract
+Ġrelations hip
+s p
+ou d
+Ġb ase
+4 8
+m it
+A d
+anc ial
+Ġcons um
+Ġpot ential
+Ġl angu
+re m
+et h
+Ġrel ig
+ress ed
+6 6
+Ġl ink
+Ġl ower
+ay er
+ĠJ une
+Ġf em
+un t
+er c
+ur d
+Ġcont act
+Ġ ill
+Ġm other
+Ġest ab
+h tt
+ĠM arch
+ĠB ro
+ĠCh ina
+Ġ2 9
+Ġs qu
+Ġprov ided
+Ġa verage
+as ons
+Ġ201 1
+Ġex am
+l in
+5 5
+n ed
+Ġper fect
+Ġt ou
+al se
+u x
+Ġbu y
+Ġsh ot
+Ġcol lect
+Ġph ot
+Ġplay ed
+Ġsur pr
+Ġofficial s
+Ġsim ple
+av y
+Ġindust ry
+Ġhand s
+g round
+Ġp ull
+Ġr ound
+Ġus er
+Ġr ange
+u ary
+Ġpriv ate
+op s
+e es
+Ġw ays
+ĠM ich
+Ġve h
+Ġex cept
+Ġter ms
+im um
+pp er
+I ON
+ore s
+ĠDr agon
+ou l
+Ġd en
+Ġperform ance
+Ġb ill
+c il
+4 7
+Ġen vironment
+Ġex c
+ad d
+Ġwor th
+Ġp ict
+Ġch ance
+Ġ201 8
+b or
+Ġspe ed
+ict ion
+Ġal leg
+ĠJ apan
+at ory
+re et
+Ġm atch
+ĠI I
+Ġst ru
+ord er
+Ġst e
+Ġl iving
+Ġst ruct
+in o
+Ġse par
+her n
+Ġresp onse
+Ġen joy
+Ġv ia
+A D
+um ents
+ace book
+Ġmem ber
+ib r
+iz ing
+Ġto ol
+ĠM on
+ĠWh ile
+h ood
+ĠA ng
+ĠD ef
+Ġoff er
+T r
+a ur
+Ġturn ed
+ĠJ uly
+d own
+an ced
+Ġrec ently
+ĠE ar
+Ġc e
+ĠSt ar
+ĠC ong
+rough t
+Ġbl ood
+Ġhop e
+Ġcom ment
+ain t
+Ġar ri
+il es
+Ġpartic ip
+ough t
+ri ption
+0 8
+4 9
+Ġg ave
+Ġse lect
+Ġkill ed
+sy ch
+Ġgo es
+i j
+Ġc oll
+Ġimp act
+at ives
+ĠS er
+0 9
+ĠAug ust
+Ġb oy
+d e
+ĠD es
+Ġf elt
+U S
+Ġexpect ed
+Ġim age
+ĠM ark
+cc ording
+o ice
+E C
+ĠM ag
+en ed
+h old
+ĠP ost
+Ġpre vent
+N o
+Ġinvol ved
+Ġey es
+Ġquick ly
+A t
+un k
+Ġbeh av
+Ġ ur
+Ġl ed
+c ome
+e y
+Ġcand id
+Ġear lier
+Ġfoc us
+et y
+P ro
+led ge
+ix ed
+ill ed
+Ġpop ular
+A P
+Ġset t
+l ight
+Ġvar ious
+in ks
+Ġlevel s
+Ġro ad
+ell ig
+ab les
+he l
+itte e
+ĠG ener
+y pe
+Ġhe ard
+ic les
+Ġm is
+Ġus ers
+ĠS an
+Ġimpro ve
+Ġf ather
+Ġse arch
+The y
+v il
+Ġprof ess
+Ġkn ew
+Ġl oss
+Ġev ents
+6 5
+Ġb illion
+0 7
+0 2
+ĠNew s
+ĠA M
+Ġco ver
+w here
+ens ion
+Ġb ott
+Ġare as
+en ces
+op e
+ĠTw itter
+a el
+Ġget s
+ĠGo ogle
+Ġs n
+i ant
+Ġv ote
+Ġnear ly
+Ġinclud ed
+Ġrec ogn
+z z
+m m
+al ed
+Ġhappen ed
+0 4
+Ġh ot
+Ġwho se
+Ġc ivil
+Ġsu ff
+o es
+it iz
+ĠSy ri
+Ġresp ond
+Ġh on
+Ġfeat ures
+Ġeconom ic
+ĠApr il
+r im
+Ġtechn ology
+Ġo ption
+ag ing
+Ġpur ch
+R e
+Ġl at
+ch ie
+is l
+Ġrec omm
+u f
+Ġtr aining
+Ġeffect s
+Ġf ast
+Ġ201 0
+Ġocc ur
+Ġwebs ite
+Ġem ail
+Ġs ens
+e ch
+Ġo il
+Ġinf lu
+Ġcurrent ly
+ĠS ch
+ĠAd d
+Ġgo al
+Ġsc ient
+Ġcon v
+1 00
+em y
+Ġdec ided
+Ġtra vel
+Ġm ention
+L L
+0 3
+Ġe lection
+Ġph one
+Ġlook s
+Ġsit uation
+Ġc y
+Ġh or
+b ed
+ĠCour t
+a ily
+av es
+Ġqu ality
+ĠCom p
+w ise
+Ġt able
+Ġst aff
+ĠW ind
+et t
+Ġtri ed
+ide red
+Ġadd ition
+Ġb ox
+Ġl ack
+ar ily
+Ġw ide
+Ġm id
+Ġbo ard
+ys is
+Ġant i
+h a
+Ġd ig
+en ing
+Ġd ro
+C on
+6 8
+Ġsl ow
+b ased
+se qu
+Ġp ath
+E x
+ak er
+Ġwork ed
+Ġp en
+Ġeng ine
+Ġlook ed
+ĠSu per
+ĠS erv
+Ġvict im
+U n
+Ġproper ty
+Ġint rodu
+Ġexec ut
+ĠP M
+L e
+Ġcol or
+ĠM ore
+Ġ6 0
+Ġnet work
+Ġd ate
+c ul
+id ge
+Ġext ra
+3 1
+Ġs le
+6 7
+Ġw ond
+Ġreport s
+j ust
+ĠAust ral
+Ġcap ital
+Ġen s
+Ġcomm and
+Ġallow ed
+Ġpre p
+Ġca pt
+h ib
+Ġnum bers
+ch an
+Ġf air
+m p
+om s
+Ġre ach
+W ith
+t ain
+Ġbro ad
+Ġcou ple
+ec ause
+ly ing
+ĠF eb
+Ġsc reen
+Ġl ives
+Ġpri or
+ĠCong ress
+A r
+Ġappro ach
+Ġe mer
+ar ies
+ĠD is
+s erv
+ĠN e
+Ġbu ilt
+c ies
+Ġre pe
+Ġrul es
+for ce
+ĠP al
+Ġfin ancial
+Ġcons idered
+ĠCh ar
+n ces
+ĠI S
+Ġb rought
+Ġb i
+i ers
+ĠS im
+O P
+Ġproduct s
+Ġvis it
+Ġdoc ument
+Ġcon duct
+Ġcomplete ly
+in ing
+ĠCal if
+ib ly
+Ġwr itten
+ĠT V
+em ents
+Ġd raw
+O ne
+Ġpub lished
+Ġsec ret
+r ain
+he t
+ĠF acebook
+ond ay
+ĠU p
+Ġsex ual
+Ġth ous
+ĠP at
+Ġ ess
+Ġstand ard
+Ġar m
+g es
+ect ion
+Ġf ell
+Ġfore ign
+an i
+ĠFr iday
+Ġreg ular
+in ary
+Ġincre ased
+Ġus ually
+Ġdem on
+Ġd ark
+Ġadd itional
+ro l
+ĠO f
+Ġprodu ction
+! !
+und red
+Ġintern ational
+id ents
+ĠF ree
+rou p
+Ġr ace
+Ġm ach
+Ġh uge
+A ll
+le ar
+ove mber
+Ġto wn
+Ġatt ention
+ĠO ff
+y ond
+ĠThe n
+f ield
+Ġter ror
+ra z
+ĠB o
+Ġmeet ing
+ĠP ark
+Ġar rest
+Ġf ear
+Ġa w
+ĠV al
+or ing
+' ,
+Ġext reme
+ar r
+Ġwork ers
+A fter
+Ġ3 1
+n et
+am ent
+Ġdirect ly
+Ġpop ulation
+ub e
+ĠOct ober
+ĠI N
+ĠJan uary
+5 9
+ĠDav id
+Ġc ross
+ce mber
+ĠF irst
+Ġmess age
+ir it
+Ġn ation
+Ġp oll
+is ions
+Ġansw er
+n y
+is ode
+Ġcar ry
+ĠRuss ia
+Ġhe ar
+eng th
+ro y
+Ġn atural
+in ally
+Ġdo g
+m itted
+Ġtr ade
+Ġsub st
+Ġmult iple
+ĠAf ric
+Ġf ans
+Ġs ort
+Ġgl obal
+ic ation
+ĠW ed
+ar a
+Ġa chie
+Ġlangu age
+ve y
+Ġt al
+Ġnecess ary
+Ġdet ails
+Ġs en
+ĠS und
+ĠRe g
+ĠR ec
+0 6
+Ġs il
+ress ive
+Ġmed ical
+un ch
+orn ia
+Ġu nd
+f ort
+oc ks
+ĠM onday
+ues day
+c raft
+7 7
+ur t
+Ġ ver
+ĠH ill
+Ġrece ive
+Ġmor ning
+es tern
+Ġb ank
+Ġs at
+ir th
+ĠH igh
+Ġdev ice
+ĠTH E
+ĠCent er
+Ġsaf e
+Ġp le
+ĠCanad a
+Ġsystem s
+Ġass ist
+Ġsur v
+Ġb attle
+ĠS oc
+vert is
+S he
+Ġp aper
+Ġgrow th
+Ġc ast
+S c
+Ġpl ans
+ll ed
+Ġpart s
+Ġw all
+Ġmove ment
+Ġpract ice
+im ately
+Ġdis play
+Ġsomet imes
+om p
+ĠP aul
+ĠY es
+k ing
+5 8
+o ly
+Ġs on
+Ġav oid
+ok es
+ĠJ ew
+Ġto wards
+as c
+Ġ //
+ĠK ore
+Ġtalk ing
+Ġcor rect
+Ġsp ent
+ic ks
+i able
+e ared
+Ġter m
+Ġwant s
+om ing
+Ġ ut
+Ġdou b
+Ġfor ces
+Ġp lease
+6 9
+ĠN ovember
+at form
+ond on
+Ġon es
+Ġimmedi ately
+ĠRuss ian
+ĠM et
+Ġde g
+Ġparent s
+C H
+ĠAmeric ans
+al y
+ĠM od
+Ġsh own
+Ġcond itions
+Ġst uff
+Ġre b
+ĠY our
+Ġinclud es
+n own
+ĠS am
+Ġexper ien
+m ission
+ĠE ven
+augh t
+Ġannoun ced
+ĠRepublic an
+Ġdeter min
+Ġdescrib ed
+ĠCount y
+( )
+Ġdo or
+Ġchang ed
+Ġne igh
+ĠH ere
+Ġcle an
+Ġp an
+ĠDe cember
+ĠEurope an
+ir ing
+ap ter
+Ġcl ub
+ĠT uesday
+Ġp aid
+ĠN et
+Ġattack s
+Ġcharact ers
+Ġal one
+Ġdirect or
+d om
+Ġ3 5
+Ġl oad
+Ġr out
+ĠCalif ornia
+Ġfin ally
+Ġr ac
+Ġcont r
+Ġexact ly
+res h
+p ri
+ĠIs lam
+Ġn ature
+Ġcare er
+Ġlat est
+Ġcon vers
+ĠS l
+p ose
+ci ent
+ĠIn c
+iv ity
+8 8
+ĠA tt
+ĠM or
+nes day
+Ġwe ight
+k en
+Ġnot e
+Ġteam s
+Ġ \
+air s
+ĠG reen
+Ġh undred
+on ent
+Ġstre ng
+Ġcons ist
+ic ated
+Ġreg ul
+Ġl ic
+ast ic
+Ġt en
+urs day
+ellig ence
+ous ly
+ĠU K
+B I
+Ġcost s
+Ġind epend
+ĠA P
+Ġnorm al
+Ġh om
+Ġob vious
+Ġs we
+Ġst ar
+Ġread y
+ac her
+Ġimp lement
+g est
+Ġs ong
+ĠG et
+ĠL ab
+Ġinterest ing
+us ing
+Ġg iving
+ĠSund ay
+Ġet c
+Ġm iddle
+Ġrem ember
+r ight
+os ition
+ut ions
+Ġm ax
+4 6
+Ġyour self
+Ġdem and
+Ġtreat ment
+Ġd anger
+ĠC ons
+Ġgu y
+ĠBrit ish
+Ġphys ical
+Ġrel ated
+Ġrem ain
+Ġcould n
+Ġref er
+Ġc itiz
+b ox
+EN T
+bo ard
+Ġin n
+I G
+er o
+ĠSt reet
+osp ital
+ren ch
+cher s
+Ġst ra
+O L
+ag er
+ĠA N
+Ġeas ily
+I A
+en ge
+in y
+Ġcl os
+ock ed
+Ġus es
+ĠC oun
+I m
+u ild
+? ?
+m ore
+Ġan g
+Ġwr ite
+ol ute
+5 7
+Ġlead er
+Ġread ing
+< /
+Ġaut om
+est s
+4 3
+Ġleg isl
+ĠG old
+Ġdesign ed
+ĠS T
+ĠLe g
+a res
+Ġbe aut
+ĠT ex
+Ġappear s
+Ġstru gg
+ĠR om
+Ġ 00
+Ġcho ice
+Ġparticular ly
+ĠF rom
+op er
+ĠL ondon
+ann ed
+Ġallow s
+ob ile
+Ġdiffere nce
+âĢ ¢
+ĠV iew
+ĠWed nesday
+Ġal though
+Ġrel ative
+Ġapplic ation
+ate ver
+Ġare n
+Ġmy self
+Ġim ag
+Ġdis e
+Ġsoc iety
+Ġfre qu
+ĠEng lish
+Ġpo or
+ĠD ay
+Ġwrit ing
+Ġse ven
+Ġstart ing
+Ġb ud
+Ġpr int
+ĠTr ans
+uf act
+ĠSt ud
+n ew
+Ġcr im
+Ġg ives
+Ġco ol
+a e
+i ance
+ĠGener al
+Ġthink ing
+Ġsa ve
+Ġlim ited
+ĠPart y
+Ġmean ing
+p en
+ow ers
+ĠJ ack
+E M
+Ġn ice
+ru pt
+Ġg as
+Ġe ight
+Ġfe et
+Ġeff ort
+Ġ ign
+ic it
+B l
+co in
+Ġop in
+Ġbr ain
+Wh ile
+he st
+ĠTh ursday
+Ġwould n
+augh ter
+Ġtou ch
+le ments
+Ġstud ies
+Ġcent er
+c ont
+or ge
+Ġcomput er
+Ġinvestig ation
+P l
+or ks
+Ġ200 8
+Ġincre asing
+Ġst ore
+Ġcom ments
+Ġb al
+m en
+Ġdo ll
+Ġl iber
+Ġw ife
+Ġlaw s
+atur day
+it ness
+Ġmod ern
+ĠS k
+Ġadminist ration
+Ġopportun ity
+Ġs al
+Ġpower ful
+M y
+Ġclaim s
+ĠEar th
+ord s
+Ġt itle
+Ġes c
+n ame
+N ot
+om en
+Ġbe yond
+Ġc amer
+Ġse ll
+it ute
+ear ch
+Ġapp l
+im ent
+4 2
+ĠAr t
+Ġun f
+Ġviol ence
+ur g
+ĠE ast
+Ġcomp ared
+Ġopt ions
+Ġthrough out
+Ġv s
+ig r
+. [
+ac hes
+7 8
+Ġfil es
+F L
+E L
+ar ian
+ĠJ ames
+ĠA ir
+an ch
+Ġdet ail
+Ġpie ce
+P S
+Ġn amed
+Ġeduc ation
+Ġdri ve
+Ġitem s
+Ġstud ent
+ic ed
+: :
+ic o
+Ġth row
+Ġsc ene
+Ġcomple x
+Ġ200 9
+Ġpre c
+ĠB re
+7 9
+Ġcon cept
+Ġstat us
+am ing
+Ġd ied
+Ġknow ledge
+Ġbegin ning
+O D
+ru ary
+Ġcertain ly
+Ġgu ys
+Ġsl ight
+in n
+ound s
+Ġf ine
+Ġf at
+ic ations
+Ġper haps
+ĠA nt
+Ġinc ome
+Ġhtt ps
+Ġmajor ity
+port s
+st on
+Ġgreat er
+Ġfe ed
+ent ially
+Ġsaf ety
+Ġun ique
+and om
+Ġg one
+Ġshow ed
+Ġhist or
+Ġcoun ter
+i us
+id a
+Ġlead ing
+i pe
+Ġs end
+ĠDon ald
+er ve
+Ġdef ense
+ines e
+Ġy es
+ĠF ire
+ĠMus lim
+ra q
+Ġcontin ued
+os h
+Ġprov ides
+Ġpr ison
+ĠP re
+Ġhapp y
+Ġeconom y
+Ġtr ust
+ag s
+ĠG ame
+Ġweap ons
+um an
+ĠC le
+it ation
+Ġanal ysis
+ĠT imes
+Ġsc ience
+- >
+Ġfig ure
+Ġdis app
+ent y
+Ġsoft ware
+Ġu lt
+Ġoffic ers
+N ew
+I s
+Ġrem ains
+ĠInd ia
+Ġp sych
+ri ef
+Ġc at
+es c
+Ġob serv
+Ġst age
+ĠD ark
+Ġent er
+ch ange
+Ġpass ed
+Ġdes pite
+ĠO ut
+Ġmov ie
+r s
+Ġv oice
+m ine
+ĠPl ay
+Ġto ward
+ĠT er
+Ġreg ion
+Ġval ues
+or ters
+Ġm ount
+Ġoffic er
+ĠO ther
+b an
+Ġh ous
+w ood
+ro om
+I V
+ĠS un
+se e
+ĠO ver
+ro g
+9 0
+Ġl ay
+ĠT ur
+a wn
+Ġpress ure
+ĠS ub
+Ġbook s
+ed om
+ĠS and
+A A
+ag o
+Ġre asons
+f ord
+Ġactiv ity
+U T
+N ow
+ĠSen ate
+ce ll
+n ight
+Ġcall s
+in ter
+Ġlet ter
+ĠR ob
+ĠJ e
+Ġcho ose
+ĠL aw
+G et
+B e
+Ġro b
+Ġtyp es
+Ġpl atform
+Ġqu arter
+R A
+ĠT ime
+Ġmay be
+ĠC r
+9 5
+p re
+Ġmov ing
+Ġl if
+Ġgo ld
+Ġs om
+Ġpat ients
+Ġtr uth
+ĠK e
+ur ance
+ant ly
+m ar
+Ġchar ge
+ĠG reat
+Ġce le
+---------------- ----------------
+Ġro ck
+ro id
+an cy
+Ġcred it
+a ud
+B y
+ĠE very
+Ġmov ed
+ing er
+rib ution
+Ġn ames
+Ġstra ight
+ĠHe alth
+ĠW ell
+Ġfe ature
+Ġr ule
+Ġsc he
+in ated
+ĠMich ael
+ber g
+4 1
+il ed
+b and
+Ġcl ick
+ĠAng el
+on ents
+Â Ń
+ĠI raq
+ĠS aturday
+Ġa ware
+p art
+Ġpat tern
+O W
+ĠL et
+Ġgr ad
+ign ed
+Ġassoci ated
+Ġst yle
+n o
+i ation
+a ith
+il ies
+Ġst ories
+ur ation
+Ġindividual s
+ĠâĢ ¦
+m iss
+ĠAss oci
+ish ing
+ab y
+Ġsum mer
+ĠB en
+Ġ3 2
+Ġar ch
+ut y
+ĠTex as
+h ol
+Ġfull y
+Ġm ill
+Ġfollow ed
+ĠB ill
+ĠInd ian
+ĠSec ret
+ĠB el
+ĠFeb ruary
+Ġjob s
+Ġseem ed
+ĠGo vern
+i pped
+Ġreal ity
+Ġl ines
+Ġp ark
+Ġmeas ure
+ĠO ur
+I M
+Ġbro ther
+Ġgrow ing
+Ġb an
+Ġest im
+Ġc ry
+ĠS chool
+Ġme chan
+ĠO F
+ĠWind ows
+Ġr ates
+ĠO h
+Ġpos itive
+Ġcult ure
+ist ics
+ic a
+Ġh ar
+y a
+ite ly
+i pp
+Ġm ap
+en cies
+ĠWill iam
+I I
+ak ers
+5 6
+ĠM art
+ĠR em
+Ġal tern
+it ude
+Ġco ach
+row d
+D on
+Ġk ids
+Ġj ournal
+Ġcor por
+Ġf alse
+Ġwe b
+Ġsle ep
+Ġcont ain
+Ġst o
+Ġb ed
+iver se
+ĠR ich
+ĠCh inese
+Ġp un
+Ġme ant
+k nown
+Ġnot ice
+Ġfavor ite
+a ven
+Ġcond ition
+Ġpur pose
+) )
+Ġorgan ization
+Ġchall eng
+Ġman ufact
+Ġsus p
+ĠA c
+Ġcrit ic
+un es
+uc lear
+Ġm er
+vent ion
+Ġ8 0
+Ġm ist
+ĠU s
+ĠT or
+htt p
+ol f
+Ġlarg er
+Ġadv ant
+Ġrese ar
+Ġact ions
+m l
+Ġke pt
+Ġa im
+, '
+c ol
+Ġbenef its
+if ying
+Ġact ual
+ĠIntern ational
+Ġveh icle
+Ġch ief
+Ġeff orts
+ĠLe ague
+ĠM ost
+Ġwa it
+Ġad ult
+Ġover all
+Ġspe ech
+Ġhigh ly
+Ġfem ale
+Ġer ror
+Ġeffect ive
+5 4
+Ġenc our
+w ell
+Ġfail ed
+Ġcons erv
+Ġprogram s
+Ġt rou
+Ġa head
+5 00
+vertis ement
+I P
+ĠF ound
+p ir
+Ġ %
+Ġcr ime
+and er
+Ġloc ation
+ĠI ran
+Ġbehav ior
+az ing
+Ġr are
+Ġem b
+Ġca used
+Ġsh ip
+Ġact ive
+Ġcont ribut
+Ġg reen
+Ġac qu
+Ġref lect
+ven ue
+Ġf irm
+Ġb irth
+] .
+Ġclear ly
+Ġem ot
+Ġag ency
+ri age
+Ġmem ory
+9 8
+S A
+ĠSe e
+ac ing
+C C
+Ġbig gest
+Ġr ap
+Ġbas ic
+Ġb and
+e at
+Ġsus pect
+ĠM ac
+Ġ9 0
+m ark
+ist an
+Ġsp read
+am s
+k i
+as y
+ra v
+ĠR ober
+Ġdemon str
+r ated
+Ġabs olute
+Ġpl aces
+Ġim pl
+ibr ary
+Ġc ards
+Ġdest roy
+Ġv irt
+ve re
+Ġapp eared
+y an
+p oint
+Ġbe g
+Ġtem per
+s pe
+ant ed
+ear s
+ĠD irect
+Ġl ength
+Ġbl og
+am b
+Ġint eg
+Ġres ources
+ac c
+if ul
+Ġsp ot
+Ġfor ced
+Ġthous ands
+ĠMin ister
+Ġqu al
+ĠF rench
+at ically
+Ġgener ally
+Ġdr ink
+Ġth us
+I L
+od es
+Ġappro pri
+ĠRe ad
+Ġwh om
+Ġey e
+Ġcol lege
+Ġ4 5
+ire ction
+Ġens ure
+Ġapp arent
+id ers
+Ġrelig ious
+Ġmin or
+ol ic
+Ġt ro
+ĠWh y
+rib ute
+m et
+Ġprim ary
+Ġdevelop ed
+Ġpe ace
+Ġsk in
+st e
+av a
+Ġbl ue
+Ġfam ilies
+Ġ ir
+Ġapp ly
+Ġin form
+ĠSm ith
+C T
+i i
+Ġlim it
+Ġres ist
+........ ........
+um n
+Ġconf lic
+Ġtw e
+ud d
+ĠT om
+Ġl iter
+qu e
+b on
+Ġha ir
+Ġevent ually
+Ġp us
+Ġhelp ed
+Ġag g
+or ney
+ĠApp le
+Ġf it
+ĠS ur
+Ġpre m
+Ġs ales
+Ġsecond s
+Ġstreng th
+Ġfeel ing
+¿ ½
+Ġt our
+Ġknow s
+o om
+Ġex erc
+Ġsom ew
+ï ¿½
+> >
+Ġsp okes
+Ġide as
+Ġreg ist
+so ft
+ĠD el
+ĠP C
+Ġpro pos
+Ġlaun ch
+Ġbott om
+T H
+ĠP lease
+v est
+it z
+ĠIn ter
+Ġsc ript
+Ġr at
+ar ning
+Ġ il
+ĠJ er
+ĠA re
+Ġwh atever
+ok en
+ci ence
+Ġmod e
+Ġag ree
+Ġs ources
+Ġinit ial
+Ġrest rict
+Ġwond er
+us ion
+## ##
+ĠS il
+vil le
+Ġb urn
+t w
+as ion
+ĠÂ £
+Ġn or
+u ing
+Ġre ached
+Ġs un
+Ġc ateg
+ig ration
+Ġc ook
+Ġprom ot
+Ġm ale
+Ġcl imate
+Ġf ix
+Ġalleg ed
+U R
+all ed
+Ġim ages
+C ont
+ot a
+Ġschool s
+i os
+Ġd rop
+Ġst ream
+ĠM o
+Ġprevious ly
+al ing
+Ġp et
+Ġdou ble
+Ġ( @
+ann el
+Ġdef ault
+t ies
+Ġr ank
+ĠD ec
+ĠCoun cil
+Ġweap on
+Ġst ock
+Ġanal y
+ĠSt r
+Ġpict ure
+ĠPol ice
+f erence
+Ġcent ury
+Ġcitiz ens
+Ġon to
+Ġexp and
+Ġhe ro
+ĠS ol
+Ġw ild
+Ġupd ate
+Ġcustom ers
+r ont
+d ef
+Ġl ik
+Ġcrim inal
+ĠChrist ian
+S P
+7 6
+Ġle aving
+Ġother wise
+ĠD ist
+Ġbas is
+5 2
+5 3
+ic ip
+ĠB er
+Ġrecomm end
+Ġfl oor
+Ġc rowd
+ol es
+Ġ7 0
+Ġcent ral
+ĠE v
+Ġd ream
+Ġdown load
+Ġconf ir
+ĠTh om
+Ġwind ow
+Ġhapp ens
+Ġun it
+Ġt end
+Ġs pl
+Ġbec omes
+Ġfight ing
+Ġpred ict
+ĠP ress
+ĠP ower
+Ġhe avy
+ak ed
+Ġf an
+or ter
+ate gy
+B A
+iz es
+Ġsp end
+H ere
+Ġ200 7
+Ġad op
+ĠH am
+Ġfoot ball
+ĠP ort
+od ay
+5 1
+amp ions
+Ġtrans fer
+h t
+Ġ3 8
+ter m
+ac ity
+Ġb ur
+] ,
+tern al
+r ig
+b ut
+Ġthere fore
+ĠB ecause
+res p
+re y
+Ġm ission
+S ome
+Ġnot ed
+Ġass um
+Ġdise ase
+Ġed it
+Ġprog ress
+r d
+ĠB rown
+oc al
+Ġadd ing
+Ġra ised
+ĠAn y
+Ġt ick
+Ġsee ing
+ĠPe ople
+Ġagre ement
+Ġser ver
+Ġw at
+Ġdeb ate
+Ġsupp osed
+il ing
+Ġlarg est
+Ġsuccess ful
+ĠP ri
+ĠDemocr atic
+Ġj ump
+ĠSyri a
+Ġown ers
+Ġoff ers
+Ġshoot ing
+Ġeff ic
+se y
+Ġha ven
+ver se
+te red
+ĠL ight
+im al
+ĠB ig
+Ġdef end
+Ġbe at
+Ġrecord s
+% )
+Ġsc en
+Ġemploy ees
+Ġdev ices
+he m
+Ġcom mer
+ĠM ex
+Ġbenef it
+ĠPro f
+Ġil leg
+Ġsur face
+ĠAl so
+Ġh arm
+ing ly
+w ide
+ĠA lex
+Ġsh ut
+ĠC ur
+Ġl ose
+p m
+Ġchall enge
+se mb
+Ġst ation
+Ġint elligence
+Ġacc ur
+ĠFl or
+Ġrequ ires
+ĠM al
+b um
+Ġh ospital
+Ġsp irit
+Ġoff ered
+Ġprodu ce
+ĠComm un
+Ġcreat ing
+Ġcr is
+s pect
+Ġend ed
+Ġd aily
+Ġvot ers
+land s
+i as
+i h
+on a
+Ġsm art
+ĠOff ice
+ĠL ord
+ri al
+ĠIntern et
+Ġcirc um
+Ġextreme ly
+' .
+Ġopin ion
+ĠM il
+Ġg ain
+B S
+ĠF in
+y p
+Ġuse ful
+Ġbud get
+Ġcom fort
+is f
+Ġback ground
+el ine
+Ġep isode
+Ġen emy
+Ġtri al
+Ġestab lish
+d ate
+ĠC ap
+Ġcontin ues
+Ġshow ing
+ĠUn ion
+w ith
+Ġpost ed
+ĠSy stem
+Ġe at
+ri an
+Ġr ise
+ĠGerman y
+il s
+Ġsign ed
+Ġv ill
+Ġgr and
+m or
+ĠEng land
+Ġproject s
+um ber
+Ġconf erence
+z a
+Ġrespons ible
+ĠAr ab
+Ġlearn ed
+âĢĶ âĢĶ
+i pping
+ĠGe orge
+O C
+Ġreturn ed
+ĠAustral ia
+Ġb rief
+Q u
+Ġbr and
+ill ing
+ab led
+Ġhig hest
+Ġtr ain
+ĠComm ission
+wh ile
+Ġn om
+cept ion
+Ġm ut
+ĠBl ue
+Ġinc ident
+v ant
+8 6
+ĠI D
+Ġn uclear
+7 4
+ĠL ike
+ĠR E
+ĠM icro
+l i
+m ail
+Ġcharg es
+8 9
+Ġad just
+ad o
+Ġear th
+N A
+Ġpr ices
+P A
+Ġd raft
+Ġrun s
+Ġcandid ate
+ens es
+Ġmanag ement
+ĠPh il
+ĠM iss
+Ġte ach
+g ram
+Ġunderstand ing
+a it
+ic ago
+A dd
+ĠE p
+sec ut
+Ġsepar ate
+Ġinst ance
+Ġe th
+Ġun less
+**** ****
+ĠF ore
+in ate
+Ġoper ations
+S p
+Ġf aith
+g ar
+ĠCh urch
+ron ic
+Ġconf ig
+os ure
+Ġactiv ities
+Ġtrad itional
+Ġ3 6
+Ġd irection
+Ġmach ine
+Ġsur round
+Ġp ush
+un ction
+ĠE U
+Ġeas ier
+Ġarg ument
+G B
+Ġm icro
+Ġsp ending
+iz ations
+Ġthe ory
+ad ow
+Ġcall ing
+ĠL ast
+Ġd er
+Ġinflu ence
+Ġcomm it
+Ġph oto
+Ġun c
+ist ry
+g n
+ast e
+ack s
+Ġdis p
+ad y
+d o
+ĠG ood
+Ġ `
+Ġw ish
+Ġreve aled
+Âł Âł
+l ig
+Ġen force
+ĠComm ittee
+Ġche m
+Ġmil es
+Ġinterest ed
+Ġsol ution
+ic y
+in ct
+Ġ- >
+ĠD et
+Ġrem oved
+Ġcomp ar
+e ah
+Ġpl ant
+ĠS ince
+Ġachie ve
+Ġadvant age
+Ġslight ly
+b ing
+Ġpl aced
+u nder
+201 5
+ĠM ad
+Ġt im
+os es
+Ġc ru
+ĠR ock
+Ġmost ly
+Ġneg ative
+Ġset ting
+Ġprodu ced
+Ġm ur
+Ġconnect ion
+ĠM er
+Ġdri ver
+Ġexecut ive
+Ġass ault
+Ġb orn
+ĠV er
+t ained
+Ġstruct ure
+Ġredu ce
+Ġdec ades
+Ġd ed
+u ke
+ĠM any
+idd en
+Ġle ague
+S e
+Ġjo in
+Ġdis co
+Ġd ie
+c ks
+act ions
+Ġass ess
+ag n
+Ġgo als
+our s
+I R
+Ġsen ior
+ill er
+m od
+ip ment
+oc ol
+u y
+ĠQ ue
+Ġpart ies
+ir gin
+Ġle arning
+it able
+Ġstre et
+Ġcamer a
+A pp
+Ġsk ills
+b re
+c ious
+Ġcele br
+ĠFr anc
+Ġexist ing
+Ġwill ing
+l or
+Ġ id
+ĠSp ace
+Ġcrit ical
+ĠL a
+ortun ately
+Ġser ve
+Ġc old
+Ġspec ies
+T S
+Ġanim als
+ĠB ay
+Ġold er
+ĠU nder
+est ic
+ĠT re
+Ġte acher
+Ġpre fer
+v is
+Ġth read
+ĠM att
+Ġmanag er
+ãĥ »
+Ġprofess ional
+ĠV ol
+Ġnot es
+The se
+ul a
+Ġf resh
+ent ed
+u zz
+ed y
+clus ion
+ĠR el
+Ġdoub t
+E O
+Ġopen ed
+ĠB it
+Ad vertisement
+Ġgu ess
+ĠU N
+Ġse qu
+Ġexpl ain
+ott en
+Ġatt ract
+ak s
+Ġstr ing
+Ġcont ext
+oss ible
+ĠRepublic ans
+Ġsol id
+Ġc ities
+Ġask ing
+Ġr andom
+u ps
+ur ies
+ar ant
+dd en
+g l
+ĠFlor ida
+Ġdep end
+ĠSc ott
+Ġ3 3
+Ġi T
+ic on
+Ġmention ed
+Ġ2 000
+Ġclaim ed
+Ġdefin itely
+ul f
+Ġc ore
+Ġopen ing
+ĠCon st
+wh ich
+ĠT ra
+A G
+7 2
+Ġbelie ved
+ad a
+Ġ4 8
+ĠSec urity
+yr ight
+ĠP et
+ĠL ou
+Ġhold ing
+======== ========
+Ġ ice
+Ġb row
+Ġauthor ities
+h ost
+w ord
+Ġsc ore
+ĠD iv
+Ġcell s
+Ġtrans l
+Ġneigh bor
+Ġrem ove
+u ct
+Ġdist rict
+ĠA ccording
+Ġwor se
+Ġconcern s
+Ġpresident ial
+Ġpolic ies
+ĠH all
+7 3
+Ġh us
+A Y
+Ġ200 6
+ĠJ ud
+Ġindepend ent
+ĠJust ice
+ili ar
+pr int
+igh ter
+Ġprotect ion
+z en
+Ġsu dden
+h ouse
+ĠJ es
+P R
+ĠIn f
+Ġb ul
+Ġ _
+ĠServ ice
+ĠP R
+Ġstr ategy
+ff ect
+Ġgirl s
+Ġmiss ing
+oy al
+ĠTe am
+ul ated
+Ġd at
+Ġpolit ics
+ab or
+A ccording
+Ġspe ll
+Ġg raph
+ort hern
+T C
+A b
+Ġlab or
+is her
+Ġk ick
+ĠiT unes
+Ġstep s
+pos es
+Ġsmall er
+E n
+ber t
+Ġro ll
+Ġresear chers
+Ġcl osed
+Ġtrans port
+Ġlaw y
+________ ________
+ĠCh icago
+Ġas pect
+Ġn one
+Ġmar riage
+9 6
+Ġe lements
+ĠF re
+ĠS al
+Ġd ram
+F C
+t op
+e qu
+Ġhe aring
+Ġsupport ed
+Ġtest ing
+co hol
+Ġmass ive
+Ġst ick
+Ġgu ard
+is co
+ph one
+F rom
+How ever
+Ġb order
+Ġcop y
+ograph y
+l ist
+7 1
+Ġown er
+cl ass
+ru it
+r ate
+ĠO nce
+Ġdig ital
+Ġt ask
+ER S
+Ġinc red
+t es
++ +
+ĠFr ance
+Ġb reat
+ow l
+Ġiss ued
+ĠW estern
+Ġdet ect
+Ġpart ners
+Ġsh ared
+ĠC all
+Ġcan cer
+ac he
+rib e
+Ġexpl ained
+Ġhe at
+{ "
+Ġinvest ment
+ĠB ook
+Ġw ood
+Ġtool s
+ĠAl though
+Ġbelie f
+Ġcris is
+Ġg e
+ĠM P
+Ġoper ation
+ty pe
+~ ~
+g a
+Ġcont ains
+ant a
+Ġexp ress
+ĠG roup
+ĠJ ournal
+k a
+Ġam b
+ĠUS A
+Ġfind ing
+Ġfund ing
+h ow
+Ġestab lished
+ide os
+Ġdeg ree
+Ġdanger ous
+ang ing
+Ġfre edom
+pp ort
+out hern
+Ġch urch
+Ġc atch
+ĠTw o
+Ġpres ence
+ĠGu ard
+U p
+Ġauthor ity
+ĠPro ject
+Ġbut ton
+Ġcon sequ
+Ġval id
+Ġwe ak
+Ġstart s
+Ġref erence
+ĠM em
+" )
+U N
+or age
+ĠO pen
+Ġcol lection
+y m
+g ency
+Ġbeaut iful
+ro s
+Ġtell s
+Ġwa iting
+n el
+Ġprov iding
+ĠDemocr ats
+Ġd aughter
+Ġm aster
+Ġpur poses
+ĠJapan ese
+Ġequ al
+Ġturn s
+Ġdoc uments
+Ġwatch ing
+R es
+Ġr an
+201 4
+Ġre ject
+ĠKore a
+Ġvictim s
+Le vel
+ere nces
+Ġw itness
+Ġ3 4
+Ġre form
+com ing
+Ġocc up
+Ġc aught
+Ġtra ffic
+ad ing
+Ġmod els
+ar io
+Ġserv ed
+Ġb atter
+u ate
+ĠSecret ary
+Ġagre ed
+Ġtr uly
+yn am
+ĠR et
+Ġun its
+ĠRes earch
+h and
+az ine
+ĠM ike
+Ġvar iety
+ot al
+Ġam azing
+Ġconfir med
+Ġentire ly
+Ġpurch ase
+Ġe lement
+Ġc ash
+Ġdeter mine
+D e
+Ġc ars
+ĠW all
+â ĸ
+Ġview s
+Ġdrug s
+Ġdep artment
+ĠSt ep
+u it
+Ġ3 9
+as ure
+ĠCl ass
+Ġc overed
+ĠB ank
+Ġme re
+u ana
+Ġmult i
+Ġm ix
+Ġun like
+lev ision
+Ġsto pped
+Ġs em
+ĠG al
+ul es
+Ġwe l
+ĠJohn son
+l a
+Ġsk ill
+Ġbec oming
+ri e
+Ġappropri ate
+f e
+ell ow
+ĠPro t
+ul ate
+oc ation
+Ġweek end
+od ies
+Ġsit es
+Ġanim al
+ĠT im
+Ġsc ale
+Ġcharg ed
+Ġinst ruct
+ill a
+Ġmethod s
+Ġc ert
+Ġjud ge
+ĠH el
+Ġdoll ars
+Ġstand ing
+ĠS qu
+Ġdeb t
+l iam
+Ġdri ving
+ĠS um
+ĠEd ition
+Ġal bum
+and on
+I F
+ĠU k
+6 3
+ad er
+Ġcommer cial
+es h
+ĠGovern ment
+Ġdisc overed
+Ġout put
+ĠHill ary
+ĠCar ol
+Ġ200 5
+Ġab use
+anc ing
+Ġsw itch
+Ġann ual
+T w
+Ġst ated
+ag ement
+in ner
+Ġdem ocr
+Ġres idents
+Ġallow ing
+Ġfact ors
+od d
+Ġf uck
+em ies
+Ġoccur red
+ot i
+Ġn orth
+ĠP ublic
+Ġinj ury
+Ġins urance
+C L
+oll y
+ã Ģ
+Ġrepe ated
+Ġar ms
+ang ed
+Ġconst ruction
+Ġf le
+P U
+ic ians
+Ġfor ms
+ĠMc C
+ant ic
+Ġm ental
+p ire
+Ġequ ipment
+Ġf ant
+Ġdiscuss ion
+Ġregard ing
+k in
+ar p
+Ġch air
+og ue
+Ġpro ceed
+ĠI d
+O ur
+Ġmur der
+M an
+Ġ4 9
+as p
+Ġsupp ly
+Ġin put
+Ġwe alth
+liam ent
+Ġpro ced
+or ial
+ĠSt at
+ĠN FL
+hen s
+ĠInst itute
+Ġput ting
+ourn ament
+et ic
+Ġloc ated
+Ġk id
+er ia
+r un
+Ġpr inc
+Ġ !
+go ing
+ĠB et
+Ġcl ot
+Ġtell ing
+Ġprop osed
+i ot
+or ry
+Ġfund s
+g ment
+ĠL ife
+Ġb aby
+ĠB ack
+Ġsp oke
+Im age
+Ġear n
+ĠA T
+g u
+Ġex change
+ĠL in
+ov ing
+Ġp air
+M ore
+az on
+Ġarrest ed
+Ġkill ing
+c an
+ĠC ard
+y d
+Ġident ified
+Ġm obile
+Ġthan ks
+ony m
+ĠF orm
+Ġhundred s
+ĠCh ris
+ĠC at
+Ġtre nd
+h at
+ĠA v
+om an
+Ġelect ric
+ĠW il
+S E
+O f
+Ġrest aur
+ot ed
+Ġtr ig
+Ġn ine
+Ġb omb
+Wh y
+Â ¯
+Ġco verage
+Ġapp eal
+ĠRober t
+ĠS up
+Ġfin ished
+Ġfl ow
+Ġdel iver
+Ġcal cul
+Ġphot os
+Ġph il
+Ġpie ces
+Ġapp re
+k es
+Ġr ough
+D o
+Ġpart ner
+Ġconcern ed
+Ġ3 7
+ĠG en
+C ol
+ct ors
+Ġ= >
+st ate
+Ġsuggest ed
+ĠFor ce
+C E
+Ġher self
+ĠPl an
+w orks
+o oth
+ren cy
+Ġcor ner
+Ġhus band
+Ġintern et
+ĠA ut
+em s
+os en
+ĠAt l
+g en
+Ġbal ance
+6 2
+Ġsound s
+te xt
+Ġar r
+ov es
+Ġmill ions
+Ġrad io
+Ġsat isf
+ĠD am
+M r
+G o
+S pe
+Ġcomb at
+r ant
+ĠG ree
+Ġf uel
+Ġdist ance
+Ġtest s
+Ġdec re
+ĠE r
+Ġman aged
+D S
+Ġt it
+Ġmeas ures
+ĠL iber
+Ġatt end
+as hed
+ĠJ ose
+ĠN ight
+d it
+ĠN ov
+ĠE nd
+out s
+Ġgener ation
+Ġadv oc
+y th
+Ġconvers ation
+ĠS ky
+act ive
+ce l
+ri er
+ĠFr ank
+Ġg ender
+Ġcon cent
+Ġcar ried
+and a
+ĠV irgin
+Ġarri ved
+ic ide
+ad ed
+Ġfail ure
+Ġmin imum
+le ts
+Ġwor st
+Ġkeep ing
+Ġint ended
+Ġilleg al
+Ġsub sc
+Ġdetermin ed
+Ġtri p
+Y es
+Ġra ise
+Ġ ~
+Ġfeel s
+Ġpack age
+ĠJ o
+h i
+201 6
+re al
+Ġf ra
+Ġsy mb
+M e
+uck y
+p ret
+ĠK h
+ĠEd it
+ĠWe b
+em ic
+ĠCol or
+Ġjust ice
+I nt
+Ġfar m
+ck now
+" >
+el ess
+Ġredu ced
+Ġ5 00
+x x
+ĠR ad
+ĠW ood
+Ġcl in
+Ġhy p
+il er
+ur a
+k ins
+8 5
+6 1
+ĠThe ir
+ĠM ary
+Ġs an
+Ġno vel
+ĠWh o
+Ġcap acity
+Ġimp ossible
+Ġpl ays
+Ġmin ister
+ij uana
+ic ate
+ĠS et
+Ġf ram
+Ġ ing
+Ġcommun ities
+ĠF BI
+it a
+Ġb on
+Ġstr ateg
+Ġinterest s
+l ock
+g ers
+m as
+ĠAN D
+Ġconflic t
+Ġrequire ments
+Ġs ac
+Ġoper ating
+in i
+rel ated
+Ġcomm itted
+Ġrelative ly
+Ġs outh
+Â¯ Â¯
+Ġaff ord
+Ġident ity
+Ġdec isions
+Ġacc used
+pl ace
+Ġvict ory
+o ch
+i at
+N ame
+C om
+t ion
+ed s
+Ġsee k
+Ġt ight
+ĠIm ages
+Ġinit i
+Ġhum ans
+Ġfam iliar
+Ġaud ience
+Ġintern al
+vent ure
+Ġs ides
+ĠT O
+Ġd im
+Ġcon clud
+Ġapp oint
+Ġenforce ment
+ĠJ im
+ĠAssoci ation
+Ġcircum st
+ĠCanad ian
+Ġjo ined
+Ġdiffere nces
+ĠL os
+Ġprot est
+Ġtw ice
+w in
+Ġgl ass
+ars h
+ĠAr my
+Ġexp ression
+Ġdec ide
+Ġplan ning
+an ia
+Ġhand le
+ĠMicro soft
+ĠN or
+Ġmax imum
+ĠRe v
+Ġse a
+Ġev al
+Ġhel ps
+re f
+Ġb ound
+Ġm outh
+Ġstand ards
+Ġcl im
+ĠC amp
+ĠF ox
+cl es
+Ġar my
+ĠTe chn
+ack ing
+x y
+S S
+Ġ4 2
+Ġbu g
+ĠUk rain
+ĠM ax
+ĠJ ones
+ĠSh ow
+l o
+Ġplan et
+Ġ7 5
+Ġwin ning
+Ġf aster
+Ġspe ct
+Ġbro ken
+T R
+Ġdef ined
+Ġhealth y
+Ġcompet ition
+htt ps
+ĠIs land
+ĠF e
+Ġannoun ce
+ĠC up
+ĠInst ead
+Ġcl ient
+Ġposs ibly
+se ction
+ock et
+l ook
+Ġfin ish
+Ġcre w
+Ġres erv
+Ġed itor
+Ġh ate
+Ġs ale
+Ġcontro vers
+Ġp ages
+w ing
+Ġnum er
+Ġopp osition
+Ġ200 4
+Ġref uge
+Ġfl ight
+Ġap art
+ĠL at
+A meric
+ĠAfric a
+Ġapplic ations
+ĠPal est
+ĠB ur
+Ġg ar
+ĠSoc ial
+Ġup gr
+Ġsh ape
+Ġspe aking
+ans ion
+a o
+ĠS n
+Ġwor ry
+ĠBrit ain
+P lease
+rou d
+Ġh un
+Ġintrodu ced
+Ġd iet
+I nd
+ĠSec ond
+Ġfun ctions
+ut s
+ĠE ach
+ĠJe ff
+Ġst ress
+Ġaccount s
+Ġgu arant
+ĠAn n
+ed ia
+Ġhon est
+Ġt ree
+ĠAfric an
+ĠB ush
+} ,
+Ġs ch
+ĠOn ly
+Ġf if
+ig an
+Ġexerc ise
+ĠEx p
+Ġscient ists
+Ġlegisl ation
+ĠW ork
+ĠS pr
+Ã Ĥ
+ĠH uman
+Ġ è
+Ġsur vey
+Ġr ich
+ri p
+Ġmain tain
+Ġfl o
+Ġleaders hip
+st ream
+ĠIslam ic
+Ġ 01
+ĠCol lege
+Ġmag ic
+ĠPr ime
+Ġfig ures
+201 7
+ind er
+x ual
+ĠDe ad
+Ġabsolute ly
+Ġfour th
+Ġpresent ed
+resp ond
+rib le
+Ġal cohol
+at o
+ĠD E
+por ary
+Ġgr ab
+Ġvar i
+Ġqu ant
+ĠPh oto
+Ġpl us
+r ick
+ar ks
+Ġaltern ative
+Ġp il
+Ġappro x
+th at
+Ġobject s
+ĠR o
+ĠAnd roid
+Ġsignificant ly
+ĠR oad
+k ay
+R ead
+av or
+Ġa cknow
+ĠH D
+ĠS ing
+O r
+ĠM ont
+Ġun s
+pro f
+Ġneg oti
+ĠAr ch
+ik i
+Ġte levision
+ĠJew ish
+Ġcomm ittee
+Ġmot or
+Ġappear ance
+Ġs itting
+Ġstri ke
+ĠD own
+com p
+ĠH ist
+Ġf old
+ac ement
+ĠLou is
+Ġbel ong
+ĠâĢ ¢
+Ġm ort
+Ġprep ared
+Ġ6 4
+ĠM aster
+Ġind eed
+ĠD en
+Ġre nt
+T A
+our ney
+ar c
+S u
+9 7
+Ġadv ice
+Ġchang ing
+Ġlist ed
+Ġlaun ched
+is ation
+ĠP eter
+is hes
+Ġl ived
+ĠM el
+ĠSup reme
+ĠF ederal
+Ġ) ;
+ruct ure
+Ġset s
+Ġphil os
+u ous
+ĠÂ ł
+Ġappl ied
+ĠN OT
+Ġhous ing
+ĠM ount
+Ġo dd
+Ġsu st
+D A
+ffic ient
+Ġ ?
+ol ved
+Ġp owers
+Ġth r
+Ġrem aining
+ĠW ater
+L C
+Ġca uses
+ãģ ®
+Ġman ner
+ad s
+Ġsuggest s
+Ġend s
+stand ing
+f ig
+ĠD un
+id th
+Ġg ay
+Ġter min
+ĠAngel es
+M S
+Ġscient ific
+Ġco al
+ap ers
+b ar
+ĠThom as
+Ġsy m
+ĠR un
+th is
+P C
+igr ants
+Ġmin ute
+ĠDist rict
+cell ent
+Ġle aves
+Ġcomple ted
+am in
+Ġfoc used
+Ġmon itor
+Ġveh icles
+M A
+ĠM ass
+ĠGr and
+Ġaffect ed
+itution al
+Ġconst ruct
+Ġfollow s
+Ġt on
+re ens
+Ġh omes
+ĠE xt
+ĠLe vel
+r ast
+ĠI r
+Ġel im
+Ġlarge ly
+ĠJ oe
+Ġvot es
+all s
+Ġbusiness es
+ĠFound ation
+ĠCent ral
+Ġy ards
+Ġmaterial s
+ul ner
+Ġgu ide
+Ġclos er
+um s
+Ġsp orts
+ed er
+J ust
+Ġtax es
+8 4
+ĠO ld
+Ġdec ade
+ol a
+Ġv ir
+Ġdro pped
+Ġdel ay
+it ect
+Ġsec ure
+ste in
+le vel
+Ġtre ated
+Ġfil ed
+ain e
+Ġv an
+Ġm ir
+Ġcol umn
+ict ed
+e per
+Ġro t
+Ġcons ult
+Ġent ry
+Ġmar ijuana
+ĠD ou
+Ġapparent ly
+ok ing
+clus ive
+Ġincre ases
+an o
+Ġspecific ally
+Ġte le
+ens ions
+Ġrelig ion
+ab ilities
+Ġfr ame
+ĠN ote
+ĠLe e
+Ġhelp ing
+Ġed ge
+ost on
+Ġorgan izations
+Ã ĥ
+ĠB oth
+hip s
+Ġbig ger
+Ġbo ost
+ĠSt and
+Ġro w
+ul s
+ab ase
+Ġr id
+L et
+are n
+ra ve
+Ġst ret
+P D
+Ġv ision
+Ġwe aring
+Ġappre ci
+Ġa ward
+ĠU se
+Ġfact or
+w ar
+ul ations
+) (
+Ġg od
+Ġter rit
+Ġpar am
+ast s
+8 7
+Ġen emies
+ĠG ames
+F F
+Ġacc ident
+W ell
+ĠMart in
+T ER
+Ġat h
+ĠHe ll
+Ġfor g
+Ġve ter
+ĠMed ic
+f ree
+Ġst ars
+Ġexp ensive
+Ġac ad
+ra wn
+ĠW he
+Ġl ock
+Ġform at
+Ġsold iers
+s m
+Ġag ent
+Ġrespons ibility
+or a
+ĠS cience
+Ġrap id
+Ġt ough
+ĠJes us
+Ġbelie ves
+M L
+Ġwe ar
+le te
+Ãĥ ÃĤ
+ĠD ri
+Ġcomm ission
+ĠB ob
+O h
+ap ed
+Ġwar m
+ÃĥÃĤ ÃĥÃĤ
+Ġ200 3
+ort ion
+Ġhas n
+ust er
+Ġun ivers
+ĠI ll
+Ġk ing
+olog ies
+9 4
+ĠT em
+ĠM os
+Ġpat ient
+ĠMex ico
+ce an
+ĠDe ath
+ĠSand ers
+y ou
+ĠC ast
+ĠComp any
+pt y
+Ġhappen ing
+F P
+ĠB attle
+Ġb ought
+A m
+M od
+U s
+ut ers
+ĠC re
+ĠTh ose
+Ġ4 4
+is er
+Ġs oul
+ĠT op
+ĠHar ry
+ĠA w
+Ġse at
+ff ee
+Ġrev olution
+Ġ( "
+ĠD uring
+et te
+Ġr ing
+Ġoff ensive
+Ġreturn s
+Ġv ideos
+Ġdis cl
+Ġfam ous
+en ced
+ĠS ign
+ĠR iver
+Ġ3 00
+P M
+ĠB us
+ĠC H
+Ġcandid ates
+ard en
+Ġpercent age
+Ġvis ual
+Ġthan k
+Ġtrou ble
+ner gy
+Ġ200 1
+Ġpro ve
+ash ion
+Ġen h
+ĠL ong
+U M
+Ġconnect ed
+Ġposs ibility
+O ver
+Ġexper t
+Ġl ibrary
+art s
+ĠDirect or
+Ġfell ow
+9 2
+ir ty
+Ġd ry
+Ġsign s
+ĠL ove
+Ġqu iet
+f oot
+Ġp ure
+ĠH un
+Ġf illed
+ph as
+ĠE lect
+end ment
+ĠEx pl
+Ġun able
+n s
+m o
+Ġv ast
+ob e
+Ġident ify
+app ing
+ĠCarol ina
+g ress
+Ġpro te
+Ġf ish
+Ġcircumst ances
+raz y
+ĠPh ot
+Ġb odies
+ĠM ur
+Ġdevelop ing
+ĠA R
+Ġexperien ced
+Ġsubst ant
+ĠBo ard
+es ome
+Ġdom estic
+Ġcomb ined
+ĠP ut
+Ġchem ical
+ĠCh ild
+Ġpo ol
+ĠC y
+Ġe gg
+c ons
+st ers
+Ġh urt
+Ġmark ets
+Ġconserv ative
+Ġsupp orters
+Ġag encies
+id el
+O b
+ur b
+Ġ4 3
+ĠDef ense
+y e
+ĠA p
+du le
+Ġtemper ature
+Ġconduct ed
+ĠCh ief
+Ġpull ed
+Ġf ol
+L ast
+ont o
+os is
+V ER
+D es
+ĠP an
+F irst
+Ġadv ance
+Ġlic ense
+r ors
+ĠJ on
+Ġimag ine
+Ġhe ll
+Ġf ixed
+Ġinc or
+os ite
+ĠL og
+ick en
+] :
+Ġsurpr ise
+h ab
+Ġc raft
+ol t
+ĠJ ul
+Ġd ial
+Ġrele vant
+Ġent ered
+Ġlead s
+ĠA D
+ĠCle an
+Ġpict ures
+ess or
+Ġal t
+Ġpay ing
+P er
+ĠMark et
+Ġupd ates
+am ily
+ĠT ype
+ĠH ome
+Ġ5 5
+semb ly
+rom e
+8 3
+Ġgreat est
+Ġhe ight
+Ġhe av
+ain ts
+Ġlist en
+as er
+ĠS H
+Ġcap able
+ac le
+Ġpers pect
+in ating
+Ġoff ering
+ry pt
+ĠDe velop
+ab in
+r c
+Ġbr ight
+al ty
+ar row
+Ġsupp l
+ind ing
+ack ed
+gy pt
+ĠAn other
+p g
+ĠVirgin ia
+ĠL u
+Ġpl anned
+Ġp it
+Ġswe et
+T ype
+ĠD i
+Ġtyp ically
+ĠFranc isco
+Ġpro spect
+ĠD an
+Ġte en
+re es
+Ġsc hed
+Ġh ol
+Ġsc r
+Ġlot s
+l ife
+Ġnews p
+Ġfor get
+ĠN one
+ĠM iddle
+ĠR yan
+ed d
+Ġse vere
+Ġsu it
+ll er
+9 3
+Ġcor respond
+Ġexpl os
+u ations
+Ġfl ag
+g ame
+r id
+Ġpr in
+ĠD ata
+Ġde ploy
+ĠEn ter
+su it
+gh an
+ĠM en
+Ġthough ts
+Ġmat ters
+Ġad apt
+ĠA ri
+Ġf ill
+Ġfor th
+Ġs am
+Ġ4 1
+Ġpay ment
+ĠH or
+Ġsp ring
+du c
+Ġl osing
+Ġbring ing
+F O
+al a
+Ġdist ribution
+he red
+b our
+ĠIsrael i
+om a
+Ġcomb ination
+Ġpl enty
+V E
+C an
+ĠH aw
+Ġper man
+ĠSpe cial
+Ġto w
+Ġsee king
+Ġexam ples
+Ġclass es
+c r
+Ġbe er
+Ġmov es
+ĠI P
+ĠK n
+Ġpan el
+E ven
+Ġproper ly
+Ġr is
+Ġpl ug
+Ġestim ated
+E very
+Ġdef ensive
+ag raph
+Ġpre gn
+Ġinst it
+ĠV ict
+Ġvol ume
+Ġpos itions
+Ġl inks
+ĠPro gram
+ĠWe ek
+ag ues
+Ġtrans form
+k er
+ĠC EO
+Ġc as
+Ġopp onent
+Ġtwe et
+ĠC ode
+Ġsh op
+Ġf ly
+Ġtal ks
+Ġb ag
+Ph one
+Ġa id
+Ġpl ants
+Ġ6 5
+Ġatt orney
+ar ters
+qu est
+ĠMag ic
+Ġbeg ins
+Ġmy ster
+Ġenvironment al
+Ġst orage
+N N
+Ġm arg
+Ġs ke
+Ġmet al
+ell y
+Ġord ered
+Ġrem ained
+Ġl oved
+Ġprom pt
+Ġupd ated
+Ġexper ts
+Ġwalk ing
+Ġan cient
+Ġperform ed
+AT E
+Ġne ither
+i ency
+Ġmanufact ure
+ĠP ak
+Ġselect ed
+Ġm ine
+Ġult imately
+Ġexpl an
+Ġlab el
+ĠServ ices
+ribut ed
+Tr ump
+Ġsy n
+ĠU lt
+S C
+Ġme at
+Ġg iant
+ĠW ars
+ĠO N
+Ġad m
+Ġinter pret
+Ġeven ing
+Ġev il
+ĠB oston
+ĠW ild
+Ġ Ã
+ĠBit coin
+ĠAm azon
+D r
+ĠIn formation
+Ġobvious ly
+Ġadv anced
+Ph oto
+ol ar
+Ġwe ather
+Ġsymb ol
+Ġso le
+Ġpot entially
+ost er
+Ġorig inally
+m un
+3 00
+az e
+ess ions
+Ġde ck
+Ġst ood
+Ġyou th
+ĠB ern
+R ep
+ĠT est
+Ġbas ically
+ot ic
+Ġinvol ve
+ol it
+ly n
+S ee
+Ġair craft
+Ġconf irm
+E W
+Ġmess ages
+ĠRich ard
+Ġk it
+Ġpro hib
+Ġv ulner
+is ters
+Ġexist ence
+Ġturn ing
+ĠS P
+Ġdes ire
+Ġfl at
+Ġm ent
+se ason
+ang es
+Ġneighbor hood
+ĠL ake
+AT ION
+Ġpoint ed
+b ur
+Ġinn ov
+uc ks
+U L
+Ġprofess or
+Ġexp ressed
+A B
+ic ious
+Ġ200 2
+ĠDe v
+Ġs ession
+Ġb are
+s en
+Ġdis s
+ĠC ath
+ĠP ass
+ĠP oint
+Ġdo ctor
+or row
+ail ed
+ĠR ub
+ĠD C
+ĠChar l
+p erson
+Ġwrit er
+igh ters
+ure au
+Ġob lig
+Ġrecord ed
+Ġbro ke
+Ġord ers
+il ty
+Ġmot ion
+in ity
+l aw
+ad ium
+Ġimm igration
+Ġcontr ast
+Ġb att
+Ġex cellent
+Ġtechn ical
+am i
+Ġt un
+Ġcl oud
+ĠY ear
+ge on
+Ġcre ation
+Ġstr ange
+Ġa uth
+Ġfor t
+b orn
+Ġext ent
+ĠT oday
+ĠCl ub
+Ġr ain
+Ġs ample
+Ġaccept ed
+Ġt act
+Ġf ired
+ĠS on
+Ġstand s
+Ġb oot
+Ġ4 7
+Ġstat ements
+Ġvers ions
+Ġse lling
+ound ed
+Ġ199 0
+Ġwere n
+ĠW atch
+Ġexper iment
+P ost
+Ġret ail
+ul ed
+In st
+un te
+ãĥ ¼
+Ġdep art
+Ġb ond
+i very
+om pl
+Ġre action
+ĠSyri an
+ĠP ac
+app ed
+ani el
+D P
+Ġres olution
+Ġre act
+Ġappro ved
+on om
+m ond
+ĠO ffic
+-- -
+Ġrepl ace
+Ġt ack
+Ġsp ort
+Ġch ain
+Ġemer gency
+r ad
+ĠPalest in
+Ġ4 6
+Ġautom atically
+Ġrout e
+Ġp al
+Ġb anks
+ĠPar is
+ĠMed ia
+ro ad
+ic ing
+i xt
+ist ed
+Ġg rew
+Ġco ord
+ĠW here
+om in
+Ġsub s
+ï¿½ ï¿½
+ĠÂ ±
+Ġcorpor ate
+Ġse lection
+n oon
+ĠRep ort
+c s
+clud ing
+ord ers
+anc he
+ĠIt s
+Ġslow ly
+ĠE gypt
+ĠA cc
+Ġcol le
+iqu es
+E X
+Ġattempt s
+ur l
+ĠC ross
+Ġfind ings
+ĠS C
+ĠO R
+Ġind ex
+ens ity
+ĠW ay
+ĠL and
+Ġsh ock
+d is
+Ġd ynam
+Ġc art
+m osp
+S ince
+i est
+ĠB oy
+Ġst orm
+ĠCont in
+201 3
+he w
+il it
+Ġess ential
+iqu id
+O ther
+ive red
+Ġreason able
+A ct
+Ġsub sequ
+ĠP ack
+ĠF ort
+Ġconsider ing
+Ġun iversity
+l og
+Ġmar ried
+Ġill ust
+ĠTr ue
+£ ı
+Ġnumer ous
+rast ructure
+Ġserious ly
+Ġrefer red
+u a
+Ġconsist ent
+on na
+ĠRe al
+ru ption
+ci ples
+Ġfact s
+9 1
+ot es
+er g
+The n
+Ġacc ompl
+N ote
+Ġre venue
+Ġpass ing
+Ġm al
+e en
+ĠY et
+Ġg ather
+ter day
+ew ork
+ĠA uthor
+P e
+Ġopt im
+Ġr ub
+Ġè £ı
+Ġun known
+st one
+Ġun ion
+ol ve
+Ġopportun ities
+Ġbrow ser
+ĠW al
+ĠC ost
+Ġreport ing
+st s
+p et
+Ġs and
+Ġsudden ly
+Ġsurpr ising
+ĠV R
+Ġsomew hat
+ĠB as
+ult ure
+iz z
+ĠC D
+Ġchalleng es
+Ġsett ings
+Ġexperien ces
+ĠF ull
+Ġcan n
+Ġrece iving
+ES T
+Ġj oint
+Ġcult ural
+Ġa st
+8 2
+as tern
+ce ived
+ĠC ru
+Ġb ull
+p ired
+am m
+Ġfac ing
+p ower
+Ġb oss
+ĠH ol
+Ġinst r
+Ġincreasing ly
+Ġsh ift
+Ġstre ets
+ĠWilliam s
+ab b
+Ġl ie
+Ġl augh
+ĠC a
+P L
+Ġadult s
+Ġcustom er
+Ġob tained
+Ġsupport ing
+ht ml
+f ire
+Ġdetail ed
+Ġpick ed
+ĠR ight
+ld er
+E E
+st ood
+ĠK im
+Ġw ire
+Ġs ight
+Ġdevelop ers
+Ġpers ons
+Ġs ad
+Ġc up
+Ġwar ning
+Ġboy s
+l ong
+Ġb ird
+f o
+Ġw al
+Ġobserv ed
+Ġz one
+iven ess
+Ġch annel
+c ript
+Ġref used
+ĠAg ain
+Ġsu c
+Ġspokes man
+ĠRe f
+r ite
+ou ston
+ãĥ ³
+ĠS her
+Ġact s
+ĠN ame
+Ġstrugg le
+ar ry
+omet imes
+Ġdisc rim
+H T
+Ġcateg ory
+Ġreal ize
+Ġemploy ee
+ĠAf ghan
+en ger
+Ġgun s
+ĠSte ve
+ĠM ot
+ĠO l
+ok ed
+Ġth ick
+Ġfair ly
+ill y
+Ġsur ve
+ĠM at
+we ight
+â Ķ
+Ġtro ops
+Ġag ents
+Ġbatter y
+Ġmot iv
+Ã ¡
+S ec
+d en
+o very
+L S
+Ġfl u
+Ġconf ident
+ĠO per
+Ġem pty
+Ġp hen
+Ġse ctor
+Ġexc ited
+Ġrem ote
+ap h
+o en
+Ġdestroy ed
+Ġmor al
+ĠH P
+ĠR on
+Ġd ress
+ĠB at
+Ġl it
+ĠM S
+Ġa f
+H L
+r um
+is ms
+Ġshould n
+Ġsym pt
+ĠTor onto
+het ic
+Ġcar bon
+Ġinstall ed
+Ġviol ent
+Ġsol ar
+j a
+Ġpract ices
+Ġr ide
+ĠP enn
+Ġimpro ved
+Ġaud io
+Ġbehav i
+ĠP S
+Ġe ating
+D ata
+ĠRe view
+p ass
+cl aim
+u ated
+ang ers
+c hen
+Ġproper ties
+Ġany where
+An other
+Ġbl ow
+ĠJack son
+Ġp roud
+Ġplan e
+l ines
+Ġsqu are
+Ġpro of
+ans as
+Ġtalk ed
+m akers
+Ġs ister
+Ġhold s
+Ġres ident
+Ġ= =
+Ġresist ance
+Ġspl it
+Ġpro secut
+Ġconf idence
+res ents
+Ġcut s
+Ġexcept ion
+Ġz ero
+Get ty
+Ġcop yright
+Ġtot ally
+orm al
+ific ations
+ĠAustral ian
+Ġs ick
+Ġ1 50
+Ġhouse hold
+Ġfe es
+Ġdri vers
+og en
+ĠN Y
+Ġnecess arily
+Ġregul ations
+ear ing
+s l
+Ġperspect ive
+c are
+ic ial
+H is
+Ġesc ape
+Ġsurpr ised
+ĠV an
+ur rent
+Ġv ac
+8 1
+ĠTh us
+Ġem phas
+ĠCh ampions
+ĠI ce
+Ġn arr
+Ġhead s
+Ġca using
+b el
+f ortunately
+ĠM a
+Ġtarg ets
+ci pl
+Ġafter noon
+Ġadd s
+ĠMay be
+ĠF our
+ess ed
+ple te
+Ġus ual
+ch o
+ing u
+Ġwith d
+ĠE nergy
+ĠE conom
+O O
+Ġart icles
+Ġinj ured
+Ġman age
+Ġexpl ains
+Ġdi agn
+R ec
+at ures
+Ġlink ed
+Ġdiscuss ed
+Ġexpl o
+Ġocc asion
+ath an
+Ġopp osite
+Ġfac es
+Ġden ied
+ĠK night
+Ġn ut
+Ġapprox imately
+Ġdisapp oint
+onym ous
+ĠB est
+ĠL o
+ĠH y
+ĠA ff
+Ġvot ing
+an while
+ĠII I
+Ġinstit utions
+ag ram
+ĠD aily
+Ġdr ag
+Ġnear by
+Ġgu ilty
+Ġcon ver
+P re
+s hip
+Ġre ward
+Ġphilos oph
+ĠS S
+u gh
+Ġapp s
+f riend
+Ġu pper
+Ġad vert
+Ġs now
+Ġfr ust
+Ġour selves
+F r
+ĠD ie
+amp ion
+Ġdis miss
+Ġc ere
+Ġsign al
+f rom
+Ġ ).
+Ġ5 2
+Ġcr imes
+it ors
+est ival
+use um
+Ġcoun cil
+ĠS aud
+M ay
+ĠG un
+ic ian
+et her
+Ġsu fficient
+ĠH en
+so le
+Ġhistor ical
+ĠF ar
+ĠT urn
+Ġp in
+Ġsuc ceed
+m at
+ly mp
+Ġtrad ition
+ĠO k
+Ġc ro
+Ġdesc ription
+al le
+Ġsk y
+T e
+Ġwide ly
+Ġw ave
+Ġdefin ition
+ĠJew s
+Ġcy cle
+Ġref ere
+Ġbr ings
+us al
+Ġal ive
+Ġfrequ ently
+Ġint ention
+ĠCont rol
+l v
+y stem
+Ġpriv acy
+g ent
+ren ce
+ĠQu est
+ĠChrist mas
+Ġr ail
+Ġco oper
+Ġtest ed
+ĠC apt
+as ks
+Ġcomfort able
+Ġdel ivered
+sc ape
+Ġdep th
+ĠG OP
+Ġwrit es
+Ġass ets
+Ġsa v
+im ents
+Ġtrans ition
+Ġart ist
+ĠL ook
+Ġl ob
+Ġcomp onents
+ar ity
+Ġwalk ed
+Ġro ot
+Ġparticip ants
+Ġnot iced
+Ġres c
+Ġn av
+ĠAd minist
+d a
+ut ral
+pl ate
+Ġimport ance
+Ġass ert
+ious ly
+c ription
+Ġinj uries
+ĠChe ck
+Ġregist ered
+Ġint ent
+Ġmiss ed
+ograph ic
+Ġsent ence
+oun ter
+Ġassist ance
+ev in
+Ġdat abase
+Ġbuild ings
+Ġclass ic
+Ġth inks
+ĠOh io
+P r
+ug g
+Ġfe e
+p an
+Ġeffect ively
+Ġfac ility
+Ġbe ar
+Ġch apter
+Ġdog s
+ĠCol umb
+Ġl atter
+it ial
+Ġad mitted
+T V
+ĠGe org
+Ġpost s
+\ \
+Ġlawy er
+Ġequ ival
+Ġm and
+Ġcontro lled
+ĠW alk
+ĠAnd rew
+Ġmen u
+am ental
+Ġprotect ed
+v a
+Ġadminist r
+or al
+Ġre in
+ĠS ar
+Ġamount s
+Ġn ative
+ĠM oon
+Ġrep resents
+Ġab andon
+Ġcarry ing
+Ġt ank
+m ary
+Ġdecl ared
+T ube
+Ġh at
+Ġpun ish
+el lect
+m es
+Ġun iverse
+ĠR od
+ph y
+Ġinf rastructure
+Ġ5 1
+Ġopp osed
+ow nt
+c a
+ĠM ake
+Ġhard ware
+Ġco ffee
+R el
+b al
+w orld
+ĠS af
+ĠSe a
+in als
+Ġown ed
+Ġh all
+ers ion
+Ġdescrib e
+ĠP ot
+Ġport ion
+Ġat mosp
+Ġgovern ments
+Ġdep ending
+Ġoff ense
+Ġtr ick
+aw a
+ĠL ine
+ĠV is
+ĠH ard
+ĠOr ig
+ĠCl ick
+Ġdes k
+ĠVal ley
+ĠS ov
+Ġmov ies
+Ġrem ark
+Ġm ail
+Ġcons cious
+Ġrul ing
+ĠR ights
+Ġmed ic
+he nt
+ĠW omen
+> <
+Ġrepl aced
+ĠP rem
+ĠTh anks
+Ġre new
+ĠB all
+if orm
+Ġsh ots
+C omm
+Ġar med
+Ġconst ant
+Ġt aste
+Ġreal ized
+Ġbu ff
+Ġm o
+Ġeffic ient
+M ost
+or ation
+if ies
+Ġcommun ication
+Ġfl ood
+Ġconsequ ences
+Ġany way
+ig g
+ĠG M
+ĠTh ank
+Ġ iron
+Ġev olution
+ĠC op
+tw itter
+Ġ9 5
+Ġrelationship s
+ad el
+ĠYou ng
+Ġpropos al
+ay ers
+uild ing
+ĠH ot
+OR E
+c os
+Ġcoll abor
+P G
+ax y
+Ġknow ing
+Ġsupport s
+ow ed
+Ġcontrol s
+Ġmere ly
+um er
+Ġath let
+Ġf ashion
+p ath
+Ġg ift
+Ġer a
+AN D
+Ġkind s
+ĠKore an
+Ġleg it
+ul ous
+Ġess entially
+Ġthe rap
+n ic
+Ġsuff ered
+Ġh ur
+Ġprom ise
+Ġex cess
+Ġover w
+Ġpr ime
+ĠH ouston
+er ry
+ĠM s
+R S
+201 2
+Ġst ores
+ĠO lymp
+Ġj ourney
+Al though
+S ub
+ĠE duc
+ĠCh apter
+Ġrequest s
+Ġconsum ers
+Ġt iny
+Ġis ol
+ĠF air
+b a
+ĠY OU
+Ġcr ash
+ce ler
+Ġemot ional
+Ġgood s
+Ġelect ed
+Ġmod er
+ĠLin ux
+Ġbl ocks
+Ġis land
+ĠSoc iety
+Ġelect ions
+Ġbroad cast
+Ġche ap
+Ġn ations
+Ġse asons
+4 00
+Ġwas te
+ĠS at
+Ġfield s
+em ploy
+Ġprof ile
+Ġauth ors
+AL L
+ĠG ra
+w est
+ĠT y
+Ġdeath s
+Ġv acc
+Ġfor med
+Ġd u
+Ġon going
+ĠMuslim s
+el f
+ig ure
+Ġass ume
+ĠUkrain e
+w ater
+Ġco ast
+Ġvot ed
+g or
+ĠA S
+ĠMich igan
+az a
+ĠAr m
+i ro
+Ġf lex
+as ters
+' '
+Ġwel come
+ar l
+Ġloc ations
+ig ation
+ĠF il
+Ġbu ying
+Ġarch itect
+Ġhard er
+ĠC ub
+Ġinter face
+Ġrestaur ant
+Ġdisco ver
+Ġex ceed
+Ġfav our
+ger y
+Ġd uty
+Ġp itch
+ad or
+ĠM ach
+b oy
+Ġrespond ed
+Ġext ended
+her s
+M any
+ra id
+if er
+ĠIn s
+S er
+Ġmed ium
+s he
+ĠS ports
+Ġmag azine
+ut ation
+Ġlim its
+ĠG all
+Ġex ternal
+raz il
+Ġyoung er
+t le
+Ġrem ind
+ĠC ON
+Ġimmedi ate
+Ġh idden
+Ġvol unte
+Ġsim pl
+od cast
+Ġph ase
+d r
+Ġpl ot
+Ġexp osure
+R I
+og rap
+v in
+an ish
+ĠAc ad
+ĠEng ine
+Ġexp ansion
+ĠP ay
+Y our
+Ġpus hed
+ĠE ll
+ĠHe ad
+Ġmarket ing
+ĠA C
+k et
+Ġh its
+Ġg ro
+ĠA ge
+ĠSc ot
+] [
+Ġst im
+Ġi Phone
+Ī Ĵ
+Ġn arrow
+ĠGet ty
+ĠTur key
+Ġperfect ly
+Ġen able
+ut ch
+Ġprec ise
+Ġreg ime
+Ġsh if
+Ġcomp ens
+g un
+d iv
+Ġch osen
+ĠK en
+An y
+Ġtre es
+Ġrecomm ended
+ĠR en
+u able
+ĠH T
+F ollow
+E G
+ĠH and
+ĠK enn
+Ġarg uments
+Ġex ists
+Ġb ike
+ĠCons erv
+Ġbre aking
+ĠG ar
+Ġc razy
+Ġvirt ual
+ay lor
+ix el
+Ġ19 80
+Ġper mission
+ĠSer ies
+Ġconsum er
+Ġclose ly
+c alled
+Ġ5 4
+Ġhop es
+Ġar ray
+ĠW in
+ĠLab our
+Ġsp ons
+ĠI re
+Ġp ow
+Ġread ers
+Ġemploy ment
+Ġcreat ure
+Ġresult ing
+Ġaccur ate
+Ġmom ents
+Ġarg ued
+Ġp ed
+D uring
+Ġ5 3
+ĠT al
+Ġs ought
+Ġsuff ering
+Ġ icon
+le e
+Ġ( $
+al ian
+Â °
+Ġp ra
+Ġbon us
+( "
+k o
+Ġact ing
+D E
+f all
+Ġcompar ison
+Ġsm ooth
+ĠN AS
+u pp
+ĠJose ph
+ep ing
+ĠT ake
+ĠM id
+Ġs ending
+f ast
+ĠF all
+Ġdeal ing
+us er
+ĠOr gan
+C o
+Ġatt ached
+Ġse es
+% .
+Ġtyp ical
+AR T
+Ġfind s
+ĠAs ia
+um in
+ĠC ore
+ĠE nt
+in ent
+u ce
+ĠBl ood
+ĠN ever
+Ġem ails
+Ġhigh light
+Ġconf ront
+at us
+ut ed
+Ġun us
+Ġtop ic
+ĠAd am
+Ġb le
+at i
+Ġunder stood
+S et
+st ruct
+T P
+Ġm ob
+a a
+ĠSt art
+pect ed
+se ll
+Ġded icated
+ĠC A
+u an
+Ġsong s
+esc ription
+Ġte ch
+Ġr ape
+Ġas ide
+Ġgr ant
+Ġ5 6
+s ub
+Ġarg ue
+Ġcont aining
+Ġsche dule
+Ġliber al
+Ġpublic ly
+Ġheav ily
+ĠU t
+in er
+ĠS ection
+ĠC are
+we et
+l s
+D is
+âĶ Ģ
+ĠF ollow
+B ack
+ĠI T
+Ġb es
+j i
+ĠH it
+est ed
+Ġevery body
+ĠSw ed
+Ġfem in
+Ġfac ilities
+Ġcon ven
+C omp
+ĠO S
+c ore
+Ġan x
+Ġdiv ision
+ĠC am
+ĠSt an
+m ates
+Ġexpl ore
+pl om
+Ġsh ares
+pl oad
+an es
+Ġide al
+et ers
+ĠB ase
+Ġpl astic
+Ġdist inct
+ĠNet work
+ĠSe attle
+Ġtrad ing
+ens us
+int end
+Ġex hib
+Ġinit ially
+ĠF ood
+Ġthous and
+ĠBus iness
+act er
+Ġpar agraph
+Ġrough ly
+Ġw ww
+Ġcreat ive
+ĠCon f
+Ġconsum ption
+Ġfil ms
+ag an
+Ġob tain
+Ġt all
+Ġt or
+Ġacknow led
+Ġg rown
+al o
+K E
+Ġ4 00
+end ers
+t aining
+U G
+Ġsu icide
+Ġwat ched
+ĠL ist
+al i
+re hens
+Ġsurround ing
+Ġp ip
+Ġf lying
+ĠJ ava
+ord an
+Ġserv ing
+in ations
+p ost
+Ġsh o
+A v
+Ġj ail
+z y
+Ġ199 9
+Ġ< /
+Ġliter ally
+ĠS ir
+Ġexp osed
+Ġl ies
+st ar
+Ġb at
+Ġear ned
+ĠD ig
+Ġspec ified
+ĠSe ason
+Ġdeg rees
+Don ald
+Ġcent re
+Ġsh aring
+Ġwin ter
+ĠC O
+C he
+Ġ Î
+M P
+Ġun w
+Ġfew er
+ĠM ir
+Ġsomew here
+ĠK ey
+Ġattack ed
+ĠK ir
+Ġdom ain
+Ġstrong er
+Ġ9 9
+Ġpen alty
+I d
+Sc ript
+Ġdecl ined
+Ġne ck
+Ġfra ud
+Ġcur rency
+Ġr ising
+R C
+âĢ¦ âĢ¦
+H z
+Ġt ab
+Ġtal ent
+n am
+ĠN BA
+Ġvill age
+Ġleg s
+ĠN ext
+E d
+Ġac id
+Ġhy d
+8 00
+Ġinvol ving
+ĠIm age
+ĠBe fore
+F l
+Ġyes terday
+S ource
+Ġterror ist
+Ġsu p
+Ġsy nt
+ĠSaud i
+Ġw est
+Ġr u
+b urg
+Ġvis ible
+Ġstru ck
+r ison
+Ġaw esome
+Ġd rawn
+Ġansw ers
+ĠG irl
+ĠR am
+Ġthreat s
+Ġdef eat
+os it
+Ġv ent
+atur ally
+Americ an
+end a
+ĠH oly
+Ġr um
+% ,
+c ase
+ĠHist ory
+ĠYou Tube
+Ġsit uations
+ĠD NA
+S te
+Ġsa ved
+It em
+Ġrec ip
+olog ist
+Ġfac ed
+Ġel ig
+O nce
+ĠL i
+u h
+Ġmist ake
+ĠDiv ision
+ĠB ell
+Ġsympt oms
+Â ®
+Ġdom in
+Ġfall ing
+Ġend ing
+as hes
+Ġmat ches
+ĠOn line
+Ġexplan ation
+D ef
+red it
+Ġany more
+ĠT otal
+ĠF OR
+us hed
+Ġlet ters
+Ġris ks
+ĠO K
+Ġreported ly
+: \
+Ġpl ate
+Ġsubject s
+Ġattempt ed
+if ier
+ian a
+Ġunlike ly
+ĠTh ough
+um a
+ĠIn vest
+ĠPr in
+ic an
+ĠD ar
+ĠColor ado
+au g
+Ġve get
+a os
+ri a
+Ġshe l
+Ġmark ed
+Ġ( )
+Ġsp r
+p o
+ĠL ink
+Ġdef e
+ĠJ r
+Ġthem e
+Ġpass ion
+ĠP en
+Ġinf o
+iz er
+Ġsh it
+ĠC ivil
+ap se
+c re
+Ġpo ly
+Ġcomp onent
+ĠChar les
+ĠIre land
+ĠPro v
+Ġdo ctors
+Ġgr anted
+Ġpain t
+Ġhon or
+Ġsm oke
+Ġpay ments
+Ġprim arily
+ĠKing dom
+r ich
+ate ll
+Ġde als
+Ġsched uled
+Ġfund amental
+Ġprote in
+Ġnewsp aper
+Ġcl ients
+yth on
+ĠD ate
+h us
+Ġfeed back
+Ġstret ch
+Ġc ock
+Ġhot el
+ĠQue en
+Ġsu gar
+Ġj u
+Ġmil k
+Ġappro val
+ĠL ive
+Ġequival ent
+ef ully
+Ġins ert
+z ona
+Ġext ension
+d ri
+J ohn
+Ġacc omp
+S m
+ĠF und
+Ġconst antly
+Ġ` `
+Ġgener ated
+ĠA ction
+ĠP sych
+ĠT ri
+Ġrecogn ize
+Ġv ary
+ph a
+ĠR a
+d f
+et ch
+ĠSov iet
+Tw o
+Ġpattern s
+Ġprof ession
+an ing
+T ime
+ĠL im
+Ġcol ors
+ĠA z
+ĠT R
+Ġinf ect
+Ġphen omen
+Ġshe ll
+Al so
+Ġput s
+Ġdel ivery
+Ġbro wn
+Ġprocess ing
+Ġlight s
+ess age
+ĠBro ok
+ĠA ud
+l ation
+Ġindust rial
+L ike
+ĠB razil
+rou s
+ES S
+ĠL uc
+Ġsome how
+Ġ8 5
+Ġpro port
+Ġpolit icians
+Ġindic ate
+Ġh ole
+Ġtechn iques
+Ġcompet itive
+Ġph r
+Ġv o
+ist ent
+ĠD ream
+Ġcamp us
+Ġaspect s
+Ġhelp ful
+Ġsh ield
+or se
+Ġtrig ger
+m al
+Ġ5 8
+Ġt ort
+Ġperson ally
+Ġt ag
+Ġkeep s
+ĠV ideo
+Ġben ch
+Ġg ap
+a ire
+Ġe ast
+Ġrec overy
+per ial
+Ġprof it
+ĠM ic
+Ġ5 7
+Ġcol on
+Ġstrong ly
+st yle
+Ġalleg ations
+h an
+Ġrep orters
+j o
+r ine
+arg et
+and al
+Ġ0 3
+Ġfl ash
+tr ans
+Ġstr ict
+Ġpark ing
+ĠPak istan
+Ġl i
+Ġwe ird
+ĠE ric
+Ġreg ions
+ĠJ un
+Ġint ellect
+ĠW H
+od ing
+rib utes
+up id
+ĠT it
+Ġf inger
+or ia
+Ġe lev
+ĠF ield
+Ġcon clusion
+; ;
+Ġfeel ings
+Ġext ensive
+Ġm ixed
+Ġne uro
+v y
+Ġhar ass
+ĠC irc
+ou ch
+Ġterrit ory
+Ġsuccess fully
+M ar
+Ġing red
+Ġoverw hel
+Ġl ayer
+V iew
+Ġall ies
+ill ance
+ĠTh ree
+Ġb unch
+Ġnorm ally
+Ġnet works
+Ġsac r
+ĠC IA
+b les
+Ġch ose
+Ġopp onents
+Ġregard less
+Ġfr anch
+Ġpre f
+ĠP o
+Ġbr idge
+ann a
+ĠSil ver
+Ġw age
+p age
+ri or
+Ġrad ical
+ĠL ittle
+Ġman ip
+Ġsecret ary
+Ġg ang
+D R
+F A
+Ġdec ent
+ĠSp irit
+Ġun cle
+ĠDevelop ment
+Ġinvest ors
+Ġwall s
+Ġpub lish
+Ġgener ate
+iss ions
+c ar
+Ġprom ote
+Ġcut ting
+Ġche st
+Ġdrink ing
+Ġcollect ed
+Ġ7 2
+Ġhop ing
+Ġem br
+gor ith
+Ġwar ned
+Ġinstruct ions
+O G
+ĠD id
+ĠAg ency
+Ġg ear
+Ġcritic ism
+ĠF urther
+Ġut il
+ann y
+R ed
+Ġcoun sel
+ĠAs ian
+Ġredu ction
+p ool
+Ġteach ing
+Ġdeep ly
+i y
+Ġestim ates
+Ġcho ices
+Ġperman ent
+in em
+ke l
+Ġf asc
+p se
+f ile
+ĠL ow
+ĠP erson
+Ġt ournament
+st al
+Ġm el
+U ST
+ĠR ay
+az i
+V al
+Ġcont ained
+ĠH olly
+Ġw ake
+Ġreve al
+Ġprocess es
+ĠIS IS
+Ġ0 9
+Ġbl ind
+Ġste el
+ĠB ad
+Ġcare fully
+app y
+ro it
+Ġg aming
+Ġhous es
+ĠC oll
+Ġtr uck
+er m
+Ġsc ored
+Ġocc as
+ret urn
+b ound
+v ar
+Ġsh arp
+Ġaf raid
+ĠE X
+am ber
+c ific
+Ġsche me
+N C
+ĠPol it
+Ġdecl ine
+Ġ199 8
+Ġpus hing
+Ġposs ession
+Ġpriv ile
+Ġteacher s
+Ġy ield
+H A
+ĠDav is
+it led
+#### ####
+Ġr ig
+ĠD aniel
+ac on
+Ġh ide
+ut en
+Ġcolle agues
+Ġprin ciples
+Ġl oud
+Ġs in
+ĠDem on
+Ġst one
+Ġ0 2
+Ġt aught
+Ġter rible
+Ġst uck
+ĠPol icy
+te en
+Ġimplement ation
+ĠB BC
+ĠAP I
+Ġwhe el
+all as
+Ġch ampions
+ol ars
+play er
+Ġrepeated ly
+ĠSt ill
+Ġlik es
+ast y
+es ter
+ĠCath olic
+R L
+Ġb ath
+Ġno ise
+t itle
+Ġn orthern
+P art
+Ġmag n
+Ġf ab
+ĠAs h
+Ġdis pl
+Ġtick et
+Ġm urd
+Ġalong side
+ĠMus ic
+Ġr iver
+ĠSte el
+ĠC L
+ĠPl ayer
+ĠM ult
+ow ing
+re p
+s ize
+Ġt ur
+ĠGeorg ia
+isc al
+ra ction
+Ġc able
+Ġ5 9
+Ġw ins
+Ġup coming
+Ġsurv ive
+Ġins pired
+ĠEduc ation
+Ġstat istics
+ĠF oot
+iam i
+Ġy ellow
+ĠP age
+. -
+ĠH as
+Ġur ban
+Ġa x
+es sel
+\ "
+Ġquarter back
+Ġreg ister
+ĠLab or
+Ġab ilities
+ĠF amily
+Ġvar iable
+ĠPr ice
+Ġcont em
+Ġth in
+ĠE qu
+d ata
+Ġg otten
+Ġconst it
+Ġas ks
+Ġt ail
+Ġexc iting
+ĠE ffect
+ĠSp anish
+Ġencour age
+ins on
+ĠA h
+Ġcommit ment
+C S
+Ġr ally
+Ġ: :
+Ġsubs id
+Ġsp in
+Ġcapt ured
+201 8
+Ġinn oc
+Ġalleged ly
+ĠC ome
+Ġart ists
+ĠN umber
+Ġelect ronic
+Ġreg ional
+ap es
+Ġw ra
+Ġmy th
+pr ise
+ĠM iller
+ĠC reat
+ĠEp isode
+b ell
+Ġdirect ed
+Ġext ract
+Ġs orry
+Ġv ice
+ag ger
+ĠSu pport
+Ġ6 6
+ĠI ron
+Ġwonder ful
+Ġg ra
+N et
+ion e
+E ng
+Ġsh ips
+ik es
+ĠK evin
+it ar
+Ġactiv ists
+tr ue
+ĠAri zona
+ent h
+ĠDes pite
+ĠS E
+Ġha bit
+ern el
+Ġin qu
+Ġab ortion
+Ġv oid
+Ġexpl icit
+Ġeng aged
+Ġang ry
+Ġr ating
+Ġfr ag
+b ro
+ick ing
+d ev
+Ġwor ried
+Ġob ser
+Ġap artment
+ĠG T
+Ġest ate
+ĠConst itution
+em on
+ĠS now
+Ġcount y
+Ġdis ag
+ĠStep hen
+Ġimm igrants
+w ind
+ĠN ations
+Ġfol ks
+O ut
+Ġg all
+Ġtarget ed
+Ġst ead
+ĠB on
+ĠL ib
+Ġinform ed
+Ġ12 0
+ch ain
+idel ines
+or ough
+Ġdri ven
+Ġregular ly
+Ġbas ket
+Ġprinc iple
+oc ument
+Ġst un
+ib ilities
+ĠRom an
+ĠAb out
+Ġal ert
+Ġdemocr acy
+Ġrepresent ed
+H S
+c ers
+p arent
+Ar t
+p ack
+Ġdi plom
+re ts
+ĠN O
+Ġcapt ure
+ĠAd v
+Ħ ¢
+Ġannounce ment
+ĠL ear
+Ġh ook
+Ġpur s
+ĠS uch
+ĠC amer
+Ġrefuge es
+ĠV e
+P ol
+Ġrecogn ized
+l ib
+Ġhad n
+A ss
+Ġpil ot
+us hing
+Ġreturn ing
+Ġtra il
+ĠSt one
+Ġrout ine
+Ġcour ts
+Ġdes per
+Ġfriend ly
+ĠIt aly
+Ġpl ed
+Ġbreat h
+Ġstud io
+N S
+Ġimp ressive
+ĠAfghan istan
+Ġf ing
+Ġd ownt
+ink ing
+ĠR og
+i ary
+col or
+se x
+ar on
+Ġf ault
+ĠN ick
+D own
+ĠR ose
+ĠS outhern
+X X
+is odes
+L ist
+6 00
+Ġout come
+er r
+Ġelse where
+Ġret ire
+Ġp ounds
+ĠGl obal
+Pe ople
+Ġcommun ications
+Ġlo an
+Ġrat io
+ĠEm pire
+Ġg onna
+Ġinv ent
+D F
+Ġ19 70
+ĠComm on
+p at
+Ġprom ised
+Ġd inner
+ĠH om
+Ġcreat es
+Ġoper ate
+ver ty
+ĠJ ordan
+et ime
+Ġsust ain
+R eg
+Ġincred ible
+im a
+Ġwar rant
+Ġm m
+A tt
+Ġlaw suit
+Ġreview s
+it ure
+ĠS ource
+l ights
+ĠF ord
+Ġ6 3
+g roup
+st ore
+Ġfeat ured
+Ġfore ver
+Ġpo verty
+ĠP op
+ĠC NN
+az z
+ab is
+ach ing
+Ġl aid
+ĠSu pp
+Ġfil ter
+en a
+ĠCommun ity
+Ġcreat ures
+u ction
+ĠR oyal
+Ġassoci ation
+ĠCon nect
+ĠBr ad
+âĸ Ī
+l ers
+the re
+ĠG i
+Ġval uable
+AC K
+ĠT aylor
+Ġl iquid
+ĠAtt orney
+ĠCar l
+ĠF inal
+ag a
+ĠWil son
+B ecause
+ĠProf essor
+ak a
+Ġincred ibly
+r ance
+! )
+R ef
+s k
+Ġsol utions
+Ġatmosp here
+Ġbl ame
+um es
+ĠN ob
+C A
+um ps
+r ical
+ĠPut in
+ĠD est
+or ic
+ĠP A
+Ġrespect ively
+w an
+Ġfif th
+â Ħ¢
+ĠC ry
+Ġgovern or
+res ident
+Ġpurch ased
+Ġh ack
+Ġint ense
+ob s
+Ġorig in
+Ġdef ine
+Ġcare ful
+** *
+Ġshould er
+Cl ick
+Ġt ied
+Ġdest ruction
+ou red
+Ġno body
+Ġh o
+ĠEx per
+Ġt ip
+" ;
+Ġtechn ique
+Ġj ur
+ĠP ok
+b ow
+Ġleg end
+Ġacc ord
+Ġbus y
+ĠInt el
+Ġh ang
+ak i
+. ]
+âĢĶâĢĶ âĢĶâĢĶ
+Ġsur gery
+Ġrep rodu
+Ġun iform
+Ġscen es
+c ode
+Ġ6 2
+l isher
+ĠH ave
+ph ia
+Ġcry pt
+Ġrec on
+Ġsc ream
+Ġadop ted
+Ġsc ores
+N e
+ĠIt alian
+in cluding
+B O
+Ġindic ated
+Ġent ertain
+G u
+T ext
+i el
+Ġtw enty
+Ġeng age
+off s
+ĠPac ific
+Ġsm ile
+Ġperson nel
+Ġto ler
+Ġdo ors
+Ġt one
+Ġmach ines
+Ġent ering
+ten ance
+C O
+ĠJer sey
+Ġfore st
+Ġhor se
+Ġcompl aint
+ĠSpr ing
+y o
+ĠPl us
+ed ing
+ĠRet urn
+qu arters
+ial s
+c ow
+Ġacad emic
+Ġf ruit
+Ġ199 6
+og ether
+Ġw ine
+Ġpur su
+ĠSte ven
+Ġlic ens
+Wh o
+Ġclot hes
+re ction
+Ġsqu ad
+Ġst able
+Ġr aw
+z ens
+St ar
+ut ies
+anc er
+Ġke ys
+ĠM u
+Ġcompl icated
+ig er
+ĠTe xt
+Ġabs or
+Ġ6 8
+Ġfun ny
+Ġrel ief
+ĠL ew
+ĠC ook
+Ġch art
+Ġdraw ing
+G E
+Ġmod ule
+ĠB ull
+I LL
+Ġs alt
+0000 0000
+il le
+Ġres ource
+aw ay
+adel phia
+ĠB ru
+Ġ6 7
+Ġsome body
+Ġparticip ate
+Ġro se
+we red
+Ġmus cle
+Ġcons ent
+Ġcontin uing
+ĠGuard ian
+ĠOr der
+reg on
+Ġre ar
+Ġprov ision
+Ġlik ed
+ri ent
+Ġb ra
+Tr ans
+Ġmeet ings
+Ġto x
+Ġcon vent
+Ġaut o
+Ġrec ording
+ĠSo ft
+00 1
+ĠR oll
+Ġprogram ming
+Ġp ic
+Ġprov ed
+Ġst ab
+ĠA st
+Ġca ption
+ul ating
+ĠAtt ack
+Ġnew ly
+Ġ199 7
+f r
+Ġdis cipl
+ĠGree k
+Ġed ition
+ĠDo es
+ĠB ox
+if le
+ack et
+Ġpass es
+Ġgu est
+Ġac celer
+it als
+U D
+Ġaut hent
+ĠR est
+ov al
+t a
+u ine
+Ġarm or
+ĠT own
+Ġcomp at
+Ġinc hes
+Des pite
+Ġass ign
+he rent
+Ġprep are
+ĠM eg
+oc key
+Ġdep ends
+Ġtrack s
+w atch
+Ġl ists
+ĠN orthern
+Ġal ter
+re c
+ĠE astern
+Ġcond em
+Ġevery where
+? '
+Ġaff ili
+Ġf ought
+": {"
+Ġm ac
+it arian
+Ġsc ope
+ĠA L
+aw s
+ar ms
+Ġqu e
+Ġenjoy ed
+nes ota
+Ġagg ressive
+ĠSt ory
+ĠI V
+Ġrec ipe
+Ġrare ly
+ĠMed ical
+val ue
+ang el
+ay ing
+omet hing
+Ġsub section
+Ġs outhern
+Ġfrequ ency
+re te
+roll ed
+ult s
+ĠN ic
+Ġbeh alf
+Ġsequ ence
+ab et
+Ġcontrovers ial
+Ġcomp rom
+Ġwork er
+Ġmain ly
+Ġal gorith
+ĠM ajor
+or ce
+g ender
+Ġorgan ized
+Ġf ake
+Ġconclud ed
+ĠE D
+ĠEx ec
+r age
+Ġch ances
+ber ry
+ĠTr ad
+Ġconfig uration
+Ġwithd raw
+Ġf ro
+ud es
+ĠBro ther
+ĠB rian
+Ġtri es
+Ġsam ples
+Ġb id
+ĠGold en
+Ġphot ograph
+if est
+ĠD O
+ĠPar liament
+******** ********
+R em
+Ġcont est
+Ġsign ing
+p x
+ĠZ eal
+âĶĢ âĶĢ
+E ar
+Ġex it
+Be fore
+ĠCor por
+n ull
+mon th
+Ġrac ial
+ott ed
+ĠV eg
+ĠRe uters
+Ġsw ord
+ps on
+ĠRom ney
+a ed
+Ġt rib
+Ġin ner
+Ġprot ocol
+ĠB i
+ĠM iami
+ever al
+p ress
+Ġsh ipping
+ĠAm endment
+ĠHow ard
+con nect
+ĠD isc
+ĠJ ac
+iam ond
+ĠThere fore
+s es
+ĠPrin cess
+ĠUS B
+ĠAn th
+Ġsurve illance
+Ġap olog
+Ġ6 1
+ow a
+Ġf ulf
+j s
+Ġl uck
+ust ed
+ĠÂ §
+n i
+Ġant icip
+em an
+Ġwin ner
+Ġsil ver
+ll a
+ic ity
+Ġunus ual
+Ġcr ack
+Ġt ies
+e z
+Ġpract ical
+Ġprov ince
+ĠPl ace
+Ġprior ity
+IC E
+Ġdescrib es
+Ġbr anch
+F orm
+ask a
+miss ions
+b i
+Ġp orn
+ĠTur k
+Ġent hus
+Ġf ighters
+Ġ0 8
+ĠDet roit
+Ġfound ation
+av id
+A re
+Ġjud gment
+cl ing
+Ġsol ve
+ĠDes ign
+W here
+hes is
+ĠT ro
+a fter
+Ġne utral
+ĠPalestin ian
+ĠHolly wood
+Ġadv is
+ĠN on
+y es
+ol is
+Ġrep utation
+Ġsm ell
+Ġb read
+ĠB ul
+ĠBe ach
+Ġclaim ing
+Ġgen etic
+Ġtechn ologies
+Ġupgr ade
+row s
+Ġdevelop er
+ĠJ osh
+ĠDis ney
+erv ed
+ip al
+Ġun ex
+Ġbare ly
+t hen
+ĠP ub
+Ġill ness
+et ary
+ĠB al
+Ġp atch
+Ġbut t
+Ġst upid
+ĠD og
+ĠD allas
+f ront
+ie ce
+Ġprot ests
+Ġch at
+oen ix
+Ġw ing
+Ġpar liament
+Ġ7 7
+ose xual
+Ġre nder
+pt ions
+ĠCo ast
+os a
+ĠG reg
+h op
+ĠMan agement
+Ġbit coin
+Ġrec over
+Ġincor por
+or ne
+ĠUs ing
+Ġpre ced
+Ġthreat ened
+Ġspirit ual
+ĠE vent
+ĠF red
+Ġadvert ising
+Ġimprove ments
+ĠC ustom
+Ġer rors
+Ġsens itive
+ĠN avy
+Ġcre am
+L ook
+Ġex clusive
+Ġcomp rehens
+Ġde leg
+Ġcon ce
+Ġrem em
+Ġstruct ures
+Ġst ored
+N D
+Ġ1 000
+U P
+ĠB udd
+A F
+w oman
+ĠAcad emy
+ð Ł
+se a
+Ġtem porary
+Ab out
+es ters
+Ġtick ets
+Ġposs ess
+in ch
+o z
+Ġl a
+Ġcontract s
+Ġun p
+Ġc ig
+ĠK at
+ult ural
+as m
+Ġmount ain
+ĠCapt ain
+St ep
+m aking
+ĠSp ain
+Ġequ ally
+Ġl ands
+at ers
+Ġreject ed
+er a
+im m
+ri x
+C D
+Ġtrans action
+g ener
+less ly
+Ġ| |
+Ġc os
+ĠHen ry
+Ġprov isions
+Ġg ained
+Ġdirect ory
+Ġra ising
+ĠS ep
+ol en
+ond er
+Ġcon sole
+in st
+Ġb om
+Ġunc ertain
+1 50
+ock ing
+Ġmeas ured
+Ġpl ain
+Ġse ats
+Ġd ict
+S L
+af e
+Ġest imate
+iz on
+at hered
+Ġcontribut ed
+Ġep isodes
+omm od
+G r
+AN T
+Ġ6 9
+G ener
+Ġ2 50
+vious ly
+rog en
+Ġterror ism
+Ġmove ments
+ent le
+oun ce
+ĠS oul
+Ġpre v
+ĠT able
+act s
+ri ors
+t ab
+Ġsuff er
+Ġn erv
+Ġmain stream
+ĠW olf
+Ġfranch ise
+b at
+Ġdem ands
+Ġag enda
+Ġdo zen
+Ġclin ical
+iz ard
+ĠO p
+t d
+Ġvis ited
+ĠPer haps
+Ġact or
+Ġde lic
+Ġcont ribute
+Ġin ject
+ĠE s
+ac co
+Ġlist ening
+Ġcon gress
+epend ent
+Ġprem ium
+Ġ7 6
+ĠIr ish
+Ġass igned
+ĠPh ys
+Ġworld wide
+Ġnarr ative
+ot ype
+m ont
+b ase
+ĠB owl
+ĠAdminist ration
+Ġrel ation
+ĠE V
+C P
+Ġco vers
+Ġ7 8
+Ġcert ific
+Ġgr ass
+Ġ0 4
+pir acy
+ir a
+Ġengine ering
+ĠM ars
+Ġun employ
+ĠFore ign
+st ract
+Ġv en
+Ġst eal
+Ġrepl ied
+Ġult imate
+Ġtit les
+d ated
+Ġj oy
+a us
+Ġhy per
+ak u
+Ġoffic ially
+ĠPro duct
+Ġdifficult y
+per or
+Ġresult ed
+rib ed
+l ink
+wh o
+~~ ~~
+ĠSpe ed
+ĠV iet
+W ind
+ĠBar ack
+Ġrestrict ions
+ĠSh are
+Ġ199 5
+ition ally
+Ġbeaut y
+op t
+Ġm aps
+ĠC R
+ĠN ation
+ĠCru z
+W ill
+Ġelectric ity
+Ġor g
+Ġb urd
+Ġviol ation
+Ġus age
+Ġper mit
+ĠCh ron
+ĠF ant
+Ġn aturally
+Ġ0 7
+Ġth rown
+ĠAw oken
+Ġal ien
+ĠHer o
+ĠK ent
+ĠR ick
+ri ke
+Ġp ace
+}, {"
+G L
+Ġpo ison
+ĠT ower
+Ġform al
+al ysis
+Ġgen uine
+Ġk il
+a ver
+Ġproced ure
+ĠPro p
+intend o
+ĠM ain
+as ant
+Ġtr ained
+G ame
+ĠL oad
+ĠM A
+Ġcru cial
+Ġle ts
+ĠF R
+Ġch ampion
+1 01
+ĠCon ference
+Ġwrit ers
+Ġconnect ions
+Ġo kay
+ir ms
+ĠR and
+Ġenc ounter
+ĠB uff
+Ġachie ved
+Ġche cks
+isc ons
+Ġassist ant
+Ġwhen ever
+ĠA ccess
+ĠU r
+b in
+Ġcl ock
+is p
+op her
+Ġb orrow
+Ġm ad
+Ġperson ality
+on ly
+IS T
+ab ama
+Ġg ains
+Ġcommon ly
+Ġter r
+Ġhyp ot
+Ġre ly
+Ġt iss
+iscons in
+Ġrid ic
+f unction
+ĠO regon
+Ġun com
+r ating
+el and
+ĠN C
+Ġm oon
+ann on
+Ġvulner able
+ut ive
+ÂłÂł ÂłÂł
+ĠRad io
+Ġw estern
+se ct
+ĠT ony
+Ġocc urs
+ĠO s
+ĠH on
+Ã Ń
+Ġv essel
+ĠScot land
+Ġdiscrim ination
+Ġsubsequ ent
+st ring
+Ġfant asy
+ĠSh adow
+Ġtest im
+W E
+it i
+r as
+Ġbo at
+Ġmar ks
+Ġord inary
+Ġre n
+Ġrepresent ative
+Ġpet ition
+Ġ7 3
+Ġad venture
+Ġign ore
+ĠPhil adelphia
+ĠS av
+V P
+Ġfact ory
+Ġt asks
+Ġdep ression
+z ed
+................ ................
+ĠSt orm
+Ġc ogn
+Ġelig ible
+Ġredu cing
+v ia
+Ġ0 5
+Ġstri king
+Ġdoll ar
+h o
+O V
+Ġinstr ument
+Ġphilosoph y
+ĠMo ore
+ĠA venue
+Ġrul ed
+ĠFr ont
+IN E
+ĠM ah
+Ġscen ario
+ĠNAS A
+Ġen orm
+Ġdeb ut
+Ġte a
+T oday
+Ġabs ence
+S im
+Ġh am
+le ep
+Ġt ables
+ĠHe art
+M I
+K e
+re qu
+V D
+m ap
+Ġchair man
+Ġp ump
+Ġrapid ly
+v i
+Ġsubstant ial
+E P
+d es
+ch ant
+ili pp
+ĠS anta
+ri ers
+anche ster
+L oad
+ĠC ase
+Ġsa ving
+Ġ7 4
+ĠA FP
+er ning
+oun ced
+ĠMin nesota
+ĠW as
+Ġrec ru
+Ġassess ment
+ĠB ron
+U E
+Ġdynam ic
+Ġf urn
+ul ator
+Ġprop ag
+h igh
+Ġacc ommod
+Ġst ack
+ĠS us
+w rit
+Ġre ven
+ĠGod d
+ĠZeal and
+ab s
+Ġbr ut
+Ġper pet
+h ot
+Ġhard ly
+ĠB urn
+ãĤ ¹
+Ġst y
+Ġtrans actions
+Ġg ate
+Ġsc reens
+Ġsub mitted
+Ġ1 01
+Ġlangu ages
+ugh t
+em en
+Ġfall s
+Ġc oc
+Ĥ ¬
+Ġstri kes
+p a
+Ġdel iber
+ĠI M
+Ġrel ax
+ann els
+ĠSen ator
+Ġext rem
+Ġ} ,
+ĠDe b
+Ġbe ll
+Ġdis order
+c ut
+Ġi OS
+Ġl ocked
+Ġem issions
+Ġshort ly
+" ]
+ĠJud ge
+ĠS ometimes
+Ġr ival
+Ġd ust
+Ġreach ing
+F ile
+Â¯Â¯ Â¯Â¯
+ino is
+ĠJ ason
+Ġs atell
+are t
+Ġst ations
+Ġag ric
+ĠTechn ology
+com es
+ĠUn fortunately
+ĠChild ren
+Ġappl ies
+ast ed
+Ġan ger
+ail ability
+ĠDam age
+Ġcomp are
+ĠStand ard
+Ġaim ed
+ĠB a
+angu age
+Ġreg ulation
+Ġj ury
+Ġair port
+Ġse ctions
+ĠPr ince
+em ed
+Ġmedic ine
+Ġh itting
+Ġsp ark
+ol ves
+Ġad s
+St ate
+Ġfood s
+Ġrepl acement
+Ġch icken
+Ġlow est
+Ġmind s
+Ġinvol ves
+u i
+Ġarr ang
+Ġproced ures
+ĠWh ich
+ivers ary
+Ġb ills
+Ġimprove ment
+Ġin ev
+Ġexpect ations
+Ġintellect ual
+Ġsp aces
+Ġmechan ism
+2 50
+bre ak
+ĠZ e
+ĠT enn
+ĠB alt
+Ġbar rel
+Ġstat ic
+man n
+Pol ice
+Ġt ips
+Ġhand ling
+c us
+od ed
+il ton
+ir y
+Ġjournal ists
+our se
+Ġcom ic
+Ġnom ine
+IT Y
+Ġvers us
+Ġlo op
+Ġsur f
+ĠInd ust
+ĠHun ter
+Ġbelief s
+is an
+Ġset up
+Ġbre w
+im age
+Ġcomput ers
+f ol
+} ,"
+ĠMed al
+Ġtax p
+Ġdisplay ed
+Ġg rav
+Ġf iscal
+M on
+ĠMos cow
+ĠK ong
+ĠCent re
+Ġcamer as
+ĠMr s
+ĠH ay
+Ġa ver
+ĠK elly
+p y
+Ġrequire ment
+Ġent itled
+omb ie
+Ġsh adow
+ag ic
+ĠA k
+Ġel ite
+Ġdiv ided
+Ġhead ing
+Ġcop ies
+Ġloss es
+Ġv it
+k ed
+ĠB ry
+Ġan s
+ĠSte am
+Ġrep orter
+he im
+ĠIt em
+Ġsuper ior
+d on
+ere nt
+Ã ¶
+Ġtherap y
+Ġpe ak
+ĠMod el
+Ġl ying
+Ġg am
+z er
+r itten
+Ġrespons es
+Ġconsider ation
+ĠB ible
+Ġl oyal
+Ġinst ant
+Ġp m
+ĠFore st
+Ã ¼
+Ġext end
+Ġconv icted
+Ġfound er
+Ġconv in
+ĠO ak
+che ck
+Ġsch olars
+p ed
+Ġover se
+T op
+c ount
+ĠAr k
+Â ·
+Ġ0 6
+ĠL A
+m d
+ĠLat in
+im ental
+ĠC PU
+Ġsubst ance
+Ġminor ity
+Ġmanufact uring
+E r
+ocol ate
+Ġatt ended
+ĠMan ager
+r ations
+Ġappreci ate
+om y
+GB T
+id ency
+B L
+Ġguarant ee
+pos ition
+Ġo cean
+clud e
+Ġhead ed
+Ġt ape
+Ġlo ose
+Ġlog ic
+Ġpro ven
+Ġsp ir
+Ġad mit
+is a
+Ġinvestig ate
+Ġ199 4
+sy lv
+ĠL ost
+c est
+Ġ7 1
+Ġrequest ed
+Ġwind ows
+ĠPok Ã©
+ĠWith out
+M et
+Ġbehavi our
+Ġread er
+Ġh ung
+ĠKe ep
+Ġro les
+Ġimplement ed
+Ġbl ank
+Ġserv es
+ĠJ ay
+Ġc ited
+ĠF riend
+prof it
+ap on
+Ġrep air
+it em
+arr ass
+Ġcrit ics
+ad i
+ĠF ather
+Ġsh out
+Ġf ool
+Ġ8 8
+Ġprodu cing
+Ġl ib
+Ġround s
+Ġcirc le
+Ġpre par
+Ġsub mit
+Ġn ic
+mor row
+ãĥ «
+U nder
+Ġv ital
+ater n
+Ġpass word
+Ġpublic ation
+Ġprom inent
+Ġspeak s
+Ġb ars
+Ġde eper
+ĠM ill
+port ed
+Ġw id
+Ġbut ter
+Ġsm oking
+Ġindic ates
+K ey
+rop ri
+ĠF ile
+all ing
+ast ing
+ĠR us
+Ġad j
+Ġ7 9
+av al
+Ġpres um
+bur gh
+on ic
+Ġf ur
+Ġpoll s
+ik a
+Ġsecond ary
+Ġmon ster
+ig s
+ĠCur rent
+E vent
+Ġowners hip
+end ar
+Ġarri ve
+ĠT ax
+Ġn ull
+ĠPri v
+Ġth ro
+Ġk iss
+c at
+Ġup set
+ang le
+it ches
+ect or
+olog ists
+ĠGal axy
+Ġcor ruption
+Ġh int
+ent er
+ĠH ospital
+Ġgreat ly
+Ġbeg un
+es y
+Ġso il
+ĠAnt on
+Ġmain tenance
+ãĥ ©
+Ġdo zens
+Ġhuman ity
+ĠAl abama
+Ġr om
+w orth
+ap ing
+sylv ania
+l ah
+Ġg athered
+G A
+Ġattack ing
+f ound
+ĠSqu are
+Ġar bit
+ict ions
+ĠW isconsin
+Ġd ance
+ĠS aint
+arch y
+Ġbase ball
+Ġcontribut ions
+Ġliter ature
+Ġex ha
+per ty
+t est
+Ġb ab
+Ġcontain er
+let ter
+Ġfall en
+Ġwebs ites
+Ġbott le
+ĠS ac
+Ġbre ast
+ĠP L
+Ġveter an
+Ġinterview s
+ĠA le
+Ġb anned
+eng ers
+ĠRev olution
+in th
+Ġconc erning
+IV E
+Ġexp enses
+ĠMatt hew
+ĠColumb ia
+d s
+ist ance
+Ġent ity
+.. ."
+Ġrel iable
+Ġpar alle
+ĠChrist ians
+Ġopin ions
+Ġin du
+l ow
+Ġcompet e
+Ġth orough
+Ġemploy ed
+Ġestablish ment
+ig en
+ĠC ro
+Ġlawy ers
+ĠSt ation
+T E
+ĠL ind
+ĠP ur
+it ary
+Ġeffic iency
+âĢ Ĳ
+ĠL y
+Ġm ask
+Ġdis aster
+Ġag es
+ER E
+es is
+ĠH old
+Ġcas ual
+b led
+Ġen abled
+ĠEn vironment
+ĠInt elligence
+i per
+ĠM ap
+ĠB E
+Ġemer ged
+is dom
+Ġc abin
+Ġregist ration
+Ġfing ers
+Ġro ster
+Ġfram ework
+ĠDo ctor
+et ts
+Ġtransport ation
+Ġaware ness
+H er
+Ġattempt ing
+O ff
+ĠSt ore
+ÃĥÃĤÃĥÃĤ ÃĥÃĤÃĥÃĤ
+ĠK now
+Ġdef ence
+Ġsc an
+ĠT en
+ĠCh air
+ĠP H
+ĠAtl anta
+Ġfuck ing
+Ġans wered
+b n
+ĠK ar
+Ġcateg ories
+Ġr ational
+Ġc ust
+Ġrob ot
+Ġcorrect ly
+Ġg if
+Ġgraph ics
+m ic
+Ġground s
+ĠO pp
+i ate
+Ġdist ributed
+Ġsan ctions
+Ġchalleng ing
+ut o
+Ġingred ients
+Ġinv ited
+Ġfound ed
+ĠRe qu
+d ed
+Ġb owl
+Ġbrother s
+ĠH a
+I O
+Ġw ages
+im ore
+oc ial
+Ġse ed
+ative ly
+Ġaddress es
+ĠI owa
+ab eth
+Ġatt itude
+is d
+ch ild
+Ġm ole
+Ġdisco very
+y ard
+B r
+Ġ8 2
+Ġsuppl ies
+ell ing
+Ġdist ingu
+C R
+Ġre cept
+Ġ vert
+Ġsw im
+b ec
+d oor
+ĠY eah
+Ġg al
+Ġinter act
+ĠE SP
+ĠC S
+amp s
+Ġconvin ced
+Ġobject ive
+Ġdis h
+ĠPhot os
+l ad
+Ġdownt own
+o il
+in ction
+Ġto morrow
+ĠC OM
+Ġsurv ival
+sh ot
+Ġsett lement
+C ons
+ĠX box
+int erest
+ĠS M
+arg o
+en ess
+Ġeth nic
+b ered
+M in
+ĠT ok
+Ġinc ent
+ĠComm and
+Ġmain tained
+Ġbreak s
+br idge
+at ar
+ag g
+ĠF inally
+un icip
+ĠO nt
+le ft
+Ġrecogn ition
+Ġ* /
+ĠP ers
+Ġwe lf
+Ġaddress ed
+ĠK ansas
+Ġvir us
+Ġwhere as
+Ġp apers
+ram s
+ĠMin istry
+Ġple asure
+Ġacqu ired
+Ġd uration
+j pg
+Ġcal m
+ĠN HL
+Ġburn ing
+Ġfold er
+ick ed
+ĠP y
+ĠIll inois
+Cl ass
+ĠGodd ess
+Ġperform ing
+Ġwelf are
+j ar
+In ter
+Ġl in
+Ġenh ance
+Ġnot ion
+f are
+yp es
+ĠAre a
+Ġcann abis
+ĠDie go
+f s
+ĠM anchester
+com m
+in ite
+Ġcover ing
+ĠS ound
+Ġ19 60
+Ġ8 4
+e lect
+z ing
+Ġcitiz en
+Ġph ones
+Ġr aid
+Ġign ored
+ĠOb ject
+Ġu pload
+c ard
+Ġmod ified
+Ġroom s
+ia h
+r ange
+he ast
+ach us
+Ġsuggest ing
+âĢ ĭ
+gr ade
+E l
+Ġclot hing
+Ġr h
+ĠH an
+un ity
+en cing
+ĠAust in
+sec ution
+t ra
+d em
+ĠQ ual
+Ġhe aven
+Ġst ages
+Ġw edd
+pl us
+ific ial
+ĠIm m
+ĠH o
+iet ies
+Ġphr ase
+Ġbr ill
+act ory
+Ġprov iders
+Ġsil ence
+Ġa er
+ĠA I
+ĠAd venture
+Ġplatform s
+Ġdemonstr ated
+Ġinter f
+ing ton
+Ġr aces
+Ġgr ade
+ult ane
+ĠTh rough
+f alse
+Ġb ow
+ĠA B
+Ġfl avor
+Ġhistor ic
+g ov
+Ġcol our
+Ġview ed
+ĠEm ail
+el come
+Ġinter vention
+Ġd iversity
+Ġperiod s
+Ġre verse
+ĠV ery
+Ġqu ote
+ĠLe ft
+th rough
+Ġsc rew
+Ġland ing
+Ġp ill
+Ġw et
+Ġprot esters
+Ġrepe at
+av ed
+er k
+Ġsal ary
+ĠPenn sylvania
+St ill
+Ġmay or
+Ġkit chen
+Ġfeat uring
+ĠM useum
+ĠT ournament
+ĠF al
+Ġser vers
+U C
+Ġany body
+im g
+ĠTr ade
+ixt ure
+the less
+Ġfin ance
+Ġcl osing
+ĠPat ri
+i ac
+ab el
+Ġ> >
+or ous
+Ġf irms
+sc reen
+un a
+Ġemb arrass
+ul se
+Ġlet ting
+Ġth rew
+ile y
+Ġch annels
+l an
+ĠVeg as
+Ġse ar
+Ġfant astic
+ar re
+uzz le
+ĠD er
+Th ose
+Ġsw ing
+Ġshe et
+ind ex
+co ver
+og an
+Ġvari ables
+ĠTe ch
+Ġsp oken
+ac hel
+ĠD a
+ĠMount ain
+Ġload ed
+Ġfoot age
+vers ion
+Ġun l
+ĠPh oenix
+Ġthrow ing
+Ġf iring
+Ġtrack ing
+Ġw idth
+Ġstrugg ling
+ro oms
+ot ion
+Ġmonth ly
+ĠSer ver
+Ġegg s
+op en
+M C
+Ġ199 3
+Ġh ired
+Ġstay ed
+ĠAll en
+Ġst ro
+Ġ9 8
+st ep
+ĠTurk ish
+Ġfab ric
+ist ing
+ĠD om
+Ġd ates
+Ġpr on
+Ġbasket ball
+Ġl ucky
+ĠArab ia
+Ġassum ed
+est y
+Ġaff airs
+Ġgl ad
+ĠInd eed
+ĠF A
+ĠW ord
+Ġjo ining
+if ice
+p read
+ir ts
+ĠSe lect
+Ġpop ulations
+aw are
+Ġn ose
+Ġcompl aints
+st art
+Ġsc oring
+Th anks
+Ġmin ing
+Ġvisit ors
+S H
+Ġdam aged
+Ġcharacter istics
+ĠP ent
+D C
+Ġ8 3
+ĠS ix
+r ates
+Ġfl ags
+ĠB rew
+d og
+M ark
+// //
+Ġexec ution
+Ġj oke
+ph ones
+Ġtestim ony
+Ġob st
+Q L
+ĠC ut
+Ġstud ied
+ĠN intendo
+ick et
+ĠN BC
+Ġl ad
+ĠB ra
+ĠM oh
+Ġk ernel
+Ġoverwhel ming
+Ġag ed
+Ġapplic able
+ĠC ond
+Ġroad s
+ĠBl ock
+m ade
+od ge
+Ġcomm ands
+Ġoff ices
+vel and
+Ġt ut
+Ġrece iver
+ĠF ro
+Ġsho pping
+Ġi P
+ĠSt re
+ĠA BC
+Ġentertain ment
+ĠB ow
+ort ed
+M c
+Ġread s
+gr ad
+ĠCol lect
+Ġâ ĪĴ
+ĠCap ital
+eder ation
+Ġemploy er
+Ġinvolve ment
+Ġanx iety
+al ia
+Ġro of
+ĠAm ong
+ĠDemocr at
+Ġstat s
+ĠV ill
+Ġconst itutional
+Ġrefer ring
+itt y
+Ġtack le
+out ube
+Ġback ed
+ĠH ong
+ĠBro ad
+Ġe le
+ĠO tt
+Ġ199 2
+h our
+achus etts
+C al
+Ġdefe ated
+Ġ8 1
+es p
+Ġseem ingly
+w as
+ĠJ enn
+ĠK urd
+Ġg ene
+Ġdisc ount
+R et
+EC T
+( );
+Ġclub s
+Ġs id
+ĠM arsh
+Che ck
+Ġp p
+ĠE ag
+ides pread
+Ġbe ings
+F T
+Ġintrodu ction
+ĠCh ange
+AR D
+Ġ1 10
+ad ows
+ier ce
+Ġme al
+a uthor
+ĠB ang
+lah oma
+Ġr anks
+201 1
+?? ??
+m ax
+Ġcoll apse
+Ġop ens
+Ġe cho
+Ġs oph
+Ġrac ist
+Ġenorm ous
+Ġw aves
+Ġt ap
+Ġcomprehens ive
+. --
+ĠR oy
+Ġfarm ers
+Rel ated
+a ired
+ron es
+ĠC rim
+Ġproport ion
+Ġdesign s
+Ġnegoti ations
+Ġvirt ually
+ĠBat man
+Ġwar n
+Ġlegit imate
+m ate
+Ġcon vention
+, ,
+net ic
+ĠS D
+Ġconsist ently
+Ġcompens ation
+Ġpunish ment
+Ġy e
+Ġt ie
+ĠB ureau
+ir lf
+ĠB u
+ĠA ren
+ĠPh ilipp
+Ġkn ife
+Ġmem ories
+ĠR oss
+Ġang le
+Ġ8 6
+ĠTh under
+Ġre nd
+ĠT our
+Ġcount s
+s ung
+ĠIm p
+Ġeduc ational
+Ġaccess ible
+C OM
+Ġd rew
+y er
+G l
+am ine
+OR T
+O B
+I B
+m aster
+Ġtri als
+og y
+h ar
+ĠTr ust
+Ġprefer red
+irlf riend
+ĠN ev
+Ġb in
+Ġc ow
+P age
+Ġsign ature
+ĠB L
+7 00
+Ġret ired
+Ġby tes
+Ġneigh b
+ĠLeg end
+Ġdev ast
+Ġsuspect ed
+is ons
+ĠPokÃ© mon
+sc ale
+Ġcap abilities
+Ġre vel
+Ġche ese
+d y
+igr ant
+Ġfail ing
+b its
+ĠHer oes
+ĠG host
+ĠS cient
+Ġappoint ed
+ur i
+Ġinst itution
+Ġexpand ed
+g reg
+Ġmonitor ing
+Ġp odcast
+Ġcoal ition
+Ġ9 6
+J o
+Ġst olen
+ĠS ab
+Ġstop s
+Ġhol iday
+Ġint r
+C ar
+Bl ack
+ĠL GBT
+Ġwar ming
+ĠAnd erson
+Ġ8 9
+Ġprodu cer
+M ed
+Ġaccur acy
+ĠMar vel
+iz abeth
+ĠPat rick
+m ony
+Ġmin i
+ac les
+Ġover t
+the y
+Ġmembers hip
+ĠV en
+Ġex ch
+Ġrem oval
+ĠD ave
+T Y
+m ad
+ĠF ind
+Ġad equ
+Ġe c
+Ġte eth
+Ġemot ion
+Ġper m
+Ġsole ly
+d b
+Ġextra ord
+IG HT
+c al
+Ġgu idelines
+Ġd ying
+Ġsusp ended
+ĠPrem ier
+ĠAnth ony
+el ve
+Ġd ad
+ĠE th
+ĠFoot ball
+Ġabandon ed
+Ġ< <
+Ġm arch
+Ġhor ror
+âĢ¦ "
+Ġchild hood
+Ġcampaign s
+Ġl unch
+ĠAl bert
+bl ock
+âĸĪ âĸĪ
+ound ing
+Ġb one
+or gan
+ad ers
+ĠFl ash
+ĠDri ve
+Ġton ight
+Ġw ars
+ĠF L
+Ġform ation
+con st
+New s
+Ġcom pe
+or ious
+ĠSt aff
+Ġdiscuss ions
+ĠProt ection
+ĠJ am
+Ġcrit eria
+Ġinstall ation
+Ġaccompl ish
+iz za
+Ġpub lisher
+Ġresc ue
+ĠT ry
+U LL
+ĠS om
+ĠH op
+ore t
+th s
+ord on
+Ġp ocket
+ĠIn v
+Down load
+ĠCr ime
+Ġb ene
+ĠGu ide
+ĠAs sembly
+Ġparam eters
+I E
+ĠAlex ander
+Ġconc ert
+ĠSc he
+Ġsh oes
+Ġvis iting
+Ġrec all
+Ġb ub
+Ġr ural
+Ġconc rete
+ĠR os
+N ext
+R uss
+Ġlo ans
+ĠSh ield
+Ġtre m
+hem at
+k g
+ĠHar ris
+is ition
+ĠM ove
+ĠF C
+Ġf ate
+ĠCh o
+Ġt ired
+Ġprinc ipal
+h ist
+ien ces
+ath y
+Ġse vent
+Ġm ood
+Ġstrateg ic
+Ġdise ases
+Ġfor um
+Ġtem por
+Ġhead quarters
+P ar
+ig e
+fl ix
+Ġgu itar
+Ġ9 4
+On ly
+Ġrele ases
+ro ph
+================ ================
+Ġ6 00
+ĠContin ue
+ig ate
+ĠC rit
+sy stem
+Ġdis abled
+Ġunex pected
+ith ub
+Ġuncle ar
+ĠE st
+Ġcontr ad
+Ġstrateg ies
+vent ures
+Ġpass age
+AM E
+Ġimpro ving
+Ġreve als
+Ġdecre ase
+ov a
+Ġann oy
+ĠSh ort
+ĠL ibrary
+Ġcy ber
+n ell
+ĠH ur
+ĠC B
+Ġphot ograp
+U I
+Ġs ed
+G e
+Ġ8 7
+Ġd iverse
+Ġencour aged
+Ġcons piracy
+Ġbird s
+Ġoper ator
+Ġhand ful
+Ġclass ified
+? )
+Ġdram atic
+Ġinvestig ators
+it o
+Ġw idespread
+ĠR oom
+-------------------------------- --------------------------------
+Ġcollect ive
+Ġjournal ist
+St ring
+Ġtemper atures
+il a
+Ġgu id
+Ġins pect
+Ġmiss ile
+ĠMay or
+Ġman ual
+Ġsim ultane
+Ġrat ings
+Ġsu ck
+Ġ9 7
+Ġunivers al
+Ġph arm
+Ġdis rupt
+ian o
+A V
+Ġf t
+Ġstat ist
+old s
+ĠWalk er
+ph p
+Ġunder t
+ĠL as
+ish op
+nt il
+res hold
+ĠWhe ther
+M s
+Ġden y
+ĠCl oud
+Ġprov ider
+Ġsurv iv
+ĠUp date
+h as
+Ġmist akes
+ch arge
+pl ed
+r ity
+Ġn ode
+ĠMass achusetts
+ool s
+lic ation
+Ġf ails
+em ale
+or i
+back s
+Ġsh irt
+Ġ' '
+ĠN AT
+Ġwat ers
+els on
+Ġe ase
+Ġsc ar
+Ġcont ents
+m ind
+Ġcont ribution
+Ġsh r
+Ġhand ed
+Ġst ability
+Ġtra ve
+E m
+Ġmir ror
+12 3
+Ġwe igh
+Ġf iction
+ou ver
+ist ant
+r ition
+ĠF ed
+Ġphys ically
+Ġst ake
+ĠArt icle
+ĠAr c
+ĠLew is
+ĠM ind
+Ġdemonstr ate
+Ġprof its
+v ision
+om ic
+ol id
+Ġbatt les
+Ġdri ves
+Ġeas tern
+ĠS ony
+!! !
+ar ation
+v ard
+ĠG L
+port ation
+Ġ9 2
+Ġlaw makers
+Ġprotect ing
+ĠE PA
+Ġy eah
+Ġsh ame
+ol ph
+e ven
+x it
+Ġatt ach
+Ġrepresent ing
+Ġob s
+ĠUt ah
+iff s
+ĠFre edom
+Ã ³
+A K
+Ġinc idents
+it age
+Ġview ers
+c d
+Ġm ouse
+Ġcl ar
+Ġaccord ance
+Ġb ot
+c or
+ĠSum mer
+he ld
+Ġinnoc ent
+Ġiniti ative
+ol s
+________________ ________________
+Ġsp ots
+p ace
+Ġconvent ional
+Ġcorpor ations
+Ġblock ed
+H D
+at tered
+Ġref ers
+Ġbu ck
+ĠDig ital
+12 0
+Ġtop ics
+T F
+Ä ģ
+br id
+re ement
+Ġunder lying
+ĠM ember
+Ġinvestig ating
+Ġpregn ancy
+Ġtouch down
+ĠB and
+ĠCall er
+Ġinst ances
+P P
+w a
+G ood
+Ġ199 1
+ĠC old
+Ġfear s
+Ġrem arks
+Ĩ Ĵ
+at al
+Ġm it
+Ġexper iments
+i pt
+Col or
+ind u
+Up date
+Ġ9 3
+A g
+Ġ å
+anc ouver
+B oth
+Ġjud ges
+Ob ject
+Ġst ere
+umb n
+Ġparticip ation
+ĠSt ars
+ĠJ ere
+Ġweek ly
+ĠB an
+Ġconvers ations
+ĠP itt
+u z
+ĠIndian a
+ĠK ick
+Ġinf ection
+Ġhero es
+Ġsett led
+Ġstri p
+Ġh al
+Ġd ump
+ĠS ci
+Ġl es
+Ġref erences
+ĠU RL
+ĠBr idge
+Ġwant ing
+For ce
+Ġex clus
+Me anwhile
+m n
+Ġg entle
+m aker
+sen al
+ĠG ro
+ou ri
+ĠR ain
+ĠAll iance
+Ġl ift
+el a
+S D
+ĠCle veland
+Ġrank ed
+Ġst adium
+Ġdead ly
+ä ¸
+Ġr iding
+ar ia
+ĠAr mor
+Ġdocument ation
+ĠGree ce
+ree k
+Ġl ens
+ĠS a
+Ġg ross
+ĠE mer
+ag ers
+ĠD ub
+ĠR h
+ĠAM D
+Ġarri val
+Ġdes ert
+Ġsupp lement
+ĠRes p
+Ġkn ee
+Ġmarg in
+f ont
+og g
+201 0
+ĠP ir
+ĠP rom
+iv als
+Ġint ake
+Ġdifferent ly
+ug s
+Ġb its
+clud ed
+Ġsearch ing
+ĠD u
+um ble
+Ġfunction al
+ĠBalt imore
+ĠC ould
+Ġdes ired
+Ġcirc uit
+ĠL yn
+ĠG O
+ĠF alse
+re pre
+' :
+alt ies
+Ġmin im
+Ġdro ve
+ĠSh ould
+Ġh ip
+Ġpro s
+Ġut ility
+ĠN ature
+ĠM ode
+P resident
+o pp
+r at
+form ance
+Ġconcent ration
+Ġf ont
+ĠB ud
+Ġam id
+Ġre vers
+ĠM L
+B ar
+Ġinter action
+Ġjur isd
+Ġspell s
+d ep
+f il
+Ġcivil ians
+ut ter
+ĠCo oper
+ĠBel ow
+Ġent rance
+Ġcon vert
+Ġcontrovers y
+ow ered
+Ġcontr ary
+Ġar c
+ĠExec utive
+ĠOffic er
+Ġpack ages
+Ġprog ressive
+w idth
+Ġreserv ed
+v ol
+ĠSam sung
+Ġprint ed
+Ġcent ers
+Ġintrodu ce
+ĠKenn edy
+Ġodd s
+Ġsure ly
+Ġindepend ence
+Ġpass engers
+repre ne
+ĠBe h
+Ġl oves
+ĠESP N
+Ġfac ilit
+Ġident ical
+Ġdo ct
+Ġpartners hip
+con f
+ĠH ide
+Ġconf used
+ĠC ow
+M en
+Ġw rest
+ĠIraq i
+Ġh oles
+ĠStud ies
+Ġpregn ant
+h ard
+Ġsign als
+I X
+Ġpull ing
+Ġgrad uate
+Ġnomine e
+D ate
+Ġper mitted
+Ġâ Ĥ¬
+ĠOk lahoma
+St art
+Ġauthor ized
+Ġal arm
+ĠC os
+v an
+Ġgener ations
+c ular
+Ġdr agon
+ĠSoft ware
+ĠEd ward
+Ġcontro ller
+S en
+ge red
+ĠV ik
+Ġappro ached
+Th ank
+Ġcan ce
+Ġform ula
+ĠSm all
+Ġweak ness
+Ġr amp
+it udes
+j ud
+Ġbrill iant
+Ġacc us
+s ource
+Ġ8 00
+ĠE vil
+S w
+Ġhom eless
+we ek
+i ens
+r ics
+ĠTh ird
+T O
+Ġorgan ic
+Ġpresent ation
+ag h
+ĠDown load
+v ation
+Ġas sembly
+or able
+hold ers
+ĠBern ie
+ĠHel p
+Ġt ong
+ĠF ight
+Ġbe ach
+B ook
+ĠL ic
+Ġr ush
+ĠR ound
+ou p
+ĠMar x
+Ġcalcul ated
+ĠDe vil
+ĠSar ah
+Ġoccasion ally
+Ġbul let
+Av ailable
+g ate
+Ġ9 1
+Ġh osp
+Ġprom ises
+ĠH IV
+ĠSt adium
+ĠSt ock
+ĠCorpor ation
+g age
+N G
+ĠC redit
+Ġs ne
+ib l
+Ġacc um
+s uch
+Ġterror ists
+Ġconscious ness
+ĠZ h
+Ġdram a
+ool a
+pir ation
+Ġlab our
+ĠN in
+Ġut ter
+Ġdemocr atic
+Ġass ass
+il ation
+Ġg est
+Ġab road
+Ġmet ab
+Ġs orts
+Ġfl av
+U B
+Ġm g
+ĠNot hing
+ĠO d
+Ġmus ical
+200 9
+Ġdro ps
+oc ated
+ater al
+0000 00
+Ġg re
+Ġequ ality
+Ġburd en
+Ġv ig
+ĠLe ader
+-------- ----
+Ġcere mony
+Ġf ighter
+Ġact ors
+Ġ æ
+am an
+F i
+Ġal ign
+put er
+Ġe lder
+ĠN SA
+Ġrepresent ation
+ĠOnt ario
+IT H
+usal em
+Ġharass ment
+itz er
+Ġsy mp
+Ġbox es
+ĠD R
+Ġman ifest
+at re
+Ġ ^
+Ġd ies
+le ton
+Ġmiss ions
+et he
+Ġres olve
+Ġfollow ers
+Ġas c
+Ġk m
+l ord
+am med
+Ġsil ent
+ĠAssoci ated
+Ġtim ing
+Ġprison ers
+ĠK ings
+ĠF ive
+Ġtow er
+Ġappro aches
+Ġprecise ly
+Ġb ureau
+ĠM other
+ĠI ss
+Ġkey board
+it ual
+Ġfund ed
+Ġstay ing
+Ġpsych ological
+Ġm ile
+ĠLe on
+ĠBar b
+w ill
+Ġw ider
+ĠAtl antic
+Ġt ill
+ĠR ome
+ro t
+Ġaccomp an
+Ġfl our
+ac o
+W orld
+ĠExp ress
+ĠY u
+C or
+Ġple ased
+part y
+Ġpoint ing
+Ġinf lation
+Ġro y
+Ġ ),
+ain er
+Ġwedd ing
+orm on
+Ġrequ iring
+Ġqual ified
+Ġse gment
+EN D
+Ġs izes
+e als
+Ġcor rupt
+ass ador
+Ġcele b
+Ġdream s
+ĠM ess
+Ġcheck ing
+ĠV ersion
+Ġprep aring
+Ġact ively
+ĠD iff
+Ġl ux
+ĠW inter
+act eria
+ĠN E
+Ġdep uty
+Ġtrans gender
+Ġsum mary
+Ġin her
+er ies
+ch ar
+ĠY an
+Ġkn ock
+ĠP ath
+Ġl ip
+roll er
+Ġimp ression
+Ġcelebr ate
+Ġsl ide
+Ġgu ests
+Ġcl ip
+F S
+Ġsav ings
+Ġcapt ain
+Ġleg acy
+ĠDen ver
+Ġw ounded
+tab oola
+AC T
+Ġpurs ue
+Ġo xy
+Ġ q
+Ġsem i
+ĠN eed
+ĠAff airs
+Ġob sc
+Ġcheck ed
+Ġd ual
+C ode
+ĠM D
+le m
+ult y
+ĠÂ ©
+ĠEl izabeth
+Ġcent uries
+ard ed
+s rc
+Ġev ident
+enn is
+at in
+Ġunemploy ment
+ĠMar io
+Ġint im
+Ch rist
+Ġbi ological
+Ġsold ier
+ĠAdd ed
+Ġm ath
+ĠG il
+Ġbi as
+Ġd ating
+ĠO cean
+Ġm ice
+M us
+h ire
+ĠT es
+Ser ver
+lim ited
+S ize
+Ġmet ers
+Ġrock et
+es see
+Ġcertific ate
+ĠIran ian
+AS S
+Ġgr id
+D ec
+Ġro lling
+com mun
+ĠSwed en
+b ury
+Ġtiss ue
+Ġrac ism
+ĠL ocal
+Ġmyster y
+Ġexam ine
+Ġst em
+Ġs its
+Ġhop ed
+ot ing
+Ġdial ogue
+Ġpers u
+W atch
+l ay
+M AN
+Ġch ronic
+ĠPort land
+mark et
+ĠS EC
+Ġparalle l
+Ġsc andal
+Ġcar ries
+Ġphenomen on
+h uman
+ack er
+ĠO x
+Ġretire ment
+tain ment
+ov ie
+ĠG ear
+Ġd uties
+Ġdo se
+Ġsc roll
+M B
+in f
+Ġsa uce
+Ġland scape
+red dit
+ĠChampions hip
+ĠRed dit
+al id
+Ġco in
+Ġover s
+Ġpost ing
+ab out
+Ġf el
+and y
+Ġb old
+Ġfocus ing
+e ffect
+G R
+Ġde emed
+Ġrecommend ations
+Ġste pped
+Ġvot er
+ĠDe ep
+ĠInst agram
+Ġmoder ate
+ĠMary land
+Ġrestrict ed
+ĠM B
+ĠCh all
+Ġto b
+Ġc ir
+ĠO cc
+ĠE ver
+Ġcoll aps
+IN FO
+= -
+ĠP ict
+ĠAcc ount
+n c
+Ġo ught
+Ġex port
+Ġdr unk
+( '
+Ġw ise
+ĠM ort
+ne cess
+Ġan cest
+ĠInc re
+Ġfrequ ent
+m ir
+Ġinterpret ation
+Ġdepend ent
+Ġco ins
+ĠB ol
+V ideo
+ĠJust in
+Ġfat al
+Ġcook ing
+Ġconf usion
+ip her
+Ġcust ody
+ĠMor gan
+om ach
+ĠGovern or
+Ġrestaur ants
+el ing
+Ġacknowled ged
+Ġthe r
+Ġgen es
+ch ing
+He y
+Ġtact ics
+ĠMex ican
+Ġv end
+Ġhe s
+qu er
+Ġnot ing
+ĠCamer on
+Ġtarget ing
+ro ck
+Ġcred its
+Ġemot ions
+Ġrepresent atives
+new s
+Ġlegisl ative
+Ġrem oving
+Ġtweet ed
+ĠCar ter
+ĠF ixed
+Ġfor cing
+Ġspeak er
+Ġm ales
+ĠViet nam
+l ined
+Ġconcept s
+Ġvo ices
+o ir
+ĠT rib
+W he
+ĠJer usalem
+ĠS ant
+Ġc ul
+Ġl ady
+ĠHaw ai
+Ġar ts
+ĠIn n
+ĠMach ine
+ĠEm peror
+Ġsl ot
+g ly
+ĠPro cess
+II I
+Ġathlet es
+ĠTem ple
+ĠRep resent
+Ġpres c
+Ġt ons
+Ġgold en
+Ġp unch
+ĠG R
+iver pool
+Ġen act
+Ġlob by
+Ġm os
+Ġpick ing
+Ġlif etime
+Ġcogn itive
+E ach
+z o
+Ġd ub
+Ġcons ists
+ol n
+Ġf estival
+am ous
+Ġint ellig
+w ords
+ĠSm art
+Ġde le
+Ġl apt
+Ġmag ical
+ĠS in
+b us
+ur ities
+igh th
+ĠRub y
+ĠS ure
+ol ving
+Ġj un
+O ST
+Ġimp osed
+Ġast ron
+Ġcor rel
+ĠN S
+ĠK it
+ĠF uture
+b urn
+Ġimm une
+oc us
+Ġcour ses
+ĠSt ring
+Ġle an
+Ġg host
+Ġout comes
+Ġexp ense
+Ġevery day
+Ġaccept able
+A h
+Ġequ ipped
+Ġor ange
+F R
+ĠD utch
+Th ough
+ĠR ank
+Q U
+ĠRober ts
+wh at
+re nd
+Ġdisapp ear
+Ġsp awn
+ĠL am
+o is
+Ġdes erve
+Ġmin imal
+Ġnerv ous
+ĠW ould
+Ġro ok
+ĠV ancouver
+Ġres ign
+sh ire
+ĠW orks
+ĠB uild
+Ġafford able
+ĠG ary
+ĠAren a
+Ġh anging
+Ġimpl ications
+ĠS ong
+Ġmain taining
+Ġgu ards
+C ON
+Ġder ived
+Ġexecut ed
+Ġthe ories
+Ġqu oted
+ĠAnd re
+og a
+sel ess
+in fo
+ĠBel g
+Ġt ears
+ĠSur v
+Ġbirth day
+ig ious
+im mer
+Ġspect rum
+Ġarchitect ure
+Ġrec ruit
+arm a
+T able
+Ġmon sters
+ĠG ov
+Ġdest ination
+Ġattract ive
+Ġf oss
+ĠMore over
+Ġpres ents
+TH E
+Ġrep ly
+pt on
+Ġc um
+Ġdel ight
+Ġaffect s
+Ġdon ations
+ĠT oy
+ĠH im
+M ENT
+Ġover come
+it ched
+ĠFant asy
+ĠH at
+ĠBe ast
+b ott
+Ġinvestig ations
+R un
+Ġhun ting
+d i
+f und
+Ġs essions
+est yle
+Ġport ray
+oid s
+Y eah
+Ġcommun icate
+Ġcom edy
+ĠY ang
+Ġbel t
+ĠMar ine
+Ġpredict ed
+Pl ay
+Ġimportant ly
+Ġremark able
+Ġelim inate
+D avid
+Ġb ind
+V ID
+Ġadvoc ates
+ĠG aza
+im p
+D B
+ĠN a
+ĠSim ilar
+I ES
+Ġchar ity
+v as
+m ath
+Ġâ ĸ
+ok er
+nd um
+Ġcap s
+ĠH al
+2 000
+e an
+Ġfle et
+Ġrec re
+R ight
+Ġsleep ing
+ij ing
+k ind
+Ġdesign ated
+Ã ¤
+Ġanim ation
+ke e
+ĠInt rodu
+Ġ/ >
+Ġdelay ed
+Ġtrem end
+Ġcur ious
+U se
+Ġle ct
+d am
+Ġinnov ation
+ĠPoint s
+Ġload ing
+Ġdisp ute
+ct ic
+ird s
+ĠB Y
+Ġn urs
+ĠVal ue
+ION S
+ĠH um
+Ġtem plate
+m ers
+Ġappear ances
+ĠEnter tainment
+Ġtransl ation
+Ġsa ke
+Ġbene ath
+Ġin hib
+Ġe uro
+abet es
+Ġstud ying
+ĠM as
+Ġper ceived
+Ġexam ined
+Ġe ager
+Ġco aches
+Ġim per
+ch i
+Ġprodu ces
+" ).
+ĠEvery one
+Ġm unicip
+Ġg irlfriend
+Ġh ire
+ĠV ice
+Ġsu itable
+op y
+Ġin equ
+ĠD uke
+f ish
+f irst
+ĠO bs
+Ġinter ior
+ĠBru ce
+ĠR y
+Ġanal ys
+Ġconsider able
+Ġfore cast
+Ġf ert
+ors hip
+ĠD rug
+ĠA LL
+: "
+th ur
+ĠM ail
+Ġball ot
+Ġinst antly
+ĠCh annel
+Ġp icks
+Ġ198 9
+Ġt ent
+ol i
+Ġcivil ian
+b ling
+ell o
+b u
+Ġin ch
+Ġlog o
+Ġcooper ation
+Ġwal ks
+Ġinvest ments
+Ġimp rison
+ĠF estival
+ĠK y
+Ġleg ally
+Ġg ri
+ch arg
+S l
+Ġthreat ening
+du ction
+fl ow
+Ġdismiss ed
+ibr aries
+c ap
+e le
+ĠMc G
+ĠHar vard
+ĠConserv ative
+ĠC BS
+p ng
+Ġro ots
+ĠH aving
+umb led
+ĠF un
+\ /
+ĠS earch
+ple x
+Ġdiscuss ing
+Ġcontin u
+ĠT ai
+ĠW ik
+F ree
+f it
+Ġref use
+Ġmanag ing
+Ġsy nd
+ip edia
+w alk
+Ġprofession als
+Ġguid ance
+Ġunivers ities
+Ġas semb
+unt u
+F inally
+AS E
+ĠAut o
+ĠH ad
+Ġann iversary
+L D
+ĠD ur
+ĠUlt imate
+ih ad
+pro duct
+Ġtrans it
+Ġrest ore
+Ġexpl aining
+Ġass et
+Ġtransfer red
+Ġbur st
+ap olis
+ĠMag azine
+ĠC ra
+ĠB R
+gg ed
+ĠH E
+M ich
+b et
+ĠL ady
+yl um
+erv es
+Ġme ets
+wh ite
+L og
+Ġcorrespond ing
+Ġins isted
+G G
+Ġsurround ed
+Ġt ens
+Ġl ane
+Ġco inc
+h ome
+Ġexist ed
+ect ed
+ĠDou ble
+lam m
+Ġske pt
+ex p
+Ġper ception
+ie v
+ĠBe ing
+o ft
+Ġadop t
+. :
+] ;
+Wind ows
+Ġsatell ite
+AS H
+Ġinf ant
+d escription
+ĠMe anwhile
+c m
+oc a
+ĠT reat
+act or
+Ġtob acco
+ĠN orm
+em ption
+Ġfl esh
+Ġj e
+o op
+ĠHe aven
+Ġbe ating
+an im
+Ġgather ing
+Ġcult iv
+G O
+ab e
+ĠJon athan
+ĠSaf ety
+Ġbad ly
+pro t
+Ġcho osing
+Ġcontact ed
+Ġqu it
+Ġdist ur
+Ġst ir
+Ġto ken
+D et
+ĠP a
+Ġfunction ality
+00 3
+s ome
+Ġlimit ations
+Ġmet h
+b uild
+con fig
+N T
+re ll
+ble m
+ĠM om
+Ġveter ans
+ĠH u
+Ġtrend s
+are r
+ĠG iven
+ĠCa ption
+m ay
+AS T
+Ġwond ering
+ĠCl ark
+n ormal
+Ġsepar ated
+Ġdes p
+st ic
+b rew
+Ġrel ating
+ĠN ik
+ĠF arm
+Ġenthus i
+g ood
+d eb
+Ġactiv ist
+Ġm art
+Ġexplos ion
+ĠEconom ic
+L ink
+Ġins ight
+Ġconven ient
+Ġcounter part
+su pport
+ĠV irt
+ag en
+ĠTenn essee
+ĠSim on
+ĠA ward
+OC K
+ĠF igure
+Ġoverse as
+Ġpr ide
+ĠC as
+n ote
+m g
+C urrent
+Ġdispl ays
+cont ent
+Ġtravel ing
+Ġhosp itals
+ĠFin ancial
+ĠP ast
+Ġdefend ant
+Ġstream ing
+m ble
+ĠBer lin
+uk i
+Ġdist ribut
+Ġant ib
+Ġch ocolate
+ĠCast le
+Ġinter rupt
+ĠR ow
+Ġconvers ion
+Ġbug s
+ĠR ather
+li est
+L Y
+ĠJe an
+com mon
+ak h
+Ġ1 30
+ot ton
+ĠDe an
+Ġam endment
+Ġgame play
+ĠWar ren
+od a
+Ġhigh lights
+Ġir re
+ĠNAT O
+Ġball s
+Ġdemand ing
+U RE
+ĠL uke
+F igure
+st op
+on ia
+z one
+iz ers
+ĠW R
+Ġaward ed
+Ġregul atory
+ĠH art
+ĠS N
+pl ing
+Ġs our
+ĠP ixel
+us ive
+Ġf et
+ĠS ent
+Ġautom atic
+Ġf er
+vern ment
+ĠKh an
+T ON
+f ather
+Ġextraord inary
+th rop
+ĠP ython
+ĠG PU
+Ġsex ually
+Ġdesk top
+it ivity
+ĠAnton io
+Ġo rient
+Ġe ars
+ob by
+ous es
+vertis ements
+Ġmanufacture rs
+ic ient
+min ute
+Ġconv iction
+Ġg arden
+p ublic
+Ġsatisf ied
+f old
+O K
+Ġin hab
+ĠTh ink
+Ġprogram me
+Ġst omach
+Ġcoord in
+Ġh oly
+Ġth reshold
+Ġr het
+Ġser ial
+Ġemploy ers
+ĠEvery thing
+ra h
+Ġb other
+Ġbr ands
+Val ue
+ĠT ed
+ĠPlan et
+Ġp ink
+ĠFurther more
+s a
+P E
+re ck
+ĠUS D
+ot te
+Ġ& &
+Ġland ed
+g ets
+Ġprodu cers
+Ġhealth care
+Ġdomin ant
+Ġdest ro
+Ġam ended
+ch ron
+Ġf its
+ĠSy d
+ĠAuthor ity
+AT CH
+Ġfight s
+ĠL LC
+Ġ-- -
+ĠCor p
+Ġtox ic
+spe cific
+ĠC orn
+ĠChe l
+Ġtele phone
+ĠP ant
+Ġmyster ious
+aun ch
+od ox
+med ia
+Ġwitness es
+ag u
+Ġquestion ed
+ĠBre xit
+ĠRem ember
+ene z
+Ġend orse
+iat ric
+ĠId ent
+Ġridic ulous
+1 10
+Ġpr ayer
+Ġscient ist
+Ġ19 50
+ĠA qu
+Ġunder ground
+ĠU FC
+m are
+ĠL ater
+w ich
+Ġsubsc rib
+Ġhost s
+Ġer r
+Ġgr ants
+ant om
+Ġsum mon
+ear ly
+ĠC lear
+ĠPr im
+Ġsusp ension
+Ġguarant eed
+app er
+Ġr ice
+ĠSe an
+ĠSh in
+Ġrefere ndum
+Ġfl ed
+r ust
+Ġ3 60
+ter y
+Ġsh ocked
+B R
+ĠO il
+ĠAll ah
+Ġpart ly
+Ġign or
+Ġtrans mission
+Ġhom osexual
+ivers al
+Ġhop efully
+ãĤ ¤
+Ġless on
+L eg
+Ġ ..
+Y et
+t able
+app ropri
+re tt
+Ġbo ards
+Ġincor rect
+Ġb acteria
+ar u
+am ac
+Ġsn ap
+.' "
+Ġpar ad
+t em
+he art
+Ġav ailability
+Ġw isdom
+Ġ( +
+Ġpri est
+ĠÂł ĠÂł
+O pen
+Ġsp an
+Ġparam eter
+Ġconv ince
+Ġ( %)
+r ac
+Ġf o
+Ġsafe ly
+Ġconver ted
+ĠOlymp ic
+Ġres erve
+Ġhe aling
+ĠM ine
+M ax
+Ġin herent
+ĠGra ham
+Ġinteg rated
+D em
+Ġpip eline
+Ġapp lying
+Ġem bed
+ĠCharl ie
+Ġc ave
+200 8
+Ġcons ensus
+Ġre wards
+P al
+ĠHT ML
+Ġpopular ity
+look ing
+ĠSw ord
+ĠAr ts
+' )
+Ġelect ron
+clus ions
+Ġinteg rity
+Ġexclus ively
+Ġgr ace
+Ġtort ure
+Ġburn ed
+tw o
+Ġ18 0
+P rodu
+Ġent reprene
+raph ics
+Ġg ym
+ric ane
+ĠT am
+Ġadministr ative
+Ġmanufacture r
+Ġ vel
+ĠN i
+Ġisol ated
+ĠMedic ine
+Ġback up
+Ġpromot ing
+Ġcommand er
+Ġfle e
+ĠRus sell
+Ġforg otten
+ĠMiss ouri
+Ġres idence
+m ons
+Ġrese mb
+Ġw and
+Ġmeaning ful
+P T
+Ġb ol
+Ġhe lic
+Ġwealth y
+Ġr ifle
+str ong
+row ing
+pl an
+as ury
+âĢ¦ .
+Ġexpand ing
+ĠHam ilton
+Ġrece ives
+S I
+eat ures
+ĠAn im
+RE E
+P ut
+Ġbrief ly
+ri ve
+Ġstim ul
+Ġ`` (
+Ġ __
+Ġch ip
+Ġha z
+Ġpri ze
+ĠTh ings
+AC E
+ul in
+d ict
+ok u
+Ġassoci ate
+ock ets
+y outube
+St ory
+ateg ory
+Ġm ild
+ail ing
+ĠY e
+O rig
+ĠK a
+or ig
+Ġpropag anda
+Ġan onymous
+Ġstrugg led
+Ġout rage
+AT ED
+ĠBe ijing
+r ary
+Ġle ather
+Ġworld s
+Ġbroad er
+12 5
+id al
+ĠBet ter
+Ġt ear
+E xt
+Ġpropos als
+Ġit er
+ĠSqu ad
+Ġvol unt
+m i
+D id
+ĠP u
+p in
+Ġspeak ers
+Ġb orders
+Ġfig ured
+= '
+Ġsimultane ously
+aed a
+Ġcharg ing
+Ġur ged
+Ġcon j
+25 6
+ĠG ordon
+mer ce
+Ġdocument ary
+Sh are
+it ol
+ON E
+ĠG arden
+h att
+ĠThom pson
+ane ous
+ap ore
+Ġt anks
+Ġless ons
+tr ack
+Ġout standing
+Ġvolunte ers
+Ġsp ray
+Ġmanag ers
+l arge
+Ġcamp s
+Ġart ificial
+ĠR u
+Ġb ags
+th al
+Ġcompat ible
+ĠBl ade
+Ġf ed
+Ġarg ues
+F I
+Ġunf air
+Ġcor n
+Ġoff set
+Ġdirect ions
+Ġdisappoint ed
+ĠCon vention
+Ġview ing
+M E
+oc ity
+Ġtown s
+Ġlay ers
+Ġro lled
+Ġjump ed
+Ġatt ribute
+Ġun necess
+inc oln
+Ġsupp ose
+ĠNet her
+ch a
+Ġbur ied
+Ġsix th
+B en
+ress ing
+OU R
+Ġw ound
+Ġcy cl
+Ġmechan isms
+Ġcongress ional
+ĠE lement
+Ġagre ements
+Ġdec or
+Ġclos est
+ĠM it
+Go ogle
+} }
+Ġm ixture
+Ġflu id
+S ign
+ĠSch olar
+Ġp ist
+ask et
+ab ling
+Ġrac ing
+he ro
+ri el
+ass y
+Ġche aper
+b en
+Ġvert ical
+amac are
+ĠRead ing
+g ments
+Ġhelic op
+Ġsacr ifice
+ay a
+p aren
+V A
+ĠL es
+ĠStud io
+Ġviol ations
+ĠAn na
+ac er
+é ¾
+ĠR at
+ĠBe ck
+ĠD ick
+ĠA CT
+Ġcomp osition
+Ġtext ure
+ĠO wn
+Ġsmart phone
+ĠN A
+Ġfor b
+im port
+Ġdef ending
+il st
+re r
+Ġo h
+ĠJere my
+Ġbank ing
+cept ions
+Ġrespect ive
+/ .
+Ġdr inks
+ĠW i
+Ġb ands
+ĠL iverpool
+Ġg rip
+ĠB uy
+Ġopen ly
+Ġreview ed
+per t
+Ġver ify
+ĠCo le
+ĠW ales
+M O
+Ġun pre
+Ġshel ter
+ĠIm perial
+Ġgu i
+ĠD ak
+Ġsuggest ions
+Ġexplicit ly
+Ġsl ave
+Ġblock chain
+Ġcompet ing
+Ġprom ising
+S ON
+Ġsoc cer
+Ġconst itution
+4 29
+Ġdist ract
+ĠU ser
+es ides
+ĠMet hod
+ĠTok yo
+Ġaccompan ied
+Cl ient
+s ur
+al og
+Ġident ification
+Ġinv asion
+as ma
+Ġindust ries
+pp ers
+Ġsub tle
+ĠUn it
+n atural
+Ġsurv ived
+Ġfl aw
+ĺ ħ
+ĠH oll
+Ġdef icit
+Ġtut orial
+ĠCh ance
+Ġarg uing
+Ġcontem porary
+Ġinteg ration
+for ward
+Ġt um
+it is
+Ġh iding
+ĠD omin
+ĠT an
+ĠB uilding
+ĠV in
+Ġspokes person
+ĠNot es
+Ġemer ging
+Ġprepar ation
+Ġpro st
+Ġsuspect s
+Ġaut onom
+D escription
+Ġdeal t
+ĠP ear
+Ġstead y
+Ġdecre ased
+Ġso vere
+ĠCl in
+Ġgrad ually
+ors es
+ĠW AR
+S erv
+ãĤ ¢
+h r
+Ġd irty
+ĠB arn
+ĠB C
+Ġd il
+Ġcal endar
+Ġcompl iance
+Ġch amber
+b b
+Ġpass enger
+ate ful
+ĠT itle
+ĠSyd ney
+ĠG ot
+Ġdark ness
+Ġdef ect
+Ġpack ed
+ass ion
+Ġgod s
+Ġh arsh
+IC K
+le ans
+Ġalgorith m
+Ġoxy gen
+Ġvis its
+Ġbl ade
+Ġkil omet
+ĠKent ucky
+Ġkill er
+P ack
+enn y
+Ġdiv ine
+Ġnom ination
+be ing
+Ġeng ines
+Ġc ats
+Ġbuff er
+ĠPh ill
+Ġtra ff
+AG E
+Ġtong ue
+Ġrad iation
+ere r
+m em
+ĠExpl icit
+é¾ į
+Ġcou ples
+Ġphys ics
+ĠMc K
+Ġpolit ically
+aw ks
+ĠBl oom
+Ġwor ship
+e ger
+ut er
+ĠF O
+Ġmat hemat
+Ġsent enced
+Ġdis k
+ĠM arg
+Ġ/ *
+P I
+Ġoption al
+Ġbab ies
+Ġse eds
+ĠScott ish
+Ġth y
+] ]
+ĠHit ler
+P H
+ng th
+Ġrec overed
+ing e
+Ġpow der
+Ġl ips
+Ġdesign er
+Ġdis orders
+Ġcour age
+Ġch aos
+" },{"
+Ġcar rier
+b ably
+H igh
+ĠR T
+es ity
+l en
+Ġrout es
+u ating
+F il
+N OT
+w all
+s burgh
+Ġeng aging
+ĠJava Script
+ore r
+li hood
+Ġun ions
+ĠF ederation
+ĠTes la
+Ġcomple tion
+ĠT a
+Ġprivile ge
+ĠOr ange
+Ġne ur
+paren cy
+Ġb ones
+Ġtit led
+Ġprosecut ors
+ĠM E
+Ġengine er
+ĠUn iverse
+ĠH ig
+n ie
+o ard
+Ġheart s
+ĠG re
+uss ion
+Ġmin istry
+Ġpen et
+ĠN ut
+ĠO w
+ĠX P
+in stein
+Ġbul k
+S ystem
+ic ism
+ĠMarket able
+Ġpre val
+Ġpost er
+Ġatt ending
+ur able
+Ġlicens ed
+ĠG h
+et ry
+ĠTrad able
+Ġbl ast
+à ¤
+ĠTit an
+ell ed
+d ie
+H ave
+ĠFl ame
+Ġprof ound
+Ġparticip ating
+Ġan ime
+ĠE ss
+Ġspec ify
+Ġregard ed
+ĠSpe ll
+Ġs ons
+own ed
+Ġm erc
+Ġexper imental
+land o
+h s
+ĠDun geon
+in os
+Ġcomp ly
+ĠSystem s
+ar th
+Ġse ized
+l ocal
+ĠGirl s
+ud o
+on ed
+ĠF le
+Ġconstruct ed
+Ġhost ed
+Ġsc ared
+act ic
+ĠIs lands
+ĠM ORE
+Ġbl ess
+Ġblock ing
+Ġch ips
+Ġev ac
+P s
+Ġcorpor ation
+Ġo x
+Ġlight ing
+Ġneighb ors
+ĠU b
+ar o
+Ġbe ef
+ĠU ber
+F acebook
+ar med
+it ate
+ĠR ating
+ĠQu ick
+Ġoccup ied
+Ġaim s
+ĠAdd itionally
+ĠInt erest
+Ġdram atically
+Ġhe al
+Ġpain ting
+Ġengine ers
+M M
+ĠM ust
+Ġquant ity
+P aul
+Ġearn ings
+ĠPost s
+st ra
+ãĥ¼ ãĥ
+Ġst ance
+Ġdro pping
+sc ript
+Ġd ressed
+M ake
+Ġjust ify
+ĠL td
+Ġprompt ed
+Ġscr ut
+Ġspeed s
+ĠGi ants
+om er
+ĠEd itor
+Ġdescrib ing
+ĠL ie
+ment ed
+Ġnow here
+oc aly
+Ġinst ruction
+fort able
+Ġent ities
+Ġc m
+ĠN atural
+Ġinqu iry
+Ġpress ed
+iz ont
+for ced
+Ġra ises
+ĠNet flix
+ĠS ide
+Ġout er
+Ġamong st
+im s
+ows ki
+Ġclim b
+ne ver
+Ġcomb ine
+d ing
+Ġcomp r
+Ġsignific ance
+Ġremem bered
+ĠNev ada
+ĠT el
+ĠSc ar
+ĠWar riors
+ĠJ ane
+Ġcou p
+b as
+Ġtermin al
+, -
+O H
+Ġt ension
+Ġw ings
+ĠMy ster
+ï¿½ï¿½ ï¿½ï¿½
+ĠUn like
+val id
+viron ments
+ĠAl i
+Ġn aked
+book s
+ĠM un
+ĠG ulf
+Ġd ensity
+Ġdim in
+Ġdesper ate
+Ġpres idency
+Ġ198 6
+h y
+IN D
+Ġun lock
+im ens
+Ġhand led
+ĠE b
+Ġdisapp eared
+Ġgen re
+Ġ198 8
+Ġdetermin ation
+St ream
+ik o
+ap ters
+Ġacknow ledge
+J an
+Ġcapital ism
+P at
+Ġ20 20
+Ġpain ful
+Ġcur ve
+Ġbom bs
+st orm
+ĠMet al
+en cer
+ĠF ig
+ĠA aron
+anc hes
+Ġins piration
+Ġexha ust
+t ains
+ash i
+Ġdesc ript
+Ġr itual
+ĠChel sea
+Ġpromot ion
+ĠH ung
+ĠW ard
+iv a
+ĠE T
+Ġto ss
+all ow
+ĠFranc is
+D ep
+Ġhapp iness
+ĠGl ass
+Ġbet a
+Ġstreng then
+N E
+o a
+Ġbutt ons
+ĠMur ray
+Ġkick ed
+Qu est
+ĠT alk
+ĠS everal
+ĠZ ero
+Ġdr one
+ul k
+Ġc am
+ĠM obile
+Ġprevent ing
+Ġret ro
+ĠA x
+Ġcru el
+Ġflo at
+. ),
+Ġfil ing
+ĠGr ant
+ĠB or
+Ġr ib
+Ġchampions hip
+ĠM erc
+Ġsty les
+Ġc ake
+Ġbuild s
+ĠS elf
+io x
+Ġep ic
+oy d
+B el
+ĠSt ew
+. (
+ah u
+ĠBe yond
+Ġout s
+Ġsol o
+ĠT ree
+Ġpres erve
+Ġt ub
+AR E
+ro c
+ĠIm pro
+ĠW right
+Ġbu nd
+Ġtr aged
+Ġoccas ional
+b ian
+Sec ond
+r ons
+Ġinter actions
+form ed
+s ing
+Ġown s
+Ġh ockey
+Gener al
+Ġlog ical
+Ġexp end
+Ġesc al
+ĠGr iff
+ĠC rown
+ĠRes erve
+Ġsto pping
+Ġexc use
+sec ond
+Ġoper ated
+Ġre aches
+ĠMal ays
+Ġpoll ution
+ĠBrook lyn
+Ġde lete
+Ġhas h
+Bl ock
+ah a
+âĢ ³
+Ġsh orter
+p iece
+> </
+Ġh orm
+ĠW at
+ĠBre ak
+Ġprohib ited
+Ġint ensity
+ĠAl an
+Ġli ability
+? !
+and ed
+Ġneigh bour
+ĠCol lection
+Ġf ires
+Ġrevolution ary
+f ly
+ĠOr leans
+Wh ite
+ĠW rit
+ĠD awn
+Ġsett le
+Ġexec ute
+B M
+Ġspokes woman
+Ġlif estyle
+Ġclick ing
+ĠK ill
+ĠLiber al
+ĠN azi
+Ġtra iler
+Ġmount ains
+Ġdam n
+z es
+p es
+Ġpress ing
+Ġb ail
+ĠOrgan ization
+Ġp ir
+Ġth irty
+Ġelect rical
+Ġ1 15
+ĠP oly
+ĠR ap
+ĠSt rike
+ĠC ann
+Ġdemand ed
+Ġback ing
+def ault
+spe ed
+ĠLeg isl
+Ġmother s
+ĠB ody
+Ġvar iation
+ced ented
+p owered
+le ading
+N ever
+Ġg rave
+ĠAnt i
+A W
+Ġinterview ed
+ĠG ab
+ĠF at
+Ġrook ie
+u u
+Ġdep os
+ix on
+Ġam pl
+ret ion
+ĠHe at
+Ġpeace ful
+S M
+ie ve
+Ġd iver
+ĠVict oria
+Ġm ic
+p df
+Ġst ating
+Ġl ung
+Ġcritic ized
+Ġvacc ine
+ĠLoad ing
+ur se
+T ake
+ĠFr an
+ĠS old
+ĠRob in
+Ġdetect ed
+ĠSc ript
+Ġadjust ed
+Ġsen ator
+Ġopp osing
+Er ror
+C ount
+Ġconflic ts
+Ġo w
+ĠAr gent
+Ġmatch ing
+h h
+ĠTre k
+st arter
+" ),
+ĠA F
+od er
+xx xx
+ĠAl t
+ac re
+ĠP ick
+ĠSol ar
+ĠD al
+O ct
+ĠB att
+Ġs rc
+Ġeng agement
+Ġexecut ives
+Ġliber ty
+j ava
+Ġtal ented
+igen ous
+Ġcon secut
+.. ...
+In fo
+Ġhor rible
+Ġsurprising ly
+f eed
+ic ating
+ĠL ED
+Ġfem ales
+St ation
+ell er
+ĠOak land
+Ġmechan ical
+i ology
+ĠV ar
+Ġrob ust
+ett ings
+ott a
+Ġthe oret
+Ġret ain
+k ward
+Ġd a
+Ġdeploy ed
+d el
+ĠAnd y
+Ġsubsc ribe
+we b
+Ġn a
+ĠMic hel
+Ġpart ially
+ĠCome y
+Ġc rown
+ĠM aj
+ĠBl u
+r ator
+D ay
+IN T
+Ġdocument ed
+ĠG DP
+g i
+che ll
+Ġbrut al
+ĠB ab
+st ration
+Ġthe ft
+Ġt ube
+@ @
+Ġqu ery
+ĠL incoln
+Ġpublish ing
+Ġw ore
+or ical
+Ġr ic
+Ġnot able
+Ġsubsequ ently
+ne x
+Ġobser ve
+ĠB oe
+Ġc odes
+m ain
+W H
+ĠS L
+Ġresident ial
+av an
+Ġm as
+are st
+ade on
+OU T
+Ġsoph istic
+ant e
+Ġc ens
+Ġ **
+Ġmort ality
+Ġyour s
+Ġoccas ions
+Ġrec alled
+ĠDri ver
+Ġv ocal
+Ġbath room
+Ġsh ops
+Ġcollabor ation
+ĠOb amacare
+ĠC ell
+Ch ar
+Su per
+C re
+Ġt ends
+Ġt orn
+Ġeconom ics
+a very
+ĠR aid
+ĠS em
+Ġshould ers
+Ġexpect ing
+Ġexam ination
+en ame
+ĠU I
+i ability
+ol as
+ĠAm b
+ĠD ra
+Ġmid field
+ĠI C
+Ġlay out
+Ġflo ating
+f i
+it ative
+Ġtremend ous
+Ġ Ð
+Ġab und
+W ork
+ĠLight ning
+Ġsimilar ly
+Ġconserv atives
+Ġpr ay
+B E
+iz arre
+Ġt empt
+Ġemphas is
+ĠMet ro
+Ġf ishing
+Ġmar ry
+ne g
+ĠStud y
+Ġrec k
+Ġdis pos
+on ing
+bs ite
+Ġsusp ic
+Ġmer ch
+ĠG ib
+ĠDes cription
+ĠD VD
+w he
+ĠY emen
+Ġen vironments
+oot ing
+ĠMod ern
+e u
+Ġreflect s
+Ġh oney
+Ġanaly st
+Ġg ut
+d ec
+A ction
+Ġhousehold s
+Ġst er
+Ġtem ple
+Ġreform s
+Ġfavour ite
+Ġdead line
+ĠL E
+Th ree
+ĠWith in
+A ug
+Ġnight s
+elt a
+Ġinv alid
+ĠEx change
+ĠDel hi
+w hen
+inc ome
+Ġ ðŁ
+Ġwire less
+sc ribe
+ist a
+Ġhost ile
+Ġall y
+Ġg ig
+Ġout lets
+ĠD or
+EM ENT
+Ġas h
+Ġab stract
+OR D
+ĠMot or
+Ġadv iser
+ist le
+Ġb ases
+Ġcourt esy
+Ġcross ing
+Ġcle ared
+Ġrefuge e
+cos ystem
+Ġthrow s
+f un
+bour ne
+d ays
+Ġdisag ree
+ĠN ative
+Ġreflect ed
+ĠF ast
+ĠY ellow
+ĠSing apore
+ĠR aven
+Ġembr ace
+ĠK u
+ĠC hen
+ĠEar ly
+Ġappoint ment
+ĠMin i
+it ement
+Ġpl acing
+Ġb icy
+S R
+Ġwh is
+S U
+Ġinvestig ated
+Ġphotograph s
+g ithub
+ĠBe at
+ĠR ing
+ig hed
+i ar
+Ġev olved
+eral d
+Ġd un
+Ġh ub
+I AL
+Ġencour aging
+ĠPr int
+ĠD ays
+Ġpro secution
+Ġp ants
+az y
+l ive
+Ġfoss il
+ĠJ u
+Ġro cks
+ud ge
+ĠR ace
+Ġg reet
+b ie
+Ġf illing
+ĠL en
+Ġdi abetes
+Ġfire arms
+um ing
+enez uel
+ĠB B
+Ġaccept ing
+AT H
+Ġres ort
+Ġh unt
+ri k
+uck er
+am ents
+Ġsust ained
+Ġcross ed
+Ġbreak fast
+Ġatt ributes
+lect ed
+at ile
+Ġv ibr
+ĠK al
+ars on
+op les
+Ġtou ched
+Ġdam ages
+Ġimp ressed
+ru p
+Ġan ch
+ĠAd ams
+H el
+ĠVict or
+Ġmount ed
+ĠC C
+Ġdelic ious
+sp an
+ell a
+Ġel abor
+am ples
+Ġdef ic
+Ġconstit u
+u ates
+ĠM ission
+ĠT her
+ĠMon ster
+b es
+Re uters
+ĠInd ones
+h ill
+mun ition
+Ġconfirm ation
+ĠCons ider
+ac ent
+Ġj et
+ĠEm ploy
+ĠGT X
+n an
+ĠSp ider
+Ġprocess or
+Ġpat ri
+ĠPent agon
+ĠRob inson
+Ġreal istic
+Ã ±
+Ġappear ing
+Ġp ipe
+om ed
+Ġf ru
+Ġaw ful
+Ġeval uation
+Ġintellig ent
+ĠC itiz
+Ġfund ra
+od ium
+Ġtwe ets
+Ġwor n
+pr ing
+Ġkid n
+Ġreb els
+ĠK am
+ĠNether lands
+ĠS W
+Ġacqu isition
+ĠM ale
+ãĥ ª
+omb ies
+Ġtrad em
+ĠStat us
+B re
+ĠTH IS
+Ġad verse
+ĠN EW
+s ign
+Ġorgan isation
+en c
+ĠHar per
+ap or
+ĠMem bers
+ĠPe ace
+ĠAir port
+ĠOther s
+Ġscr atch
+ĠP il
+Ġsens or
+Ġadop tion
+ĠHot el
+ĠDr ag
+Ġhonest ly
+Ġy ard
+ĠFor ces
+Ġpat ent
+Ġb ass
+Ġquiet ly
+Ġbreat hing
+Ġp ose
+i ors
+ĠJ ess
+st atic
+IT E
+O ffic
+Ġj ew
+w cs
+Ġ14 0
+Ġpre view
+ipp i
+Ġunf ortunately
+oke mon
+Ġh orn
+Ġre ass
+Ġpe er
+ock er
+Ġunt o
+ĠGr ay
+Ġclean ing
+Ġattract ed
+200 7
+P oint
+k ill
+ĠAg reement
+ur ches
+Ġhor r
+ĠMiss iss
+Ġworth y
+Ġfl owers
+t own
+d ll
+Ġre actions
+Ġde ce
+Ġindic ating
+M D
+Ġpre ference
+ĠM VP
+ess ional
+ĠT arget
+g ence
+ĠInd ians
+Ġm isc
+Ġfree ly
+Ġmus cles
+Ġline up
+Ġimpact s
+ous ing
+om i
+ac ular
+Ġcontro lling
+ag ine
+c ery
+he ll
+Ġrank ing
+ĠN ich
+ĠA ve
+12 8
+Ġhigh way
+Ġinc ons
+Ġb inding
+Ġstrugg les
+ĠPitt sburgh
+Ġgr ay
+r in
+Ġcom ics
+ĠS port
+Ġrel atives
+Ġfr ight
+Ġpro be
+ĠPort ug
+Ġv oc
+Ġt u
+ĠCor ps
+Ġposs ibilities
+Ġqual ify
+wcs store
+Ġl ibraries
+Ġm igrants
+Ġent ries
+Ġconsecut ive
+v als
+ĠChair man
+Ġh ill
+IM E
+ĠG ard
+Ġinequ ality
+f ox
+ĠS ave
+Ġc ort
+claim ed
+Ġtra its
+Ġp our
+Ġmiss iles
+Ġess ence
+Ġs ends
+Ġall iance
+Ġw ishes
+ĠChrist opher
+B ig
+N Y
+ĠJac ob
+s an
+ur red
+ĠS O
+ll y
+Ġadvoc ate
+ĠB ond
+Ġ" /
+Us ing
+Ġdistrict s
+ĠG ate
+ĠB ir
+r idge
+ĠN az
+ĠR s
+bo ards
+ĠG a
+ĠRe agan
+Ġinflu enced
+1 000
+ap y
+Ġchalleng ed
+Ġb arg
+Ġfac ulty
+ĠF if
+Ġacqu ire
+A c
+Ġin sect
+Ġinstr uments
+Ġle af
+th odox
+M essage
+Ġt ale
+Ġthere by
+Ġtra p
+Ġstrong est
+ĠMil itary
+is ible
+Ġ198 4
+ethe less
+Ġflex ible
+Ġkill s
+Ġfin ishing
+ĠS ize
+Ġredu ces
+Ġep id
+Ġorient ation
+f ull
+Ġtr ace
+Ġl aser
+Ġopp ose
+Ġed iting
+Ġmoment um
+ä º
+sh ow
+V I
+ĠL ad
+Ġ198 5
+Ġmurd ered
+9 00
+ut her
+Ġprob ability
+ĠP oll
+Ġrel uct
+ĠChe m
+ĠMont real
+Ġadequ ate
+ĠPol and
+ĠSher iff
+um ph
+Ġo k
+Ġ 000
+Ġ" [
+Ġoper ators
+ĠF er
+Ġmod es
+ĠE ve
+Ġdiscipl ine
+N ET
+H and
+Ġor al
+ĠW E
+em ail
+J P
+ĠPalestin ians
+Ġhe nce
+ĠL ess
+Ġover l
+d ig
+Ġintim id
+ĠCo al
+Ġr anging
+th a
+Ġdist ant
+Ġf ib
+ĠInd ex
+ĠW onder
+ĠP el
+hatt an
+ĠH ug
+Ã Ĺ
+ra it
+Ġwra pped
+ĠR PG
+Ġchemical s
+ĠM oney
+Ġfro zen
+Ġind irect
+ĠAgain st
+E nd
+Ġuncom fortable
+ĠGall ery
+ĠPost ed
+Ø §
+ond uct
+Ġconsequ ence
+Ġbit ter
+Ġ198 7
+p op
+Ġcount less
+ĠAl aska
+ff ff
+Ġdepart ure
+Ġref und
+ĠI an
+i ated
+Ġsee ks
+Ġmechan ics
+Ġjurisd iction
+lyn n
+Ġal ike
+ĠH unt
+ath on
+Ġres olved
+Ġc ache
+Ġdist inction
+d irect
+Ġenc ount
+ou b
+be at
+ĠCount ry
+se arch
+Ġcontin uous
+Ġmod est
+ĠR ail
+th ood
+1 30
+B UG
+Ġcrim inals
+Ġindic ation
+Ġencount ered
+l ast
+ĠW y
+Ġide ology
+ĠP DF
+sec urity
+] )
+ĠJim my
+ĠE N
+Ġh iring
+T em
+Ġp ig
+aun t
+ĠCry stal
+Ġpen alties
+Ġcap ability
+Ġp y
+Ġproduct ive
+Ġbal anced
+ĠGe Force
+cl ick
+olit an
+od s
+Ġafter wards
+Ġplay offs
+ĠG ill
+U ser
+Ġback s
+p ub
+t ag
+Ġabs urd
+p iring
+Ġc iting
+Ġtr illion
+Ġoblig ation
+Ġmax im
+ah oo
+c f
+um i
+ĠAl pha
+ĠN elson
+Ġpursu ant
+in itely
+Ġf ract
+ent ry
+ber y
+ĠTh or
+Add ed
+ĠD J
+ĠG ene
+Ġaw kward
+St ud
+Ġwal let
+ĠDiv ine
+ari os
+Ġrele asing
+Ġed ited
+Ġaccompl ished
+B est
+Ġed ges
+Ġplan es
+Ġfeed ing
+" },"
+Ġdiscl osure
+Ġgr ain
+air y
+o ons
+ern and
+V R
+Ġreason ably
+Ġdr um
+Ġpart ial
+Ġgraph ic
+Ġunpre cedented
+Ġadv ised
+M icro
+ĠAss ad
+point s
+sc ar
+ĠZ one
+tt es
+Ġ7 00
+v o
+ĠH amp
+Ġfix es
+Ġca ution
+Ġstr ings
+Ġpan els
+Ġle ak
+Ġpr icing
+row th
+ĠEr ror
+ĠS aints
+f ix
+Ġobserv ations
+ĠA bs
+Ġsuggest ion
+ĠUkrain ian
+Ġbar rier
+Ġpain ted
+B et
+im ir
+ĠS pect
+p ot
+orne ys
+Ġcomp ound
+Ġbe ars
+ĠR ush
+Ġlux ury
+S um
+Ġor bit
+ĠMar c
+Ġex empt
+ĠTra il
+ĠM O
+ĠH ans
+ĠWe apon
+oc used
+umin um
+ĠJer ry
+Ġb ust
+ĠA G
+ĠW iki
+Ġend less
+ĠV lad
+ĠB ah
+ĠR adeon
+ke ys
+ĠSur vey
+ĠV iol
+def ine
+le an
+Ġcomm od
+Ġreven ues
+Å į
+Ġfurn iture
+Ġcast ing
+Ġdiplom atic
+ĠPlay ers
+ĠK illed
+Ġmod ify
+Ġinnov ative
+ĠAb u
+n or
+Ġbond s
+Ġcoach ing
+M er
+Ġmod ules
+ĠPatri ots
+Ġenh anced
+Ġproceed ings
+Ġteam mates
+Ġ12 8
+ard o
+Ġcomprom ise
+ĠM uch
+Ġfle w
+ĠEd ge
+Ġunnecess ary
+Ġdoct rine
+re port
+ĠOr lando
+ĠProf ile
+Ġplay off
+friend ly
+Ġcompl ain
+ĠM C
+ĠO pt
+ĠG B
+Ġbeat en
+Ġg olf
+Ġpl acement
+B it
+Ġnews letter
+Ġ201 9
+vis or
+raw l
+ĠiP ad
+Ġact ed
+Ġju ice
+Ġdec ks
+P N
+su ccess
+ĠH alf
+Ġdele ted
+Ġsec rets
+Ġas ylum
+M art
+ĠAct iv
+ĠGu y
+ĠT s
+Ġd ys
+Ġassum ing
+Ġman a
+Ġsub ur
+Ġ12 5
+M edia
+AR Y
+r ide
+c p
+Ġdifficult ies
+Ġcollect ing
+Ġbank rupt
+n on
+Ġcomp osed
+Ġvol t
+Ġmilit ants
+Ġ> >>
+ĠM ormon
+t or
+Ġpartic les
+ĠB art
+ry ption
+Ġad min
+Ġsqu ee
+VID IA
+Ġcreat or
+iam eter
+ic ular
+N BC
+Ġgrab bed
+Ġn odd
+Ġr ated
+Ġrot ation
+Ġgr asp
+Ġexcess ive
+ĠE C
+ĠWh it
+Ġinvent ory
+ault s
+ĠF B
+Ġe cosystem
+Ġbill ions
+Ġvent ure
+n amed
+Ġdef ender
+out e
+Inst ead
+ir able
+W ar
+Ġassum ption
+Ġb ite
+Ġearth qu
+t ail
+sp ace
+Ġgif ts
+boy s
+Ġinev itable
+Ġstruct ural
+Ġbenef icial
+Ġcompe lling
+h ole
+erv ation
+Ġco at
+o j
+inc arn
+ĠY ears
+Ġdetermin ing
+Ġrhet oric
+Ġbound aries
+Ġwh ites
+A nt
+add y
+) -
+ra ham
+eter min
+Ġhar vest
+ĠCon c
+Ġlapt op
+ĠM atch
+Ġenjoy ing
+cc a
+oll ar
+Ġtri ps
+Ġadd iction
+ĠS ak
+Ġpow ered
+Ġc ous
+ĠRuss ians
+ie re
+Ġret rie
+qu ality
+Ġdiff er
+Ġking dom
+ĠL aur
+ĠCap itol
+Ġcon clusions
+ĠAl tern
+ĠN av
+Ġtrans parent
+B ER
+G roup
+ĠCom plete
+Ġinf er
+Ġint rig
+Ġins ane
+R O
+oph ob
+is en
+qu al
+Mich ael
+Ġm useum
+ĠP ope
+Ġres et
+r ative
+f ive
+Ġagg reg
+itte es
+osit ory
+Ġcar b
+ĠRec ord
+Ġdec ides
+ĠF ix
+Ġexcept ions
+ĠCommission er
+un s
+ĠEnvironment al
+Ġlegend ary
+ist ence
+Ġtun nel
+k m
+Ġins ult
+Ġt roll
+Ġsh ake
+Ġdet ention
+qu es
+ĠCh rome
+ĠF iles
+Ġsub t
+Ġprospect s
+Ġpro l
+re nder
+pro of
+Ġperform ances
+St r
+Ġh ref
+ern ame
+Ġachieve ment
+Ġf ut
+F ull
+ĠLe ban
+go ogle
+ãĥ Ī
+amp a
+May be
+Ġproject ed
+ĠE mb
+Ġcol leg
+Ġa wards
+Ġâ Ķ
+G old
+ĠBl ake
+ĠR aj
+if ting
+Ġp ending
+Ġinst inct
+Ġdevelop ments
+Con nect
+ĠM and
+ĠW ITH
+ĠPhilipp ines
+prof ile
+Ġalt ogether
+ĠB und
+ĠT D
+oo oo
+amp ed
+ip h
+Ġste am
+Ġold est
+Ġdet ection
+ul pt
+Ġ ç
+ĠWay ne
+200 6
+f a
+Ġcir cles
+ĠF u
+Ġdon ors
+appropri ate
+ĠDak ota
+j amin
+Ġmotiv ated
+Ġpurch ases
+ĠLouis iana
+ĠS pl
+Ġgl obe
+Ġ10 5
+z ip
+c all
+Ġdepart ments
+Ġsustain able
+10 5
+ĠO P
+if iers
+Ġprevent ed
+Ġinc omp
+ĠComm ander
+Ġdom inated
+ĠÂ »
+Ġinvest ed
+Ġcomplex ity
+Ġin cl
+Ġens uring
+Ġreal m
+yn c
+ĠInd ependent
+r ained
+ĠJ en
+ĠFl ight
+Ġat he
+Ġspec ulation
+ĠT E
+oc ate
+t ic
+Ġpl aint
+her ry
+Ġto y
+Ġ1 11
+Ġpl ates
+st atus
+ĠIs a
+Ġdev oted
+C op
+ĠE S
+25 5
+ur rency
+M ain
+Ġsl aves
+Ġpe pper
+Ġqu otes
+Ġce iling
+ĠF ish
+Ġtrans formation
+Ġfra ction
+Ġadvant ages
+Ġto ile
+Ġstun ning
+Ġmo ist
+bre aking
+s i
+ĠL ocation
+ĠMed ium
+Ġtext s
+Ġu gly
+Ġb io
+. âĢĶ
+ĠB ased
+Ġtr ains
+ĠW ing
+ĠAn cient
+ĠRec ords
+ĠH ope
+Spe cial
+ades h
+ob i
+[ /
+Ġtempor arily
+V er
+h u
+os er
+Ġover night
+Ġm amm
+ĠTre asury
+ĠV enezuel
+ĠMeg a
+Ġt ar
+Ġexpect s
+bl ack
+or ph
+\\ \\
+Ġaccept ance
+Ġrad ar
+s is
+Ġjun ior
+Ġfram es
+Ġobserv ation
+ac ies
+P ower
+ĠAdv anced
+M ag
+olog ically
+ĠMe chan
+Ġsent ences
+Ġanaly sts
+augh ters
+force ment
+Ġv ague
+Ġcl ause
+Ġdirect ors
+Ġeval uate
+Ġcabin et
+M att
+ĠClass ic
+A ng
+Ġcl er
+ĠB uck
+Ġresear cher
+Ġ16 0
+Ġpoor ly
+Ġexperien cing
+ĠP ed
+ĠMan hattan
+Ġfre ed
+Ġthem es
+ad vant
+Ġn in
+Ġpra ise
+10 4
+ĠLib ya
+b est
+Ġtrust ed
+Ġce ase
+Ġd ign
+D irect
+Ġbomb ing
+Ġm igration
+ĠSci ences
+Ġmunicip al
+ĠA verage
+Ġgl ory
+Ġreve aling
+Ġare na
+Ġuncertain ty
+Ġbattle field
+ia o
+G od
+Ġc inem
+ra pe
+el le
+ap ons
+Ġlist ing
+Ġwa ited
+Ġsp otted
+ke ley
+ĠAud io
+e or
+ard ing
+idd ing
+ig ma
+ĠN eg
+Ġl one
+Ġ ----
+ex e
+d eg
+Ġtrans f
+Ġwas h
+Ġsl avery
+Ġexpl oring
+ĠW W
+ats on
+Ġen cl
+l ies
+ĠC reek
+Ġwood en
+Man ager
+ĠBr and
+um my
+ĠAr thur
+Ġbureau cr
+Ġbl end
+ar ians
+F urther
+Ġsupposed ly
+Ġwind s
+Ġ19 79
+Ġgrav ity
+Ġanalys es
+ĠTra vel
+ĠV eter
+Ġd umb
+Ġaltern ate
+g al
+Ġconsum ed
+Ġeffect iveness
+.' '
+Ġpath s
+ond a
+L A
+ĠStr ong
+Ġen ables
+Ġesc aped
+Ġ" "
+Ġ1 12
+Ġ198 3
+Ġsm iled
+Ġtend ency
+F ire
+Ġp ars
+ĠR oc
+Ġl ake
+Ġf itness
+ĠA th
+ĠH orn
+Ġh ier
+Ġimp ose
+m other
+Ġp ension
+ic ut
+bor ne
+ic iary
+. _
+ĠS U
+Ġpol ar
+is y
+eng u
+itial ized
+AT A
+w rite
+Ġexerc ises
+ĠD iamond
+ot ypes
+Ġharm ful
+on z
+Ġprint ing
+st ory
+Ġexpert ise
+ĠG er
+Ġtraged y
+ĠF ly
+Ġd ivid
+amp ire
+st ock
+M em
+Ġre ign
+Ġun ve
+Ġam end
+ĠProp het
+Ġmut ual
+ĠF ac
+Ġrepl acing
+H ar
+ĠCirc uit
+Ġthro at
+ĠSh ot
+Ġbatter ies
+Ġto ll
+Ġaddress ing
+ĠMedic aid
+Ġp upp
+ĠN ar
+ol k
+Ġequ ity
+M R
+ĠHis pan
+ĠL arge
+m id
+D ev
+Ġexp ed
+Ġdem o
+ĠMarsh all
+erg us
+Ġf iber
+Ġdiv orce
+ĠCre ate
+Ġsl ower
+ĠPark er
+ĠStud ent
+ĠTr aining
+Ret urn
+ĠT ru
+Ġc ub
+ĠRe ached
+Ġpan ic
+Ġqu arters
+Ġre ct
+Ġtreat ing
+Ġr ats
+ĠChristian ity
+ol er
+Ġsac red
+Ġdecl are
+ul ative
+et ing
+Ġdeliver ing
+est one
+Ġt el
+ĠL arry
+Ġmet a
+ac cept
+art z
+ĠRog er
+hand ed
+Ġhead er
+Ġtra pped
+ĠCent ury
+Ġkn ocked
+ĠOx ford
+Ġsurviv ors
+b ot
+Ġdemon stration
+Ġd irt
+Ġass ists
+OM E
+ĠD raft
+ortun ate
+fol io
+pe red
+ust ers
+g t
+ĠL ock
+Ġjud icial
+ver ted
+Ġsec ured
+out ing
+ĠBook s
+Ġhost ing
+Ġlif ted
+l ength
+Ġj er
+Ġwhe els
+ĠR ange
+umbn ails
+Ġdiagn osis
+te ch
+ĠStew art
+ĠP ract
+Ġnation wide
+Ġde ar
+Ġoblig ations
+Ġgrow s
+Ġmand atory
+Ġsusp icious
+! '
+A pr
+G reat
+Ġmort gage
+Ġprosecut or
+Ġeditor ial
+ĠK r
+Ġprocess ed
+ung le
+Ġflex ibility
+Ear lier
+ĠC art
+ĠS ug
+Ġfoc uses
+Ġstart up
+Ġbre ach
+ĠT ob
+cy cle
+ãĢ Į
+ro se
+Ġb izarre
+ãĢ į
+Ġveget ables
+$ $
+Ġret reat
+osh i
+ĠSh op
+ĠG round
+ĠSt op
+ĠHawai i
+ĠA y
+Per haps
+ĠBe aut
+uff er
+enn a
+Ġproduct ivity
+F ixed
+cont rol
+Ġabs ent
+ĠCamp aign
+G reen
+Ġident ifying
+Ġreg ret
+Ġpromot ed
+ĠSe ven
+Ġer u
+ne ath
+aug hed
+ĠP in
+ĠL iving
+C ost
+om atic
+me ga
+ĠN ig
+oc y
+Ġin box
+Ġem pire
+Ġhor izont
+Ġbr anches
+Ġmet aph
+Act ive
+ed i
+ĠFil m
+ĠS omething
+Ġmod s
+inc ial
+ĠOrig inal
+G en
+Ġspir its
+Ġear ning
+H ist
+Ġr iders
+Ġsacr ific
+M T
+ĠV A
+ĠS alt
+Ġoccup ation
+ĠM i
+Ġdis g
+lic t
+Ġn it
+Ġn odes
+e em
+ĠP ier
+Ġhat red
+ps y
+ãĥ ī
+Ġthe ater
+Ġsophistic ated
+Ġdef ended
+Ġbes ides
+Ġthorough ly
+ĠMedic are
+Ġbl amed
+arent ly
+Ġcry ing
+F OR
+pri v
+Ġsing ing
+ĠI l
+Ġc ute
+o ided
+olit ical
+ĠNe uro
+å ¤
+Ġdon ation
+ĠEag les
+ĠG ive
+T om
+Ġsubstant ially
+ĠLic ense
+ĠJ a
+Ġg rey
+ĠAn imal
+ĠE R
+ĠU nd
+Ġke en
+Ġconclud e
+ĠMississ ippi
+Eng ine
+ĠStud ios
+P ress
+o vers
+ll ers
+Ġ3 50
+ĠR angers
+Ġr ou
+ert o
+E p
+iss a
+iv an
+Ġse al
+ĠReg ist
+dis play
+Ġwe aken
+u um
+ĠComm ons
+ĠS ay
+Ġcult ures
+Ġl aughed
+Ġsl ip
+Ġtreat ments
+iz able
+m art
+ĠR ice
+Ġbe ast
+Ġob esity
+ĠLa ure
+ig a
+Wh ich
+hold er
+Ġelder ly
+Ġp ays
+Ġcompl ained
+Ġc rop
+Ġpro c
+Ġexplos ive
+ĠF an
+ĠAr senal
+A uthor
+ef ul
+Ġme als
+Ġ( -
+id ays
+Ġimag ination
+Ġann ually
+Ġm s
+as ures
+H ead
+ik h
+m atic
+Ġboy friend
+ĠCom puter
+Ġb ump
+Ġsur ge
+ĠCra ig
+ĠKir k
+D el
+medi ate
+Ġscen arios
+ĠM ut
+ĠSt ream
+Ġcompet itors
+Ù Ħ
+ĠStan ford
+ĠRes ources
+az ed
+b age
+Ġorgan is
+ĠRe lease
+Ġsepar ately
+Ġha bits
+Ġmeasure ments
+ĠCl ose
+Ġaccomp any
+Ġg ly
+Ġt ang
+ĠR ou
+Ġplug in
+Ġcon vey
+ĠChall enge
+oot s
+j an
+Ġcur s
+ĠRel ations
+ke eper
+Ġapproach ing
+p ing
+Spe aking
+Ġarrang ement
+ĠV I
+are ttes
+Ġaffect ing
+Ġperm its
+b ecause
+Ġu seless
+ĠH us
+!! !!
+Ġdestro ying
+Un fortunately
+Ġfasc inating
+S em
+Ġelect oral
+Ġtrans parency
+ĠCh aos
+Ġvolunte er
+Ġstatist ical
+Ġactiv ated
+ro x
+We b
+H E
+ĠHamp shire
+is ive
+M ap
+Ġtr ash
+ĠLaw rence
+st ick
+C r
+Ġr ings
+EX T
+Ġoper ational
+op es
+D oes
+ĠEv ans
+Ġwitness ed
+P ort
+Ġlaunch ing
+ec onom
+w ear
+ĠPart icip
+um m
+cul es
+ĠR AM
+ĠT un
+Ġass ured
+Ġb inary
+Ġbet ray
+Ġexpl oration
+ĠF el
+Ġad mission
+it ated
+S y
+Ġav oided
+ĠSim ulator
+Ġcelebr ated
+ĠElect ric
+¥ ŀ
+Ġcl uster
+itzer land
+he alth
+L ine
+ĠN ash
+at on
+Ġsp are
+Ġenter prise
+ĠD IS
+clud es
+Ġfl ights
+Ġreg ards
+ĠÃ Ĺ
+h alf
+Ġtr ucks
+Ġcontact s
+Ġunc ons
+ĠCl imate
+Ġimm ense
+N EW
+oc c
+ect ive
+Ġemb od
+Ġpat rol
+Ġbes ide
+Ġv iable
+Ġcre ep
+Ġtrig gered
+ver ning
+Ġcompar able
+q l
+Ġg aining
+ass es
+Ġ( );
+ĠG rey
+ĠM LS
+s ized
+Ġpros per
+" ?
+Ġpoll ing
+Ġsh ar
+ĠR C
+Ġfire arm
+or ient
+Ġf ence
+Ġvari ations
+g iving
+ĠP i
+osp el
+Ġpled ge
+Ġc ure
+Ġsp y
+Ġviol ated
+Ġr ushed
+Ġstro ke
+ĠBl og
+sel s
+ĠE c
+,' '
+Ġp ale
+ĠColl ins
+ter ror
+ĠCanad ians
+Ġt une
+Ġlabor atory
+Ġn ons
+t arian
+Ġdis ability
+ĠG am
+Ġsing er
+al g
+ĠSen ior
+Ġtrad ed
+ĠWar rior
+Ġinf ring
+ĠFrank lin
+Ġstr ain
+ĠSwed ish
+Ġsevent h
+ĠB enn
+ĠT ell
+Ġsynd rome
+Ġwond ered
+id en
+++ ++
+ig o
+Ġpur ple
+Ġjournal ism
+Ġreb el
+Ġf u
+bl og
+Ġinv ite
+ren cies
+ĠCont act
+Is rael
+ĠCont ent
+Ġche er
+Ġbed room
+ĠEngine ering
+ĠQue ens
+Ġd well
+ĠPlay Station
+ĠD im
+ĠCol on
+l r
+Ġoper ates
+Ġmotiv ation
+US A
+ast ered
+C ore
+ĠTr uth
+ol o
+OS E
+ĠMem ory
+Ġpred ec
+Ġan arch
+Ġ19 20
+ĠY am
+Ã ¨
+b id
+Ġgr ateful
+Ġexc itement
+Ġtre asure
+Ġlong est
+ct ive
+Ġdes erves
+Ġreserv es
+Ġcop s
+ĠOtt awa
+ĠEgypt ian
+ank ed
+Ġart if
+Ġhypot hesis
+: /
+Ġpurch asing
+Ġlove ly
+H P
+Ġdiv ide
+Ġstrict ly
+Ġquestion ing
+Ġtaxp ayers
+ĠJ oy
+Ġroll s
+ĠHe avy
+Ġp orts
+Ġmag netic
+Ġinf lamm
+Ġbr ush
+t ics
+â ĪĴ
+Ġbott les
+pp y
+Ġp add
+ãĤ ¯
+m illion
+Ġdevast ating
+Ġcomp iled
+Ġmed ication
+Ġtw elve
+ĠPer ry
+Sp ace
+im b
+y our
+Ġle aked
+ĠT ar
+Ġun ity
+Ġinfect ed
+Ġtravel ed
+ID E
+ĠMc Donald
+t xt
+ĠPr inc
+Ġinter ven
+ĠTai wan
+ĠP ow
+Ġbe aring
+ĠTh read
+Ġz ones
+iz ards
+un ks
+Ch apter
+ll or
+ĠÂ ·
+Ġw ounds
+Ġdisc retion
+Ġsucceed ed
+ik ing
+Ġicon ic
+C all
+Ġscreen ing
+ĠM is
+ict s
+Ġmin isters
+Ġsepar ation
+Pl ayer
+Ġb ip
+Ġbel oved
+Ġcount ing
+ĠE ye
+ar ound
+ing ing
+Ġtable t
+Ġoff ence
+in ance
+h ave
+ĠInf o
+ĠNin ja
+Ġprotect ive
+ĠC ass
+M ac
+ĠQual ity
+N orth
+Ġ ic
+ĠCub a
+ĠChron icle
+ĠPro perty
+Ġfast est
+ot os
+ĠG erm
+OW N
+Ġbo om
+ĠStan ley
+ergus on
+Ġcle ver
+Ġent ers
+m ode
+ter ior
+ĠS ens
+Ġlin ear
+AR K
+Ġcomp aring
+Ġpure ly
+Ġsaf er
+ĠPot ter
+Ġc ups
+R T
+Ġgl uc
+Ġatt ributed
+Ġdu pl
+ĠP ap
+Ġprec ious
+Ġp a
+iction ary
+ĠT ig
+ĠTo o
+ol utions
+st an
+Ġrob ots
+Ġlob b
+Ġstat ute
+Ġprevent ion
+w estern
+16 0
+ĠAct ive
+ĠMar ia
+h al
+N one
+ell ar
+ĠK B
+ĠPart ners
+ĠSing le
+ĠFollow ing
+ang o
+ac ious
+Ġth ou
+Ġk g
+Ġinflu ential
+ĠFriend s
+S ur
+ain ted
+Ġfor ums
+Ġst arter
+Ġcitizens hip
+ĠE lection
+on ge
+ot ation
+os ph
+;; ;;
+ut ical
+p ur
+ere n
+Ġaccus ations
+bit ious
+ab bit
+ĠOr d
+Post ed
+ir k
+Ġsens itivity
+ic he
+ĠAm y
+ĠF ab
+Ġsum mit
+Ġped est
+Ġrub ber
+Ġagric ultural
+Ġcan cel
+A E
+Ġin aug
+Ġcont am
+Ġfirm ly
+i w
+st age
+ĠK an
+Ġt ier
+Ġinv ention
+Ġtransl ated
+ĠR ules
+B ox
+Tw itter
+ID S
+Ġp izza
+Ġdeb ug
+ĠD rop
+v s
+Ġh orses
+b ig
+Ġb oring
+Ġh ood
+ĠMcC ain
+at ched
+ĠBro s
+Ġsk ip
+Ġess ay
+st at
+ĠLeg ends
+Ġam munition
+au c
+Ġshoot er
+Ġun h
+Ġsuppl ied
+Ġgener ic
+ĠS K
+ib an
+yr ics
+Ġ25 5
+Ġclim bing
+Form er
+Ġfl ip
+Ġjump ing
+Ġfrust ration
+ĠTer ry
+Ġneighborhood s
+Ġmed ian
+be an
+Ġbr ains
+Follow ing
+Ġsh aped
+Ġdraw s
+Ġal tered
+J ack
+Ġrecip es
+Ġsk illed
+we alth
+ach i
+e lection
+Ġbehavi ors
+de als
+ĠU ntil
+F e
+Ġdecl aration
+mar ks
+ĠBet ween
+cel ona
+Ġres on
+Ġbub ble
+Am ong
+Ġim perial
+G S
+Ġfemin ist
+200 5
+ĠK yle
+Ġaccount ing
+ĠTe le
+ĠT yr
+Ġconnect ing
+Ġre hab
+ĠP red
+s im
+Ġmeant ime
+Ġphys ician
+M W
+ĠCamp bell
+ĠBr andon
+Ġcontribut ing
+ĠR ule
+ĠWe ight
+ĠN ap
+Ġinter active
+Ġv ag
+Ġhel met
+ĠCom b
+f our
+Ġsh ipped
+Ġcomple ting
+ĠP D
+PD ATE
+Ġspread ing
+Ġsc ary
+erv ing
+ĠG as
+Ġfr ank
+s chool
+Ġrom antic
+Ġstab il
+R ob
+Ġaccur ately
+Ġac ute
+ĠH ann
+Ġsymbol s
+Ġcivil ization
+ĠA W
+Ġlight ning
+Ġcons iders
+Ġven ue
+Ġ ×
+Ġo ven
+ĠS F
+h is
+Ġn u
+ĠLear n
+Ġpe oples
+Ġst d
+Ġsle e
+Ġs lic
+ĠStat istics
+Ġcor ners
+ĠB aker
+Ġ: )
+ment ation
+ol ver
+Ġlaugh ing
+ĠT odd
+ond e
+ĠH ills
+Ġn uts
+ĠW oman
+pl ane
+Ġl iver
+ĠIn side
+S orry
+Ġagre es
+Ġfund ament
+ĠF isher
+Ġa uction
+Ġthread s
+gl as
+ĠBas ic
+ĠN at
+Ġlack ing
+Ġceleb ration
+j u
+Ġs illy
+E uro
+Ġt att
+ight y
+cont rolled
+T est
+ĠSing h
+Ġr age
+Ġrh yth
+o ffic
+ĠPh antom
+Ġhead lines
+Ġrespond ing
+ĠMor ning
+Ġvit amin
+Ġboot s
+ĠS ite
+al in
+p i
+Ġvir al
+ĠU C
+D ER
+ĠSe x
+Ġst ocks
+c urrent
+Ġch urches
+ĠR are
+ĠMur phy
+Ġden ial
+ĠG aming
+Ġtou g
+Ġn ick
+Ġm akers
+ĠRon ald
+Ġgener ous
+ĠD oc
+ĠMor ris
+Ġtransform ed
+ĠN ormal
+Ġ10 4
+ĠKick starter
+ĠUp on
+On line
+ĠI RS
+Ġw rap
+Ġl oving
+Ġarri ves
+ĠD ue
+Ġhe ter
+ĠM ade
+Ġrent al
+Ġbelong s
+Ġatt orneys
+Ġcro ps
+Ġmat ched
+ul um
+ol ine
+10 9
+Ġdis par
+Ġbuy ers
+ĠCam bridge
+Ġeth ics
+rou ps
+Ġjust ified
+Ġmarg inal
+Ġrespect ed
+win ning
+Ġnodd ed
+ĠSer ge
+ĠForm er
+C raft
+######## ########
+ĠWar ner
+Ġd ash
+et e
+Ġent ert
+ĠE scape
+out heast
+Ġkn ees
+ĠB omb
+Ġr ug
+P ass
+Ġatt itudes
+go vernment
+ĠPri or
+Ġqual ities
+Ġnot ification
+ĠPh one
+l ie
+Ġanticip ated
+ĠCom bat
+ĠBar ry
+Ġ198 2
+Us ers
+on er
+Ġcomput ing
+ĠConnect icut
+Ġless er
+Ġpe ers
+ĠC u
+Ġtechn ically
+Ġsub mission
+ĠUn iversal
+Ġman ually
+our ge
+Ġrespond ents
+ĠB TC
+ĠH ost
+Ġf are
+ĠB ird
+Ġrece ipt
+al so
+Ġj ack
+Ġagric ulture
+Ġsk ull
+Ġ! =
+Ġpass ive
+ĠC I
+Ġsoc ieties
+Ġremind ed
+Ġinter ference
+B uy
+Ġâ ľ
+g on
+Ġscrut iny
+ĠW itch
+Ġconduct ing
+Ġ ãĥ
+Ġexch anges
+ĠMit chell
+Ġinhab it
+Ġtw ist
+B D
+Ġwhere ver
+group on
+Ġj okes
+ĠBen jamin
+ĠR andom
+fr ame
+ĠL ions
+Ġhighlight ed
+ĠArk ansas
+E nt
+Ġp ile
+Ġpre lim
+g s
+mind ed
+Ġfel ony
+ĠG A
+ĠL uck
+Ġpract ically
+ĠB os
+Ġact ress
+D am
+ĠB ou
+Ġvis a
+Ġembed ded
+Ġhy brid
+Ġear liest
+Ġsoon er
+s ocial
+ĠH A
+Ġste ep
+Ġdis advant
+Ġexplo it
+ĠE gg
+ĠUlt ra
+Ġnecess ity
+L ocal
+ie ge
+Ġd ated
+Ġmass es
+Ġsubsc ription
+pl ess
+Ġan onym
+Ġpresum ably
+Bl ue
+The ir
+asket ball
+ĠPhil ip
+Ġcom ed
+load ed
+r ane
+Ġref lection
+Ch ina
+Ġext ends
+Ġform ing
+Ġund ers
+200 1
+Ġgr at
+Ġconcent rations
+Ġins ulin
+Ġsec ular
+Ġwh ilst
+Ġwin ners
+Ad vertisements
+Ġdeliber ately
+ĠWork ing
+Ġs ink
+et ics
+d ale
+Ġmand ate
+Ġg ram
+Ġvac ation
+Ġwarn ings
+ri pp
+ĠTH AT
+Ġcomment ary
+Ġint u
+Ġa est
+Ġreason ing
+Ġbreak down
+ĠZ ombie
+Ġ-- >
+ĠPolit ical
+c ott
+Ġthr ust
+Ġtechn ological
+Ġdec iding
+Ġtraff icking
+L ong
+W elcome
+pr ising
+ĠCommun ications
+Ġend ors
+Ġsw ift
+Ġmetab ol
+co ins
+res a
+ĠHT TP
+Ġen roll
+ĠH appy
+us r
+int age
+Ġ[ "
+u ably
+ĠM aterial
+Ġrepe al
+Se pt
+k h
+ĠMod i
+Ġunder neath
+ĠI L
+sh ore
+Ġdiagn osed
+ace utical
+Ġsh ower
+au x
+ĠSw itch
+ĠStre ngth
+Ġj ihad
+n ational
+Ġtra uma
+uss y
+on i
+Ġcons olid
+Ġcal ories
+ĠF lynn
+ag ged
+16 8
+ĠP ink
+Ġfulf ill
+Ġch ains
+Ġnot ably
+ĠA V
+L ife
+ĠCh uck
+m us
+ĠUr ban
+ĠH end
+Ġdep osit
+ĠS ad
+Ġaff air
+OR K
+ie val
+ĠF DA
+Ġt rop
+ĠOver all
+Ġvirt ue
+Ġsatisf action
+au nd
+Ġl un
+ĠSw itzerland
+ĠOper ation
+pro cess
+Ġsh ook
+Ġcount ies
+le ased
+ĠCharl otte
+1 12
+Ġtrans cript
+Ġre dd
+p ush
+ĠHe y
+ĠAn alysis
+[ "
+Ġaltern atives
+ard less
+Ġele ph
+Ġpre jud
+ĠLe af
+H aving
+ĠH ub
+Ġexpress ions
+ĠVol ume
+Ġshock ing
+ĠRed s
+Ġread ily
+Ġplan ets
+ad ata
+Ġcollaps ed
+ĠMad rid
+Ġir rit
+i pper
+ĠEn c
+ĠW ire
+Ġbu zz
+ĠG P
+ash a
+Ġaccident ally
+ur u
+Ġfrust rated
+ĠS A
+Ġhung ry
+ĠH uff
+Ġlab els
+ant o
+ĠE P
+Ġbar riers
+) |
+ĠBer keley
+ĠJ ets
+Ġp airs
+ĠL an
+J ames
+ĠB ear
+Ġhum or
+ĠLiber ty
+Ġmagn itude
+Ġag ing
+ĠM ason
+Ġfriends hip
+umb ling
+Ġemer ge
+Ġnewsp apers
+Ġam bitious
+ĠRich ards
+atern al
+Ġ198 1
+Ġcook ies
+Ġsc ulpt
+Ġpur suit
+L ocation
+Ġscript s
+p c
+Ġarrang ements
+Ġd iameter
+Ġl oses
+am ation
+Ġl iqu
+ĠJ ake
+aret te
+Ġunderstand s
+ĠZ en
+v m
+Ġappro ve
+Ġw ip
+Ġult ra
+Ġint end
+ĠD I
+asc ular
+Ġst ays
+ĠK or
+ĠK l
+Ġinvest ing
+L a
+Ġbelie ving
+b ad
+m outh
+Ġtaxp ayer
+ãĥ ĥ
+ĠQue bec
+Ġl ap
+ĠSw iss
+d rop
+Ġdr ain
+ir i
+et c
+ft en
+ĠN ex
+Ġst raw
+Ġscream ing
+Ġcount ed
+Ġdam aging
+Ġamb assador
+cent ury
+Ġpro x
+Ġarrest s
+u v
+il ateral
+ĠCh arg
+Ġpresc ribed
+Ġindepend ently
+Ġf ierce
+ĠB aby
+Ġb rave
+Ġsu its
+= >
+Ġbas eline
+ĠR ate
+Ġis lands
+Ġ( (
+g reen
+ix els
+Ġname ly
+ĠVill age
+th an
+am y
+V ersion
+g mail
+ential s
+ĠS ud
+ĠMel bourne
+Ġarri ving
+Ġquant um
+e ff
+rop olitan
+T ri
+Ġfun eral
+ĠI R
+ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ
+ĠC ob
+it ably
+Ġt urb
+Ġcomb o
+Re view
+Ġdeploy ment
+u ity
+ĠB ott
+Ġinv isible
+Ġrender ing
+Ġunl ocked
+Ġa qu
+ĠVlad imir
+Ġp ad
+ĠBr ain
+ĠLeg acy
+dr agon
+ĠKurd ish
+Ġsound ed
+Ġdet ained
+ĠD M
+g ary
+Ġd aughters
+Ġdistur bing
+uk a
+ĠPar ad
+Ġt ast
+Ġunf ortunate
+Ġu l
+em in
+Ġattend ance
+tr l
+Ġpar ks
+ĠMem orial
+ĠAl ice
+oth y
+gu ard
+ĠD ise
+ĠSh an
+ĠFor um
+R ich
+Ġshif ted
+ue z
+Ġl ighter
+ĠMag n
+Ġc od
+S ch
+ham mad
+P ub
+3 50
+ĠP okemon
+Ġprot otype
+Ġun re
+B ase
+ĠStud ents
+ĠRep ly
+ĠCommun ist
+Ġg au
+ĠTy ler
+I Z
+Ġparticip ated
+Ġsup rem
+ĠDet ails
+Ġvessel s
+ro d
+Ġt ribe
+ke ep
+Ġassum ptions
+Ġp ound
+Ġcr ude
+ĠAv ailable
+Ġswim ming
+Ġin clusion
+Ġadv ances
+c ulation
+Ġconserv ation
+Ġover d
+ĠBuff alo
+Art icle
+ed ge
+Ġaw a
+ĠMad ison
+Ġsid ew
+Ġcat ast
+ĠK rist
+uc le
+ĠHigh way
+ĠTer ror
+Ġactiv ation
+Ġuncons cious
+ĠSat an
+ĠSus an
+ill ery
+Ġarr anged
+i op
+Ġrum ors
+ur ring
+th ink
+ĠKe ith
+ĠK ind
+Ġavoid ing
+by n
+n ut
+ĠSpe aker
+r us
+n ames
+Ġgu ilt
+ĠOlymp ics
+Ġsa il
+ĠM es
+lev ant
+ĠColumb us
+a ft
+C ity
+S outh
+ĠHar vey
+ĠP un
+S everal
+Ġment ally
+Ġimp ress
+m ount
+ĠUb untu
+âĢĶâĢĶâĢĶâĢĶ âĢĶâĢĶâĢĶâĢĶ
+ĠSuper man
+ĠMP s
+Ġintent ions
+ĠR acing
+Ġlike lihood
+Ġ2 40
+T otal
+Ġto ys
+ĠW atson
+Ġur ge
+L ear
+ĠP aper
+Ġoccur ring
+ĠB eng
+ĠC ert
+Ġst ones
+T im
+ĠTw in
+z b
+ĠD ynam
+Ġpolit ician
+k ens
+ĠEnter prise
+UT ERS
+Ġab ol
+Ġref resh
+Ġarbit rary
+pe ction
+Ġtrou bles
+Ġ} );
+t v
+Ġpil ots
+Ġdist ribute
+Ġaud it
+Ġp ause
+orig inal
+Ġr ivals
+Â £
+F ig
+T L
+ab il
+ry ing
+L in
+ion ed
+l on
+Ġf ancy
+Ġcr ashed
+Ġt ract
+Ġshe d
+Ġcons ume
+B ased
+down load
+in it
+Ġvolt age
+Int rodu
+Ġcondem ned
+ĠFin ance
+res pect
+Ġex cluded
+Ġestablish ing
+her ic
+Ġher itage
+Ġspect acular
+Ġun st
+ĠSnow den
+ĠL ane
+S an
+Ġprotect ions
+st ruction
+inc inn
+Ġmac ro
+C ustom
+ios ity
+Ġes p
+Ġfunction ing
+Ġm ush
+Ġp uzzle
+Ġeth ical
+M al
+Ġgo verning
+ĠF erguson
+Ġrest ored
+Ġst ressed
+ĠCoun ter
+ĠK as
+cl ip
+AN S
+Ġse iz
+U K
+by ss
+old own
+ap i
+Ġperman ently
+oun ters
+W est
+Th rough
+L ight
+at oes
+Ġne at
+Ġc ord
+ure r
+Ġsevere ly
+ĠA ven
+Ġinter rog
+Ġtri ple
+G iven
+N umber
+Ġar ise
+Ġs her
+pl ant
+Ġfl ower
+ĠC ou
+Ġat e
+Ġnew er
+b ul
+Ġmean while
+ĠL air
+Ġadjust ment
+ĠCop yright
+Ġd ivers
+i ological
+Ġgam ers
+o at
+Ġhistor ically
+Ġanal og
+Ġlong time
+Ġpres cription
+ĠM ist
+ĠHy per
+ĠM aine
+ĠDe ity
+Ġmulti pl
+ĠRe incarn
+ĠH yd
+ĠP ic
+S il
+r ants
+ĠC ris
+. ;
+( {
+epend ence
+Ġrec y
+ate ur
+Ġqu ad
+Ġgl ob
+Ġcon ced
+te am
+Ġcapital ist
+ĠL ot
+Ġroy al
+ĠCy ber
+Ġblack s
+met ic
+ri v
+ĠD anny
+Ġsp o
+ĠR O
+Ġanim ated
+rypt ed
+ĠDep uty
+Ġrend ered
+F E
+Ġstre ak
+Ġcloud s
+ĠDou g
+~~~~ ~~~~
+Ġdisc our
+ĠVe h
+Ġpsych ology
+ĠJ ourney
+Ġcry stal
+ĠFro st
+Ġsuspic ion
+Ġrel ate
+or us
+ĠC rypt
+ĠN VIDIA
+com ed
+ut ing
+incinn ati
+Ġvulner ability
+ost ic
+Ġisol ation
+Ġcool ing
+ĠCoal ition
+Ġ1 19
+F our
+ĠDe al
+Ġâ ī
+se mble
+ram ent
+ĠBar celona
+Ġ10 2
+Ġcoc aine
+ocaly pse
+F eb
+ogen ic
+Ġmut ation
+Ġcrypt oc
+ĠK el
+ĠG it
+a is
+Ġs isters
+AN K
+Ġactiv ate
+T er
+Ġd read
+yl on
+Ġprop ri
+A ust
+ĠDef ault
+Ġout door
+Ġshe er
+ce ive
+Ġg ently
+Ð ¾
+Pro gram
+Ġâ ĨĴ
+Ġve gan
+ĠCr us
+Ġrespons ibilities
+ĠH R
+OL D
+Ġprev ents
+Ġst iff
+ĠW ere
+Ġathlet ic
+ĠSc ore
+Ġ) :
+Ġcolumn s
+ĠL oc
+av ailable
+ĠF ram
+ĠS essions
+Ġcompan ion
+Ġpack s
+14 0
+ĠKn ights
+Ġf art
+Ġstream s
+Ġsh ore
+Ġapp eals
+ĠPer formance
+h aul
+ĠSt ra
+ĠN ag
+10 3
+ĠTrans portation
+B B
+E v
+z an
+P ublic
+Ġtw in
+uls ion
+M ult
+Ġelect ro
+Ġstat ue
+ation ally
+ĠN ort
+Ġins pection
+/ *
+ig ue
+Ġcomp assion
+ĠT ales
+ĠSte in
+ĠSc reen
+ĠB ug
+ĠL ion
+g irl
+Ġwithdraw al
+Ġobject ives
+Ġblood y
+Ġprelim inary
+Ġj acket
+Ġdim ensions
+ĠC ool
+ĠOcc up
+Ġw reck
+Ġdoub led
+ank ing
+Ġ19 75
+Ġglass es
+ĠW ang
+pro v
+P ath
+connect ed
+ĠMult i
+ĠNor way
+agon ist
+Ġfe ared
+Ġtouch ing
+Ġarg uably
+Â¯Â¯Â¯Â¯ Â¯Â¯Â¯Â¯
+ĠNC AA
+che m
+Ġsp at
+ĠW WE
+ĠC el
+ig ger
+Ġattack er
+ĠJo in
+ob ject
+ett a
+Ġelim inated
+d et
+Ġdest ruct
+ĠLuc as
+ct uary
+18 0
+ĠBr ady
+ĠBl ues
+B ay
+au kee
+Ġtim eline
+Ġdeleg ates
+w ritten
+uff icient
+Ġsh apes
+Cop yright
+ou ble
+serv ice
+Ġp ione
+Ġcolleg es
+Ġrow s
+Ġsp ite
+Ġassess ed
+3 60
+Ġle ase
+Ġconfident ial
+ck er
+ĠMan ning
+ĠV oice
+Ġse aled
+Ġcalcul ate
+N O
+ĠAss istant
+Ġteen ager
+ul ent
+ather ine
+Ġm ock
+Ġd iamond
+Ġf est
+Ġsw itched
+Ġres ume
+ĠPu erto
+Ġl anes
+ir ation
+ĠSimilar ly
+Ġro d
+ĠS el
+ĠPal ace
+ĠLim ited
+e ous
+Ġvar iant
+Ġw ard
+Ġ) )
+Sh ow
+OO K
+A lex
+ĠN ep
+br is
+ĠWik ipedia
+Ġexcept ional
+Ġman ages
+ĠD raw
+Ag ain
+Ġco pper
+ut t
+Ġex ports
+Ġport folio
+Ġelev ated
+R ated
+ĠOther wise
+ĠT act
+ĠShe l
+ĠT X
+" âĢĶ
+Ġres ur
+ĠW a
+ven ant
+Ġmon etary
+pe ople
+E mail
+Ġfif ty
+ĠS weet
+ĠMalays ia
+Ġconf using
+ĠR io
+ud a
+uten ant
+" );
+Ġpra ised
+Ġvol umes
+t urn
+Ġm ature
+Ġnon profit
+Ġpassion ate
+ĠPriv ate
+Ġ10 3
+Ġdesc end
+ç ¥ŀ
+uff y
+head ed
+Whe ther
+ri en
+ze ch
+be it
+Ġch rom
+ĠMc M
+Ġd ancing
+Ġe leg
+ĠNot iced
+11 5
+Ġadvoc acy
+ENT S
+amb ling
+ĠMin or
+ĠF inn
+Ġprior ities
+Ġthere of
+ĠSt age
+ĠRog ers
+Ġsubst itute
+ĠJ ar
+ĠJeff erson
+Ġlight ly
+10 2
+ĠL isa
+u its
+ys ical
+Ġshif ts
+Ġd rones
+Ġwork place
+Ġres id
+ens ed
+ah n
+Ġpref erences
+ser ver
+Ġdeb ates
+d oc
+ĠGod s
+Ġhelicop ter
+Ġhon our
+Ġconsider ably
+ed ed
+ĠF emale
+ĠAn ne
+Ġre un
+ĠF ace
+ĠHall ow
+ĠBud get
+Ġcondem n
+Ġt ender
+Pro f
+ocr atic
+ĠTurn er
+ĠAg ric
+Ġ19 76
+Ġa pt
+d isc
+ĠF ighter
+ĠA ur
+Ġgar bage
+in put
+ĠK arl
+ĠOl iver
+ĠL anguage
+k n
+N on
+ĠCl ar
+Ġtrad itions
+Ġad vertisement
+ĠS or
+Ġarch ive
+Ġvill ages
+7 50
+Ġimplement ing
+w aukee
+Ġdiet ary
+Ġswitch ing
+Rep ublic
+Ġvel ocity
+Ġc it
+ĠA wards
+Ġfin ancing
+Ġlast ed
+) ]
+Ġrem inder
+P erson
+Ġprec ision
+Ġdesign ers
+ĠF ried
+ĠB order
+Ġtr agic
+Ġw ield
+Ġiniti atives
+ĠT ank
+w er
+Ġjo ins
+R o
+in ery
+Ġar row
+Ġgener ating
+found er
+Ġsear ches
+Ġrandom ly
+A ccess
+Ġb atch
+Ġp osed
+l at
+Ġpursu ing
+as a
+Ġtest ified
+form ing
+ĠSh ar
+w iki
+ĠE ither
+S ometimes
+Ġsen ators
+ĠJohn ny
+ĠTal iban
+ĠG PS
+":" /
+ãģ® å
+Ġanaly zed
+ĠRub io
+ĠMove ment
+op ard
+ii i
+St and
+f ight
+Ġign oring
+i ang
+ĠG N
+so ever
+ĠST AT
+Ġref using
+Ġswe at
+Ġb ay
+P ORT
+ir med
+ak y
+Ġdis pro
+Ġlabel ed
+Ġ10 8
+H ello
+Ġple asant
+ab a
+Ġtri umph
+Ġab oard
+Ġinc om
+ĠC row
+le tt
+Ġfol k
+Ġch ase
+` `
+ĠBr us
+Ġte ens
+c ue
+Ġter rain
+h yd
+il ight
+OR Y
+Su pport
+ew s
+ll i
+rain ts
+ĠC and
+Ġab used
+ach ment
+l arg
+B as
+ĠC ancer
+Ġ19 78
+Ġsupp orter
+ac cess
+ĠTer min
+ĠT ampa
+ĠAN Y
+Ġnew est
+ĠCrim inal
+ed u
+Ġ19 30
+Ġadm its
+Ġend e
+Ġfail ures
+ur ate
+ful ness
+cy cl
+ĠSub ject
+Ġinf inite
+th ree
+W A
+p it
+ĠInst all
+R ad
+ili ation
+G M
+Ġcontin ent
+Ġaccommod ate
+ĠCl ay
+Ġp up
+ĠF unction
+Ġham mer
+ĠAlbert a
+Ġrev ised
+Ġminor ities
+Ġmeasure ment
+Con nell
+Ġdis able
+ĠM ix
+In cre
+Ġfor k
+ĠR osen
+Ġimpl ies
+umb lr
+AN G
+Ġprote ins
+Ġagg ression
+Ġfacilit ate
+S N
+Ġilleg ally
+u er
+Ġacad em
+Ġp uzz
+ĠSh ift
+p ay
+oll o
+Ġaud iences
+B uild
+Ġno ble
+Ġsynt ax
+â ĺħ
+Ġbe am
+ĠB ed
+ĠA ld
+Ġorig ins
+v ideo
+Ġ19 77
+ĠAss ault
+Ġgar age
+Te am
+Ġver dict
+Ġd war
+ĠVirt ual
+e vent
+Ke ep
+Ġsent iment
+Ġwild life
+sh irt
+Ġb urg
+Ġrecommend ation
+rep resent
+Ġgall ery
+own ers
+Ġsch olar
+Ġconven ience
+ĠSw ift
+Ġconv inc
+C ap
+Ġwar fare
+ĠVis ual
+Ġconst itute
+Ġab ort
+ĠWe ather
+ĠLook ing
+ĠH em
+Ġmart ial
+Ġinc oming
+et ition
+Ġtoler ance
+ĠCre ated
+Ġfl ows
+ĠE lder
+Ġsoul s
+Ġf oul
+ĠP ain
+ĠC AN
+Ġ2 20
+b c
+he nd
+Ġgen ius
+R eal
+ĠW r
+omet er
+p ad
+Ġlim iting
+ĠS i
+ĠL ore
+ĠAd ventures
+Ġvar ied
+D isc
+f in
+ĠPerson al
+Ch ris
+Ġinv ented
+Ġd ive
+ĠR ise
+Ġo z
+ĠCom ics
+Ġexp ose
+ĠRe b
+let ters
+s ite
+im ated
+Ġh acking
+Ġeduc ated
+ĠNob ody
+Ġdep ri
+Ġincent ive
+ãĤ ·
+Ġovers ight
+Ġtrib es
+ĠBelg ium
+Ġlicens ing
+our t
+Produ ct
+ah l
+ĠG em
+Ġspecial ist
+Ġc ra
+ann ers
+ĠCor byn
+Ġ19 73
+RE AD
+Ġsum mar
+Ġover look
+ĠApp lication
+Ġin appropriate
+Ġdownload ed
+Q ue
+ĠB ears
+Ġth umb
+ĠChar acter
+ĠReincarn ated
+ĠS id
+Ġdemonstr ates
+s ky
+ĠBloom berg
+ĠAr ray
+ĠRes ults
+ĠFour th
+ĠED T
+ĠO scar
+c end
+Ġ10 6
+ĠN ULL
+ĠH ERE
+m atch
+ĠBr un
+Ġgluc ose
+ie g
+eg u
+Ġcert ified
+Ġrel ie
+Ġhuman itarian
+Ġpr ayers
+K ing
+Ġn an
+h ou
+10 8
+ul u
+Ġrenew able
+Ġdistingu ish
+Ġd ense
+ĠV ent
+ĠPack age
+ĠB oss
+Ġedit ors
+Ġm igr
+T ra
+ĠPet ers
+ĠAr ctic
+200 4
+ĠC ape
+Ġloc ally
+Ġlast ing
+Ġhand y
+. ).
+P an
+ĠR ES
+Ind ex
+Ġt ensions
+Ġformer ly
+Ġide ological
+Ġsens ors
+Ġdeal ers
+Ġdef ines
+S k
+Ġproceed s
+Ġpro xy
+az ines
+ĠB ash
+ĠP ad
+ĠC raft
+eal ous
+Ġshe ets
+omet ry
+J une
+cl ock
+T T
+ĠThe atre
+ĠB uzz
+Ġch apters
+Ġmill enn
+Ġd ough
+ĠCongress ional
+Ġimag ined
+av ior
+Ġclin ic
+Ġ19 45
+Ġhold er
+ro ot
+oles ter
+Ġrest art
+B N
+ĠHam as
+ĠJ ob
+Ġor b
+Ġr am
+Ġdiscl ose
+Ġtransl ate
+Ġimm igrant
+Ġannoy ing
+Ġtreat y
+an ium
+ĠTe a
+ĠLeg ion
+Ġcrowd s
+ĠB ec
+ĠA er
+oh yd
+B ro
+Look ing
+Ġl bs
+Ġagg ress
+Ġse am
+Ġinter cept
+ĠM I
+mer cial
+act iv
+ĠC it
+Ġdim ension
+Ġconsist ency
+Ġr ushing
+ĠDou glas
+Ġtr im
+Inst all
+ick er
+Ġsh y
+10 6
+Ġment ions
+pe lled
+ĠT ak
+c ost
+Ġclass room
+Ġfort une
+dri ven
+Ġun le
+ĠWhe el
+Ġinvest or
+ĠM asters
+k it
+Ġassoci ations
+ĠEv olution
+op ing
+us cript
+Ġprov incial
+ĠWal ter
+av i
+S O
+Ġun limited
+Eng lish
+ĠC ards
+ĠEb ola
+ne red
+Ġreven ge
+Ġout right
+um per
+Ġf itting
+ĠSol id
+Ġform ally
+Ġproblem atic
+Ġhaz ard
+Ġenc ryption
+Ġstraight forward
+ĠA K
+Ġp se
+ĠOr b
+ĠCh amber
+ĠM ak
+Cont ents
+Ġloyal ty
+Ġl yrics
+ĠSy m
+Ġwel comed
+Ġcook ed
+Ġmon op
+Ġn urse
+Ġmis leading
+Ġe ternal
+Ġshif ting
+Ġ+ =
+V is
+Ġinst itutional
+ill ary
+Ġp ant
+VER T
+ĠA CC
+ĠEn h
+Ġinc on
+ĠRE UTERS
+Ġdon ated
+âĢ¦âĢ¦ âĢ¦âĢ¦
+In tern
+Ġexhib it
+Ġt ire
+ĠR ic
+ĠCh ampion
+ĠMu hammad
+N ING
+ĠSoc cer
+Ġmob ility
+Ġvary ing
+ĠM ovie
+Ġl ord
+o ak
+F ield
+Ġve ctor
+us ions
+Ġsc rap
+Ġen abling
+m ake
+T or
+. *
+| |
+ĠWe bsite
+ĠN PC
+Ġsocial ist
+ĠBill y
+ĠAdd itional
+Ġc argo
+Ġfar ms
+ĠSo on
+ĠPri ze
+Ġmid night
+Ġ9 00
+se en
+ĠSp ot
+Ġshe ep
+Ġspons ored
+ĠH i
+ĠJ ump
+Ġ19 67
+Micro soft
+ĠAg ent
+Ġch arts
+d ir
+Ġadj acent
+Ġtr icks
+Ġman ga
+Ġex agger
+/ >
+foot ball
+ĠF CC
+G C
+ĠT ier
+and ra
+OU ND
+% ),
+Ġfru its
+V C
+ĠA A
+R ober
+Ġmid st
+â Ĺ
+ank a
+Ġlegisl ature
+ĠNe il
+Ġtour ists
+" "
+ĠWar ning
+ĠNever theless
+ĠOffic ial
+ĠWh atever
+Ġm old
+Ġdraft ed
+Ġsubst ances
+Ġbre ed
+Ġt ags
+ĠT ask
+Ġver b
+Ġmanufact ured
+com ments
+ĠPol ish
+Pro v
+Ġdetermin es
+Ob ama
+k ers
+Ġutter ly
+Ġse ct
+sc he
+ĠG ates
+ĠCh ap
+Ġal uminum
+Ġz ombie
+ĠT ouch
+ĠU P
+Ġsatisf y
+Ġpred omin
+asc ript
+Ġelabor ate
+Ġ19 68
+Ġmeas uring
+ĠV ari
+any ahu
+Ġs ir
+ul ates
+id ges
+ick ets
+ĠSp encer
+T M
+oub ted
+Ġpre y
+Ġinstall ing
+ĠC ab
+re ed
+re ated
+Su pp
+Ġwr ist
+ĠK erry
+10 7
+ĠK le
+ĠR achel
+Ġc otton
+ĠA RE
+ĠE le
+Cont rol
+Ġload s
+ĠD od
+an as
+b one
+Ġclass ical
+ĠReg ional
+ĠInt eg
+V M
+Ġdes ires
+Ġaut ism
+support ed
+ĠM essage
+Ġcomp act
+writ er
+Ġ10 9
+ĠHur ricane
+c ision
+Ġcy cles
+Ġdr ill
+Ġcolle ague
+Ġm aker
+G erman
+Ġmist aken
+S un
+ĠG ay
+Ġwhat soever
+Ġsell s
+ĠA irl
+l iv
+ĠO ption
+Ġsol ved
+Ġse ctors
+Ġhorizont al
+Ġequ ation
+ĠSk ill
+ĠB io
+g ement
+ĠSn ap
+ĠLeg al
+Ġtradem ark
+Ġmake up
+Ġassemb led
+Ġsa ves
+ĠHallow een
+ĠVer mont
+ĠFR OM
+Ġfar ming
+ĠP odcast
+accept able
+ĠHig her
+Ġas leep
+ull ivan
+Ġrefere n
+ĠLe v
+Ġbul lets
+ok o
+H C
+Ġst airs
+Ġmain tains
+ĠL ower
+ĠV i
+Ġmar ine
+Ġac res
+Ġcoordin ator
+ĠJ oh
+Ġcounterpart s
+ĠBrother s
+Ġind ict
+b ra
+Ġch unk
+Ġc ents
+H ome
+ĠMon th
+Ġaccording ly
+if les
+ĠGerm ans
+ĠSy n
+H ub
+Ġey eb
+âĶĢâĶĢ âĶĢâĶĢ
+Ġr anges
+ĠHoll and
+ĠRob ot
+f c
+M ike
+Ġpl asma
+Ġsw ap
+Ġath lete
+ĠR ams
+,' "
+Ġinfect ions
+Ġcor rid
+Ġv ib
+Ġpat ches
+Ġtradition ally
+Ġrevel ation
+Ġswe ep
+Ġgl ance
+Ġin ex
+200 3
+ĠR aw
+work ing
+os ures
+ĠD at
+ĠLyn ch
+Ġle verage
+ĠRe id
+Ġcorrel ation
+ian ces
+av ascript
+Ġrep ository
+ret ty
+Ġ19 72
+24 0
+Ġo un
+p ol
+ĠRe ed
+Ġtact ical
+is ite
+App le
+ĠQu inn
+Ġrap ed
+ill o
+Euro pe
+Ġalgorith ms
+ĠRod rig
+i u
+Ġill um
+Ġf ame
+Ġintrodu cing
+Ġdel ays
+ĠRaid ers
+Ġwh istle
+Ġnovel s
+ĠRe ally
+Ġder iv
+Ġpublic ations
+ĠNe ither
+ĠCom merce
+Ġa ston
+l anguage
+Not es
+ĠR oth
+ĠF ear
+Ġm ate
+Ġpar ade
+ĠQ B
+Ġman eu
+ĠC incinnati
+m itting
+Ġwa ist
+ĠR ew
+Ġdisc ont
+Ð °
+Ġst aring
+Ġal ias
+Ġsec urities
+Ġtoile t
+ĠJ edi
+Ġun law
+v ised
+//// ////
+] (
+ĠWe iss
+Ġpre st
+ĠComp an
+Ġmem o
+ĠGr ace
+J uly
+ĠEl ite
+cent er
+ĠSt ay
+Ġgal axy
+Ġto oth
+ĠS ettings
+Ġsubject ed
+ãĤ ¦
+Ġline back
+Ġretail ers
+ĠW ant
+Ġd angers
+A ir
+Ġvolunt ary
+ew ay
+Ġinterpret ed
+ot ine
+Ã §
+Ġp el
+Serv ice
+ĠEvent ually
+Ġcare ers
+Ġthreat en
+Ġmem or
+ĠBrad ley
+anc ies
+s n
+ĠUn known
+N ational
+Ġsh adows
+ail and
+ĠD ash
+Every one
+izz ard
+M arch
+= (
+Ġpull s
+Ġstr anger
+Ġback wards
+ĠBern ard
+imens ional
+Ġch ron
+Ġtheoret ical
+k top
+Ġw are
+ĠInvest ig
+ĠIn iti
+ĠOper ations
+o ven
+oc ide
+* /
+Ġfl ames
+ĠC ash
+sh it
+Ġc ab
+ĠAn aly
+ĠSe ah
+Ġdefin ing
+Ġorder ing
+Ġimm un
+Ġpers istent
+AC H
+Russ ian
+m ans
+Ġh ind
+Ġphot ography
+Â ©
+Ġh ug
+Ġ10 7
+ĠH ence
+i ots
+ude au
+Ġsubsid ies
+Ġroutine ly
+ĠDev ice
+it ic
+Ġdisg ust
+land er
+Ġ19 40
+Ġassign ment
+ĠB esides
+w ick
+ĠD ust
+us c
+struct ed
+11 1
+de velop
+Ġf ond
+Ġinter section
+Ġdign ity
+Ġcommission er
+With out
+re ach
+Ġcart oon
+Ġsc ales
+ãĥ Ń
+F IG
+Ġsurve ys
+ĠIndones ia
+Ġart work
+Ġun ch
+Ġcy cling
+un ct
+au er
+or ate
+ĠOb viously
+Ġcharacter ized
+fe ld
+Ġaff irm
+Ġinn ings
+Ġ é
+Ġal iens
+Ġcl oth
+et ooth
+ĠC ertain
+Â §
+Ġdig est
+k now
+ĠX L
+Ġpredict ions
+Ġd in
+W AR
+Ġafter math
+Ex ample
+ĠSu ccess
+ĠTh r
+IG N
+Ġmin er
+B us
+Ġcl arity
+heim er
+ĠO UT
+ĠS end
+ĠCirc le
+ĠD iet
+Ġpron ounced
+Ġcreat ors
+Ġearthqu ake
+atter y
+ge ons
+Ġo d
+Ġlay ing
+or p
+U lt
+pro ject
+Ġunder min
+Ġsequ el
+S am
+ĠDark ness
+Ġre ception
+b ull
+Y S
+ĠV ir
+Ġsequ ences
+ĠCo in
+Ġout fit
+ĠW ait
+1 19
+Ġdel ivers
+.... ..
+Ġbl own
+ĠE sc
+ĠM ath
+per m
+ĠU l
+Ġgl im
+Ġfac ial
+Ġgreen house
+Ġto kens
+/ -
+ĠAnn ual
+ĠON E
+Ġteen age
+ĠPhys ical
+ĠL ang
+ĠC elt
+Ġsu ed
+ivid ually
+Ġpat ience
+ch air
+reg ular
+Ġa ug
+in v
+ex cept
+ĠL il
+Ġn est
+f d
+s um
+ĠCh ase
+Russ ia
+ĠJenn ifer
+Ġoff season
+Over all
+F ore
+Ġr iot
+A ud
+form er
+Ġdefend ers
+ĠC T
+iot ic
+rib ly
+Ġautom ated
+Ġpen is
+Ġins ist
+Ġdi agram
+ĠS QL
+ĠG arc
+Ġw itch
+cl ient
+ier ra
+am bers
+Ġrec ount
+f ar
+V ery
+oster one
+Ġappreci ated
+ĠPer fect
+S ection
+Ġd oses
+oca ust
+Ġcost ly
+Ġg rams
+ĠSh i
+Ġwrest ling
+Ġ19 71
+Ġtro phy
+Ġn erve
+ĠK az
+ĠExper ience
+Ġpled ged
+Ġplay back
+Ġcreat ivity
+by e
+Ġattack ers
+Ġhold ers
+ĠCo ach
+ĠPh D
+Ġtransf ers
+Ġcol ored
+ĠH indu
+Ġd rown
+Ġlist ened
+ĠW A
+ias m
+P O
+Ġappeal ing
+Ġdiscl osed
+ĠCh icken
+ag ging
+Ġple aded
+Ġnav igation
+ĠReturn s
+Ġ[ [
+R OR
+E A
+Ġphotograp her
+ĠR ider
+ipp ers
+Ġsl ice
+Ġe rect
+Ġhe d
+iss ance
+ĠVik ings
+ur ious
+Ġapp et
+oubted ly
+Ch ild
+Ġauthent ic
+o os
+ĠM aking
+Ġannoun cing
+Ġb od
+Ġmet er
+ĠN ine
+ĠR ogue
+Ġwork force
+Ġrenew ed
+Ġorganis ations
+ac s
+P LE
+Sh ort
+Ġcomp ounds
+ĠVis it
+Ġen velop
+ear th
+Ġsupport ive
+gg le
+ĠBrus sels
+ĠGu ild
+Cre ate
+RE L
+Ġaver aged
+Ġ19 69
+ri ages
+Ġlength y
+Ġforg ot
+O kay
+ĠE rd
+Ġdeal er
+Ġrec ession
+D D
+Ġdesper ately
+Ġhun ger
+Ġst icks
+Ġm ph
+ĠF aith
+Ġintention ally
+Ġdem ol
+ue ller
+ĠS ale
+Ġde bris
+s pring
+Ġle ap
+>> >>
+Ġcontain ers
+se lling
+rane an
+atter ing
+Ġcomment ed
+ĠC M
+on ut
+Ġwood s
+es pecially
+Ġorgan ize
+iv ic
+ĠWood s
+ang a
+s qu
+Ġm aj
+am on
+Ġax is
+Ġ19 74
+ĠDen mark
+Ġwar rior
+ĠP and
+Ġout lined
+ĠB O
+ins ula
+z illa
+eb ook
+Ġd are
+Ġsear ched
+Ġnav igate
+S n
+writ ing
+Ġun ited
+J apan
+ĠHe brew
+Ġfl ame
+Ġrel ies
+Ġcatch ing
+ĠSh o
+Ġimprison ment
+Ġp ockets
+Ġclos ure
+ĠF am
+t im
+ade qu
+Act ivity
+Ġrecru iting
+ĠW ATCH
+ĠArgent ina
+d est
+Ġapolog ize
+or o
+Ġlack s
+Ġtun ed
+ĠGriff in
+Ġinf amous
+Ġcelebr ity
+ss on
+Ġ ----------------------------------------------------------------
+ĠIs is
+ĠDis play
+Ġcred ibility
+Ġeconom ies
+Ġhead line
+ĠCow boys
+Ġind ef
+Ġl ately
+Ġincent ives
+but ton
+ĠM ob
+A ut
+Ġres igned
+ĠO m
+c amp
+Ġprof iles
+Ġsche mes
+olph ins
+ay ed
+Cl inton
+en h
+ĠY ahoo
+Ġab st
+Ġan k
+su its
+Ġw ished
+ĠMar co
+udd en
+Ġsp here
+ĠB ishop
+Ġincorpor ated
+ĠPl ant
+11 4
+Ġh ated
+p ic
+Ġdon ate
+Ġl ined
+Ġbe ans
+Ġsteal ing
+Ġcost ume
+Ġsher iff
+Ġfor ty
+Ġint act
+Ġadapt ed
+Ġtrave lling
+b art
+Ġnice ly
+Ġdri ed
+Ġsc al
+os ity
+NOT E
+ĠB h
+ĠBron cos
+ĠI gn
+Ġint imate
+Ġchem istry
+Ġopt imal
+D eb
+ĠGener ation
+Ġ] ,
+ich i
+ĠW ii
+ĠYOU R
+vent ions
+W rite
+Ġpop ul
+un ning
+ĠW or
+V ol
+Ġqu een
+head s
+K K
+Ġanaly ze
+op ic
+ear chers
+Ġd ot
+leg raph
+ast ically
+Ġupgr ades
+Ġca res
+Ġext ending
+Ġfree ze
+Ġin ability
+Ġorg ans
+Ġpret end
+Ġout let
+11 3
+ol an
+ĠM all
+ul ing
+t alk
+Ġexpress ing
+ĠAl ways
+ĠBe gin
+f iles
+Ġlic enses
+% %
+ĠM itt
+Ġfil ters
+ĠMil waukee
+G N
+Ġunf old
+M o
+Ġnut rition
+pp o
+B o
+Ġfound ing
+Ġunder mine
+Ġeas iest
+ĠC zech
+ĠM ack
+Ġsexual ity
+ĠN ixon
+W in
+ĠAr n
+ĠK in
+ãĤ £
+ic er
+Ġfort un
+Ġsurf aces
+agh d
+Ġcar riers
+ĠP ART
+ĠT ib
+Ġinter val
+Ġfrust rating
+ĠSh ip
+ĠAr med
+ff e
+Ġbo ats
+ĠAb raham
+in is
+Ġsu ited
+th read
+i ov
+ab ul
+ĠVenezuel a
+Ġto m
+su per
+Ġcast le
+alth ough
+iox ide
+ec hes
+Ġevolution ary
+Ġnegoti ate
+Ġconfront ed
+Rem ember
+Ġ17 0
+S uch
+Ġ9 11
+m ult
+ĠA byss
+ur ry
+ke es
+spe c
+ĠBarb ara
+Ġbelong ing
+Ġvill ain
+ist ani
+Ġaccount able
+Ġport ions
+ĠDe cl
+U r
+ĠK ate
+g re
+Ġmag azines
+UC K
+Ġregul ate
+om on
+ĠAl most
+Ġover view
+Ġsc ram
+Ġl oot
+ĠF itz
+Ġcharacter istic
+ĠSn ake
+s ay
+ĠR ico
+Ġtra it
+ĠJo ined
+au cus
+Ġadapt ation
+ĠAirl ines
+Ġarch ae
+ĠI de
+Ġb ikes
+Ġliter ary
+Ġinflu ences
+ĠUs ed
+C reat
+Ġple a
+ĠDef ence
+ĠAss ass
+Ġp ond
+UL T
+) "
+Ġeval uated
+Ġob taining
+Ġdem ographic
+Ġvig il
+ale y
+Ġsp ouse
+ĠSeah awks
+resp ons
+ĠB elt
+um atic
+Ġr ises
+run ner
+ĠMichel le
+Ġpot ent
+r ace
+ĠP AC
+F ind
+olester ol
+IS S
+ĠIntrodu ced
+ress es
+ign ment
+O s
+ĠT u
+ĠDe x
+ic ides
+Ġspark ed
+ĠLaur a
+ĠBry ant
+Ġsm iling
+ĠNex us
+Ġdefend ants
+ĠCat al
+Ġdis hes
+sh aped
+Ġpro long
+m t
+( $
+ãĢ Ĥ
+Ġcalcul ations
+ĠS ame
+Ġp iv
+H H
+Ġcance lled
+Ġgr in
+Ġterrit ories
+ist ically
+C ome
+ĠP arent
+Pro ject
+Ġneg lig
+ĠPriv acy
+Ġam mo
+LE CT
+olute ly
+ĠEp ic
+Ġmis under
+w al
+Apr il
+m os
+path y
+ĠC arson
+Ġalbum s
+ĠE asy
+Ġpist ol
+< <
+Ġ\ (
+t arget
+hel p
+Ġinter pre
+cons cious
+ĠH ousing
+ĠJ oint
+12 7
+Ġbe ers
+s cience
+ĠFire fox
+effect ive
+ĠC abin
+ĠO kay
+ĠApp lic
+Ġspace craft
+ĠS R
+ve t
+ĠStr ange
+S B
+Ġcor ps
+iber al
+e fficient
+Ġpreval ence
+Ġeconom ists
+11 8
+Th read
+ord able
+OD E
+ĠC ant
+=- =-
+if iable
+ĠA round
+Ġpo le
+Ġwilling ness
+CL A
+ĠK id
+Ġcomple ment
+Ġsc attered
+Ġin mates
+Ġble eding
+e very
+Ġque ue
+ĠTr ain
+Ġh ij
+Ġme lee
+ple ted
+Ġdig it
+Ġg em
+offic ial
+Ġlif ting
+Ð µ
+Re qu
+it utes
+Ġpack aging
+ĠWork ers
+h ran
+ĠLeban on
+ol esc
+Ġpun ished
+ĠJ uan
+Ġj am
+ĠD ocument
+Ġm apping
+ic ates
+Ġinev itably
+Ġvan illa
+ĠT on
+Ġwat ches
+Ġle agues
+Ġiniti ated
+deg ree
+port ion
+Ġrec alls
+Ġru in
+Ġm elt
+I AN
+Ġhe m
+Ex p
+Ġb aking
+ĠCol omb
+at ible
+Ġrad ius
+pl ug
+ĠI F
+et ically
+Ġf ict
+H ER
+ĠT ap
+atin um
+Ġin k
+Ġco h
+ĠW izard
+b oth
+te x
+Ġsp ends
+ĠCurrent ly
+ĠP it
+Ġneur ons
+ig nt
+Ġr all
+Ġbus es
+b uilding
+Ġadjust ments
+Ġc ried
+ibl ical
+att ed
+ĠZ ion
+ĠM atter
+Ġmed itation
+ĠD ennis
+Ġour s
+ĠT ab
+Ġrank ings
+ort al
+Ġad vers
+Ġsur render
+ĠG ob
+ci um
+om as
+im eter
+Ġmulti player
+Ġhero in
+Ġoptim istic
+Ġindic ator
+ĠBr ig
+Ġgro cery
+Ġapplic ant
+ĠRock et
+v id
+Ex ception
+p ent
+Ġorgan izing
+Ġenc ounters
+ĠT OD
+Ġjew el
+S ave
+ĠChrist ie
+Ġhe ating
+Ġl azy
+ĠC P
+Ġcous in
+Con fig
+Ġreg ener
+Ġne arest
+Ġachie ving
+EN S
+th row
+ĠRich mond
+ant le
+200 2
+Ġan ten
+b ird
+13 3
+Ġn arc
+r aint
+un ny
+ĠHispan ic
+ourn aments
+Ġprop he
+ĠTh ailand
+ĠT i
+Ġinject ion
+Ġinher it
+rav is
+Ġmed i
+Ġwho ever
+ĠDE BUG
+G P
+ĠH ud
+C ard
+p rom
+Ġp or
+Ġover head
+L aw
+Ġviol ate
+Ġhe ated
+Ġdescript ions
+Ġachieve ments
+ĠBe er
+ĠQu ant
+W as
+Ġe ighth
+ĠI v
+Ġspecial ized
+U PDATE
+ĠD elta
+P op
+J ul
+ĠAs k
+oph y
+Ġnews letters
+ĠT ool
+Ġg ard
+ĠConf eder
+ĠGM T
+ĠAb bott
+Ġimm unity
+ĠV M
+Is lam
+Ġimpl icit
+w d
+Ġ19 44
+rav ity
+omet ric
+Ġsurv iving
+ur ai
+ĠPr ison
+Ġr ust
+ĠSk etch
+Ġbe es
+ĠThe ory
+Ġmer it
+T ex
+ch at
+Ġm im
+Ġpast e
+ĠK och
+Ġignor ance
+ĠSh oot
+Ġbas ement
+Un ited
+ĠAd vis
+he ight
+Ġf oster
+Ġdet ain
+in formation
+Ġne ural
+' ;
+Ġprov es
+all ery
+Ġinv itation
+um bers
+Ġc attle
+Ġbicy cle
+z i
+Ġconsult ant
+Ġap ology
+ĠT iger
+Ġ12 3
+99 9
+Ġind ividually
+r t
+ig ion
+ĠBrazil ian
+Ġdist urb
+Ġentreprene urs
+Ġfore sts
+cer pt
+pl ates
+p her
+clip se
+Ġtw itter
+Ġac ids
+ograph ical
+h um
+ĠB ald
+if ully
+Ġcomp iler
+ĠD A
+Ġdon or
+as i
+Ġtrib al
+l ash
+ĠCon fig
+Ġapplic ants
+Ġsal aries
+13 5
+Put in
+ĠF ocus
+ir s
+Ġmisc onduct
+ĠH az
+Ġeat en
+M obile
+Mus lim
+ĠMar cus
+v iol
+Ġfavor able
+Ġst ub
+ad in
+ĠH ob
+Ġfaith ful
+Ġelectron ics
+Ġvac uum
+w ait
+back ed
+econom ic
+d ist
+Ġten ure
+Ġsince re
+ĠT ogether
+ĠW ave
+Ġprog ression
+Ġden ying
+Ġdist ress
+br aska
+th ird
+Ġmix ing
+Ġcolon ial
+Ġpriv ately
+Ġun rest
+atern ity
+Ġprem ises
+ant i
+greg ation
+Ġlic ence
+ĠH ind
+ĠSam uel
+Ġconvinc ing
+ĠA ce
+ĠR ust
+ĠNet anyahu
+Ġhand les
+ĠP atch
+orient ed
+ah o
+ĠG onz
+Ġhack ers
+claim er
+Ġcustom s
+ĠGr an
+f ighters
+Ġl uc
+Ġman uscript
+aren thood
+Ġdev il
+Ġwar riors
+Ġoff enders
+Will iam
+Ġhol idays
+Ġnight mare
+Ġle ver
+iff erent
+St at
+Ġexhib ition
+put ed
+ĠP ure
+Ġal pha
+Ġenthus iasm
+ĠRepresent atives
+E AR
+ĠT yp
+Ġwhe at
+ĠAl f
+Ġcor rection
+Ġev angel
+AT T
+M iss
+Ġs oup
+Ġimpl ied
+par am
+Ġsex y
+ĠL ux
+Ġrep ublic
+p atch
+ab lish
+Ġic ons
+Ġfather s
+ĠG ET
+ĠCar ib
+Ġregul ated
+ĠCo hen
+ĠBob by
+Ġn er
+Ġb ent
+vent ory
+ĠAl ong
+ĠE ST
+ĠWall ace
+Ġmurd ers
+r ise
+ke ll
+ĠCommon wealth
+Ġn asty
+et a
+ĠM IT
+Ġadminist ered
+Ġgenuine ly
+Ed itor
+n ick
+Ġhyd ro
+**************** ****************
+ĠB le
+Ġfin es
+Ġg orge
+aus ible
+r h
+Ġapp le
+ment ioned
+Ġro pe
+ot yp
+H R
+Ġdisappoint ing
+Ġc age
+n ik
+Ġdoub ts
+ĠF REE
+print s
+ĠM UST
+Ġvend ors
+ĠIn qu
+Ġliber als
+Ġcontract or
+Ġup side
+child ren
+Ġtrick y
+Ġregul ators
+charg ed
+l iter
+Ġ ***
+Ġreb ell
+l ang
+Ġloc als
+Ġphys icians
+Ġhe y
+ar se
+t m
+ĠLe x
+Ġbehavior al
+success ful
+F X
+Ġbr ick
+ov ic
+Ġcon form
+Ġreview ing
+Ġins ights
+Ġbi ology
+ĠRem ove
+ĠExt ra
+Ġcomm itting
+indu ced
+ignt y
+ig m
+Ġat omic
+Comm on
+ĠE M
+ĠP ere
+ĠIt ems
+e h
+Ġpres erved
+ĠH ood
+Ġprison er
+Ġbankrupt cy
+Ġg ren
+us hes
+Ġexplo itation
+Ġsign atures
+Ġfin an
+] ,"
+ĠM R
+Ġme g
+rem lin
+Ġmusic ians
+Ġselect ing
+Ġexam ining
+IN K
+l ated
+H i
+Ġart ic
+Ġp ets
+Ġimp air
+ĠM AN
+Ġtable ts
+in clude
+R ange
+Ġca ut
+Ġlog s
+Ġmount ing
+Ġun aware
+Ġdynam ics
+ĠPalest ine
+ĠQu arter
+ĠPur ple
+Ġm a
+ĠIm port
+Ġcollect ions
+ci ation
+Ġsuccess or
+Ġcl one
+Ġaim ing
+Ġposs essed
+Ġstick ing
+Ġsh aking
+Ġloc ate
+ĠH ockey
+T urn
+17 0
+Ġfif teen
+ĠHar rison
+Ġcontinu ously
+ĠT C
+ĠVal ent
+ĠRes cue
+Ġby pass
+am ount
+Ġm ast
+Ġprotect s
+Ġart istic
+Ġsomet ime
+Ġsh oe
+Ġshout ed
+ific ant
+et itive
+ĠReg ister
+ĠJ in
+Ġconcent rated
+ling ton
+on ies
+Ġgener ator
+yr im
+ĠAr men
+Ġclear ing
+id o
+ĠT W
+al ph
+Ġlad ies
+H ard
+Ġdial og
+Ġinput s
+æ ľ
+Ġpos es
+Ġsl ots
+ĠPrem ium
+Ġle aks
+Ġboss es
+Ġ11 3
+c ourse
+A cc
+ĠNew ton
+ĠAust ria
+ĠM age
+Ġte aches
+ab ad
+Ġwe ars
+Ġc yl
+Ġcur se
+ĠS ales
+ĠW ings
+Ġp sy
+Ġg aps
+ĠIce land
+ĠP interest
+Ġland lord
+Ġdefin itions
+ĠK er
+Ġsufficient ly
+ĠP ence
+ĠArch itect
+Ġsur pass
+Ġ11 4
+Ġsuper hero
+ĠDise ase
+Ġpri ests
+ĠC ulture
+Ġdefin itive
+Ġsecret ly
+ĠD ance
+inst all
+ch ief
+ĠJess ica
+W ould
+Up dated
+Ġlock er
+ĠK ay
+Ġmem orial
+è ¦
+f at
+Ġdis gu
+Ġflav ors
+ĠBase ball
+ĠRes istance
+Ġk icks
+Ġen v
+Ġteen agers
+D ark
+ĠC AR
+Ġh alt
+ĠL G
+ĠGab riel
+Ġfe ver
+Ġs atur
+Ġm all
+Ġaffili ate
+ĠS leep
+ĠSpe cific
+ĠV el
+Ġj ar
+ĠSac red
+ĠEd wards
+ĠA CL
+Ġret ained
+ĠG iant
+Ġlim itation
+in ces
+Ġref usal
+ĠT ale
+ĠBut ler
+Ġacc idents
+ĠC SS
+Ġimport ed
+ĠCop y
+Î ±
+ER T
+z el
+Ġdiv isions
+h ots
+ĠAl b
+ĠD S
+Load er
+W ashington
+at isf
+ĠCreat ive
+\ .
+ĠAut om
+red ict
+Ġrecept or
+ĠCarl os
+Met hod
+ok a
+Ġmal icious
+Ġste pping
+, [
+ĠD ad
+Ġatt raction
+ĠEffect s
+ĠPir ate
+ĠC er
+ĠIndust ry
+ĠR ud
+Ġchar ter
+Ġd ining
+Ġins ists
+Ġconfig ure
+Ġ( #
+ĠSim ple
+ĠSc roll
+UT C
+17 5
+ĠK on
+Ġmarket place
+Ġ ãĤ
+Ġref res
+Ġg ates
+er red
+ĠP od
+Ġbeh ave
+Fr ank
+n ode
+Ġendors ed
+he tt
+as ive
+ĠHom eland
+Ġr ides
+ĠLe ave
+er ness
+Ġflood ing
+A FP
+Ġris en
+Ġcontin ually
+Ġun anim
+ĠCont ract
+ĠP as
+Ġgu ided
+ĠCh ile
+b d
+Ġsu cc
+pt ic
+Ġcomm ittees
+ĠL uther
+ĠAny one
+Ġs ab
+12 4
+Ġp ixel
+ĠB ak
+ĠT ag
+ĠBenn ett
+En ter
+sm all
+ĠPresident ial
+Ġp ul
+Ġcontr ace
+arch ive
+Ġcoast al
+ĠK ids
+19 2
+âĢ ²
+ick y
+ING TON
+Ġw olf
+ĠSt alin
+T ur
+id get
+am as
+ĠUn less
+Ġspons or
+Ġmor ph
+ĠCho ose
+Ġrun ner
+Ġun bel
+Ġm ud
+ĠMan a
+Ġdub bed
+Ġg odd
+ure rs
+wind ow
+Ġrel ied
+Ġcelebr ating
+os c
+Ġ13 5
+Ġlobb ying
+Ġincom plete
+Ġrestrict ion
+Ġinc ap
+it us
+Ġexpect ation
+ĠAp ollo
+Ġint ens
+Ġsyn c
+G H
+Ġmanip ulation
+B Y
+Ġspe ar
+Ġbre asts
+Ġvol can
+il ia
+M aterial
+Ġform ats
+ĠB ast
+Ġparliament ary
+Ġsn ake
+Ġserv ants
+ĠTr udeau
+ĠGr im
+ĠArab ic
+ĠSC P
+ĠBoy s
+st ation
+Ġprospect ive
+ord e
+in itialized
+Ġb ored
+AB LE
+Ġaccess ed
+Ġtax i
+ĠShe ll
+aid en
+urs ed
+in ates
+ĠIns urance
+ĠPet e
+Sept ember
+6 50
+Ġad ventures
+ĠCo ver
+Ġt ribute
+Ġsk etch
+Ġem power
+Ġ Ø
+ĠGl enn
+ĠD aw
+= \"
+ĠPolit ics
+Ġgu ides
+Ġd ioxide
+ĠG ore
+ĠBr ight
+ĠS ierra
+Ġval ued
+c ond
+Ġpo inter
+Se lect
+Ġrisk y
+Ġabsor b
+im ages
+Ġref uses
+Ġbon uses
+__ _
+Ġh ilar
+ĠF eatures
+2 20
+ĠCollect or
+F oot
+Ġ19 64
+cul us
+Ġd awn
+Ġwork out
+ĠL O
+Ġphilosoph ical
+ĠSand y
+ĠYou th
+Ġl iable
+A f
+bl ue
+Ġovert urn
+less ness
+ĠTrib une
+ĠIn g
+Ġfact ories
+Ġcat ches
+Ġpr one
+Ġmat rix
+Ġlog in
+Ġin acc
+Ġex ert
+s ys
+Ġneed le
+ĠQ ur
+Ġnot ified
+ould er
+t x
+Ġremind s
+Ġpublisher s
+Ġn ort
+Ġg it
+Ġfl ies
+ĠEm ily
+Ġflow ing
+ĠAl ien
+ĠStr ateg
+Ġhard est
+Ġmod ification
+AP I
+ĠM Y
+Ġcr ashes
+st airs
+n umber
+Ġur ging
+ch annel
+ĠFal con
+Ġinhabit ants
+Ġterr ifying
+Ġutil ize
+Ġban ner
+Ġcig arettes
+Ġsens es
+ĠHol mes
+Ġpract ition
+ĠPhill ips
+ott o
+Ġcomp ile
+Mod el
+ĠK o
+Ġ[ ]
+Americ ans
+ĠTer ms
+Ġmed ications
+ĠAn a
+Ġfundament ally
+ĠNot ice
+Ġwe aker
+Ġ 0000
+Ġgar lic
+Ġout break
+Ġeconom ist
+ĠB irth
+Ġobst acles
+ar cer
+ĠOr thodox
+Ġplace bo
+ĠC rew
+asp berry
+ĠAng els
+Ġdis charge
+Ġdestruct ive
+11 7
+ĠR ising
+Ġd airy
+l ate
+Ġcoll ision
+ĠTig ers
+ean or
+ocument ed
+ĠIn valid
+Ġd ont
+ĠL iter
+ĠV a
+Ġhyd rogen
+Ġvari ants
+ĠBrown s
+Ġ19 65
+Ġind igenous
+Ġtrad es
+Ġremain der
+Ġswe pt
+ĠImp act
+Ġred ist
+Ġun int
+grad uate
+ãĥ ķ
+ĠW ILL
+ãģ® ç
+ĠCrit ical
+Ġf isher
+Ġv icious
+Ġrevers ed
+Y ear
+ĠS ox
+Ġshoot ings
+Ġfil ming
+Ġtouchdown s
+ai res
+m el
+Ġgrand father
+Ġaffect ion
+ing le
+Ġover ly
+Add itional
+Ġsup reme
+ĠGr ad
+Ġsport ing
+Ġmer cy
+ĠBrook s
+ount y
+Ġperform s
+Ġtight ly
+Ġdem ons
+Ġkill ings
+Ġfact ion
+ĠNov a
+aut s
+Ġund oubtedly
+ar in
+Ġunder way
+ra k
+Ġl iv
+ĠReg ion
+Ġbrief ing
+s ers
+cl oud
+ĠM ik
+us p
+Ġpred iction
+az or
+Ġport able
+ĠG and
+Ġpresent ing
+Ġ10 80
+Â »
+ush i
+ĠSp ark
+there um
+Ġjust ification
+ĠN y
+Ġcontract ors
+ming ham
+ĠSt yle
+å ħ
+ĠChron icles
+ĠPict ure
+Ġprov ing
+Ġw ives
+set t
+Ġmole cules
+ĠFair y
+Ġconsist ing
+Ġp ier
+al one
+in ition
+Ġn ucle
+j son
+Ġg otta
+Ġmob il
+Ġver bal
+ar ium
+Ġmon ument
+uck ed
+Ġ25 6
+T ech
+mine craft
+ĠTr ack
+Ġt ile
+Ġcompat ibility
+as is
+Ġs add
+Ġinstruct ed
+ĠM ueller
+Ġle thal
+Ġhorm one
+Ġor che
+el se
+Ġske let
+Ġentert aining
+Ġminim ize
+ag ain
+Ġunder go
+Ġconst raints
+Ġcig arette
+ĠIslam ist
+Ġtravel s
+ĠPant hers
+l ings
+C are
+Ġlaw suits
+ur as
+Ġcry st
+Ġlow ered
+Ġaer ial
+Ġcomb inations
+Ġha un
+Ġch a
+Ġv ine
+Ġquant ities
+Ġlink ing
+b ank
+Ġso y
+B ill
+ĠAngel a
+Ġrecip ient
+ĠProt est
+Ġs ocket
+Ġsolid arity
+Ġâ Ĩ
+m ill
+Ġvar ies
+ĠPak istani
+Dr agon
+Ġun e
+Ġhor izon
+ÂłÂłÂłÂł ÂłÂłÂłÂł
+Ġprov inces
+Ġfrank ly
+Ġenact ed
+not es
+[ '
+Ġ19 2
+ocr acy
+Ġendorse ment
+Ġover time
+Tr ue
+L ab
+lic ted
+ĠD NC
+Ġbe ats
+ĠJam ie
+15 2
+ĠIN T
+Cont act
+Ġaccount ed
+h ash
+ĠPack ers
+p ires
+Ġles bian
+Ġamend ments
+Ġhop eful
+ĠFin land
+Ġspot light
+Ġconfig ured
+Ġtrou bled
+Ġg aze
+ĠCal gary
+Ġrel iability
+Ġins urg
+sw er
+b uy
+ĠSk in
+Ġp ixels
+Ġhand gun
+Ġpar as
+Ġcateg or
+ĠE L
+ĠRe x
+Ind eed
+Ġkind a
+Ġconj unction
+ĠBry an
+ĠMan ufact
+y ang
+Pl us
+S QL
+ish ment
+Ġdom inate
+Ġn ail
+Ġo ath
+Ġeru pt
+ĠF ine
+it bart
+ĠCh ip
+ĠAb d
+ĠN am
+Ġbuy er
+Ġdiss ent
+Le aks
+Cont in
+Ġr ider
+ĠSome one
+Ġill usion
+c in
+ĠBoe ing
+Ġin adequ
+ov ation
+i ants
+Ġreb uild
+4 50
+ĠDest iny
+S W
+ĠT ill
+H it
+ia z
+ĠBang l
+acher s
+ĠRe form
+Ġse gments
+Ġsystem atic
+d c
+ĠConserv atives
+Ġport al
+h or
+ĠDragon bound
+Ġdrag ged
+om o
+Ġthe e
+ad vert
+ĠRep orts
+ĠE t
+Ġbarrel s
+Aug ust
+Ġcompar isons
+Ġhe x
+Ġan throp
+" [
+bor ough
+ab i
+Ġpict ured
+play ing
+ĠAdd ress
+ĠMir ror
+Sm ith
+Ġt ires
+ĠN PR
+AA AA
+Ġclass ification
+ĠTh an
+ĠH arm
+ĠR A
+Ġreject ion
+min ation
+Ġr anged
+ĠF alls
+D I
+H ost
+ãĤ ´
+ĠEx ample
+list ed
+th irds
+Ġsaf egu
+br and
+Ġprob able
+Can ada
+IT ION
+ĠQ aeda
+Ġch ick
+Ġimport s
+h it
+l oc
+W W
+Ġble w
+Ġany time
+Ġwh oles
+ik ed
+Ġcal culation
+cre ate
+ĠO ri
+Ġupgr aded
+Ġapp ar
+ut ory
+ĠM ol
+B rit
+ĠJ ong
+IN AL
+ĠStart ing
+Ġd ice
+urt le
+Ġre lying
+cl osure
+Ġprof itable
+Ġsl aughter
+ĠMan ual
+c aster
+Ġ" $
+Ġfe ather
+ĠSim ply
+ie ves
+Ġdeter ior
+ĠPC I
+Ġst amp
+Ġfl aws
+Ġsh ade
+ham mer
+Ġpass port
+Ġcont ing
+am el
+Ġobser vers
+Ġneg lect
+ĠR B
+ĠBrother hood
+Ġskept ical
+f amily
+us k
+Ġemotion ally
+â Ļ
+ĠBet a
+ason able
+id ity
+ĠM ul
+Ġkick ing
+ĠC arm
+oll ah
+VERT IS
+ĠAt hen
+Ġlad der
+ĠBul let
+å £
+00 01
+ĠWild life
+ĠM ask
+ĠN an
+R ev
+Ġun acceptable
+leg al
+Ġcrowd ed
+ag i
+ĠC ox
+j e
+Ġmor ality
+Ġfu els
+Ġc ables
+Ġman kind
+ĠCarib bean
+Ġanch or
+Ġby te
+ĠO ften
+ĠO z
+Ġcraft ed
+Ġhistor ian
+ĠW u
+Ġtow ers
+ĠCitiz ens
+Ġhel m
+Ġcred entials
+Ġsing ular
+ĠJes se
+Ġtack les
+Ġcont empt
+Ġa fore
+ĠSh adows
+Ġn il
+Ġur gent
+app le
+bl ood
+Ġv on
+Ġoff line
+Ġbreat he
+Ġj umps
+Ġirre levant
+ox ic
+om al
+import ant
+J im
+Ġgl oves
+arm ing
+dep th
+Ġtal ents
+ook ie
+ĠS B
+Ġpal m
+uff s
+est a
+IG H
+Ġcan on
+ĠVer izon
+ĠP le
+Ġcou pled
+vel t
+Ġfundra ising
+ĠGet ting
+ĠD LC
+Ġmathemat ical
+ĠH S
+ĠCard inals
+te lling
+Ġspons ors
+Ġ Ï
+ĠBull s
+op tion
+Ġprop ose
+Ġmem orable
+Ġembr aced
+Ġdecl ining
+He alth
+ed a
+Ġ} ;
+Ġsp am
+m ile
+Ġpit cher
+ĠE ight
+Ġcar ing
+ut ic
+ro le
+Ġair line
+ernand ez
+ĠAth let
+Ġcert ification
+ux e
+rig er
+Ġem pir
+Ġsens ation
+Ġdis m
+Ġb olt
+Ġev olve
+H ouse
+Ġconsult ation
+ĠD uty
+Ġtou ches
+ĠN athan
+Ġf aint
+h ad
+" (
+ĠCons umer
+ĠExt reme
+Ġ12 7
+ĠHer m
+ĠSac rament
+iz oph
+Ġanx ious
+ul ously
+Ġsoc ially
+ĠU TC
+Ġsol ving
+ĠLet ter
+Hist ory
+ed uc
+Pr ice
+) );
+Ġrel oad
+am ic
+Ġp ork
+Ġdisc ourse
+Ġt ournaments
+ai ro
+ĠK ur
+ĠCost a
+Ġviol ating
+Ġinterf ere
+Ġrecre ational
+uff le
+Ġspe eches
+Ġneed ing
+Ġremem bers
+Ġcred ited
+n ia
+f ocused
+amer a
+Ġb ru
+um bs
+ĠCub an
+Ġpreced ing
+Ġnons ense
+ac ial
+Ġsmart phones
+ĠSt ories
+S ports
+ĠEmer gency
+oun cing
+ef ined
+Ġb er
+Ġconsult ing
+Ġm asters
+he astern
+." [
+ĠRun ning
+Ġsus cept
+ĠF eng
+Americ a
+pr ises
+st itial
+ĠWeek ly
+ĠGreat er
+mod ules
+if ter
+G raphics
+ul er
+Ġwho lly
+Ġsupp ress
+Ġconce aled
+Ġhapp ily
+Ġaccept s
+ĠEn joy
+Ġr ivers
+ĠEx cept
+2 25
+ĠN HS
+ĠMc Connell
+Ġp ussy
+fer red
+ut able
+Ġatt ain
+Ġ> =
+Ġdepos its
+roph ic
+Ġnot orious
+ĠSh aw
+il itation
+Ġepid emic
+all ic
+Ġsmall est
+ov ich
+Ġaccess ories
+per ties
+Ġsur plus
+ĠMe ch
+Ġamb ig
+ĠImm igration
+Ġch im
+ev al
+Ġpract icing
+ĠMyster y
+Ġdom ains
+ĠSil icon
+app s
+Ġkilomet ers
+e a
+ĠSm ash
+Ġwarrant y
+Ġn ost
+s il
+re v
+J on
+ĠDub lin
+Ġtast es
+Ġb out
+g reat
+er ror
+Ġsw itches
+ĠB apt
+D O
+ok i
+Ġsour ced
+pro du
+Ġattach ment
+ĠIss ue
+ĠQuest ion
+Jo in
+Ġf itted
+Ġunlaw ful
+^ ^
+ere k
+Ġauthent ication
+Ġst ole
+Ġaccount ability
+l abel
+S earch
+Ġal beit
+atic an
+fund ed
+ĠAdd ing
+ĠI Q
+Ġsub mar
+l it
+a que
+ĠLear ning
+Ġint eger
+M aster
+ĠCh rom
+Ġprem ier
+O p
+ĠLi u
+Ġbl essed
+ĠGl obe
+ĠResp onse
+Ġlegit im
+ĠMer kel
+Ġdispos al
+Â ´
+Ġgau ge
+pe at
+Ġindu ced
+Ġquestion able
+arth y
+ĠV it
+ĠF eed
+U ntil
+U t
+worth y
+R Y
+ĠH erald
+ĠHam mer
+Ġmed al
+ĠR ivers
+ĠH ack
+Ġclar ify
+Ġtrack ed
+Ġautonom ous
+Ġten ant
+ĠQ atar
+er ie
+Ġgr im
+ĠMon itor
+Ġresist ant
+ĠSpe c
+ĠWell s
+N AS
+14 8
+Ġmin ers
+iot ics
+Ġmiss es
+11 6
+g ian
+g it
+ĠE yes
+p res
+Ġgrad uated
+Ġang el
+Ġsyn chron
+Ġefficient ly
+Ġtrans mitted
+H arry
+Ġglob ally
+EN CE
+ĠMont ana
+r aged
+ĠPre vention
+Ġp iss
+ĠL l
+Ġshe lf
+ĠB JP
+ĠTest ament
+ĠL ate
+ik er
+ĠH app
+ĠJul ian
+h all
+Ġsp ont
+Ġshut down
+Ġincons istent
+Ġsubscrib ers
+Ġske leton
+ĠNe braska
+Ġins pire
+ĠV oid
+F eed
+Ġang les
+ĠSpr ings
+Ġbench mark
+Ġvacc ines
+izoph ren
+se xual
+uff ed
+Ġsh ine
+ĠK ath
+Ġgest ure
+ine a
+Ġr ip
+Ġopp ression
+Ġcons cience
+b t
+ĠL um
+Ġinc idence
+ĠF a
+w r
+Ġmin eral
+ĠSp urs
+alk y
+Ġth under
+Ġop io
+Be ing
+ĠPal m
+Ġwas ted
+Ġl b
+i aries
+ĠIniti ative
+Ġcur ric
+Ġmark er
+ĠMc L
+Ġext ensions
+ĠP v
+ĠAr ms
+Ġoffer ings
+Ġdef enses
+Ġvend or
+Ġcontrad ict
+ĠCol in
+Ġredd it
+Ġper ipher
+12 2
+Ġs ins
+E dit
+IC T
+So ft
+ĠSh ah
+Ġadministr ator
+ĠT rip
+Ġporn ography
+Ġtu ition
+in ence
+ĠPro gress
+Ġcat alog
+Ġsu ite
+Ġh ike
+Ġreprodu ctive
+eng ine
+Ġd rought
+ĠNo ah
+Ġ2 30
+Ġd ude
+Ġrelax ed
+Ġpart ition
+Ġparticip ant
+Ġtel esc
+Ġfe as
+ĠF F
+own er
+Ġswe eping
+Ġl enses
+Ġmatch up
+ĠRe pl
+ourn als
+Ġcred ible
+Ġgrand mother
+Ġther mal
+Ġsubscrib ing
+Ġident ities
+col m
+U CT
+Ġreluct ant
+us ers
+ĠC ort
+Ġassist ed
+OS S
+ATION S
+IS H
+Ġpharm aceutical
+ic able
+ad ian
+ĠSon ic
+ĠF ury
+ĠM ong
+A H
+ĠPsych ology
+Ġph osph
+Ġtreat s
+Ń Ķ
+Ġstead ily
+ĠHell o
+Ġrel ates
+Ġcl ue
+Ex pl
+a uth
+Ġrev ision
+Ġe ld
+os ion
+Ġbr on
+14 4
+ri kes
+Ġmin es
+Ġblank et
+ĠF ail
+el ed
+ĠIm agine
+ĠPl anned
+a ic
+Re quest
+M ad
+ĠHor se
+ĠEag le
+Ġcap ac
+15 7
+Ġl ing
+ĠN ice
+ĠP arenthood
+min ster
+og s
+ens itive
+Not hing
+Ġcar n
+F in
+ĠP E
+Ġr ifles
+ĠL P
+S and
+Ġgui Active
+Ġtour ist
+C NN
+Ġunve iled
+Ġpredec essor
+} {
+u ber
+Ġoff shore
+Ġopt ical
+ĠR ot
+ĠPear l
+et on
+Ġst ared
+Ġfart her
+at ility
+cont in
+ĠG y
+ĠF oster
+ĠC oc
+ri ents
+Ġdesign ing
+ĠEconom y
+ON G
+W omen
+ĠN ancy
+er ver
+Ġmas cul
+Ġcasual ties
+Ġ2 25
+ĠS ullivan
+ĠCh oice
+Ġa ster
+w s
+Ġhot els
+Ġconsider ations
+Ġcou ch
+ĠSt rip
+ĠG n
+Ġmanip ulate
+l ied
+Ġsynt hetic
+Ġassault ed
+Ġoff enses
+ĠDra ke
+Ġim pe
+Oct ober
+ĠHer itage
+h l
+ĠBl air
+Un like
+Ġg rief
+Ġ4 50
+Ġopt ed
+Ġresign ation
+il o
+Ġver se
+ĠT omb
+Ġu pt
+Ġa ired
+ĠH ook
+ĠML B
+Ġassum es
+out ed
+ĠV ers
+Ġinfer ior
+Ġbund le
+ĠD NS
+ograp her
+Ġmult ip
+ĠSoul s
+Ġillust rated
+Ġtact ic
+Ġdress ing
+Ġdu o
+Con f
+Ġrel ent
+Ġc ant
+Ġscar ce
+Ġcand y
+ĠC F
+Ġaffili ated
+Ġspr int
+yl an
+ĠGarc ia
+Ġj unk
+Pr int
+ex ec
+C rit
+Ġport rait
+ir ies
+ĠOF F
+Ġdisp utes
+W R
+L ove
+ãģ Ħ
+ĠRe yn
+Ġh ipp
+op ath
+Ġflo ors
+ĠFe el
+Ġwor ries
+Ġsett lements
+ĠP os
+Ġmos que
+Ġfin als
+Ġcr ushed
+ĠPro bably
+ĠB ot
+ĠM ans
+ĠPer iod
+Ġsovere ignty
+Ġsell er
+Ġap ost
+Ġam ateur
+Ġd orm
+Ġconsum ing
+Ġarm our
+ĠRo ose
+Ġint ensive
+Ġelim inating
+ĠSun ni
+ĠAle ppo
+j in
+Ġadv ise
+p al
+ĠH alo
+Ġdes cent
+Ġsimpl er
+Ġbo oth
+ST R
+L ater
+ĠC ave
+== =
+Ġm ol
+Ġf ist
+Ġshot gun
+su pp
+Ġrob bery
+E ffect
+Ġobsc ure
+ĠProf essional
+Ġemb assy
+Ġmilit ant
+Ġinc arcer
+Ġgener ates
+Ġlaun ches
+Ġadministr ators
+Ġsh aft
+Ġcirc ular
+Ġfresh man
+ĠW es
+ĠJo el
+ĠD rew
+ĠDun can
+ĠApp arently
+s ight
+ĠIntern al
+ĠInd ividual
+ĠF E
+Ġb ore
+ĠM t
+Ġbroad ly
+ĠO ptions
+ount ain
+ip es
+ĠV ideos
+20 4
+Ġh ills
+Ġsim ulation
+Ġdisappoint ment
+it an
+ĠLabor atory
+Ġup ward
+Ġbound ary
+Ġdark er
+h art
+Ġdomin ance
+C ong
+ĠOr acle
+ĠL ords
+Ġscholars hip
+ĠVin cent
+ed e
+ĠR ah
+Ġencour ages
+ro v
+Ġqu o
+Ġprem ise
+ĠCris is
+ĠHol ocaust
+Ġrhyth m
+Ġmet ric
+cl ub
+Ġtransport ed
+Ġn od
+ĠP ist
+Ġancest ors
+ĠFred er
+th umbnails
+ĠC E
+ON D
+Ph il
+ven ge
+ĠProduct s
+cast le
+Ġqual ifying
+ĠK aren
+VERTIS EMENT
+Ġmight y
+Ġexplan ations
+Ġfix ing
+D i
+Ġdecl aring
+Ġanonym ity
+Ġju ven
+ĠN ord
+ĠDo om
+ĠAct ually
+O k
+ph is
+ĠDes ert
+Ġ11 6
+I K
+ĠF M
+Ġinc omes
+V EL
+ok ers
+Ġpe cul
+Ġlight weight
+g ue
+Ġacc ent
+Ġincre ment
+ĠCh an
+Ġcompl aining
+ĠB aghd
+Ġmidfield er
+Ġover haul
+Pro cess
+ĠH ollow
+ĠTit ans
+Sm all
+man uel
+ĠUn ity
+ĠEv ents
+S ty
+Ġdispro portion
+n esty
+en es
+ĠC od
+Ġdemonstr ations
+ĠCrim son
+ĠO H
+Ġen rolled
+Ġc el
+ĠBre tt
+Ġa ide
+Ġhe els
+Ġbroad band
+Ġmark ing
+Ġw izard
+ĠN J
+ĠChief s
+Ġingred ient
+Ġd ug
+ĠSh ut
+urch ase
+end or
+Ġfar mer
+ĠGold man
+12 9
+15 5
+Or der
+Ġl ion
+i ably
+Ġst ain
+ar ray
+ilit ary
+ĠFA Q
+Ġexpl oded
+ĠMcC arthy
+ĠT weet
+ĠG reens
+ek ing
+l n
+ens en
+Ġmotor cycle
+Ġpartic le
+Ġch olesterol
+B ron
+Ġst air
+Ġox id
+Ġdes irable
+ib les
+Ġthe or
+for cing
+Ġpromot ional
+ov o
+b oot
+ĠBon us
+raw ling
+Ġshort age
+ĠP sy
+Ġrecru ited
+Ġinf ants
+Ġtest osterone
+Ġded uct
+Ġdistinct ive
+Ġfirm ware
+bu ilt
+14 5
+Ġexpl ored
+Ġfact ions
+Ġv ide
+Ġtatt oo
+Ġfinan cially
+Ġfat igue
+Ġproceed ing
+const itutional
+Ġmis er
+Ġch airs
+gg ing
+ipp le
+Ġd ent
+Ġdis reg
+ç Ķ
+st ant
+ll o
+b ps
+aken ing
+Ġab normal
+ĠE RA
+å£ «
+ĠH BO
+ĠM AR
+Ġcon cess
+Ġserv ant
+Ġas pir
+l av
+ĠPan el
+am o
+Ġprec ip
+Ġrecord ings
+Ġproceed ed
+Ġcol ony
+ĠT ang
+ab lo
+Ġstri pped
+Le ft
+to o
+Ġpot atoes
+Ġfin est
+% ).
+Ġc rap
+ĠZ ach
+ab ases
+ĠG oth
+Ġbillion aire
+w olf
+Ġsan ction
+S K
+Ġlog ged
+P o
+ey ed
+un al
+Ġcr icket
+Ġarm ies
+Ġunc overed
+Cl oud
+Ã³ n
+Ġreb ounds
+Ġm es
+O per
+P ac
+Ġnation ally
+Ġinsert ed
+p ict
+Ġgovern ance
+Ð ¸
+Ġprivile ges
+G ET
+Ġfavor ites
+im ity
+Ġlo ver
+the m
+em pl
+Ġgorge ous
+An n
+Ġsl ipped
+Ġve to
+B ob
+Ġsl im
+u cc
+ĠF ame
+udden ly
+Ġden ies
+ĠM aur
+Ġdist ances
+Ġw anna
+t ar
+ĠS ER
+Ġâ Ī
+Ġle mon
+at hetic
+Ġlit eral
+Ġdistingu ished
+Ġansw ering
+G I
+Ġrelig ions
+ĠPhil os
+ĠL ay
+Ġcomp os
+ire ments
+ĠK os
+ine z
+roll ing
+Ġyoung est
+and ise
+ĠB orn
+Ġalt ar
+am ina
+ĠB oot
+v oc
+Ġdig ging
+Ġpress ures
+Ġl en
+26 4
+Ġassass ination
+ĠBir mingham
+ĠMy th
+Ġsovere ign
+ĠArt ist
+ĠPhot ograph
+Ġdep icted
+Ġdisp ens
+orth y
+Ġamb ul
+int eg
+ĠC ele
+ĠTib et
+Ġhier archy
+Ġc u
+Ġpre season
+ĠPet erson
+Ġcol ours
+Ġworry ing
+Ġback ers
+ĠPal mer
+ĠÎ ¼
+Ġcontribut or
+Ġhear ings
+Ġur ine
+Ġ Ù
+ourge ois
+Sim ilar
+ĠZ immer
+s omething
+ĠUS C
+Ġstrength s
+ĠF I
+Ġlog ging
+As ked
+ĠTh ai
+in qu
+ĠW alt
+Ġcrew s
+it ism
+3 01
+Ġshar ply
+um ed
+Ġred irect
+r ators
+In f
+ĠWe apons
+Ġte asp
+19 99
+L ive
+ĠEs pecially
+ĠS ter
+ĠVeter ans
+Ġint ro
+other apy
+Ġmal ware
+Ġbre eding
+Ġmole cular
+ĠR oute
+ĠCom ment
+oc hem
+Ġa in
+Se ason
+Ġlineback er
+Ä «
+ĠEconom ics
+es ar
+ĠL ives
+ĠEm ma
+Ġk in
+ĠTer rit
+Ġpl anted
+ot on
+ĠBut ter
+ĠSp ons
+P ER
+Ġdun geon
+Ġsymb olic
+Ġfil med
+Ġdi ets
+Ġconclud es
+Ġcertain ty
+ĠForm at
+Ġstr angers
+form at
+ĠPh ase
+Ġcop ied
+Ġmet res
+ld a
+ĠUs ers
+Ġdeliber ate
+Ġwas hed
+ĠL ance
+im ation
+Ġimpro per
+ĠGen esis
+ick r
+ĠK ush
+Ġreal ise
+Ġembarrass ing
+alk ing
+b ucks
+Ġver ified
+Ġout line
+year s
+ĠIn come
+20 2
+Ġz ombies
+F inal
+ĠMill enn
+Ġmod ifications
+ĠV ision
+ĠM oses
+ver b
+iter ranean
+ĠJ et
+Ġnav al
+ĠA gg
+Ġur l
+Ġvict ories
+Ġnon etheless
+Ġinj ust
+ĠF act
+ç ļ
+Ġins ufficient
+re view
+face book
+Ġnegoti ating
+Ġguarant ees
+im en
+uten berg
+Ġg ambling
+Ġcon gr
+Load ing
+Ġnever theless
+Ġpres idents
+ĠIndust rial
+Ġ11 8
+Ġp oured
+ĠT ory
+Ġ17 5
+Ġ: =
+Sc ott
+ange red
+T ok
+Ġorgan izers
+M at
+ĠG rowth
+Ġad ul
+Ġens ures
+Ġ11 7
+é¾į å
+Ġmass acre
+Ġgr ades
+be fore
+AD VERTISEMENT
+ĠSl ow
+ĠM MA
+âĢĶ "
+ĠV atican
+Q aeda
+Ġo we
+66 66
+ĠS orry
+ĠGr ass
+Ġbackground s
+Ġexha usted
+Ġcl an
+Ġcomprom ised
+ĠE lf
+ĠIsa ac
+ens on
+In vest
+IF A
+Ġinterrupt ed
+ãĥī ãĥ©
+Ġtw isted
+ĠDrag ons
+M ode
+ĠK remlin
+Ġfert il
+he res
+ph an
+ĠN ode
+f ed
+ĠOr c
+Ġunw illing
+C ent
+Ġprior it
+Ġgrad uates
+Ġsubject ive
+Ġiss uing
+ĠL t
+Ġview er
+Ġw oke
+Th us
+bro ok
+Ġdep ressed
+Ġbr acket
+ĠG or
+ĠFight ing
+Ġstri ker
+Rep ort
+ĠPortug al
+Ġne o
+w ed
+19 9
+Ġflee ing
+sh adow
+ident ified
+US E
+Ste am
+Ġstret ched
+Ġrevel ations
+art ed
+ĠD w
+Ġalign ment
+est on
+ĠJ ared
+S ep
+Ġblog s
+up date
+g om
+r isk
+Ġcl ash
+ĠH our
+Ġrun time
+Ġunw anted
+Ġsc am
+Ġr ack
+Ġen light
+on est
+ĠF err
+Ġconv ictions
+Ġp iano
+Ġcirc ulation
+ĠW elcome
+Ġback lash
+ĠW ade
+Ġrece ivers
+ot ive
+J eff
+Ġnetwork ing
+ĠPre p
+ĠExpl orer
+Ġlect ure
+Ġupload ed
+ĠMe at
+B LE
+ĠNaz is
+ĠSy nd
+st ud
+ro ots
+ri ans
+Ġportray ed
+Ġ ??
+ĠBudd ha
+s un
+Rober t
+ĠCom plex
+Ġover see
+Ġste alth
+T itle
+ĠJ obs
+ĠK um
+Ġappreci ation
+ĠM OD
+Ġbas ics
+Ġcl ips
+Ġnurs ing
+Ġpropos ition
+Ġreal ised
+ĠNY C
+Ġall ocated
+ri um
+ar an
+ĠPro duction
+ĠV ote
+Ġsm ugg
+Ġhun ter
+az er
+ĠCh anges
+Ġfl uct
+y on
+Ar ray
+Ġk its
+W ater
+Ġuncom mon
+Ġrest ing
+ell s
+w ould
+Ġpurs ued
+Ġassert ion
+omet own
+ĠMos ul
+ĠPl atform
+io let
+Ġshare holders
+Ġtra ils
+P ay
+ĠEn forcement
+ty pes
+ĠAn onymous
+Ġsatisf ying
+il ogy
+Ġ( '
+w ave
+c ity
+Ste ve
+Ġconfront ation
+ĠE ld
+C apt
+ah an
+ht m
+ĠC trl
+ON S
+2 30
+if a
+hold ing
+Ġdelic ate
+Ġj aw
+ĠGo ing
+or um
+S al
+Ġd ull
+ĠB eth
+Ġpr isons
+Ġe go
+ĠEl sa
+avor ite
+ĠG ang
+ĠN uclear
+Ġsp ider
+ats u
+Ġsam pling
+Ġabsor bed
+ĠPh arm
+iet h
+Ġbuck et
+ĠRec omm
+O F
+ĠF actory
+AN CE
+Ġb acter
+H as
+ĠObs erv
+12 1
+Ġprem iere
+De velop
+Ġcur rencies
+C ast
+Ġaccompany ing
+ĠNash ville
+Ġfat ty
+ĠBre nd
+Ġloc ks
+Ġcent ered
+ĠU T
+augh s
+or ie
+ĠAff ordable
+v ance
+D L
+em et
+Ġthr one
+ĠBlu etooth
+Ġn aming
+if ts
+AD E
+Ġcorrect ed
+Ġprompt ly
+ĠST R
+Ġgen ome
+Ġcop e
+Ġval ley
+Ġround ed
+ĠK end
+al ion
+p ers
+Ġtour ism
+Ġst ark
+v l
+Ġblow ing
+ĠSche dule
+st d
+Ġunh appy
+Ġlit igation
+ced es
+Ġand roid
+Ġinteg ral
+ere rs
+ud ed
+t ax
+Ġre iter
+ĠMot ors
+oci ated
+Ġwond ers
+ĠAp ost
+uck ing
+ĠRoose velt
+f ram
+Ġyield s
+Ġconstit utes
+aw k
+Int erest
+Ġinter im
+Ġbreak through
+ĠC her
+Ġpro sec
+ĠD j
+ĠM T
+Res p
+ĠP T
+Ġs perm
+ed it
+B T
+Lin ux
+count ry
+le ague
+Ġd ick
+Ġo ct
+Ġinsert ing
+Ġsc ra
+ĠBrew ing
+Ġ19 66
+Ġrun ners
+Ġpl un
+id y
+ĠD ian
+Ġdys function
+Ġex clusion
+Ġdis gr
+Ġincorpor ate
+Ġrecon c
+Ġnom inated
+ĠAr cher
+d raw
+achel or
+Ġwrit ings
+Ġshall ow
+Ġh ast
+ĠB MW
+ĠR S
+Ġth igh
+Ġ19 63
+Ġl amb
+Ġfav ored
+ag le
+Ġcool er
+ĠH ours
+ĠG U
+ĠOrig in
+Ġglim pse
+---------------- ----
+L im
+Ġche ek
+Ġj ealous
+- '
+Ġhar ness
+ĠPo ison
+Ġdis abilities
+ne apolis
+Ġout look
+Ġnot ify
+ĠIndian apolis
+Ġab rupt
+ns ic
+Ġenc rypted
+Ġfor fe
+reat h
+Ġr abb
+Ġfound ations
+Ġcompl iment
+ĠInter view
+ĠS we
+Ġad olesc
+Ġmon itors
+ĠSacrament o
+Ġtime ly
+Ġcontem pl
+Ġposition ed
+Ġpost ers
+ph ies
+iov ascular
+v oid
+ĠFif th
+Ġinvestig ative
+OU N
+Ġinteg rate
+ĠIN C
+ish a
+ibl ings
+ĠRe quest
+ĠRodrig uez
+Ġsl ides
+ĠD X
+Ġfemin ism
+Ġdat as
+Ġb end
+ir us
+ĠNig eria
+F ox
+Ch ange
+Ġair plane
+ĠLad en
+Ġpublic ity
+ixt y
+Ġcommit ments
+Ġaggreg ate
+Ġdisplay ing
+ĠAr row
+Ġ12 2
+Ġrespect s
+and roid
+s ix
+ĠSh a
+Ġrest oration
+) \
+W S
+oy s
+Ġillust rate
+with out
+12 6
+ĠâĶ Ĥ
+Ġpick up
+n els
+Ġ ....
+f ood
+ĠF en
+) ?
+Ġphenomen a
+Ġcompan ions
+ĠW rite
+Ġsp ill
+Ġbr idges
+ĠUp dated
+ĠF o
+Ġinsect s
+ASH INGTON
+Ġsc are
+il tr
+ĠZh ang
+Ġsever ity
+Ġind ul
+14 9
+ĠCo ffee
+Ġnorm s
+Ġp ulse
+ĠF T
+Ġhorr ific
+ĠDest roy
+ĠJ SON
+Ġo live
+Ġdiscuss es
+R est
+E lect
+ĠW inn
+ĠSurv iv
+ĠH ait
+S ure
+op ed
+Ġro oted
+ĠS ke
+ĠBron ze
+Ġl ol
+Def ault
+Ġcommod ity
+red ited
+Ġliber tarian
+Ġforb idden
+Ġgr an
+à ¨
+Ġl ag
+en z
+dri ve
+Ġmathemat ics
+Ġw ires
+Ġcrit ically
+Ġcarb ohyd
+ĠChance llor
+ĠEd die
+Ġban ning
+ĠF ri
+Ġcompl ications
+et ric
+ĠBangl adesh
+Ġband width
+St op
+ĠOrig inally
+Ġhalf way
+yn asty
+sh ine
+Ġt ales
+rit ies
+av ier
+Ġspin ning
+ĠWH O
+Ġneighbour hood
+b ach
+Ġcommer ce
+ĠS le
+B U
+Ġentreprene ur
+Ġpecul iar
+ĠCom ments
+f re
+3 20
+IC S
+Ġimag ery
+ĠCan on
+ĠElect ronic
+sh ort
+( (
+D ig
+Ġcomm em
+u ced
+Ġincl ined
+ĠSum mon
+Ġcl iff
+ĠMed iterranean
+Ġpo etry
+Ġprosper ity
+ĠRe ce
+Ġp ills
+m ember
+Ġfin ale
+un c
+ĠG ig
+ä ½
+Ġl od
+Ġback ward
+- +
+ĠFor ward
+Ġth ri
+s ure
+Ġso ap
+ĠF X
+R ES
+ĠSe xual
+oul os
+Ġfool ish
+Ġright eous
+Ġco ff
+terror ism
+ust ain
+ot er
+Ġab uses
+ne xt
+Ġab usive
+Ġthere after
+Ġprohib ition
+ĠS UP
+Ġd ip
+Ġr ipped
+Ġinher ited
+Ġb ats
+st ru
+G T
+Ġflaw ed
+ph abet
+Ġf og
+do ors
+Ġim aging
+Ġdig its
+ĠHung ary
+Ġar rog
+Ġteach ings
+Ġprotocol s
+ĠB anks
+à ¸
+p ound
+ĠC urt
+." )
+. /
+Ġex emption
+end ix
+ĠM ull
+Ġimpro ves
+ĠG amer
+d imensional
+I con
+ĠMarg aret
+St atus
+d ates
+Ġint ends
+Ġdep ict
+Ġpark ed
+J oe
+ĠMar ines
+chn ology
+! ).
+Ġjud ged
+Ġwe ights
+R ay
+Ġapart ments
+he ster
+Ġrein force
+Ġoff ender
+occ up
+Ġs ore
+e pt
+ĠPH P
+ĠB row
+Ġauthor ization
+ĠR isk
+ĠDel aware
+ĠQ U
+Ġnot ifications
+Ġsun light
+Ġex clude
+d at
+Ġm esh
+ĠSud an
+Ġbelong ed
+Ġsub way
+Ġno on
+ĠInter ior
+ol ics
+ĠL akers
+Ġc oding
+Dis claimer
+Cal if
+O ld
+Ġdis l
+???? ?
+Ġconfir ms
+Ġrecruit ment
+Ġhom icide
+Cons ider
+ĠJeff rey
+ft y
+} ;
+Ġobject ion
+do ing
+ĠLe o
+W ant
+Ġgl ow
+ĠClar ke
+ĠNorm an
+Ġver ification
+Ġpack et
+ĠForm ula
+Ġpl ag
+es ville
+Ġshout ing
+Ġo v
+ĠR EC
+ĠB ub
+Ġn inth
+Ġener g
+Ġvalid ity
+Ġup s
+j ack
+Ġneighbor ing
+ĠN ec
+ew orks
+ĠH ab
+are z
+Ġsp ine
+Ġevent ual
+ĠLe aders
+ĠC arn
+Ġprob ation
+Ġrom ance
+ms g
+ĠMechan ical
+ER Y
+R ock
+Ġpart isan
+N ode
+ass ets
+min ent
+Ġforeign ers
+Ġtest ify
+ĠUs ually
+l ords
+ĠG ren
+ĠPow ell
+BI L
+Ġs r
+Ġadd ict
+Ġshell s
+Ġs igh
+ĠY ale
+tern ity
+Ġ7 50
+E U
+ĠR ifle
+Ġpat ron
+em a
+ĠB annon
+an ity
+Ġtrop ical
+ĠV II
+c ross
+Every thing
+ĠIS O
+Ġhum ble
+ass ing
+ĠF IG
+Ġupd ating
+ys on
+Ġcal cium
+Ġcompet ent
+Ġste ering
+Pro t
+ĠS Y
+ĠFin als
+ĠR ug
+15 9
+13 7
+ĠG olf
+Ġ12 6
+Ġaccommod ation
+ĠHug hes
+Ġaest hetic
+art isan
+ĠTw ilight
+Ġpr ince
+ĠAgric ulture
+ĠDis co
+Ġpreced ent
+Ġtyp ing
+author ized
+O ption
+ĠA ub
+l ishes
+ach t
+m ag
+P eter
+ĠU FO
+mont on
+ĠL ith
+Ġa rom
+Ġsec uring
+Ġconf ined
+priv ate
+Ġsw ords
+Ġmark ers
+Ġmetab olic
+se lect
+ĠCur se
+ĠO t
+g ressive
+Ġinc umb
+ĠS aga
+Ġpr iced
+Ġclear ance
+Cont ent
+Ġdr illing
+Ġnot ices
+Ġb ourgeois
+Ġv est
+Ġcook ie
+ĠGuard ians
+ry s
+in yl
+Ġ12 4
+Ġpl ausible
+on gh
+ĠOd in
+Ġconcept ion
+ĠY uk
+ĠBaghd ad
+ĠFl ag
+Aust ral
+ĠI BM
+Ġintern ationally
+ĠWiki Leaks
+I ED
+Ġc yn
+Ġcho oses
+ĠP ill
+Ġcomb ining
+Ġrad i
+ĠMoh ammed
+def ense
+atch ing
+Sub ject
+ic iency
+Fr ame
+Ġ{ "
+Ġche ss
+Ġtim er
+19 0
+Ġt in
+Ġord inance
+emet ery
+Ġacc using
+Ġnotice able
+Ġcent res
+Ġl id
+ĠM ills
+img ur
+Ġz oom
+erg ic
+Ġcomp ression
+pr im
+f ind
+Ġsur g
+Ġp and
+ĠK ee
+ĠCh ad
+cell ence
+oy le
+Ġsocial ism
+ĠT ravis
+ĠM Hz
+Ġgu ild
+ALL Y
+ĠSub scribe
+ĠRel ated
+Ġoccur rence
+itch ing
+Ġfict ional
+Ġcr ush
+ĠE A
+c od
+m ix
+ĠTri ple
+Ġretrie ve
+Ġstimul us
+Ġpsych iat
+ĠDo or
+Ġhomosexual ity
+Ġelement ary
+Ġcell ular
+id ian
+ĠL aun
+Ġintrig uing
+Ġfo am
+ĠB ass
+id i
+its u
+Ġass ure
+Ġcongr at
+Ġbusiness man
+ĠBo ost
+cl ose
+Ġl ied
+Ġsc iences
+ĠO mega
+ĠG raphics
+Ġ< =
+sp oken
+Ġconnect ivity
+S aturday
+ĠAven gers
+Ġto ggle
+Ġank le
+Ġnational ist
+mod el
+ĠP ool
+ophob ia
+V ar
+ĠM ons
+ator ies
+Ġaggress ively
+C lear
+For ge
+act ers
+Ġhed ge
+Ġpip es
+Ġbl unt
+Ġs q
+Ġremote ly
+W ed
+as ers
+Ġref riger
+Ġt iles
+Ġresc ued
+Ġcompr ised
+ins ky
+Ġman if
+avan augh
+Ġprol ifer
+Ġal igned
+x ml
+Ġtri v
+Ġcoord ination
+ĠP ER
+ĠQu ote
+13 4
+b f
+ĠS aw
+Ġtermin ation
+Ġ19 0
+Ġadd itions
+Ġtri o
+Ġproject ions
+Ġpositive ly
+Ġin clusive
+Ġmem br
+19 90
+old er
+Ġpract iced
+ink le
+Ar ch
+Ġstar ters
+ari us
+Ġinter mediate
+ĠBen ef
+ĠK iller
+Ġinter ventions
+ĠK il
+ĠF lying
+In v
+Ġprem ature
+Ġpsych iatric
+Ġind ie
+Ġcoll ar
+ĠRain bow
+af i
+Ġdis ruption
+ĠFO X
+cast ing
+Ġmis dem
+c ro
+Ġw ipe
+ard on
+Ġb ast
+ĠTom my
+ĠRepresent ative
+Ġbell y
+ĠP O
+ĠBre itbart
+13 2
+Ġmess aging
+Sh ould
+Ref erences
+ĠG RE
+ist ical
+L P
+ĠC av
+ĠC razy
+Ġintu itive
+ke eping
+ĠM oss
+Ġdiscont in
+ĠMod ule
+Ġun related
+ĠPract ice
+ĠTrans port
+Ġstatist ically
+orn s
+Ġs ized
+p u
+Ġca f
+ĠWorld s
+ĠRod gers
+ĠL un
+ĠCom ic
+l iving
+Ġc ared
+Ġclim bed
+) {
+Ġconsist ed
+Ġmed ieval
+fol k
+Ġh acked
+Ġd ire
+ĠHerm ione
+Ġt ended
+ce ans
+D aniel
+w ent
+Ġlegisl ators
+Ġred es
+g ames
+Ġg n
+am iliar
+Ġ+ +
+gg y
+th reat
+Ġmag net
+Ġper ceive
+Ġz ip
+Ġindict ment
+Ġcrit ique
+g ard
+ĠSaf e
+ĠC ream
+Ġad vent
+ob a
+Ġv owed
+ous ands
+Ġsk i
+Ġabort ions
+u art
+Ġstun ned
+Ġadv ancing
+Ġlack ed
+Ġ\ "
+Ġsch izophren
+Ġeleg ant
+Ġconf erences
+Ġcance led
+ĠHud son
+ĠHop efully
+Ġtr ump
+Ġfrequ encies
+Ġmet eor
+ĠJun ior
+ĠFle et
+ĠMal colm
+ĠT ools
+Ġ ........
+Ġh obby
+ĠEurope ans
+Ġ15 00
+ĠInt o
+Ġs way
+ĠApp ro
+ĠCom pl
+Comm unity
+Ġt ide
+ĠSum mit
+ä »
+Ġinter vals
+ĠE ther
+Ġhabit at
+ĠSteven s
+lish ing
+ĠDom ain
+Ġtrig gers
+Ġch asing
+Ġchar m
+ĠFl ower
+it ored
+Ġbless ing
+Ġtext ures
+F ive
+Ġliqu or
+R P
+F IN
+Ġ19 62
+C AR
+Un known
+Ġres il
+ĠL ily
+Ġabund ance
+Ġpredict able
+r ar
+Ġbull shit
+le en
+che t
+M or
+M uch
+ä ¹
+Ġemphas ized
+Ġcr ust
+Ġprim itive
+Ġenjoy able
+ĠPict ures
+Ġteam mate
+pl er
+ĠT ol
+ĠK ane
+Ġsummon ed
+th y
+ram a
+ĠH onda
+Ġreal izing
+Ġquick er
+Ġconcent rate
+cle ar
+Ġ2 10
+ĠErd ogan
+ar is
+Ġrespond s
+ĠB I
+Ġelig ibility
+Ġpus hes
+ĠId aho
+Ġagg rav
+Ġru ins
+ur ations
+Ġb ans
+Ġan at
+sh are
+Ġgr ind
+h in
+um en
+Ġut ilities
+ĠYan kees
+Ġdat abases
+ĠD D
+Ġdispl aced
+Ġdepend encies
+Ġstim ulation
+h un
+h ouses
+ĠP retty
+ĠRaven s
+ĠTOD AY
+Ġassoci ates
+Ġthe rape
+cl ed
+Ġde er
+Ġrep airs
+rent ice
+Ġrecept ors
+Ġrem ed
+ĠC e
+Ġmar riages
+Ġball ots
+ĠSold ier
+Ġhilar ious
+op l
+13 8
+Ġinherent ly
+Ġignor ant
+Ġb ounce
+ĠE aster
+REL ATED
+ĠCur rency
+E V
+ãĥ ŀ
+ĠLe ad
+Ġdece ased
+B rien
+ĠMus k
+J S
+Ġmer ge
+heart ed
+c reat
+m itt
+m und
+ĠâĢ ĭ
+ĠB ag
+Ġproject ion
+Ġj ava
+ĠStand ards
+ĠLeon ard
+Ġcoc onut
+ĠPop ulation
+Ġtra ject
+Ġimp ly
+Ġcur iosity
+ĠD B
+ĠF resh
+ĠP or
+Ġheav ier
+ne ys
+gom ery
+Ġdes erved
+Ġphr ases
+ĠG C
+Ġye ast
+d esc
+De ath
+Ġreb oot
+Ġmet adata
+IC AL
+Ġrep ay
+ĠInd ependence
+Ġsubur ban
+ical s
+Ġat op
+Ġall ocation
+gener ation
+ĠG ram
+Ġmoist ure
+Ġp ine
+ĠLiber als
+Ġa ides
+Ġund erest
+ĠBer ry
+Ġcere mon
+3 70
+ast rous
+ĠPir ates
+Ġt ense
+ĠIndust ries
+ĠApp eals
+ĠN ear
+Ġè£ı ç
+Ġlo vers
+ĠC AP
+ĠC raw
+Ġg iants
+Ġeffic acy
+E lement
+ĠBeh avior
+ĠToy ota
+Ġint est
+P riv
+A I
+Ġmaneu ver
+Ġperfect ion
+Ġb ang
+p aper
+r ill
+Ge orge
+b order
+in ters
+ĠS eth
+Ġcl ues
+ĠLe vi
+ĠRe venue
+14 7
+Ġv apor
+Ġfortun ate
+Ġthreat ens
+Ġve t
+Ġdepend ency
+ers ed
+art icle
+ĠBl izzard
+Ġch lor
+Ġmin us
+ĠB ills
+Ġcryptoc urrency
+Ġmetabol ism
+ter ing
+Ġp estic
+step s
+ĠTre asure
+ract ed
+ĠConst ant
+Ġtem p
+13 9
+ĠDet ective
+ur ally
+Ġrecover ing
+Ġcort ex
+Ġ14 4
+cl osed
+Ġprejud ice
+aun ted
+Ġstorm s
+ĠN OW
+Ġmach inery
+Add ress
+Ġcompe lled
+27 0
+Ġdesp air
+b ane
+Ġveget able
+Ġbed s
+Lear n
+Ġcolor ful
+Ġsp ike
+Ġmarg ins
+Ġsymp athy
+Ġworks hop
+ĠC BC
+S at
+Ġburn s
+ĠG ender
+Ġ12 9
+ĠC able
+Ġdeb ts
+ĠThe resa
+Ġreflect ing
+Ġa irst
+Ġr im
+ram id
+Ġweakness es
+W rit
+ogg le
+t i
+ĠCh arge
+Ġwe ighed
+Ġ( .
+Ġl aughter
+Ġrou ter
+ĠDemocr acy
+D ear
+Ġhas ht
+Ġd y
+Ġhint s
+run ning
+Ġfin ishes
+ar us
+M ass
+res ult
+asc us
+Ġv intage
+Ġcon qu
+Ġwild ly
+ac ist
+Ġl ingu
+Ġprot agonist
+st rom
+te enth
+ĠSol o
+m ac
+f illed
+Ġre nown
+it ives
+Ġmot ive
+ĠAnt ar
+ĠM ann
+ĠAd just
+Ġrock ets
+Ġtrou bling
+e i
+Ġorgan isms
+ass is
+Christ ian
+Ġ14 5
+ĠH ass
+Ġsw all
+Ġw ax
+ĠSurv ival
+V S
+ĠM urd
+v d
+stand ard
+Ġdrag ons
+Ġacceler ation
+r ational
+f inal
+Ġp aired
+ĠE thereum
+Ġinterf aces
+Ġres ent
+Ġartif acts
+Å «
+are l
+Ġcompet itor
+ĠNich olas
+ĠSur face
+c pp
+ĠT ot
+Ġeconom ically
+Ġorgan ised
+Ġen forced
+in ho
+Ġvar ieties
+Ġab dom
+ĠBa iley
+id av
+ĠSal v
+p aid
+Ġalt itude
+ess ert
+ĠG utenberg
+are a
+op oulos
+Ġprofess ors
+igg s
+ĠF ate
+he y
+Ġ3 000
+D ist
+Ġtw ins
+c ill
+ĠM aps
+Ġtra ps
+Ġwe ed
+ĠK iss
+Ġy oga
+Ġrecip ients
+ĠWest minster
+Ġpool s
+ĠWal mart
+18 8
+ĠSchool s
+att ack
+ĠAR M
+par agraph
+W arning
+j l
+Ġself ish
+anche z
+ĠHe ights
+F re
+ĠS oph
+Ġ --------------------------------
+t ml
+33 3
+Ġraid s
+Ġsatell ites
+KE Y
+Ġlast s
+Ñ Ĥ
+In s
+ĠD ame
+Ġunp redict
+// /
+gh ai
+Ġart illery
+Ġcru ise
+Ġg el
+ĠCabin et
+Ġbl ows
+ĠE sp
+Ġprox imity
+ot he
+ĠSk ills
+ĠU pper
+ob o
+ĠN DP
+Ġenjoy s
+Ġrepe ating
+ĠConst ruction
+ĠQuest ions
+H illary
+Ġu int
+Ġprocess ors
+ĠGib son
+ĠMult iple
+q a
+ĠB om
+ĠM iles
+vent ional
+Ġhur ts
+s kin
+ĠA IDS
+Ġadvis ers
+ĠR oot
+Ġmethod ology
+ĠD ale
+Ġdet on
+ĠKnow ledge
+sequ ently
+Ġ12 1
+Ġconnect s
+C y
+ĠD anger
+Ġcontribut ors
+ĠB ent
+Ġbr ass
+ĠGun s
+int o
+ĠFort une
+Ġbro ker
+bal ance
+Ġlength s
+Ġv ic
+Ġaver aging
+Ġappropri ately
+ĠCamer a
+Ġsand wich
+ĠCD C
+Ġcoord inate
+Ġnav ig
+Ġgood ness
+l aim
+Ġbra ke
+Ġextrem ist
+ĠW ake
+ĠM end
+ĠT iny
+ĠC OL
+ĠR F
+ĠD ual
+ĠW ine
+C ase
+Ġref ined
+Ġl amp
+L ead
+Ġb apt
+ĠCar b
+ĠS add
+ĠMin neapolis
+PD F
+Ear ly
+ĠH idden
+I ts
+ĠT IME
+Ġp ap
+Ġcommission ed
+ĠF ew
+ĠCol ts
+ĠB ren
+Ġbot hered
+Ġlike wise
+Ex per
+ĠSch w
+c ry
+n n
+ĠM itch
+im on
+M G
+b m
+UM P
+r ays
+Ġregist ry
+Ġ2 70
+ach ine
+re lla
+ant ing
+00 000
+Ġru ined
+sp ot
+Ġt a
+Ġmaxim ize
+Ġincon ven
+D ead
+H uman
+En abled
+ĠMar ie
+Ġch ill
+ĠParad ise
+Ġstar ring
+ĠLat ino
+ĠProt ocol
+ĠE VER
+Ġsuppl iers
+m essage
+ĠBro ck
+Ġser um
+âĸĪâĸĪ âĸĪâĸĪ
+Ġen comp
+Ġamb ition
+ues e
+Ġar rows
+And rew
+Ġanten na
+Ġ19 61
+ĠB ark
+Ġb ool
+ãĤ ª
+ĠSt orage
+Ġrail way
+Ġtoug her
+ĠC ad
+Ġwas hing
+P y
+' ]
+em bed
+ĠMem phis
+ack le
+Ġfam ously
+ĠF ortunately
+ov ies
+Ġmind set
+Ġsne ak
+ĠD h
+RA W
+ĠSim pson
+Ġliv est
+Ġland mark
+Ġc ement
+L ow
+Ġthr illed
+ĠCour se
+in el
+Ġch uck
+id ate
+gl obal
+Ġwh it
+Ġ ï¿½
+ad ays
+s ki
+ĠS V
+Ġvir uses
+30 6
+ĠResp ons
+Ġthe aters
+ĠBr anch
+ĠGene va
+ĠM K
+Ġunbel iev
+Ġcommun ist
+Orig inal
+ĠRe ceived
+ĠTrans fer
+ĠAr g
+In put
+ĠStr ategy
+Ġpal ace
+the ning
+D ri
+Ġsent encing
+umbn ail
+Ġp ins
+re cy
+Ġs iblings
+Get ting
+ĠB U
+ĠNorth west
+Ġprolong ed
+ĠSak ura
+C omb
+ĠB our
+Ġinadequ ate
+ĠK ash
+Ġus ername
+ĠImpro ve
+Ġbatt ling
+ĠM AC
+Ġcurric ulum
+Ġs oda
+ĠC annon
+Ġsens ible
+sp ons
+De cember
+Ġw icked
+ĠP engu
+Ġdict ators
+ĠHe arts
+og yn
+Ġsimilar ities
+ĠSt ats
+Ġh ollow
+it ations
+": [
+Ġh over
+ĠList en
+s ch
+S und
+Ġc ad
+ĠPar ks
+Ġl ur
+Ġhy pe
+ĠL em
+N AME
+is ure
+Fr iday
+Ġshoot s
+Ġclos es
+Ġd b
+ĠR idge
+ĠDiff erent
+Ġrepl ies
+ĠBroad way
+op ers
+Ġint oler
+ĠZe us
+akes pe
+Ġpropri etary
+Ġrequest ing
+Ġcontro llers
+ĠM IN
+im edia
+be cca
+Ġexp ans
+Ġoil s
+B ot
+ĠCh and
+Ġpr inter
+Ġto pped
+ĠP OL
+ĠEar lier
+S ocial
+av in
+Ġdecre ases
+ĠSe b
+Ġspecific ations
+ĠBl ast
+ĠK urt
+Ġfre el
+B rown
+Ġdil ig
+ro e
+ĠPro blem
+ĠQu ad
+Ġdecent ral
+ĠV ector
+an ut
+Ġplug ins
+ĠGreg ory
+Ġfuck ed
+el ines
+ĠAmb assador
+t ake
+Ġcle ans
+ong yang
+An onymous
+st ro
+" }
+al ine
+ĠO dd
+ĠE ug
+2 16
+Ġbo il
+ĠP owers
+Ġnurs es
+Ob viously
+ĠTechn ical
+Ġexceed ed
+OR S
+Ġextrem ists
+Ġtr aces
+ex pl
+Ġcom r
+ĠS ach
+) /
+Ġm asks
+Ġsc i
+B on
+Ġreg ression
+we gian
+Ġadvis or
+it ures
+ĠV o
+ex ample
+ĠInst ruct
+Ġs iege
+Ġredu ctions
+pt r
+Ġstat utory
+Ġrem oves
+Ġp uck
+red its
+Ġbe e
+Ġsal ad
+Ġpromot ions
+ĠJosh ua
+with standing
+ET H
+ĠCh a
+im us
+Ġexpend iture
+aun ting
+Ġdelight ed
+Ġ15 5
+be h
+Ġcar pet
+ĠSp art
+Ġj ungle
+l ists
+Ġbull ying
+ĠNob el
+ĠGl en
+Ġreferen ced
+Ġintrodu ces
+se in
+Ġcho pped
+gl ass
+ĠW rest
+Ġneutral ity
+Ġâ Ļ
+Ġinvestig ator
+Ġshel ves
+Ġun constitutional
+Ġreprodu ction
+Ġmer chant
+m ia
+Ġmet rics
+Ġexplos ives
+ĠSon ia
+Ġbod ily
+Ġthick ness
+Ġpredomin antly
+ĠAb ility
+Ġmon itored
+IC H
+Ġ] .
+ĠMart inez
+Ġvis ibility
+Ġqu eries
+Ġgen ocide
+ĠWar fare
+Qu ery
+Ġstud ios
+Ġemb ry
+Ġcorrid or
+Ġclean ed
+com plete
+ĠM H
+Ġenroll ment
+ING S
+Ġimpact ed
+Ġdis astrous
+ĠY un
+ĠCl aire
+ĠBas ically
+y t
+uster ity
+Ġindirect ly
+w ik
+Ġd od
+ĠCar r
+Ġam p
+Ġprohib it
+ĠIn itial
+ĠR d
+ij i
+Ġeduc ate
+c orn
+i ott
+ĠBeaut y
+Ġdetect ive
+ĠCon n
+s ince
+Ġst agger
+Ġob ese
+Ġb ree
+olog ic
+is se
+walk er
+Ġbl ades
+Ġlaw ful
+fun c
+ĠBeh ind
+Ġappet ite
+Ġ( *
+Ġt ennis
+Ġoff spring
+Ġj ets
+Ġstruct ured
+Ġafore mentioned
+N ov
+Ġsc aling
+f ill
+Ġst ew
+Ġcur b
+ĠStep han
+ed In
+S F
+ob ic
+é ŃĶ
+ou g
+ĠM M
+Ġgen etically
+ope z
+13 6
+Ġu mb
+anc ers
+Ġcoh ort
+Ġmerch andise
+Ġimp osing
+ĠLegisl ature
+ĠArch ive
+iv ia
+ĠN aval
+Ġoff ences
+Ġmir acle
+Ġsn apped
+Ġf oes
+Ġextensive ly
+ĠR af
+Ġc ater
+ed ience
+K it
+ĠB in
+Ġrecomm ends
+ĠC ities
+Ġrig id
+ĠRE AD
+ĠNob le
+ĠT ian
+Ġcertific ates
+ant is
+o iler
+ĠBudd hist
+d id
+Ġsurvey ed
+Ġdown ward
+Ġprint s
+ĠMot ion
+ron ics
+ĠS ans
+oss ibly
+u ctions
+Ġcolon ies
+ĠDan ish
+un it
+Ġsp oil
+Ġadvis ory
+ber ries
+Pl an
+Ġspecific ation
+op hers
+ĠRes ource
+Ġsh irts
+prising ly
+commun ications
+Ġtriv ial
+Ġmention ing
+ise xual
+Ġsupp lements
+Ġsuper vision
+B P
+v or
+Ġw it
+Ġco oldown
+Ġplaint iff
+ĠReview s
+ĠS ri
+ĠM int
+ĠSug ar
+Ġafter ward
+ĠPri est
+ĠInvest ment
+og ene
+ĠT aking
+Ġstretch ing
+Ġinflamm ation
+ĠTe hran
+Ġl ining
+Ġfree zing
+ĠEnt ity
+Ġins piring
+spe cial
+pr ice
+Ġsu e
+ĠP orter
+oun ge
+ET A
+ĠD erek
+ĠLu is
+u o
+ym ph
+Ġex terior
+ih il
+ĠAsh ley
+in ator
+Ġnut rients
+ĠTh rones
+Ġfin ances
+ĠIn spect
+Ġspe cially
+ĠRequ ired
+ĠP TS
+ĠViol ence
+oint ed
+sh ots
+Ġex cerpt
+co on
+IN S
+ĠG ri
+Ġrecogn ised
+We ek
+You ng
+Ġv om
+is le
+ĠCur ry
+ĠBudd h
+Ġnot ebook
+Ġd urable
+/ ?
+ĠG ad
+ĠP upp
+Ġforg ive
+p ark
+Ġpersonal ities
+an alysis
+cl amation
+Ġelev ator
+Ġware house
+ĠR ole
+un n
+Ġillust ration
+ĠSc an
+Ġatmosp heric
+Im port
+AN C
+rict ed
+f u
+01 0
+Ġar che
+Ġreward ed
+akespe are
+Ġintern ally
+ĠR BI
+alk er
+Ġeleph ant
+ow itz
+ĠP izza
+Ġbip artisan
+Ã© s
+Ġslow ed
+ĠSt ark
+Ġover ride
+OU S
+Ġ3 20
+undred s
+ĠDe ck
+ĠC ensus
+be e
+14 6
+ot or
+Ġ ip
+Ġu b
+oc ations
+ĠBut ton
+r ice
+Ġc ripp
+ff f
+Ġorig inated
+Ġoverwhel med
+app a
+Ġfore most
+âĢ ĳ
+ĠL EG
+re lease
+eat ured
+at ches
+Ġre ps
+Ġl ending
+ĠRe ference
+ĠCl ient
+16 5
+vent h
+Com plete
+ĠPat rol
+Ġsw orn
+c am
+Ġshut tle
+ĠR alph
+Ġh ometown
+- ,
+on al
+ĠB P
+å ı
+Ġpersu ade
+ĠAlex and
+Ġcomb ines
+Ġv ivid
+ĠL ag
+Ġenc oding
+Ġsal vation
+w en
+ĠRec overy
+i ya
+Un iversity
+ĠB iden
+Ġbud gets
+ĠTex ans
+f its
+Ġhon ored
+Ġp ython
+T D
+## #
+cl one
+Ġbl ink
+ĠL iquid
+Ġunemploy ed
+Ġcl ashes
+ĠCoun sel
+Ġdirect ing
+Ġpun ct
+ĠFal cons
+Ġsh ark
+ĠDam ascus
+Ġje ans
+Ġemb ark
+Ġse ize
+Ġup wards
+2 80
+ĠE z
+ĠAny thing
+Ġex otic
+l ower
+ĠCreat or
+ĠU m
+Ġsubur bs
+ber ger
+ĠW end
+Ġm int
+ĠX X
+ĠD ro
+Ġsuff ers
+Ġher b
+t ree
+Ġfrag ile
+Ġflood ed
+ĠAl cohol
+ole an
+ny der
+ĠK O
+F ram
+Ġ13 6
+Ġow ed
+ĠMe lee
+ĠH ash
+Ġwh isk
+Ġsu do
+r r
+Qu ick
+app ro
+Ġi i
+ĠEx amples
+he e
+Ġpromot es
+per ature
+k ar
+ĠHon or
+Ġs odium
+ĠL if
+ros so
+intend ent
+Ġcorrespond ent
+F ound
+sec ret
+Ġident ifies
+ag ne
+Ġl ou
+ĠP P
+Ġcoinc idence
+m ove
+Ġmilit ia
+Ġinf iltr
+ĠPrim ary
+Ġpitch ing
+ĠI b
+ĠGO OD
+ãĤ ¸
+ĠW izards
+ir al
+ĠVen us
+R R
+ĠâĢ ķ
+ĠCase y
+Ġsad ly
+Ġadm ire
+Ġembarrass ed
+c b
+M el
+Ġtub es
+Ġbeaut ifully
+ĠQueens land
+Bel ow
+re z
+qu et
+ple asant
+ĠÂ «
+C amp
+Ġdec isive
+19 98
+ĠL amb
+ut ton
+h n
+ĠJ agu
+au nder
+ĠC ord
+Ġcl erk
+Ġca ffe
+Ġwip ed
+Ġre im
+ĠMount ains
+Ġimprison ed
+Ġdevelop s
+ĠP ra
+Ġmodel ing
+Any one
+ance l
+ĠS it
+Ġshield s
+Ġl awn
+Ġcard iovascular
+Ġdemonstr ating
+Ġpar se
+ĠIsrael is
+Ġeuro s
+14 3
+Ġgl orious
+ins ki
+ec d
+Ġcondition ing
+Ġhel pless
+Ġmicro sc
+ĠHar bor
+Ġst akes
+Ġ2 60
+Ġun equ
+ĠFl oyd
+Ġd amp
+Ġappar atus
+ĠLaw s
+Ġcoun ters
+Ġindu ce
+at able
+ĠAh med
+Ġsl am
+N ovember
+Ġpers ist
+Ġim minent
+Ã¡ n
+Ġsh red
+Ġph ases
+ĠEd monton
+ĠArm strong
+ĠMe et
+ĠK itty
+Ñ Ģ
+c irc
+ĠAd ult
+Ġa rose
+ĠX en
+D an
+g ow
+Ġsuper f
+ĠAd mir
+Ġend ure
+Ġkey word
+yr us
+Ġy arn
+Ġpath way
+ĠHop kins
+mid t
+Ġcens orship
+d ependent
+Ġinstruct or
+S ources
+Ġto e
+Ġball oon
+N ob
+Ġsw ear
+ĠCast ro
+Ġgl oss
+ĠK avanaugh
+Ġremark ably
+Ph otos
+ĠN om
+ĠS outheast
+y ers
+Ġvalid ation
+Ġcann on
+ĠVict ory
+ĠPier re
+Ġcaut ious
+Aud io
+Ġf etch
+ĠG ift
+ĠH yp
+Ġrem edy
+Z E
+Ġsc ent
+Ġbe ard
+ĠR ut
+- "
+Ġpat ents
+H y
+Ġun just
+Ġpot ato
+Ġforth coming
+Ġche f
+ĠR ift
+aff e
+ĠR OM
+ĠL aunch
+Ġp ads
+ĠNe o
+Ġon set
+Ġsquee ze
+s afe
+Ġpref ix
+ĠT M
+ĠN early
+ĠClin ical
+ĠM ental
+ot iation
+ĠUn ic
+ant ry
+ĠC ir
+Ġep it
+Ã ¦
+Ġextract ed
+verse ly
+ri ad
+Ġstr ains
+Ġto ps
+Ġpo em
+ĠRand y
+ĠMap le
+TH ER
+up iter
+ĠSS D
+ļ é
+Ġun con
+per ing
+Ġsle pt
+in ers
+Ġunder water
+ĠEv idence
+g one
+20 5
+Ġhistor ians
+Ġsynt hesis
+Ġf rog
+b asketball
+Ġvibr ant
+Ġsub ord
+Ġ3 65
+ĠD ial
+Ġcooper ate
+HA HA
+Ġgreet ed
+15 8
+Ġj azz
+Ġinto x
+ĠWalk ing
+Ġsuper visor
+ĠF usion
+ĠMer cedes
+s end
+H am
+s d
+n l
+Ġtour s
+ĠF IFA
+Ġcul p
+g d
+30 4
+Ġple as
+Ġillust rates
+ĠColomb ia
+Ġhighlight ing
+ĠSum mary
+Ġexp osing
+ĠD ru
+Ġir ony
+r itional
+ĠCar roll
+ĠEll is
+P ict
+ĠR apt
+Ġad apter
+Ġun m
+Ġcor pse
+Ġceleb rities
+D en
+at um
+ĠAp ocalypse
+ĠW ag
+lin ing
+Ġhorm ones
+R ub
+ĠX i
+ĠV aults
+20 8
+alky rie
+inos aur
+Ġfeed s
+v ity
+Ġdefe ating
+W ait
+Ġemphas ize
+ĠSteel ers
+yr inth
+le ys
+ĠWhe never
+Current ly
+ĠCl ock
+Ġcollect ively
+any on
+ĠJ P
+Ġment ality
+Ġdownload s
+Ġsurround ings
+ĠBarn es
+Ġflags hip
+Ġindic ators
+Ġgra pp
+Jan uary
+ĠElement al
+ĠAthen a
+ib al
+Ġs ights
+Ġcap ita
+ĠTreat y
+Ġvo iced
+ĠG az
+let te
+Ġy a
+Ġexp ired
+Leg end
+H ot
+n ature
+Ġunst able
+Ġ2 80
+Ã º
+Com ment
+AL E
+Ġquest s
+Ġhand ler
+n is
+Ġvers atile
+Ġconce al
+enge ance
+ĠInter active
+Ġobs essed
+ĠDog s
+Ġcr acked
+S ound
+s v
+ĠD ylan
+ro ads
+f x
+ĠCath olics
+ĠH ag
+Ġsl ammed
+Ġgl owing
+s ale
+Ġtiss ues
+ĠCh i
+ne e
+Ġc her
+s ic
+ur rection
+Ġb acon
+ul atory
+) ."
+Ġir regular
+FOR M
+ass ed
+Ġintention al
+Ġcompens ate
+ĠSpe aking
+ĠS ets
+15 3
+Ġconvent ions
+b ands
+em ade
+Ġe cc
+ĠWin ston
+ĠAssass in
+ĠBelg ian
+Ġdepend ence
+Ġnic he
+Ġb ark
+ĠJ azz
+Ġdisadvant age
+Ġgas oline
+Ġ16 5
+çļ Ħ
+ess a
+mod ule
+ang ular
+O Y
+ĠTreat ment
+it as
+ol ation
+ĠArn old
+Ġfe ud
+ĠN est
+Ġthe atre
+ew ater
+Ġmin ors
+olic y
+ĠH aven
+div ision
+Ġtr unk
+F ar
+ĠP ull
+Ġcapt uring
+Ġ18 00
+ĠTe en
+Ġex empl
+Ġclin ics
+ĠB urg
+Ġsubst it
+Ġpay load
+ĠL av
+ĠT roy
+ĠW itness
+Ġfrag ments
+Ġpass words
+Ġg ospel
+ĠG in
+Ġten ants
+ol ith
+S ix
+Pre vious
+ĠAg es
+ĠDar win
+Ġbl at
+Ġem pathy
+sm ith
+b ag
+ĠE cho
+ĠC amb
+ĠM add
+ĠB oo
+Ġred e
+ĠBurn ing
+Ġsmooth ly
+ĠAd rian
+ĠV ampire
+ĠMon sters
+ste am
+Sty le
+M a
+re a
+ĠD war
+aly st
+urs or
+Ġelim ination
+Ġcrypt o
+ch t
+ĠE ternal
+âĢ¦ ]
+ĠS orce
+I ll
+N ER
+Ġu h
+Con clusion
+w age
+Ġresp ir
+Ġrem inis
+het ical
+Ġg y
+Ġutil ized
+ic idal
+Ġ19 00
+Ġhun ters
+ĠSw an
+ĠRe act
+Ġvis itor
+ĠThanks giving
+30 8
+Post s
+Ġh ips
+19 97
+om ers
+Ġkn ocking
+ĠVeh icle
+Ġt il
+Ġ13 8
+Ġm i
+ĠInvest igation
+ĠKen ya
+Ġcas ino
+Ġmot ives
+Ġreg ain
+re x
+Ġweek ends
+Ġstab bed
+bor o
+Ġexplo ited
+ĠHA VE
+ĠTe levision
+c ock
+Ġprepar ations
+Ġende av
+ĠRem ote
+ĠM aker
+ĠPro du
+ĠEv an
+Ġinform ational
+ĠLouis ville
+15 4
+ĠDream s
+Ġpl ots
+ĠRun ner
+Ġhur ting
+Ġacad emy
+ĠMont gomery
+n m
+ĠL anc
+ĠAl z
+2 10
+el ong
+Ġretail er
+Ġar ising
+Ġrebell ion
+Ġbl onde
+play ed
+Ġinstrument al
+C ross
+Ġret ention
+Ġtherape utic
+Ġse as
+Ġinfant ry
+ĠCl int
+Ġprompt ing
+Ġbit ch
+Ġst ems
+ĠK ra
+Ġthe sis
+ĠB og
+ru ed
+Ġk ings
+Ġcl ay
+ific ent
+ĠY ES
+ĠTh ing
+ĠCub s
+vey ard
+els h
+in arily
+ĠE y
+ĠRoll ing
+Ġev olving
+Ind ia
+Ġrecogn izes
+Ġgrad uation
+is ers
+Ġfert ility
+ĠMil an
+Comm and
+Ġbox ing
+Ġ19 43
+Ġgl uten
+ĠEm ir
+Ġid ol
+Ġcon ceived
+ĠCre ation
+Mer it
+udd y
+uss ions
+ĠLie utenant
+iet al
+Ġunch anged
+ĠSc ale
+ĠCrime a
+ball s
+ator ial
+Ġdepth s
+Ġempir ical
+Ġtrans m
+Ġuns afe
+miss ible
+com fort
+15 6
+Ġmechan ic
+00 2
+l ins
+Ġsm oked
+P os
+Ġslow ing
+Ġl av
+Tex as
+Ġche ating
+ĠMet ropolitan
+eth yl
+Ġdiscover ing
+as se
+Ġpen cil
+ĠPy ongyang
+Ġclos et
+ĠShe et
+ĠEnt ry
+ou stic
+Ġmy st
+er ate
+ari at
+Ġminer als
+Ġmusic ian
+ĠP ul
+ĠM az
+24 9
+Ġper missions
+Ġ iv
+en ary
+ick ers
+ĠB ing
+he a
+en able
+Ġgri ev
+Ġassert ed
+ĠColon el
+Ġaff idav
+w o
+Ġse ated
+ĠR ide
+Ġpaint ings
+ĠP ix
+Ġ13 7
+ish i
+umb ai
+g otten
+ĠEar l
+Ġin ning
+Ġc ensus
+Ġtrave lled
+ĠCons ult
+18 5
+b ind
+Ġsimpl icity
+Ġoverlook ed
+ĠHelp ful
+Ġmon key
+Ġoverwhelming ly
+Bl ood
+ĠFl int
+ĠJ ama
+ĠPres ent
+ĠR age
+ĠT A
+pt ive
+Ġturn out
+w ald
+ĠD olphins
+ĠV PN
+Ġon ion
+Ġcraft ing
+m ma
+ĠMerc ury
+Ġarr ange
+Ġalert s
+ĠO T
+zb ollah
+Ġg ases
+ĠRichards on
+s al
+l ar
+Ġfro st
+Ġlower ing
+Ġacc laim
+Ġstart ups
+ĠG ain
+ess ment
+Ġguard ian
+äº º
+ĠP ie
+ĠL inks
+Ġmer its
+Ġaw ake
+Ġparent al
+Ġexceed s
+Ġid le
+ĠPil ot
+Ġe Bay
+ĠAc cept
+ipe g
+C am
+ĠK ot
+Ġtrad ers
+olit ics
+unk er
+ĠP ale
+os i
+an mar
+Ġ19 47
+ĠF ell
+est ial
+it ating
+G F
+ĠS r
+if ted
+Ġconnect or
+ĠB one
+ill es
+2 60
+h ma
+Ġoverl ap
+ĠGit Hub
+Ġclean er
+ĠBapt ist
+ĠW AS
+Ġlung s
+Ñ ģ
+ĠB UT
+Ġc ite
+Ġpit ched
+reat ment
+Ġtro phies
+ĠN u
+38 6
+ĠPr ide
+Ġattend ees
+[ ]
+17 9
+Ġspat ial
+Ġpri zes
+ĠRel igion
+Ġshow case
+ĠC ategory
+vid ia
+T arget
+Pro perty
+? ,
+Ġf usion
+p ie
+ĠU CLA
+Ġsound track
+Ġprin cess
+ĠC aval
+sh ould
+Ġlim bs
+Back ground
+Ġlone ly
+Ġc ores
+ĠT ail
+she et
+Ġ13 2
+R a
+ãĤ «
+ĠB olt
+Ġbook ed
+Ġadmin ister
+Ġequ als
+w y
+Ġobserv ing
+ĠBar on
+ĠAd obe
+Ġv irgin
+ĠSocial ist
+M ove
+gh azi
+ĠLind a
+2 12
+Ġbre wing
+Ġmerch ants
+bur se
+Ġdiv or
+Ġmet als
+ĠN er
+Ġsum s
+ĠEn emy
+Ġen vision
+Ġgrant ing
+ĠH oney
+ĠSk yrim
+Ġsoc io
+gr aded
+Ġselect ive
+W ASHINGTON
+Ġ19 48
+ĠSir ius
+ĠG ross
+act ivity
+ĠI van
+Ġfur ious
+BS D
+ĠPre vious
+Ġrespons ive
+Ġchar itable
+Ġle aning
+ĠP ew
+Ġviol ates
+\\\\ \\\\
+ĠCom ing
+w ire
+Ġpo et
+Ġres olutions
+comm and
+ĠPortug uese
+Ġnick name
+Ġde af
+Feb ruary
+Ġrecogn ise
+Ġentire ty
+Ġseason al
+pl aced
+ĠTe legraph
+Ġmicro phone
+our ing
+Ġgr ains
+Ġgovern ed
+Ġpost p
+ĠW aters
+in ement
+Ġund ocumented
+ĠCom cast
+Ġf ox
+Ġassault s
+re on
+man y
+ĠJen kins
+ĠAny way
+Ġassess ments
+Ġdown s
+ĠM ouse
+Ġsuper b
+k t
+ĠD ow
+Ġtax ation
+4 01
+Ġsm iles
+Ġundert aken
+Ġex h
+Ġenthusi astic
+Ġtw ent
+Ġgovernment al
+Ġautonom y
+ĠTechn ologies
+ĠCh ain
+Ġpreval ent
+f b
+Ġnic otine
+og ram
+j ob
+Ġawa iting
+ĠMen u
+Ġdep uties
+k ov
+ish ops
+But ton
+ĠShan ghai
+Ġdies el
+ĠD uck
+R yan
+ĠPC s
+N F
+j ury
+ent e
+Ġinacc urate
+edd y
+Wh atever
+Ġshow c
+ĠN ad
+od us
+et r
+Ġplaint iffs
+ĠW OR
+ĠAss ange
+Ġpriv at
+Ġpremium s
+Ġt am
+UR L
+Ġel ites
+ĠR anger
+otten ham
+ĠH off
+ĠAt hens
+Ġdefin ite
+Ġs ighed
+Ġeven ly
+2 11
+ĠAm ber
+ak ia
+Ġmail ing
+Ġcr ashing
+ĠConfeder ate
+ru gged
+W al
+ĠDep ths
+Ġjuven ile
+Ġreact or
+Introdu ction
+ĠDel uxe
+19 95
+ĠS anchez
+ĠM ead
+iv able
+: -
+ĠPlan ning
+ĠT rap
+qu in
+ĠProt ect
+ve red
+In formation
+Ġkid ney
+inn amon
+l as
+Ġpolic ing
+Ġtoler ate
+ĠQ i
+Ġbi ased
+F ort
+ĠK i
+s ave
+Ġprivile ged
+Ġbe asts
+ĠGl as
+ĠC inem
+Ġcome back
+Sund ay
+Ġext inction
+h ops
+Ġtrans mit
+Ġdoub les
+ĠFl at
+16 7
+Ġdis puted
+Ġinjust ice
+f oo
+V ict
+role um
+ĠJul ie
+Con text
+ĠR arity
+iss ue
+Comp onent
+Ġcounsel ing
+an ne
+d ark
+Ġobject ions
+u ilt
+Ġg ast
+Ġpl ac
+Ġun used
+ãĥ ĩ
+ĠT rial
+ĠJ as
+hed ral
+ob b
+Ġtempor al
+ĠPR O
+ĠN W
+ĠAnn iversary
+L arge
+Ġther m
+Ġd avid
+Ġsystem ic
+ĠSh ir
+m ut
+ĠNe pt
+add ress
+Ġscan ning
+Ġunderstand able
+Ġcan vas
+C at
+ĠZ oo
+Ġang els
+L O
+ĠStat ement
+ĠS ig
+ov able
+ĠA way
+sh aring
+ocr ats
+st ated
+Ġweigh ing
+N or
+w ild
+B ey
+Ġaston ishing
+ĠReyn olds
+Ġop ener
+Ġtrain er
+Ġsurg ical
+p n
+Ġadjust ing
+whe el
+Ġf rown
+erv ative
+Ġsusp end
+With in
+te in
+Ġobst acle
+Ġliber ties
+ym es
+Ġur anium
+ans om
+an ol
+ub a
+ĠL oss
+Ġa rous
+ĠHend erson
+W ow
+s pl
+c ur
+ĠÂ Ń
+Ġtheir s
+Dam age
+Ġdownload ing
+Ġdisc ern
+ĠSt o
+ĠFl a
+Ġh ath
+ĠA j
+Ġun pleasant
+Europe an
+exp ensive
+Ġscreens hot
+ĠU V
+Ġall ied
+ĠPers ian
+Ġmonop oly
+Ġat om
+ĠReds kins
+"> <
+Ġcan cell
+Ġcinem a
+13 1
+f air
+ĠAlf red
+Ġd uck
+arg s
+22 3
+ĠIS I
+Ġsign aling
+in ar
+Ġlaugh s
+Ġfor wards
+Ġreck less
+Ġlisten ers
+at ivity
+Ġvast ly
+n ant
+L ess
+ĠHun ting
+ĠScient ific
+IT ED
+Ġkn ight
+ĠH TC
+us a
+t mp
+Ġr ude
+ĠLegend ary
+Ġar ises
+B ad
+ĠCl aim
+pe g
+Ġreal ities
+Th ink
+ĠÂ °
+Ġro de
+Ġstri ve
+Ġan ecd
+Ġshort s
+Ġhypot hes
+Ġcoord inated
+ĠGand hi
+ĠF PS
+R ED
+Ġsuscept ible
+Ġshr ink
+ĠCh art
+Hel p
+Ġ ion
+de ep
+rib es
+ĠK ai
+ĠCustom er
+Sum mary
+Ġc ough
+w ife
+Ġl end
+Ġposition ing
+Ġlot tery
+ĠC anyon
+Ġf ade
+Ġbron ze
+ĠKenn y
+Ġbo asts
+ĠEnh anced
+rec ord
+Ġemer gence
+Ġa kin
+ĠB ert
+it ous
+âĸ ĳ
+Ġst ip
+Ġexch anged
+om ore
+als h
+Ġreserv oir
+Ġstand point
+W M
+Ġiniti ate
+Ġdec ay
+Ġbrew ery
+Ġter ribly
+Ġmort al
+lev ard
+Ġrev is
+N I
+el o
+Ġconf ess
+ĠMS NBC
+Ġsub missions
+Cont roller
+Ġ20 2
+ĠR uth
+} );
+ĠAz ure
+Ġ ."
+20 6
+ĠMarket ing
+Ġl aund
+ien cies
+Ġrenown ed
+ĠT rou
+ĠN GO
+ble ms
+Ġterr ified
+Ġwar ns
+Ġper t
+Ġuns ure
+4 80
+ale z
+ult z
+ĠOut side
+Ġst yl
+ĠUnder ground
+Ġp anc
+Ġd ictionary
+Ġf oe
+rim inal
+ĠNor wegian
+Ġj ailed
+Ġm aternal
+Ã© e
+ĠLu cy
+c op
+Ch o
+Ġuns igned
+ĠZe lda
+ĠIns ider
+ĠContin ued
+Ġ13 3
+ĠNar uto
+ĠMajor ity
+16 9
+ĠW o
+ãĤ ĵ
+Ġpast or
+Ġinform al
+Ð ½
+an throp
+jo in
+ãģ Ĺ
+it ational
+N P
+ĠWrit ing
+f n
+ĠB ever
+19 5
+Ġy elling
+Ġdr astically
+Ġe ject
+Ġne ut
+Ġth rive
+ĠFre qu
+ou x
+Ġpossess es
+ĠSen ators
+ĠD ES
+ĠSh akespeare
+ĠFran co
+ĠL B
+uch i
+Ġinc arn
+Ġfound ers
+F unction
+Ġbright ness
+ĠB T
+Ġwh ale
+ĠThe ater
+m ass
+ĠD oll
+S omething
+Ġecho ed
+ĠHe x
+c rit
+af ia
+Ġgodd ess
+Ġele ven
+ĠPre view
+ĠAur ora
+Ġ4 01
+uls ive
+ĠLog an
+in burgh
+ĠCent ers
+ĠON LY
+ĠA id
+Ġparad ox
+Ġh urd
+ĠL C
+D ue
+c ourt
+Ġoff ended
+Ġeval uating
+ĠMatthew s
+Ġto mb
+Ġpay roll
+Ġextra ction
+ĠH ands
+if i
+Ġsuper natural
+ĠCOM M
+] =
+dog s
+Ġ5 12
+ĠMe eting
+Rich ard
+ĠMax imum
+Ġide als
+Th ings
+m and
+ĠReg ardless
+Ġhum ili
+b uffer
+L ittle
+ĠD ani
+ĠN ak
+Ġliber ation
+ĠA be
+ĠO L
+Ġstuff ed
+ac a
+ind a
+raph ic
+Ġmos qu
+Ġcampaign ing
+Ġoccup y
+S qu
+r ina
+ĠW el
+ĠV S
+Ġphys ic
+Ġp uls
+r int
+oad ed
+ET F
+ĠArch ives
+Ġven ues
+h ner
+ĠTur bo
+Ġl ust
+Ġappeal ed
+que z
+il ib
+ĠTim othy
+Ġo mn
+d ro
+Ġobs ession
+ĠSav age
+19 96
+Gl obal
+J es
+2 14
+Ġsl iding
+Ġdisapp ro
+ĠMag ical
+Ġvolunt arily
+g b
+ane y
+Ġprop het
+ĠRe in
+ĠJul ia
+ĠW orth
+aur us
+Ġb ounds
+ie u
+)) )
+Ġcro re
+ĠCitiz en
+S ky
+Ġcolumn ist
+Ġseek ers
+ond o
+IS A
+ĠL ength
+Ġnost alg
+Ġnew com
+Ġdet rim
+ent ric
+3 75
+ĠG E
+Ġaut op
+Ġacadem ics
+App Data
+ĠS hen
+Ġid iot
+ĠTrans it
+Ġteasp oon
+W il
+K O
+ĠCom edy
+> ,
+Ġpop ulated
+W D
+Ġp igs
+ĠO culus
+Ġsymp athetic
+Ġmar athon
+19 8
+Ġseiz ure
+s ided
+Ġd op
+irt ual
+L and
+ĠFl oor
+osa urs
+... ]
+Ġl os
+Ġsubsid iary
+E Y
+ĠPart s
+ĠSt ef
+ĠJud iciary
+Ġ13 4
+Ġmir rors
+Ġk et
+t imes
+Ġneuro log
+Ġc av
+ĠGu est
+Ġtum or
+sc ill
+ĠLl oyd
+E st
+Ġcle arer
+Ġstere otypes
+Ġd ur
+not hing
+Red dit
+Ġnegoti ated
+---------------- --------
+23 5
+Ġfl own
+ĠSe oul
+ĠRes ident
+ĠS CH
+Ġdisappear ance
+ĠV ince
+g rown
+Ġgrab s
+r il
+ĠInf inite
+ĠTw enty
+Ġpedest rian
+Ġjer sey
+ĠF ur
+ĠInf inity
+ĠEll iott
+Ġment or
+Ġmor ally
+Ġob ey
+sec ure
+iff e
+Ġantib iotics
+ang led
+ĠFre eman
+ĠIntrodu ction
+J un
+Ġm arsh
+ic ans
+ĠEV ENTS
+och ond
+W all
+icult y
+Ġmisdem eanor
+Ġl y
+Th omas
+ĠRes olution
+Ġanim ations
+ĠD ry
+Ġinter course
+ĠNew castle
+ĠH og
+ĠEqu ipment
+17 7
+Ġterrit orial
+Ġarch ives
+20 3
+Fil ter
+ĠMun ich
+Ġcommand ed
+ĠW and
+Ġpit ches
+ĠCro at
+Ġrat ios
+ĠM its
+Ġaccum ulated
+ĠSpecific ally
+Ġgentle man
+acer b
+Ġp enn
+Ġa ka
+ĠF uk
+Ġinterven e
+ĠRef uge
+ĠAlz heimer
+Ġsuccess ion
+oh an
+d oes
+L ord
+Ġsepar at
+Ġcorrespond ence
+Ġsh iny
+P rior
+Ġs ulf
+Ġmiser able
+Ġded ication
+( ).
+Ġspecial ists
+Ġdefect s
+ĠC ult
+ĠX ia
+Ġje opard
+ĠO re
+Ab ility
+Ġle ar
+Ġamb itions
+ĠB MI
+ĠArab s
+Ġ19 42
+Ġpres ervation
+ific ate
+Ġash amed
+l oss
+ĠRest aur
+Ġrese mble
+Ġen rich
+ĠK N
+ĠCl an
+fl oat
+Ġplay able
+IT T
+Ġharm ony
+arr ison
+ĠWe instein
+w ere
+Ġpoison ing
+ĠCom put
+ĠWord Press
+m ajor
+ĠVal ve
+F an
+ĠTh row
+ĠRom ans
+ĠDep ression
+ad os
+Ġtort ured
+Ġbal ancing
+bott om
+Ġacqu iring
+ĠMon te
+ard i
+Ġa ura
+Ġ# #
+ĠStand ing
+ĠAtl as
+C F
+Ġintr ins
+ĠBen ghazi
+Ġcamp ing
+Ġt apped
+bl ade
+st rous
+ĠR abb
+ĠW ritten
+t ip
+ĠNe igh
+ster dam
+ĠAll ow
+ĠHe aling
+ĠR hod
+n um
+Ġcaffe ine
+ĠPer cent
+Ġbo o
+Ġapp les
+30 5
+Ġwel coming
+Ġappl aud
+Ġa usterity
+Â ±
+ĠRe ality
+ef e
+å ®
+Ġsu cks
+Ġtab s
+ĠPay Pal
+Ġback pack
+Ġgif ted
+abul ary
+ĠSc out
+ir teen
+Ġch in
+Ġo mitted
+Ġnegative ly
+Ġaccess ing
+ĠE arn
+Ġambul ance
+Ġhead phones
+Ġ20 5
+ĠRef resh
+p resident
+ĠKit chen
+ĠEnt ered
+ĠS nyder
+00 5
+om ical
+Ġborrow ed
+ĠN em
+Ġav iation
+Ġst all
+rim ination
+Ġuniform s
+it ime
+ĠSim mons
+ener gy
+ab lished
+y y
+qual ified
+Ġrall ies
+ĠSt uart
+fl ight
+Ġgang s
+r ag
+Ġv ault
+lu x
+ĠCom par
+Ġdesign ation
+20 9
+ĠJ os
+d ollar
+z ero
+Ġwell s
+30 3
+Ġconstitu ents
+Ġhe ck
+Ġc ows
+Ġcommand ers
+Ġdifferent ial
+ĠC atherine
+29 9
+Ġval ve
+Ġbr ace
+Ġperspect ives
+c ert
+f act
+icular ly
+ĠMc N
+pl anes
+Ġint ric
+Ġpe as
+ov an
+Ġtoss ed
+ret ch
+ĠL opez
+Ġunf amiliar
+de ath
+ĠA part
+ĠCh ang
+Ġrelie ved
+rop he
+Ġair ports
+Ġfre ak
+ut il
+M ill
+ĠCh in
+ĠOw en
+m ale
+ĠBro ken
+ĠWind s
+ro b
+r ising
+Ġfire fighters
+Ġauthor itarian
+Ġ14 8
+Bit coin
+ex ternal
+Ġbrow sers
+iche ver
+or ian
+Ġun b
+Ġpo ke
+ĠZ ot
+M id
+ĠPop ular
+Ġco vert
+Ġcont ributes
+Ġ6 50
+Ġcont ention
+G ate
+Ġcons oles
+Ġchrom os
+ĠI X
+Ġvis ually
+ĠE isen
+Ġjewel ry
+Ġdeleg ation
+Ġacceler ate
+ĠR iley
+Ġsl ope
+Ġind oor
+it ially
+Ġhuge ly
+Ġtun nels
+Ġfin ed
+Ġdirect ive
+Ġfore head
+ustom ed
+Ġsk ate
+Mus ic
+g as
+Ġrecogn izing
+am bo
+Ġover weight
+ĠGr ade
+Ù Ĭ
+Ġsound ing
+Ġlock ing
+ĠR EM
+St ore
+Ġexc av
+ĠLike wise
+ĠL ights
+Ġel bow
+ĠSupp ly
+w ic
+Ġhands ome
+19 94
+C oll
+Ġadequ ately
+ĠAssoci ate
+Ġstri ps
+Ġcrack down
+Ġmar vel
+ĠK un
+Ġpass ages
+@@ @@
+ĠT all
+Ġthought ful
+names e
+Ġprost itution
+bus iness
+Ġball istic
+person al
+c ig
+iz ational
+R ound
+ĠÂłĠÂł ĠÂłĠÂł
+ĠCole man
+Ġadm itting
+ĠPl ug
+Ġbit coins
+ĠSu z
+Ġfair ness
+Ġsupp lier
+Ġcatast rophic
+ĠHel en
+o qu
+M arc
+ĠArt icles
+g ie
+Ġend angered
+Ġdest iny
+ĠVol t
+ol ia
+ax is
+Ġche at
+Ġun ified
+IC O
+qu ote
+30 2
+ĠS ed
+Ġsupp ression
+Ġanaly zing
+Ġsqu at
+Ġfig uring
+Ġcoordin ates
+Ġch unks
+Ġ19 46
+Ġsub p
+Ġw iki
+ĠFor bes
+ĠJ upiter
+ĠE rik
+im er
+ĠCom mercial
+\ )
+Ġlegitim acy
+Ġd ental
+ĠMe an
+Ġdefic its
+5 50
+Orig inally
+ĠHor ror
+Ġcontam ination
+ll ah
+Ġconf isc
+ĠCl are
+T B
+ĠF ailed
+an ed
+Ġrul er
+ĠCont roller
+Ġfemin ists
+F ix
+g ay
+20 7
+Ġr abbit
+Th ird
+ownt own
+Ġgl ue
+Ġvol atile
+Ġsh ining
+Ġf oll
+Ġimp aired
+Ġsup ers
+æ Ī
+Ġcl utch
+ļé ĨĴ
+Ġpro let
+Ġ( !
+Ġy elled
+ĠK iev
+ĠEr n
+ĠSh ock
+K B
+Ġsit uated
+qu ery
+ĠN as
+Ġan nex
+char acter
+ĠHol iday
+Ġautom ation
+ĠJ ill
+ĠRem astered
+Ġl inem
+Ġwild erness
+ĠHor izon
+ĠGu inea
+A Z
+Ġmain land
+Ġsec recy
+LE ASE
+Ġp unk
+ĠProv ince
+( ),
+Spe ed
+Ġhand ing
+ĠSeb ast
+S ir
+r ase
+Ġj ournals
+Ġcon gest
+ĠT ut
+ir rel
+Ġschizophren ia
+Ġmis ogyn
+health y
+I ron
+Ġreact ed
+- $
+25 2
+Ġpl ural
+Ġpl um
+Ġbarg ain
+Ġground ed
+f inder
+Ġdis se
+ĠL az
+O OD
+Ġat roc
+F actory
+Ġmin ions
+Ġo ri
+ĠB rave
+ĠP RE
+ĠMy anmar
+ĠH od
+Ġexped ition
+Ġexpl ode
+ĠCo ord
+Ġext r
+ĠB rief
+ĠAD HD
+Ġhard core
+feed ing
+Ġd ile
+ĠF ruit
+Ġvacc ination
+ĠM ao
+osp here
+Ġcont ests
+- |
+Ġf ren
+isp here
+R om
+ĠSh arp
+ĠTre nd
+Ġdis connect
+âĢ¢ âĢ¢
+Ġper secution
+Ear th
+Ġhealth ier
+38 4
+Ġc ob
+ĠTr inity
+OW S
+AN N
+Ġspecial ty
+Ġg ru
+Ġcooper ative
+wh y
+Start ing
+ĠIss ues
+st re
+ens or
+Ġ18 5
+Ad v
+! ?
+ĠRe vel
+em ia
+ĠH ulk
+Ġcelebr ations
+ĠS ou
+ra ud
+ĠKle in
+Ġun real
+con text
+Ġpartners hips
+Ġadop ting
+t ical
+Ġspl ash
+ĠHe zbollah
+c ategory
+cycl op
+xt on
+ĠD ot
+urd y
+t z
+Ġenvelop e
+ĠN L
+â ķ
+Ġwhere in
+Spe c
+18 4
+Ġte lev
+al iation
+Ġmyth s
+å °
+Ġrig orous
+Ġcommun icating
+Ġobser ver
+Ġre he
+ĠW ash
+Ġapolog ized
+ĠT in
+Ġexpend itures
+work ers
+d ocument
+Ġhes itate
+ĠLen in
+Ġunpredict able
+Ġrenew al
+cl er
+ok ia
+ĠCON T
+Ġpost season
+Tok ens
+Ġex acerb
+Ġbet ting
+Ġ14 7
+Ġelev ation
+W ood
+ĠSol omon
+19 4
+00 4
+out put
+Ġredu nd
+ĠM umbai
+Ġp H
+Ġreprodu ce
+ĠD uration
+MA X
+Ġb og
+C BS
+ĠBal ance
+ĠS gt
+ĠRec ent
+Ġc d
+Ġpo pped
+Ġincomp et
+pro p
+ay an
+g uy
+Pac ific
+Ġty r
+Ġ{ {
+ĠMy stic
+ĠD ana
+Ġmast urb
+Ġge ometry
+Ã ¢
+ĠCor rect
+Ġtraject ory
+Ġdistract ed
+Ġf oo
+ĠW elsh
+L uc
+m ith
+Ġrug by
+Ġrespir atory
+Ġtri angle
+Ġ2 15
+Ġunder graduate
+ĠSuper ior
+ch anging
+_ -
+Ġright ly
+Ġrefere e
+Ġluc rative
+Ġun authorized
+Ġresemb les
+ĠGN U
+ĠDer by
+Ġpath ways
+ĠL ed
+Ġend urance
+Ġst int
+Ġcollect or
+F ast
+Ġd ots
+Ġnational s
+ĠSec urities
+Ġwh ip
+Par am
+Ġlearn s
+M agic
+Ġdetail ing
+m oon
+Ġbroadcast ing
+Ġb aked
+26 5
+hol m
+ĠS ah
+ĠHus sein
+ĠCourt esy
+17 4
+Ġ14 6
+Ġge ographic
+pe ace
+Ġjud ging
+ĠS tern
+B ur
+Ġstory line
+G un
+ĠSt ick
+24 5
+30 7
+ãĤ´ ãĥ³
+ĠAdminist rator
+Ġbur nt
+Ġp ave
+ch oes
+Ex ec
+Ġcamp uses
+Res ult
+Ġmut ations
+ĠCh arter
+Ġcapt ures
+Ġcomp ares
+Ġbad ge
+S cient
+Ġer ad
+ier y
+o i
+ett es
+ĠE state
+Ġst rap
+Ġproud ly
+Ġf ried
+Ġwithd rawn
+ĠV oy
+ph ony
+It ems
+ĠP ierce
+b ard
+Ġann otation
+ant on
+ill on
+Im pro
+... )
+Ġhapp ier
+---- --
+ad just
+Ġstaff ers
+Ġactiv ism
+Ġper f
+Ġal right
+N eed
+Ġcomm ence
+Ġopio id
+ĠAm anda
+E s
+ĠP ars
+ĠK aw
+W orks
+24 8
+Ġind o
+t c
+end ant
+ĠM oto
+Ġlegal ization
+OT E
+Ġtask ed
+Ġt sp
+ĠACT IONS
+16 6
+Ġrefres hing
+ĠN R
+ĠPere z
+Ġinfring ement
+S Y
+List en
+in ning
+k u
+Ġrot ate
+pro gram
+ar ah
+Des ign
+Ġ( Â£
+Ġst oring
+Ġwar rants
+Ġjud gement
+ĠB rist
+us ually
+ph oto
+ĠR an
+ĠP ine
+Ġoutrage ous
+ĠValent ine
+lu ence
+ĠEvery body
+Al tern
+Ġrele vance
+Ġtermin ated
+Ġd essert
+Ġfulf illed
+Ġprosecut ed
+ĠW ords
+Ġm igrant
+Ġcultiv ation
+ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ
+idel ity
+ĠV ern
+ĠLog in
+Ġmetaph or
+ĠT ip
+Ġrecru its
+ĠP ig
+rib ing
+Ġenthusi asts
+ex per
+Ġfright ening
+ĠH air
+ans on
+str ate
+Ġh i
+He ight
+Ġown ing
+n one
+Ġdis like
+Ġkn ives
+pher d
+Ġloud ly
+ĠAP Is
+Dis play
+ĠL ac
+ĠUS S
+ab l
+ver ages
+J ew
+Ġ17 2
+ĠHist orical
+at oon
+ĠPhys ics
+in tern
+Ġwarm th
+Ġto pp
+D M
+Ġgun man
+Ġem peror
+od i
+ãĥ £
+in atory
+ĠR ib
+Ġ13 1
+ĠSat urn
+ĠSh ining
+Ġw aking
+Qu otes
+Ġcomed ian
+en berg
+Â ½
+Ġbelie vers
+Ġpaper work
+c ustom
+Ġle v
+Ġl ament
+Ġpour ing
+22 2
+p olitical
+ĠSupp lement
+m aid
+Ġcruel ty
+Ġt read
+ys ics
+A w
+rit es
+Ġmod ifier
+ĠP osition
+Ad am
+l b
+ub s
+Ġimper fect
+Ġcl usters
+ĠEngine er
+ĠC herry
+Ġinaug uration
+ĠS au
+Ġembod iment
+ĠUn cle
+Ġover r
+Ġexplos ions
+c ule
+ĠPrinc eton
+ĠAndre a
+Ġincorrect ly
+Ġearn est
+Ġpil gr
+ĠS print
+Ġslee ve
+Ġhe ars
+ĠAm azing
+Ġbrow sing
+ag in
+Ġhom eland
+Ġha w
+Ġd iving
+ist ered
+17 8
+Ġbarg aining
+ĠArc ade
+Ġdeleg ate
+ters on
+................................ ................................
+ĠJackson ville
+27 5
+Ġst agn
+Ġad am
+ĠSher man
+C B
+Ġsub urb
+ĠFood s
+Ġconver ting
+ĠAr ist
+Ġch ambers
+l ove
+Ġam ino
+ĠG an
+Ġmad ness
+m c
+ĠUS E
+def ined
+Ġul tr
+ind ust
+Ġw olves
+l ance
+Add itionally
+Ġcr acks
+as ia
+ĠRe ason
+ĠP ump
+Ġaccident al
+ĠL aser
+ĠR id
+Ġinitial ized
+ell i
+Ġun named
+Ġn oun
+ĠPass ed
+Ġhost age
+ĠEth iop
+sh irts
+Ġun rel
+ĠEmb assy
+Ġ19 41
+Ġat oms
+Ġpur ported
+16 4
+ĠF i
+Ġgall ons
+ĠMon ica
+Ġp g
+en ment
+Ġsort ed
+ĠG ospel
+Ġhe ights
+Ġtr aced
+Ġunder going
+She ll
+Ġs acks
+Ġproport ions
+Ġhall uc
+F ont
+ac et
+Ġwar mer
+ĠIN TER
+Ġgrab bing
+Pl ug
+Ġreal ization
+ĠBur ke
+Ġen chant
+AT ER
+ĠSe ed
+Ġabund ant
+F M
+Ġc ivic
+V s
+is i
+Ġv ow
+Ġre per
+ĠPartners hip
+Ġpenet ration
+Ġax e
+Ġsh attered
+ĠZ ombies
+Ġv inyl
+ĠAl ert
+e on
+Ġoblig ed
+ĠIll ust
+ĠPl aza
+ĠFront ier
+Ġdavid jl
+ĠSer ial
+ĠH av
+ĠNut rition
+B i
+Ġâĸ Ī
+ĠJ ays
+lin ux
+Ġhur ry
+Ġv oy
+Ġhop eless
+ĠSte alth
+Ġ ãģ
+ess ors
+tt le
+b org
+ĠSaf ari
+f ell
+Ġw ary
+d ue
+ĠAb ove
+H a
+E LL
+Ġnot or
+ĠW on
+T oo
+Ġoccup ations
+Ġposs essions
+Ġinv iting
+Ġpred ators
+Ġacceler ated
+Ġ15 7
+uter te
+ĠC ube
+e ast
+acc ount
+G ive
+Ġtrans plant
+red ients
+id able
+Ġscreens hots
+ĠG und
+ĠF S
+Ġtravel ers
+Ġsens ory
+ĠF iat
+ĠRock ets
+İ ĭ
+_ {
+F riend
+Ġchar ming
+AL S
+Ġenjoy ment
+m ph
+Ġ5 000
+ĠRE G
+Ù Ĩ
+b ia
+Ġcomp ilation
+ro st
+ĠV P
+ĠSch ne
+201 9
+Ġcop ying
+M ORE
+ĠFl ore
+f alls
+2 15
+t otal
+Ġdis ciples
+d ouble
+Ġexceed ing
+Ġsm ashed
+Ġconcept ual
+ĠRom ania
+ĠB rent
+ĠI CE
+ĠT ou
+Ġg rap
+Ġn ails
+18 9
+ãĥ ĺ
+Ġproc ure
+e ur
+Ġconfir ming
+ĠC ec
+aw i
+ĠEd en
+Ġn g
+Ġengine ered
+at ics
+Ġhook ed
+Ġdisgust ing
+ĠMur der
+ãĤ ¿
+L ibrary
+Ġ16 8
+Al most
+hem atic
+Men u
+ĠNot re
+ĠJ ur
+Ġkidn apped
+Ġhack er
+ĠJ ade
+Ġcreep y
+Ġdraw ings
+ĠSpons or
+Ġcycl ists
+ĠGob lin
+Ġoptim ized
+Ġst aged
+ĠMc D
+bet ween
+A ge
+en o
+S ex
+ĠW ide
+n ings
+av is
+Ġincap able
+ĠK ob
+Ġreward ing
+ĠL one
+oles cent
+Ġcontract ed
+Ġstick y
+J ose
+B all
+f est
+ĠIn put
+ĠRec ently
+Ġto mat
+squ are
+App lication
+Ġnit rogen
+Ġdupl icate
+ĠRec on
+ĠD ear
+L ondon
+Ġint ra
+Ġd ock
+Ġout reach
+ĠM illion
+Ġmamm als
+am pton
+V AL
+Ġsn aps
+Ġd os
+ĠWh ole
+ĠRead y
+T ry
+ĠWinn ipeg
+ear ance
+Ġinc urred
+ren ched
+ĠNS W
+il ot
+rain e
+Ġc ube
+g ot
+Ġrun way
+etermin ed
+ĠHaw ks
+Ġsurviv or
+ĠW ish
+ĠD in
+ĠDE F
+ĠV ault
+18 7
+Ġmush rooms
+Ġcris p
+be y
+ĠDisco very
+Ġdevelopment al
+Ġparad igm
+Ġcha otic
+ĠT su
+Ġ3 33
+b ons
+Ġbacter ial
+Ġcomm its
+Ġcos mic
+Ġme ga
+oc ative
+ĠP aint
+ophob ic
+Ġv ain
+Ġcar ved
+ĠTh ief
+ĠG ul
+ows hip
+Ġc ites
+ĠEd inburgh
+Ġdimin ished
+Ġacknowled ges
+ĠK ills
+Ġmic row
+ĠHer a
+Ġsen iors
+Ġwhere by
+H op
+at ron
+Ġun available
+ĠN ate
+Ġ4 80
+Ġsl ated
+ĠRe becca
+ĠB attery
+Ġgram mar
+Ġhead set
+Ġcurs or
+Ġex cluding
+any e
+aunder ing
+eb in
+Ġfeas ible
+ĠPub lishing
+ĠLab s
+ĠCl iff
+ĠFerr ari
+Ġp ac
+vis ible
+mark ed
+pe ll
+Ġpol ite
+Ġstagger ing
+ĠGal actic
+Ġsuper st
+Ġpar an
+ĠOffic ers
+ãĢ ģ
+Ġspecific s
+ul us
+23 9
+ĠP aste
+AM P
+ĠPan ama
+ĠDe lete
+angu ard
+rest rial
+Ġhero ic
+ĠD y
+Ø§ ÙĦ
+Ġincumb ent
+Ġcr unch
+t ro
+Ġsc oop
+Ġblog ger
+Ġsell ers
+ure n
+Ġmedic ines
+ĠC aps
+ĠAnim ation
+ox y
+Ġout ward
+Ġinqu iries
+22 9
+Ġpsych ologist
+ĠS ask
+ev il
+Ġcontam inated
+ãĤ ¨
+he rence
+Ġbrand ed
+ĠAbd ul
+z h
+Ġparagraph s
+Ġmin s
+Ġcor related
+er b
+Ġimp art
+Ġmil estone
+ĠSol utions
+ot le
+Ġunder cover
+Ġmar ched
+ĠCharg ers
+f ax
+ĠSec rets
+Ġr uth
+we ather
+Ġfemin ine
+Ġsh am
+Ġprest igious
+igg ins
+Ġs ung
+hist ory
+ett le
+gg ie
+Ġout dated
+ol and
+Ġper ceptions
+ĠS ession
+ĠDod gers
+u j
+ĠE ND
+D oc
+Ġdefic iency
+Gr and
+ĠJ oker
+Ġretro spect
+Ġdiagn ostic
+Ġharm less
+Ġro gue
+ĠA val
+E qu
+Ġtrans c
+ĠRoberts on
+ĠDep ending
+ĠBurn s
+iv o
+Ġhost ility
+F eatures
+ĵ ĺ
+Ġdis comfort
+ĠL CD
+spec ified
+ĠEx pect
+3 40
+Ġimper ative
+ĠReg ular
+Ch inese
+Ġstate wide
+Ġsy mm
+Ġlo ops
+Ġaut umn
+N ick
+Ġsh aping
+Ġqu ot
+Ġc herry
+ĠCross ref
+è¦ ļéĨĴ
+Stand ard
+he ed
+ĠD ell
+ĠViet namese
+Ġo st
+ĠV alkyrie
+O A
+Ass ad
+Ġreb ound
+ĠTra ffic
+pl aces
+æ ĺ
+ĠB uc
+17 2
+Ġshel ters
+Ġins isting
+ĠCertain ly
+ĠKenn eth
+ĠT CP
+Ġpen al
+ĠRe play
+he ard
+Ġdial ect
+iz a
+ĠF Y
+it cher
+ĠD L
+Ġspir al
+Ġquarterback s
+Ġh ull
+Ġgo ogle
+Ġto dd
+ĠSter ling
+ĠPl ate
+Ġsp ying
+mb ol
+ĠReal m
+ĠPro ced
+ĠCr ash
+Ġtermin ate
+Ġprotest ing
+C enter
+gu ided
+Ġun cover
+Ġboy cott
+Ġreal izes
+s ound
+Ġpret ending
+ĠV as
+19 80
+Ġfram ed
+Ġ13 9
+Ġdesc ended
+Ġrehab ilitation
+Ġborrow ing
+ĠB uch
+Ġbl ur
+R on
+ĠFro zen
+en za
+Ch ief
+ĠP oor
+Ġtransl ates
+M IN
+Ġ2 12
+J ECT
+Ġerupt ed
+Ġsuccess es
+S EC
+Ġpl ague
+Ġg ems
+d oms
+Ġstret ches
+ĠSp y
+Ġstory telling
+C redit
+ĠP ush
+Ġtra ction
+Ġin effective
+ĠL una
+Ġt apes
+Ġanaly tics
+erc ise
+Ġprogram mes
+ĠCar bon
+Ġbeh old
+he avy
+ĠConserv ation
+ĠF IR
+Ġs ack
+ter min
+ric ks
+Ġhous ed
+Ġunus ually
+I ce
+Ġexecut ing
+ĠMor oc
+ed ay
+Ġed itions
+Ġsm arter
+ĠB A
+Ġout law
+Ġvan ished
+ib a
+AL SE
+ĠSil va
+23 8
+C ould
+Ġphilos opher
+Ġevac uated
+Sec ret
+14 2
+Ġvis as
+ãĤ ¬
+ĠM alt
+ĠClear ly
+ĠN iger
+ĠC airo
+ĠF ist
+3 80
+ĠX ML
+aut o
+it ant
+Ġrein forced
+Rec ord
+ĠSurviv or
+G Hz
+Ġscrew s
+parent s
+Ġo ceans
+ma res
+Ġbra kes
+vas ive
+Ġhell o
+ĠS IM
+rim p
+Ġo re
+ĠArm our
+24 7
+Ġterr ific
+Ġt ones
+14 1
+ĠMin utes
+Ep isode
+Ġcur ves
+Ġinflamm atory
+Ġbat ting
+ĠBeaut iful
+L ay
+Ġunp op
+v able
+Ġr iots
+ĠTact ics
+b augh
+ĠC ock
+Ġorg asm
+ĠS as
+Ġconstruct or
+et z
+G ov
+Ġant agon
+Ġthe at
+Ġde eds
+ha o
+c uts
+ĠMc Cl
+Ġu m
+ĠScient ists
+Ġgrass roots
+ys sey
+"] =>
+Ġsurf aced
+Ġsh ades
+Ġneighb ours
+Ġad vertis
+oy a
+Ġmer ged
+Up on
+Ġg ad
+Ġanticip ate
+Any way
+Ġsl ogan
+Ġdis respect
+I ran
+ĠT B
+act ed
+Ġsubp oen
+medi ately
+OO OO
+Ġwa iver
+Ġvulner abilities
+ott esville
+ĠHuff ington
+J osh
+ĠD H
+M onday
+ĠEll en
+K now
+x on
+it ems
+22 8
+Ġf ills
+ĠN ike
+Ġcum ulative
+and als
+I r
+Ġ ì
+Ġfr iction
+ig ator
+Ġsc ans
+ĠVi enna
+ld om
+Ġperform ers
+P rim
+Ġb idding
+M ur
+Ġlean ed
+ĠPri x
+al ks
+Ġ[ âĢ¦]
+ĠTw itch
+ĠDevelop er
+ĠG ir
+Ġcall back
+Ab stract
+Ġacc ustomed
+Ġfreed oms
+ĠP G
+ur acy
+Ġl ump
+is man
+,, ,,
+19 92
+ĠR ED
+Ġwor m
+M atch
+ĠPl atinum
+I J
+ĠOwn er
+Tri via
+com pl
+Ġnew born
+Ġfant as
+O wn
+Ġ19 59
+Ġsymp ath
+Ġub iqu
+Ġoutput s
+Ġal lev
+Ġpr ag
+K evin
+Ġfav ors
+Ġbur ial
+Ġn urt
+so lete
+c ache
+Ġ15 6
+Ġunl ocks
+te chn
+M aking
+Ġcon quer
+ad ic
+æ ĸ
+Ġel f
+Ġelect orate
+ĠKurd s
+ĠSt ack
+ĠSam urai
+Ġâ ĺħ
+Ġ{ }
+ĠS aid
+ĠFall out
+Ġkind ness
+ĠCustom s
+ĠBou levard
+Ġhelicop ters
+ot ics
+ĠVe get
+com ment
+Ġcritic ised
+Ġpol ished
+ĠRem ix
+ĠC ultural
+Ġrec ons
+Ġdo i
+at em
+Sc reen
+Ġbar red
+Com ments
+ĠGener ally
+Ġsl ap
+7 20
+V ari
+p ine
+Ġem pt
+Ġh ats
+ĠPlay ing
+l ab
+a verage
+form s
+ĠC otton
+Ġcan s
+ĠD ON
+ĠSom alia
+C rypt
+ĠIncre ases
+E ver
+mod ern
+Ġsur geon
+3 000
+Ġrandom ized
+================================ ================================
+B ern
+im pl
+ĠC OR
+Ġpro claim
+th ouse
+Ġto es
+Ġam ple
+Ġpres erving
+Ġdis bel
+gr and
+B esides
+Ġsil k
+ĠPat tern
+h m
+Ġenter prises
+Ġaffidav it
+ĠAdvis ory
+Ġadvert ised
+ĠRel igious
+se ctions
+psy ch
+ĠField s
+aw ays
+Ġhasht ag
+ĠNight mare
+Ġv ampire
+Ġfore nsic
+rosso ver
+n ar
+Ġn avy
+Ġvac ant
+ĠD uel
+Ġhall way
+Ġface book
+ident ally
+ĠN RA
+Ġm att
+Ġhur ricane
+ĠKir by
+ĠP uzzle
+Ġsk irt
+ou st
+du llah
+Ġanal ogy
+in ion
+Ġtomat oes
+ĠN V
+ĠPe ak
+ĠMe yer
+Ġappoint ments
+Ġm asc
+Ġal ley
+re hend
+Ġchar ities
+Ġund o
+Ġdest inations
+ĠTest ing
+"> </
+Ġdest ined
+Ġimp lements
+ĠHar old
+RE CT
+Ġoptim ization
+Ġkilomet res
+Ġc md
+Ġimpair ment
+Ġun successful
+Ġswift ly
+ĠGlas gow
+art en
+ĠSh ares
+ĠAn swer
+ĠAl bum
+Ġnut ritional
+ãĥ ĸ
+ĠF ut
+Ġbl oc
+ĠN FC
+Ġwholes ale
+ĠC W
+Ġneg lected
+Ġlaun cher
+Ġannounce ments
+OU LD
+com b
+Ġrot ating
+Ġrest s
+ĠT icket
+ched el
+L ou
+ĠV ic
+Ġ" '
+Ġtem plates
+Ġrepl aces
+Ar c
+:: ::
+ĠGil bert
+Ġillness es
+Ġsched ules
+Ġheter osexual
+L INE
+Ġhere in
+Ġco erc
+Ġdecre asing
+Ġde portation
+s udo
+ĠInd igenous
+Ġweigh s
+Al ong
+' );
+ĠBeng als
+70 7
+Ġjoint s
+ver ts
+Ġ14 9
+na ire
+Ġsimpl est
+Ġl ore
+10 80
+f iction
+ĠDat abase
+Ġreserv ation
+Ġs ou
+Ġsan ctuary
+aud io
+ap le
+Ġveget arian
+Ġanticip ation
+m icro
+Ġend uring
+Ġdepart ed
+Ġsidew alk
+Ġprohib its
+ĠF ont
+Ġcomp ute
+ĠS ect
+Ġ15 8
+B attle
+Ġbom ber
+Ġdist raction
+Ġend ured
+Ġpractition ers
+Ġdistur bed
+Ġdr ank
+ord ered
+Ġsurpr ises
+se at
+Sec urity
+ĠW isdom
+og o
+Ġsub paragraph
+ĠPen insula
+ĠOrig ins
+ire n
+ĠP av
+igg le
+Ġgrat itude
+ĠG ravity
+over ty
+im an
+ct r
+ĠCa esar
+c ould
+g em
+Ġsk ies
+Ġch amp
+Ġagree ing
+F amily
+D iv
+17 6
+Ġmess y
+um ption
+F ederal
+ern o
+ĠCh at
+Bey ond
+Ġdev ote
+ĠW alsh
+Ġdump ed
+Ġaccum ulation
+st ad
+hib ition
+Ġsm okers
+Ġinspect or
+F rench
+iss an
+ĠV ita
+Ġresearch ing
+R AM
+ĠCelt ics
+Ġcl oak
+ĠTer ra
+M ary
+so ld
+ĠD OM
+mod s
+Int el
+Ġmult itude
+ĠImpro ved
+Ġrel iance
+Ġartif act
+Ġalarm ing
+P rom
+h on
+T ION
+med ium
+Ġref lex
+ĠEx cel
+Ġweaken ed
+16 3
+2 24
+Ġcost umes
+Ġunique ly
+Ġs orrow
+Ġm ansion
+w p
+Ġsal v
+ĠGro ve
+bs p
+ĠSn iper
+ĠSh ipping
+ĠP OW
+Ġund is
+Ġbrand ing
+G irl
+ĠAh mad
+ĠL akes
+ĠCore y
+Ġinherit ance
+ener y
+Ġpack ing
+ĠP rest
+D est
+F W
+Ġregul ator
+l ocked
+Ġcont ested
+ĠMel issa
+ĠD uc
+Ġunpop ular
+Ġst acked
+Ġ19 17
+Ġyear ly
+Ġst are
+Ġassess ing
+Ã ¸
+Ġbe verages
+Ġcompet itions
+Ġstreng thening
+al ong
+ĠL ud
+Ġmel ted
+stan bul
+Ġb ounty
+EN C
+ĠL ands
+Ġdecl ares
+Ġcustom ize
+Ġcomp osite
+ãĥ ¬
+C M
+ograph ics
+ĠTem p
+Ġcont ender
+Ġins ign
+ĠL AN
+Ġdis asters
+ins pired
+Ġjud gments
+ustain able
+urs ion
+Ġvar iance
+ĠUlt imately
+Ġ --------
+u ador
+ĠR X
+Ġmel ting
+ĠExt ended
+ĠT we
+M ajor
+ĠB il
+Ġsy rup
+qu ick
+ĠHold er
+Ġinnoc ence
+U LE
+ĠM ight
+99 99
+Ġf al
+Ġcontinu ity
+Ġ19 53
+ĠB S
+st ill
+L at
+ĠAb use
+Ġun supported
+xxxx xxxx
+Ġinst itute
+Ġfrag ment
+ĠP ep
+W estern
+ĠC ause
+ĠFr ag
+ĠAr s
+à ¥
+ast ics
+Ġb ishop
+Ġcross es
+Ġ15 4
+ĠUp grade
+Ġmit igate
+ĠRay mond
+Mod s
+Ġtom ato
+Ġst umbled
+Ġdiff ers
+In itial
+ĠR aspberry
+Ġign ores
+Ġt ant
+Ã ł
+Ġrel ay
+Ġb isexual
+Ġconf ession
+Ġd ement
+in as
+ĠHe ather
+pl atform
+dri ving
+bour g
+ĠM ush
+Ġhy ster
+Det ails
+Ġdr ift
+ĠW ald
+ĠLuck ily
+or f
+Ġexp ire
+ĠP unch
+zy me
+g old
+Ġunp aid
+ĠT rent
+Ġun armed
+Ġill icit
+ĠT ottenham
+Ġsm ash
+Intern ational
+ink er
+Ġst ing
+ĠSadd am
+ĠAR T
+Ġtruth s
+b irth
+Ġso ber
+ĠN it
+Ġ ib
+Ġus able
+Ġst acks
+ĠSy lv
+Ġnort heast
+Ġdom ination
+ĠM our
+EN SE
+ĠMe asure
+Ġprogram mer
+Ġ< -
+18 2
+ĠCond ition
+Ġback yard
+ir ling
+ĠJ eb
+ĠCre ed
+ĠH ang
+ĠCOM P
+F ER
+ĠIs h
+Ġdetect ives
+------------ ---
+ĠMess enger
+Ġlo oph
+Ġgate way
+15 1
+ĠMaterial s
+ĠD T
+Ġdo omed
+od o
+Ġslic es
+Ġemail ed
+ĠPer l
+Ġren ov
+UT H
+ody nam
+ĠSouth west
+get ic
+ĠT PP
+Ġoptim ism
+ĠT ow
+ul ators
+prot ected
+y les
+Â «
+Ġex ile
+en v
+P rop
+ĠZimmer man
+Ù İ
+C a
+om aly
+ãĥ Ĩ
+Ġrail road
+L ee
+23 2
+Ġrepl icate
+Ġcomfort ably
+act ly
+Ġr av
+Ġtelesc ope
+Ġhonest y
+ĠPe pper
+ĠBr ing
+Ġric hest
+Ġout doors
+Ġh alls
+Ġcont end
+IS E
+Ġsub mitting
+Ġna ive
+ar ations
+Ġ14 3
+Ġpo ised
+respons ible
+Ġsoc ks
+ĠSk ull
+Quest ion
+Ġdiscover ies
+Jo ined
+ĠEn emies
+ĠWire less
+ĠRe venge
+Ġpuzz les
+Ġce ased
+29 0
+cript ions
+ĠCon sole
+Ġbo iling
+Ġdisc rep
+Ġded uction
+Ġar senal
+XX XX
+ĠAm sterdam
+rox imately
+ĠSh ane
+Ġpos ing
+ĠACL U
+ĠCompan ies
+Ġthe ology
+ĠU g
+qu arter
+ĠH ank
+Co in
+ĠL v
+Ġalleg ation
+ĠAv oid
+Ġindef initely
+Ġcommod ities
+Ġbr ig
+ĠMan it
+Ġt enth
+met hod
+ĠKn icks
+ĠâĢ İ
+Ġinv oked
+D ial
+AR A
+Ġc aucus
+22 7
+ĠJ ab
+Ġoun ces
+b ay
+Ġbud dy
+f an
+23 4
+ĠH il
+ad h
+ĠT Y
+ĠIN D
+Ġ19 39
+Ġiter ation
+ĠGonz alez
+ĠV ert
+ĠI O
+em b
+re ra
+en ch
+ĠRequ irements
+ĠW ins
+Ġlivest ock
+h ours
+" âĢ¦
+b ral
+M arg
+ĠD one
+Ġwas ting
+ing ed
+g roups
+Ġw ishing
+ĠT umblr
+Ġt apping
+Ġnational ism
+ĠB yr
+Ġsqu ares
+ĠAct ions
+ãĥ ¥
+In side
+deb ug
+Ġapp end
+Ġstub born
+ĠC ind
+T ell
+Ġt earing
+ĠRe y
+or c
+ĠDay ton
+ĠN H
+ĠMad ness
+Ch arl
+ĠMor rison
+fil ter
+Ġacc use
+Ġ. /
+Ġtor rent
+Ġdecl ines
+g allery
+M ine
+Ġneg otiation
+ĠBash ar
+op ia
+19 93
+em ort
+ĠNo vel
+ĠF ang
+ers ive
+ĠInst ant
+Ġroll er
+A round
+ĠElect ions
+G ames
+Ġin expensive
+Ġwor s
+Ġv ul
+ĠH ole
+Ġunbeliev able
+Ġn ause
+Ġent r
+bo at
+ĠST E
+Ġbus h
+ĠHass an
+Ġw o
+Ġpa used
+ĠM ig
+l ived
+Ġsc out
+Ġl ith
+Pub lished
+du ino
+c ool
+Ġcirc ulating
+id as
+ĠP am
+viol ent
+ĠCraw ford
+udd le
+ĠLet ters
+Gu ard
+mor ph
+Ġwand ering
+Ġsoph omore
+Ġque er
+ĠBl ind
+r ue
+ĠMar riage
+D om
+Ġpadd ing
+Ġfold ers
+Ġmeaning less
+Ġcandid acy
+af ort
+Ġwhistle bl
+ĠIdent ified
+Ġcig ar
+Ġh id
+ĠDub ai
+Ġpost ure
+Ġh iking
+ĠTermin al
+Legend ary
+ĠT P
+ĠAT K
+ĠStar bucks
+ĠR iot
+19 91
+ĠBott om
+e ffic
+ĠEug ene
+ĠWy oming
+ĠRock y
+Ġsal mon
+Ġmet ro
+Ġb ilateral
+Ġcelebr ates
+L ength
+b illion
+B at
+Ġre leg
+Ġpse udo
+D T
+ĠRh ode
+P arent
+ple tion
+Ġatt ribut
+Ġtun ing
+ĠNOT E
+ĠRe bel
+ic us
+F und
+Ġcock tail
+Ġ5 01
+Ġsp oon
+Ġbrut ality
+Ġun ite
+Ġmicro bi
+ĠRe ich
+pos itive
+Ġam azed
+ĠN T
+D esc
+ECT ION
+Ġfalse ly
+ĠHigh lander
+ĠC rist
+ĠVictor ian
+Ġdistribut ions
+the ir
+ĠE instein
+Ġp od
+Ġepid em
+Ġhe ap
+ĠR anch
+Ġan them
+Ġre app
+ĠAub urn
+Ġconc urrent
+ĠThrough out
+ĠP OST
+â ĺ
+Ġhom emade
+k ick
+B eg
+Ġch assis
+c ounter
+Ġmer ger
+Ġl aps
+2 17
+un ion
+ĠTr igger
+Ġdeb ated
+Ġsil ently
+Ġrest raint
+B al
+0000 000
+Ġform idable
+ĠFil ip
+Ġsacrific es
+F ood
+Ġdwar f
+ĠSe qu
+in ian
+More over
+Ġtang ible
+ops is
+ĠMine craft
+ĠRegist ration
+o an
+Ġrepresent ations
+Ġth irst
+Ġcor p
+ire ment
+M ade
+l oe
+> "
+c ats
+* .
+Ġgest ures
+gener al
+Le ague
+Ġpack ets
+ĠInspect or
+ĠBer g
+Ġfraud ulent
+Ġcritic ize
+F un
+Ġbl aming
+nd ra
+Ġsl ash
+ĠE ston
+Ġpropos ing
+Ġwh ales
+Ġtherap ist
+Ġsub set
+Ġle isure
+EL D
+ĠC VE
+ĠAct ivity
+Ġcul min
+sh op
+ĠD AY
+is cher
+ĠAdmir al
+ĠAtt acks
+Ġ19 58
+Ġmem oir
+Ġfold ed
+Ġsex ist
+Ġ15 3
+ĠL I
+Ġread ings
+Ġembarrass ment
+ĠEmploy ment
+w art
+ch in
+Ġcontin uation
+l ia
+Rec ently
+Ġd uel
+Ġevac uation
+ĠKash mir
+Ġdis position
+ĠR ig
+Ġbol ts
+Ġins urers
+4 67
+M ex
+Ġret aliation
+Ġmis ery
+Ġunre asonable
+r aining
+I mm
+ĠP U
+em er
+Ġgen ital
+ãĤ ³
+ĠC andy
+Ġon ions
+ĠP att
+lin er
+Ġconced ed
+Ġf a
+Ġfor c
+ĠH ernandez
+ĠGe off
+deb ian
+ĠTe ams
+Ġc ries
+Ġhome owners
+23 7
+A BC
+Ġst itch
+Ġstat istic
+Ġhead ers
+ĠBi ology
+Ġmot ors
+ĠG EN
+ĠL ip
+Ġh ates
+Ġhe el
+S elf
+i pl
+ED IT
+ort ing
+Ġann ot
+ĠSpe ech
+old emort
+ĠJ avascript
+ĠLe Bron
+Ġfoot print
+Ġf n
+Ġseiz ures
+n as
+h ide
+Ġ19 54
+ĠBe e
+ĠDecl aration
+ĠKat ie
+Ġreserv ations
+N R
+f emale
+Ġsatur ated
+Ġb iblical
+Ġtroll s
+Dev ice
+ph otos
+Ġdr ums
+ãĥīãĥ© ãĤ´ãĥ³
+N ight
+f ighter
+ĠH ak
+ri ber
+Ġc ush
+Ġdiscipl inary
+ba um
+ĠG H
+ĠSch midt
+ilib rium
+Ġs ixty
+ĠKush ner
+ro ts
+Ġp und
+ĠR ac
+Ġspr ings
+Ġcon ve
+Bus iness
+F all
+Ġqual ifications
+Ġvers es
+Ġnarc iss
+ĠK oh
+ĠW ow
+ĠCharl ottesville
+ed o
+Ġinterrog ation
+ĠW ool
+36 5
+B rian
+Ġâľ ĵ
+Ġalleg es
+ond s
+id ation
+ĠJack ie
+y u
+Ġl akes
+Ġworth while
+Ġcryst als
+ĠJud a
+Ġcomp rehend
+Ġfl ush
+Ġabsor ption
+ĠO C
+Ġfright ened
+ĠCh ocolate
+Mart in
+Ġbu ys
+Ġbu cks
+Ġapp ell
+ĠChampions hips
+Ġlist ener
+ĠDef ensive
+Ġc z
+ud s
+ĠM ate
+Ġre play
+Ġdecor ated
+Ġs unk
+ĠV IP
+ĠAn k
+Ġ19 5
+aa aa
+Nob ody
+ĠMil k
+ĠG ur
+ĠM k
+ĠS ara
+Ġse ating
+ĠW id
+Tr ack
+Ġemploy s
+Ġgig antic
+AP P
+ãĤ §
+in ventory
+Ġtow el
+at che
+l asting
+ĠT L
+Ġlat ency
+Ġkn e
+B er
+me aning
+Ġup held
+Ġplay ground
+Ġm ant
+S ide
+Ġstere o
+Ġnorth west
+Ġexception ally
+Ġr ays
+Ġrec urring
+D rive
+Ġup right
+Ġab duct
+ĠMar athon
+Ġgood bye
+Ġal phabet
+h p
+Ġcourt room
+ring ton
+ot hing
+T ag
+Ġdiplom ats
+Ġbar bar
+ĠAqu a
+18 3
+33 33
+Ġmat urity
+Ġinst ability
+ĠAp ache
+Ġ= ==
+Ġfast ing
+ĠGr id
+Mod Loader
+Ġ15 2
+A bs
+ĠOper ating
+ett i
+Ġacqu aint
+Don nell
+ĠK em
+ĠFor ge
+Ġarm ored
+M il
+Ġphilos ophers
+in vest
+Pl ayers
+â Ī
+Ġmy riad
+Ġcomr ades
+R ot
+Ġremember ing
+Ġcorrespond s
+Ġprogram mers
+ĠLyn n
+Ġo lig
+Ġco herent
+yn chron
+ĠChem ical
+Ġj ugg
+p air
+post s
+E ye
+ĠIn ner
+Ġsem ester
+ott est
+ĠEmir ates
+ric anes
+or ously
+m its
+ĠW is
+Ġd odge
+l ocation
+Ġf aded
+Am azon
+ĠPro ceed
+ĠIN FO
+j ournal
+ĠTru ck
+T en
+Ġ2 17
+Ġstat utes
+m obile
+ĠT ypes
+Rec omm
+b uster
+pe x
+Ġleg ends
+Ġhead ache
+f aced
+ĠWi Fi
+if ty
+ĠH ER
+Ġcirc uits
+ER ROR
+22 6
+ol in
+Ġcyl inder
+osp ace
+ik ers
+P rem
+Qu ant
+Ġconflic ting
+Ġslight est
+Ġfor ged
+ion age
+Step hen
+ĠK ub
+ĠOpp ortun
+ĠHe al
+Ġbl o
+Ġrul ers
+Ġh uh
+Ġsubmar ine
+f y
+ass er
+Ġallow ance
+ĠKas ich
+ĠT as
+ĠAustral ians
+Forge ModLoader
+ĠâĨ ĳ
+ĠMat rix
+am ins
+Ġ12 00
+ĠAc qu
+23 6
+D ocument
+ĠBre aking
+19 3
+ĠSub st
+ĠRoll er
+ĠPro perties
+ĠN I
+t ier
+Ġcr ushing
+Ġadvoc ating
+Further more
+keep ers
+Ġsex ism
+x d
+Ġcall er
+ĠS ense
+chie ve
+ĠT F
+Ġfuel ed
+Ġreminis cent
+Ġobs ess
+ur st
+Ġup hold
+ĠF ans
+het ics
+Ġâ Ĺ
+ĠB ath
+Ġbe verage
+Ġo scill
+25 4
+Ġpol es
+Ġgrad ual
+Ġex ting
+ĠS uff
+ĠS uddenly
+Ġlik ing
+Ġ19 49
+un ciation
+am ination
+ĠO mar
+ĠL V
+ĠCon sequently
+Ġsynt hes
+ĠG IF
+Ġp ains
+Ġinteract ing
+u ously
+inc re
+Ġrum or
+ĠScient ology
+19 7
+ĠZ ig
+Ġspe lling
+ĠA SS
+Ġexting u
+ms on
+Ġg h
+Ġremark ed
+ĠStrateg ic
+ĠM ON
+å ¥
+g ae
+ĠWH AT
+E ric
+ĠCamp us
+Ġmeth ane
+Ġimag in
+J UST
+ĠAl m
+X T
+i q
+ĠR SS
+Ġwrong doing
+att a
+Ġbig ot
+Ġdemonstr ators
+ĠCal vin
+ĠV illa
+Ġmembr ane
+ĠAw esome
+Ġbenef ic
+26 8
+Ġmagn ificent
+ĠL ots
+G reg
+ĠBor is
+Ġdetain ees
+ĠH erman
+Ġwhis pered
+Ġa we
+Prof essor
+fund ing
+Ġphys iological
+ĠDest ruction
+Ġlim b
+Ġmanip ulated
+Ġbub bles
+Ġpse ud
+Ġhyd ra
+ĠBrist ol
+Ġst ellar
+ĠExp ansion
+ĠK ell
+ĠInterest ingly
+Ġm ans
+Ġdrag ging
+Ġec ological
+ĠF it
+Ġg ent
+Ġbenef ited
+ĠHait i
+Ġpoly g
+ãĥ İ
+Ġ20 30
+Ġpro w
+Ġrecon struction
+Ġwas t
+Ġpsych ic
+ĠGree ks
+Hand ler
+16 2
+ĠP ulse
+Ġsol icit
+Ġsy s
+Ġinflu x
+ĠG entle
+per cent
+Ġprolifer ation
+Ġtax able
+Ġdisreg ard
+Ġesc aping
+Ġg inger
+Ġwith stand
+Ġdevast ated
+ĠD ew
+ser ies
+Ġinject ed
+ela ide
+Ġturn over
+he at
+Ļ Ĥ
+H appy
+ĠSil ent
+ãĤ Ń
+iv ism
+Ġir rational
+AM A
+Ġre ef
+r ub
+Ġ16 2
+Ġbank ers
+ĠEth ics
+v v
+Ġcritic isms
+K n
+18 6
+M ovie
+ĠT ories
+Ġno od
+Ġdist ortion
+F alse
+od ore
+Ġt asty
+Res earch
+ĠU ID
+- )
+Ġdivor ced
+ĠM U
+ĠHay es
+ĠIs n
+ian i
+ĠH Q
+Ġ" #
+ign ant
+Ġtra umatic
+ĠL ing
+H un
+Ġsab ot
+on line
+r andom
+Ġren amed
+ra red
+K A
+d ead
+Ã© t
+ĠAss istance
+Ġse af
+++++ ++++
+Ġse ldom
+ĠWeb b
+Ġbo olean
+u let
+Ġref rain
+ĠDI Y
+ru le
+Ġshut ting
+Ġutil izing
+load ing
+ĠPar am
+co al
+oot er
+Ġattract ing
+ĠD ol
+Ġher s
+ag netic
+ĠRe ach
+im o
+Ġdisc arded
+ĠP ip
+01 5
+Ã¼ r
+Ġm ug
+Im agine
+C OL
+Ġcurs ed
+ĠSh ows
+ĠCurt is
+ĠSach s
+spe aking
+ĠV ista
+ĠFram ework
+ong o
+Ġsub reddit
+Ġcr us
+ĠO val
+R ow
+g rowing
+Ġinstall ment
+Ġgl ac
+ĠAdv ance
+EC K
+ĠLGBT Q
+LE Y
+Ġac et
+Ġsuccess ive
+ĠNic ole
+Ġ19 57
+Qu ote
+Ġcircumst ance
+ack ets
+Ġ14 2
+ort ium
+Ġguess ed
+ĠFr ame
+Ġperpet rators
+ĠAv iation
+ĠBen ch
+Ġhand c
+A p
+Ġ19 56
+25 9
+r and
+Net Message
+d in
+urt les
+h ig
+ĠV III
+ff iti
+ĠSw ords
+b ial
+Ġkidn apping
+dev ice
+Ġb arn
+ĠEl i
+auc as
+S end
+Con structed
+ĠÂ ½
+Ġneed les
+Ġad vertisements
+Ġv ou
+Ġexhib ited
+ĠFort ress
+As k
+B erry
+TY PE
+Ġcan cers
+ump ing
+ĠTerrit ory
+Ġpr ud
+Ġn as
+Ġathe ist
+Ġbal ances
+ãģ Ł
+ĠSh awn
+& &
+Ġland sc
+ĠR GB
+Ġpet ty
+Ġex cellence
+Ġtransl ations
+Ġpar cel
+ĠChe v
+E ast
+ĠOut put
+im i
+Ġamb ient
+ĠTh reat
+Ġvill ains
+Ġ5 50
+IC A
+Ġtall er
+Ġle aking
+c up
+Ġpol ish
+Ġinfect ious
+ĠK C
+Ġ@ @
+back ground
+Ġbureaucr acy
+ĠS ai
+un less
+it ious
+ĠSky pe
+At l
+ID ENT
+00 8
+Ġhyp ocr
+Ġpit chers
+Ġguess ing
+ĠF INAL
+Bet ween
+Ġvill agers
+Ġ25 2
+f ashion
+ĠTun is
+Be h
+ĠEx c
+ĠM ID
+28 8
+ĠHas kell
+19 6
+ĠN OR
+Ġspec s
+Ġinv ari
+Ġgl ut
+ĠC ars
+Ġimp ulse
+Ġhon ors
+g el
+Ġjurisd ictions
+ĠBund le
+ul as
+Calif ornia
+ĠIncre ase
+Ġp ear
+Ġsing les
+Ġc ues
+Ġunder went
+ĠW S
+Ġexagger ated
+Ġdub ious
+Ġfl ashing
+L OG
+) ].
+J ournal
+t g
+V an
+ĠI stanbul
+ĠIn sp
+ĠFrank en
+D raw
+Ġsad ness
+Ġiron ic
+ĠF ry
+x c
+Ġ16 4
+is ch
+W ay
+ĠProtest ant
+h orn
+Ġun aff
+ĠV iv
+ill as
+ĠProduct ions
+ĠH ogan
+Ġper imeter
+ĠS isters
+Ġspont aneous
+Ġdown side
+Ġdescend ants
+Ġor n
+w orm
+Japan ese
+Ġ19 55
+Ġ15 1
+ĠDo ing
+els en
+umb les
+Ġrad ically
+ĠDr um
+ĠB ach
+Ġli abilities
+ĠO B
+ĠElement ary
+Ġmem e
+yn es
+Ġfinger print
+ĠGr ab
+Ġundert ake
+Mem bers
+ĠRead er
+ĠSim s
+g od
+Ġhypot hetical
+s cient
+ĠA J
+Ġchar ism
+Ġad missions
+ĠMiss ile
+tr ade
+Ġexerc ising
+ĠBack ground
+W ritten
+Ġvoc als
+whe ther
+Ġv i
+ĠW inner
+Ġl itter
+ĠSh ooting
+ST EM
+ãĤ ¡
+ĠA FL
+Ġvari ability
+Ġe ats
+ĠD PS
+b row
+Ġeleph ants
+Ġstr at
+Ġ Å
+Ġsett lers
+Matt hew
+Ġin advert
+H I
+ĠIM F
+ĠGo al
+Ġnerv es
+John son
+ey e
+ablish ment
+Th ursday
+BIL ITY
+H ad
+am oto
+het amine
+ep s
+Ġmit ochond
+Ġcomp ressed
+ĠTre vor
+ĠAnim als
+T ool
+L ock
+Ġtwe ak
+Ġpin ch
+Ġcancell ation
+P ot
+Ġfoc al
+ĠAst ron
+17 3
+ĠA SC
+ĠO THER
+umn i
+Ġdem ise
+d l
+Ù ħ
+Sem itism
+Ġcr acking
+Ġcollabor ative
+Ġexpl ores
+s ql
+Ġher bs
+Ġconfig urations
+m is
+ĠRes ult
+ace y
+ĠSm oke
+Ġsan ct
+el ia
+Ġdeg ener
+Ġdeep est
+Ġscream ed
+Ġn ap
+Soft ware
+ĠST AR
+E F
+ĠX in
+spons ored
+mans hip
+23 3
+Ġprim aries
+Ġfilter ing
+Ġas semble
+m il
+ĠMy ers
+b ows
+Ġpun ched
+M ic
+Ġinnov ations
+Ġfun c
+and o
+Ġfr acking
+ĠV ul
+Ð¾ Ð
+osh op
+ĠIm mun
+Ġsett ling
+Ġadolesc ents
+Ġreb uilding
+Ġtransform ing
+Ġpar ole
+Ġhar bor
+Ġbook ing
+ot ional
+onge vity
+ĠY o
+b ug
+Ġemer ges
+ĠMethod s
+ĠCh u
+P res
+ĠDun geons
+Ġtra iling
+ĠR um
+ĠH ugh
+å¤ ©
+ĠE ra
+ĠBatt les
+Res ults
+ĠTr ading
+Ġvers a
+c ss
+ax ies
+he et
+Ġgre ed
+19 89
+Ġgard ens
+Ġconting ent
+P ark
+ĠLeaf s
+h ook
+ro be
+Ġdiplom acy
+ĠF uel
+ĠInv asion
+Ġupgr ading
+M ale
+Ġe lic
+Ġrelent less
+ĠCo venant
+ap esh
+ĠT rop
+T y
+pro duction
+art y
+Ġpun ches
+ak o
+cyclop edia
+ĠR abbit
+ĠHD MI
+Ġ14 1
+Ġf oil
+Item Image
+ĠF G
+Ġimplement ations
+ĠP om
+ixt ures
+Ġaw ait
+Ġ3 30
+am us
+Ġumb rella
+Ġfore see
+se par
+Ġcircum cision
+Ġperipher al
+S ay
+ĠExper t
+In c
+Ġwithd rew
+ĠAnd ers
+f ried
+Ġradio active
+ĠOp ening
+Ġboard ing
+ĠN D
+Ġover throw
+Act iv
+W P
+ĠAct s
+× Ļ
+Ġmot ions
+v ic
+ĠM ighty
+ĠDef ender
+a er
+Ġthank ful
+ĠK illing
+ĠBr is
+mo il
+Ġpredict ing
+26 6
+ch oice
+Ġkill ers
+Ġinc ub
+ĠChe st
+ather ing
+Ġpro claimed
+fl ower
+oss om
+umbled ore
+ĠCy cling
+ĠOccup y
+AG ES
+P en
+ĠY ug
+Ġpack aged
+Ġheight ened
+c ot
+st ack
+C ond
+Ġst amps
+m age
+Ġpersu aded
+Ġens l
+ĠCard inal
+Ġsol itary
+Ġpossess ing
+ĠC ork
+Ġev id
+ĠT ay
+Ġbl ues
+Ġextrem ism
+Ġlun ar
+Ġcl own
+Te chn
+Ġfest ivals
+ĠPv P
+ĠL ar
+Ġconsequ ently
+p resent
+Ġsom eday
+ç İĭ
+ĠMet eor
+Ġtour ing
+c ulture
+Ġbe aches
+S hip
+c ause
+ĠFl ood
+ãĥ ¯
+Ġpur ity
+th ose
+Ġem ission
+b olt
+Ġch ord
+ĠScript ure
+L u
+Ġ$ {
+cre ated
+Other s
+25 8
+Ġelement al
+Ġannoy ed
+ĠA E
+d an
+ĠS ag
+Res earchers
+Ġfair y
+âĢĵ âĢĵ
+======== ====
+Sm art
+GG GG
+Ġskelet ons
+Ġpup ils
+link ed
+Ġur gency
+en abled
+ĠF uck
+Ġcoun cill
+r ab
+U AL
+T I
+Ġlif es
+Ġconf essed
+B ug
+Ġharm on
+ĠCON FIG
+ĠNe utral
+D ouble
+Ġst aple
+ĠSH A
+Brit ish
+ĠSN P
+AT OR
+oc o
+Ġswing ing
+ge x
+ole on
+pl ain
+ĠMiss ing
+ĠTro phy
+v ari
+ran ch
+Ġ3 01
+4 40
+00000000 00000000
+Ġrest oring
+Ġha ul
+uc ing
+ner g
+Ġfut ures
+Ġstrateg ist
+quest ion
+Ġlater al
+ĠB ard
+Ġs or
+ĠRhod es
+ĠD owntown
+????? -
+ĠL it
+ĠB ened
+Ġco il
+st reet
+ĠPort al
+FI LE
+ĠG ru
+* ,
+23 1
+ne um
+Ġsuck ed
+Ġr apper
+Ġtend encies
+ĠLaure n
+cell aneous
+26 7
+Ġbrow se
+Ġover c
+head er
+o ise
+Ġbe et
+ĠG le
+St ay
+Ġm um
+Ġtyp ed
+Ġdiscount s
+T alk
+ĠO g
+ex isting
+ĠS ell
+u ph
+C I
+ĠAust rian
+ĠW arm
+Ġdismiss al
+Ġaver ages
+c amera
+Ġalleg iance
+L AN
+=" #
+Ġcomment ators
+ĠSet ting
+ĠMid west
+Ġpharm ac
+ĠEX P
+Ġstain less
+Ch icago
+Ġt an
+24 4
+Ġcountry side
+ĠV ac
+29 5
+Ġpin ned
+Ġcr ises
+Ġstandard ized
+T ask
+ĠJ ail
+ĠD ocker
+col ored
+f orth
+" },
+Ġpat rons
+Ġsp ice
+Ġm ourn
+ĠM ood
+Ġlaund ry
+Ġequ ip
+ĠM ole
+y ll
+ĠTH C
+n ation
+ĠSher lock
+Ġiss u
+ĠK re
+ĠAmeric as
+ĠA AA
+Ġsystem atically
+Ġcont ra
+ĠS ally
+Ġrational e
+Ġcar riage
+Ġpe aks
+Ġcontrad iction
+ens ation
+ĠFail ure
+Ġpro ps
+Ġnames pace
+Ġc ove
+field s
+ãĤ ĭ
+Ġw ool
+ĠC atch
+Ġpresum ed
+ĠD iana
+r agon
+ig i
+Ġh amm
+Ġst unt
+ĠG UI
+ĠObserv atory
+ĠSh ore
+Ġsmell s
+ann ah
+Ġcock pit
+ĠD uterte
+8 50
+Ġopp ressed
+bre aker
+ĠCont ribut
+ĠPer u
+ĠMons anto
+ĠAtt empt
+Ġcommand ing
+Ġfr idge
+ĠR in
+ĠChe ss
+ual ity
+Ġo l
+Republic an
+ĠGl ory
+ĠW IN
+.... ...
+ag ent
+read ing
+Ġin h
+J ones
+Ġcl icks
+al an
+Ġ[ ];
+ĠMaj esty
+ĠC ed
+op us
+ate l
+Ã ª
+AR C
+ĠEc uador
+ãĥ ł
+ĠK uro
+Ġritual s
+Ġcapt ive
+Ġoun ce
+Ġdisag reement
+Ġsl og
+f uel
+P et
+M ail
+Ġexerc ised
+Ġsol ic
+Ġrain fall
+Ġdev otion
+ĠAss essment
+Ġrob otic
+opt ions
+ĠR P
+ĠFam ilies
+ĠFl ames
+Ġassign ments
+00 7
+aked own
+Ġvoc abulary
+Re illy
+Ġc aval
+g ars
+Ġsupp ressed
+ĠS ET
+ĠJohn s
+Ġwar p
+bro ken
+Ġstat ues
+Ġadvoc ated
+Ġ2 75
+Ġper il
+om orph
+ĠF emin
+per fect
+Ġh atch
+L ib
+5 12
+Ġlif elong
+3 13
+Ġche eks
+Ġnum bered
+ĠM ug
+B ody
+ra vel
+We ight
+ĠJ ak
+ĠHe ath
+Ġkiss ing
+ĠJ UST
+Ġw aving
+u pload
+Ġins ider
+ĠPro gressive
+ĠFil ter
+tt a
+ĠBe am
+Ġviol ently
+ip ation
+Ġskept icism
+Ġ19 18
+ĠAnn ie
+ĠS I
+Ġgen etics
+Ġon board
+at l
+ĠFried man
+ĠB ri
+cept ive
+Ġpir ate
+ĠRep orter
+27 8
+Ġmyth ology
+Ġe clipse
+Ġsk ins
+Ġgly ph
+ing ham
+F iles
+C our
+w omen
+Ġreg imes
+Ġphotograp hed
+K at
+ĠMA X
+Offic ials
+Ġunexpected ly
+Ġimpress ions
+F ront
+;;;; ;;;;
+Ġsuprem acy
+Ġs ang
+Ġaggrav ated
+Ġabrupt ly
+ĠS ector
+Ġexc uses
+Ġcost ing
+ide press
+St ack
+ĠR NA
+ob il
+Ġghost s
+ld on
+at ibility
+Top ics
+Ġreim burse
+ĠH M
+ĠDe g
+Ġth ief
+y et
+ogen esis
+le aning
+ĠK ol
+ĠB asketball
+Ġf i
+ĠSee ing
+Ġrecy cling
+Ġ[ -
+Cong ress
+Ġlect ures
+P sy
+Ġne p
+Ġm aid
+Ġori ented
+A X
+Ġrespect ful
+re ne
+fl ush
+ĠUn loaded
+re quest
+gr id
+ĠAltern atively
+ĠHug o
+Ġdec ree
+ĠBuddh ism
+and um
+And roid
+ĠCong o
+ĠJoy ce
+Ġacknowled ging
+hes ive
+ĠTom orrow
+ĠH iro
+th ren
+ĠM aced
+Ġho ax
+ĠIncre ased
+ĠPr adesh
+W ild
+____ __
+16 1
+Ġa unt
+Ġdistribut ing
+ĠT ucker
+ĠSS L
+ĠW olves
+B uilding
+ou lt
+ĠLu o
+ĠY as
+ĠSp ir
+ĠSh ape
+ĠCamb od
+ĠIP v
+Ġm l
+Ġext rad
+39 0
+ĠPenn y
+d ream
+Ġstation ed
+opt ional
+ew orthy
+. </
+Ġundert aking
+Ġchick ens
+Ġstimul i
+ĠEl se
+ig ators
+ĠBegin ning
+ct ory
+Ġprep ares
+Ġdel ta
+Ġvic inity
+t ool
+Ġworks hops
+M Hz
+Ġaccus ation
+Ġhist ories
+rop olis
+ĠChurch ill
+Ġne on
+Ġb aff
+d ies
+may be
+Ġè£ı è¦ļéĨĴ
+Ġsympt om
+EC H
+ĠMan uel
+Ġban ana
+ĠH B
+Ġ ****
+ĠKore ans
+c oll
+F B
+Ġpr aying
+ĠCann ot
+ĠM ile
+Ġembr acing
+ĠSil k
+39 3
+ot ers
+F D
+Ġday light
+al ias
+ĠBrig ade
+ĠHann ah
+Ġcler gy
+Ġs outheast
+Ġalcohol ic
+Ġpropos es
+liv ion
+Ġcalcul ating
+Ġstim ulate
+Ġspl itting
+e ight
+ĠInd y
+pl ays
+ĠP ik
+Ġdom est
+Ġforg iveness
+ĠR ings
+pat ient
+kins on
+M ont
+ig ible
+; "
+Ġperiod ically
+amm ad
+ĠBr itt
+p ard
+Ġarbit ration
+ĠSchne ider
+ĠCorpor ate
+ĠMay a
+Ġsn akes
+a um
+Ġbl asted
+Ġmyster ies
+Ġrev ive
+oc amp
+ĠD odge
+ĠOper a
+27 9
+Ġor phan
+Ġspec ifies
+ĠM ets
+D uration
+H en
+Ġfire works
+Ġprosec ute
+ĠTill erson
+d p
+us age
+l iness
+ĠDeb ian
+Ġ2 24
+ris es
+ĠIn fect
+at ra
+ĠR R
+ĠL or
+d iff
+ĠCharl eston
+Ġac oustic
+Ġam use
+3 30
+Ġc er
+ĠT ac
+Ġ[ +
+Ġcard iac
+ĠRestaur ant
+er gy
+Ġf uzz
+Ġbit es
+Ġhazard ous
+Ġbr ighter
+r ans
+ĠStephan ie
+ext ra
+RE T
+ĠChrist ine
+ĠS ue
+stat ement
+Ġbol ster
+Ġant it
+Rad io
+B IT
+ãĤ °
+Ġvis ions
+ĠCon cept
+Ġin line
+ĠPhilos ophy
+is ans
+ĠIr ving
+Ã £
+t aking
+Ġincons ist
+ĠKum ar
+Ġl ig
+ĠSch umer
+ĠReg ulations
+ĠH z
+th ro
+ĠV oldemort
+ĠM ED
+ĠFreder ick
+P ad
+22 1
+Ġalleg ing
+ĠCommun ication
+Ġ16 7
+Ġforecast s
+Ġsp iders
+Or gan
+ĠParticip ants
+ĠO ps
+des ign
+Cl ose
+Ġfact o
+Ġbom bers
+res istant
+ateg ories
+S chool
+Ġhom ework
+Ġcor ro
+T uesday
+ĠBrend an
+ĠM X
+ĠT S
+ĠSt ri
+Ġstake holders
+ĠMillenn ium
+Ġtransfer ring
+J ud
+Ġt ac
+Ġ16 00
+ĠSD K
+r b
+Ġinterpret ations
+ĠS G
+Ġup stairs
+ĠHar vest
+Ġvag ina
+Ġing est
+x f
+ĠOr ion
+ĠJoe y
+Ġsand wic
+Ġimm ortal
+Ġfl ipped
+ort ex
+threat ening
+Ġsn iper
+Ġconver ts
+Ġinstall ations
+ĠBul gar
+ors che
+m ails
+Ġl ure
+Ġnarrow ly
+Ġgren ade
+ĠG ing
+Ġunder wear
+------------ --
+Ġch ased
+ĠV AL
+Ġparent ing
+ĠH amb
+ĠBl az
+Ġanarch ist
+ĠMed ian
+ĠProgram s
+Î ½
+Ġob j
+ĠN okia
+orm an
+an qu
+at ism
+op a
+Ġfulf illing
+Ġpupp y
+Ġent it
+ĠSebast ian
+Ġshoot ers
+Ġric her
+è ¡
+Ġtempt ed
+ĠAT T
+ĠC V
+Ġto re
+Res ource
+ĠDevil s
+40 8
+in ational
+Ġass urance
+ĠDar ren
+Ġwh ichever
+pos ure
+Ġf ury
+St ock
+Ġunivers ally
+resp onse
+Ġo ak
+Ġwork load
+ĠCor ner
+ee le
+" ...
+Ġdepri ved
+k owski
+Ġcast s
+Ġaffili ation
+ĠA ch
+ĠAs ked
+at he
+Ġl act
+ĠTh u
+r m
+Ġair lines
+Ġnot ions
+Form at
+ĠF AA
+ãĥ Ĭ
+dri ver
+Ġtrans cend
+S ettings
+ĠPro secut
+Ġsp inal
+Ġdefault s
+F K
+Ġpref ers
+rend ered
+th us
+fil m
+Ġt iger
+ĠSp icer
+rec ogn
+ĠRug by
+Net work
+Ġp ity
+Ġcomp artment
+c asters
+ĠMon roe
+Ġ7 20
+Ġcorrect ions
+Ġdop amine
+ĠA Z
+C ut
+Ġro omm
+Ġspec ulate
+H ash
+Ġrestrict ive
+11 11
+red ible
+on el
+Ġramp ant
+re ported
+ĠSu ite
+ĠMin imum
+al ys
+az ard
+lo op
+Ġl ent
+sh a
+Ġv andal
+men u
+ĠBoe hner
+Ġnarr atives
+Ġauthent icity
+26 9
+an ic
+d uty
+28 5
+Ġthank ed
+Ġbetray ed
+l ift
+Ġsouth west
+ĠDex ter
+ĠB od
+Ġkey words
+A verage
+D IS
+Ġethnic ity
+! ),
+ĠNational s
+á ¹
+ĠT ah
+iox id
+Ġwid get
+Ġpast a
+Ġbill ing
+Ġtr ilogy
+ĠL ines
+Ġsn iff
+Ġnep hew
+L ate
+Ġprinc ip
+ĠLo op
+ĠMarx ist
+Ġdiss olved
+Ġcontext s
+ĠAm ount
+ĠSp ike
+Ġtot als
+Ġorgan izer
+Ġup rising
+s hips
+Y Y
+ĠNort heast
+m oney
+grad ation
+Ġgoal keeper
+ĠH ear
+Ġste ak
+ĠBuzz Feed
+Ġsole mn
+ĠSc and
+Ġpo pping
+Ġad here
+ĠAl leg
+by te
+ĠW olver
+Ġun in
+Ġrec ol
+it ud
+Ġmim ic
+ib us
+Ġpredict s
+ĠKee per
+i ating
+Ġde ception
+Ġlear nt
+Ġdi ary
+Ġcond itional
+Ġre lic
+Ġinv oke
+ien ced
+å Ī
+ĠP ont
+Ġcell phone
+Ġspeed ing
+Ġtack ling
+Ġn ude
+op ened
+ĠMan afort
+Ġ19 52
+Ġmaj ors
+ĠSil ence
+Ġlog istics
+Ġweight ed
+ĠPsych iat
+": ["
+Ġsick ness
+Ġdivid ends
+z on
+Re lease
+ĠKe ys
+ĠI ch
+Ġen z
+ĠF ernand
+ĠÎ ±
+Ġmean ings
+Ġp enny
+Ġst ern
+Ġl ar
+ĠPub lished
+Ġback drop
+K im
+ĠSy nt
+Ġdeb uted
+w m
+ĠIs le
+Ġregul ating
+ott i
+ĠSch olars
+ices ter
+ĠChe f
+Ġpop s
+ĠLaun cher
+ĠVar ious
+Ġcomment ing
+os lav
+enz ie
+Ġrival ry
+â Ĥ¬
+Re ally
+Ġor c
+Ġbe an
+ĠJud y
+Not ice
+ĠB ike
+? ]
+Ġrent ed
+st en
+Ġfore front
+ĠBald win
+Ġyield ed
+t ails
+Pr ime
+ĠS ources
+ic ator
+Se an
+Ġmarch ing
+Out put
+ĠJ ungle
+Ġres ide
+zz le
+ĠAndrew s
+Ġtor que
+Bas ic
+Act ually
+st rap
+p enter
+Ġexam s
+ĠY a
+Ġ15 9
+ĠDec ision
+Ġr ansom
+ete enth
+ens ing
+2 13
+Ġsun set
+40 4
+ĠRap id
+ĠHe in
+ĠAb original
+Ġorgan ism
+ĠS ever
+Ġcl a
+aj i
+Sim ple
+ĠFl avor
+ĠE val
+pr us
+Ġch orus
+D AY
+Ġden ounced
+Ġbi ography
+ĠTurn bull
+Rec ent
+N ormal
+lect ions
+W ord
+Ġf erry
+ĠWag ner
+h om
+Un it
+Ġsuper market
+ĠS ith
+Ġnomine es
+Ġdictators hip
+idd ler
+Ġannoun ces
+ĠThe m
+ĠNept une
+Ġde ity
+ĠY i
+Ġmon arch
+AR R
+Ġinv aded
+ĠH ok
+unt ary
+C ertain
+eg a
+Ġk idding
+ĠReg ulation
+Ġtr ay
+Ġphotograp hers
+ĠArc ane
+Ġdis charged
+Ġevangel ical
+Ġinter change
+Ġfilm maker
+ĠEnd less
+Ġ29 0
+ĠSalv ador
+AS Y
+ĠSign al
+Ġwr ath
+â ľ
+l ot
+' /
+Ġproject ile
+Ġemploy ing
+ĠInter face
+19 1
+atell ite
+ĠR ath
+pack age
+Ġindic ations
+J ason
+Ġarg s
+ĠG Hz
+Ġt ilt
+n ants
+w on
+ãĤ µ
+red d
+res cent
+ĠCal endar
+Ġmod ular
+Ġassist ing
+Ġred eem
+ĠBe an
+Ġwor sh
+Ġdecentral ized
+) ...
+37 7
+Ġarr ays
+Ġaccomplish ments
+Î ¿
+d ot
+Ġmut ually
+Ġob struct
+Ġmis represent
+ore st
+ion ic
+ru ce
+% ;
+Ġknow ingly
+port ing
+in ently
+A ri
+ĠSch ultz
+D a
+ĠC ere
+Ġob solete
+ħ ĭ
+g ive
+Ġb ait
+Ġen larg
+Ne ill
+Ġ19 33
+Ġrecons ider
+ĠSerge ant
+ĠDian e
+ĠC ogn
+ĠI con
+P osition
+Ġf ost
+Ġstir ring
+se ven
+ĠSpace X
+ugg ets
+Ġmed d
+G al
+ĠS ister
+B oy
+Ġtrigger ing
+T aking
+Ġscream s
+Ġca usal
+Ġaw aken
+Ar m
+29 7
+Ġdisp atched
+ĠF ALSE
+Ġorgan izational
+ĠT ong
+Ġdile mma
+d emon
+S pl
+Ġhook s
+ud ing
+Ġvalid ate
+Ġpot ion
+Ġcl aw
+Ġburg l
+Ġqu ir
+AC A
+ĠBren nan
+Ġdur ability
+Ġbomb ings
+ĠWind ow
+Ġculp rit
+3 25
+There fore
+umb ered
+per formance
+w arts
+Ġen forcing
+ĠBl ow
+Ġre print
+if ax
+al pha
+Ġsin ister
+Ġbur ger
+fight ing
+Sc ore
+ĠSt ones
+i em
+40 5
+che my
+Ġvine gar
+n om
+Ġprev ailing
+ĠLat est
+Â ¶
+Ġb a
+ĠWrit er
+Ġ17 7
+ĠCon way
+Ġcollect s
+Ġquant itative
+Ġhor rors
+og ens
+ĠSl ov
+Ġl ays
+h aw
+ĠSl ash
+Ġnight club
+ĠDav ies
+Ġbr ide
+ĠScar let
+y mm
+ĠApplic ations
+vel ength
+Ġrev ival
+Ġsoft ly
+Ġz oo
+ita ire
+C ur
+Ġelect rom
+Ġplant ing
+OT O
+ĠE lements
+Ġsw allow
+por ter
+Ġlapt ops
+Ġpe anut
+Ġlobby ists
+Î ²
+Pan el
+ĠJo an
+im il
+t nc
+Ġresist ed
+Ġout we
+Ġret aining
+at ri
+Ġpo orer
+ĠSyri ans
+ĠHam mond
+Ġwe ld
+ud er
+top ic
+ĠT T
+ric ia
+Ġth ieves
+L ic
+ĠG ust
+ĠW ays
+are th
+24 3
+Ġbroad caster
+sh ield
+ass ium
+ub le
+Ġairst rikes
+on so
+Ġped al
+Ġcollect ors
+ĠV ander
+ĠMes a
+Ġdict ator
+Ġd ir
+ent on
+c art
+sc ore
+ad der
+C ry
+Ġs sh
+gg er
+Ġdrunk en
+ĠG S
+ĠSe at
+Ġcorner back
+Ġsk ipped
+ĠRes earchers
+ĠAud i
+Ref erence
+Ġhaun ted
+Ã «
+ĠClin ic
+c z
+Ġp s
+ĠPal adin
+ĠRec ipe
+Ġst igma
+opp y
+Ġmon keys
+ĠHaw k
+S ad
+" />
+ĠWorks hop
+ĠRet ail
+ĠAv atar
+6 25
+N a
+ĠV C
+ĠSec ure
+M Y
+19 88
+oss ip
+Ġpro state
+Ġund en
+Ġg amer
+ĠCont ents
+ĠWar hammer
+ĠSent inel
+3 10
+Ġse gregation
+ĠF lex
+ĠM AY
+Ġdr ills
+ĠDrug s
+Islam ic
+Ġsp ur
+Ġca fe
+Ġimag inary
+Ġgu iding
+Ġsw ings
+ĠThe me
+ob y
+Ġn ud
+Ġbe gging
+Ġstr ongh
+Ġreject ing
+Ġpedest rians
+ĠPro spect
+R are
+s le
+Ġconcess ions
+ĠConst itutional
+Ġbe ams
+Ġfib ers
+p oon
+Ġinstinct s
+pro perty
+ĠB IG
+Sand ers
+im ates
+Ġco ating
+Ġcorps es
+ĠTR UE
+check ed
+Ġ16 6
+A sh
+ĠJ S
+ĠF iction
+Ġcommun al
+Ġener getic
+oooo oooo
+Ġnow adays
+IL D
+ib o
+ĠSU V
+R en
+Ġdwell ing
+Sil ver
+Ġt ally
+ĠM oving
+Ġcow ard
+Ġgener als
+Ġhorn s
+Ġcirc ulated
+Ġrob bed
+ĠUn limited
+Ġharass ed
+Ġinhib it
+Ġcomp oser
+ĠSpot ify
+Ġspread s
+3 64
+Ġsu icidal
+Ġno ises
+ĠSt ur
+Ġs aga
+ĠK ag
+is o
+Ġtheoret ically
+M oney
+Ġsimilar ity
+Ġslic ed
+ut ils
+ing es
+" -
+Ġan th
+Ġimp ed
+Mod ule
+Through out
+Ġmen us
+comm ittee
+and i
+ob j
+in av
+f ired
+ĠAb dullah
+Ġund ead
+Ġfont s
+H old
+EN G
+Ġsustain ability
+Ġfl ick
+Ġr azor
+ĠF est
+ĠChar acters
+Ġword ing
+Ġpopul ist
+Ġcritic izing
+Ġm use
+v ine
+Ġcard board
+Ġkind ly
+Ġfr inge
+ĠThe ft
+icult ural
+Ġgovern ors
+Ġ ï¿½ï¿½ï¿½ï¿½
+Ġ16 3
+Ġtime out
+ĠA uth
+Child ren
+A U
+Ġred emption
+ĠAl ger
+Ġ19 14
+Ġw aved
+Ġastron auts
+og rams
+Ġsw amp
+ĠFinn ish
+Ġcand le
+Ġton nes
+ut m
+Ġr ay
+Ġsp un
+Ġfear ful
+art icles
+Ġca us
+or ically
+ĠRequ ires
+ĠG ol
+Ġpop e
+Ġinaug ural
+Ġg le
+AD A
+ĠIS IL
+ĠOff ensive
+Ġwatch dog
+Ġbal con
+ent ity
+ĠH oo
+Ġgall on
+AC C
+Ġdoub ling
+Ġimpl ication
+ĠS ight
+Ġdoct r
+---- ---
+Ġ\ \
+Ġm alt
+R oll
+Ġâī ¥
+Ġrec ap
+add ing
+u ces
+ĠB end
+fig ure
+Ġtur key
+Ġsoc ietal
+ĠT ickets
+Ġcommer cially
+Ġsp icy
+Ġ2 16
+ĠR amp
+Ġsuperior ity
+Ã ¯
+ĠTr acker
+C arl
+ĠC oy
+ĠPatri ot
+Ġconsult ed
+Ġlist ings
+Ġsle w
+reens hot
+ĠG one
+Ġ[ ...]
+30 9
+Ġh ottest
+Ø ±
+Ġrock y
+ĠD iaz
+Ġmass age
+Ġpar aly
+Ġp ony
+A z
+Ġcart ridge
+ĠN Z
+Ġsn ack
+ĠLam ar
+ple ment
+ĠLes lie
+Ġm ater
+Ġsn ipp
+24 6
+Ġjoint ly
+ĠBris bane
+ĠiP od
+Ġpump ing
+Ġgo at
+ĠSh aron
+eal ing
+Ġcor on
+Ġan omal
+rah im
+ĠConnect ion
+Ġsculpt ure
+Ġsched uling
+ĠD addy
+at hing
+Ġeyeb rows
+Ġcur ved
+Ġsent iments
+Ġdraft ing
+D rop
+( [
+Ġnom inal
+ĠLeaders hip
+ĠG row
+Ġ17 6
+Ġconstruct ive
+iv ation
+Ġcorrupt ed
+ger ald
+ĠC ros
+ĠChe ster
+ĠL ap
+ãģ ª
+OT H
+D ATA
+Ġal mond
+pro bably
+I mp
+Ġfe ast
+ĠWar craft
+F lor
+Ġcheck point
+Ġtrans cription
+Ġ20 4
+Ġtwe aks
+Ġrel ieve
+S cience
+Ġperform er
+Z one
+Ġtur moil
+ig ated
+hib it
+ĠC afe
+the med
+Ġflu or
+ben ch
+Ġde com
+ĠU nt
+ĠBar rett
+ĠF acts
+Ġt asting
+ĠPTS D
+ĠSe al
+ĠJuda ism
+ĠDynam ic
+ĠC ors
+V e
+ĠM ing
+ĠTrans form
+v on
+ĠDef enders
+ĠTact ical
+ĠV on
+ĠUn ivers
+Ġdist orted
+ĠB reath
+?' "
+Ġag on
+ĠDead ly
+Ġl an
+ĠCy cle
+orn ed
+Ġrel iably
+Ġgl or
+ĠMon key
+ãĥ ¡
+Ġad ren
+Ġmicrow ave
+ĠAl ban
+irc raft
+dig it
+sm art
+ĠD read
+Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯ Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯
+{ {
+ĠRoc hester
+Ġsimpl ified
+Ġinf licted
+Ġtake over
+Ġyour selves
+ad itional
+Ġmus cular
+K S
+Ġing en
+T ax
+ĠFe ature
+27 7
+Ġcru c
+Ġcr ate
+Ġun identified
+Ġacclaim ed
+ĠM anga
+ĠFr ances
+ĠNep al
+ĠG erald
+ĠKu wait
+Ġsl ain
+ĠHe b
+ĠG oku
+ãģ® æ
+28 6
+M rs
+ĠC ody
+ĠSan ctuary
+01 6
+Ġdism ant
+Ġdatas et
+ĠH ond
+b uck
+ĠPat terson
+Ġpal ette
+ĠG D
+ic ol
+ĠL odge
+Ġplanet ary
+ak in
+ĠRegist ered
+ab we
+ĠPeters burg
+Ġha iled
+ĠP iece
+S che
+ĠDO J
+Ġen umer
+18 1
+ĠObs erver
+ĠB old
+f ounded
+com merce
+Ġexplo its
+ĠF inding
+UR N
+ĠS ne
+ĠAc id
+ay ette
+ĠVal ues
+Ġdr astic
+Ġarchitect ural
+Ġ" .
+× ķ
+ump ed
+Ġwra pping
+Ġwid ow
+ĠSl ayer
+l ace
+on ce
+German y
+av oid
+Ġtem ples
+P AR
+Ã ´
+ĠLuc ifer
+ĠFl ickr
+l ov
+for ces
+Ġsc outing
+Ġlou der
+tes y
+Ġbefore hand
+Ä ĵ
+ĠNe on
+ĠW ol
+ĠTyp ically
+ĠPolit ico
+-+ -+
+Ġbuild er
+Ġder ive
+K ill
+Ġp oker
+Ġambig uous
+Ġlif ts
+Ġcy t
+Ġrib s
+ood le
+ĠS ounds
+h air
+ĠSynd rome
+t f
+Ġproport ional
+u id
+Ġper taining
+ĠKind le
+ĠNeg ro
+Ġreiter ated
+ĠTon ight
+oth s
+ĠCorn ell
+Ġo wing
+Ġ20 8
+elf are
+oc ating
+ĠB irds
+Sub scribe
+Ġess ays
+Ġburd ens
+Ġillust rations
+ar ious
+ER AL
+ĠCal cul
+Ġx en
+ĠLink edIn
+ĠJ ung
+Ġredes ign
+Con nor
+29 6
+Ġrevers al
+ĠAd elaide
+ĠL L
+Ġs inking
+Ġg um
+US H
+c apt
+ĠGr imm
+Ġfoot steps
+ĠCB D
+isp ers
+Ġpro se
+Wed nesday
+ĠM ovies
+ed in
+Ġoverturn ed
+Ġcontent ious
+US B
+~~~~~~~~ ~~~~~~~~
+ĠCo pper
+Ġpoint less
+N V
+val ues
+olph in
+d ain
+Ġdepos ited
+ĠG W
+Ġpreced ed
+ĠCl a
+ĠGo lem
+ĠN im
+ĠÎ ²
+ĠEngine ers
+m iddle
+Ġfl att
+oper ative
+Ġcouncil s
+imb abwe
+el in
+Ġstress ful
+ĠL D
+Ġres h
+l ake
+Ġwheel chair
+ĠAltern ative
+Ġoptim ize
+oper ation
+Ġpe ek
+Ġones elf
+ig il
+Ġtrans itions
+op athy
+bl ank
+Ġ16 9
+17 1
+________________________________ ________________________________
+Ġl aundering
+En c
+ĠD EC
+Ġwork outs
+Ġsp ikes
+Ġdin osaurs
+Ġdiscrim inatory
+P ool
+R ather
+38 5
+R NA
+tes ters
+et o
+ĠIdent ity
+Ġve in
+ĠBur ton
+Ġarc ade
+4 20
+Ult imately
+ĠSad ly
+Ã °
+p ill
+Ġcub ic
+ĠSpect rum
+the se
+st ates
+Ġun official
+h awks
+ĠEVER Y
+Ġrain bow
+Ġincarcer ation
+and ing
+Ġsy ll
+ĠEver ton
+Ġ17 9
+ĠSer bia
+Ġ18 9
+m eter
+ĠMic key
+Ġant iqu
+Ġfact ual
+ne ck
+ĠN are
+n orm
+m ust
+Ġhigh ways
+Ġgl am
+Ġdivid ing
+ĠSquad ron
+ĠMar tha
+Ġbirth s
+C over
+//////// ////////
+ĠW ong
+Ph ot
+ĠA LS
+ri o
+ĠNon etheless
+ĠL emon
+Ġ20 6
+ĠE E
+Ġderiv ative
+ĠWW II
+v ote
+Ġthere in
+Ġsepar ating
+44 6
+sy nc
+ĠStre ets
+Ġr att
+Ġmunicip ality
+ĠShort ly
+Ġmon k
+) ,"
+Ġscr ub
+Ġoper atives
+Ne ither
+Pl ace
+ĠLim it
+F emale
+ĠAct or
+Char acter
+Ġconstit uted
+35 7
+Ġprotest ed
+ĠSt raw
+ĠHe ight
+ild a
+ĠTy ph
+Ġflood s
+Ġcos metic
+W AY
+pert ure
+up on
+t ons
+ess ing
+ĠP ocket
+Ġro oft
+ĠC aucas
+Ġant idepress
+Ġincomp atible
+EC D
+Ġoper a
+ĠCont est
+Ġgener ators
+l ime
+Def ense
+19 87
+for um
+Ġsav age
+ĠHung arian
+n z
+Ġmet allic
+Ġex pelled
+Ġres idency
+Ġdress es
+66 6
+ĠC lement
+f ires
+C ategory
+Ġge ek
+al is
+Ġc emetery
+educ ated
+Ġc rawl
+ĠUn able
+ĠT yson
+ak is
+Ġp ardon
+ĠW ra
+Ġstrengthen ed
+ĠF ors
+33 5
+ĠH C
+ĠM ond
+Ġvisual s
+ĠBeat les
+ett lement
+Ġ ï
+g ro
+Ġb ash
+Ġpo orest
+Ġex cel
+Ġaspir ations
+ĠM unicip
+ens ible
+Ġceremon ies
+Ġintimid ation
+ĠCON TR
+be ck
+ĠK ap
+as u
+Ġtradem arks
+ĠS ew
+ĠComp etition
+net work
+ĠAr ri
+ĠT et
+Ro aming
+W C
+D at
+Ġso b
+Ġpair ing
+Ġoverd ose
+SA Y
+ab er
+Ġrev olt
+ĠF ah
+act ing
+e q
+est ation
+F ight
+ĠMar ks
+27 3
+Ġ17 8
+R aw
+ãģ ĭ
+34 9
+bl ocks
+Ġver ge
+est ine
+ĠPod esta
+Ġinv asive
+Ġprofound ly
+ĠA o
+e ach
+Ġl est
+inter pret
+Ġshr inking
+Ġerr one
+Ġche es
+ly s
+ĠI vy
+ĠDirect ory
+Ġhint ed
+V ICE
+Ġcontact ing
+ĠG ent
+he i
+Ġlabel ing
+Ġmerc ury
+ĠL ite
+Ġexp ires
+Ġdest abil
+rit is
+c u
+Ġfeather s
+Ġste er
+Ġprogram med
+ĠV ader
+Go ing
+ĠE lim
+Ġy o
+ĠMic he
+Ġ20 3
+Ġslee ves
+Ġb ully
+ĠHum ans
+36 8
+Ġcomp ress
+ĠBan ner
+AR S
+Ġa while
+Ġcal ib
+Ġspons orship
+ĠDiff iculty
+ĠP apers
+Ġident ifier
+} .
+Ġy og
+ĠSh ia
+Ġclean up
+Ġvib e
+int rodu
+im ming
+Austral ia
+Ġout lines
+ĠY outube
+tr ain
+ĠM akes
+Ġde ported
+Ġcent r
+ĠD ug
+ĠB oulder
+ĠBuff y
+Ġinj unction
+ĠHar ley
+ĠG roups
+ĠD umbledore
+ĠCl ara
+Ġ" -
+Ġsacrific ed
+ep h
+Sh adow
+ib ling
+Ġfreel ance
+Ġevident ly
+ph al
+Ġret ains
+M ir
+Ġfin ite
+d ar
+ĠC ous
+Ġrep aired
+Ġperiod ic
+Ġchampions hips
+Ġaster oid
+bl ind
+Ġexpress ly
+ĠAst ros
+Ġsc aled
+Ġge ographical
+ĠRap ids
+En joy
+Ġel astic
+ĠMoh amed
+Mark et
+be gin
+Ġdisco vers
+Ġtele communications
+Ġscan ner
+Ġen large
+Ġsh arks
+Ġpsy chedel
+ĠRou ge
+Ġsnap shot
+is ine
+X P
+Ġpestic ides
+ĠL SD
+ĠDist ribution
+re ally
+Ġde gradation
+Ġdisgu ise
+Ġbi om
+ĠEX T
+Ġequ ations
+Ġhaz ards
+ĠComp ared
+) *
+Ġvirt ues
+Ġeld ers
+Ġenh ancing
+ĠAc ross
+er os
+ang ling
+Ġcomb ust
+ucc i
+Ġconc ussion
+Ġcontrace ption
+ĠK ang
+Ġexpress es
+Ġa ux
+ĠP ione
+Ġexhib its
+Deb ug
+OT AL
+ĠAl ready
+ĠWheel er
+Ġexp ands
+? :
+Ġreconc iliation
+Ġpir ates
+Ġpur se
+Ġdiscour age
+Ġspect acle
+R ank
+Ġwra ps
+ĠTh ought
+Ġimp ending
+O pp
+ĠAng lo
+ĠE UR
+Ġscrew ed
+ret ched
+Ġencour agement
+mod els
+Ġconf use
+mm m
+ĠVit amin
+âĸĳ âĸĳ
+C ru
+Ġkn ights
+Ġdisc ard
+Ġb ishops
+ĠW ear
+ĠGar rett
+k an
+ãĥ Ł
+Ġmascul ine
+cap ital
+ĠA us
+Ġfat ally
+th anks
+ĠA U
+ĠG ut
+12 00
+Ġ 00000000
+Ġsur rog
+ĠBI OS
+ra its
+ĠWat ts
+Ġresur rection
+ĠElect oral
+ĠT ips
+4 000
+Ġnut rient
+Ġdepict ing
+Ġspr ink
+Ġm uff
+ĠL IM
+ĠS ample
+ps c
+ib i
+gener ated
+Ġspec imens
+Ġdiss atisf
+Ġtail ored
+Ġhold ings
+ĠMonth ly
+ĠE at
+po ons
+Ġne c
+ĠC age
+ĠLot us
+ĠLan tern
+Ġfront ier
+Ġp ensions
+Ġj oked
+ĠHard y
+=-=- =-=-
+r ade
+U ID
+Ġr ails
+Ġem it
+Ġsl ate
+Ġsm ug
+Ġsp it
+ĠCall s
+ĠJac obs
+f eat
+ĠU E
+Ġrest ruct
+Ġregener ation
+Ġenerg ies
+ĠCon nor
+OH N
+ĠChe ese
+Ġg er
+Ġresur rect
+man agement
+N W
+Ġpres ently
+ĠBru ins
+M ember
+ĠM ang
+id an
+Ġboost ing
+w yn
++ .
+requ isite
+ĠNY PD
+ĠMe gan
+ĠCond itions
+Ġp ics
+nes ium
+ĠR ash
+Ġ17 4
+ĠD ucks
+Ġemb ro
+z u
+on ian
+rel igious
+Ġc raz
+ĠAC A
+ĠZ ucker
+EM A
+ĠPro s
+We apon
+ĠKn ox
+ĠAr duino
+Ġst ove
+Ġheaven s
+ĠP urchase
+Ġher d
+Ġfundra iser
+Dig ital
+5 000
+Ġprop onents
+/ âĢĭ
+Ġj elly
+ĠVis a
+Ġmon ks
+Ġadvance ment
+ĠW er
+Ġ18 7
+e us
+ert ility
+Ġfet al
+Ġ19 36
+L o
+Ġout fits
+Ġstair case
+b omb
+Ġcustom ized
+cl air
+T ree
+Ġm apped
+ĠConsider ing
+ĠTor res
+Ġmeth yl
+Ġapprox imate
+Ġdo om
+ĠHans en
+Ġc rossover
+Ġstand alone
+ä ¼
+Ġinv ites
+Ġgra veyard
+Ġh p
+Donald Trump
+Ġesc ort
+G ar
+Ġpredec essors
+Ġh ay
+Ġen zyme
+ĠStra ight
+vis ors
+I ng
+ane ously
+ĠApp lied
+Ġf ec
+ĠDur ant
+Ġout spoken
+or b
+Ġz eal
+Ġdisgr ace
+' ).
+ĠChe ng
+28 9
+ĠRen a
+ĠSu icide
+29 4
+Ġout raged
+ĠNew man
+ĠN vidia
+ĠA ber
+ĠB ers
+Ġrecre ation
+Wind ow
+ĠD P
+x e
+Ġped oph
+Ġfall out
+ambo o
+Ġpresent ations
+ĠApp s
+Ġh tml
+3 45
+ĠX XX
+Ġrub bing
+ĠLe ather
+Ġhum idity
+se ys
+est ablished
+ĠUn its
+64 6
+Ġrespect able
+A uto
+Ġthri ving
+ĠInn ovation
+ang s
+Ext ra
+reg ulation
+29 8
+p ick
+Ex amples
+ĠC J
+Att ack
+Ġdr acon
+L T
+Ġstick er
+re rs
+Ġsun ny
+I ss
+reg ulated
+d im
+ĠAb stract
+Ġhus bands
+Off ice
+om ination
+it ars
+AN GE
+asc al
+ĠK ris
+ĠInf antry
+Ġm alf
+ĠA the
+ĠR ally
+bal anced
+................ ........
+OU P
+Ġmole cule
+met ics
+ĠSpl it
+ĠInstruct ions
+ĠN ights
+c ards
+Ġt ug
+Ġcon e
+å Ń
+Ġt x
+ĠDisc ussion
+Ġcatast rophe
+pp e
+g io
+Ġcommun ism
+Ġhal ted
+ĠGu ant
+cle an
+ĠSc hed
+ĠK anye
+Ġw ander
+ĠSer iously
+Ġ18 8
+enn ial
+f ollow
+product ive
+ĠFl ow
+ĠS ail
+Ġc raw
+Ġsim ulations
+or u
+ang les
+ĠN olan
+Ġmen stru
+4 70
+Ġ20 7
+aj a
+Ġcas ually
+board ing
+Ġ2 22
+ov y
+ĠN umbers
+um at
+O E
+28 7
+ĠCle mson
+Ġcert s
+Ġsl id
+ĠT ribe
+Ġto ast
+Ġfort unes
+Ġf als
+ĠComm ittees
+Ġg p
+Ġf iery
+ĠN ets
+ĠAn ime
+Pack age
+ĠComp are
+l aughter
+in fect
+Ġatroc ities
+Ġjust ices
+Ġins ults
+ĠVern on
+Ġsh aken
+Ġperson a
+est amp
+36 7
+br ain
+Ġexperiment ing
+K en
+ĠElect ronics
+Ġ16 1
+dom ain
+Ġgraph ical
+b ishop
+Ġwho pping
+ĠEv angel
+Ġadvertis ers
+ĠSpe ar
+Ġb ids
+Ġdestro ys
+ut z
+Ġunders c
+ĠAD D
+Ġan ts
+ĠC um
+ipp les
+ĠF ill
+Ġgl anced
+Ġind icted
+ĠE ff
+Ġmis con
+ĠDes ktop
+Ġab ide
+ãĥ Ģ
+ĠI o
+ĠC oul
+Ġcaps ule
+ĠCh rys
+M ON
+Ġund es
+ĠI RA
+Ġc itation
+Ġdict ate
+ĠNet works
+ĠConf lict
+ĠSt uff
+x a
+is ec
+ĠChem istry
+Ġquarter ly
+William s
+an an
+O pt
+ĠAlexand ria
+out heastern
+ĠSpring field
+ĠBlack s
+Ġge ography
+24 2
+Ġut most
+ĠEx xon
+ab outs
+E VA
+ĠEn able
+ĠBar r
+Ġdisag reed
+ĠCy prus
+Ġdement ia
+Ġlab s
+Ġubiqu itous
+ĠLO VE
+Ġconsolid ated
+s r
+Ġcream y
+ĠTim ber
+Reg ardless
+ĠCert ificate
+Ġ" ...
+ogen ous
+Capt ain
+Ġinsult ing
+ĠSor os
+ĠInst r
+ĠBulgar ia
+bet ter
+Ġsuck ing
+ĠDavid son
+at z
+Ġcoll ateral
+g if
+Ġplag ued
+ĠC ancel
+ĠGard ner
+R B
+Ġsix teen
+Rem ove
+ur istic
+c ook
+R od
+Ġcompr ising
+f le
+) âĢĶ
+ĠVik ing
+g rowth
+agon al
+Ġsr f
+af ety
+m ot
+N early
+st own
+ĠF actor
+Ġautom obile
+Ġproced ural
+m ask
+amp ires
+Ġdisapp ears
+j ab
+3 15
+Ġ19 51
+ne eded
+Ġd aring
+le ader
+Ġp odium
+Ġun healthy
+Ġm und
+Ġpy ramid
+oc re
+Ġkiss ed
+Ġdream ed
+ĠFant astic
+ĠG ly
+å Ĭ
+Ġgreat ness
+Ġsp ices
+Ġmet ropolitan
+Ġcomp uls
+i ets
+101 6
+ĠSh am
+ĠP yr
+fl ies
+ĠMid night
+Ġswall owed
+Ġgen res
+ĠL ucky
+ĠRew ards
+Ġdisp atch
+ĠI PA
+ĠApp ly
+Ġa ven
+al ities
+3 12
+th ings
+Ġ( ).
+Ġm ates
+ĠS z
+ĠC OP
+ol ate
+O FF
+Ġre charge
+c aps
+ĠYork er
+ic one
+Ġgal axies
+ile aks
+D ave
+ĠP uzz
+ĠCelt ic
+ĠA FC
+27 6
+ĠS ons
+Ġaffirm ative
+H or
+Ġtutorial s
+ĠC ITY
+ĠR osa
+ĠExt ension
+Ser ies
+Ġf ats
+Ġr ab
+l is
+Ġun ic
+Ġe ve
+ĠSp in
+Ġadul thood
+ty p
+Ġsect arian
+Ġcheck out
+ĠCy cl
+S ingle
+Ġmart yr
+Ġch illing
+88 8
+ou fl
+Ġ] ;
+Ġcongest ion
+m k
+ĠWhere as
+Ġ19 38
+ur rencies
+er ion
+Ġbo ast
+ĠPat ients
+Ġch ap
+ĠB D
+real DonaldTrump
+Ġexam ines
+h ov
+Ġstart ling
+ĠBab ylon
+w id
+om ew
+br ance
+ĠOd yssey
+w ig
+Ġtor ch
+ĠV ox
+ĠMo z
+ĠT roll
+ĠAn s
+Similar ly
+ĠF ul
+00 6
+Un less
+ĠAl one
+st ead
+ĠPub lisher
+r ights
+t u
+ĠDoes n
+Ġprofession ally
+Ġcl o
+ic z
+Ġste als
+Ġ á
+19 86
+Ġst urdy
+ĠJoh ann
+Ġmed als
+Ġfil ings
+ĠFr aser
+d one
+Ġmult inational
+Ġf eder
+Ġworth less
+Ġp est
+Yes terday
+ank ind
+Ġg ays
+Ġb orne
+ĠP OS
+Pict ure
+Ġpercent ages
+25 1
+r ame
+Ġpot ions
+AM D
+ĠLeban ese
+Ġr ang
+ĠL SU
+ong s
+Ġpen insula
+ĠCl ause
+AL K
+oh a
+ĠMac Book
+Ġunanim ous
+Ġl enders
+Ġhang s
+Ġfranch ises
+ore rs
+ĠUp dates
+Ġisol ate
+and ro
+S oon
+Ġdisrupt ive
+ĠSur ve
+Ġst itches
+ĠSc orp
+ĠDomin ion
+Ġsupp lying
+Ar g
+Ġtur ret
+ĠL uk
+Ġbr ackets
+* )
+ĠRevolution ary
+ĠHon est
+Ġnot icing
+ĠSh annon
+Ġafford ed
+Ġth a
+ĠJan et
+! --
+ĠNare ndra
+ĠPl ot
+H ol
+se ver
+e enth
+Ġobst ruction
+Ġ10 24
+st aff
+j as
+or get
+sc enes
+l aughs
+ĠF argo
+cr ime
+Ġorche str
+Ġde let
+ili ary
+rie ved
+Ġmilit ar
+ĠGreen e
+âĹ ı
+ãģ ¦
+ĠGu ards
+Ġunle ashed
+ĠWe ber
+Ġadjust able
+Ġcal iber
+Ġmotiv ations
+ĠÃ ł
+m Ah
+ĠL anka
+hand le
+Ġp ent
+ĠR av
+ĠAng ular
+ĠK au
+umb ing
+Ġphil anthrop
+Ġde hyd
+Ġtox icity
+e er
+ĠY ORK
+w itz
+å ¼
+ĠI E
+commun ity
+ĠA H
+Ġret ali
+Ġmass ively
+ĠDani els
+ĠD EL
+Ġcar cin
+Ur l
+Ġrout ing
+ĠNPC s
+ĠR AF
+ry ce
+Ġwa ived
+ĠGu atem
+Every body
+Ġco venant
+Ġ17 3
+Ġrelax ing
+Ġqu art
+al most
+Ġguard ed
+ĠSold iers
+ĠPL AY
+Ġout going
+L AND
+Ġre write
+ĠM OV
+ĠIm per
+ĠS olution
+Ġphenomen al
+Ġl ongevity
+Ġimp at
+ĠN issan
+ir ie
+Ġod or
+ĠZ ar
+ok s
+Ġmilit ias
+ĠSP EC
+Ġtoler ated
+ars er
+ĠBrad ford
++ ,
+Ġsur real
+s f
+Can adian
+Ġresemb lance
+Ġcarbohyd rate
+VI EW
+Ġaccess ory
+me al
+larg est
+ieg el
+Some one
+Ġtoug hest
+os o
+Ġfun nel
+Ġcondemn ation
+lu ent
+Ġw ired
+ĠSun set
+Jes us
+ĠP ST
+ĠP ages
+ĠTy coon
+ĠP F
+Ġselect ions
+Ġ à¤
+part isan
+Ġhigh s
+ĠR une
+Ġcraft s
+le ad
+ĠParent s
+Ġre claim
+ek er
+ĠAll ied
+ae per
+Ġlo oming
+Ġbenefic iaries
+ĠH ull
+Stud ents
+Jew ish
+d j
+Ġp act
+tem plate
+ĠOffic ials
+ĠBay lor
+Ġhe mp
+Ġyouth s
+ĠLevel s
+ĠX iao
+ĠC hes
+Ġende avor
+ĠRem oved
+Ġhipp ocamp
+H ell
+ãĤ Ĭ
+80 5
+Ġd inosaur
+ĠWr ath
+ĠIndones ian
+Ġcalcul ator
+ĠD ictionary
+Ġ4 20
+ĠM AG
+( _
+! ,
+t arians
+Ġrestrict ing
+rac use
+Ġweek day
+OU NT
+Ġsh rugged
+leg round
+Ġb ald
+ĠDo ctors
+Ġt outed
+ĠMax well
+Ġ2 14
+Ġdiplom at
+Ġrep ression
+Ġconstitu ency
+v ice
+r anked
+ĠNap oleon
+g ang
+ĠFore ver
+t un
+Ġbul b
+ĠPD T
+ĠC isco
+V EN
+Ġres umed
+Ste ven
+ĠManit oba
+Ġfab ulous
+ĠAg ents
+19 84
+Ġam using
+ĠMyster ies
+Ġor thodox
+fl oor
+Ġquestion naire
+Ġpenet rate
+Ġfilm makers
+ĠUn c
+Ġst amped
+Ġth irteen
+Ġout field
+Ġforward ed
+Ġapp ra
+Ġa ided
+t ry
+Ġunf ocused
+ĠL iz
+ĠWend y
+ĠSc ene
+Ch arg
+Ġreject s
+Ġleft ist
+ĠProv idence
+ĠBr id
+reg n
+Ġprophe cy
+ĠL IVE
+4 99
+Ġfor ge
+ĠF ML
+Ġintrins ic
+ĠF rog
+Ġw ont
+ĠH olt
+Ġfam ed
+CL US
+aeper nick
+ĠH ate
+ĠC ay
+Ġregister ing
+ort ality
+rop y
+ocaly ptic
+a an
+n av
+Ġfasc ist
+IF IED
+Ġimpl icated
+ĠRes ort
+ĠChand ler
+ĠBr ick
+P in
+ys c
+Us age
+ĠHel m
+us ra
+âĺħ âĺħ
+ĠAb bas
+Ġunanim ously
+Ġke eper
+Ġadd icted
+?? ?
+Ġhelm ets
+Ġant ioxid
+aps ed
+80 8
+gi ene
+Ġwa its
+Ġmin ion
+ra ved
+ĠP orsche
+Ġdream ing
+Ġ17 1
+ĠC ain
+Ġun for
+ass o
+ĠConfig uration
+k un
+hard t
+Ġn ested
+ĠL DS
+L ES
+Ġt ying
+en os
+Ġc ue
+ĠMar qu
+sk irts
+Ġclick ed
+Ġexp iration
+ĠAccording ly
+ĠW C
+Ġbless ings
+Ġaddict ive
+ĠN arr
+y x
+ĠJagu ars
+Ġrent s
+ĠS iber
+Ġt ipped
+ous se
+ĠFitz gerald
+Ġhier arch
+out ine
+Ġwa velength
+> .
+ch id
+ĠProcess ing
+/ +
+r anking
+E asy
+ĠConst ruct
+Ġt et
+ins ured
+H UD
+Ġqu oting
+Ġcommun icated
+in x
+Ġin mate
+Ġerect ed
+ĠAbs olutely
+ĠSure ly
+Ġun im
+ĠThr one
+he id
+Ġcl aws
+Ġsuper star
+ĠL enn
+ĠWh is
+U k
+ab ol
+Ġsk et
+ĠN iet
+Ġper ks
+Ġaff inity
+Ġopen ings
+phas is
+Ġdiscrim inate
+T ip
+v c
+Ġgr inding
+ĠJenn y
+Ġast hma
+hol es
+ĠHom er
+Ġreg isters
+ĠGl ad
+Ġcre ations
+Ġlith ium
+Ġappl ause
+unt il
+Just ice
+ĠTur ks
+Ġsc andals
+Ġb ake
+t ank
+M ech
+ĠMe ans
+ĠM aid
+Republic ans
+is al
+wind ows
+ĠSant os
+Ġveget ation
+33 8
+t ri
+Ġfl ux
+ins ert
+Ġclar ified
+Ġmort g
+ĠCh im
+ĠT ort
+Ġdiscl aim
+met al
+ĠAs ide
+Ġindu ction
+Ġinf l
+Ġathe ists
+amp h
+Ġe ther
+ĠV ital
+ĠBu ilt
+M ind
+Ġweapon ry
+S ET
+Ġ18 6
+ad min
+g am
+cont ract
+af a
+Ġderiv atives
+Ġsn acks
+Ġch urn
+E conom
+Ġca pped
+ĠUnder standing
+ĠH ers
+ĠI z
+Ġd uct
+I ENT
+augh ty
+Ġâľ Ķ
+ĠN P
+Ġsa iling
+In itialized
+Ġt ed
+Ġreact ors
+ĠL omb
+Ġcho ke
+ĠW orm
+Ġadm iration
+Ġsw ung
+ens ibly
+Ġr ash
+ĠGo als
+ĠImport ant
+Sh ot
+ĠR as
+Ġtrain ers
+ĠB un
+Work ing
+Ġhar med
+ĠPand ora
+ĠL TE
+Ġmush room
+ĠCH AR
+ĠF ee
+ĠM oy
+B orn
+ol iberal
+ĠMart ial
+Ġgentle men
+Ġling ering
+Offic ial
+Ġgra ffiti
+ĠN ames
+D er
+Ġqu int
+ist rate
+aze era
+ĠNOT ICE
+ĠFlore nce
+Ġpay able
+Ġdep icts
+ĠSpe cies
+He art
+âĶĢâĶĢâĶĢâĶĢ âĶĢâĶĢâĶĢâĶĢ
+Ġencl osed
+Incre ases
+D aily
+ĠL is
+Ġenact ment
+ĠB acon
+ĠSt eele
+dem and
+Ġ18 3
+Ġmouth s
+Ġstr anded
+Ġenhance ment
+01 1
+ĠWh ats
+Ġhe aled
+en y
+ĠR ab
+Ġ3 40
+ĠLab yrinth
+ro ach
+ĠY osh
+ĠCl ippers
+Ġconcert s
+Intern et
+35 5
+Ġstick ers
+Ġter med
+ĠAx e
+Ġgrand parents
+Fr ance
+ĠCl im
+ĠU h
+ul ic
+Ġthr ill
+cent ric
+ĠOver view
+ĠCond uct
+Ġsubstant ive
+Ġ18 2
+m ur
+Ġstr ay
+ĠCo ff
+Ġrep etitive
+ĠFor gotten
+Ġqual ification
+ew itness
+ĠZ imbabwe
+Ġsim ulated
+ĠJ D
+25 3
+ĠW are
+Ġun sc
+T imes
+Ġsum mons
+Ġdis connected
+Ġ18 4
+ci us
+ĠGu jar
+od ka
+Ġer ase
+ĠTob acco
+elect ed
+Ġun cont
+ĠShe pard
+ĠL amp
+Ġalert ed
+Ġoper ative
+arn a
+u int
+Ġneglig ence
+ac ements
+Ġsup ra
+Ġprev ail
+ĠSh ark
+Ġbel ts
+ãģ «
+Ġt ighter
+Engine ers
+Ġin active
+Ġexp onent
+ĠWill ie
+a ples
+Ġhe ir
+ĠH its
+ian n
+ĠS ays
+Ġcurrent s
+ĠBeng al
+Ġar ist
+B uffer
+Ġbree ze
+ĠWes ley
+Col a
+Ġpron oun
+Ġde ed
+ĠK ling
+Ġof t
+Ġinf lict
+Ġpun ishing
+Ġn m
+ik u
+OD UCT
+01 4
+Ġsubsid y
+ĠDE A
+ĠHer bert
+ĠJ al
+B ank
+Ġdef erred
+Ġship ment
+B ott
+Ġal le
+b earing
+HT ML
+Off line
+Ġ2 13
+Ġscroll ing
+Ġsc anned
+ĠLib yan
+ĠT OP
+ch rom
+d t
+col umn
+Psy NetMessage
+Z ero
+Ġtor so
+0 50
+âķ Ĳ
+Ġimp erson
+ĠSchw artz
+ud ic
+Ġpiss ed
+ĠS app
+25 7
+ĠIS Ps
+og l
+Ġsuper vised
+Ġad olescent
+Ġatt ained
+ĠDel ivery
+ĠB unny
+Ġ19 37
+Ġmini ature
+Ġo s
+Ġ3 70
+60 8
+ĠMour inho
+Ġinn ate
+Ġtem po
+ĠN M
+ĠFall en
+00 9
+Ġprov ocative
+Stream er
+ĠBened ict
+ĠBol she
+Ġt urtle
+ĠPC B
+ĠEqu al
+Direct or
+ĠR end
+Ġflu ids
+Author ities
+Ġcous ins
+requ ency
+ĠNeigh bor
+s ets
+sh ared
+Char les
+pass word
+Ġg ears
+Ġ2 11
+ĠHard ware
+ri ka
+Ġup stream
+H om
+Ġdisproportion ately
+iv ities
+Ġund efined
+Ġelect rons
+Ġcommem or
+Event ually
+Ġ> <
+Ġir responsible
+2 18
+ĠRe leased
+ĠO VER
+ĠI GN
+ĠB read
+st ellar
+ĠS age
+tt ed
+dam age
+ed ition
+ĠPre c
+Ġl ime
+Ġconf inement
+Ġcal orie
+we apon
+Ġdiff ering
+ĠS ina
+m ys
+am d
+Ġintric ate
+k k
+ĠP AT
+Ã£ o
+st ones
+lin ks
+Ġr anch
+Sem itic
+Ġdifferent iate
+ĠS inger
+occup ied
+Ġfort ress
+c md
+Ġinter ception
+ĠAnk ara
+Ġre pt
+ĠSol itaire
+Ġrem ake
+p red
+Ġd ared
+aut ions
+ĠB ACK
+Run ning
+Ġdebug ging
+Ġgraph s
+3 99
+ĠNig el
+Ġb un
+Ġpill ow
+Ġprog ressed
+fashion ed
+Ġob edience
+ER N
+Ġrehe ars
+C ell
+t l
+S her
+Ġher ald
+ĠPay ment
+ĠC ory
+ĠDe pt
+Ġrep ent
+ĠWe ak
+uck land
+Ġple asing
+Ġshort ages
+Ġjur ors
+ĠK ab
+q qa
+Ant i
+Ġw ow
+ĠRC MP
+Ġt sun
+ĠS ic
+Ġcomp rises
+Ġsp ies
+Ġprec inct
+n u
+Ġur ges
+Ġtim ed
+Ġstrip es
+ĠB oots
+Ġy en
+Adv anced
+Ġdisc rete
+ĠArch angel
+employ ment
+D iff
+Ġmon uments
+Ġ20 9
+work er
+Ġ19 6
+ĠI g
+utter stock
+T PS
+J ac
+Ġhomeless ness
+Ġcomment ator
+Ġrac ially
+f ing
+se ed
+E le
+ell ation
+Ġeth anol
+Ġpar ish
+ĠD ong
+ĠAw akening
+Ġdev iation
+ĠB earing
+ĠTsu k
+Ġrec ess
+Ġl ymph
+ĠCann abis
+å ľ
+ĠNEW S
+Ġd ra
+ĠStef an
+ĠWr ong
+ĠS AM
+Ġloose ly
+Ġinterpre ter
+ĠPl ain
+Go vernment
+Ġbigot ry
+Ġgren ades
+ave z
+pict ured
+Ġmand ated
+ĠMon k
+ĠPed ro
+Ġl ava
+27 4
+Ġcyn ical
+ĠScroll s
+l ocks
+M p
+Ġcon gregation
+orn ings
+ph il
+ĠI bid
+Ġf erv
+Ġdisapp earing
+Ġarrog ant
+sy n
+ĠMa ver
+ĠSu it
+24 1
+Ġab bre
+ack ers
+P a
+ĠY el
+Whe never
+Ġ23 5
+ĠV ine
+ĠAn at
+Ġext inct
+LE T
+Ġexecut able
+V ERS
+ox ide
+D NA
+ĠP rel
+Ġresent ment
+Ġcompr ise
+ĠAv iv
+Ġinter ceptions
+Ġprol ific
+IN A
+ĠEr in
+though t
+2 19
+ĠPsychiat ry
+un ky
+chem ist
+H o
+ĠMcC oy
+Ġbr icks
+L os
+ri ly
+ĠUS SR
+Ġr ud
+Ġl aud
+ĠW ise
+ĠEmer ald
+Ġrev ived
+Ġdam ned
+ĠRep air
+id em
+ct ica
+Ġpatri arch
+ĠN urs
+me g
+Ġcheap est
+re ements
+empt y
+ĠCele br
+Ġdepri vation
+ch anted
+ĠTh umbnails
+E nergy
+ĠEth an
+ĠQ ing
+Ġopp oses
+W IND
+v ik
+ĠM au
+ĠS UB
+66 7
+G RE
+ĠVol unte
+nt on
+C ook
+å Ĳ
+es que
+Ġplum met
+Ġsu ing
+Ġpron ounce
+Ġresist ing
+ĠF ishing
+ĠTri als
+Ġy ell
+Ġ3 10
+Ġin duct
+Ġpersonal ized
+oft en
+R eb
+EM BER
+Ġview point
+Ġexist ential
+() )
+rem ove
+MENT S
+l asses
+Ġev apor
+Ġa isle
+met a
+Ġreflect ive
+Ġentit lement
+Ġdev ised
+mus ic
+asc ade
+Ġwind ing
+off set
+Ġaccess ibility
+ke red
+Bet ter
+ĠJohn ston
+th inking
+S now
+ĠCroat ia
+ĠAt omic
+27 1
+34 8
+Ġtext book
+ĠSix th
+Ġ Ø§ÙĦ
+Ġsl ider
+ĠBur ger
+b ol
+S ync
+Ġgrand children
+Ġc erv
++ )
+Ġe ternity
+Ġtweet ing
+Ġspec ulative
+Ġpiv otal
+ĠW P
+ĠT ER
+ynam ic
+Ġu pl
+ĠC ats
+per haps
+Ġclass mates
+Ġblat ant
+' -
+Ġl akh
+ant ine
+ĠB org
+i om
+/ (
+ĠAthlet ic
+Ġs ar
+OT A
+ĠHoff man
+Never theless
+Ġad orable
+Ġspawn ed
+Ass ociated
+ĠDom estic
+Ġimpl ant
+ĠLux em
+ĠK ens
+Ġp umps
+ĠS AT
+Att ributes
+50 9
+av our
+Ġcentral ized
+ĠT N
+Ġfresh ly
+ĠA chieve
+Ġouts iders
+her ty
+ĠRe e
+ĠT owers
+ĠD art
+ak able
+Ġm p
+ĠHeaven ly
+Ġr ipe
+ĠCarol ine
+ry an
+Ġclass ics
+Ġret iring
+Ġ2 28
+Ġa h
+Ġdeal ings
+Ġpunch ing
+ĠChap man
+O ptions
+max well
+vol ume
+Ġst al
+Ġex ported
+ĠQu ite
+Ġnumer ical
+B urn
+F act
+ĠKey stone
+Ġtrend ing
+Ġalter ing
+ĠAfric ans
+47 8
+ĠM N
+ĠKn ock
+Ġtempt ation
+Ġprest ige
+Over view
+ĠTrad itional
+ĠBah rain
+Priv ate
+ĠH OU
+Ġbar r
+ĠT at
+C ube
+US D
+ĠGrand e
+ĠG at
+ĠFl o
+Ġres ides
+Ġind ec
+vol ent
+Ġperpet ual
+ub es
+Ġworld view
+ĠQuant um
+Ġfil tered
+Ġen su
+orget own
+ERS ON
+ĠM ild
+37 9
+OT T
+Ã ¥
+Ġvit amins
+Ġrib bon
+Ġsincere ly
+ĠH in
+Ġeight een
+Ġcontradict ory
+Ġgl aring
+Ġexpect ancy
+Ġcons pir
+Ġmon strous
+Ġ3 80
+re ci
+Ġhand ic
+Ġpump ed
+Ġindic ative
+Ġr app
+Ġav ail
+ĠLEG O
+ĠMar ijuana
+19 85
+ert on
+Ġtwent ieth
+################ ################
+ĠSw amp
+Ġval uation
+Ġaffili ates
+adjust ed
+ĠFac ility
+26 2
+Ġenz ymes
+itud inal
+Ġimp rint
+S ite
+Ġinstall er
+ĠT RA
+m ology
+lin ear
+ĠCollect ive
+ig ating
+ĠT oken
+Ġspec ulated
+K N
+ĠC ly
+or ity
+Ġdef er
+Ġinspect ors
+appro ved
+R M
+ĠSun s
+Ġinform ing
+ĠSy racuse
+ib li
+7 65
+Ġgl ove
+Ġauthor ize
+âĢ¦âĢ¦âĢ¦âĢ¦ âĢ¦âĢ¦âĢ¦âĢ¦
+ĠCru ise
+Ġcontract ing
+she ll
+IF E
+ĠJew el
+p ract
+ĠPhot oshop
+ĠKnow ing
+h arm
+Ġattract ions
+ad an
+et us
+01 8
+w agen
+Al t
+Ġmultip ly
+Ġequ ilibrium
+: {
+ĠF ighters
+ĠEd gar
+Ġfour teen
+Go vern
+Ġmis use
+Ġab using
+Ġancest ry
+ram er
+64 4
+Ġwor ms
+Ġthick er
+ĠComb ine
+Ġpeas ants
+Ġv ind
+Ġcon quest
+Ġm ocked
+Ġc innamon
+ĠC ald
+ĠGall up
+Ġavoid ance
+Ġincarn ation
+ĠStr at
+Ġt asted
+ent a
+ĠN eal
+p ared
+Ġtermin ology
+ject ion
+Scient ists
+ĠIN S
+ĠDe e
+Ġdirect ories
+R oad
+ĠSh ap
+br ight
+ĠDirect ors
+ĠCol umn
+Ġb ob
+Ġprefer ably
+Ġgl itch
+f urt
+Ġe g
+id is
+C BC
+Ġsur rendered
+Ġtest ament
+33 6
+ug gest
+ĠN il
+an other
+Ġpat hetic
+ĠDon na
+Ġ2 18
+ĠA very
+Ġwhis key
+Ġf ixture
+ĠCon quest
+Ġbet s
+O cc
+ĠLe icester
+] ."
+Ġ) );
+Ġfl ashes
+45 6
+Ġmask ed
+ge bra
+Ġcomput ed
+che l
+aud er
+Ġdefe ats
+ĠLiber ation
+ĠOs ama
+ĠV ive
+Ch anges
+Ch annel
+Ġtar iffs
+Ġm age
+ĠS ax
+Ġinadvert ently
+ĠC RE
+ĠRe aper
+ink y
+gr ading
+Ġstere otyp
+Ġcur l
+ĠF ANT
+Ġfram eworks
+M om
+ĠAn ch
+Ġflav our
+car bon
+Ġperm itting
+let cher
+ĠMo zilla
+ĠPark ing
+ĠCh amp
+Sc roll
+Ġmurd erer
+Ġrest ed
+Ġow es
+ĠP oss
+AD D
+IF F
+res olution
+ĠMin ing
+Ġcompar ative
+D im
+Ġneighbour ing
+ĠA ST
+ĠT oxic
+Ġbi ases
+Ġgun fire
+ur ous
+ĠMom ent
+19 83
+Ġper vasive
+tt p
+ĠNorm ally
+r ir
+S arah
+ĠAlb any
+Ġun sett
+ĠS MS
+ip ers
+l ayer
+ĠWh ites
+up le
+Ġtur bo
+ĠLe eds
+Ġthat s
+ĠMin er
+M ER
+ĠRe ign
+Ġper me
+ĠBl itz
+Ġ19 34
+Ġintimid ating
+t ube
+Ġecc entric
+ab olic
+box es
+ĠAssoci ates
+v otes
+Ġsim ulate
+um bo
+aster y
+Ġship ments
+FF FF
+an th
+Ġseason ed
+Ġexperiment ation
+âĸ ł
+law s
+Me et
+idd les
+ant ics
+R ating
+IS IS
+h ift
+Ġfront s
+b uf
+01 7
+Ġun att
+ĠD il
+le ases
+ĠGard ens
+77 7
+t ouch
+ve ll
+45 8
+Ġ= ====
+s aving
+Ġer osion
+ĠQu in
+Ġearn s
+Ġaccomplish ment
+ĠWe i
+Ġ< [
+____ _
+Ġir rig
+ĠT eddy
+Ġconqu ered
+ĠArm ored
+Ġassert s
+Ġmanip ulating
+r Ã©
+Ġtranscript s
+G allery
+Ġplot ting
+Ne il
+Ġbetray al
+load er
+ĠS ul
+Ġdispl acement
+Ġroy alty
+ĠW I
+he it
+ĠDev ices
+alle l
+Ġmunicipal ities
+Ġcan al
+St ars
+ĠU AE
+Ġ" âĢ¦
+ĠC U
+ab ove
+Ġreson ance
+ĠguiActive Un
+add ed
+ĠBra ves
+ĠI bn
+Ġhere by
+ĠB RE
+Ġshare holder
+ĠH ir
+ĠJ i
+Ġstrange ly
+Ġadm ired
+Ġpl ight
+Ġb achelor
+ĠP ole
+cipl inary
+T ony
+ĠArmen ian
+Ġun man
+ĠZion ist
+St age
+isco ver
+Ġautom otive
+Ġs idelines
+Ġsl ick
+ĠRena issance
+ĠF UN
+Im ages
+ĠH aj
+Ġp ing
+Ġshort cut
+ĠBl vd
+ĠLook s
+Ġbur sts
+Ġcl amp
+Ġm ish
+Ġsort ing
+Ġpatri ot
+Ġcorrect ness
+ĠScand inav
+ĠCaval iers
+p ython
+az ar
+Ġ3 75
+ĠJa une
+40 9
+Ġdetrim ental
+Ġstab bing
+Ġpoison ed
+Ġf ountain
+oc ent
+or st
+ĠMar i
+Ġr ains
+ĠO vers
+ĠInst itution
+ud get
+AM Y
+t ale
+ĠK R
+ĠPr ices
+Ġhead aches
+Ġlands l
+ĠA ura
+Bon us
+ĠZ hao
+ĠH ip
+Ġhop s
+ĠKurd istan
+Ġexplo iting
+ry n
+Ġhypocr isy
+op ening
+Ġgun shot
+Ġw ed
+inter stitial
+Inter stitial
+Ġam en
+Bre aking
+Ġmarket ed
+W ire
+ĠC rowd
+Contin ue
+ĠK nown
+ĠEffect ive
+ore an
+iz ons
+Jose ph
+Ġescal ation
+us ername
+Ġcur tain
+AT ES
+ĠP AR
+ĠM iy
+Ġcounter fe
+l ene
+Ġcont enders
+d aily
+ĠAs c
+ĠPhill ip
+most ly
+Ġfil ename
+he ne
+Ġresemb ling
+Ġst aging
+ĠCh loe
+Ġw iring
+H on
+ĠRen ew
+ott age
+ĠHy brid
+m uch
+Ġstro kes
+Ġpolicy makers
+AP TER
+ĠArk ham
+pl ot
+Ġassist ants
+Ġde port
+ĠSe ga
+Ġinflu enza
+ĠC ursed
+ĠK obe
+Ġskin ny
+Prov ider
+ĠR ip
+Ġincrement al
+product s
+B F
+Ġd ome
+ĠC redits
+Ġlos ers
+int s
+ĠBet ty
+ĠTal ent
+ĠD AM
+L v
+E ss
+Ġd ens
+tem p
+J udge
+od ic
+Ġ' (
+UR ES
+ets k
+V O
+Ġretrie ved
+Ġarchitect s
+Ù ĩ
+Ġeth ic
+ĠSecond ary
+st ocks
+ad ia
+Ġ3 25
+ĠOp inion
+Ġsimultane ous
+Ġd izz
+ul p
+Ġsmugg ling
+ipp ery
+R andom
+f acing
+ĠD as
+Ġstock p
+Ġdiscl osures
+po inter
+Ġcor al
+ĠSe lection
+ĠP ike
+ival ent
+Ġruth less
+ĠR im
+Ġensu ing
+ĠExper iment
+Ġcongress man
+Ġbelie ver
+Ġun specified
+ĠM ord
+Ġknowledge able
+ĠV ERY
+T X
+Ġstra ps
+Ġtur f
+apesh ifter
+Ġmar ital
+Ġfl ock
+ãģ Ĩ
+26 3
+AM ES
+ĠOpp osition
+Ġtre asures
+ĠG OD
+Ġmodel ed
+ĠWOR LD
+Ġ( [
+ĠUs age
+H F
+Ġ$ (
+uss ed
+Ġpione er
+E ight
+par se
+b read
+rit z
+ĠMir anda
+ĠK ant
+++ )
+ore n
+Ġprov oked
+Ġbre eds
+ĠIn cludes
+ĠPast ebin
+ĠFl ip
+J ava
+Ġbr ink
+Ġrum ored
+Ġun seen
+Ġgar nered
+ĠDef in
+al ted
+Ġtatt oos
+Ġhes itation
+is itions
+ĠWe aver
+ĠReport ing
+Ġtherap ies
+Ġconsult ants
+Ġresid ual
+ĠMal i
+ĠRom a
+i ago
+ĠRes idents
+ub i
+Ġremed ies
+Ġadapt ive
+ĠAl ive
+ĠBar cl
+Ġwal lets
+c rypt
+etermin ation
+ĠPel osi
+Ġsl ipping
+oton in
+Ġall iances
+pat rick
+ir is
+Ġor th
+ĠPer kins
+ĠDe V
+ĠG ets
+Ġdry ing
+ge e
+fore st
+ĠFor get
+ore m
+33 9
+Ġvague ly
+ĠD ion
+ĠP orn
+ĠH OW
+Ġp neum
+Ġrub ble
+ĠT aste
+enc ia
+ĠG el
+Ġd st
+Ġ24 5
+ĠMoroc co
+inf lamm
+ĠTw ins
+Ġb ots
+d aughter
+ĠB alk
+Ġbre thren
+Ġlog os
+Ġgo bl
+f ps
+Ġsub division
+Ġp awn
+Ġsquee zed
+Ġmor ale
+ĠD W
+' "
+Ġkn ot
+ook y
+Ġdiv isive
+Ġboost ed
+ch y
+ãĥ Ĳ
+if act
+Ġnewcom ers
+ĠWrest ling
+Ġsc outs
+w olves
+R at
+Ġnin eteenth
+ĠOs borne
+St ats
+Ġem powered
+Ġpsych opath
+ĠO EM
+ugg age
+ĠP K
+ĠMoh ammad
+P ak
+Ġanarch ists
+ĠExt ract
+est hes
+ĠStock holm
+l oo
+ĠG raph
+Ġdeploy ing
+ĠStr anger
+ĠM old
+Ġstaff er
+Ġdiscount ed
+uck le
+ple ase
+ĠLand ing
+ÃŃ a
+Ġ19 3
+Ġan te
+Ġrep etition
+Ġ+ /-
+Ġpar ody
+Ġlive ly
+AA A
+ĠHor us
+Ġp its
+ind ers
+L OC
+ĠVen ice
+40 6
+ĠDis cover
+â Ĩ
+ellect ual
+Ġp ens
+Ġey el
+ig uous
+Im pl
+Ġj oking
+Ġinv al
+ĠBel fast
+Ġcredit ors
+ĠSky walker
+ov sky
+Ġcease fire
+Ġse als
+is oft
+) ).
+ĠFel ix
+IT S
+Ġt resp
+ĠBlock chain
+ew are
+ĠSch war
+en ne
+mount ed
+ĠBe acon
+les h
+Ġimmense ly
+Ġche ering
+Em ploy
+sc ene
+ish ly
+atche wan
+ĠNic olas
+Ġdr ained
+ĠEx it
+ĠAz erb
+j un
+Ġflo ated
+u ania
+De ep
+Ġsuper v
+Ġmyst ical
+ĠD ollar
+ĠApost le
+ĠR EL
+ĠProv ided
+ĠB ucks
+ãĥ ´
+cut ting
+Ġenhance ments
+ĠPengu ins
+ĠIsa iah
+Ġj erk
+ĠW yn
+Ġst alled
+Ġcryptoc urrencies
+ĠR oland
+sing le
+Ġl umin
+ĠF ellow
+ĠCap acity
+ĠKaz akh
+W N
+Ġfin anced
+38 9
+Ġt id
+Ġcoll usion
+ĠMy r
+î Ģ
+Sen ator
+Ġped iatric
+Ġneat ly
+Ġsandwic hes
+ĠArchitect ure
+Ġt ucked
+Ġbalcon y
+Ġearthqu akes
+qu ire
+F uture
+Ġhe fty
+é Ĺ
+Ġspecial izes
+Ġstress es
+Ġs ender
+Ġmisunder standing
+Ġep ile
+Ġprov oke
+ĠCol ors
+Ġdis may
+uk o
+[ _
+58 6
+ne utral
+Ġdon ating
+ĠRand all
+Mult i
+Ġconvenient ly
+ĠS ung
+ĠC oca
+Ġt ents
+ĠAc celer
+Ġpart nered
+27 2
+ir ming
+ĠB AS
+s ometimes
+Ġobject ed
+ub ric
+p osed
+LC S
+gr ass
+Ġattribut able
+V IS
+Israel i
+Ġrepe ats
+ĠR M
+v ag
+ut a
+in ous
+Ġin ert
+ĠMig uel
+æ Ń
+ĠHawai ian
+B oard
+Ġart ific
+ĠAzerb ai
+as io
+ĠR ent
+A IN
+Ġappl iances
+Ġnational ity
+Ġass hole
+ĠN eb
+Ġnot ch
+h ani
+ĠBr ide
+Av ailability
+Ġintercept ed
+Ġcontin ental
+Ġsw elling
+ĠPers pect
+b ies
+. <
+ith metic
+ĠL ara
+Ġtempt ing
+add r
+Ġoversee ing
+cl ad
+ĠD V
+ĠGing rich
+Ġm un
+ĠApp ropri
+Ġalter ations
+ĠPat reon
+Ġha voc
+Ġdiscipl ines
+Ġnotor iously
+aku ya
+ier i
+? ).
+ĠW ent
+Ġsil icon
+Ġtre mb
+Cont ainer
+K nown
+Ġmort ar
+est e
+ick a
+Ar thur
+ĠPre viously
+ĠMart y
+Ġsp arse
+g ins
+Ġin ward
+ĠParticip ant
+C opy
+ĠM isc
+Ġantib iotic
+ĠRet ro
+Ġel usive
+Ġass ail
+ĠBatt alion
+ĠB ought
+Ġdimin ish
+ĠEuro pa
+s ession
+ĠDanger ous
+ies el
+Ġdisbel ief
+Ġbl asts
+ext reme
+ĠBoy d
+ĠProject s
+ĠGu ys
+Ġunder gone
+Ġgr ill
+ĠDw ight
+Ġ19 7
+US ER
+Ġfiles ystem
+Ġcl ocks
+T aylor
+Ġwra pper
+Ġfold ing
+ous and
+ĠPhilipp ine
+ATION AL
+ĠPer th
+Ġas hes
+Ġaccum ulate
+ĠGate way
+Sh op
+orks hire
+H an
+ĠBar rel
+ĠLe h
+ĠX V
+Ġwh im
+Ġrep o
+ĠC G
+ĠM am
+Ġincorpor ating
+Ġbail out
+Ġlingu istic
+Ġdis integ
+C LE
+Ġcinem atic
+ĠF iber
+S yn
+il ion
+ĠCom pos
+c hens
+Ġne oc
+Ġbo iled
+F INE
+on o
+un cle
+ik en
+ĠB M
+Î ¹
+Ġreceipt s
+Ġdisp osed
+ĠTh irty
+ĠR ough
+ĠA BS
+Ġnot withstanding
+oll en
+# $
+Ġunrel iable
+Ġbl oom
+Ġmedi ocre
+Ġtr am
+ĠTas man
+Ġsh akes
+Ġmanifest o
+ĠM W
+Ġsatisf actory
+Ġsh ores
+Ġcomput ation
+Ġassert ions
+orm ons
+ar ag
+ab it
+Dem ocrats
+ĠL oot
+ĠVol ks
+ha ired
+Ġgrav itational
+S ing
+ĠM iz
+Ġthro ttle
+Ġtyr anny
+ĠView s
+Ġrob ber
+ĠMinor ity
+Ġsh rine
+sc ope
+pur pose
+Ġnucle us
+our cing
+ĠUS DA
+ĠD HS
+w ra
+ĠBow ie
+Sc ale
+ĠB EL
+x i
+I ter
+Ġ( ),
+w right
+Ġsail ors
+ous ed
+NAS A
+ĠPro of
+ĠMin eral
+t oken
+ĠF D
+R ew
+Ġe ll
+6 30
+Ġchance llor
+ĠG os
+Ġamount ed
+ĠRec re
+ome z
+ĠOpt im
+ĠOl ive
+Ġtrack er
+ow ler
+ĠUn ique
+R oot
+Ġmar itime
+ĠQur an
+ĠAd apt
+Ġecosystem s
+ĠRe peat
+ĠS oy
+ĠI MP
+Ġgrad uating
+and em
+P ur
+ĠRes et
+ĠTr ick
+ĠPh illy
+ĠT ue
+ĠMalays ian
+Ġclim ax
+Ġb ury
+Ġcons pic
+ĠSouth ampton
+ĠFl owers
+Ġesc orted
+ĠEduc ational
+ĠI RC
+Ġbrut ally
+e ating
+Ġpill ar
+ĠS ang
+ĠJ ude
+ar ling
+ĠAm nesty
+Ġrem inding
+ĠAdminist rative
+hes da
+Ġfl ashed
+ĠP BS
+per ate
+fe ature
+Ġsw ipe
+Ġgra ves
+oult ry
+26 1
+bre aks
+ĠGu er
+Ġsh rimp
+ĠV oting
+qu ist
+Ġanaly tical
+Ġtables poons
+ĠS OU
+Ġresear ched
+Ġdisrupt ed
+Ġj our
+Ġrepl ica
+Ġcart oons
+b ians
+} )
+c opy
+G ot
+ou ched
+P UT
+Ġsw arm
+not ations
+s aid
+Ġreb uilt
+Ġcollabor ate
+Ġr aging
+Ġn ar
+Ġdem ographics
+ĠD DR
+Ġdist rust
+oss ier
+ĠK ro
+Ġpump kin
+Ġreg rets
+Ġfatal ities
+ĠL ens
+ĠO le
+p d
+Ġpupp et
+ĠOut look
+ĠSt am
+O l
+F air
+U U
+Ġre written
+Ä ±
+Ġfasc inated
+Ġve ctors
+Ġtrib unal
+u ay
+ĠM ats
+ĠCo ins
+[ [
+Ġ18 1
+Ġrend ers
+ĠK aepernick
+Ġesp ionage
+Ġsum m
+Ġd itch
+Acc ount
+Ġspread sheet
+Ġmut ant
+p ast
+40 7
+Ġd ye
+Ġinit iation
+Ġ4 000
+Ġpunish able
+Ġth inner
+ĠKh al
+Ġinter medi
+D un
+ĠGoth am
+Ġeager ly
+Ġvag inal
+p owers
+V W
+ĠWATCH ED
+Ġpred ator
+ams ung
+Ġdispar ity
+Ġ[ *
+Ġam ph
+Ġout skirts
+ĠSpir its
+Ġskelet al
+Ð »
+ĠR ear
+Ġissu ance
+ĠLog ic
+re leased
+Z Z
+ĠB ound
+Ent ry
+Ġex its
+is ol
+ĠFound er
+Ġw re
+ĠGreen land
+ĠM MO
+t aker
+IN C
+ãģ ¾
+Ġhour ly
+hen ko
+Ġfantas ies
+Ġdis ob
+Ġdemol ition
+ãĥ ĭ
+Ġen listed
+rat ulations
+Ġmis guided
+Ġens ured
+Ġdiscour aged
+m ort
+Ġfl ank
+Ġc ess
+Ġreact s
+ĠS ere
+s ensitive
+ĠSer pent
+ass ad
+Ġ24 7
+Ġcalm ly
+b usters
+Ġble ed
+ĠSt ro
+Ġamuse ment
+ĠAntar ctica
+Ġs cept
+ĠG aw
+a q
+ason ic
+Ġsp rawling
+n ative
+atur ated
+ĠBattle field
+IV ERS
+E B
+ĠG ems
+ĠNorth western
+ĠFil ms
+ĠAut omatic
+Ġappre hend
+ãģ ¨
+Ġgui Name
+Ġback end
+Ġevid enced
+ge ant
+01 2
+ĠS iege
+Ġexternal To
+Ġunfocused Range
+ĠguiActiveUn focused
+Ġgui Icon
+ĠexternalTo EVA
+ĠexternalToEVA Only
+F ri
+ch ard
+en aries
+Ġchief s
+Ġc f
+ĠH UD
+Ġcorro bor
+Ġd B
+ĠT aken
+ĠPat ricia
+ra il
+ĠCh arm
+ĠLiber tarian
+rie ve
+Person al
+ĠO UR
+ger ies
+Ġdump ing
+Ġneurolog ical
+it imate
+ĠClint ons
+raft ed
+ĠM olly
+Ġtermin als
+reg ister
+Ġfl are
+Ġenc oded
+Ġautop sy
+p el
+m achine
+Ġexempt ions
+ĠRoy als
+d istance
+Ġdraft s
+Ġl ame
+ĠC unning
+Ġsp ouses
+ĠMark ets
+ĠCar rier
+Ġimp lying
+ĠY ak
+s id
+Ġl oser
+Ġvigil ant
+Ġimpe achment
+Ġaug mented
+ĠEmploy ees
+Ġunint ended
+tern ally
+ĠW att
+Ġrecogn izable
+ess im
+æ Ŀ
+Ġco ated
+r ha
+Ġlie utenant
+ĠLegisl ation
+pub lished
+44 4
+01 3
+Ġide ally
+ĠPass word
+Ġsimpl ify
+ĠMet a
+ĠM RI
+Ġple ading
+organ ized
+hand ler
+Ġun ravel
+cor rect
+Ġ icy
+Ġparan oid
+Ġpass er
+Ġinspect ions
+of er
+ĠHealth care
+28 3
+ĠBr ut
+iol a
+for ge
+ĠMed ieval
+MS N
+ie vers
+ĠProgram ming
+å ī
+Ġ2 23
+m u
+ĠC LE
+ug a
+Ġsho ppers
+Ġinform ative
+ĠPl ans
+Ġsupplement ation
+ĠT ests
+ty ard
+ocy tes
+ĠVeg a
+ĠGujar at
+erman ent
+Ex cept
+ĠL OT
+all a
+ĠC umm
+ĠO sw
+Ġven om
+ĠDeb t
+ĠD OWN
+Ġreun ion
+Ġm uc
+ĠRel ief
+Ġge op
+ĠðŁ ĺ
+al ogue
+An th
+ech o
+Ġcor ros
+Ġrepl ication
+ĠBl azing
+ĠD aughter
+Ġinf lic
+ĠLind sey
+Ù Ī
+28 4
+Ex it
+Ġgl oom
+TA IN
+Ġundermin ing
+Ġadv ising
+h idden
+Ġover flow
+Ġg or
+urd ue
+Ġe choes
+enh agen
+Ġimp uls
+d rug
+c ash
+Ġas ync
+Ġmir ac
+at ts
+p unk
+Ġpiv ot
+ĠLegisl ative
+Ġblog gers
+ĠCl aw
+s burg
+d yl
+ĠRecomm end
+Ġver te
+Ġprohib iting
+ĠPant her
+Jon athan
+Ġo min
+Ġhate ful
+28 1
+ĠOr che
+ĠMurd och
+down s
+Ġas ymm
+G ER
+Al ways
+Ġinform s
+ĠW M
+ĠP ony
+ĠApp endix
+ĠAr lington
+J am
+Ġmedic inal
+ĠS lam
+IT IES
+Ġre aff
+ĠR i
+F G
+S pring
+b ool
+Ġthigh s
+Ġmark ings
+ĠRa qqa
+ĠL ak
+p oll
+ts ky
+ĠMort y
+ĠDef inition
+Ġdeb unk
+end ered
+ĠLe one
+a vers
+Ġmortg ages
+App arently
+N ic
+ha us
+ĠTh ousands
+au ld
+Ġm ash
+sh oot
+Ġdi arr
+Ġconscious ly
+H ero
+e as
+ĠN aturally
+ĠDestroy er
+Ġdash board
+serv ices
+R og
+Ġmillenn ials
+Ġinv ade
+- (
+Ġcomm issions
+ĠA uckland
+Ġbroadcast s
+Ġfront al
+Ġcr ank
+ĠHist oric
+Ġrum ours
+CT V
+Ġster il
+Ġboost er
+rock et
+ãĤ ¼
+ut sche
+ĠP I
+Ġ2 33
+ĠProdu cer
+ĠAnaly tics
+Ġinval uable
+Ġunint ention
+ĠC Y
+Ġscrut in
+Ġg igg
+Ġeng ulf
+Ġprolet ariat
+Ġh acks
+ĠH ew
+ar ak
+ĠSl ime
+ield ing
+ag her
+ĠEll iot
+Ġtele com
+Ġ2 19
+ult an
+ĠAr bor
+ĠSc outs
+B an
+Ġlifes pan
+Ġbl asp
+38 8
+Ġjud iciary
+ĠContin ental
+ask ing
+Mc C
+L ED
+Ġbag gage
+ĠSorce rer
+Ġrem nants
+ĠGriff ith
+ets u
+ĠSub aru
+ĠPerson ality
+des igned
+ush ima
+agn ar
+Ġrec oil
+Ġpass ions
+\ ":
+Ġte e
+Ġabol ition
+ĠCreat ing
+j ac
+Ġ19 4
+01 9
+Ġpill ars
+ric hed
+/ "
+t k
+Ġlive lihood
+Ġro asted
+ah on
+ĠH utch
+ass ert
+Ġdivid end
+Ġkn it
+Ġd aunting
+Ġdisturb ance
+Ġsh ale
+Ġcultiv ated
+Ġrefriger ator
+L B
+ĠN ET
+Ġcommercial s
+Ġthink ers
+45 5
+Ġch op
+B road
+Ġsuspic ions
+Ġtag ged
+l ifting
+Ġsty lish
+ĠShield s
+Short ly
+Ġt ails
+A uth
+ST E
+ĠG AME
+Ġse ism
+ĠK is
+olog ne
+Ġcow ork
+Ġforc ibly
+Ġthy roid
+ĠP B
+AN E
+mar ried
+h orse
+Ġpoly mer
+ĠCh al
+od or
+DE BUG
+ĠCon text
+Ġbl iss
+Ġpin point
+ĠMat hemat
+leg ram
+ĠWeek end
+Ġlab elled
+Ġb art
+it les
+Ġest rogen
+âĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶ âĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶâĢĶ
+" '
+Ġvis ibly
+Ġouts ider
+aid a
+Are a
+Ġdisse min
+Ġdish onest
+ĠCl osed
+ĠBullet in
+ĠRam sey
+sw ord
+ĠX I
+our ced
+S ame
+34 6
+ĠRe pe
+ĠK ou
+c ake
+em is
+C ache
+ĠMe aning
+ĠEn light
+onom y
+Ġmanifest ation
+sw orth
+J ay
+Ġch ore
+Ã¶ r
+D ream
+Ġsanction ed
+Ġcult urally
+ĠA ra
+N av
+Ġthe ological
+Ġstr ut
+ĠV O
+ĠHand book
+Ġconstruct ing
+ĠÂ ¶
+ĠBenef its
+ĠPsych ological
+s ac
+å ¸
+p olicy
+ĠMat ters
+ĠReport ed
+ĠBy te
+Ġvit ro
+ĠM aiden
+Ġl am
+ĠJenn ings
+Ġgar ment
+ĠRut gers
+ĠStaff ord
+ĠWell ington
+Ġinter mitt
+Ġn pm
+Ġord eal
+Ġplug ged
+o oming
+in ished
+fram ework
+Ġtim ber
+Ġc ass
+Ġ8 50
+il ess
+ĠRed ux
+7 68
+St re
+Ġsurpass ed
+w hel
+Ġparalle ls
+Ġve il
+ĠG I
+ĠR EST
+Ġread iness
+s ort
+Ġmod ifying
+ĠSl ate
+ru ff
+Ġmar ble
+Ġinf rared
+Ġaud itor
+ĠFANT ASY
+ĠP overty
+ĠS PD
+Ġ" (
+K y
+RA Y
+Ġexecut ions
+ĠBever ly
+ĠMarx ism
+ĠBur st
+ĠK ali
+est ones
+Clear ly
+E ll
+ãģ §
+ĠProceed ings
+T oken
+IF IC
+Ã± a
+Cent ral
+ĠH aley
+ĠD rama
+Ġform ations
+OR N
+Book s
+Ġdom inating
+ĠFly ers
+ĠCompan ion
+Ġdiscipl ined
+ĠYug oslav
+ĠSpell s
+Ġv engeance
+Ġland lords
+L en
+ĠO gre
+ano ia
+Ġpier cing
+Ġcon greg
+Ġscore r
+ob ia
+Ġnic kel
+ĠLear ns
+Ġre jo
+Ġmaster piece
+Fl ash
+Ġinhab ited
+ĠOpen GL
+ĠD ud
+ĠI CO
+Ġar ter
+Ġpl ur
+Ġmaster y
+Ġlong standing
+st ed
+Ġw ines
+Ġtelev ised
+ĠSh rine
+ĠBay ern
+Ġâ ĵĺ
+Ġencl osure
+j ohn
+Ġprophe ts
+ĠRes urrection
+ĠOrd ers
+Ġun even
+r als
+Ġd wind
+ĠL ah
+ĠSl oven
+37 8
+Ġins istence
+aff le
+ĠCl one
+Ġhard ship
+ĠCongress man
+Ġple ad
+Ġreview ers
+Ġc ured
+Ġ19 35
+as ley
+f ake
+ĠTh inking
+yd ia
+P ART
+ĠD ota
+o it
+Ġwh ipped
+Ġb ouncing
+ĠHispan ics
+com ings
+Ġcann abin
+ĠCh ambers
+ĠZ ack
+Option al
+Ġco ats
+Ġprow ess
+ĠNort on
+Ġplain ly
+Ġfre ight
+Ġinhib ition
+Ġcl am
+Ġ30 3
+ke f
+ale igh
+L uke
+Ġpsych o
+ator ium
+M ED
+Ġtreat ies
+Ġind isc
+Ġd c
+OP S
+Ġresil ient
+ĠInter state
+Ġsl ack
+Ġmund ane
+Ġestab lishes
+35 9
+Ġstr ained
+Ġn ond
+S us
+Ġcast e
+ar ate
+ie ving
+Ġunfair ly
+Ġpars er
+on ial
+urs ive
+V ia
+ĠOtt o
+ĠAuthor ities
+stro ke
+K R
+ĠMer cy
+Ġfurn ished
+Ġout set
+Ġmet ic
+19 82
+olith ic
+ĠT ent
+og ical
+ĠA ircraft
+Ġh ides
+ĠBec ame
+Ġeduc ators
+re aching
+Ġvol atility
+Ġtodd ler
+ĠNAS CAR
+ĠTw elve
+ĠHigh lights
+Ġgra pe
+Ġspl its
+Ġpe asant
+Ġre neg
+ĠMS I
+Tem p
+st ars
+Ġtre k
+ĠHy de
+b inding
+Ġreal ism
+Ġox ide
+ĠH os
+Ġmount s
+Ġbit ing
+Ġcollaps ing
+Ġpost al
+Ġmuse ums
+Ġdet ached
+Ġrespect ing
+Ġmonop ol
+Ġwork flow
+ĠC ake
+Tem plate
+ĠOrgan isation
+Ġpers istence
+36 9
+C oming
+B rad
+Ġredund ant
+ĠG TA
+Ġb ending
+Ġrev oked
+Ġoff ending
+Ġfram ing
+Ġprint f
+Comm un
+mem bers
+Out side
+Ġconst rued
+Ġc oded
+F ORE
+Ġch ast
+Ch at
+Ind ian
+ĠY ard
+? !"
+ĠP orts
+ĠX avier
+ĠR ET
+' ."
+ĠBo at
+iv ated
+ich t
+umer able
+D s
+ĠDun n
+Ġcoff in
+Ġsecure ly
+ĠRapt ors
+ĠB es
+Install ation
+Ġin ception
+ĠHealth y
+end ants
+Ġpsych ologists
+ĠShe ikh
+c ultural
+ĠBlack Berry
+sh ift
+F red
+oc he
+Ġc akes
+ĠS EO
+ĠG ian
+ĠAs ians
+og ging
+e lement
+Ġpund its
+ĠV augh
+ĠG avin
+Ġh itter
+Ġdrown ed
+Ġch alk
+ĠZ ika
+Ġmeas les
+80 2
+âĢ¦ ..
+ĠAW S
+] "
+Ġdist ort
+ĠM ast
+Ġantib odies
+ĠM ash
+Mem ory
+ĠUg anda
+ĠPro b
+Ġvom iting
+ĠTurn s
+Ġoccup ying
+Ġev asion
+ĠTher apy
+Ġprom o
+Ġelect r
+Ġblue print
+ĠD re
+pr iced
+ĠDep ot
+Ġallev iate
+ĠSom ali
+m arg
+n ine
+Ġnostalg ia
+ĠShe pherd
+Ġcaval ry
+Ġtor ped
+ĠBlood y
+x b
+Ġs ank
+Ġgo alt
+report print
+embed reportprint
+clone embedreportprint
+ĠIn itially
+ĠF ischer
+Ġnot eworthy
+c ern
+Ġin efficient
+raw download
+rawdownload cloneembedreportprint
+c ation
+ĠD ynasty
+l ag
+D ES
+Ġdistinct ly
+ĠEston ia
+Ġopen ness
+Ġg ossip
+ru ck
+W idth
+ĠIb rahim
+Ġpet roleum
+Ġav atar
+ĠH ed
+ath a
+ĠHog warts
+Ġc aves
+67 8
+Ġsafegu ard
+ĠM og
+iss on
+ĠDur ham
+sl aught
+ĠGrad uate
+Ġsub conscious
+ĠEx cellent
+ĠD um
+---- -
+Ġp iles
+ĠW ORK
+ĠG arn
+ĠF ol
+ĠAT M
+Ġavoid s
+ĠT ul
+Ġble ak
+EL Y
+iv ist
+light ly
+P ers
+ĠD ob
+ĠL S
+Ġins anity
+Î µ
+atal ie
+En large
+Ġtw ists
+Ġfault y
+Ġpir acy
+Ġimp over
+Ġrug ged
+ĠF ashion
+Ġs ands
+' ?
+sw ick
+Ġn atives
+Ġhe n
+ĠNo ise
+ãĥ Ĺ
+Ġg reens
+Ġfree zer
+Ġd ynasty
+ĠFather s
+ĠNew ark
+Ġarchae ological
+Ġo t
+ob ar
+Ġblock ade
+Ġall erg
+L V
+Ġdeb it
+ĠR FC
+ĠMil ton
+ĠPress ure
+Ġwill ingly
+Ġdisproportion ate
+Ġopp ressive
+Ġdiamond s
+Ġbelong ings
+19 70
+Ġbell s
+Ġimperial ism
+Ġ2 27
+Ġexpl oding
+ĠE clipse
+Ġ19 19
+Ġr ant
+Ġnom inations
+34 7
+Ġpeace fully
+ric a
+ĠF UCK
+Ġvib ration
+mal ink
+Ġro pes
+ĠIv anka
+ĠBrew ery
+ĠBook er
+ĠOw ens
+go ers
+Serv ices
+ĠSn ape
+Ġ19 1
+39 5
+Ġ2 99
+just ice
+Ġb ri
+Ġdisc s
+Ġprom inently
+Ġvul gar
+Ġsk ipping
+l ves
+Ġtsun ami
+37 4
+ĠU rug
+ĠE id
+rec ated
+p hen
+Ġfault s
+ĠStart ed
+9 50
+Ġp i
+Ġdetect or
+Ġbast ard
+Ġvalid ated
+Space Engineers
+OUR CE
+Ġ( ~
+Ġuns ur
+Ġaff irmed
+Ġfasc ism
+Ġres olving
+ĠCh avez
+ĠC yn
+Ġdet ract
+L ost
+Ġrig ged
+Ġhom age
+ĠBrun o
+55 5
+ec a
+Ġpress es
+Ġhum our
+Ġsp acing
+Ġ' /
+olk ien
+C oun
+OP ER
+T re
+S on
+ĠCambod ia
+ier re
+m ong
+o zy
+Ġliquid ity
+ĠSov iets
+ĠFernand o
+Ġ2 29
+Ġsl ug
+ĠCatal an
+elect ric
+Ġsc enery
+ĠH earth
+Ġconst rained
+Ġgoal ie
+ĠGu idelines
+ĠAm mo
+ĠPear son
+Ġtax ed
+Ġfet us
+Resp onse
+ĠAlex is
+th ia
+G uy
+Ġrecon struct
+Ġextrem es
+Ġconclud ing
+ĠP eg
+ook s
+Ġded uctions
+R ose
+Ġground breaking
+ĠT arg
+ãĥ ģ
+ĠRe ve
+res ource
+Ġmo ons
+Ġelectrom agnetic
+Ġamid st
+ĠVik tor
+N ESS
+B ACK
+Ġcomm ute
+ĠAna heim
+Ġfluct uations
+6 40
+Ġnood les
+ĠCop enhagen
+ĠT ide
+ĠGri zz
+ĠS EE
+Ġpip elines
+Ġsc ars
+end o
+ag us
+ĠE TF
+/ #
+ĠBec ome
+44 8
+Ġvis c
+ĠRecomm ended
+Ġj umper
+Ġcogn ition
+Ġassass in
+Ġwitness ing
+ĠSet up
+Ġl ac
+v im
+IS M
+p ages
+SS L
+35 8
+Ġad ject
+indust rial
+l ore
+cher y
+Ġgl itter
+Ġc alf
+Flor ida
+Ġspoil ers
+Ġsucceed s
+Ġch anting
+Ġslog ans
+ĠTr acy
+Vis it
+rol ogy
+Ġm ornings
+Ġline age
+Ġs ip
+Ġintense ly
+Ġflour ish
+ĠSle eping
+ĠF em
+or por
+ĠK lan
+ĠDar th
+h ack
+ĠNi elsen
+Ġtum ors
+Ġprocure ment
+ĠY orkshire
+Ġra ided
+K Y
+An na
+Ġ// [
+ĠDis order
+ĠMust ang
+ĠW en
+ĠTry ing
+s q
+Ġdeliver ies
+Ġshut ter
+Ġcere bral
+Ġbip olar
+ĠC N
+l ass
+j et
+Ġdeb ating
+> :
+Ġe agle
+gr ades
+ĠD ixon
+UG C
+M AS
+ĠDr aco
+ĠMach ines
+aff er
+Ġem an
+Â ²
+pr on
+ĠG ym
+Ġcompar atively
+ĠTrib unal
+PR O
+Ġle x
+Ġfert ile
+Ġdep ressing
+Ġsuperf icial
+ess ential
+ĠHun ters
+g p
+Ġprom inence
+L iber
+ĠAn cest
+ote chnology
+Ġm ocking
+ĠTra ff
+ĸ ļ
+Med ium
+I raq
+Ġpsychiat rist
+Quant ity
+ĠL ect
+Ġno isy
+5 20
+G Y
+Ġsl apped
+ĠM TV
+Ġpar a
+p ull
+Mult iple
+as her
+Ġn our
+ĠSe g
+Spe ll
+v ous
+ord ial
+Sen ior
+ĠGold berg
+ĠPl asma
+ne ed
+Ġmess enger
+ere t
+Ġteam ed
+Ġliter acy
+ĠLe ah
+ĠD oyle
+Ġem itted
+U X
+Ġev ade
+Ġm aze
+Ġwrong ly
+ĠL ars
+Ġstere otype
+Ġpled ges
+Ġarom a
+ĠM ET
+Ġac re
+ĠO D
+Ġf f
+Ġbrew eries
+ĠH ilton
+und le
+ĠK ak
+ĠThank fully
+ĠCan ucks
+in ctions
+ĠApp ears
+Ġco er
+Ġundermin ed
+ro vers
+And re
+Ġbl aze
+um ers
+Ġfam ine
+amp hetamine
+ulk an
+Am ount
+Ġdesper ation
+wik ipedia
+develop ment
+ĠCor inth
+uss ia
+Jack son
+L I
+N ative
+R s
+Oh io
+ĠKath leen
+F ortunately
+Ġattend ant
+ĠPre ferred
+ĠDid n
+ĠV s
+M is
+Ġrespond ent
+Ġb oun
+st able
+Ġp aved
+Ġunex pl
+ĠChe ney
+L M
+ĠC ull
+bl own
+Ġconfront ing
+oc ese
+serv ing
+W i
+ĠLith uania
+ann i
+Ġst alk
+h d
+Ġv ener
+AP H
+ynchron ous
+UR R
+um ably
+hist oric
+H alf
+H ay
+Ġresil ience
+spe ction
+Ġabandon ing
+O bs
+ĠDeb bie
+Ġgrad ient
+ĠPl aint
+ĠCan al
+AR CH
+Ġexpans ive
+Ġfun g
+Ġb ounced
+U nd
+Ġprec autions
+Ġclar ification
+Ġd agger
+Ġgri ps
+ĠÂ µ
+ĠRiver a
+ĠUnd ead
+is ites
+ĠFIR ST
+Ã± o
+aud i
+Ġhost ages
+Ġcompl iant
+Ġal umni
+Se ven
+Ġcyber security
+e ither
+Col lect
+Ġinvari ably
+ĠS oci
+Ġlaw maker
+Ġa le
+ĠPerson ally
+N azi
+Ġcustom ization
+ĠPro c
+ĠSask atchewan
+eat uring
+Ġsp ared
+Ġdiscontin ued
+Ġcomput ational
+ĠMotor ola
+Ġsuprem acist
+government al
+Ġparad ise
+ĠDown ing
+ĠNik on
+Ġcat alyst
+ber ra
+Tor onto
+8 75
+bet a
+ĠMac ron
+Ġunreal istic
+ve ctor
+ĠVeh icles
+it iveness
+ĠR V
+ĠCol bert
+s in
+o ji
+ent in
+ĠKr ish
+hell o
+ff ield
+ok y
+ĠT ate
+Ġmap le
+Ġa ids
+chem ical
+33 4
+n uts
+ĠWar p
+Ġx x
+ĠRob b
+umer ous
+_- _
+ft ime
+ĠV W
+Ġw inger
+ĠD ome
+t ools
+ĠP V
+ĠGe orgetown
+Ġg eared
+Ġjihad ists
+Ġc p
+Ġster oids
+M other
+cler osis
+ĠDR M
+nes ia
+Ġl inger
+Ġimm ersive
+ĠC OUN
+Ġoutwe igh
+ens ual
+B and
+Ġtransform s
+mat ched
+ps ons
+ĠJud icial
+f actor
+Ġrefer ral
+Ġodd ly
+ĠW enger
+B ring
+ĠB ows
+60 2
+IC LE
+Ġl ions
+ĠAcad emic
+ĠTh orn
+ĠRa ider
+kef eller
+St orage
+L ower
+ĠOr t
+ĠEqu ality
+AL T
+ĠS OC
+T ypes
+Ġl yn
+ĠAss et
+co at
+TP P
+C VE
+ĠPione er
+app lication
+Mod ern
+ĠH K
+En vironment
+Al right
+R ain
+IP P
+ĠShi ite
+Ġm ound
+ĠAb ilities
+cond ition
+St aff
+Ġcompet ence
+ĠM oor
+ĠDi ablo
+Ġwith held
+Ġost ensibly
+ĠB rom
+Ġms g
+Ġden omin
+ĠRef erences
+ĠF P
+Ġplun ged
+Ġp amph
+m oving
+cent ral
+Ġdown right
+Ġf ading
+T al
+T yp
+ĠTh y
+uk es
+it he
+Ġo ve
+Ġbatt led
+Ġseaf ood
+Ġfig ur
+ĠR D
+c rop
+Ġsqu ads
+{ \
+à ¹
+ĠE h
+Ġinterview ing
+ĠQ in
+Ġas piring
+PL IC
+Ġcla uses
+ĠG ast
+ĠN ir
+Ġl uggage
+Ġh ose
+Ġsystem d
+Ġdesc ending
+ĠRev ised
+ĠR ails
+al ign
+70 9
+33 7
+Ġf ug
+charg ing
+t ags
+Ġut er
+k ish
+WAR NING
+49 0
+prof its
+Ġvoy age
+Ġa ce
+ĠV anguard
+ĠT anks
+ĠM uk
+Ġ2 26
+S afe
+Ar mor
+Ġvolcan ic
+Ġwom b
+ĠM IL
+Ġbegin ner
+ĠRec ogn
+ĠA AP
+PL AY
+) !
+Ġdetect ing
+c n
+Ġbre aches
+Bas ically
+ĠP ag
+ĠMunicip al
+ĠInd ie
+ĠL af
+ĠDis able
+ĠOl son
+Ġrest rained
+Ġrul ings
+Ġhum ane
+ev ents
+ĠCinem a
+display Text
+ĠH atch
+action Date
+onna issance
+Ġassault ing
+ĠL ug
+CH AT
+Ġvig orous
+ĠPer se
+Ġintoler ance
+ĠSnap chat
+ĠSh arks
+Ġd ummy
+ĠDi agn
+ĠGu itar
+im eters
+40 3
+RE G
+A x
+Ġsepar ates
+ĠMah m
+Ġt v
+j ah
+O OL
+C irc
+ĠWinds or
+uss ian
+Ġintu ition
+Ġdis dain
+ĠDon ovan
+Ġ2 21
+E mb
+Ġcondem ning
+Ġgener osity
+zz y
+Ġpant ies
+ĠPre vent
+Action Code
+AN A
+34 2
+external ActionCode
+Ġspec ifying
+Ġcryst all
+J ere
+Ġru pt
+ĠApp rentice
+Ġprof iling
+Ð º
+St rike
+Ġsid eline
+Ġoblig ated
+Ġocc ult
+Ġbureaucr atic
+ant ically
+rupt ed
+neg ative
+ĠEthiop ia
+ĠC ivic
+Ġins iders
+el igible
+ĠTV s
+ĠB AR
+ĠT I
+i ologist
+ĠA IR
+Ġsubstit uted
+Ar ab
+ĠS aul
+ĠY og
+p rem
+Ġbuild ers
+Ġstation ary
+Ġdoubt ful
+Ġvig orously
+Ġthr illing
+Ph ysical
+ĠCare y
+ĠHyd ra
+geon ing
+ĠS ly
+y ton
+Ġborrow ers
+ĠPark inson
+Ġ ë
+ĠJama ica
+Ġsat ir
+Ġinsurg ents
+ĠF irm
+Ġis ot
+ĠK arn
+our ning
+ak ens
+doc s
+l ittle
+ĠMon aco
+CL ASS
+Tur key
+L y
+ĠCon an
+ass ic
+Ġstar red
+ĠPac ers
+et ies
+Ġt ipping
+M oon
+ĠR w
+s ame
+Ġcav ity
+Ġgo of
+ĠZ o
+Sh ock
+um mer
+Ġemphas izes
+Ġreg rett
+Ġnovel ty
+Ġen vy
+ĠPass ive
+r w
+50 5
+Ġind ifferent
+ĠR ica
+ĠHim self
+ĠFred die
+Ġad ip
+ä¸ Ģ
+Ġbreak out
+Ġhur ried
+ĠHu ang
+ĠD isk
+Ġro aming
+?????- ?????-
+U V
+ĠRick y
+ĠS igma
+Ġmarginal ized
+Ġed its
+Ġ30 4
+mem ory
+Ġspec imen
+29 3
+ãģ ¯
+Ġvert ically
+Ġaud ition
+ĠHe ck
+Ġc aster
+ĠHold ings
+ad al
+ĠC ron
+ĠL iam
+Ġdef lect
+P ick
+ĠDeb ug
+RE F
+Ġvers atility
+ot hes
+class ified
+ĠMah ar
+ĠH ort
+C ounter
+st asy
+not iced
+33 1
+ĠSh im
+f uck
+ĠB ie
+Ġair ing
+ĠPro tein
+ĠHold ing
+Ġspect ators
+ili ated
+ĠThat cher
+n osis
+ãĥ¼ ãĥ³
+Te le
+B oston
+ĠTem pl
+st ay
+Ġdecl arations
+47 9
+Vol ume
+ĠDesign er
+ĠOver watch
+id ae
+Ġon wards
+Ġn ets
+ĠMan ila
+part icularly
+Ġpolit ic
+o other
+Ġport raits
+Ġpave ment
+c ffff
+Ġs aints
+Ġbegin ners
+ES PN
+Ġshort comings
+âķĲ âķĲ
+Ġcom et
+ĠOrgan ic
+qu el
+Ġhospital ized
+Bre ak
+Ġpe el
+dyl ib
+asp x
+ur ances
+ĠT IM
+P g
+Ġread able
+ĠMal ik
+Ġm uzzle
+Ġbench marks
+d al
+ĠV acc
+ĠH icks
+60 9
+ĠB iblical
+he ng
+Ġover load
+ĠCivil ization
+Ġimm oral
+Ġf ries
+ãĤ Ĵ
+Ġreprodu ced
+Ġform ulation
+j ug
+ire z
+g ear
+Ġco ached
+Mp Server
+ĠS J
+ĠK w
+In it
+d eal
+ĠO ro
+ĠL oki
+ĠSong s
+Ġ23 2
+ĠLou ise
+asion ally
+Ġunc ond
+olly wood
+Ġprogress ives
+ĠEn ough
+ĠDo e
+Ġwreck age
+Ġbr ushed
+ĠBase Type
+Ġz oning
+ish able
+het ically
+ĠC aucus
+ĠH ue
+Ġk arma
+ĠSport ing
+Ġtrad er
+Ġseem ing
+ĠCapt ure
+4 30
+b ish
+Ġt unes
+Ġindo ors
+ĠSp here
+ĠD ancing
+TER N
+Ġno b
+ĠG ST
+m aps
+Ġpe ppers
+F it
+Ġoverse es
+ĠRabb i
+ĠR uler
+vert ising
+off ice
+xx x
+Ġra ft
+Ch anged
+Ġtext books
+L inks
+ĠO mn
+ãĢ ĳ
+Ġinconven ience
+ĠDon etsk
+= ~
+Ġimplicit ly
+Ġboost s
+ĠB ones
+ĠBo om
+Cour tesy
+Ġsens ational
+AN Y
+Ġgre edy
+ed en
+Ġinex per
+ĠL er
+ĠV ale
+Ġtight en
+ĠE AR
+ĠN um
+Ġancest or
+S ent
+ĠH orde
+urg ical
+all ah
+Ġsa p
+amb a
+ĠSp read
+tw itch
+Ġgrand son
+Ġfract ure
+Ġmoder ator
+ĠSe venth
+ĠRe verse
+Ġestim ation
+Cho ose
+Ġpar ach
+Ġbar ric
+ãĢ Ĳ
+Ġcomp ass
+Ġall ergic
+âĢ ķ
+OT HER
+err illa
+Ġw agon
+Ġz inc
+Ġrub bed
+ĠFull er
+ĠLuxem bourg
+ĠHoo ver
+Ġli ar
+ĠEven ing
+ĠCob b
+est eem
+Ġselect or
+ĠB rawl
+is ance
+ĠE k
+Ġtro op
+Ġg uts
+ĠApp eal
+ĠTibet an
+Ġrout ines
+ĠM ent
+Ġsummar ized
+steam apps
+Ġtr anqu
+Ġ19 29
+or an
+ĠAut hent
+Ġg maxwell
+Ġappre hens
+Ġpo ems
+Ġsa usage
+ĠWeb ster
+ur us
+Ġthem ed
+Ġl ounge
+Ġcharg er
+Sp oiler
+Ġsp illed
+h og
+ĠSu nder
+ĠA in
+ĠAng ry
+Ġdis qual
+ĠFrequ ency
+ĠEther net
+Ġhel per
+Per cent
+Ġhorr ifying
+Ġa il
+ĠAll an
+EE E
+ĠCross ing
+44 9
+Ġh olog
+ĠPuzz les
+ĠGo es
+eren n
+60 4
+ãģ ı
+ĠRaf ael
+Ġatt en
+ĠE manuel
+Ġup ro
+ĠSus p
+P sych
+ĠTr ainer
+ĠN ES
+ĠHun ts
+bec ue
+Ġcounsel or
+R ule
+Ġtox ins
+Ġb anners
+r ifice
+Ġgreet ing
+Ġfren zy
+Ġall ocate
+Ġ* )
+ex pr
+50 3
+ĠCh ick
+ĠT orn
+Ġconsolid ation
+ĠF letcher
+sw itch
+fr ac
+cl ips
+ĠMcK in
+ĠLun ar
+Mon th
+IT CH
+Ġscholar ly
+rap ed
+39 8
+Ġ19 10
+Ġe greg
+Ġin secure
+Ġvict orious
+cffff cc
+Ġsing led
+Ġel ves
+ĠW ond
+bur st
+Ġcam oufl
+ĠBL ACK
+Ġcondition ed
+ç ī
+ans wered
+Ġcompuls ory
+asc ist
+Ġpodcast s
+ĠFrank furt
+bn b
+Ġne oliberal
+ĠKey board
+ĠBel le
+w arm
+Ġtrust s
+Ġins ured
+ĠBu cc
+us able
+60 7
+ĠPl ains
+Ġ18 90
+Ġsabot age
+Ġlod ged
+f elt
+Ġg a
+ĠN arc
+ĠSal em
+Ġsevent y
+ĠBl ank
+p ocket
+Ġwhis per
+Ġm ating
+om ics
+ĠSal man
+ĠK ad
+Ġan gered
+Ġcoll isions
+Ġextraord inarily
+Ġcoerc ion
+G host
+b irds
+è Ģ
+k ok
+Ġper missible
+avor able
+Ġpo inters
+Ġdiss ip
+ac i
+Ġtheat rical
+ĠCos mic
+Ġforget ting
+Ġfinal ized
+å¤ §
+y out
+l ibrary
+Ġbo oming
+ĠBel ieve
+ĠTe acher
+ĠL iv
+ĠGOOD MAN
+ĠDomin ican
+OR ED
+ĠPart ies
+Ġprecip itation
+ĠSl ot
+R oy
+ĠComb ined
+Ġinteg rating
+Ġch rome
+Ġintest inal
+ĠRe bell
+Ġmatch ups
+Ġblock buster
+ĠLore n
+ĠLe vy
+Ġpre aching
+ĠS ending
+ĠPur pose
+ra x
+f if
+Ġauthor itative
+ĠP ET
+ast ical
+Ġdish on
+Ġchat ting
+Ġ"$ :/
+Connect ion
+Ġrecre ate
+Ġdel inqu
+Ġbro th
+ĠD irty
+ĠAd min
+z man
+Ġscholars hips
+Ġ25 3
+cont act
+als a
+7 67
+c reen
+abb age
+Ġ19 15
+Ġbl ended
+Ġal armed
+L anguage
+35 6
+Ġbl ends
+ĠCh anged
+W olf
+Ġhe pat
+Creat ing
+Ġper secut
+Ġsweet ness
+art e
+Ġforfe iture
+ĠRober to
+im pro
+N FL
+ĠMag net
+Det ailed
+Ġinsign ificant
+ĠPOL IT
+ĠBB Q
+ĠC PS
+Ġse aw
+amin er
+m L
+end if
+f inals
+Ġ26 5
+u ish
+Ġ} )
+ĠPro blems
+Ġem blem
+Ġserious ness
+Ġpars ing
+Ġsubst itution
+Ġpress ured
+Ġrecy cled
+ale b
+Rub y
+Ġprof iciency
+Dri ver
+ĠW ester
+: '
+AF TA
+Ġm antle
+ĠClay ton
+fl ag
+Ġpractition er
+c overed
+ĠSt ruct
+add afi
+4 25
+ĠTown ship
+ĠHyd ro
+Lou is
+34 3
+Ġcond o
+ĠT ao
+Ġutil ization
+Ġnause a
+ĠDem s
+rid ges
+p ause
+Ġform ulas
+Ġchall enger
+37 6
+Ġdefect ive
+ĠRail way
+ĠPub Med
+Ġyog urt
+l bs
+ĠNor folk
+OP E
+ĠMood y
+Ġdistribut or
+Ġscroll s
+Ġextract s
+St an
+Ġv iability
+Ġexp oses
+Ġstar vation
+ĠStep s
+ĠD odd
+f ew
+ST D
+33 2
+Ġclos ures
+Ġcomplement ary
+ĠS asha
+ump y
+Ġmon et
+Ġartic ulate
+ĠDo ct
+k iller
+Ġsc rim
+Ġ2 64
+Ġprost itutes
+Ġse vered
+Ġattach ments
+Ġcool ed
+L ev
+ĠF alk
+f ail
+Ġpolic eman
+ĠD ag
+Ġpray ed
+ĠK ernel
+Ġcl ut
+Ġc ath
+Ġan omaly
+St orm
+em aker
+ĠBreak fast
+ul i
+o ire
+J J
+h z
+Oper ation
+ĠS ick
+35 4
+ĠGuatem ala
+R ate
+Ġexp osures
+f aces
+ĠArch ae
+ra f
+ĠM ia
+Ġ20 25
+Ġop aque
+Ġdisgu ised
+ĠHead quarters
+S ah
+Ġp ots
+9 78
+ĠM alf
+Ġfrown ed
+Ġpoison ous
+ĠCon vers
+ee ks
+Ġcr ab
+." "
+Ġtre ason
+Ġr anc
+Ġescal ating
+Ġwar r
+Ġmob s
+Ġl amps
+ĠSun shine
+ĠBrun swick
+Ph ones
+Ġspe lled
+ĠSk ip
+Ġ20 50
+Ġ19 11
+ĠPl uto
+ĠAm end
+Ġme ats
+38 7
+Ġst omp
+ĠZh ou
+ĠLevi athan
+ĠHaz ard
+ad v
+ĠOr well
+Ġal oud
+Ġb umper
+ĠAn arch
+ub untu
+ĠSer ious
+f itting
+ĠOption al
+ĠCec il
+RE AM
+Ġser otonin
+Ġcultiv ate
+ag ogue
+} \
+Ġmos ques
+ĠSun ny
+Ġre active
+rev olution
+ĠL up
+ĠFed ora
+Ġdefense man
+ĠV ID
+ist ine
+Ġdrown ing
+ĠBroad casting
+Ġthr iller
+ĠS cy
+Ġacceler ating
+Ġdirect s
+od ied
+b ike
+d uration
+Ġpain fully
+R edd
+Ġproduct ions
+Ġg ag
+Ġwh ist
+Ġs ock
+Ġinf initely
+ĠConc ern
+ĠCit adel
+Ġlie u
+Ġcand les
+ogene ous
+arg er
+Ġheaven ly
+inflamm atory
+Per formance
+C s
+ruct ose
+az aki
+Ġp essim
+Ġinf erence
+Ġpow d
+ĠZ oe
+Ġpain ts
+Ġd azz
+pt a
+-------- ---
+Ġins pir
+ĠExper imental
+ĠKn ife
+reg or
+b ors
+Ġshow ers
+rom eda
+Ġs aint
+Ġben ign
+ĠJ iang
+Ġenvision ed
+Ġsh roud
+IF T
+H O
+Ġsh uff
+ĠI CC
+Ġse greg
+Ġrevis it
+ighth ouse
+L i
+Ġsub strate
+ĠSe as
+ĠRew ard
+ĠH ep
+ĠBr ass
+s bm
+Ġelim inates
+Ġst amina
+ĠV AT
+ĠLo an
+Ġconst raint
+Ġappropri ated
+Ġp es
+ĠA LE
+r anging
+Ġ40 4
+39 2
+Ġintellectual s
+ach u
+Ġrestruct uring
+ĠLe vin
+Ġrun es
+Ġdelight ful
+Ġcarbohyd rates
+ĠMod els
+ĠExp o
+Ġtransport ing
+all oc
+Ġring ing
+S amsung
+Ġscarce ly
+ĠURL s
+ĠM AS
+Ġprot otypes
+Ġnarr ator
+ĠCPU s
+cd n
+ĠBart on
+Ġdecided ly
+ĠSh u
+ix ir
+oc ious
+ĠMy st
+N intendo
+Ġre use
+Ġforg iven
+F ew
+in ical
+n at
+Ġseam less
+ĠEv a
+ĠE VE
+ĠJ O
+land ers
+Ġso fter
+neg ie
+Ġtrans ient
+Ġorb ital
+Ġfulf il
+ĠK om
+Hop efully
+Ġdynam ically
+ĠHun ger
+å Ľ
+ĠArmen ia
+el man
+ber to
+Ġp ige
+ĠID s
+lim it
+Ġve ins
+Ġso aring
+p acks
+Gold en
+ĠCr ab
+ist or
+ĠR PM
+Ġ$ $
+g ression
+Ġjihad ist
+Ġgam ble
+Ġcare g
+Ġinf lated
+F ace
+ĠFire arms
+ĠEm manuel
+â Ŀ
+Ġsh ocks
+gr ab
+Ġspl end
+ĠHP V
+ab ortion
+Ab ove
+Ent ity
+play ers
+Ġcomm enced
+ul ence
+Ġfulfill ment
+Ġembod iments
+ĠW elfare
+Ġha il
+Ġ< @
+tt en
+Ġcat cher
+ĠJ azeera
+Ġvolcan o
+Ġstabil ize
+ĠHand ler
+Ġintens ified
+ĠAb rams
+Ġhum iliation
+p aced
+60 5
+ĠCent OS
+Spe cific
+Ġhe ed
+ĠC AM
+ĠGal ile
+D ie
+Ġabol ished
+ĠThom son
+ĠTe achers
+ĠW ass
+j ong
+ĠIS BN
+ĠAll ies
+sh ake
+å ·
+v ict
+How ard
+Ġde em
+Ġexceed ingly
+ĠSmart stocks
+ib e
+Ġdoor way
+Ġcompet ed
+ig mat
+Ġnational ists
+Ġg room
+ĠKe en
+Ġdispos able
+de cl
+ĠT olkien
+ĠSche me
+Ġb iod
+Ġav id
+ĠEl on
+ag ar
+ĠT SA
+R oman
+Ġartific ially
+Ġadvis ors
+X L
+ĠInf erno
+36 6
+Ġted ious
+ĠPhot ography
+ĠCar rie
+Ġtro pe
+ĠSand ra
+Ġdec imal
+Que en
+ĠGund am
+ĠO M
+ote ch
+N BA
+Ġ19 32
+Ġent renched
+ĠMar ion
+Ġfr aternity
+Lab our
+Hen ry
+Ġlat itude
+E ither
+Ġenh ances
+ĠPot ential
+Ġsh ines
+id ad
+Ġbread th
+Ġcapac ities
+ĠðŁ ĻĤ
+ĠBron x
+Ġsex es
+Ġdifferent iation
+Ġheavy weight
+ĠT aj
+d ra
+Ġmigr ate
+Ġexhaust ion
+ĠR UN
+els ius
+ĠCu omo
+Ġgu itars
+Ġcl ones
+ĠSom ew
+ĠP ry
+------------ -
+Ġwarr anted
+cy cles
+Ġsalv age
+Ġdis ks
+R ANT
+ĠNGO s
+ĠMart ian
+":[ {"
+Ġadd icts
+oj ure
+il let
+Ġamazing ly
+art ments
+p ixel
+ĠGPU s
+Lay out
+è £
+ĠTam il
+ĠBas il
+Ġimpart ial
+ĠSt ructure
+f ork
+b ryce
+Ġr idge
+ĠHamb urg
+ri ous
+Ġbl itz
+cig arettes
+Ġcan ned
+40 2
+Ġiron ically
+Ġcompassion ate
+ĠHaw kins
+. #
+ĠCat hedral
+Ġrall ied
+in ternal
+Ġqu ota
+st akes
+T EXT
+m om
+Ġcomple tes
+Ġ23 8
+Ġsh rug
+ãĥ ĳ
+ĠN inth
+Ġrev ise
+ĠProv ider
+Ġtre acher
+Ġqu asi
+ĠPR ES
+Ġdep osition
+Ġconfidential ity
+iss ors
+Ġim balance
+Ġspan ning
+Ġang ular
+ĠC ul
+commun ication
+ĠNor a
+ĠGen ius
+op ter
+Ġs acked
+Sp ot
+Ġfine ly
+ĠCH R
+28 2
+w aves
+Pal est
+ĠRo hing
+N L
+è ¿
+Ġsh itty
+ĠSc alia
+4 75
+Pro gress
+Ġreferen cing
+Ġclass rooms
+ab ee
+Ġs od
+hes ion
+70 8
+ĠZucker berg
+ĠFin ish
+ĠScot ia
+ĠSav ior
+ĠInstall ation
+an tha
+( -
+Ġ30 2
+ĠP unk
+Ġcr ater
+yout u
+Ġro ast
+Ġinflu encing
+Ġd up
+ĠJ R
+ĠG rav
+Ġstat ure
+Ġbath rooms
+A side
+W iki
+me an
+ĠZ ak
+ĠOn es
+ĠN ath
+Ġhyper t
+Ġcommence ment
+C ivil
+Ġmoder ately
+Ġdistribut ors
+Ġbreast feeding
+Ġ9 80
+ĠS ik
+ĠC ig
+ĠAM ER
+R IP
+ĠCare er
+ust ing
+Ġmess ed
+Ġe h
+ĠJ ensen
+/ $
+Ġblack mail
+Ġconvers ions
+Ġscientific ally
+Ġmant ra
+p aying
+Ġiv ory
+ĠCour ts
+OU GH
+aunt let
+Ser ial
+B row
+ĠH undreds
+3 23
+Ġpe e
+Ġlin ux
+Ġsub mer
+ĠPrinc ipal
+48 5
+ĠD SL
+ĠCous ins
+Ġdoctr ines
+ĠAthlet ics
+Ġ3 15
+ĠK arma
+Ġatt ent
+ur ger
+Ġpresc ribe
+Ġenc aps
+ĠC ame
+Ġsecret ive
+ĠCr imes
+d n
+C lean
+ĠEgypt ians
+ĠCar penter
+Ġ ll
+H um
+ĠMil o
+Ġcapital ists
+Ġbrief ed
+T we
+ĠBas in
+elve t
+M os
+Ġplun ge
+ĠKa iser
+ĠFu j
+ill in
+Ġsafegu ards
+Ġo ste
+ĠOpportun ity
+ĠM afia
+ĠCall ing
+ap a
+ur ban
+br ush
+ill ard
+c Ã©
+int elligence
+ĠL ob
+ĠDru id
+Ġsm oother
+Ġfoot ing
+Ġmotor ists
+arc ity
+Ġmascul inity
+Ġm ism
+Ġabdom inal
+ĠTa vern
+ĠR oh
+Ġesc apes
+s igned
+Anth ony
+Ġsacrific ing
+Ġintim acy
+Ġan terior
+ĠK od
+Ġmot if
+Ġg raz
+Ġvisual ization
+Ġguitar ist
+ĠTro tsky
+m agic
+D ar
+ĠMor i
+Ġw ards
+Ġtoile ts
+l est
+Ġtele port
+ĠSund ays
+ĠPl at
+ET S
+Ġe Sports
+Pat rick
+ĠK atherine
+en ko
+Ġhas sle
+ĠM ick
+gg les
+Ġh ob
+aint ain
+Ġair borne
+Ġsp ans
+Ġch ili
+Ġa perture
+Ġvolunte ered
+ĠInc ident
+ĠF res
+ĠVeter an
+augh tered
+ing o
+Ġun insured
+CL OSE
+Ġf use
+Ġer otic
+Ġadvert ise
+ra ising
+Text ure
+Ġatt ends
+ĠRE AL
+udd led
+Ġsm oot
+Ġ30 5
+ĠWill is
+Ġbl ond
+An alysis
+ĠV T
+on ica
+Ġstrongh old
+R F
+N M
+. >>
+Ġprosper ous
+Ġbo asted
+29 2
+ĠManufact uring
+PR ESS
+g ren
+Ġpharm acy
+ĠRoc kefeller
+k ai
+Ġth umbs
+ĠH ut
+Ġmother board
+Ġguard ians
+ĠAl ter
+ll ular
+Ġsh ack
+Ġwise ly
+Ġback bone
+erv a
+Ġsu icides
+ĠMcG regor
+ij ah
+E mer
+ĠB rav
+Ġdesign ate
+P OST
+produ ced
+Ġcleans ing
+irl wind
+ex istent
+ĠHum ph
+ĠPay ne
+Ġv ested
+Å ¡
+Ġstring ent
+ion a
+Ġuns ub
+Ġsum med
+ĠHer cules
+sub ject
+ĠR agnar
+ĠN os
+Ġcharacter ization
+Ġsav vy
+ĠDaw son
+ĠCas ino
+Ġf ri
+ĠBar rier
+Ġmis information
+Ġins ulation
+Ġcorrid ors
+Ġair planes
+ĠNo ct
+ah i
+Ġ19 16
+k b
+arm ac
+Ġsh un
+Ġsche ma
+Ġhorr ified
+Ġ23 9
+aund ers
+N B
+i ates
+er ity
+ĠSh ard
+Ġr arity
+Ġgroup ed
+ĠGh ana
+again st
+ĠBi ological
+ĠA ware
+ow ell
+Ï Ħ
+ĠBe au
+sh aw
+H ack
+ĠJul ius
+US S
+ol son
+aun a
+c ru
+ĠMaur ice
+ĠI k
+Ġsequ encing
+Ġradical s
+Ġ( ?,
+v irtual
+Ġany ways
+Ġreper c
+Ġhand lers
+Ġhes itant
+é ĥ
+ĠM F
+ple mentation
+ass ociated
+Ġcampaign ed
+ĠY ue
+ut ations
+ĠY oga
+Ġsim mer
+Ġro ds
+Ġmel ody
+Ġconv oy
+v ideos
+Ġscreen ed
+N eg
+ochem ical
+Ġ( ))
+Ġultr as
+Ġant ip
+ĠIsland ers
+70 4
+Ġfet ish
+Ġridic ulously
+ĠK art
+Ġmitochond rial
+Ġinterf ering
+Build er
+Ġover fl
+Ġac ne
+ĠM ud
+ĠK err
+f lex
+ĠPost al
+ĠBalt ic
+47 7
+ĠPers ons
+our age
+H B
+ĠM use
+ĠImm ortal
+ĠDri ving
+Ġpet itions
+Ġsubsc ript
+Ġs orce
+ĠProcess or
+ut on
+S ony
+Ġph on
+Ġr aced
+ĠAnth rop
+Ġday time
+ĠEx ercise
+Add ing
+Ġeng ages
+ĠQual comm
+Ġmir acles
+Ġmem es
+ĠDr ink
+ĠOri oles
+Ġhair s
+ĠPol ar
+ath om
+Ġsl ippery
+ĠR emy
+Ġcar amel
+ĠY EAR
+Ġal k
+I gn
+a ution
+ĠMer lin
+ĠC ran
+Ġap ologies
+Ġ4 10
+Ġout ing
+ĠMem ories
+app ointed
+Ġcount ered
+u ld
+pos ing
+Ġfire wall
+ĠW ast
+ĠW et
+work ed
+se ller
+Ġrepe aled
+ere o
+ass uming
+BL IC
+m ite
+ĠCEO s
+ĠChap el
+ellig ent
+________________ ________
+D og
+Ġw art
+Ġsubsc riber
+s ports
+Ġbe gged
+ĠM V
+Ġsem if
+eth ical
+Ġpre ach
+Ġrev ital
+Ġpun itive
+Ġshort cuts
+Ġinstit uted
+ĠWars aw
+Ġabdom en
+ĠK ING
+Ġsuper intendent
+Ġf ry
+ĠGe o
+T OR
+Ġcontrad ictions
+apt ic
+Ġlandsc apes
+b ugs
+Ġcl ust
+Ġvol ley
+c ribed
+Ġt andem
+Ġrob es
+WH AT
+Ġpromot er
+Ġel oqu
+review ed
+ĠD K
+ĠPl ato
+Ġf ps
+T ank
+ĠDer rick
+Ġpriorit ize
+as per
+ĠHond uras
+ĠCom pleted
+ne c
+Ġm og
+n ir
+ĠMay o
+DE F
+st all
+in ness
+ĠVolks wagen
+Ġprec aution
+ĠM ell
+i ak
+ist ries
+Ġ24 8
+Ġoverl apping
+Sen ate
+ĠEnh ance
+res y
+rac ial
+OR TS
+ĠM ormons
+Str ong
+ĠCo ch
+Mex ico
+ĠMad uro
+Ġj ars
+Ġcan e
+W ik
+oll a
+iff erence
+Ġphysic ist
+ĠMag gie
+Ġ28 5
+Ġdep iction
+ĠMcL aren
+J u
+Ġsl ows
+Ġcommission ers
+ĠWill ow
+ĠExpl os
+hov ah
+Ġtechn ician
+Ġhom icides
+ĠFl av
+ĠTr uman
+Ġ100 00
+u ctor
+Ġsh ader
+News letter
+45 7
+Ġre ver
+Ġhard ened
+Ġwhere abouts
+Ġrede velop
+Ġcar bs
+Ġtra vers
+Ġsqu irrel
+Ġfoll ower
+Ġs ings
+50 8
+Ġrabb its
+emon ium
+Ġdocument ing
+Ġmisunder stood
+) '
+R ick
+gg ies
+Ġprem ie
+Ġsk ating
+Ġpass ports
+Ġf ists
+aged don
+H aw
+AC P
+0 80
+ĠThough ts
+ĠCarl son
+Ġpriest hood
+h ua
+Ġdun geons
+ĠLo ans
+Ġant is
+Ġfamiliar ity
+ĠS abb
+op al
+ĠIn k
+st rike
+Ġc ram
+Ġlegal ized
+Ġcu isine
+Ġfib re
+Tra vel
+ĠMon ument
+OD Y
+eth y
+Ġinter state
+ĠP UR
+em porary
+ĠArab ian
+develop ed
+Ġsadd le
+Ġg ithub
+ĠOff er
+ĠIS P
+ro let
+ĠSUP ER
+ĠDen is
+Ġmultipl ier
+Ġstir red
+Interest ingly
+Ġcustom ary
+Ġbill ed
+he x
+Ġmultipl ied
+Ġfl ipping
+ĠCros by
+Ġfundament als
+ia e
+ĠPlay ed
+ĠAt om
+am azon
+ĠFl am
+ee z
+activ ated
+Ġtables poon
+Ġliberal ism
+ĠPal in
+ĠP atel
+N um
+ĠT AM
+Ġs urn
+ĠRel oaded
+Ġco ined
+" ],
+ĠCl ash
+ĠAg u
+Ġprag matic
+ĠActiv ate
+Ġ8 02
+Ġtrail ers
+Ġsil hou
+Ġprob es
+Ġcirc us
+ĠB ain
+ĠLind say
+ĠAb bey
+Del ivery
+Ġconcess ion
+Ġgast ro
+ĠSpr ite
+Ä Ł
+and el
+Ġg imm
+Ġaut obi
+ĠT urtle
+Ġwonder fully
+ĠHar am
+ĠWorld wide
+ĠHand le
+Ġtheor ists
+Ġsle ek
+ĠZh u
+ograph ically
+EG A
+ĠOwn ers
+ath s
+ĠAntar ctic
+n atal
+=" "
+fl ags
+`` ``
+Ġs ul
+K h
+Ġpot assium
+Ġlinem an
+Ġcere al
+ĠSe asons
+Ġ20 22
+Ġmat hematic
+Ġastron omers
+prof essional
+Ġf ares
+cknow led
+Ġch i
+Ġyoung sters
+Ġmistaken ly
+Ġhem isphere
+ĠDiv inity
+r one
+Ġ" ,
+r ings
+Ġattract s
+v ana
+å ¹
+C AP
+Ġplay list
+Ġpor ch
+ãģ £
+Ġincorpor ates
+Ġso ak
+Ġassert ing
+ĠTerror ism
+ĠP ablo
+J a
+ces ter
+Ġfear ing
+ĠPr ayer
+Ġescal ated
+G W
+Ġro be
+ĠBright on
+ac ists
+ĠSym phony
+ĠDwar f
+ĠPar ade
+ĠLe go
+Ġinex pl
+Ġl ords
+le af
+RA G
+l iber
+Ġcig ars
+ĠJe hovah
+60 6
+WIND OWS
+ĠLiber ia
+eb us
+He avy
+Ġl ubric
+ĠR W
+angu ages
+Ġnarrow ed
+com puter
+ĠE mber
+Ġmurder ing
+Ġdown stream
+ĠT uls
+ĠT ables
+Top ic
+ĠAcc uracy
+= /
+l ost
+ĠRe i
+Ġprogress es
+b ear
+Ġestablish ments
+Just in
+ĠPe ach
+ĠG omez
+å ¿
+ĠTri angle
+Id ent
+ĠH ive
+Res ources
+Ġmix es
+ĠAss uming
+M u
+Ġhyp oc
+Ġs ane
+ĠW an
+id ious
+Su ccess
+Ġ io
+Ang el
+Ġdanger ously
+ĠCreat ure
+W ORK
+: [
+ĠKat rina
+List ener
+M iller
+ĠId lib
+h ang
+Ġcircum vent
+h ref
+Ġcel estial
+ĠWe eks
+ĠP ug
+ĠDal ton
+Ġsubpoen a
+uk u
+Ġpers isted
+pe i
+old ing
+ĠDoc uments
+ĠH ast
+ĠC ENT
+Ġprim er
+Ġsyn onymous
+Ġn ib
+om bs
+Ġnot ation
+ĠD ish
+ĠAt mosp
+Ġforb id
+ĠAN G
+pat tern
+l os
+Ġproject iles
+b rown
+." ,
+ĠVen om
+Ġfierce ly
+ub lished
+ĠU ran
+ĠNic arag
+4 10
+ĠC AL
+OT OS
+ĠMir acle
+ĠEn chant
+Ġguard ing
+app end
+Att ach
+Ġlevel ed
+Ġcond oms
+ih ilation
+64 9
+Ġnight mares
+ĠTHE Y
+ĠST ART
+ĠK inn
+Ġroomm ate
+Ġhy giene
+o pping
+J ob
+Ġl vl
+ĠV ER
+ĠKe eping
+ab etic
+Ġformat ting
+eral a
+Ġrev isions
+Ġres urg
+T el
+ĠGood man
+35 3
+p od
+Ġind isp
+ĠTrans lation
+Ġg own
+ĠM und
+Ġc is
+Ġby stand
+col lect
+ĠPun jab
+act ively
+ĠG amb
+te ll
+Ġimport ing
+g encies
+Ġloc om
+ĠBr ill
+H oly
+ĠBer ger
+Ġshow down
+Ġrespond ers
+IL Y
+Ġt akedown
+le ted
+Ġmat tered
+Ġpredict ive
+Ġover lay
+G PU
+ĠV ick
+Ġconvey ed
+T ab
+pe er
+Sc an
+Ġdefensive ly
+v ae
+Ġappro ving
+Ġt iers
+ĠV ia
+quer ade
+ĠSaud is
+Ġdemol ished
+ĠProp he
+Ġmon o
+Ġhospital ity
+H AM
+ĠAri el
+M OD
+ĠTor ah
+Ġbl ah
+ĠBel arus
+erent ial
+ĠT uc
+Ġbank er
+39 7
+Ġmosqu it
+ĠScient ist
+ĠMus ical
+Ġh ust
+Sh ift
+Ġtor ment
+Ġstand off
+E duc
+ĠF og
+Ġampl ifier
+Sh ape
+Inst ance
+ĠCrit ics
+Ġda emon
+H ouston
+Ġmatt ress
+ĠID F
+Ġobsc ene
+ĠA mer
+hett i
+Ġcomp iling
+35 2
+vere tt
+ĠRed uction
+ist ration
+ĠBl essed
+ĠB achelor
+3 16
+Ġpr ank
+ĠVul can
+dd ing
+Ġm ourning
+ĠQu int
+ĠBl aster
+test ing
+Ġsed iment
+>> >
+ĠE ternity
+ĠWH ERE
+ĠM aze
+Ġreact ing
+ĠAl v
+oms day
+ĠC RA
+Ġtransl ator
+Ġbog us
+at u
+We bsite
+oll s
+Ġbapt ism
+Ġs ibling
+ĠAut umn
+ve z
+ãģ® é
+gu ards
+Ge org
+assad ors
+ĠFre ud
+Ġcontin ents
+ĠReg istry
+Bern ie
+ĸļ å£«
+Ġtoler ant
+ĠU W
+Ġhor ribly
+99 5
+ĠMID I
+Ġimpat ient
+oc ado
+er i
+ĠWor st
+ĠNor ris
+ĠTalk ing
+Ġdef ends
+ens able
+Ġ20 21
+Ġanat omy
+L ew
+Ġdraw er
+ĠCan berra
+Ġpatri otic
+é¾įå ĸļå£«
+ĠAv g
+AR M
+Ġundis closed
+Ġfare well
+45 9
+b able
+ĠAll ison
+OL OG
+Ġcon co
+t ight
+ĠAC PI
+ĠM ines
+l ich
+ĠâĶ ľ
+represent ed
+200 000
+Ġenthusi ast
+OT S
+b il
+ĠIng redients
+Ġinvent or
+ĠMy SQL
+ÂłÂł Âł
+ĠAB OUT
+with in
+Ġm k
+B ul
+ĠF ake
+Ġdracon ian
+W a
+hel m
+ĠTer ran
+erv ille
+Ġcommon place
+SI ZE
+Ġ" <
+re place
+ograph s
+ĠSE LECT
+inc ible
+ĠMost ly
+ĠShe ffield
+ĠID E
+ugg le
+Ġcit ations
+h urst
+ĠUn ix
+Ġunle ash
+ĠP iper
+ĠN ano
+Ġsucc umb
+Ġreluct ance
+Ġ25 00
+ĠMer chant
+Ġwire t
+Ġcomb os
+ĠBirth day
+Ġchar coal
+ĠU PS
+ĠFair fax
+Ġdrive way
+ĠT ek
+ĠP itch
+ove re
+Ġtechn icians
+ĠAct ual
+fl ation
+ĠF iscal
+ĠEm pty
+an amo
+Ġmag nesium
+Ġsl ut
+Ġgrow ers
+Invest igators
+( ):
+ĠS atellite
+ĠKe ynes
+miss ive
+l ane
+Ġb orough
+3 44
+ĠTE AM
+ĠBet hesda
+C V
+h ower
+ĠR AD
+Ġch ant
+ĠR iy
+Ġcompos itions
+Ġmild ly
+Ġmedd ling
+Ġag ility
+ane ers
+5 01
+Ġsyn th
+ling er
+29 1
+Ġex claimed
+Part y
+Ġcont amin
+ĠMan or
+ĠResp ond
+Ġpra ising
+Ġman ners
+fle et
+Sum mer
+ĠLy nd
+ĠDef initely
+gr im
+Ġbow ling
+st ri
+ç Ľ
+y nt
+Ġmand ates
+D IV
+Ġreconc ile
+view s
+ĠDam on
+vet te
+F lo
+ĠGreat est
+il on
+ic ia
+Ġportray al
+Ġcush ion
+50 4
+19 79
+oss al
+App lic
+sc ription
+Ġmit igation
+AT S
+p ac
+Ġer ased
+Ġdefic iencies
+ĠHolland e
+ĠX u
+Ġb red
+Ġpregn ancies
+f emin
+Ġem ph
+Ġpl anners
+Ġout per
+utter ing
+Ġperpet rator
+Ġm otto
+ĠEll ison
+ĠNE VER
+Ġadmitted ly
+AR I
+ĠAzerbai jan
+Ġmill isec
+Ġcombust ion
+ĠBott le
+ĠL und
+ĠP s
+ĠD ress
+Ġfabric ated
+Ġbat tered
+Ġs idel
+ĠNot ting
+Fore ign
+ĠJer ome
+0 20
+ĠAr bit
+Ġkn ots
+ĠR IGHT
+M oving
+ãģ Ļ
+Ġsur geries
+Ġcour thouse
+Ġm astered
+Ġhover ing
+ĠBr an
+ĠAl ison
+Ġsaf est
+m ilitary
+Ġbull ied
+Ġbar rage
+Read er
+ES E
+ĠGe ographic
+T ools
+3 14
+ĠGe ek
+ro th
+gl ers
+ĠF IN
+Ï ģ
+ĠA ston
+al tern
+48 8
+Ġveter in
+G amer
+Ġint el
+ren ches
+Sh ield
+Ġam nesty
+ĠB har
+Ġp iled
+Ġhonor able
+ĠInst itutes
+Ġso aked
+Ġcom a
+ĠE FF
+34 1
+by tes
+ĠG mail
+le in
+ĠCanad iens
+m aterial
+I l
+Ġinstruct ors
+ĠK Y
+Ġconce ive
+ub b
+ĠP ossible
+Ġeas ing
+ĠChrist ina
+Ġcar ic
+ĠHD R
+R OM
+Ġsho vel
+de lete
+Ġp uff
+ĠCh anging
+Ġseam lessly
+Att ribute
+Ġacqu isitions
+ak ery
+ĠE F
+Ġaut istic
+ĠT akes
+ĠPow der
+ĠSt ir
+5 10
+ĠBub ble
+sett ings
+ĠF owler
+Ġmust ard
+Ġmore over
+Ġcopyright ed
+ĠLED s
+15 00
+æ ī
+ĠH IS
+en f
+Ġcust od
+ĠH uck
+G i
+Ġim g
+An swer
+C t
+j ay
+ĠInf rastructure
+Ġfeder ally
+L oc
+Ġmicro bes
+Ġover run
+dd s
+ot ent
+adi ator
+>>>> >>>>
+Ġtorn ado
+Ġadj ud
+Ġintrig ued
+Ġs i
+ĠRevel ation
+pro gress
+Ġburgl ary
+ĠSai yan
+ĠK athy
+Ġser pent
+ĠAndre as
+Ġcomp el
+ess ler
+ĠPl astic
+ĠAd vent
+ĠPos itive
+ĠQ t
+ĠHind us
+reg istered
+ular ity
+Ġrighteous ness
+Ġdemon ic
+u itive
+ĠB DS
+ĠGre gg
+c ia
+ĠCrus ade
+ĠSina i
+W ARE
++ (
+Ġme ll
+Ġder ail
+y ards
+A st
+Ġnotice ably
+ĠO ber
+R am
+Ġun noticed
+Ġse q
+av age
+T s
+Ġ6 40
+Ġconced e
+Ġ] )
+F ill
+Ġcapt ivity
+ĠImprove ment
+ĠCrus ader
+ara oh
+M AP
+æ Ĺ
+Ġstr ide
+al ways
+F ly
+N it
+Ġal gae
+ĠCook ing
+ĠDo ors
+Mal ley
+Ġpolic emen
+ãģ į
+Ġastron aut
+access ible
+49 5
+ĠR AW
+cl iffe
+udic rous
+Ġdep ended
+al ach
+Ġvent ures
+ra ke
+Ġt its
+ĠH ou
+Ġcond om
+ormon al
+Ġind ent
+Ġupload ing
+Foot note
+Import ant
+Ġ27 1
+Ġmind ful
+Ġcont ends
+C ra
+Ġcal ibr
+ĠO ECD
+plug in
+F at
+ĠIS S
+ĠDynam ics
+ans en
+68 6
+' ),
+Ġsp rite
+Ġhand held
+ĠH ipp
+=~ =~
+Tr ust
+Ġsem antics
+ĠBund es
+ĠRen o
+ĠLiter ature
+s ense
+G ary
+ĠA eg
+ĠTr in
+EE K
+Ġcler ic
+ĠSS H
+Ġch rist
+Ġinv ading
+ib u
+Ġen um
+aur a
+Ġal lege
+ĠInc redible
+B BC
+Ġth ru
+Ġsa iled
+Ġem ulate
+Ġin security
+Ġc rou
+Ġaccommod ations
+Ġincompet ent
+Ġsl ips
+ĠEarth qu
+s ama
+IL LE
+Ġi Phones
+as aki
+Ġby e
+Ġar d
+Ġext ras
+Ġsl aughtered
+Ġcrowd funding
+res so
+Ġfil ib
+ĠER ROR
+ĠT LS
+e gg
+ĠIt al
+Ġen list
+ĠCatal onia
+ĠSc ots
+Ġser geant
+Ġdiss olve
+N H
+Ġstand ings
+ri que
+I Q
+Ġbenef iciary
+Ġaqu arium
+You Tube
+ĠPower Shell
+Ġbright est
+ĠWar rant
+S old
+Writ ing
+Ġbegin nings
+ĠRes erved
+ĠLatin os
+head ing
+Ġ4 40
+Ġrooft op
+AT ING
+Ġ3 90
+VP N
+G s
+k ernel
+turn ed
+Ġprefer able
+Ġturn overs
+ĠH els
+S a
+ĠShin ji
+ve h
+ĠMOD ULE
+V iol
+Ġex iting
+Ġj ab
+ĠVan illa
+Ġac ron
+ĠG ap
+ber n
+A k
+ĠMc Gu
+Ġend lessly
+ĠFar age
+ĠNo el
+V a
+M K
+Ġbr ute
+ĠK ru
+ĠES V
+ĠOl ivia
+âĢ ł
+ĠK af
+Ġtrust ing
+Ġh ots
+3 24
+Ġmal aria
+Ġj son
+Ġp ounding
+ort ment
+Count ry
+Ġpostp oned
+Ġunequ iv
+? ),
+ĠRo oney
+udd ing
+ĠLe ap
+ur rence
+sh apeshifter
+ĠH AS
+os ate
+Ġca vern
+Ġconserv atism
+ĠB AD
+Ġmile age
+Ġarrest ing
+V aults
+Ġmix er
+Dem ocratic
+ĠB enson
+Ġauth ored
+8 000
+Ġpro active
+ĠSpirit ual
+t re
+Ġincarcer ated
+ĠS ort
+Ġpe aked
+Ġwield ing
+re ciation
+×Ļ ×
+P atch
+ĠEm my
+Ġex qu
+tt o
+ĠRat io
+ĠP icks
+ĠG ry
+ph ant
+Ġf ret
+Ġeth n
+Ġarch ived
+% -
+c ases
+ĠBl aze
+Ġim b
+c v
+y ss
+im ony
+Ġcount down
+Ġaw akening
+ĠTunis ia
+ĠRe fer
+ĠM J
+Ġun natural
+ĠCar negie
+iz en
+ĠN uggets
+he ss
+Ġev ils
+64 7
+Ġintrodu ctory
+l oving
+ĠMcM ahon
+Ġambig uity
+L abel
+ĠAlm ighty
+Ġcolor ing
+ĠCl aus
+set ting
+N ULL
+ĠF avorite
+ĠS IG
+> (
+ĠSh iva
+ĠMay er
+Ġstorm ed
+ĠCo verage
+we apons
+igh am
+Ġun answered
+Ġle ve
+Ġc oy
+c as
+b ags
+as ured
+Se attle
+ĠSant orum
+ser ious
+Ġcourage ous
+ĠS oup
+Ġconfisc ated
+Ġ// /
+Ġuncon ventional
+Ġmom s
+ĠRohing ya
+ĠOrche stra
+ĠPot ion
+Ġdisc redit
+ĠF IL
+f ixed
+ĠDe er
+do i
+ĠDim ension
+Ġbureaucr ats
+et een
+Ġaction Group
+oh m
+Ġb umps
+ĠUt ility
+Ġsubmar ines
+ren heit
+re search
+ĠShap iro
+Ġsket ches
+Ġde ceptive
+ĠV il
+es ame
+ĠEss entially
+Ġramp age
+isk y
+Ġmut tered
+th ritis
+Ġ23 6
+f et
+b ars
+Ġpup il
+ĠTh ou
+o S
+s ong
+Ġfract ured
+Ġre vert
+pict ure
+Ġcrit erion
+us her
+Ġreperc ussions
+ĠV intage
+ĠSuper intendent
+Offic ers
+Ġflag ged
+Ġbl ames
+Ġin verse
+ograp hers
+Ġmakes hift
+Ġdev oid
+Ġfoss ils
+ĠArist otle
+ĠFund s
+Ġde pleted
+ĠFl u
+ĠY uan
+Ġw oes
+Ġlip id
+Ġsit u
+requ isites
+Ġfurn ish
+ĠSam ar
+Ġshame ful
+Ġadverse ly
+Ġad ept
+Ġrem orse
+Ġmurder ous
+uck les
+ĠE SL
+Ġ3 14
+s ent
+Ġred ef
+ĠC ache
+ĠP urs
+ig ans
+Ġ4 60
+Ġpres criptions
+Ġf res
+F uck
+ocr ates
+Tw enty
+ĠWe ird
+ĠT oggle
+ĠC alled
+itiz ens
+Ġp oultry
+Ġharvest ing
+ãĤ¦ ãĤ¹
+Bott om
+Ġcaution ed
+t n
+39 6
+ĠNik ki
+Ġeval uations
+Ġharass ing
+Ġbind ings
+ĠMon etary
+Ġhit ters
+Ġadvers ary
+un ts
+Ġset back
+Ġenc rypt
+ĠC ait
+Ġl ows
+eng es
+ĠN orn
+Ġbul bs
+Ġbott led
+ĠVoy ager
+3 17
+Ġsp heres
+p olitics
+Ġsubt ract
+Ġsens ations
+Ġapp alling
+Ġ3 16
+Ġenvironment ally
+ĠST EM
+Ġpub lishes
+5 60
+Ġdilig ence
+48 4
+Ġadv ises
+Ġpet rol
+Ġimag ining
+Ġpatrol s
+ĠInt eger
+ĠAs hes
+act us
+ĠRad iant
+ĠL T
+it ability
+ht aking
+Set ting
+Ġnu anced
+ĠRe ef
+ĠDevelop ers
+N i
+pie ces
+99 0
+Lic ense
+Ġlow ers
+ĠOtt oman
+3 27
+oo o
+Ġqu itting
+mark ets
+Beh ind
+Ġbas in
+Ġdoc s
+an ie
+fl ash
+ct l
+Ġcivil ized
+ĠFuk ushima
+"] ,"
+ĠK S
+ĠHonest ly
+ar at
+Ġconstruct s
+ĠL ans
+ĠD ire
+ĠLI KE
+ĠTrou ble
+Ġwith holding
+ĠOb livion
+Ġsan ity
+any a
+Con st
+Ġgro cer
+ĠC elsius
+Ġrecount ed
+ĠW ife
+B order
+ate red
+h appy
+Ġspo iler
+Ġlog ically
+H all
+Ġsucceed ing
+Ġpoly morph
+Ġax es
+ĠShot gun
+ĠS lim
+ĠPrin ciples
+ĠL eth
+art a
+Ġsc or
+Sc reenshot
+Ġrelax ation
+#$ #$
+Ġdeter rent
+idd y
+Ġpower less
+Ġles bians
+Ġch ords
+ĠEd ited
+se lected
+Ġseparat ists
+000 2
+Ġair space
+Ġturn around
+Ġc unning
+P ATH
+P oly
+Ġbomb ed
+Ġt ion
+x s
+Ġwith hold
+Ġw aged
+ĠLiber ties
+Fl ag
+Ġcomfort ing
+45 4
+ĠI ris
+are rs
+Ġr ag
+Ġrel ocated
+ĠGu arant
+Ġstrateg ically
+Ġgam ma
+uber ty
+ĠLock heed
+g res
+Ġgr illed
+ĠLow e
+st ats
+ĠR ocks
+Ġsens ing
+Ġrent ing
+ĠGe ological
+Ø§ Ø
+ot rop
+Ġse w
+Ġimproper ly
+48 6
+Ġâĸ ł
+Ġstar ving
+ĠB j
+Disc ussion
+3 28
+ĠCom bo
+ĠFix es
+N AT
+Ġstri ving
+th ora
+Ġharvest ed
+ĠP ing
+Ġplay ful
+Ġaven ues
+Ġoccup ational
+Ġw akes
+ĠCou rier
+Ġdrum mer
+ĠBrow ser
+ĠH outh
+it u
+Ġapp arel
+p aste
+Ġhun ted
+ĠSecond ly
+l ain
+X Y
+ĠP IN
+ic ons
+Ġcock tails
+Ġs izable
+Ġhurd les
+est inal
+ĠRecre ation
+Ġe co
+64 8
+ĠD ied
+m int
+Ġfinger prints
+Ġdis pose
+ĠBos nia
+ts y
+22 00
+Ġins pected
+ĠF ou
+Ġf uss
+Ġamb ush
+ĠR ak
+Ġmanif ested
+Pro secut
+Ġsuff ice
+ren ces
+Ġcompens ated
+ĠC yrus
+Ġgen us
+ĠWolver ine
+ĠTrend s
+Ġh ikes
+ĠSe en
+Ġen rol
+C old
+Ġpol itely
+ĠSl av
+ĠRu pert
+Ġey ewitness
+ĠAl to
+Ġun comp
+Ġposter ior
+M ust
+ĠHer z
+Ġprogress ively
+Ġ23 4
+Ġind ifference
+ĠCunning ham
+Ġacadem ia
+Ġse wer
+Ġast ounding
+ĠA ES
+r ather
+Ġeld est
+Ġclim bs
+ĠAdd s
+Ġout cry
+Ġcont ag
+ĠH ouses
+Ġpe pt
+ĠMel ania
+interest ed
+ĠU CH
+ĠR oots
+ĠHub bard
+ĠT BD
+ĠRoman ian
+fil ename
+St one
+ĠIm pl
+Ġchromos ome
+C le
+d x
+Ġscram bled
+ĠP t
+Ġ24 2
+OP LE
+Ġtremend ously
+St reet
+Ġcra ving
+Ġbund led
+ĠR G
+p ipe
+Ġinj uring
+Ġarc ane
+Part icip
+ĠHero ic
+st y
+Ġto pping
+ĠTemp est
+rent ices
+b h
+Ġpar anoia
+ĠUnic ode
+Ġegreg ious
+Ġ\ '
+ĠOsw ald
+Ġgra vel
+ĠSim psons
+Ġbl and
+ĠGuant anamo
+Writ er
+lin ers
+ĠD ice
+J C
+Ġpar ity
+Ġs ided
+Ġ23 7
+ĠPyr rha
+at ters
+d k
+F ine
+comp an
+Ġform ulated
+ĠId ol
+il ers
+hem oth
+ĠF av
+Ġintr usion
+Ġcar rots
+ĠL ayer
+ĠH acker
+Ġ ----------------
+Ġmoder ation
+é ģ
+oc oc
+Ġcharacter ize
+ĠTe resa
+Ġsocio economic
+Ġper k
+ĠParticip ation
+tr aining
+ĠPaul o
+ph ys
+Ġtrust worthy
+Ġembod ied
+ĠMer ch
+c urrency
+ĠPrior ity
+Ġte asing
+Ġabsor bing
+Ġunf inished
+ĠCompar ison
+Ġdis ple
+writ ers
+Ġprofess ions
+ĠPengu in
+Ġang rily
+ĠL INK
+68 8
+ĠCor respond
+Ġprev ailed
+Ġcart el
+l p
+as ms
+ĠRed emption
+ĠIslam ists
+effect s
+d ose
+ĠL atter
+ĠHal ifax
+Ġv as
+ĠTop ics
+ĠN amed
+advert ising
+zz a
+IC ES
+Ġret arded
+ach able
+ĠPupp et
+ĠItem Level
+Ġret ract
+Ġident ifiable
+A aron
+ĠB uster
+s ol
+hel le
+as semb
+H ope
+r anged
+B a
+ĠP urch
+é Ģ
+ĠSir i
+Ġarri vals
+Ġ19 12
+Ġshort ened
+Ġ3 12
+Ġdiscrep ancy
+ĠTem perature
+ĠWal ton
+Ġkind erg
+p olit
+Ġrem ix
+Ġconnect ors
+ãĥĺ ãĥ©
+ĠKazakh stan
+dom inated
+Ġsu gars
+im ble
+ĠPan ic
+ĠDem and
+ĠCol ony
+on en
+ĠM ER
+7 75
+ur ia
+aza ar
+ĠDeg ree
+P ri
+Ġsun shine
+Ġ25 1
+Ġpsychedel ic
+Ġdigit ally
+ĠBra un
+Ġsh immer
+Ġsh ave
+ĠTel esc
+ĠAst ral
+ĠVenezuel an
+ĠO G
+Ġc rawling
+Int eg
+ĠFe ather
+Ġunfold ing
+Ġappropri ation
+Ġè£ı è
+ĠMob ility
+ĠN ey
+- .
+b ilt
+L IN
+ĠT ube
+ĠCon versely
+Ġkey boards
+ĠC ao
+Ġover th
+Ġla ure
+>> \
+ĠV iper
+ach a
+Off set
+ĠR aleigh
+ĠJ ae
+J ordan
+j p
+Ġtotal itarian
+Connect or
+Ġobserv es
+ĠSpart an
+ĠIm mediately
+ĠSc al
+C ool
+Ġt aps
+Ġro ar
+P ast
+Ġch ars
+ĠB ender
+ĠShe ldon
+Ġpain ter
+Ġbe acon
+ĠCreat ures
+Ġdownt urn
+Ġh inder
+ĠAnd romeda
+Ã Ľ
+cc oli
+ĠF itness
+et rical
+Ġutil izes
+Ġsen ate
+Ġen semble
+Ġche ers
+T W
+Ġaff luent
+k il
+ry lic
+ord ering
+Com puter
+Ġgru esome
+ost ics
+ĠUb isoft
+ĠKel ley
+Ġw rench
+Ġbourgeois ie
+IB LE
+ĠPrest on
+w orn
+ar ist
+reat ing
+Ġst ained
+ar ine
+Ġsl ime
+EN N
+Ġche sts
+Ġground water
+ann ot
+ĠTr ay
+ĠLoc ke
+ĠC TR
+Ġd udes
+ĠEx ternal
+ĠDec oder
+Ġpar amed
+ĠMed line
+80 9
+ĠD inner
+rup al
+g z
+ĠG um
+ĠDem o
+j ee
+Ġd h
+ber man
+arch s
+Ġen qu
+ĠEp stein
+Ġdevast ation
+Ġfriends hips
+ĠAr d
+Ġ23 1
+ĠRub in
+ĠDist ance
+Ġsp urred
+Ġd ossier
+Ġover looking
+\\\\\\\\ \\\\\\\\
+Fore st
+ĠCom es
+\ ",
+ĠIran ians
+Ġf ixtures
+L aughs
+Ġcur ry
+ĠKing ston
+Ġsqu ash
+Ġcat alogue
+Ġabnormal ities
+Ġdigest ive
+.... .....
+Ġsubord inate
+og ly
+Ġ24 9
+M iddle
+Ġmass ac
+Ġburg ers
+Ġdown stairs
+Ġ19 31
+39 4
+ĠV G
+Ġl asers
+ĠS ikh
+ĠAlex a
+der ived
+Ġcycl ist
+ãģ® éŃĶ
+onel iness
+!!!! !!!!
+Ġbuff s
+leg ate
+Ġrap ing
+Ġrecomm ending
+ro red
+Ġmult icultural
+un ique
+Ġbusiness men
+Ġune asy
+ĠM AP
+Ġdisp ersed
+cipl ine
+J ess
+ĠK erala
+å §
+Ġabst raction
+Sur v
+U h
+Ġprin ters
+ij a
+ow der
+Ġanalog ous
+ĠA SP
+af er
+Ġunfold ed
+Ġlevel ing
+Ġbre ached
+ĠH earing
+Ġn at
+Ġtransl ating
+crit ical
+Ġant agonist
+ĠYes terday
+Ġfuzz y
+w ash
+m ere
+Ġbe wild
+ĠM ae
+V irgin
+ph rase
+Ġsign aled
+ĠH IGH
+Ġprot ester
+Ġgar ner
+unk nown
+Ġk ay
+Ġabduct ed
+Ġst alking
+am n
+Ġdes erving
+ĠR iv
+ĠJ orge
+Ġscratch ing
+ĠS aving
+ip ing
+Ġte ase
+Ġmission ary
+ĠMor row
+T IME
+P resent
+Ġchem otherapy
+tern ess
+ĠH omes
+ĠP urdue
+Ġst aunch
+ĠWhit ney
+ĠTH ERE
+Î ¼
+iat us
+ĠErn est
+ĠDe ploy
+Ġcove ted
+F ML
+ĠDial ogue
+Ġex ited
+f ruit
+Ġner d
+":" ","
+Ġv ivo
+ru ly
+4 60
+ĠAm en
+rehens ible
+Ġâ ĺ
+D IR
+Ġad herence
+Ġche w
+ĠCo ke
+ĠSerge i
+dig ital
+ĠNe ck
+g ently
+enth al
+/ )
+Ġwe ary
+Ġgu ise
+ĠConc ord
+ĠOn ion
+at cher
+Ġb inge
+ĠDirect ive
+Ġman ned
+ans k
+Ġill usions
+Ġbillion aires
+38 3
+oly n
+odynam ic
+ĠWhe at
+ĠA lic
+Ġcol oured
+ĠN AFTA
+ab o
+Ġmac ros
+ind ependent
+s weet
+Ġsp ac
+ĠK abul
+Ġ Ä
+em e
+Ġdict ated
+Ġsh outs
+= {
+Ġr ipping
+ĠSh ay
+ĠCr icket
+direct ed
+Ġanalys ed
+ĠWAR RANT
+ag ons
+ĠBlaz ers
+Ġche ered
+Ġar ithmetic
+ĠTan z
+37 3
+ĠFl ags
+Ġ29 5
+Ġw itches
+ĠIn cluded
+ĠG ained
+ĠBl ades
+G am
+ĠSam antha
+ĠAtl antis
+ĠPr att
+Ġspo iled
+ĠI B
+ĠRam irez
+Pro bably
+re ro
+ĠN g
+ĠWar lock
+t p
+Ġover he
+Ġadministr ations
+Ġt int
+Ġreg iment
+Ġpist ols
+Ġblank ets
+Ġep ist
+Ġbowl s
+Ġhydra ulic
+Ġde an
+Ġj ung
+Ġasc end
+70 5
+ĠSant iago
+Ã ®
+Ġun avoid
+ĠSh aman
+re b
+Ġstem ming
+99 8
+ĠM G
+st icks
+esthes ia
+ER O
+Ġmor bid
+ĠGr ill
+ĠP oe
+any l
+Ġdele ting
+ĠSurve illance
+Ġdirect ives
+Ġiter ations
+ĠR ox
+ĠMil ky
+F ather
+Ġpat ented
+44 7
+Ġprec ursor
+Ġm aiden
+ĠP hen
+ĠVe gan
+ĠPat ent
+K elly
+Redd itor
+Ġn ods
+Ġvent ilation
+ĠSchwar z
+Ġw izards
+Ġomin ous
+ĠHe ads
+ĠB G
+Ġl umber
+ĠSp iel
+Ġis Enabled
+Ġancest ral
+ĠSh ips
+Ġwrest ler
+ph i
+Ġy uan
+ĠRebell ion
+Ġice berg
+Ġmag ically
+Ġdivers ion
+ar ro
+yth m
+ĠR iders
+ĠRob bie
+ĠK ara
+ĠMain tenance
+ĠHer b
+Ġhar ms
+p acked
+ĠFe instein
+Ġmarry ing
+Ġbl ending
+ĠR ates
+Ġ18 80
+Ġwr ink
+ĠUn ch
+ĠTor ch
+desc ribed
+Ġhuman oid
+ilit ating
+ĠCon v
+ĠFe ld
+IGH TS
+Ġwhistlebl ower
+ort mund
+ets y
+arre tt
+ĠMon o
+ĠI ke
+ĠC NBC
+ĠW AY
+ĠMD MA
+ĠIndividual s
+Ġsupplement al
+Ġpower house
+ĠSt ru
+F ocus
+aph ael
+ĠCol leg
+att i
+Z A
+Ġp erenn
+ĠSign ature
+ĠRod ney
+Ġcub es
+idd led
+ĠD ante
+ĠIN V
+iling ual
+ĠC th
+Ġso fa
+Ġintimid ate
+ĠR oe
+ĠDi plom
+ĠCount ries
+ays on
+Ġextrad ition
+Ġdis abling
+ĠCard iff
+Ġmemor andum
+ĠTr ace
+Ġ?? ?
+se ctor
+ĠRou hani
+ĠY ates
+ĠFree ze
+Ġbl adder
+M otor
+ĠProm ise
+ant asy
+Ġforesee able
+ĠC ologne
+cont ainer
+ĠTre es
+ĠG ors
+ĠSin clair
+Ġbar ring
+key e
+Ġsl ashed
+ĠStat istical
+é ĩ
+Ġâĸ º
+All ows
+Ġhum ility
+Ġdr illed
+ĠF urn
+44 3
+Ġse wage
+Ġhome page
+Ġcour tyard
+Ġv ile
+Ġsubsid iaries
+aj o
+direct ory
+Ġam mon
+V ers
+charg es
+Ġ} }
+ĠCh ains
+Ġ24 6
+n ob
+Ġper cept
+Ġg rit
+Ġfisher men
+ĠIraq is
+ĠDIS TR
+ĠF ULL
+ĠEval uation
+g raph
+at ial
+Ġcooper ating
+Ġmel an
+Ġenlight ened
+Ġal i
+t ailed
+Ġsal ute
+Ġweak est
+ĠBull dogs
+U A
+ĠAll oy
+Ġsem en
+oc ene
+ĠWilliam son
+s pr
+, âĢĶ
+ĠG F
+itt ens
+Be at
+ĠJ unk
+iph ate
+ĠFarm ers
+ĠBit coins
+ig ers
+d h
+ĠL oyal
+p ayer
+Ġentert ained
+Ġpenn ed
+Ġcoup on
+Que ue
+Ġweaken ing
+c arry
+Ġunderest imate
+Ġshoot out
+Ġcharism atic
+ĠProced ure
+Ġprud ent
+in ances
+Ġric hes
+Ġcort ical
+Ġstr ides
+Ġd rib
+ĠOil ers
+5 40
+ĠPer form
+ĠBang kok
+Ġe uth
+S ER
+Ġsimpl istic
+t ops
+camp aign
+Q uality
+Ġimpover ished
+ĠEisen hower
+Ġaug ment
+ĠH arden
+Ġinterven ed
+Ġlist ens
+ĠK ok
+Ġs age
+Ġrub bish
+ĠD ed
+Ġm ull
+pe lling
+Ġvide ot
+Produ ction
+D J
+m iah
+Ġadapt ations
+Ġmed ically
+Ġboard ed
+Ġarrog ance
+Ġscra pped
+Ġopp ress
+FORM ATION
+Ġj unction
+4 15
+EE EE
+S kill
+Ġsub du
+ĠSug gest
+ĠP ett
+Ġle tt
+ĠMan ip
+ĠC af
+ĠCooper ation
+T her
+Ġreg ained
+¶ æ
+ref lect
+Ġth ugs
+ĠShel by
+Ġdict ates
+ĠWe iner
+ĠH ale
+Ġbatt leground
+s child
+Ġcond ol
+h unt
+osit ories
+Ġacc uses
+Fil ename
+Ġsh ri
+Ġmotiv ate
+Ġreflect ions
+N ull
+ĠL obby
+¥ µ
+ĠS ATA
+ĠBack up
+Ñ ĥ
+n in
+ĠCor rection
+Ġju icy
+ut ra
+ĠP ric
+Ġrest raining
+ĠAir bnb
+ĠAr rest
+Ġappropri ations
+Ġsl opes
+Ġmans laughter
+Ġwork ings
+ĠH uss
+ĠF rey
+Le ave
+ĠHarm ony
+ĠF eder
+Ġ4 30
+Ġt rench
+Ġglad ly
+Ġbull pen
+ĠG au
+b ones
+Ġgro ove
+Ġpre text
+ã ħĭ
+Ġtransm itter
+ĠComp onent
+Ġunder age
+ĠEm pires
+T ile
+Ġo y
+ĠMar vin
+ĠC AS
+Ġbl oss
+Ġrepl icated
+ĠMar iners
+Marc us
+ĠBl ocks
+Ġliber ated
+Ġbutter fly
+Fe el
+Ġfer mentation
+Ġyou tube
+Ġoff end
+ĠTer m
+res ist
+Ġcess ation
+Ġinsurg ency
+Ġb ir
+ĠRa ise
+59 5
+Ġhypothes es
+50 2
+Ġpl aque
+ocr at
+Ġjack ets
+ĠHuff Post
+am ong
+Ġconf er
+48 7
+ĠL illy
+Ġadapt ing
+ĠF ay
+Ġsh oved
+ve c
+Ġref ine
+Ġg on
+Ġgun men
+z ai
+ĠShut tle
+ĠI zan
+Ġ19 13
+Ġple thora
+Â· Â·
+Ġ5 10
+Ġp uberty
+Ġ24 1
+ĠWe alth
+ĠAl ma
+ĠM EM
+ĠAd ults
+C as
+pr ison
+R ace
+Ġwater proof
+Ġathlet icism
+Ġcapital ize
+ĠJu ice
+Ġillum inated
+ĠP ascal
+Ġirrit ation
+ĠWitness es
+ad le
+ĠAst ro
+Ġf ax
+ĠEl vis
+Prim ary
+ĠL ich
+ĠEl ves
+Ġres iding
+Ġst umble
+3 19
+ĠP KK
+Ġadvers aries
+D OS
+ĠR itual
+Ġsm ear
+Ġar son
+ident al
+Ġsc ant
+Ġmon archy
+Ġhal ftime
+Ġresid ue
+Ġind ign
+ĠSh aun
+ĠEl m
+aur i
+A ff
+W ATCH
+ĠLy on
+hel ps
+36 1
+Ġlobby ist
+Ġdimin ishing
+Ġout breaks
+Ġgo ats
+f avorite
+ĠN ah
+son ian
+ĠBo oster
+Ġsand box
+ĠF are
+ĠMalt a
+Ġatt Rot
+ĠM OR
+ld e
+Ġnavig ating
+T ouch
+Ġunt rue
+ĠDis aster
+Ġl udicrous
+Pass word
+ĠJ FK
+blog spot
+4 16
+ĠUN DER
+ern al
+Ġdelay ing
+T OP
+Ġimpl ants
+ĠAV G
+ĠH uge
+att r
+Ġjournal istic
+ĠPe yton
+ĠI A
+R ap
+go al
+ĠProgram me
+Ġsm ashing
+w ives
+print ln
+ĠPl ague
+in us
+EE P
+Ġcru iser
+ĠPar ish
+umin ium
+Ġoccup ants
+ĠJ ihad
+m op
+Ġp int
+Ġhe ct
+ĠMe cca
+direct or
+ĠFund ing
+ĠM ixed
+Ġst ag
+T ier
+Ġg ust
+Ġbright ly
+ors i
+Ġup hill
+R D
+Ġles ions
+ĠBund y
+liv ious
+Ġbi ologist
+ĠFac ulty
+ĠAuthor ization
+Ġ24 4
+All ow
+ï ¸
+ĠGi ul
+Ġpert inent
+ot aur
+es se
+ĠRo of
+Ġunman ned
+35 1
+ĠSh ak
+ĠO rient
+Ġend anger
+D ir
+Ġrepl en
+ed ient
+Ġtail or
+Ġgad gets
+Ġaud ible
+âĺ Ĩ
+N ice
+Ġbomb ard
+ĠR ape
+Ġdef iance
+ĠTW O
+ĠFilip ino
+Ġunaff ected
+erv atives
+Ġso ared
+ĠBol ton
+Ġcomprom ising
+ĠBrew ers
+R AL
+ĠA HL
+icy cle
+Ġv ampires
+Ġdi pped
+oy er
+ĠX III
+Ġsidew ays
+ĠW aste
+ĠD iss
+ĠâĶľ âĶĢâĶĢ
+$ .
+Ġhabit ats
+ĠBe ef
+tr uth
+tr ained
+spl it
+R us
+And y
+ĠB ram
+RE P
+p id
+è£ ħ
+ĠMut ant
+An im
+ĠMar ina
+Ġfut ile
+hig hest
+f requency
+Ġepile psy
+Ġcop ing
+Ġconc ise
+Ġtr acing
+ĠS UN
+pan el
+ĠSoph ie
+ĠCrow ley
+ĠAd olf
+ĠShoot er
+Ġsh aky
+ĠI G
+ĠL ies
+ĠBar ber
+p kg
+Ġupt ake
+Ġpred atory
+UL TS
+/ **
+Ġintox icated
+ĠWest brook
+od der
+he ment
+Ġbas eman
+AP D
+st orage
+ĠFif ty
+ed itor
+G EN
+UT ION
+ir ting
+Ġse wing
+r ift
+Ġag ony
+ĠS ands
+Ġ25 4
+C ash
+Ġl odge
+Ġp unt
+N atural
+ĠIde as
+Ġerrone ous
+ĠSens or
+ĠHann ity
+Ġ19 21
+Ġm ould
+ĠG on
+kay a
+Ġanonym ously
+ĠK EY
+Ġsim ulator
+W inter
+Ġstream ed
+50 7
+? ",
+Ġte ased
+Ġco efficient
+Ġwart ime
+ĠTH R
+' '.
+ĠBank ing
+mp ire
+Ġf andom
+Ġl ia
+G a
+Ġdown hill
+Ġinterpre ting
+Ind ividual
+N orm
+Ġjealous y
+bit coin
+Ġple asures
+ĠToy s
+ĠChev rolet
+ĠAd visor
+IZ E
+Ġrecept ions
+70 6
+C ro
+Ġ26 2
+Ġcit rus
+ir u
+Review er
+ject ed
+U ES
+an z
+19 81
+ĠWork er
+Ġcompl ied
+ores cent
+contin ental
+T on
+ĠPr ism
+ĠShe ep
+Ġ28 8
+n ox
+ĠV og
+O rd
+Ġreal ms
+te k
+Ġirrig ation
+Ġbicy cles
+Ġelectron ically
+p oly
+t all
+() );
+Ġaest hetics
+ĠInteg rated
+Expl ore
+Ġd unk
+47 6
+p ain
+ĠJac ques
+ĠD mit
+Fram es
+Ġreun ited
+Ġhum id
+D ro
+P olitical
+Ġyouth ful
+Ġent ails
+Ġmosqu ito
+36 3
+spe cies
+Ġcoord inating
+ĠMay hem
+ĠMagn us
+M ount
+Impro ved
+ĠST ATE
+ATT LE
+Ġflow ed
+Ġtack led
+Ġfashion ed
+Ġre organ
+iv ari
+f inger
+Ġreluct antly
+et ting
+ĠV and
+you ng
+ĠGar land
+Ġpresum ption
+Ġamen ities
+ĠPle asant
+on ential
+ĠO xy
+Ġmor als
+ĠY ah
+Read y
+Sim on
+En h
+D emon
+Ġcl ich
+Mon itor
+ĠD U
+Ġwel comes
+Ġstand out
+Ġdread ful
+Ġban anas
+Ġball oons
+h ooting
+bas ic
+Ġsuff ix
+Ġd uly
+can o
+Ch ain
+at os
+Ġgeop olitical
+Ġ( &
+ĠGem ini
+ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ
+Ġacqu itted
+L uck
+prot ect
+10 24
+Ġsc arcity
+Ġmind fulness
+ec ided
+D N
+pr ime
+ĠPres idents
+ĠVID EO
+Ġ( âĪĴ
+add ock
+N OR
+ĠP ru
+p un
+ĠL OL
+)) ))
+ĠL iqu
+ĠS AS
+Ġsty ling
+Ġpunish ments
+Ġnum b
+Ġasc ertain
+ĠRock ies
+f lu
+Th umbnail
+Ġperpet rated
+ĠSem i
+Ġdis arm
+ĠOld er
+ĠEx ception
+Ġexponent ially
+ĠCommun ities
+Ġabol ish
+ĠPart ner
+pt oms
+Ġ7 77
+ĠFo ley
+ĠC ases
+Ġgre ase
+ĠReb irth
+G round
+Ġ; )
+ĠDoct rine
+ik ini
+Y e
+ĠBl ossom
+Ġpers ists
+b ill
+Ġinf usion
+Ġbud dies
+9 11
+ĠPat ient
+Ġdem os
+Ġacquaint ance
+ĠP aw
+at ari
+Ġx ml
+Ġfasc ination
+ĠSer ve
+Ï Ĥ
+br anded
+Ġa z
+Return s
+Ġover shadow
+Ġro am
+Ġspeed y
+n umbered
+hel ial
+Ġdisc iple
+Ġass urances
+g iven
+pect ing
+ĠN atalie
+çĶ °
+Ġmosquit oes
+rote in
+Ġnumer ic
+Ġindepend ents
+Ġtrans itional
+Ġreaction ary
+ĠMech dragon
+do ctor
+Ġshort est
+Ġsequ ential
+ĠB ac
+ĠAccount s
+ãģ Į
+ach y
+ract ive
+ĠReg iment
+Ġbreat htaking
+ffic iency
+ĠB ates
+Ġ3 11
+Ġward robe
+ft s
+ĠBer k
+Sim ply
+ĠRivers ide
+iver ing
+ident ial
+lu cent
+Ġen riched
+ĠCon ver
+ĠG iving
+ãĥ Ļ
+Ġlegal ize
+ĠF TC
+Ġfre aking
+M ix
+Ġter restrial
+es ian
+ci ents
+W ing
+LO AD
+Ġled ge
+ĠViol ent
+ĠMet all
+Ġ30 8
+Ġs outheastern
+hett o
+M eat
+Ġslow down
+Ġret reated
+Jere my
+end as
+**** *
+er ic
+Ġre ins
+opp able
+ĠHuman ity
+ear ances
+rig an
+C amera
+Ġwa ivers
+s oc
+Ġalter ation
+trans form
+ĠC emetery
+50 6
+Ġindef inite
+Ġstim ulating
+y g
+60 3
+ĠS op
+Ġdescript ive
+Ph ase
+ĠEd mund
+Ġpneum onia
+vent us
+A mb
+Ġlabor atories
+ĠEx clusive
+ug ar
+W ere
+Ġmalf unction
+Ġhomosexual s
+Ġ---- ---
+un i
+Ġturb ines
+ĠEqu ity
+D u
+Ġmind ed
+ĠR H
+ĠBlack hawks
+Ġfe ats
+Ġ17 00
+re pl
+36 2
+lad en
+Ġindisp ensable
+ly ss
+tt i
+Ġre el
+Ġdiver ted
+Ġlik eness
+Ġsubscript ions
+Ġfing ert
+Ġfil thy
+dest ruct
+d raft
+ĠBernard ino
+l aunch
+Ġper plex
+ĠS UM
+car b
+Ġswe ater
+ĠVent ure
+ĠJ ag
+ĠCele b
+ĠV oters
+Ġstead fast
+Ġathlet ics
+ĠHans on
+ĠDr ac
+Tr acker
+Ġcomm end
+ĠPres idency
+ĠD ID
+in formed
+Ġweb page
+P retty
+Ġforce fully
+ãĥĥ ãĤ¯
+Ġrel ocation
+Ġsat ire
+â ī
+ĠSunder land
+æ Ħ
+V oice
+???? ????
+Ġinform ant
+Ġbow el
+ĠUn iform
+Ġ ..."
+Ġpur ge
+Ġpic nic
+ĠU mb
+ĠU PDATE
+ĠSapp hire
+ĠSt all
+le arn
+Ġobject ively
+Ġob liter
+Ġlooph ole
+Ġjour neys
+Ġo mission
+Pro s
+ĠSid ney
+pl oma
+Ġspray ed
+Ġg uru
+Ġtra itor
+Ġtim et
+Ġsn apping
+ĠSe vent
+urn al
+ĠUk ip
+Ġb owed
+por al
+l iberal
+R os
+Quest ions
+i OS
+Ġsummar ize
+ST AT
+Ġ18 50
+ap est
+Ġl ender
+ĠVari able
+br inging
+ĠL ORD
+, )
+Ġcollaps es
+x iety
+ĠN ed
+Y D
+ĠSch a
+Ġantib ody
+Ġdis band
+y re
+ill usion
+Ġro ver
+s hed
+ĠHiro sh
+cc i
+Ġcal am
+ĠMort on
+P interest
+Ġ19 28
+ĠE uras
+ord es
+Ġf ences
+ĠIn ventory
+ĠVal encia
+ĠU d
+ĠT iff
+Ġsqu e
+Ġqu otation
+Ġtroubles ome
+er ker
+QU EST
+ĠKing doms
+s outh
+Ġle vy
+Pr ince
+ĠSt ing
+Ġnick named
+Ġapp e
+Ġphot ographic
+Ġcorp us
+re ference
+ĠT rog
+U nt
+) =(
+ĠLat via
+Ġactiv ating
+Ġlicense e
+Ġdispar ities
+ĠNews letter
+ãĥĥ ãĥĪ
+Ġfree ing
+ĠJe ep
+ĠPer ception
+ins k
+Ġsil icone
+ĠHay den
+Le an
+ĠSuz uki
+ibr arian
+66 8
+Ġsp or
+Ġcorrel ations
+ag hetti
+Ġtu ber
+ĠIP CC
+il us
+ĠV u
+Ġwealth iest
+ĠCarb uncle
+an za
+Ġfool ed
+ĠZ ur
+Ġd addy
+ran o
+il ian
+Ġknock out
+f man
+requ ired
+ĠWik ileaks
+ĠD uffy
+ON T
+Ġins ol
+ĠObject s
+Ġb ou
+ĠNord ic
+ĠIns ert
+sc an
+Ġd ancers
+Ġid iots
+major ity
+ĠNev ille
+ĠFree BSD
+Ġt art
+pan ic
+69 0
+Ġcoc oa
+Ġsam pled
+Ġlook up
+Ind ust
+Ġinject ions
+gen re
+Ġa u
+Ġroad way
+Ġgen itals
+K ind
+ĠEx aminer
+ĠY az
+F resh
+Ġpar alysis
+ĠAl uminum
+Ġre ap
+ok Ã©
+Ġsl oppy
+ĠTun nel
+pos ium
+ner y
+en ic
+Ġher bal
+ĠOut er
+ĠBuild er
+Ġinc ur
+Ġide ologies
+Ġback ups
+cons uming
+ĠDet ect
+de ck
+ĠKN OW
+ĠG ret
+ĠM IC
+Ġtough ness
+ĠEx hibit
+Ġh ive
+L es
+ĠSCH OOL
+ĠAt ari
+ald e
+ĠN ull
+and estine
+m ouse
+Ġbrig ade
+48 9
+Ġrev ol
+ĠLaw son
+ĠW ah
+op oly
+eb ted
+ĠS aunders
+Ġ3 13
+ĠW inc
+Ġtab oo
+ĠHel met
+Ġw edge
+ch ip
+ĠT ina
+b g
+Ġinf uri
+r n
+Ġanomal ies
+ĠSy nc
+ĠEx am
+ĠComm it
+ĠDi ary
+ĠALS O
+ĠDe bor
+omed ical
+Ġcomprehens ion
+6 55
+Ġempower ing
+Ġ ire
+Ġju ices
+ĠE TH
+ĠBox ing
+=" /
+Ġfacilit ated
+p oke
+ĠPars ons
+ĠMod er
+tra vel
+Ġcivil izations
+Ġliber tarians
+Ġrun e
+ĠCl arks
+at hed
+Ġcampaign ers
+ĠDis patch
+ĠFah renheit
+ĠCap com
+-------- --
+Ġl ace
+Ġdr aining
+Ġl iner
+ĠArt ificial
+Ã© n
+t ask
+] ).
+ĠGM O
+ĠOper ator
+ord inary
+ĠInf luence
+ĠU ps
+Ġpot ency
+uss en
+osp ons
+ĠSw im
+ĠDead line
+Un ity
+Ġcul inary
+Ġenlight enment
+Ġwe arer
+Ġmin ed
+Ġp ly
+Ġinc est
+ĠDVD s
+W alk
+B TC
+Tr ade
+Ġdev al
+ib and
+ĠOvers ight
+Palest inian
+Ġd art
+Ġm ul
+L R
+Ġrem ovable
+ĠReal ms
+ì Ŀ
+Ġmisc ar
+ĠV ulkan
+68 5
+Ã¨ re
+ĠS ap
+Ġmer ging
+ĠCar ly
+che ster
+Ġbr isk
+Ġlux urious
+ĠGener ator
+Ġbit terness
+Ġed ible
+Ġ24 3
+T G
+Ġrect angle
+With No
+bel ow
+J enn
+Ġdark est
+Ġh itch
+Ġdos age
+Ġsc aven
+ĠK eller
+ĠIllust rated
+Certain ly
+ĠMaver icks
+Marg inal
+Ġdiarr hea
+Ġenorm ously
+Ġ9 99
+sh r
+qu art
+Ġadam ant
+ĠM ew
+Ġren ovation
+Ġcerv ical
+ĠPercent age
+en ers
+ĠKim ber
+Ġflo ats
+Ġde x
+ĠW itcher
+ĠSwan sea
+d m
+Ġsal ty
+y ellow
+Ġca pe
+ĠDr ain
+ĠPaul a
+ĠTol edo
+les i
+Mag azine
+ĠW ick
+ĠM n
+ĠA ck
+ĠR iding
+AS ON
+Ġhom ophobic
+AR P
+Ġwand ered
+C PU
+ood oo
+ĠP ipe
+Ġtight ening
+ĠBut t
+3 18
+Ġdesert ed
+S ession
+Ġfacilit ating
+J ump
+Ġemer gencies
+OW ER
+Ġexhaust ive
+ĠAF TER
+Ġheart beat
+ĠLab el
+ack y
+ĠCert ified
+ilt ration
+Z e
+ĠU tt
+Ġ13 00
+Ġpres ume
+ĠDis p
+Ġsur ged
+Ġdoll s
+Col umb
+Ġchim pan
+ĠR azor
+Ġt icks
+Ġcouncill or
+Ġpilgr image
+ĠReb els
+ĠQ C
+ĠA uction
+x ia
+ik k
+b red
+Ġinsert ion
+Ġco arse
+d B
+SE E
+ĠZ ap
+ĠF oo
+Ġcontem por
+ĠQuarter ly
+ot ions
+ĠAl chemist
+ĠT rey
+ĠDu o
+S weet
+80 4
+ĠGi ov
+Ġfun n
+N in
+h off
+Ġram ifications
+Ġ19 22
+ĠExper ts
+az es
+Ġgar ments
+ar ial
+ĠN ab
+Ġ25 7
+ĠV ed
+Ġhum orous
+ĠPom pe
+Ġn ylon
+Ġlur king
+ĠSerge y
+ĠMatt is
+Ġmisogyn y
+ĠComp onents
+ĠWatch ing
+ĠF olk
+ract ical
+B ush
+Ġt aped
+Ġgroup ing
+Ġbe ads
+Ġ20 48
+Ġcon du
+quer que
+Read ing
+Ġgriev ances
+Ult ra
+Ġend point
+H ig
+ĠSt atic
+ĠScar borough
+L ua
+ĠMess i
+a qu
+ĠPsy Net
+ĠR udd
+Ġa venue
+v p
+J er
+Ġsh ady
+ĠRes ist
+ĠArt emis
+Ġcare less
+Ġbro kers
+Ġtemper ament
+Ġ5 20
+T ags
+ĠTurn ing
+Ġut tered
+Ġp edd
+Ġimpro vised
+Ġ: (
+Ġtab l
+Ġpl ains
+16 00
+press ure
+ĠEss ence
+marg in
+friend s
+ĠRest oration
+Ġpoll ut
+ĠPok er
+ĠAugust ine
+ĠC IS
+ĠSE AL
+or ama
+Ġth wart
+se ek
+Ġp agan
+Â º
+cp u
+Ġg arn
+Ġass ortment
+ĠI LCS
+t ower
+Recomm ended
+Ġun born
+ĠRandom Redditor
+ĠRandomRedditor WithNo
+Ġparaly zed
+Ġeru ption
+Ġinter sect
+ĠSt oke
+ĠS co
+B ind
+å ¾
+ĠP NG
+ĠNeg ative
+ĠNO AA
+Le on
+Ġall oy
+ĠL ama
+ĠD iversity
+5 75
+Ġunderest imated
+ĠSc or
+Ġm ural
+Ġb usted
+so on
+l if
+Ġnone x
+Ġall ergy
+ĠUnder world
+ĠR ays
+ĠBl asio
+Ġh rs
+ĠD ir
+Ġ3 27
+by ter
+Ġrepl acements
+Ġactiv ates
+ri ved
+M H
+Ġp ans
+ĠH I
+Ġlong itudinal
+Ġnu isance
+al er
+Ġsw ell
+ĠS igned
+s ci
+ĠIs les
+ĠA GA
+Ġdef iant
+Ġson ic
+oc on
+K C
+ĠA im
+t ie
+ah ah
+Ġm L
+D X
+Ġb isc
+ĠBill board
+ĠSY STEM
+NE Y
+ga ard
+Ġdist ressed
+former ly
+Al an
+Ġche fs
+Ġopt ics
+ĠC omet
+ĠAM C
+Ġredes igned
+irm ation
+Ġsight ings
+38 2
+3 11
+ĠW B
+Ġcont raction
+ĠT OTAL
+D ual
+Ġstart led
+Ġunderstand ably
+Ġsung lasses
+ETH OD
+Ġd ocker
+Ġsurf ing
+ĠH EL
+ĠSl ack
+ton es
+Ġsh alt
+Vis ual
+49 8
+Dep artment
+c ussion
+Ġunrest ricted
+Ġt ad
+Ġre name
+employ ed
+Ġeduc ating
+Ġgrin ned
+bed room
+ĠActiv ities
+ĠV elvet
+ĠSW AT
+Ġsh uffle
+ig or
+Ġsatur ation
+F inding
+c ream
+ic ter
+Ġv odka
+tr acking
+te c
+Ġfore ground
+iest a
+Ġve hement
+ĠEC B
+ĠT ie
+E y
+Ġt urtles
+ĠRail road
+ĠKat z
+ĠFram es
+Ġmen ace
+ĠFell owship
+ĠEss ential
+ugg ish
+Ġdri p
+ch witz
+ĠKy oto
+s b
+ĠN ina
+Param eter
+Ġal arms
+ĠCl aud
+Ġpione ering
+Ġchief ly
+ĠSc ream
+Col lection
+Ġthank fully
+ĠRonald o
+åŃ Ĳ
+st rip
+ĠDisney land
+com mercial
+See ing
+S oul
+Ġevac uate
+Ġc iv
+ĠAs he
+Ġdiv ides
+ĠD agger
+rehens ive
+Ġber ries
+ĠD F
+Ġs ushi
+Ġplur ality
+W I
+Ġdisadvant aged
+Ġbatt alion
+ob iles
+45 1
+Ġcl ing
+Ġunden iable
+ĠL ounge
+Ġha unt
+p he
+Ġquant ify
+Ġdiff ered
+Ġ[* ]
+ĠV iz
+c um
+sl ave
+Ġvide og
+Ġqu ar
+Ġbund les
+ĠAl onso
+t ackle
+Ġneur onal
+Ġlandsl ide
+conf irmed
+ĠDep th
+Ġrenew ables
+B ear
+ĠMaced onia
+Ġjer seys
+Ġb unk
+ĠSp awn
+ĠControl s
+ĠBuch anan
+Ġrobot ics
+Ġemphas izing
+ĠTut orial
+h yp
+ist on
+Ġmonument al
+æ °
+ĠCar ry
+Ġt bsp
+en ance
+H ill
+art hed
+Ġro tten
+De an
+Ġtw isting
+Ġgood will
+Ġimm ersion
+L iving
+Ġbr ushes
+ĠC GI
+ĠAt k
+tr aditional
+Ġph antom
+ĠSt amina
+Ġexpans ions
+ĠMar in
+Ġembark ed
+ĠE g
+int estinal
+ĠPE OPLE
+ĠBo oth
+ĠApp alach
+Ġreleg ated
+V T
+M IT
+Ġmust er
+Ġwithdraw ing
+Ġmicrosc ope
+ĠG athering
+ĠC rescent
+ĠArgent ine
+ĠDec re
+ĠDomin ic
+Ġbud s
+ant age
+ĠI on
+Ġwid ened
+ONS ORED
+ĠGl oves
+iann opoulos
+raz en
+fe el
+Ġrepay ment
+Ġhind sight
+ĠRE ALLY
+ĠPist ol
+ĠBra h
+Ġwat ts
+Ġsurv ives
+Ġfl urry
+iss y
+Al ert
+ĠUrug uay
+Ph oenix
+S low
+ĠG rave
+ĠF ir
+Ġmanage able
+Ġtar iff
+ĠU DP
+ĠPist ons
+ĠNiger ian
+Ġstrike outs
+Ġcos metics
+whel ming
+f ab
+c ape
+pro xy
+Ġre think
+Ġover coming
+sim ple
+Ġw oo
+Ġdistract ing
+ĠSt anton
+ĠTuls a
+ĠD ock
+65 9
+Ġdisc ord
+ĠEm acs
+ĠV es
+ĠR OB
+Ġreass uring
+Ġcons ortium
+Muslim s
+3 21
+Ġprompt s
+se i
+ĠH itch
+imp osed
+ĠF ool
+Ġindisc rim
+wr ong
+bu querque
+D avis
+! ]
+Ġtim eless
+ĠNE ED
+Ġpestic ide
+Ġrally ing
+ĠCal der
+Ġå ¤
+Ġx p
+ĠUn le
+ĠEx port
+lu aj
+B uff
+) </
+B oot
+ĠChrys ler
+or ative
+M ess
+Ġneglig ible
+ert odd
+ĠMush room
+ĠG ale
+g c
+ĠCos by
+ĠR ural
+rit ical
+B ell
+Ġturb ine
+00 200000
+Ġlegit imately
+ĠAnim ated
+T ED
+ĠThe odore
+c onduct
+ĠH ier
+Ġcounterfe it
+ĠAlger ia
+Ġun beat
+cont roller
+Ġun res
+Ġscram bling
+ĠFall on
+T es
+Ġam ber
+Ġroy alties
+ĠShel ter
+ĠL ester
+Ġclass ify
+Rem ote
+Ġun heard
+Ġcontrovers ies
+Ġenrich ment
+ĠYan kee
+g amer
+Ġpl atinum
+Ġec ology
+ĠS ark
+Ġunt ouched
+Ġsuper visors
+Ġ" %
+Ġf ooth
+Ġcomm ons
+Ġnarc otics
+Ġind ices
+ĠP ly
+Ġaddition ally
+ĠGaw ker
+ĠE Q
+Pl aying
+Ġcave at
+ĠAbs olute
+oss us
+B aby
+Ġr ation
+Ġres in
+Ġcalib ration
+ĠNew port
+Ġkn ocks
+v t
+Ġcomp ost
+Sc ene
+Ġsar cast
+Ġkiss es
+Ġn s
+all i
+ĠMar cel
+ĠP iet
+iat rics
+Ġsurround s
+ĠRep rodu
+ĠPhill ies
+Ġuncertain ties
+ĠE ur
+ĠRom ance
+ĠH ath
+ĠNeed s
+ĠCl oak
+Ġcre m
+que ue
+Ġ3 55
+Ġup front
+] );
+Ġrecip roc
+Ġ19 27
+Ġ11 00
+ut su
+Ġdep ressive
+ow ment
+F ans
+Ġme ch
+Ġann ihil
+Ġcounter terrorism
+ĠFig ures
+b old
+ĠMo ines
+ĠDri vers
+Ġmanuscript s
+ĠCrypt o
+Ġhyp not
+redd its
+Ġprosec utions
+Ġdiver t
+CR IP
+ĠB ene
+ĠRe ggie
+Ġtax ing
+ĠMor ales
+ent ing
+t ur
+sign ificant
+ĠPR OV
+Ġstr ands
+Ġp ouch
+ĠR ookie
+» Ĵ
+Ġnic er
+he my
+h w
+EC A
+Ġintimid ated
+Ġstr icter
+Ġmicro bial
+det ails
+Ġv ows
+Ġqu ake
+hh hh
+Ġrein vent
+U b
+Ġrel inqu
+ĠBuff ett
+lic ensed
+itte red
+ĠPic ard
+Ġche wing
+u cl
+organ ic
+Ġlocal ized
+ĠEconom ist
+Ġacqu ainted
+Def inition
+s ed
+Crit ics
+Ġc c
+45 3
+38 1
+Ġfell ows
+Ġcheck points
+0 25
+Ġre election
+Ġmed iated
+ĠK DE
+Ġhurd le
+Ġtext ing
+Per fect
+Ġtrust ees
+fect ure
+Ġd ich
+mon ary
+Ġdist inctions
+Ġ14 00
+Ġus her
+Ġparas ites
+ĠSh aring
+ĠV im
+Ġbar becue
+ĠMin isters
+ere lla
+Ġe b
+Ġm c
+ĠSome how
+ĠIn sect
+ch anges
+b road
+ĠBy z
+Ġgrap es
+66 9
+Ġ= ================
+Ġass imil
+Ġhaun ting
+Ġfire power
+Ġdef amation
+em phasis
+Ġcomp ose
+Ġallerg ies
+Ġstr ang
+roll ers
+b ang
+Ġbrew ers
+ron gh
+ri ot
+p oor
+c old
+S ample
+Ġbu oy
+0 40
+ĠCourt ney
+Ġ26 8
+ĠWed ding
+70 2
+Ġobsess ive
+Ġbra king
+ĠL al
+an ical
+å ¦
+at en
+Con struction
+Ġclin ically
+iers hip
+N ames
+ĠDisc uss
+ĠRam os
+Ġloc ale
+ĠAgric ultural
+En able
+Ġhorse power
+ent ure
+P ref
+C ourt
+Ġstaff ing
+Ġfut uristic
+dri vers
+ĠMarket place
+æĪ ¦
+Friend s
+Ġdam ning
+ĠCustom ers
+Ġwe eds
+ĠM ai
+Ġag ile
+ĠT att
+ic ent
+R anked
+cro ft
+ĠKat y
+Ext reme
+Ġcar ve
+ĠR over
+ĠBy ron
+37 2
+Ġconduct s
+r atch
+it ia
+ĠPump kin
+Sad ly
+Rel oaded
+P olicy
+Ġl ick
+pe ak
+is ks
+ĠCD s
+ĠEn cyclopedia
+in itial
+C os
+ĠAware ness
+ĠD ram
+$$ $$
+Ġr iff
+Ġscript ure
+run ners
+Ġbo iler
+ons on
+o in
+Ġham string
+Ġcat aly
+ĠArch bishop
+ch all
+Ġf aux
+ok in
+local host
+ĠN AME
+ad obe
+S AN
+am ate
+Ġscram ble
+Ġcar c
+ĠMan ifest
+ĠCed ar
+ĠSer gio
+l ater
+ff er
+Ġgrapp ling
+ĠDe utsche
+agon ists
+ĠNew sp
+Ġpret ended
+arch ment
+Ġcur ated
+Ġhead phone
+ĠUn common
+ĠS IGN
+A gent
+Ġdead lines
+Ġhorizont ally
+ĠM AT
+ĠSum mers
+Ġord ained
+ĠLast ly
+ĠKend all
+Ġfr ig
+ĠMach ina
+ĠWater loo
+ĠMex icans
+Ġprotect or
+Ġgl are
+} "
+Prem ium
+Ġr ift
+ĠTelesc ope
+Met al
+Ġrec apt
+Ġ; ;
+Ġincl ination
+Ġimp oses
+ing en
+^ {
+Ġh aste
+Ġd olphins
+Ġcomm uters
+pl anned
+c ong
+m x
+ĠU pload
+Ġext rap
+ĠTuc son
+ĠExpl oration
+efe ated
+Ġsl ender
+70 3
+ĠB uk
+is el
+Ġcompet itiveness
+ch lor
+ĠP ermanent
+ĠE verett
+ĠSpecial ist
+ĠS OL
+Ġcy an
+ĠEx actly
+U F
+ĠL IFE
+ary l
+on et
+ĠEmploy ee
+aw ed
+ĠRat ings
+Ġextra vag
+ul hu
+ĠPl ane
+Ġelev ate
+ĠCoord inator
+ĠWat kins
+Ġex cludes
+Ġsent ient
+Ġep och
+Ġall oc
+Pre viously
+ĠSh y
+ĠSlov akia
+L OCK
+Ġmarked ly
+Ġkn ob
+Ġadventure rs
+ĠBe en
+ĠCost s
+amm ers
+Ġon slaught
+ĠSupport ed
+ĠT au
+ik arp
+ĠS overe
+ĠHam pton
+ãĤ ī
+Pre v
+ĠW orse
+Ġc ottage
+ĠH ades
+le z
+b owl
+Ġfrag rance
+ĠL ok
+EM OTE
+ĠPet ro
+Ġ19 25
+ĠP end
+produ cing
+Ġrel ocate
+v ati
+p ole
+Ġsem in
+ĠN UM
+Ġrock ed
+b uff
+b ly
+Rep ly
+ĠH ai
+Ġartic ulated
+ĠIslam abad
+66 5
+ĠClaim s
+Des ktop
+Ġtrust ee
+Ġscript ing
+ĠS ob
+ĠAs ylum
+STD OUT
+ĠCl own
+ĠD ortmund
+ĠDev on
+l ite
+ĠMar ble
+Ġb unker
+Ġcre st
+Ġarous al
+ĠS ears
+ĠBudd y
+ered ith
+ĠP olly
+Ġdec ode
+ĠV ish
+ĠRef lect
+an on
+Ġrefund s
+imm ers
+H M
+Ġwip ing
+Ġpuzz led
+Ġmat te
+un o
+P ierre
+) ),
+Ġt ainted
+Ġsymbol ism
+ĠF raz
+Ġprotest ors
+ethe us
+%% %%
+W ra
+Ġl ax
+ad em
+atur ation
+ãĥ ĵ
+ĠTra iler
+ĠE NG
+ĠBows er
+Ġatt m
+D ur
+80 7
+Ġsid x
+Ġc ider
+ĠA ffect
+Ġw oven
+ĠBark er
+ben ef
+Ġdst g
+ĠRy u
+> [
+Ġsq or
+S audi
+Ġis tg
+Ġindul ge
+pro c
+Ġdisg usted
+Ġcomp ounded
+Ġn em
+Ġschool ing
+ĠC ure
+process ing
+S ol
+Ġpro verb
+it ized
+ĠAlv arez
+Ġscar f
+Ġrect angular
+re ve
+Ġh ormonal
+ĠSt ress
+itiz en
+Ġ4 25
+girl s
+ĠNo ir
+ĠR app
+Ġmar ches
+ch urch
+ĠUs es
+Ġ40 5
+ĠBer m
+Ġord inances
+ĠJud gment
+Charg es
+ĠZ in
+Ġdust y
+Ġstraw berries
+Ġper ce
+ĠTh ur
+ĠDebor ah
+net flix
+ĠLam bert
+Ġam used
+ĠGu ang
+Y OU
+R GB
+ĠC CTV
+Ġf iat
+r ang
+Ġf ederation
+ĠM ant
+ĠB ust
+ĠM are
+respect ive
+ĠM igration
+ĠB IT
+59 0
+Ġpatriot ism
+Ġout lining
+reg ion
+ĠJos Ã©
+Ġbl asting
+ĠEz ra
+B s
+Ġundermin es
+ĠSm ooth
+Ġcl ashed
+rad io
+Ġtransition ing
+ĠBucc aneers
+ĠOw l
+Ġplug s
+Ġh iatus
+ĠPin ball
+Ġm ig
+ĠNut r
+ĠWolf e
+Ġinteg ers
+Ġor bits
+ĠEd win
+ĠDirect X
+b ite
+Ġbl azing
+v r
+Ed ge
+ĠP ID
+ex it
+ĠCom ed
+ĠPath finder
+ĠGu id
+ĠSign s
+ĠZ er
+ĠAg enda
+Ġreimburse ment
+M esh
+i Phone
+ĠMar cos
+ĠS ites
+h ate
+en burg
+Ġs ockets
+p end
+Bat man
+v ir
+ĠSH OW
+Ġprovision al
+con n
+ĠDeath s
+AT IVE
+Pro file
+sy m
+J A
+Ġnin ja
+inst alled
+id ates
+eb ra
+ĠOm aha
+Ġse izing
+ĠBe asts
+Ġsal ts
+M ission
+Gener ally
+ĠTr ilogy
+he on
+leg ates
+Ġd ime
+Ġf aire
+par able
+G raph
+Ġtotal ing
+Ġdiagram s
+ĠYan uk
+ple t
+ĠMe h
+Ġmyth ical
+ĠStep hens
+aut ical
+ochem istry
+Ġkil ograms
+Ġel bows
+anc ock
+ĠB CE
+ĠPr ague
+Ġimpro v
+ĠDev in
+Ġ" \
+par alle
+Ġsuprem acists
+ĠB illion
+Ġreg imen
+inn acle
+Ġrequ isite
+ang an
+ĠBur lington
+ain ment
+ĠObject ive
+oms ky
+G V
+Ġun ilateral
+Ġt c
+Ġh ires
+ment al
+Ġinvol untary
+Ġtrans pl
+ĠASC II
+Â ¨
+Ev ents
+Ġdoub ted
+ĠKa plan
+ĠCour age
+ig on
+ĠMan aging
+ĠT art
+Ġfalse hood
+ĠV iolet
+Ġair s
+Ġfertil izer
+Brit ain
+Ġaqu atic
+ou f
+W ords
+ĠHart ford
+Ġeven ings
+ĠV engeance
+qu ite
+G all
+ĠP ret
+Ġp df
+ĠL M
+ĠSo chi
+ĠInter cept
+9 20
+Ġprofit ability
+ĠId le
+ĠMac Donald
+ĠEst ablishment
+um sy
+Ġgather ings
+ĠN aj
+Charl ie
+Ġas cent
+ĠProt ector
+Ġal gebra
+Ġbi os
+for ums
+EL S
+Introdu ced
+Ġ3 35
+Ġastron omy
+Cont ribut
+ĠPol ic
+Pl atform
+Ġcontain ment
+w rap
+Ġcoron ary
+ĠJ elly
+man ager
+Ġheart breaking
+c air
+ĠChe ro
+c gi
+Med ical
+ĠAccount ability
+! !"
+oph ile
+Ġpsych otic
+ĠRest rict
+Ġequ itable
+iss ues
+Ġ19 05
+ĠN ek
+c ised
+ĠTr acking
+Ġo zone
+Ġcook er
+ros is
+Ġre open
+Ġinf inity
+ĠPharm aceutical
+ens ional
+Att empt
+ĠR ory
+Mar co
+Ġawa its
+H OW
+t reated
+Ġbol st
+Ġreve red
+Ġp ods
+opp ers
+00 10
+Ġampl itude
+ric an
+SP ONSORED
+Ġtrou sers
+Ġhal ves
+ĠK aine
+ĠCut ler
+ĠA UTH
+Ġsplend id
+Ġprevent ive
+ĠDud ley
+if acts
+umin ati
+ĠY in
+Ġad mon
+ĠV ag
+Ġin verted
+Ġhast ily
+ĠH ague
+L yn
+Ġled ger
+Ġastron omical
+get ting
+Ġcirc a
+ĠC ic
+ĠTenn is
+Lim ited
+Ġd ru
+ĠBY U
+Ġtrave llers
+Ġp ane
+ĠInt ro
+Ġpatient ly
+Ġa iding
+Ġlo os
+ĠT ough
+Ġ29 3
+Ġconsum es
+Source File
+Ġ"" "
+Ġbond ing
+Ġtil ted
+Ġmenstru al
+ĠCel estial
+UL AR
+Plug in
+Ġrisk ing
+N az
+ĠRiy adh
+Ġacc redited
+Ġsk irm
+é Ľ
+Ġexam iner
+Ġmess ing
+Ġnear ing
+ĠC hern
+ĠBeck ham
+Ġsw apped
+Ġgo ose
+K ay
+Ġlo fty
+ĠWal let
+Ġ[ '
+Ġap ocalypse
+Ġb amboo
+ĠSP ACE
+ĠEl ena
+Ġ30 6
+ac ons
+Ġtight ened
+Ġadolesc ence
+Ġrain y
+Ġvandal ism
+ĠNew town
+Ġcon ject
+c akes
+Ġche ated
+Ġmoder ators
+par ams
+E FF
+Ġdece it
+ĠST L
+ĠTanz ania
+ĠR I
+Ġ19 23
+ĠEx ile
+the l
+Ġthe olog
+Ġquir ky
+ĠIr vine
+Ġneed y
+or is
+U m
+K a
+Ġmail box
+3 22
+Ġb os
+ĠPet ra
+K ING
+Ġenlarg ed
+O ften
+Ġbad ass
+Ġ3 43
+ĠPl aces
+ĠC AD
+Ġpr istine
+Ġinterven ing
+d irection
+Ġl az
+ĠD SM
+Ġproject ing
+ĠF unk
+ag og
+pay ment
+n ov
+Ġch atter
+AR B
+Ġexam inations
+ĠHouse hold
+ĠG us
+F ord
+4 14
+B oss
+Ġmy stic
+Ġle aps
+ĠB av
+ul z
+b udget
+Foot ball
+Ġsubsid ized
+Ġfirst hand
+Ġcoinc ide
+oc ular
+Con n
+ĠColl abor
+Ġfool s
+am ura
+ah ar
+r ists
+Ġsw ollen
+Ġexp ended
+ĠP au
+s up
+Ġsp ar
+Ġkey note
+s uff
+Ġunequ al
+Ġprogress ing
+str ings
+ĠGamer gate
+Dis ney
+ĠEle ven
+om nia
+Ġscript ed
+Ġear ners
+bro ther
+ĠEn abled
+æ ³
+Ġlar vae
+ĠL OC
+m ess
+Wil son
+ĠTem plate
+success fully
+Ġparam ount
+Ġcamoufl age
+Ġbind s
+ĠQu iet
+ĠSh utterstock
+r ush
+Ġmasc ot
+fort une
+ĠCol t
+ĠBe yon
+hab i
+Ġha irc
+Ġ26 7
+ĠDe us
+Ġtw itch
+Ġconcent rating
+Ġn ipples
+c ible
+Ġg ir
+N Z
+M ath
+n ih
+Requ ired
+Ġp onder
+ĠS AN
+Ġwedd ings
+Ġl oneliness
+N ES
+ĠMah jong
+69 5
+add le
+ĠGar ner
+ĠC OUR
+Br idge
+Ġsp ree
+ĠCald well
+Ġbri bery
+Ġï¿½ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ï¿½
+plug ins
+Ġr acket
+Ġchamp agne
+vers ible
+V ote
+Ġmod ifiers
+May or
+6 80
+Ġassemb lies
+ĠS ultan
+ĠN ing
+ĠLad ies
+Ġsulf ur
+Ġor bs
+Ġ---- -
+____ ___
+ĠJournal ism
+Ġes ports
+Ġl ush
+Ġh ue
+Ġspect ral
+H onest
+ãĥ ı
+Ġbus hes
+Ġrein forcement
+Ġre opened
+ĠWhe els
+ĠM org
+rie ving
+Ġaux iliary
+Ġj Query
+ĠB AT
+tes que
+Ġver tex
+p ure
+f rey
+ãĤ º
+d os
+Ġty ph
+Ġc ull
+Ġe q
+Ġdec on
+Ġtoss ing
+Ġdispar ate
+ĠBr igham
+print f
+led ged
+Ġsu nd
+Ġco zy
+Ġhepat itis
+per forming
+Ġav al
+ĠG G
+f uture
+Ġpet ertodd
+ĠKos ovo
+Ġmagn ets
+Al ready
+ĠEd ison
+ĠCe res
+ĠRA ID
+Ġbrill iance
+57 6
+Ġder ives
+Ġhypert ension
+ĠÎ Ķ
+Ġlamb da
+Ġfl air
+Ġmission aries
+Ġrap es
+ĠSt arter
+ĠMon ths
+Ġdef y
+Ġseism ic
+ĠR aphael
+Ġeuro zone
+65 6
+z sche
+Ġscr atched
+Ġb ows
+ĠLenn on
+ĠGa ia
+Ġdri pping
+f acts
+A le
+Ġfrog s
+ĠBre ast
+ogene ity
+ĠProsecut or
+Ġampl ified
+ĠHod g
+ĠF n
+Th ousands
+ĠNI H
+ĠMonitor ing
+FT WARE
+ĠPri ebus
+ĠG rowing
+hun ter
+Ġdiagn ose
+ĠM ald
+ĠL R
+Ġcrown ed
+Ġburst ing
+Ġdiss olution
+j avascript
+Ġuseful ness
+ĠExec ution
+: (
+ĠIv ory
+a ah
+Ġpersecut ed
+viol ence
+ist as
+ĠCr ate
+Ġimpuls es
+ĠSp ani
+ed es
+Hand le
+ĠZ erg
+think able
+Last ly
+Ġspont aneously
+Ġinconven ient
+Ġdismiss ing
+Ġpl otted
+Ġeight y
+Ġ7 37
+r ish
+ĠThor nton
+ath am
+Ġsit com
+V en
+Rec ipe
+t el
+l und
+Ġcle ars
+ĠSas uke
+Ġ25 8
+Ġopt ing
+Ġen raged
+est hetic
+ĠA e
+uch s
+Pre p
+Fl ow
+Ġrun off
+ĠE ating
+ĠG iles
+ĠAct ing
+res ources
+ib aba
+Ġr pm
+Ġske wed
+ĠBl anc
+ĠS akuya
+Ġhot ter
+Ġ19 24
+op ian
+ck o
+Ġcr umbling
+Ġcapt ains
+ĠAppropri ations
+le aders
+dro pping
+an uts
+Ġrevers ing
+ĠP ose
+ĠS ek
+Sc ot
+ĠIde a
+c ise
+ĠSloven ia
+Ġ3 17
+Do ctor
+Ġcro cod
+ald i
+Se a
+ĠFar rell
+Ġmerc enaries
+ĠR NC
+ĠGu ess
+Ġp acing
+M achine
+Streamer Bot
+ĠChar ity
+Ġ29 8
+Ġcann ons
+ĠTob y
+TPP StreamerBot
+ĠPass ion
+cf g
+Th om
+Ġbad ges
+ĠBern stein
+. âĢĵ
+ĠP OP
+ĠCon j
+Ġinitial ization
+Ġbiod iversity
+D ub
+Ġfeud al
+Ġdisclaim er
+Ġc row
+Ġign ition
+ar f
+S HA
+Ġk Hz
+h azard
+ĠArt ists
+oe uv
+67 9
+ĠRud y
+N ine
+ĠRam adan
+å ½
+itt o
+Ġadren aline
+C ert
+Ġsmell ed
+Ġimp unity
+Ġag endas
+ĠRe born
+ĠCon cent
+ĠSe ems
+Ġo mega
+ĠDust in
+Ġback er
+ĠSau ce
+ĠBoy le
+W IN
+Ġsp ins
+Ġpa uses
+u pt
+Ġshred ded
+Ġstra pped
+ĠCor ruption
+Ġscr atches
+Ġn i
+Ġatt ire
+ĠS AF
+Factory Reloaded
+ĠI PS
+Ġ( %
+Ġsem inar
+f ocus
+c ivil
+Ġ18 60
+int osh
+Ġcontin ual
+Ġabbre vi
+ĠS ok
+oc obo
+X M
+Ġfr antic
+Ġunavoid able
+Ġar tery
+Ġannot ations
+b ath
+Cl imate
+Ġd ors
+ĠSl ide
+co ord
+ĠRel oad
+ĠL DL
+ĠLove craft
+Ġunim agin
+Ġresemb led
+Ġbarr acks
+n p
+Ġsurrog ate
+Ġcategor ized
+ãĤ ©
+Ġvacc inated
+Ġdrain age
+Ġind ist
+ĠWhats App
+Ġ18 70
+oler ance
+inv oke
+am orph
+Ġrecon nect
+Ġem anc
+Ġblind ness
+Ġ12 80
+intern et
+c ollar
+Ġalt ru
+Ġab yss
+ĠT RI
+65 7
+Ġinf used
+HE AD
+Ġforest ry
+ĠWood y
+ĠC i
+w i
+s am
+78 4
+hol iday
+Ġmog ul
+ĠF ees
+ĠD EN
+In ternal
+ur bed
+f usc
+at om
+ĠIll usion
+Ġpoll ed
+Ġfl ap
+Ġco ax
+L GBT
+An aly
+ĠSect ions
+ĠCalif orn
+em n
+Ġh ither
+ĠN IGHT
+Ġn ailed
+ĠPip eline
+39 1
+o of
+ĠPr imal
+vere nd
+Ġsl ashing
+Ġret ri
+avi our
+Ġdepart ing
+g il
+IS C
+Ġmid way
+Ġultras ound
+Ġbeh aving
+ĠT ara
+class es
+V irtual
+ĠColon ial
+Ġstri pping
+Ġorchestr ated
+ĠGra ves
+45 2
+ĠIron ically
+ĠWrit ers
+Ġl ends
+ĠMan z
+Ġra ven
+Ġoxid ative
+Ġ26 6
+EL F
+act ually
+asc ar
+D raft
+Ġfavour able
+Ġhumili ating
+Ġf idelity
+ĠH of
+ĠX uan
+49 6
+Ġlay ered
+at is
+79 0
+Ġpay check
+it on
+K ar
+ĠVM ware
+ĠFar mer
+Ġserv ic
+gl omer
+Ġsl ump
+ĠFab ric
+ĠD OC
+est ing
+Ġreass ure
+Ġph yl
+v olt
+it ory
+R ules
+Ġoxid ation
+Ġpri zed
+Ġmist ress
+ĠDj ango
+WAR N
+å ĳ
+Ġenc ode
+ĠFeed back
+Ġstupid ity
+I an
+ĠYugoslav ia
+× ¨
+ac l
+UT E
+19 77
+Ġqual ifies
+Ġpuls es
+pret ty
+Ġfro ze
+Ġs s
+Iter ator
+Ġur gently
+Ġm ailed
+ĠCh am
+Ġsust aining
+Ġbas il
+Ġpupp ies
+il ant
+ĠP LEASE
+l ap
+ace ous
+F ear
+ĠMaster y
+aut omatic
+ĠT AG
+Ġant im
+ag les
+47 3
+fram es
+Ġwh ispers
+ĠWho ever
+Ġbra very
+ĠUK IP
+ract ions
+"" "
+Ġt ame
+Ġpart ed
+every thing
+CON T
+Ġind ebted
+Ġadd r
+re k
+IR ED
+Ġem inent
+cl inton
+Ġo usted
+Ġreview er
+Ġmelt down
+Ġre arr
+ĠY ao
+the real
+aby te
+Ġst umbling
+Ġbat ches
+Ġ25 9
+Ġcontrace ptive
+Ġprost itute
+ens is
+De cl
+ĠSt rikes
+M ilitary
+ĠO ath
+v acc
+pp ings
+05 2
+Ġpart Name
+amp ing
+Rep orts
+K I
+CH R
+Ġsubt ly
+sw ers
+Bl ake
+us ual
+Ġcontest ants
+Ġcart ridges
+ĠGRE AT
+Ġbl ush
+ĠâĢ º
+47 2
+Ġreason ed
+ãĥ ¤
+paralle led
+Ġd yn
+ag ate
+Ġnight ly
+å Ĩ
+55 6
+Ġsem antic
+ĠAdv oc
+Ġ !!
+Ġdisag rees
+ĠB W
+V eh
+Ġharm ing
+Ġembr aces
+Ġstri ves
+Ġin land
+ĠK ard
+Ġhe ats
+ĠGin ny
+ut an
+ern aut
+yl ene
+ĠE lev
+J D
+Ġh ars
+ĠStar r
+Ġsk ysc
+Ġcollabor ators
+Us ually
+Ġrev olutions
+ĠSTAT S
+Ġdism antle
+Ġconfident ly
+Ġkin etic
+Al i
+Ġpercent ile
+Ġextract ing
+ill ian
+est ead
+Ġphysic ists
+ĠMarsh al
+Ġfell owship
+Ġd ashed
+ĠU R
+ĠSi oux
+ĠComp act
+am ide
+P ython
+ĠLe igh
+ĠPharm ac
+ist rates
+her ical
+Ġf ue
+ĠE min
+Ġ( {
+ĠNeighbor hood
+Ġdisrupt ing
+ĠD up
+Ġg land
+ĠSe v
+ĠMar ian
+arg on
+ĠD und
+Ġ< !--
+Ġstr and
+Ġstadium s
+z os
+Ġpsych osis
+ĠR ack
+Ġbrilliant ly
+ï¸ ı
+Ġsubmer ged
+ĠInst it
+ĠCh ow
+Ġc ages
+ĠH ats
+ĠU rs
+Ġdil uted
+us at
+ien ne
+ĠMembers hip
+ĠBur k
+Ġ ie
+Ġarche type
+D rug
+ult on
+ĠSp ock
+ĠMcK ay
+ĠDep end
+F eatured
+S oc
+19 78
+ĠB ere
+Ġrelent lessly
+Ġcripp ling
+Ġar thritis
+çĶ Ł
+ĠTrop ical
+ĠBul g
+ĠCher yl
+Ġadm irable
+Ġsub title
+Over ride
+Ġorig inating
+ĠC CP
+Ġsw ore
+ĠSo le
+ĠDis orders
+3 29
+Ġprocess ion
+Ġref urb
+Ġimm ersed
+requ ently
+Ġskept ics
+Ġcer amic
+m itter
+en stein
+b elt
+ĠT IT
+b idden
+Ġf ir
+m ist
+> ]
+Ġwe ave
+ĠParad ox
+Ġentr usted
+ĠBarcl ays
+Ġnovel ist
+og ie
+80 6
+Ġnin ety
+Ġdisag reements
+@@@@ @@@@
+ĠAus chwitz
+c ars
+ĠL ET
+t ub
+arant ine
+P OS
+Ġback story
+Ġcheer ful
+ĠR ag
+ek a
+bi ased
+Ġinexper ienced
+ak ra
+ĠW itt
+t an
+Ġrap ist
+Ġplate au
+ch al
+ĠInqu is
+exp ression
+Ġc ipher
+Ġsh aving
+add en
+re ly
+( \
+ism a
+ĠReg ulatory
+CH AR
+ily n
+N VIDIA
+G U
+Ġmur m
+la us
+Christ opher
+Ġcontract ual
+ĠPro xy
+ĠJa ime
+ĠMethod ist
+Ġstew ards
+st a
+per ia
+Ġphys iology
+Ġbump ed
+Ġf ructose
+Austral ian
+ĠMet allic
+ĠMas querade
+ar b
+Ġprom ul
+Ġdown fall
+Ġbut cher
+Ġb our
+ĠIN FORMATION
+ĠB is
+pect s
+ad ena
+Ġcontempl ating
+ar oo
+cent ered
+ĠPe aks
+Us ed
+Ġmod em
+Ġg enders
+Ġ8 000
+37 1
+Ġm aternity
+ĠR az
+Ġrock ing
+Ġhandgun s
+ĠD ACA
+Aut om
+ĠN ile
+Ġtum ult
+ĠBenef it
+ĠAppro ach
+works hop
+ĠLe aving
+G er
+inst ead
+Ġvibr ations
+Ġrep ositories
+49 7
+ĠA unt
+ĠJ ub
+ĠExp edition
+Al pha
+Ġs ans
+Ġoverd ue
+Ġoverc rowd
+Ġlegisl atures
+Ġp aternal
+ĠLeon ardo
+Ġexp ressive
+Ġdistract ions
+Ġsil enced
+tr ust
+Ġb iking
+Ġ5 60
+Ġpropri et
+Ġimp osition
+Ġcon glomer
+Ġ= ================================================================
+ĠTe aching
+ĠY ose
+int ensive
+T own
+Ġtroll ing
+ĠGr ac
+ĠAS US
+Y o
+Ġspecial s
+ĠNep h
+ĠGod zilla
+Dat abase
+ĠHe gel
+Ġ27 2
+19 76
+ĠGl oria
+Ġdis emb
+ĠInvestig ations
+ĠB ane
+ag ements
+St range
+Ġtre asury
+ĠPl ays
+Ġundes irable
+Ġwid ening
+Ġverb ally
+Ġinf ancy
+Ġcut ter
+f ml
+Ġ21 00
+prot otype
+f ine
+Ġdec riminal
+Ġdysfunction al
+Ġbes ie
+ĠErn st
+z eb
+Ġnort heastern
+Ġa ust
+por ate
+ĠMar lins
+Ġsegreg ated
+ew orld
+ĠMa her
+Ġtra verse
+Ġmon astery
+ur gy
+G ear
+s and
+Com pl
+ĠE MP
+Ġpl ent
+ĠMer cer
+Ġ27 6
+TA BLE
+Config uration
+H undreds
+Ġpr ic
+Ġcollabor ating
+ĠPar amount
+ĠCumm ings
+Ġ( <
+Ġrecord er
+Ġfl ats
+Ġ4 16
+wh ose
+Font Size
+ĠOr bit
+Y R
+Ġwr ists
+Ġb akery
+) }
+ĠB ounty
+ĠLanc aster
+Ġend ings
+acc ording
+ĠSal am
+e asy
+75 5
+ĠBur r
+ĠBarn ett
+onom ous
+Un ion
+Ġpreced ence
+ĠScholars hip
+ĠU X
+Ġroll out
+Ġbo on
+al m
+ĠCan ter
+æ µ
+Ġround ing
+Ġcl ad
+Ġv ap
+ĠF eatured
+is ations
+Ġ5 40
+pol ice
+Ġunsett ling
+Ġdr ifting
+ĠLum ia
+ĠObama Care
+ĠF avor
+Hy per
+ĠRoth schild
+ĠMil iband
+an aly
+ĠJul iet
+H u
+Ġrec alling
+a head
+69 6
+Ġunf avorable
+Ġd ances
+O x
+Ġleg ality
+Ġ40 3
+rom ancer
+Ġinqu ire
+ĠM oves
+\ ">
+ĠVari ant
+ĠMess iah
+ĠL CS
+ĠBah Ã¡
+75 6
+Ġeyeb row
+ĠÂ ¥
+ĠMc F
+ĠFort y
+M as
+Ġpan icked
+Ġtransform ations
+q q
+Ġrev olves
+ring e
+ĠA i
+ax e
+Ġon ward
+ĠC FR
+ĠB are
+log in
+Ġliqu ids
+Ġde comp
+second ary
+il an
+ĠCon vert
+ami ya
+Ġprosecut ing
+Ġâī ¡
+ĠYork ers
+ĠByr ne
+sl ow
+aw ei
+J ean
+Ġ26 9
+ĠSky dragon
+Ġ Ã©
+ĠNicarag ua
+ĠHuck abee
+ĠHigh ly
+Ġamph ib
+ĠPast or
+ĠL ets
+Ġbl urred
+Ġvisc eral
+ĠC BO
+Ġcollabor ated
+z ig
+Leg al
+Ġapart heid
+Ġbr id
+Ġpres et
+ĠD ET
+ĠAM A
+× Ķ
+arch ing
+auc uses
+build er
+Ġpo etic
+Ġem ulator
+ĠMole cular
+Ġhon oring
+ise um
+Ġtract or
+ĠCl uster
+ĠCal m
+ared evil
+Ġsidew alks
+Ġviol in
+Ġgeneral ized
+ĠAle c
+Ġemb argo
+Ġfast ball
+ĠHT TPS
+ĠL ack
+ĠCh ill
+ri ver
+C hel
+ĠSw arm
+ĠLev ine
+ro ying
+L aunch
+Ġkick er
+Ġadd itive
+ĠDe als
+W idget
+cont aining
+Ġescal ate
+ĠOP EN
+Ġtwe aked
+Ġst ash
+Ġsp arks
+ĠEs sex
+ĠE cc
+Ġconv ict
+Ġblog ging
+I ER
+ĠH L
+Ġmurd erers
+75 9
+ĠH ib
+Ġde pl
+ĠJ ord
+S ac
+Ġdis sect
+ĠHow e
+os her
+Ġcustom izable
+ĠFran z
+Ġat ro
+Ä ĩ
+Ġ000 4
+Ġout post
+R oss
+Ġglyph osate
+ĠHast ings
+ĠBE FORE
+Ġsh ove
+o pped
+ĠSc ala
+Ġam ulet
+an ian
+Ġexacerb ated
+Ġe ater
+47 1
+UM E
+Ġpul p
+izont al
+ĠZ am
+ĠAT I
+imm une
+aby tes
+Ġunnecess arily
+ĠC AT
+ĠAx is
+Ġvisual ize
+Ã ī
+ĠRad ical
+f m
+Doc uments
+ĠFor rest
+Ġcontext ual
+ĠSy mbol
+Ġtent ative
+ĠDO ES
+ĠGood s
+Ġintermitt ent
+} :
+medi ated
+Ġridic ule
+Ġathe ism
+Ġpath ogens
+ĠM um
+Ġre introdu
+Ġ30 7
+i HUD
+Ġflash light
+Ġsw earing
+Ġp engu
+B u
+Ġrot ated
+ĠCr ane
+Ġ() );
+Ġfashion able
+Ġendors ing
+46 3
+) [
+Ġingest ion
+Ġcook s
+Ġ9 50
+ot omy
+ĠIm am
+Ġk a
+Ġte aser
+ĠGhost s
+ĠãĤ µ
+19 69
+Ï ĥ
+ub by
+Ġconver ter
+zan ne
+end e
+ĠPre par
+ĠNic kel
+ĠChim era
+h im
+ĠTyr ann
+ĠSabb ath
+ĠNich ols
+Ġra pt
+ih ar
+Ġshe lling
+Ġillum inate
+Ġdent ist
+ut or
+ĠInteg ration
+Ġwh ims
+ĠLiter ary
+Be aut
+Ġp archment
+ag ara
+Br and
+Ġder og
+âĢ¦ )
+ĠNor se
+Ġunw itting
+Ġc uc
+Ġborder line
+Ġupset ting
+Ġrec ourse
+Ġd raped
+ĠRad ar
+Ġcold er
+ĠPep si
+im inary
+], [
+65 8
+V i
+ĠF rem
+ĠP es
+Ġveter inary
+ĠT ED
+ĠEp idem
+n ova
+k id
+Ġdev out
+o ct
+j ad
+M oh
+ĠP AY
+Ġge ometric
+Ġ3 23
+Ġcircum ference
+ich ick
+19 75
+ĠY uri
+ĠSh all
+ĠH over
+un in
+S pr
+Ġg raft
+ĠHapp iness
+Ġdisadvant ages
+att acks
+Ġhub s
+ĠStar Craft
+é ĸ
+Ġgall eries
+ĠKor ra
+Ġgrocer ies
+ĠGors uch
+Ġrap ists
+Ġfun gi
+ĠTyph oon
+V ector
+ĠEm press
+b attle
+4 68
+Ġparas ite
+ĠBom ber
+S G
+ex ist
+ĠP f
+Ġun se
+Ġsurge ons
+B irth
+ĠUn sure
+ĠPrint ed
+ĠBehavior al
+ĠA ster
+Pak istan
+Ġun ethical
+Ġs v
+ĠIo T
+Ġlay outs
+P ain
+Ġconst ants
+ĠL W
+ĠB ake
+Ġtow els
+Ġdeterior ation
+ĠBol ivia
+Ġblind ed
+ĠW arden
+ĠMist ress
+Ġon stage
+Ġcl ans
+ĠB EST
+19 60
+Ġant ique
+Ġrhet orical
+ĠPer cy
+ĠRw anda
+, .
+B ruce
+Ġtra umat
+ĠParliament ary
+Ġfoot note
+id ia
+ĠLear ned
+se eking
+gen ic
+Ġdim ensional
+H ide
+èĢ ħ
+Ġintrig ue
+in se
+Ġle ases
+Ġapp rentices
+w ashing
+Ġ19 26
+V ILLE
+Ġsw oop
+s cl
+Ġbed rooms
+on ics
+ĠCr unch
+comp atible
+Ġincap ac
+ĠYemen i
+ash tra
+z hou
+d anger
+Ġmanifest ations
+ĠDem ons
+AA F
+Secret ary
+ACT ED
+L OD
+Ġam y
+ra per
+eth nic
+4 17
+Ġpos itives
+Ġ27 3
+ĠRefuge es
+Ġus b
+ĠV ald
+odd y
+ĠMahm oud
+As ia
+Ġskull s
+ĠEx odus
+ĠComp et
+ĠL IC
+ĠM ansion
+ĠA me
+Ġconsolid ate
+storm s
+ont ent
+99 6
+Ġcl en
+Ġm ummy
+fl at
+75 8
+ĠV OL
+oter ic
+n en
+ĠMin ute
+S ov
+Ġfin er
+R h
+ly cer
+Ġreinforce ments
+ĠJohann es
+ĠGall agher
+Ġgym n
+S uddenly
+Ġext ortion
+k r
+i ator
+T a
+Ġhippocamp us
+N PR
+ĠComput ing
+Ġsquare ly
+Ġmod elling
+ĠFor ums
+ĠL isp
+ĠKrish na
+Ġ3 24
+Ġr ushes
+Ġens ued
+Ġcre eping
+on te
+n ai
+il ater
+ĠHorn ets
+Ġob livious
+IN ST
+55 9
+Ġjeopard y
+Ġdistingu ishing
+j ured
+Ġbeg s
+sim ilar
+ph ot
+5 30
+ĠPark way
+Ġs inks
+ĠHearth stone
+ib ur
+ĠBat on
+Av oid
+Ġd ancer
+Ġmag istrate
+ary n
+Ġdisturb ances
+ĠRom ero
+Ġpar aph
+Ġmis chief
+âĸ ĵ
+ĠSh aria
+Ġur inary
+r oute
+iv as
+f itted
+Ġeject ed
+ĠAl buquerque
+Ġ4 70
+Ġirrit ated
+ĠZ ip
+ĠB iol
+Ã į
+Ġden ounce
+Ġbin aries
+ĠVer se
+Ġopp os
+ĠKend rick
+ĠG PL
+Ġsp ew
+ĠEl ijah
+ĠE as
+Ġdr ifted
+so far
+Ġannoy ance
+ĠB ET
+47 4
+ĠSt rongh
+it ates
+ĠCogn itive
+oph one
+ĠIdent ification
+ocr ine
+connect ion
+Ġbox er
+ĠAS D
+ĠAre as
+Y ang
+t ch
+ull ah
+Ġdece ive
+Comb at
+ep isode
+cre te
+W itness
+Ġcondol ences
+ht ar
+Ġhe als
+Ġbuck ets
+ĠLA W
+B lu
+Ġsl ab
+ĠOR DER
+oc l
+att on
+ĠSteven son
+ĠG inger
+ĠFriend ly
+ĠVander bilt
+sp irit
+ig l
+ĠReg arding
+ĠPR OG
+Ġse aling
+start ing
+Ġcard inal
+ĠV ec
+ĠBe ir
+Ġmillisec onds
+we ak
+per se
+Ġster ile
+ĠCont emporary
+ĠPh ant
+ĠCl o
+Ġout p
+Ġex iled
+Ġ27 7
+Ġself ie
+Ġman ic
+Ġn ano
+ter ms
+Alex ander
+Ġres olves
+Ġmillenn ia
+Ġexpl odes
+Ġconst ellation
+Ġadul tery
+m otion
+D OC
+Ġbroad casters
+Ġkinderg arten
+ĠMay weather
+ĠE co
+ich o
+Ġ28 7
+l aun
+Ġm ute
+Ġdisc reet
+Ġpres chool
+Ġpre empt
+De lete
+ĠFre ed
+P i
+H K
+Ġblock er
+ĠC umber
+Ġw rought
+d ating
+Ġins urer
+Ġquot as
+Ġpre ached
+Ġev iction
+ĠReg ina
+ĠP ens
+Ġsevent een
+ĠN ass
+D ick
+Ġfold s
+Ġd otted
+ĠA ad
+Un iversal
+Ġp izz
+ĠG uru
+Ġso ils
+Ġno vice
+ĠNe ander
+Ġst ool
+Ġdeton ated
+ĠPik achu
+ĠMass ive
+IV ER
+ĠAb del
+Ġsubdu ed
+Ġtall est
+Ġprec arious
+Ġa y
+r ification
+ĠOb j
+c ale
+Ġun question
+cul osis
+ad as
+igr ated
+D ays
+Ġque ens
+ĠGaz ette
+ĠCol our
+ĠBow man
+ĠJ J
+Ã¯ ve
+Ġdomin ates
+Stud ent
+Ġm u
+Ġback log
+ĠElect ro
+Tr uth
+48 3
+Ġcond ensed
+r ules
+ĠCons piracy
+Ġacron ym
+hand led
+ĠMat te
+j ri
+ĠImp ossible
+l ude
+cre ation
+Ġwar med
+ĠSl ave
+Ġmis led
+Ġfer ment
+ĠK ah
+ink i
+ke leton
+cy l
+ĠKar in
+Hun ter
+Reg ister
+ĠSur rey
+Ġst ares
+ĠW idth
+ĠN ay
+ĠSk i
+Ġblack list
+uck et
+Ġexp ulsion
+im et
+Ġret weet
+vant age
+Fe ature
+Ġtro opers
+Ġhom ers
+9 69
+Ġconting ency
+ĠW TC
+ĠBrew er
+fore ign
+W are
+S olar
+Ġund ue
+RE C
+ulner able
+path ic
+ĠBo ise
+Ġ3 22
+Ġarous ed
+ĠY ing
+ä¸ į
+uel ess
+Ġp as
+Ġmor p
+Ġfl oral
+Ex press
+ud ging
+k B
+ĠGr anted
+Ø ¯
+ĠMich a
+ĠGoth ic
+ĠSPEC IAL
+ĠRic ardo
+F ran
+Ġadminister ing
+6 20
+por a
+ĠÂ ®
+Ġcomprom ises
+Ġb itten
+Ac cept
+Th irty
+Ð ²
+Ġmater ially
+ĠTer r
+ig matic
+ch ains
+Ġdo ve
+stad t
+Mar vel
+FA ULT
+Ġwind shield
+Ġ3 36
+ad ier
+Ġsw apping
+Ġflaw less
+ĠPred ator
+ĠMiche le
+Ġprop ulsion
+ĠPsych ic
+Ġassign ing
+Ġfabric ation
+Ġbar ley
+l ust
+Ġtow ering
+Ġalter cation
+ĠBent ley
+Sp here
+Ġtun a
+ĠClass es
+Fre edom
+un er
+L ady
+v oice
+Ġcool est
+or r
+Ġpal p
+$ {
+Ġhyster ia
+ĠMet atron
+p ants
+Ġspawn ing
+Exper ts
+ĠInvest ors
+ĠAn archy
+Ġshr unk
+ĠVict im
+Ġ28 9
+Ġec stasy
+ĠB inding
+58 5
+ĠMel ody
+57 8
+ot ally
+ĠE tsy
+lig a
+Ġapplaud ed
+Ġswe ating
+Ġredist ributed
+Ġpop corn
+Ġsem inal
+f ur
+ĠNeuro science
+R and
+ĠO st
+ĠMadd en
+ĠIncre asing
+ĠDaw kins
+ĠSub way
+Ġar sen
+cons erv
+B UR
+Ġsp iked
+ĠLy ft
+ĠImper ium
+ĠDrop box
+Ġfav oured
+Ġencomp asses
+gh ost
+Ġins pires
+Ġbur geoning
+ĠY oshi
+ĠVert ical
+ĠAud itor
+Ġint ending
+Ġfilib uster
+Bl oom
+f ac
+ĠCav s
+ign ing
+Ġcowork ers
+ĠBarb arian
+rem ember
+FL AG
+Ġaudit ory
+ason ry
+Col lege
+Ġmut ed
+gem ony
+ob in
+ĠPsych o
+9 68
+Ġlav ish
+Ġhierarch ical
+ĠDr one
+ou k
+Ġcripp led
+ĠMax im
+Sl ot
+Ġqu iz
+ĠV id
+if ling
+Ġarchae ologists
+Ġabandon ment
+d ial
+le on
+ĠF as
+T ed
+Ġr aspberry
+Ġmaneu vers
+Ġbehavi ours
+Ġins ure
+Ġrem od
+Sw itch
+h oe
+Ġsp aced
+Ġafford ability
+ĠF ern
+not ation
+ĠBal anced
+Ġoccup ies
+en vironment
+Ġneck lace
+Ġsed an
+F U
+ĠBrav o
+Ġab users
+ĠAn ita
+met adata
+ĠG ithub
+ait o
+ĠF aster
+ĠWass erman
+ĠF lesh
+Ġth orn
+r arily
+ĠMer ry
+w ine
+Ġpopul ace
+ĠL ann
+Ġrepair ing
+Ġpsy che
+Ġmod ulation
+aw aru
+âĢĭ âĢĭ
+ari j
+Ġdecor ations
+Ġapolog ise
+ĠG arg
+app ly
+Ġgive away
+ĠFl an
+ĠWy att
+U ber
+Ġauthor ised
+ĠMor al
+HAHA HAHA
+activ ate
+Ġtorped o
+ĠF AR
+Ġam assed
+ĠA ram
+ark in
+ĠVict ims
+st ab
+Ġo m
+ĠE CO
+Ġopio ids
+Ġpurpose ly
+ĠV est
+Ġer g
+at an
+ĠSur gery
+Ġcorrect ing
+ĠOrt iz
+ĠBe et
+Ġrev oke
+Ġfre eway
+ĠH iggins
+F ail
+ĠFar ms
+ĠAT P
+h ound
+Ġp oking
+ĠCommun ists
+mon ster
+iment ary
+Ġunlock ing
+Ġunf it
+we ed
+en ario
+at ical
+ĠEnlight enment
+ĠN G
+ĠComp ensation
+de en
+ĠWid ow
+ĠCind y
+ĠAfter wards
+Ġ6 000
+ikh ail
+ag ically
+Ġrat ified
+Ġcasual ty
+H OME
+p sey
+f ee
+Ġspark ling
+Ġd Ã©
+Ġconcert ed
+C atal
+Ġcomp lying
+ĠA res
+ĠD ent
+Sh ut
+Ġsk im
+ad minist
+Ġhost ilities
+ĠG ins
+Ġ6 08
+Ġm uddy
+ĠMc Int
+ĠDec ay
+5 25
+Ġconspic uous
+ĠEx posure
+Ġresc ind
+Ġwear able
+Ġ3 28
+our met
+ah s
+ĠRob ots
+Ġe clips
+inst ance
+ĠRE PORT
+ĠApp l
+0 30
+ĠSk ies
+01 00
+Ġfall acy
+S ocket
+ĠRece iver
+Ġsol ves
+ĠButter fly
+ĠSho pping
+ĠFI RE
+65 4
+Med ic
+Ġsing ers
+ĠNeed less
+'' ''
+isher s
+ĠD ive
+58 8
+Ġselect ively
+Ġcl umsy
+88 9
+Ġpurch aser
+ear ned
+ard y
+Ġbenef iting
+eng lish
+Ġyield ing
+ĠP our
+Ġspin ach
+Ġdel ve
+ĠC rom
+6 10
+Ġexport ing
+ĠMA KE
+Ġ26 3
+Ġg rop
+Ġenv oy
+ĠInqu iry
+ĠLu igi
+d ry
+ĠT uring
+Thumbnail Image
+ĠVar iety
+Ġfac et
+Ġfl uffy
+Ġexcerpt s
+Ġsh orth
+ĠOl sen
+CL UD
+Ġrel iant
+ĠUN C
+T our
+Ġbat hing
+Comp any
+Ġglobal ization
+P red
+ĠMalf oy
+Ġh oc
+j am
+craft ed
+ĠBond s
+ĠKiss inger
+Eng land
+Ġorder ly
+cat entry
+Ġ26 1
+Ġexch anging
+ĠInt ent
+ĠAmend ments
+D OM
+Ġst out
+ÂłÂłÂłÂłÂłÂłÂłÂł ÂłÂłÂłÂłÂłÂłÂłÂł
+ĠAir bus
+Ġ27 8
+hy de
+P oll
+Item ThumbnailImage
+Ġlooph oles
+ĠPill ar
+Ġexpl or
+St retch
+A part
+Ġun married
+Lim it
+ĠTransform ers
+Ġintellect ually
+unct ure
+18 00
+Ġd arn
+B razil
+Ġleft over
+ber us
+f red
+Mine craft
+3 26
+ĠForm s
+Ġproof s
+ĠDes igned
+Ġindex es
+ĠSupp ose
+EM S
+ĠL oving
+ĠBon nie
+im ating
+OT US
+Ġconduct or
+Ġbehav ed
+ĠF ren
+Ġsy nerg
+Ġmillenn ium
+Ġcater ing
+ĠL auder
+W r
+ĠY iannopoulos
+ĠAT F
+Ġensl aved
+Ġawaken ed
+D VD
+ĠED ITION
+ĠConc ert
+ĠChall enger
+ĠH aku
+umer ic
+Ġdep recated
+ĠSH AR
+4 12
+Ġdy stop
+Ġtremb ling
+Ġdread ed
+ĠSp ac
+p adding
+Re pl
+ĠG arrison
+M ini
+Ġun paralleled
+am ar
+URR ENT
+w reck
+c ertain
+t al
+ĠC LS
+app ings
+Ġsens ed
+Ġf encing
+ĠPas o
+ĠDes k
+Ġsc off
+Ġcontem plate
+ĠL iga
+l iquid
+75 7
+Ġapp rentice
+ĠUCH IJ
+5 70
+ĠTh ousand
+ĠIll um
+Ġchampion ed
+ãĤ Į
+Ġelect ors
+Ġ3 98
+ĠH ancock
+round ed
+ĠJ OHN
+Ġuns atisf
+Ġqual ifier
+ĠGad get
+EN E
+Ġdead liest
+ĠPl ants
+Ġ ions
+Ġacc ents
+Ġtwe aking
+Ġsh aved
+F REE
+ĠCh aser
+Again st
+9 60
+Ġmeth amphetamine
+Ġnormal ized
+Ġ$ \
+ĠPre cision
+ĠGu am
+Ġch oked
+ĠX II
+ĠCast ing
+Tor rent
+Ġscal p
+ĠJagu ar
+w it
+Ġsem ic
+ix ie
+ĠG ould
+Ġconf ines
+N usra
+ĠL on
+ĠJ ugg
+y cle
+ĠCod ec
+E gypt
+Ġrest rain
+ĠAl iens
+Ġch oking
+ĠD unk
+ĠBell a
+ab c
+Ġsl ang
+Ġneuro trans
+s av
+Ġempower ment
+â ĨĴ
+Ġclim bers
+ĠM im
+ĠF ra
+ros se
+Cap ital
+ĠCth ulhu
+Inter face
+Ġprof icient
+ĠIN TO
+Ġ3 18
+ront al
+5 80
+ĠDes pair
+K enn
+Ġscrim mage
+ĠCo at
+as ions
+Ġwall paper
+ĠJ ol
+Ġresurg ence
+Ġant iv
+ĠB alls
+² ¾
+Ġbuff ers
+Ġsub system
+ĠSt ellar
+ĠL ung
+A IDS
+Ġerad icate
+Ġblat antly
+Ġbehav es
+ĠN un
+Ġant ics
+ex port
+DE V
+w b
+Ġph p
+ĠInteg rity
+Ġexplore r
+Ġrev olving
+auth ored
+g ans
+Ġbas k
+Ġas ynchronous
+å į
+TH ING
+69 8
+G ene
+ĠR acer
+ĠN ico
+iss ued
+Ġser mon
+p ossibly
+Ġsize of
+Ġentrepreneur ial
+ox in
+ĠMin erva
+Ġpl atoon
+n os
+ri ks
+A UT
+ĠAval anche
+ĠDes c
+ĳ å£«
+ĠP oc
+Ġconf erred
+Î »
+Ġpat ched
+F BI
+66 2
+Ġfract ures
+Ġdetect s
+Ġded icate
+Ġconstitu ent
+Ġcos mos
+W T
+Ġswe ats
+Ġspr ung
+b ara
+s olid
+Ġuns us
+Ġbul ky
+ĠPhilipp e
+ĠFen rir
+Ġtherap ists
+ore al
+^^ ^^
+Ġtotal ed
+Ġboo ze
+ĠR PC
+Prosecut ors
+Ġdis eng
+ĠSh ared
+Ġmotor cycles
+Ġinvent ions
+Ġlett uce
+ĠMer ge
+ĠJ C
+Ġspiritual ity
+ĠWAR NING
+Ġunl ucky
+ĠT ess
+Ġtong ues
+ĠD UI
+T umblr
+Ġle ans
+Ġinv aders
+Ġcan opy
+ĠHur ricanes
+ĠB ret
+ĠAP PLIC
+id ine
+ick le
+Reg arding
+Ġve ggies
+Ġe jac
+ju ven
+F ish
+D EM
+ĠD ino
+Th row
+ĠCheck ing
+be ard
+( &
+Ġj ails
+Ġh r
+trans fer
+iv ating
+Ġfle ets
+ĠIm ag
+ĠMc Donnell
+Ġsnipp et
+Is a
+ĠCh att
+ĠSt ain
+ĠSet FontSize
+ĠO y
+ĠMathemat ics
+49 4
+Ġelectro ly
+ĠG ott
+ĠBr as
+B OOK
+ĠF inger
+d ump
+Ġmut ants
+Ġrent als
+Ġinter tw
+Ġc reek
+ail a
+Bro ther
+ĠDisc ord
+pe e
+raw ler
+Ġcar p
+Ġ27 9
+ãĤ· ãĥ£
+rel ations
+Ġcontr asts
+Col umn
+Ġrec onnaissance
+Ġun know
+Ġl ooting
+Ġregul ates
+Ġopt imum
+ĠChero kee
+ĠA ry
+Lat est
+Ġroad side
+Ġd anced
+ĠUnic orn
+A cknowled
+Ġuncont roll
+ĠM US
+at io
+ch ance
+ha ven
+VAL UE
+Ġfavour ites
+Ġceremon ial
+b inary
+pe ed
+wood s
+EM P
+Ġv ascular
+Ġcontempl ated
+Ġbar ren
+ĠL IST
+Y ellow
+ospons ors
+Ġwhisk y
+ĠM amm
+ĠDeV os
+min imum
+H ung
+44 2
+P ic
+ĠSnap dragon
+77 6
+Ġcar ving
+Ġund ecided
+Ġadvantage ous
+Ġpal ms
+ĠA Q
+Ġst arch
+L oop
+Ġpadd le
+Ġfl aming
+ĠHor izons
+An imation
+bo ost
+Ġprob abilities
+ĠM ish
+Ġex odus
+ĠEditor ial
+Ġfung us
+Ġdissent ing
+ĠDel icious
+rog ram
+ĠD yn
+d isk
+t om
+Ġfab rics
+ĠC ove
+ĠB ans
+Ġsoft en
+ĠCON S
+Ġin eligible
+Ġestim ating
+ĠLex ington
+pract ice
+of i
+Ġshe dding
+ĠN ope
+Ġbreat hed
+ĠCorinth ians
+y ne
+ek i
+B ull
+Ġatt aching
+reens hots
+Ġanaly se
+ĠK appa
+Ġuns ustainable
+Ġinter pol
+ank y
+he mer
+Ġprot agonists
+Ġform atted
+ĠBry ce
+ĠAch illes
+ĠAb edin
+sh ock
+Ġb um
+b os
+qu a
+ĠW arn
+q t
+ĠDi abetes
+8 64
+ĠIn visible
+Ġvan ish
+Ġtrans mitting
+Ġmur ky
+ĠFe i
+Ġawa ited
+ĠJur assic
+umm ies
+Ġmen acing
+g all
+C ath
+B uilt
+ild o
+ĠV otes
+Ġon t
+Ġmun itions
+ĠFre em
+ÃŃ n
+Ġdec ency
+lo pp
+ie ved
+ĠG ord
+Ġun thinkable
+ĠNews week
+Ġ3 21
+He at
+Ġpresent er
+ji ang
+Ġpl ank
+ĠAval on
+Ġben z
+ĠR out
+Ġslam ming
+ĠD ai
+ou ter
+ĠCook ie
+ĠAlic ia
+ge y
+Ġvan ity
+Ġow l
+á µ
+t ested
+ĠAw akens
+Ġcan v
+Ġblind ly
+ĠRid ley
+ĠEm ails
+Requ ires
+ĠSer bian
+ograp hed
+if rame
+eter ia
+Ġaltern ating
+qu iet
+Ġsoc iology
+ĠUn lock
+ĠCommun ism
+Ġo ps
+Ġatt ribution
+Ġab duction
+ĠAb ram
+Ġsidel ined
+ĠB OOK
+Ġref ining
+ĠFe eling
+ĠOs lo
+ĠPru itt
+r ack
+ang ible
+Ġcaut iously
+ĠM ARK
+eed s
+M ouse
+ĠStep h
+ĠP air
+S ab
+99 7
+ĠBa al
+B ec
+Ġcomm a
+ĠP all
+ĠG ael
+Ġmisunder stand
+ĠP esh
+Order able
+Ġdis mal
+ĠSh iny
+% "
+Ġreal istically
+Ġpat io
+ĠG w
+ĠVirt ue
+Ġexhaust ing
+wh atever
+oph ys
+y ip
+4 18
+Ad just
+ĠWa iting
+ess on
+ĠMaz da
+ĠDo zens
+Ġstream lined
+Ġincompet ence
+ĠM eth
+Ġeth os
+ON ES
+Ġincent iv
+Ġgr itty
+ĠBut cher
+Head er
+Ġexp onential
+Ã Ł
+Ġcorrel ate
+Ġcons ensual
+s ounding
+R ing
+Orig in
+Ġcon clusive
+fe et
+ac ly
+ĠF ernandez
+Buy able
+Ġd ucks
+aunt lets
+Ġel ong
+Ġ28 6
+Ġsim ul
+G as
+ĠK irst
+Ġprot r
+ĠRob o
+ĠAo E
+op ol
+Ġpsych ologically
+sp in
+ilater ally
+ĠCon rad
+W ave
+44 1
+ĠAd vertisement
+ĠHarm on
+ĠOri ental
+is Special
+Ġpresum ptive
+Ġw il
+ĠK ier
+ne a
+Ġp pm
+Ġhar bour
+ĠW ired
+comp any
+Ġcor oner
+atur days
+ĠP roud
+ĠN EXT
+ĠFl ake
+val ued
+ce iver
+Ġfra ught
+Ġc asing
+Ġrun away
+Ġg in
+ĠLaure nt
+ĠHar lem
+ĠCur iosity
+qu ished
+Ġneuro science
+ĠH ulu
+Ġborrow er
+Ġpetition er
+ĠCo oldown
+W ARD
+Ġinv oking
+conf idence
+For ward
+Ġst s
+pop ulation
+Delivery Date
+Fil m
+ĠC ov
+quick Ship
+quickShip Available
+prim ary
+isSpecial Orderable
+inventory Quantity
+channel Availability
+BO X
+ĠMulti player
+ĠJen ner
+77 8
+ĠM d
+Ġ~ /.
+M N
+Ġchild ish
+Ġantioxid ant
+ĠChrom ebook
+Ġ27 4
+Ġscreen play
+Ġadvent urous
+ĠRelations hip
+respons ive
+ming ton
+Ġcorner stone
+ĠF ey
+F IR
+Ġrook ies
+ĠF eaturing
+Ġorig inate
+Ġelectro des
+ant es
+Ġscript ures
+Ġgl ued
+Ġdiscont ent
+Ġaff licted
+lay out
+B rave
+Ġm osa
+ĠQuant ity
+ĠH ik
+w inner
+H ours
+Ġent ail
+ĠCell s
+olog ue
+Ġv il
+Ġpre acher
+Ġdecor ative
+d ifferent
+Ġprejud ices
+ĠSm oking
+ĠNotting ham
+so Type
+Ġrhyth ms
+ĠAl ph
+bl ast
+Ste el
+ĠDaniel le
+Ġstr ife
+Ġrem atch
+so DeliveryDate
+ĠF ork
+t rip
+ol ulu
+hes es
+C G
+ĠPOLIT ICO
+ost a
+ĠDr ift
+é¾įå ¥
+é¾įå¥ ĳå£«
+Ġvet ting
+ĠJin ping
+ĠRec ession
+Min or
+ĠF raud
+enf ranch
+Ġconven ed
+ĠNA ACP
+ĠMill ions
+ĠFarm ing
+ĠW oo
+ĠFl are
+rit o
+imm igrant
+Ġvac ancy
+ĠHE AD
+ĠV aj
+eg al
+ĠV igil
+Stud y
+Ġru ining
+Ġr acks
+Ġhe ater
+ĠRand olph
+ĠBr ush
+ĠT ir
+Ø ¨
+Ġc ov
+% ]
+Ġrecount s
+ĠO PT
+ĠM elt
+Ġtr uce
+Ġcas inos
+Ġcrus ade
+Ġcarn age
+Ġstri pe
+ĠK yl
+Text ures
+Ġ6 98
+Ġpro clamation
+Ġgood ies
+Ġ........ ..
+pro claimed
+P olit
+Ġtop ical
+Ġspecial ize
+ĠA min
+g m
+Ġanch ored
+Ġbear ings
+s ample
+ĠHigh land
+ĠAut ism
+Ġmerc enary
+Ġinterview er
+L ER
+ĠSom ers
+Ġembry o
+ĠAss y
+Ġ28 1
+ĠEd iting
+ĠCh osen
+6 60
+Ġp ci
+ĠThunder bolt
+BI LL
+Ġchuck led
+jri wal
+h of
+Ġearth ly
+() {
+ind ependence
+Ġdisp ers
+ĠV endor
+ĠG areth
+Ġp als
+P enn
+ĠSub mit
+ic um
+Th u
+Ġcl andestine
+Ġcann ibal
+ĠCl erk
+E Stream
+gal itarian
+âĻ ¥
+g ew
+Ġhor rend
+ĠL ov
+ĠRe action
+ocr in
+Class ic
+Ġecho ing
+Ġdiscl osing
+ĠIns ight
+og un
+ĠInc arn
+upload s
+pp erc
+guy en
+Ġ19 01
+ĠB ars
+68 7
+Ġb ribes
+ĠFres no
+ur at
+ĠRe ese
+Ġintr usive
+Ġgri pping
+ĠBlue print
+ĠR asm
+un ia
+man aged
+ĠHeb do
+Ġ3 45
+Ġdec oding
+Ġpo ets
+Ġj aws
+ĠF IGHT
+am eless
+ĠMead ows
+ĠHar baugh
+Inter view
+ĠH osp
+ĠB RA
+Ġdelet ion
+m ob
+W alker
+ĠMoon light
+ĠJ ed
+ĠSoph ia
+Ġus ur
+Ġfortun ately
+ĠPut ting
+ĠF old
+Ġsan itation
+Ġpart isans
+IS ON
+B ow
+ĠCON C
+ĠRed uced
+ĠS utton
+Ġtouch screen
+Ġembry os
+âĢ¢âĢ¢ âĢ¢âĢ¢
+ĠK rug
+com bat
+ĠPet roleum
+Ġam d
+ĠCos mos
+Ġpresc ribing
+Ġconform ity
+ours es
+Ġplent iful
+Ġdis illusion
+ĠEc ology
+itt al
+Ġf anc
+Ġassass inated
+regn ancy
+Ġperenn ial
+ĠBul lets
+Ġst ale
+Ġc ached
+ĠJud ith
+ĠDise ases
+All en
+Ġl as
+Ġsh ards
+ĠSu arez
+ĠFriend ship
+inter face
+ĠSupp orters
+add ons
+46 2
+ĠIm ran
+ĠW im
+Ġnew found
+ĠM b
+An imal
+Ġd arling
+and e
+Ġrh y
+ĠTw isted
+pos al
+yn ski
+Var ious
+× ľ
+ĠK iw
+uy omi
+Ġwell being
+ĠL au
+an os
+Ġunm ist
+Ġmac OS
+Ġrest room
+ĠOl iv
+ĠAir ways
+Ġtimet able
+9 80
+Ġrad ios
+v oy
+ias co
+Ġcloud y
+ĠDraw ing
+Any thing
+Sy ria
+ĠH ert
+st aking
+Ġun checked
+Ġb razen
+ĠN RS
+69 7
+onom ic
+est ablish
+Ġl eng
+Ġdi agonal
+ĠF ior
+L air
+ĠSt ard
+Ġdef icient
+jo ining
+be am
+Ġomn ip
+Ġbl ender
+Ġsun rise
+Mo ore
+ĠF ault
+ĠCost ume
+ĠM ub
+Fl ags
+an se
+Ġpay out
+ĠGovern ors
+ĠD illon
+ĠBan ana
+N ar
+Ġtra iled
+Ġimperial ist
+um ann
+ats uki
+4 35
+ĠRoad s
+Ġsl ur
+ĠIde ally
+Ġt renches
+C trl
+Ġmir rored
+ĠZ el
+ĠC rest
+Comp at
+ĠRoll s
+sc rib
+ĠTra ils
+omet ers
+w inter
+Ġimm ortality
+il ated
+Ġcontrad icts
+un iversal
+ill ions
+ĠM ama
+opt im
+AT URE
+Ġge o
+et ter
+ĠCar lo
+4 24
+Ġcanon ical
+ĠStrongh old
+n ear
+Ġperf ume
+Ġorche stra
+od iac
+Ġup he
+Ġreign ing
+vers ive
+Ġc aucuses
+ĠD EM
+Ġinsult ed
+Ġ---- --
+ĠCr ush
+Ġroot ing
+ĠWra ith
+Ġwh ore
+Ġto fu
+C md
+ĠB ree
+Ġ$ _
+Ġr ive
+ĠAd vertising
+Ġw att
+ĠH O
+Ġpersu asive
+ĠParam eters
+Ġobserv ational
+ĠN CT
+ĠMo j
+ĠSal on
+Ġtr unc
+Ġexqu isite
+ĠMar a
+Ġpo op
+ĠAN N
+Ex c
+ĠWonder ful
+ĠT aco
+Ġhome owner
+ĠSmith sonian
+orpor ated
+mm mm
+Ġlo af
+ĠYam ato
+ĠInd o
+Ġcl inging
+Ã¡ s
+Ġimm utable
+h ub
+Or ange
+Ġfingert ips
+ĠWood en
+ĠK idd
+ĠJ PM
+ĠDam n
+C ow
+c odes
+48 2
+Ġiniti ating
+ĠEl k
+ĠCut ting
+Ġabsent ee
+ĠV ance
+ĠLil ith
+G UI
+Ġobsc ured
+Ġdwar ves
+ĠCh op
+ĠB oko
+Val ues
+Ġmult imedia
+Ġbrew ed
+Reg ular
+CRIP TION
+ĠMort al
+Ġa pex
+Ġtravel er
+Ġbo ils
+Ġspray ing
+Rep resent
+ĠStars hip
+4 28
+Ġdisappro val
+Ġshadow y
+Ġlament ed
+ĠRe place
+ĠFran Ã§
+67 7
+d or
+Ġunst oppable
+Ġcoh orts
+gy n
+ĠClass ics
+ĠAm ph
+Ġsl uggish
+ĠAdd iction
+ĠPad res
+Ġins cription
+Ġin human
+min us
+ĠJere miah
+at ars
+Ter ror
+ĠT os
+ĠSh arma
+ast a
+c atch
+Ġpl umbing
+ĠTim bers
+Sh ar
+H al
+ĠO sc
+Ġcou pling
+hum ans
+Ġsp onge
+Ġid ols
+ĠSp a
+ĠAdv ocate
+ĠBe ats
+lu a
+Ġtick ing
+Ġload er
+ĠG ron
+8 10
+Ġstim ulated
+Ġside bar
+ĠManufact urer
+ore And
+19 73
+Ġpra ises
+ĠFl ores
+dis able
+ĠElect rical
+ra ise
+E th
+Ġmigr ated
+Ġlect urer
+K ids
+ĠCa vern
+Ġk ettle
+Ġgly c
+ĠMand ela
+ĠF ully
+å§ «
+FIN EST
+Ġsquee zing
+ĠRy der
+amp oo
+oreAnd Online
+Inst oreAndOnline
+Buyable InstoreAndOnline
+Ġcommem orate
+ĠRamp age
+Aust in
+ĠSh roud
+ĠRu ins
+9 15
+ĠK H
+Ġwater front
+ĠE SC
+b aby
+ĠC out
+ĠEm blem
+Ġequival ents
+49 2
+Un ique
+ĠNiet zsche
+brow ser
+Ġim itation
+ĠWere wolf
+ĠKir in
+ac as
+' ,"
+ĠÃ ¾
+Review ed
+Ġc unt
+Ġvo ic
+ĠLen ovo
+Ġbond ed
+48 1
+Ġinhib itors
+Ġendeav ors
+ĠHav ana
+ĠSt out
+ĠJ olly
+A ctor
+*/ (
+Ġoccur rences
+ĠT ens
+Incre ased
+ĠACT ION
+Ġ ãĢĮ
+ĠRank ings
+ĠB reat
+Ġ30 9
+D ou
+Ġimpact ing
+ĠDuc hess
+pre fix
+Q B
+Ġsummon ing
+Ġbest owed
+ĠKe pler
+ĠPOW ER
+c ube
+ĠK its
+ĠG rip
+Ġop ium
+Ġrep utable
+t oc
+ich ael
+ĠR ipple
+Ġcaf Ã©
+ĠZ oom
+ĠBur ma
+Ġwa ive
+Ġst alls
+Ġdem eanor
+inc erity
+Ġfluor ide
+ĠSH OULD
+Par is
+Ġlong ing
+Ġpl at
+Ġgross ly
+Ġbull s
+Ġshowc asing
+ex pected
+ĠG addafi
+engine ering
+Re peat
+ĠK ut
+Ġconce ivable
+Ġtrim med
+osc ope
+ĠCand idate
+ĠT ears
+rol og
+Lew is
+S UP
+Ġroad map
+Ġsal iva
+Ġtrump et
+Jim my
+Ġmirac ulous
+Ġcolon ization
+Ġam put
+ĠGN OME
+ate ch
+D ifferent
+ĠE LE
+ĠGovern ments
+ĠA head
+ãħĭ ãħĭ
+word press
+L IB
+ĠIn clude
+ĠDor othy
+0 45
+ĠColomb ian
+Ġle ased
+88 4
+Ġde grading
+ĠDa isy
+i ations
+Ġbapt ized
+Ġsurn ame
+co x
+Ġblink ed
+ãĥ ¢
+Ġpoll en
+Ġder mat
+Ġre gex
+ĠNich olson
+ĠE ater
+ç ľ
+rad or
+Ġnarrow er
+Ġhur ricanes
+Ġhalluc inations
+r idden
+ISS ION
+ĠFire fly
+Ġattain ment
+Ġnom inate
+Ġav ocado
+ĠM eredith
+Ġt s
+Ġreve rence
+Ġe uph
+Ġcr ates
+ĠT EXT
+Ġ4 43
+Ġ3 19
+J SON
+iqu ette
+Ġshort stop
+ic key
+Ġpro pelled
+Ġap i
+ĠTh ieves
+77 9
+Ġovers aw
+Ġcol i
+ĠNic ola
+Ġover cl
+ik awa
+ĠC yr
+Ġ38 4
+78 9
+ĠAll ows
+10 27
+Det roit
+TR Y
+set up
+ĠSocial ism
+Sov iet
+s usp
+ĠAP R
+ĠShut down
+Ġal uminium
+zb ek
+ĠL over
+GGGG GGGG
+Ġdemocr acies
+Ġ19 08
+ĠMer rill
+ĠFranco is
+gd ala
+Ġtraff ickers
+ĠT il
+ĠGo at
+Ġsp ed
+ĠRes erv
+Ġpro d
+55 2
+Ġc ac
+ĠUn iv
+ĠSch we
+Ġsw irling
+ĠWild erness
+ĠEgg s
+Ġsadd ened
+Ġarch aic
+H yd
+Ġexcess ively
+B RE
+Ġaer ospace
+ĠVo ices
+Cra ig
+Ġign ited
+In itially
+ĠMc A
+Ġhand set
+Ġreform ing
+Ġfrust rations
+ĠDead pool
+ĠBel ichick
+ract or
+ĠRagnar ok
+ĠD rupal
+ĠApp roximately
+19 20
+ĠHub ble
+arm or
+ĠSar as
+ĠJon as
+Ġnostalg ic
+Ġfeas ibility
+Sah aran
+Ġorb iting
+Ġ9 70
+R u
+Ġsh in
+ĠInvestig ators
+Ġinconsist encies
+ĠP AN
+B G
+Ġgraz ing
+Ġdetect ors
+ĠStart up
+ĠFun ny
+ĠNa omi
+Consider ing
+Ġh og
+ut f
+ce mic
+Ġfort ified
+ĠFun ctions
+Ġcod ec
+nut rition
+H at
+" !
+micro soft
+55 8
+ĠTh in
+ĠA CE
+Al ias
+ĠO PS
+p apers
+P K
+ãĢ İ
+Ġimpro bable
+N orthern
+equ al
+Ġlook out
+Ġty res
+ĠMod ified
+ĠK op
+Abs olutely
+Ġbuild up
+sil ver
+Ġaud i
+Ġgro tesque
+ĠSab er
+ĠPres byter
+ON Y
+Ġglac iers
+ĠSho als
+ĠK ass
+ĠH RC
+ĠNic ol
+ĠL unch
+ĠF oss
+âĸ Ĵ
+AD RA
+ĠOne Plus
+o ing
+ground s
+Ġincident al
+Ġdatas ets
+68 9
+ĠClarks on
+Ġassemb ling
+ĠCorrect ions
+Ġdrink ers
+Ġqual ifiers
+Ġle ash
+Ġunf ounded
+ĠH undred
+Ġkick off
+T i
+Ġrecon cil
+ĠGr ants
+ĠCompl iance
+ĠDexter ity
+Ġ19 06
+w arn
+D allas
+Max imum
+n ard
+av ia
+be aut
+ens itivity
+tr ace
+Ġpione ers
+ĠF ract
+ãĢ ı
+Ġpre cept
+Ġgloss y
+ĠI EEE
+Ac ross
+Ġ6 80
+S leep
+che on
+Ġsatir ical
+ĠMin otaur
+ĠCla ude
+Ġr Ã©
+ape go
+Ġcar rot
+ĠSem in
+ino a
+Ġz o
+Ind ependent
+Ġdiagn oses
+ĠC ue
+M AR
+Ġrend ition
+ĠK ik
+Ġpath ology
+Ġselect s
+Link edIn
+Ġass ay
+ĠD res
+Ġtext ual
+post ed
+IT AL
+ĠM aul
+N eal
+Ġinter connected
+Ġerr atic
+ĠVir us
+Ġ5 30
+Ġenvironmental ists
+ĠP helps
+Ġeng agements
+ĠIN ST
+Ġeconom ical
+nox ious
+Ġg earing
+izz y
+Ġfavor ably
+ĠMcG ill
+T erm
+Ġh anged
+Ġball park
+ĠRe yes
+Ġbe ware
+ĠP sal
+ĠMass acre
+q i
+Ġin accessible
+acly sm
+Ġfr ay
+ill ac
+Ġbitter ly
+ĠCert ification
+Mich igan
+Ġir respective
+al ore
+Em pty
+Ġendorse ments
+Ġund et
+f g
+equ ipped
+Ġmerc iless
+ĠC ust
+Ġimm ature
+Ġvou cher
+ĠBlack well
+Ñ ı
+h awk
+dis ciplinary
+ile e
+ĠMak oto
+ĠD ude
+ãĥĩ ãĤ£
+Y ears
+Ġin ver
+Ġsh aman
+ĠY ong
+ip el
+ell en
+ĠCath y
+br ids
+Ġs arc
+65 1
+N ear
+Ġground work
+Ġam az
+Ġ4 15
+ĠHunting ton
+hew s
+ĠB ung
+Ġarbit rarily
+ĠW it
+ĠAl berto
+Ġdis qualified
+best os
+46 1
+Ġp c
+Ġ28 4
+ro bat
+Rob in
+Ġh ugs
+ĠTrans ition
+ĠOcc asionally
+Ġ3 26
+ĠWh ilst
+ĠLe y
+Ġspaces hip
+cs v
+Ġun successfully
+ĠA u
+le ck
+ĠWing ed
+ĠGrizz lies
+. ï¿½
+Ġne arer
+ĠSorce ress
+ĠInd igo
+El se
+8 40
+let es
+Co ach
+Ġup bringing
+ĠK es
+Ġseparat ist
+Ġrac ists
+Ġch ained
+Ġabst inence
+lear ning
+Ġrein stated
+Ġsymm etry
+Ġremind ers
+ĠChe vy
+Ġm ont
+Ġexempl ary
+ĠT OR
+Z X
+Ġqual itative
+ĠSt amp
+ĠSav annah
+ĠRoss i
+Ġp aed
+Ġdispens aries
+ĠWall s
+ĠCh ronic
+Ġcompliment ary
+ĠBeir ut
+Ġ+ ---
+igs list
+Ġcrypt ographic
+mas ters
+ĠCap itals
+Ġmax imal
+Ġent ropy
+Point s
+Ġcombat ants
+l ip
+ĠGl ob
+ĠB MC
+ph ase
+th ank
+HT TP
+Ġcomm uter
+Ġ\( \
+.. /
+ĠReg ener
+ĠDO I
+ĠActiv ision
+Ġsl it
+os al
+RE M
+Ġch ants
+Y u
+Ke ys
+Bre xit
+ĠFor ced
+Ari zona
+Ġsquad ron
+IS O
+ĠMal one
+Ġ3 38
+Ġcontrast ing
+Ġt idal
+Ġlib el
+Ġimpl anted
+Ġupro ar
+ĠC ater
+Ġpropos itions
+M anchester
+ĠEuro s
+it amin
+G il
+ĠEl ven
+ĠSe ek
+ĠB ai
+Ġredevelop ment
+ĠTown s
+ĠL ub
+! ",
+al on
+K rist
+Ġmeas urable
+Ġimagin able
+Ġapost les
+Y N
+7 60
+Ġster oid
+Ġspecific ity
+ĠL ocated
+ĠBeck er
+ĠE du
+ĠDiet ary
+uts ch
+ĠMar ilyn
+Ġbl ister
+ĠM EP
+ĠK oz
+ĠC MS
+y ahoo
+ĠCar ney
+Ġbo asting
+ĠC aleb
+By te
+read s
+ad en
+Pro blem
+ĠWood ward
+S we
+S up
+ĠK GB
+Set up
+Ġtac it
+Ġret ribution
+Ġd ues
+ĠM Ã¼
+. ?
+ä¸ Ń
+p ots
+Ġcame o
+ĠP AL
+educ ation
+A my
+like ly
+g ling
+Ġconstitution ally
+ĠHam m
+ĠSpe ak
+Ġwid gets
+br ate
+Ġcra ppy
+ĠI ter
+Ġanticip ating
+ĠB out
+P ixel
+ĠY ep
+ĠLaur ie
+Ġh ut
+Ġbullet in
+ĠSal vation
+Ġch ats
+ear able
+Honest ly
+AL TH
+onse qu
+c ult
+isco very
+ovy ch
+Ġse lves
+ĠSat oshi
+S ounds
+Ġconver gence
+ĠRosen berg
+19 74
+Ġnas al
+Ġfull est
+Ġfer ocious
+x us
+ist e
+AM S
+Ġlobb ied
+Ġso othing
+ĠGun n
+t oday
+0 24
+Ġinspir ational
+ĠN BN
+p b
+g ewater
+or ah
+all owed
+ĠCol iseum
+Ġspecial izing
+Ġinsane ly
+ĠT ape
+del ay
+Ġt arn
+ĠP ound
+Ġmel anch
+Ġdeploy ments
+il and
+Ġless en
+Ġfur ry
+ĠUE FA
+Ġblood shed
+ĠMe ier
+ither ing
+Ġhe irs
+ĠJ aw
+ax ter
+ĠPublic ations
+Ġal ters
+int ention
+ĠWinc hester
+d etermination
+ĠLif etime
+th in
+Mon ster
+7 80
+Ġapprox imation
+Ġsuper markets
+ĠSecond s
+or os
+h uge
+Ġb ribe
+ĠLIM ITED
+un ed
+Ġmis interpret
+ĠIn jury
+Ġ3 67
+Ġthreshold s
+ĠCarn ival
+Ġgastro intestinal
+Ġguid eline
+Ġde ceived
+f eatures
+Ġpurported ly
+ĠRon nie
+ĠNew t
+Ġsp acious
+as us
+Ġsuperhero es
+ĠCyn thia
+le gged
+k amp
+ch io
+Ġth umbnail
+ĠShir ley
+ill ation
+Ġshe ds
+ĠZ y
+E PA
+Ġdam s
+Ġy awn
+n ah
+ĠPe ggy
+ĠE rie
+ĠJu ventus
+ĠF ountain
+r x
+don ald
+al bum
+ĠComp rehensive
+Ġc aching
+ĠU z
+ulner ability
+ĠPrinc iple
+ĠJ ian
+ing ers
+cast s
+ĠOs iris
+ch art
+t ile
+ĠTiff any
+ĠPatt on
+ĠWh ip
+Ġovers ized
+J e
+ĠCind erella
+ĠB orders
+ĠDa esh
+M ah
+Ġdog ma
+Ġcommun ists
+v u
+Coun cil
+Ġfresh water
+Ġw ounding
+Ġdeb acle
+Ġyoung ster
+Ġthread ed
+ĠB ots
+ĠSav ings
+ãģ Ĥ
+ol ing
+oh o
+Ġillum ination
+M RI
+Ġlo osen
+tr ump
+ag ency
+ur ion
+Ġmoment arily
+ĠCh un
+ĠBud apest
+ĠAl ley
+D isk
+Ġaston ished
+ĠCon quer
+ĠAccount ing
+h aving
+ĠWe in
+ĠAl right
+Ġrev olver
+Ġdel usion
+Ġrelic s
+Ġad herent
+qu ant
+Ġhand made
+or io
+Ġcomb ating
+c oded
+Ġquad ru
+re th
+N ik
+ĠTrib al
+ĠMyster ious
+Ġin hal
+ĠWin ning
+ĠClass ification
+ch anged
+Ġun ab
+Ġsc orn
+icip ated
+w l
+ond uctor
+Ġrein forcing
+ĠChild hood
+an ova
+Ġadventure r
+Ġdoctor al
+ĠStrateg ies
+Ġengulf ed
+ĠEnc ounter
+Ġl ashes
+Crit ical
+ric ular
+ĠU TF
+oci ation
+check ing
+ĠConsult ing
+Run time
+per iod
+ĠAs gard
+Ġdist illed
+ĠPas adena
+ĠD ying
+ĠCOUN TY
+Ġgran ite
+Ġsm ack
+Ġparach ute
+ĠS UR
+Virgin ia
+ĠF urious
+78 7
+ĠO kin
+Ġcam el
+ĠM bps
+19 72
+ĠCh ao
+ĠC yan
+j oice
+ef er
+ĠW rap
+ĠDeb ate
+S eg
+Ġfore arm
+ĠIgn ore
+Ġtim estamp
+Ġprob ing
+ĠNo on
+ĠGra il
+f en
+Ġdorm ant
+ĠFirst ly
+ĠE ighth
+ĠH UN
+ĠDes ire
+or as
+Girl s
+ĠDes mond
+z ar
+am ines
+O AD
+exec ute
+Ġbo obs
+ĠAT L
+_ (
+Chel sea
+Ġmasturb ation
+ĠCo C
+Ġdestroy er
+ĠCh omsky
+Ġsc atter
+ĠAss ets
+79 6
+ĠC argo
+Ġrecept ive
+ĠSc ope
+Ġmarket ers
+Ġlaun chers
+Ġax le
+ĠSE A
+se q
+ĠM off
+f inding
+ĠGib bs
+Georg ia
+extreme ly
+N J
+Ġlab orers
+st als
+Ġmed iation
+ĠH edge
+at own
+Ġi od
+des pite
+v ill
+J ane
+ex istence
+Ġcoinc ided
+ĠUt ilities
+ĠChe ap
+Ġlog istical
+Ġcul mination
+ĠNic otine
+p ak
+F older
+Ġrod ents
+st uff
+Ġlaw fully
+Ġreper to
+io ch
+j j
+Dial ogue
+HH HH
+lic tion
+Look s
+Ġ29 7
+Ġtur rets
+ĠAb andon
+Ġinc ess
+ĠTraff ord
+Ġcur led
+Ġprefer ring
+Ġprivat ization
+Ġir resist
+ĠP anda
+ĠSh ake
+ĠMc Gr
+ãĥ Ħ
+und ers
+Ġdiscrim inated
+Ġbart ender
+I LE
+Atl antic
+Ġprop ensity
+ĠW iz
+ĠG im
+con ference
+Ġrein forces
+G h
+w agon
+Ġe erie
+F al
+Ġhug ged
+rac ist
+R IC
+F u
+Ġf iller
+ĠSt ub
+Ġeng raved
+ĠWrest le
+Ġimagin ative
+ĠPe er
+ĠFact ors
+an us
+ĠDrac ula
+mon itor
+Ġrou ters
+ib ia
+ĠBoo lean
+end ale
+ĠSl aughter
+ĠSh ack
+R FC
+ĠSpiel berg
+S ax
+ĠPH OTO
+ĠCl over
+ĠR ae
+Dep ending
+ĠMem or
+ar am
+Ġpier ced
+Ġcur tains
+v ale
+ĠInqu isition
+ĠP oke
+Ġforecast ing
+Ġcompl ains
+S ense
+ĠHer mes
+isc overed
+Ġb ible
+ĠMor ph
+Ġg erm
+78 5
+D ON
+Ġcon gen
+Ġcr ane
+ĠD PR
+Ġrespect fully
+R oom
+ĠN aw
+ĠDal ai
+re ason
+ĠAng us
+Educ ation
+ĠTitan ic
+Ë ľ
+Ġo val
+un ited
+Ġthird s
+Ġmoist ur
+ĠC PC
+M iami
+Ġtent acles
+ĠPol aris
+ex c
+ex clusive
+ĠPra irie
+Ġcol ossal
+ĠBl end
+sur prisingly
+ÃŃ s
+Ġindo ctr
+Ġbas al
+ĠMP EG
+und o
+Spl it
+Develop ment
+Ġlan tern
+19 71
+Ġprov ocation
+Ġang uish
+ĠB ind
+ĠLe ia
+duc ers
+ipp y
+conserv ancy
+Ġinitial ize
+ĠTw ice
+ĠSu k
+Ġpred ic
+Ġdi ploma
+Ġsoc iop
+Ing redients
+Ġhamm ered
+ĠIr ma
+Q aida
+Ġglim ps
+ĠB ian
+Ġst acking
+Ġf end
+gov track
+Ġun n
+dem ocratic
+ig ree
+Ġ5 80
+Ġ29 4
+Ġstraw berry
+ID ER
+Ġcher ished
+ĠH ots
+Ġinfer red
+Ġ8 08
+ĠS ocrates
+O regon
+ĠR oses
+ĠFO IA
+Ġins ensitive
+Ġ40 8
+Recomm end
+ĠSh ine
+Ġpain staking
+UG E
+ĠHell er
+ĠEnter prises
+I OR
+ad j
+N RS
+L G
+Ġalien ated
+Ġacknowled gement
+ĠA UD
+ĠRen eg
+Ġvou chers
+Ġ9 60
+Ġm oot
+ĠDim ensions
+Ġc abbage
+B right
+g at
+ĠK lu
+Ġlat ent
+Ġz e
+ĠM eng
+Ġdis perse
+Ġpand emonium
+H Q
+Ġvirt uous
+ĠLoc ations
+ee per
+prov ided
+Ġse ams
+ĠW T
+iz o
+PR OV
+Ġtit anium
+Ġrecol lection
+Ġcr an
+Ġ7 80
+ĠN F
+49 1
+64 2
+p acking
+59 8
+text ure
+Sp ider
+fre edom
+cipl ed
+ĠTAM ADRA
+âĻ ¦
+aut hent
+ĠW ANT
+r ified
+Ġr ites
+Ġuter us
+k iss
+Ġâī ¤
+Ġsk illet
+Ġdis enfranch
+ĠGa al
+Comp an
+Ġage ing
+gu ide
+B alt
+Ġiter ator
+Ġdiscretion ary
+t ips
+Ġprim ates
+ĠTechn ique
+ĠPay ments
+az el
+ĠR OCK
+stant ial
+0 60
+Ġd mg
+ĠJack ets
+ĠPlay off
+Ġnurs ery
+ĠSy mb
+art on
+Ġannex ation
+Color ado
+Ġco ils
+ĠSh oes
+âĦ¢ :
+ĠRo z
+COM PLE
+ĠEve rest
+ĠTri umph
+J oy
+G rid
+à ¼
+process or
+ĠPros per
+ĠSever us
+ĠSelect ed
+r g
+ĠTay yip
+St ra
+Ġski ing
+Ġ? )
+Ġpe g
+Tes la
+Ġtime frame
+Ġmaster mind
+ĠN B
+scient ific
+ĠSh it
+gener ic
+IN TER
+N UM
+Ġst roll
+ĠEn ix
+ĠM MR
+ĠE MS
+m ovie
+Ĥ ª
+Ġminim izing
+idd ling
+Ġilleg itimate
+Ġprot otyp
+Ġpremature ly
+Ġmanual s
+obb ies
+ĠCass idy
+D EC
+des ktop
+Ġaer os
+Ġscreen ings
+Ġdeb ilitating
+ĠGr ind
+nature conservancy
+Ġf ades
+ter mination
+assets adobe
+F actor
+Ġdefinitive ly
+P okÃ©
+ap ult
+ĠLaf ayette
+C orn
+ĠCor al
+Ġstagn ant
+T ue
+Ġdissatisf action
+G ender
+Ġkid neys
+ĠG ow
+ĠDef eat
+ĠAsh ton
+Ġcart els
+Ġfore closure
+ĠExpl ore
+stre ngth
+ot in
+Ġveterin arian
+Ġf umble
+Ġpar ap
+ĠSt rait
+r ils
+Ġpr ick
+ĠBerm uda
+ĠAm munition
+skin ned
+Ġab ound
+ĠB raz
+Ġshar per
+ĠAsc ension
+Ġ9 78
+Ġpreview s
+Ġcommun ion
+ĠX Y
+Ġph ony
+Ġnewcom er
+Ġ3 32
+." ,"
+Ġredist ribution
+Prot ect
+ĠSo f
+K al
+Ġlip stick
+w orst
+Ġtang led
+Ġretrospect ive
+int eger
+Ġvolunte ering
+Ġ19 07
+Ġ --------------------
+ic hen
+Ġunve iling
+Ġsen seless
+Ġfisher ies
+\ -
+Ġh inges
+Ġcalcul us
+My th
+Ġund efeated
+Ġoptim izations
+Ġdep ress
+Ġbill board
+ĠY ad
+ĠPy ramid
+Is n
+I de
+Ġleg ion
+ĠK ramer
+ent anyl
+Ġpenet rating
+ĠHaw th
+ĠPR ODUCT
+ĠGer ard
+ĠP act
+ĠIn cluding
+ĠEl ias
+ĠEl aine
+vis ual
+Ġhum ming
+Ġcond esc
+ĠF asc
+ä¸ Ĭ
+Ġe galitarian
+Ġdev s
+ĠD ahl
+O ps
+D H
+ĠB ounce
+id ated
+ald o
+Ġrepublic an
+Ġh amb
+ĠS ett
+ograph ies
+CH APTER
+Ġtrans sexual
+Ġsky rocket
+ans wer
+Ġmark up
+Ø ª
+Ġhero ine
+Comp are
+ĠT av
+Be ast
+Ġsuccess ors
+Ġna Ã¯ve
+ĠBuck ley
+st ress
+me at
+Ġdownload able
+Ġindex ed
+Ġsc aff
+ĠL ump
+ĠHom o
+Stud io
+In sp
+Ġr acked
+far ious
+ĠPet ty
+Ex ternal
+Ġ19 09
+W ars
+com mit
+put ers
+Ġun ob
+ĠEr r
+ĠE G
+ĠAl am
+ĠSiber ia
+ĠAtmosp heric
+IS TER
+ĠSatan ic
+trans lation
+ĠL oud
+tra umatic
+l ique
+Ġreson ate
+ĠWel ch
+Ġspark ing
+ĠT OM
+t one
+Ġout l
+Ġhandc uffed
+ĠSer ie
+8 01
+Ġland marks
+ĠRee ves
+Ġsoft ened
+Ġdazz ling
+ĠW anted
+month s
+Mag ikarp
+Ġunt reated
+ĠBed ford
+M i
+ĠDynam o
+O re
+79 5
+Ġwrong ful
+Ġl ured
+Ġcort isol
+Ġve x
+d rawn
+ile t
+Download ha
+ĠF action
+Ġlab yrinth
+Ġhij acked
+w aters
+er ick
+Ġsuper iors
+ĠRow ling
+ĠGu inness
+Ġt d
+99 2
+Ġune arthed
+Ġcentr if
+Ġsham eless
+P od
+ĠF ib
+Ġ icing
+Ġpredict or
+Ġ29 2
+fore station
+con struct
+C and
+@ #
+Ġag itated
+Ġre pr
+OV A
+Ġkn itting
+ĠLim a
+Ġf odder
+68 4
+ĠPerson a
+k l
+7 01
+Ġbreak up
+á ¸
+Ġapp alled
+Ġantidepress ants
+ĠSus sex
+Har ris
+ĠTher mal
+ee ee
+U pload
+Ġg ulf
+Ġdoor step
+ĠSh ank
+L U
+ĠM EN
+ĠP ond
+s orry
+Ġmis fortune
+n ance
+Ġb ona
+M ut
+Ġde graded
+ĠL OG
+ĠN ess
+an imal
+Ġa version
+und own
+Ġsupplement ed
+ĠC ups
+Ġ50 4
+Ġdep rive
+ĠSpark le
+Å Ĥ
+ĠMed itation
+auth ors
+ĠSab an
+ĠN aked
+air d
+ĠMand arin
+ĠScript ures
+ĠPerson nel
+ĠMahar ashtra
+Ġ19 03
+ĠP ai
+ĠMir age
+omb at
+Access ory
+Ġfrag mented
+T ogether
+Ġbelie vable
+ĠGl adiator
+al igned
+ĠSl ug
+M AT
+Ġconvert ible
+ĠBour bon
+amer on
+ĠRe hab
+nt ax
+Ġpowd ered
+pill ar
+Ġsm oker
+ĠMans on
+ĠB F
+5 11
+ĠGood ell
+ĠD AR
+m ud
+g art
+Ġob edient
+ĠTrans mission
+ĠDon ation
+8 80
+Ġbother ing
+Material s
+ãĤ ±
+dest roy
+Ġfore going
+Ġanarch ism
+ĠK ry
+ice ps
+Ġl ittered
+ĠSch iff
+Ġanecd otal
+un its
+Ġf ian
+ĠSt im
+ĠS OME
+ĠInv aders
+Ġbehaviour al
+ĠVent ures
+Ġsub lime
+Ġfru ition
+ĠPen alty
+Ġcorros ion
+¶ ħ
+Ġlik ened
+Ġbesie ged
+ween ey
+ĠCre ep
+Ġlinem en
+mult i
+ic ably
+ud der
+Ġvital ity
+Ġshort fall
+ĠP ants
+ap ist
+H idden
+ĠDro ps
+med ical
+Ġpron unciation
+ĠN RL
+Ġinsight ful
+J V
+ĠBe ard
+ĠCh ou
+Ġchar ms
+Ġb ins
+Ġamb assadors
+ĠS aturdays
+Ġinhib itor
+ĠFr anch
+6 01
+', '
+ĠCon or
+art ney
+ĠX peria
+g rave
+be es
+ĠProtest ants
+Ġso aking
+ĠM andal
+Ġph ased
+Ġ6 60
+Ġsc ams
+Ġbuzz ing
+ĠItal ians
+ĠLoren zo
+ĠJ A
+Ġhes itated
+Ġcl iffs
+ĠG OT
+ingu ishable
+Ġk o
+Ġinter ruption
+Z ip
+Lear ning
+Ġundersc ores
+ĠBl ink
+K u
+57 9
+ĠAut ob
+I RE
+Ġwater ing
+Ġpast ry
+8 20
+Ġvision ary
+ĠTempl ar
+awa ited
+Ġpist on
+Ġant id
+current ly
+Ġp ard
+Ġw aging
+Ġnob ility
+ĠY us
+Ġinject ing
+f aith
+ĠP ASS
+å º
+Ġret ake
+ĠPR OC
+Ġcat hedral
+b ash
+Ġwrest lers
+Ġpartner ing
+Ġn oses
+Ġ3 58
+Trans form
+am en
+Ġb outs
+ĠId eal
+ĠConstant in
+Ġse p
+ĠMon arch
+att en
+ĠPe oples
+mod ified
+Ġmor atorium
+Ġpen chant
+Ġoffensive ly
+Ġprox ies
+ok ane
+ĠTaiwan ese
+ĠP oo
+ĠH OME
+us ional
+Ġver bs
+ĠO man
+vis ory
+Ġpersu asion
+Ġmult it
+Ġsc issors
+G ay
+ow ay
+oph ysical
+l us
+gn u
+Ġap ocalyptic
+Ġabsurd ity
+Ġplay book
+Ġautobi ography
+I UM
+Ġsne aking
+ĠSim ulation
+pp s
+ell ery
+Plan et
+Ġright fully
+Ġn iece
+ĠN EC
+ĠIP O
+ĠDis closure
+lean or
+ous y
+ST ER
+Ġ28 2
+Cru z
+Ch all
+64 3
+ĠSurv ive
+ĠF atal
+ĠAm id
+ap o
+We apons
+D EN
+7 70
+ĠGreen wald
+Ġlin en
+al os
+Ġpollut ants
+ĠPCI e
+k at
+Ġp aw
+ĠK raft
+C hem
+ĠTermin ator
+Ġre incarn
+Ġ] [
+ĠSe eds
+Ġsilhou ette
+ĠSt ores
+Ġgro oming
+ĠD irection
+ĠIs abel
+ĠBr idges
+ðŁ ĳ
+E ED
+ĠM orsi
+Ġval ves
+ĠRank ed
+ĠPh arma
+ĠOrgan izations
+Ġpenet rated
+ĠRod ham
+ĠProt oss
+Ġove rest
+Ġex asper
+ĠT J
+Ġ 000000
+Ġtrick le
+Ġbour bon
+WH O
+Ġw retched
+Ġmicrosc opic
+Ġcheck list
+Ġad orned
+R oyal
+Ad minist
+ĠRet irement
+ĠHig hest
+We ather
+ile ge
+Ġincre ments
+ĠC osponsors
+Ġmas se
+ĠS inn
+r f
+Ġh ordes
+as sembly
+75 4
+ĠNat asha
+ĠTY PE
+ĠGEN ERAL
+Ġarr anging
+Ġ40 7
+l ator
+Ġg lean
+Ġdisc redited
+Ġclin icians
+UN E
+Ġachie ves
+ĠEm erson
+com plex
+= [
+Ġprincip ally
+Ġfra il
+p icked
+Ġthan king
+Ġre cl
+ĠL AST
+Ġsupp ressing
+il ic
+Ġantidepress ant
+ĠLis bon
+Ġth or
+Ġsp a
+Ġking doms
+ĠPear ce
+em o
+Ġpl ung
+Ġdiv est
+Ġ ********************************
+b is
+osp els
+ad r
+Sp irit
+hall a
+P ink
+end ez
+Ġresurrect ed
+esc ape
+ĠRosen stein
+Ġge ological
+Ġnecess ities
+Ġcarn iv
+ĠE lys
+ĠBar ney
+Ġ29 6
+dig y
+ST ON
+D OWN
+Ġmil estones
+Ġk er
+Ġdismant ling
+Ġre prim
+Ġcross ings
+19 45
+Ġpatri archy
+Ġblasp hemy
+Ġ3 59
+met ry
+ĠOb esity
+ĠDiff erences
+bl ocking
+ãĥķ ãĤ¡
+ich ita
+ĠSab ha
+ph alt
+ĠCol o
+ual a
+effic ients
+ĠMed ina
+con sole
+55 7
+ĠHann ibal
+ĠHab it
+ĠF ever
+Ġthen ce
+Ġsyn agogue
+Ġessential s
+Ġw ink
+ĠTr ader
+ID A
+ĠSp oiler
+ĠIceland ic
+ĠHay ward
+Ġpe ac
+Ġmal ice
+Ġflash back
+Ġth w
+Ġlay offs
+L iquid
+Ġtro oper
+Ġh inge
+ĠRead ers
+Ph ill
+ĠB auer
+Cre ated
+Ġaud its
+ac compan
+Ġunsus pecting
+ier a
+6666 6666
+Ġbro ch
+Ġapprehend ed
+ĠM alk
+cer ning
+ĠCod ex
+O VER
+M arsh
+ĠD eng
+ĠExp ression
+Ġdisrespect ful
+Ġasc ending
+t ests
+ĠPlaint iff
+ster y
+ĠAl ibaba
+din and
+ĠDem psey
+Applic ations
+mor al
+Ġthrough put
+Ġquar rel
+Ġm ills
+Ġhe mor
+ĠC ASE
+terror ist
+st im
+ifest yle
+ro zen
+CE PT
+Ar k
+u ci
+lect ic
+Ġirrit ating
+she ets
+A y
+Ġrede emed
+Ġhorn y
+ĠTe ach
+ĠS ear
+dem ocracy
+4 65
+ĠRest ore
+Ġstand by
+ĠP is
+iff in
+Ġsleep y
+Ġextr ater
+Ġcompl iments
+Fram eworks
+Ġinstall s
+Ġb anging
+sur face
+found land
+Ġmetaph ysical
+Ġ28 3
+oul s
+dev ices
+Ar gs
+ĠSac rifice
+ĠMcC orm
+es on
+Cons ervative
+ĠM ikhail
+see ing
+is ively
+ĠRo oms
+ĠGener ic
+Ġenthusi astically
+Ġgri pped
+Ġcomed ic
+ĠElectric ity
+Ġgu errilla
+Ġdec oration
+ĠPerspect ive
+Ġconsult ations
+Ġun amb
+Ġplag iar
+Ġmagic ian
+Ġe rection
+ĠTour ism
+or ied
+ro xy
+11 00
+T am
+Ī è
+Î ³
+× ª
+ĠPred ators
+Nit rome
+Ġtelesc opes
+project s
+Ġun protected
+Ġst ocked
+ĠEnt reprene
+nex pected
+Ġwast ewater
+V ill
+Ġint imately
+Ġi Cloud
+ĠConst able
+Ġspo of
+Ġne farious
+Ġfin s
+Ġcens or
+ĠMod es
+ĠEs per
+ar bon
+Ġinter sections
+Ġlaud ed
+Ġphys i
+Ġgener ously
+ĠThe Nitrome
+ĠTheNitrome Fan
+Ġar isen
+ĠÙ Ī
+Ġg lands
+ĠPav ilion
+ĠGu pta
+Ġuniform ly
+Ġr amps
+ri et
+ĠWH EN
+ĠVan essa
+Ġrout ed
+Ġlim p
+ĠC PI
+p ter
+int uitive
+Ġv aping
+Ġexperiment ed
+ĠOlymp us
+ĠAm on
+Ġsight ing
+Ġinfiltr ate
+ĠGentle man
+Ġsign ings
+ĠMe ow
+ĠNav igation
+che cks
+4 33
+Ġel apsed
+ĠBulg arian
+esp ie
+ĠS OM
+d uring
+Ġsp ills
+anc a
+ĠPly mouth
+M AL
+Ġdomest ically
+ĠWater gate
+ĠF AM
+k illed
+ed ited
+ĠYour self
+Ġsynchron ization
+ĠPract ices
+ST EP
+Ġgen omes
+ĠQ R
+not ice
+Ġloc ating
+z in
+Ġ3 29
+al cohol
+Ġk itten
+V o
+Ġr inse
+Ġgrapp le
+ĠSc rew
+ĠD ul
+A IR
+Ġle asing
+ĠCaf Ã©
+Ġro ses
+ĠRes pect
+Ġmis lead
+Ġperfect ed
+Ġnud ity
+Ġnon partisan
+ĠCons umption
+Report ing
+Ġnu ances
+Ġdeduct ible
+ĠSh ots
+Ġ3 77
+Ġæ ľ
+ano oga
+Ben ef
+ĠB am
+ĠS amp
+if ix
+Ġgal van
+ĠMed als
+rad ius
+Ġno bles
+Ġe aves
+igr ate
+K T
+ĠHar bour
+u ers
+Ġrisk ed
+re q
+Ġneuro t
+get table
+ain a
+Rom ney
+Ġunder pin
+Ġlo ft
+ĠSub committee
+ĠMong ol
+b iz
+Ġmanif ests
+ass isted
+ĠG aga
+Ġsy nergy
+Ġreligious ly
+ĠPre f
+ĠG erry
+T AG
+ĠCho i
+4 66
+beh ind
+ĠO u
+Gold Magikarp
+Ġhemor rh
+R iver
+Ġtend on
+Ġinj ure
+ĠF iona
+Ġp ag
+Ġag itation
+|| ||
+ur an
+ĠE SA
+Ġest eem
+Ġdod ging
+Ġ4 12
+r ss
+Ġce ases
+ex cluding
+Ġint akes
+Ġinsert s
+Ġemb old
+ĠO ral
+up uncture
+4 11
+ĠUn ified
+ĠDe le
+Ġfurn ace
+ĠCoy otes
+ĠBr ach
+L abor
+Ġhand shake
+Ġbru ises
+Gr ade
+éĹ ĺ
+ĠGram my
+ile en
+St ates
+ĠScandinav ian
+ĠKard ash
+8 66
+Ġeffort lessly
+ĠDI RECT
+ĠTH EN
+ĠMe i
+ert ation
+19 68
+Ġgro in
+w itch
+Requ irements
+98 5
+Ġroof s
+Ġest ates
+ĠH F
+Ġha ha
+Ġdense ly
+ĠO CT
+Ġpl astics
+Ġincident ally
+ĠTr acks
+ĠTax es
+Ġch anted
+Ġforce ful
+ĠBie ber
+ĠK ahn
+K ent
+ĠC ot
+lic ts
+F ed
+Ġhide ous
+ĠVer d
+ĠSynd icate
+ĠIl legal
+J et
+ĠD AV
+re asonable
+c rew
+Ġfundamental ist
+Ġtruth ful
+ĠJ ing
+Ġl il
+Ġdown ed
+Ġen chanted
+ĠPolic ies
+ĠMcM aster
+ĠH are
+ides how
+Ġpar ams
+en cers
+gorith m
+Ġallow ances
+Ġturb ulent
+Ġcomplex ities
+ĠK T
+Ġ3 37
+ĠGen etic
+F UN
+D oug
+t ick
+Ġg igs
+ument hal
+Ġpatriarch al
+Ġcal c
+, ...
+Ġc out
+ĠGu an
+Ġpath ological
+ĠR ivals
+Ġunder rated
+Ġflu orescent
+ĠJ iu
+arna ev
+ĠQu an
+Ġ4 29
+Ġ à¨
+M ario
+Con struct
+ĠC itation
+ĠR acial
+ĠR SA
+ĠF idel
+Ġ3 95
+Person ally
+C ause
+Ã »
+rad ical
+in en
+Ġvehement ly
+ĠPap a
+Ġintern ship
+Ġfl akes
+ĠRe ck
+Luck ily
+B ra
+20 20
+rav ings
+R N
+W onder
+Ser iously
+Ġre usable
+Ġpoll uted
+ĠP eng
+le igh
+ind le
+Ġcircuit ry
+ĠMad onna
+ĠB ART
+Res idents
+att ribute
+Phil adelphia
+Cl ub
+Ġplan ner
+Ġfr antically
+Ġfaith fully
+ĠTerrit ories
+ĠL AT
+ĠAnders en
+an u
+ĠP ARK
+ĠS ora
+i age
+ĠPlay offs
+ĠG CC
+4 27
+Ġab norm
+ĠL ever
+Ġdisob edience
+As ync
+ĠShe a
+V ert
+Ġsk irts
+ĠSaw yer
+x p
+Ġwors ening
+Ġsc apego
+ĠAng le
+oth al
+Ġtro ve
+ĠSt y
+ĠN guyen
+mar ine
+ide on
+Dep ths
+Bl og
+ĠIll uminati
+Ġtract s
+Ġorgan ise
+Ġo str
+F s
+Ġlever aging
+ĠD aredevil
+as ar
+Ġl ang
+Ġex termin
+urs ions
+ĠRom o
+ãĤ¤ ãĥĪ
+Ġcont ended
+Ġencounter ing
+ĠTable t
+ĠAltern ate
+sk ill
+Ġswe ets
+Ġco hesive
+cap acity
+Ġrep ud
+Ġl izard
+ro o
+Ġpilgr ims
+ĠR uff
+ĠInstr ument
+ĠLog o
+uit ous
+E H
+Ġsales man
+Ġank les
+L ed
+ĠPat ty
+ud os
+Own er
+Ġdiscrep ancies
+k j
+M U
+Ġuncond itional
+Dragon Magazine
+i ard
+O ak
+ĠConvers ation
+be er
+ĠOs aka
+D elta
+us ky
+Ġsecret ion
+Ġpl aza
+Ġm ing
+Ġde pletion
+ĠM ous
+ĠI TS
+ĠH imal
+ĠFle ming
+Ġcyt ok
+ĠH ick
+Ġbat ters
+ĠInt ellectual
+6 75
+Ã© r
+IS ION
+ĠQu entin
+ĠCh apters
+ih adi
+Ġco aster
+WAY S
+ĠL izard
+ĠY or
+and ering
+S kin
+ha ust
+ab by
+Ġportray ing
+Ġwield ed
+d ash
+Ġprop onent
+Ġr ipple
+Ġgrap hene
+Ġfly er
+Ġrec urrent
+Ġdev ils
+Ġwater fall
+æĺ ¯
+go o
+Text Color
+Ġtam pering
+IV ES
+TR UMP
+ĠAb el
+ĠS AL
+ĠHend ricks
+ĠLu cius
+b ots
+Ġ40 96
+IST ORY
+Gu est
+ĠN X
+in ant
+Ben z
+ĠLoad ed
+ĠCle ver
+t reatment
+Ġta vern
+Ġ3 39
+ĠT NT
+ific antly
+Tem perature
+F el
+Ġunder world
+ĠJud ges
+Ġ< +
+Ġst ump
+Ġoccup ancy
+Ġab er
+ĠF inder
+) ",
+ĠN unes
+res et
+in et
+ect omy
+Ġwell ness
+ĠP eb
+quart ered
+and an
+Ġneg atives
+ĠTh iel
+ĠCl ip
+ĠL TD
+Ġbl ight
+Ġreperto ire
+K yle
+Ġqu er
+ĠC es
+Ġha pl
+98 9
+ĠTh ames
+isc opal
+Des k
+ivari ate
+ĠEx cellence
+found ation
+Ġâ ĩ
+X i
+Ġmyster iously
+esty les
+Ġper ish
+ĠEng els
+ĠDE AD
+09 0
+}} }
+ĠUn real
+Ġrest less
+ID ES
+orth odox
+ĠInter mediate
+Ġdin ners
+ĠTr out
+ĠSe ym
+ĠHall s
+og ged
+Ġtraged ies
+Ġdid nt
+67 6
+Ġail ments
+Ġobserv able
+ĠV ide
+ad apt
+ĠD usk
+Ġprofessional ism
+ĠPres cott
+ĠInd ies
+p ox
+ĠMe hran
+W ide
+Ġend emic
+ĠPar an
+B ird
+Ġped als
+ĠI U
+ĠAdam ant
+ĠH urt
+Ġcorrel ates
+urd en
+Ġspons oring
+cl imate
+ĠUnivers ities
+ĠK not
+enn es
+ĠDam ian
+ĠAx el
+S port
+Ġbar b
+ĠS no
+sh own
+ste en
+ud ence
+Ġnon violent
+Ġhom ophobia
+Ġbiom ass
+ĠDet ail
+Ġsrf N
+ĠT une
+accompan ied
+I ENCE
+Al bert
+ĠMong o
+z x
+ĠCer berus
+or bit
+c ens
+Ġsl ay
+SH ARE
+H Y
+Ġb rawl
+ĠPro be
+Ġnonex istent
+ĠClare nce
+ĠBlack burn
+Ġport als
+ĠR ita
+ĠRem ain
+ĠLe vant
+Ġtrick ed
+ĠF erry
+aver ing
+ĠStraw berry
+ĠAn swers
+Ġhorrend ous
+ĠA man
+Supp lement
+ĠT oad
+Ġpe eled
+Ġman oeuv
+ĠU zbek
+mond s
+ĠH ector
+Ġ40 2
+pe es
+fix es
+Ġd j
+Ġres umes
+Ġaccount ant
+Ġadvers ity
+Ġham pered
+ĠL arson
+Ġd oping
+part s
+H ur
+Ġbe arded
+Ġy r
+ĠPlug in
+å¥ ³
+Ġ/ **
+rol ley
+Ġwaters hed
+ĠSub mission
+if lower
+AS C
+Ġcho ir
+Ġsculpt ures
+m A
+incre asing
+ai i
+Ġsne akers
+Ġconfront s
+ĠEle phant
+ĠEl ixir
+Ġrec al
+ĠT TL
+w idget
+ĠW ax
+ĠGr ayson
+Ġha irst
+Ġhumili ated
+ĠWAR N
+app iness
+ĠT TC
+F uel
+Ġpol io
+Ġcomplex es
+Ġbab e
+ĠX IV
+P F
+). [
+P arts
+Ġ4 35
+M eg
+ĠY ards
+ĠAL P
+Ġy ells
+Ġprin ces
+Ġbull ies
+ĠCapital ism
+ex empt
+FA Q
+ĠSp onge
+ĠAl a
+Ġpleas antly
+Ġbu f
+Ġden ote
+Ġunp ublished
+Ġkne eling
+asc a
+Ġl apse
+al ien
+99 4
+Ġrefere es
+ĠLaw yers
+S anta
+Ġpuzz ling
+ĠProm etheus
+ĠPh araoh
+ĠDel ay
+Ġfacilit ates
+ĠC ES
+Ġjew els
+Ġbook let
+ond ing
+Ġpolar ization
+ĠMor an
+ĠSal ad
+ĠS OS
+ĠAdv ice
+PH OTOS
+IC AN
+iat ures
+ex press
+ĠWonder land
+ĠC ODE
+ĠCL ASS
+9 75
+Ġg rep
+ĠD iesel
+ĠGl ac
+! ?"
+Ġr m
+o ine
+disc rimination
+ĠN urse
+m allow
+Ġv ortex
+ĠCons ortium
+Ġlarge Download
+stra ight
+augh lin
+G rad
+Ġpublic ized
+ĠW aves
+ĠRed d
+Ġfest ivities
+ĠM ane
+ar ov
+Ġfleet ing
+ĠDr unk
+ug en
+C ele
+Ġchromos omes
+ĠD OT
+-+-+ -+-+
+Ġbus iest
+ĠBe aver
+Sy rian
+ĠK yr
+k as
+ĠCross Ref
+19 50
+76 01
+Ġrepe aling
+ĠWin ners
+ĠMac ro
+ĠD OD
+bl ance
+S ort
+64 1
+Ġmet re
+ĠD irk
+Ġgo ggles
+Ġdraw backs
+Ġcomplain ant
+Ġauthor izing
+Ġantit rust
+oper ated
+Ġm ah
+Ġexagger ation
+Am azing
+ĠSer aph
+Ġha ze
+w ow
+Ġextingu ished
+Ġcan yon
+ĠB osh
+Ġv ents
+Ġsc rape
+Cor rect
+4 26
+Ġav g
+Dem and
+ĠâĪ ¼
+Ġmicrobi ota
+"} ],"
+ĠSt ev
+B io
+ĠPlan es
+Ġsuggest ive
+Ġdec ipher
+ĠRefuge e
+ĠKe jriwal
+ĠGreen peace
+Ġdecl ass
+ĠSound ers
+Ġth o
+Ġdec rypt
+Ġbr ushing
+ĠJane iro
+ip op
+S i
+8 77
+ĠGeoff rey
+Ġc pu
+ĠHaz el
+Ġview points
+Ġcris py
+ĠNot ification
+Ġsold er
+ĠMod est
+ĠHem isphere
+Ġcass ette
+in cludes
+Ġident ifiers
+ĠC ALL
+in cent
+T odd
+ĠSwe ep
+Ġ3 34
+b oss
+Ġsm ir
+gin x
+Ġtown ship
+Ġg rieving
+ĠMos que
+Net flix
+AS ED
+ĠMillenn ials
+oc om
+19 67
+Ġbold ly
+s leep
+Ġes che
+arij uana
+Ġsw irl
+ĠPen al
+Ġneglig ent
+ĠStephen son
+K ER
+ĠZ oro
+ris is
+Ġlocal ization
+ĠSeym our
+ĠAng lic
+red itation
+prot ection
+ĠPa ige
+Ġo mit
+ĠR ousse
+ĠT ub
+Ġinv itations
+t ty
+Ġm oss
+ph ysical
+C redits
+Ġan archy
+Ġchild care
+Ġl ull
+ĠM ek
+ĠL anguages
+lat est
+ĠSan ford
+Ġus ability
+Ġdiff use
+ĠD ATA
+Ġsp rites
+ĠVeget a
+ĠProm otion
+ãĥ¼ ãĤ¯
+rict ing
+z ee
+Tur kish
+ĠTD s
+pro ven
+57 1
+Ġsmug glers
+707 10
+Ġreform ed
+ĠLo is
+Ġun fl
+ĠWITH OUT
+ĠReturn ing
+ann ie
+ĠTom as
+Fr anc
+ĠProf it
+ĠSER V
+ĠR umble
+ik uman
+es an
+Ġt esters
+Ġgad get
+Ġbrace let
+ĠF SA
+comp onent
+Ġparamed ics
+Ġj an
+ĠRem em
+ĠSk inner
+Ġl ov
+ĠQu ake
+rom a
+Ġfl ask
+Pr inc
+Ġover power
+Ġlod ging
+ĠK KK
+ret te
+Ġabsor bs
+w rote
+Ġ ,"
+K ings
+ĠH ail
+ĠFall ing
+xt ap
+ĠHel ena
+ire ns
+L arry
+Ġpamph let
+ĠC PR
+G ro
+ĠHirosh ima
+Ġhol istic
+". [
+Ġdet achment
+Ġas pire
+Ġcompl icit
+ĠGreen wood
+Ġresp awn
+ĠSt upid
+ĠFin ished
+f al
+b ass
+Ġab hor
+Ġmock ery
+ĠFe ast
+VID EO
+Ġcon sec
+ĠHung ry
+P ull
+ĠH ust
+it ance
+? ãĢį
+) --
+ĠPar allel
+con v
+4 69
+ha ar
+w ant
+P aper
+m ins
+ĠTor o
+ĠTR UMP
+ĠR ai
+D W
+ĠW icked
+ĠL ep
+Ġfun ky
+Ġdetrim ent
+ios is
+ache v
+Ġde grade
+im ilation
+Ġret ard
+Ġfrag mentation
+Ġcow boy
+ĠY PG
+ĠH AL
+Parent s
+ĠS ieg
+ĠStra uss
+ĠRub ber
+× Ĳ
+Fr ag
+Ġp t
+Ġoption ally
+ĠZ IP
+ĠTrans cript
+ĠD well
+88 2
+M erc
+ĠM OT
+ãĥ¯ ãĥ³
+Ġhun ts
+Ġexec utes
+In cludes
+Ġacid ic
+ĠRespons ibility
+ĠD umb
+we i
+And erson
+ĠJas per
+ight on
+abs olutely
+Ad ult
+Ġpl under
+Mor ning
+ĠT ours
+ĠD ane
+Î º
+ĠT EST
+ĠG ina
+Ġcan ine
+aw an
+Ġsocial ists
+ĠS oda
+Ġimp etus
+ĠSupplement ary
+oli ath
+ĠKinn ikuman
+mitted ly
+second s
+Ġorganis ers
+Ġdocument aries
+Vari able
+GRE EN
+Ġres orts
+Ġbr agging
+Ġ3 68
+Art ist
+w k
+bl ers
+Un common
+ĠRet rieved
+Ġhect ares
+Ġtox in
+r ank
+Ġfaith s
+ĠG raphic
+Ġve c
+ĠL IA
+Af rican
+Ġard ent
+end iary
+L ake
+ĠD OS
+cient ious
+ĠOk awaru
+ĠAll y
+ĠTim eline
+D ash
+ĠI c
+contin ue
+Ġt idy
+Ġinstinct ively
+ĠP ossibly
+ĠOut door
+ĠWould n
+Ġl ich
+ĠBr ay
+ĠA X
+ĠÃ ī
+Ġ+ #
+\ '
+Direct ory
+ab iding
+Ġf eral
+ic ative
+but t
+Ġper verse
+S alt
+Ġwar ped
+Ġnin eteen
+Ġcabin ets
+Ġsrf Attach
+ĠSl oan
+Ġpower ing
+reg ation
+F light
+se vere
+Ġst ren
+Ġc og
+ap ache
+Ġâ Ŀ
+Ġcaf eteria
+p aces
+ĠGrim oire
+uton ium
+Ġr aining
+Ġcir cling
+Ġlineback ers
+c redit
+Ġrep atri
+ĠCam den
+lic ense
+Ġly ric
+Ġdescript or
+Ġval leys
+Ġre q
+Ġback stage
+ĠPro hibition
+ĠK et
+Op ening
+S ym
+æĸ ¹
+Ġserv ings
+Ġoverse en
+Ġaster oids
+ĠMod s
+ĠSpr inger
+ĠCont ainer
+è »
+ĠM ens
+Ġmult im
+Ġfire fighter
+pe c
+Ġchlor ine
+Ð ¼
+end i
+Ġsp aring
+Ġpolyg amy
+ĠR N
+ĠP ell
+Ġt igers
+Ġflash y
+ĠMad ame
+S word
+Ġpref rontal
+Ġpre requisite
+uc a
+Ġw ifi
+Ġmiscon ception
+Ġharsh ly
+ĠStream ing
+ot om
+ĠGiul iani
+foot ed
+Ġtub ing
+ind ividual
+z ek
+n uclear
+m ol
+Ġright ful
+49 3
+Ġspecial ization
+Ġpassion ately
+ĠVel ocity
+ĠAv ailability
+T enn
+Ġl atch
+ĠSome body
+Ġhel ium
+cl aw
+Ġdi pping
+XX X
+Ġinter personal
+7 10
+Ġsub ter
+Ġbi ologists
+ĠLight ing
+Ġopt ic
+Ġden im
+end on
+ĠC orm
+Ġ3 41
+ĠC oup
+Ġfear less
+Ġal ot
+ĠCliff ord
+ĠRun time
+ĠProv ision
+up dated
+lene ck
+Ġneur on
+Ġgrad ing
+ĠC t
+sequ ence
+in ia
+con cept
+Ġro aring
+ri val
+ĠCaucas ian
+Ġmon og
+key es
+Ġappell ate
+Ġlia ison
+EStream Frame
+ĠPl um
+! .
+Ġsp herical
+Ġper ished
+Ġbl ot
+Ġben ches
+Ġ4 11
+Ġpione ered
+Ġhur led
+Jenn ifer
+ĠYose mite
+Ch air
+Ġreef s
+Ġelect or
+ĠAnt hem
+65 2
+Ġun install
+Ġimp ede
+Ġbl inking
+Ġgot o
+Dec re
+A ren
+Ġstabil ization
+ĠDis abled
+ĠYanuk ovych
+Ġoutlaw ed
+ĠVent ura
+ten ess
+Ġplant ation
+Ġy acht
+ĠHu awei
+Ġsol vent
+Ġgr acious
+Ġcur iously
+Ġcapac itor
+Ġc x
+ĠRef lex
+Ph ys
+ĠC f
+pt in
+cons ervative
+Ġinv ocation
+c our
+F N
+ĠNew ly
+H our
+As ian
+ĠLe ading
+ĠAer ospace
+An ne
+Ġpre natal
+Ġdeterior ating
+H CR
+ĠNorm andy
+ol ini
+ĠAm bro
+9 10
+Ġset backs
+ĠT RE
+Ġs ig
+ĠSc ourge
+59 7
+79 8
+Game play
+Ġm sec
+M X
+Ġprice y
+ĠL LP
+aker u
+Ġover arching
+ĠB ale
+Ġworld ly
+Cl ark
+Ġscen ic
+Ġdisl iked
+ĠCont rolled
+T ickets
+ĠE W
+ab ies
+ĠPl enty
+Non etheless
+Ġart isan
+Trans fer
+ĠF amous
+Ġinf ield
+ble y
+Ġunres olved
+ĠML A
+ãĤ Ĥ
+Cor rection
+Ġdemocr at
+ĠMore no
+ro cal
+il ings
+Ġsail or
+Ġr ife
+h ung
+Ġtrop es
+Ġsn atched
+ĠL IN
+ĠB ib
+ES A
+ĠPre v
+ĠCam el
+run time
+Ġob noxious
+4 37
+Ġsum mers
+Ġunexpl ained
+ĠWal ters
+cal iber
+Ġg ull
+ĠEnd urance
+ä½ ľ
+Ġ3 47
+Ir ish
+Ġaer obic
+Ġcr amped
+ĠHon olulu
+à ©
+us erc
+ec ast
+AC Y
+ĠQu ery
+ãĤ¹ ãĥĪ
+Bet a
+Ġsuscept ibility
+ĠSh iv
+ĠLim baugh
+ĠÃ ĸ
+ĠN XT
+ĠM uss
+ĠBrit ons
+ES CO
+EG IN
+Ġ% %
+Ġsec ession
+ĠPat ron
+ĠLu a
+n aires
+ĠJPM organ
+us b
+ocy te
+Ġcouncill ors
+ĠLi ang
+f arm
+Ġnerv ously
+Ġattract iveness
+ĠK ov
+j ump
+Pl ot
+Ġst ains
+ĠStat ue
+ĠApost les
+he ter
+ĠSUP PORT
+Ġoverwhel m
+Y ES
+Ġ29 1
+d ensity
+Ġtra pping
+M it
+Ġf ide
+ĠPam ela
+atl antic
+Dam n
+Ġp ts
+OP A
+Ġserv icing
+Ġoverfl owing
+ul o
+ĠE rit
+t icket
+light ing
+ĠH mm
+ãĥ¼ ãĥ«
+im oto
+Ġchuck le
+4 23
+ãģ ķ
+sh ape
+Ġque ues
+Ġanch ors
+ãĤ¼ ãĤ¦ãĤ¹
+F er
+Ġaw oke
+Ġ6 66
+h ands
+Ġdiver gence
+Ġ50 5
+T ips
+Ġdep ot
+Ġske w
+ĠDel iver
+op ot
+Ġdiv ul
+ĠE B
+uns igned
+ĠUn i
+X box
+Ġfor ks
+Ġ7 02
+å ¯
+Ġpromot ers
+ĠV apor
+Ġlev ied
+sl ot
+Ġpig ment
+Ġcyl inders
+C RE
+Ġsn atch
+Ġperpet ually
+Ġl icking
+ĠFe et
+ĠKra ken
+ĠHold en
+ĠCLS ID
+m r
+Ġproject or
+Ġden otes
+Ġchap el
+ĠTor rent
+b ler
+R oute
+ĠDef endant
+ĠPublisher s
+ĠM ales
+ĠInn ov
+ĠAg ility
+rit er
+ty mology
+st ores
+L ind
+Ġf olly
+ĠZur ich
+B le
+Ġnurt ure
+Ġcoast line
+uch in
+D omin
+Ġfri vol
+ĠCons olid
+res ults
+M J
+Ġphyl ogen
+Ġha uled
+ĠW iley
+ĠJess ie
+ĠPrep are
+ĠE ps
+Ġtreasure r
+I AS
+Ġcolon ists
+Ġin und
+ĠWW F
+ĠCon verted
+6 000
+out side
+ĠApp earance
+ĠRel ic
+ĠM ister
+s aw
+Ġresult ant
+Ġadject ive
+ĠLaure l
+ĠHind i
+b da
+Pe ace
+Ġreb irth
+Ġmembr anes
+Ġforward ing
+Ġcoll ided
+ĠCar olyn
+K ansas
+5 99
+ĠSolid GoldMagikarp
+Be ck
+Ġstress ing
+ĠGo o
+ĠCooper ative
+Ġf s
+ĠAr chie
+L iter
+ĠK lopp
+J erry
+Ġfoot wear
+War ren
+Ġsc ree
+h are
+Under standing
+P ed
+Ġanth ology
+ĠAnn ounce
+M ega
+Ġflu ent
+Ġbond age
+ĠDisc ount
+il ial
+C art
+ĠNight mares
+Sh am
+ĠB oll
+uss ie
+H ttp
+Atl anta
+Ġun recogn
+ĠB id
+Ġunder grad
+Ġforg iving
+ĠGl over
+AAAA AAAA
+4 45
+V G
+pa io
+kill ers
+Ġrespons ibly
+Ġmobil ize
+Ġeffect ed
+ĠL umin
+Ġk ale
+Ġinfring ing
+ann ounced
+Ġf itt
+b atch
+ĠT ackle
+ĠL ime
+ĠAP P
+uke mia
+Ġrub y
+Ġex oner
+ĠCas ual
+0 70
+Ġpel vic
+Ġautom ate
+ĠK ear
+ĠCoast al
+Ġcre ed
+Ġbored om
+ĠSt un
+ri ott
+Ĥ İ
+Ġregener ate
+Ġcomed ians
+ĠOP ER
+Sp ons
+id ium
+on is
+L ocated
+05 7
+Ġsusp ense
+ĠD ating
+C ass
+Ġneoc ons
+ĠShin zo
+Ġaw oken
+ch rist
+ĠMess ages
+att led
+ĠSpr ay
+ĠSp ice
+C W
+Ġshield ing
+ĠG aul
+Am id
+Ġparam ilitary
+Ġmult if
+ĠTan ner
+il k
+Ġgodd amn
+g ements
+Ġbe friend
+m obi
+Ġ3 88
+fold er
+acc a
+Ġins in
+g ap
+N ev
+fif th
+Ġpsychiat ry
+b anks
+TH IS
+Ġhar b
+ac qu
+Ġfac ade
+ĠPower Point
+80 3
+Ġbl uff
+Sh ares
+Ġfavor ing
+El izabeth
+Ãį Ãį
+Ġr anger
+77 2
+ĠAr che
+h ak
+ĠGen etics
+ĠF EMA
+Ġev olves
+Ġest e
+ĠP ets
+ĠM Ã©
+ĠInterest ing
+ĠCanter bury
+ch apter
+ĠStar fleet
+Sp anish
+Ġdraw back
+ĠNor wich
+9 70
+n orth
+ag anda
+Ġtransform ative
+ram ids
+bi ology
+ad ay
+Ġpropag ation
+ĠGam ma
+ĠDen ise
+ĠCalcul ator
+ent imes
+ĠB ett
+Ġapp endix
+ĠHD D
+AK ING
+Ġst igmat
+Ġhol ster
+Ġord inarily
+Ch ance
+ĠCont rary
+Ġad hesive
+Ġgather s
+6 12
+re au
+ony ms
+ew ays
+Ġindu ces
+Ġinterchange able
+se m
+Wh it
+Ġtr ance
+Ġincorpor ation
+ĠExt ras
+Fin ancial
+Ġawkward ly
+ĠStur geon
+ĠH Y
+Norm ally
+ĠEnd ing
+ĠAss ist
+enc rypted
+Ġsub jug
+Ġn os
+Ġfan atic
+C ub
+C U
+?" .
+Ġirre versible
+å Ĥ
+03 1
+ĠH AR
+sp read
+ul ia
+= $
+Sc ope
+L ots
+Ġlif estyles
+ol on
+Ġf eds
+Ġcongrat ulate
+web kit
+Ġindist inguishable
+ĠSw ing
+Ġcommand ments
+qu ila
+ab ella
+m ethyl
+ann abin
+Ġo vere
+Ġlob ster
+ĠQU EST
+ĠCONT IN
+bern atorial
+:::: ::::
+ĠTra ve
+ĠSam oa
+AN I
+75 2
+Ð ´
+userc ontent
+ĠMod erate
+y eah
+ĠK itt
+Ġwe e
+Ġstuff ing
+ĠInter vention
+ĠD ign
+Ġware houses
+ĠF iji
+Ġpel lets
+Ġtake away
+ĠT ABLE
+ĠClass ical
+col lection
+Ġland fall
+ĠMus cle
+Ġsett les
+ĠAD V
+Ġ3 44
+L aura
+Ġf ared
+ĠPart ial
+4 36
+oss ibility
+ĠD aly
+ĠT arant
+ĠFu ji
+am l
+c ence
+55 1
+ĠProced ures
+ĠO CD
+ĠU D
+t in
+Q UI
+ach o
+4 38
+Ġgl itches
+Ġenchant ment
+Ġcalcul ates
+IR O
+ĠH ua
+alys es
+ĠL ift
+um o
+Ġle apt
+Ġhypothes ized
+ĠGust av
+it ans
+VERS ION
+æ ł
+Rog er
+Ġr and
+ĠAd apter
+Ġ3 31
+ĠPet ition
+k ies
+M ars
+Ġunder cut
+ze es
+ĠLy ons
+ĠDH CP
+Miss ing
+Ġretire es
+Ġins idious
+el i
+> )
+. ãĢį
+Ġfinal ists
+ĠA ure
+Ġacc user
+Ġwas tes
+ĠY s
+ĠL ori
+Ġconstitu encies
+Ġsupp er
+Ġmay hem
+or ange
+Ġmis placed
+Ġmanager ial
+Ġex ce
+ĠCL I
+Ġprim al
+ĠL ent
+Cry stal
+h over
+ĠN TS
+end um
+Ġd w
+ĠAl c
+n ostic
+Ġpres erves
+ĠTs arnaev
+Ġtri pled
+rel ative
+Arc ade
+k illing
+ĠW EEK
+ĠH anna
+D ust
+Com pleted
+ģ «
+Ġappro ves
+ĠSur f
+ĠLuther an
+ven ants
+Ġrobber ies
+we ights
+soft ware
+at ana
+ug al
+Ġgrav y
+ĠC ance
+OLOG Y
+ly ak
+Ton ight
+Ġunve il
+Ġ19 04
+ĠMin ion
+ent ious
+st ice
+pack ages
+ĠG EAR
+Ġg ol
+ĠHutch inson
+ĠProf ession
+ĠG UN
+ĠDiff erence
+ĠTsuk uyomi
+ĠLes bian
+6 70
+Ġfug itive
+ĠPlan etary
+-------------------------------- ------------------------
+Ġacc rued
+Ġch icks
+Ġsto pp
+Ġblock ers
+C od
+Ġcomment ers
+ĠSomew here
+ĠPhot ographer
+the me
+Ġmay oral
+w u
+Ġanten nas
+Ġrev amped
+ĠSubject s
+it Ã©
+im ura
+Ġentr ances
+liter ally
+Ġten ets
+ĠO MG
+ĠMP H
+ĠDon key
+ĠOff ense
+Ġ" +
+Sn ap
+ĠAF B
+Ġan imate
+ĠS od
+His panic
+Ġinconsist ency
+D b
+F Y
+Ex port
+Ġa pe
+Ġpear l
+ib el
+ĠPAC s
+Ġ{ \
+Ġact u
+ĠHS BC
+camp us
+Ġpay off
+Ġde ities
+ĠN ato
+ou ple
+Ġcens ored
+ĠCl ojure
+Ġconf ounding
+en i
+Ġreck on
+op he
+Ġspot ting
+Ġsign ifies
+Ġprop el
+Ġfest ive
+S uggest
+Ġpled ging
+ĠB erman
+Ġrebell ious
+Ġovershadow ed
+Ġinfiltr ated
+j obs
+67 2
+Ġscal able
+Ġdomin ion
+ĠNew foundland
+ĠMead ow
+Ġpart itions
+AM I
+Ġsupplement ary
+str ument
+Ġhair y
+Ġperpet uate
+Ġnuts hell
+ĠPot ato
+ĠHob bit
+Ġcur ses
+Flo at
+Ġquiet er
+Ġfuel ing
+Ġcaps ules
+ĠL ust
+ĠH aunted
+Exec utive
+Ġchild birth
+G re
+Ġrad iant
+å İ
+Ġm alls
+Ġin ept
+ĠWarrant y
+Ġspect ator
+E h
+t hens
+Ġculmin ating
+æ ©
+ary a
+ãĤ ®
+ilit arian
+ĠOR IG
+ĠSp ending
+pt ives
+ĠS iren
+ĠRec ording
+ay ne
+Ġv im
+Ġspr ang
+T ang
+ĠM FT
+mor ning
+ĠWe ed
+m peg
+cess ion
+ĠCh ung
+7 30
+w arning
+56 2
+handed ly
+P oor
+P olitics
+: #
+Ġp ian
+Ġfec es
+ĠDocument ation
+Ġban ished
+Ġ3 99
+ĠAR C
+Ġhe inous
+J ake
+ĠAm ir
+way ne
+v re
+os henko
+Ġnotebook s
+Ġfound ational
+Ġmarvel ous
+ixt ape
+Ġwithdraw als
+Ġh orde
+ĠD habi
+is able
+ĠK D
+Ġcontag ious
+ĠD ip
+ĠAr rows
+Ġpronoun s
+Ġmorph ine
+ĠB US
+68 2
+Ġk osher
+fin ished
+ĠInstr uments
+Ġf used
+yd en
+ĠSal mon
+F ab
+aff ected
+K EN
+C ENT
+Dom ain
+Ġpoke mon
+ĠDr inking
+G rowing
+ĠInvestig ative
+ĠA ether
+em i
+Ġtabl oid
+Ġrep ro
+ĠNot withstanding
+ĠBers erker
+Ġdram as
+Ġclich Ã©
+Ġb ung
+ĠU RI
+ĠD os
+0 44
+Ġpast ors
+Ġl s
+Ġac rylic
+aun ts
+Ed ward
+Ġmajor ities
+B ang
+Ġfield ing
+ĠRepl acement
+ĠAl chemy
+pp ard
+ĠRome o
+ĠSan ct
+ĠLav rov
+ib ble
+Inst ruct
+Ġimp ractical
+ĠPlay boy
+ce phal
+Ġsw aps
+Ġk an
+ĠThe o
+Ġillust rating
+Ġdismant led
+ĠTrans gender
+ĠG uth
+UG H
+Ġtriumph ant
+Ġencomp ass
+Ġbook mark
+udd in
+j er
+Ġpred icate
+ES H
+Ġwhen ce
+ĠAB E
+Ġnon profits
+Se qu
+Ġdi abetic
+Ġp end
+Ġheart felt
+sh i
+Ġinter acts
+ĠTele com
+Ġbombard ment
+dep ending
+ĠLow ry
+ĠAd mission
+ĠBl ooming
+ust ration
+ene gger
+B rew
+Ġmol ten
+ĠNer d
+P IN
+âĸ Ģ
+ave ment
+Ġtou red
+Ġco efficients
+ĠTray von
+ans son
+Ġsand y
+t old
+fl ows
+Ġpop ulous
+ĠT inder
+ĠBl iss
+R achel
+Min imum
+Ġcontest ant
+ĠRed uce
+ĠMor se
+ĠGrass ley
+ĠClick er
+Ġexp r
+Ġs incerity
+Ġmar qu
+Ġelic it
+ĠPro position
+ĠDemon ic
+Ġtac os
+G reek
+Ġpost war
+Ġin sofar
+ĠP ork
+Ġ35 2
+doctor al
+walk ing
+Ġmid term
+ĠSam my
+sight ed
+ĠTR ANS
+ic i
+AL D
+ĠUS L
+ĠF ISA
+ĠAm pl
+ĠAlex andra
+ine lli
+Tr ain
+Ġsign ify
+ĠVers us
+Ġob fusc
+Ġk h
+Ġagg ro
+ĠRen ault
+Ġ3 48
+5 18
+ox icity
+0 22
+ĠTw ist
+Ġgoof y
+D ynamic
+Ġbrief ings
+m ight
+8 99
+Ġderog atory
+T ro
+Ġfor ging
+ĠKor an
+ĠMar ried
+ĠBuc s
+Ġpal ate
+ĠCon version
+m able
+4 13
+Ġ( _
+Ġs iph
+ĠN EO
+col lege
+Ġmarg inally
+Ġfl irt
+ĠTra ps
+ĠP ace
+é »Ĵ
+Ġgoalt ender
+Ġforb ids
+Ġcler ks
+ĠT ant
+ĠRobb ins
+ĠPrint ing
+Ġpremie red
+Ġmagn ification
+ĠT G
+ĠR ouse
+ĠM ock
+odynam ics
+Ġpre clude
+ism o
+ĠPul itzer
+Ġaval anche
+ĠK odi
+rib une
+ĠL ena
+Elect ric
+Ġref inery
+Ġend owed
+Ġcounsel ors
+Ġd olphin
+ĠM ith
+Ġarm oured
+hib ited
+Beg in
+ĠP W
+O il
+ĠV or
+ĠShar if
+ĠFraz ier
+est ate
+Ġj ams
+Pro xy
+Ġband its
+ĠPresbyter ian
+ĠPrem iere
+t iny
+ĠCru el
+Test ing
+Ġhom er
+ĠV ERS
+ĠPro l
+ĠDep osit
+ĠCoff in
+Ġsemin ars
+Ġs ql
+ĠDef endants
+Altern atively
+ĠR ats
+ç «
+ethy st
+' >
+Ġiss uer
+58 9
+Ġch aired
+ĠAccess ories
+man ent
+Ġmar row
+ĠPrim ordial
+C N
+Ġlimit less
+ĠCarn age
+Ġund rafted
+q v
+IN ESS
+on ew
+Ġco hesion
+98 7
+Ġne cks
+Ġfootball er
+ĠG ER
+Ġdetect able
+ĠSupport ing
+ĠCS V
+oc ally
+k Hz
+Ġund e
+Ġsh one
+Ġbud ding
+tra k
+Stand ing
+ĠStar craft
+ĠKem p
+Ben ch
+Ġthw arted
+ĠGround s
+ath i
+L isa
+Dial og
+ĠS X
+V ision
+Ġingen ious
+Ù Ĳ
+Ġfost ering
+ĠZ a
+ĠIn gram
+Ġ" @
+N aturally
+6 16
+0 35
+ĠF AC
+H mm
+55 4
+Ġacceler ator
+ĠV end
+Ġsun screen
+Ġtuber culosis
+rav iolet
+ĠFunction al
+ĠEr rors
+ed ar
+19 66
+ĠSpect re
+ĠRec ipes
+88 5
+ĠM ankind
+L iverpool
+Ġ| --
+Ġsubst itutes
+ĠX T
+w ired
+Ġinc o
+ĠAf gh
+E va
+ic c
+S ong
+K night
+Ġdilig ently
+ĠBroad cast
+A id
+Ġaf ar
+ĠH MS
+aton in
+ĠGr ateful
+Ġfire place
+ĠOm ni
+e uro
+ĠF RE
+ĠSh ib
+ĠDig est
+t oggle
+Ġheads ets
+Ġdiff usion
+ĠSqu irrel
+ĠF N
+Ġdark ened
+out her
+Ġsleep s
+ĠX er
+gun s
+Ġset ups
+Ġpars ed
+Ġmamm oth
+ĠCur ious
+g ob
+ĠFitz patrick
+ĠEm il
+im ov
+........ .....
+ĠB enny
+Second ly
+Ġheart y
+Ġcons on
+st ained
+Ġgal actic
+cl ave
+Ġplummet ed
+Ġp ests
+Ġsw at
+Ġrefer rals
+ĠLion el
+h oly
+Ġunder dog
+ĠSl ater
+ĠProv ide
+ĠAm ar
+ress or
+å Į
+ong a
+Ġtim id
+Ġp iety
+ĠD ek
+Ġsur ging
+az o
+Ġ6 10
+Ġdes ks
+ĠSp okane
+ĠAn field
+Ġwars hips
+ĠCob ra
+Ġar ming
+clus ively
+ĠBad ge
+ag ascar
+ĠPR ESS
+ĠMcK enzie
+ĠFer dinand
+burn ing
+Af ee
+Ġtyr ann
+ĠI w
+ĠBo one
+100 7
+ĠRe pt
+Ċ Âł
+Ġcar avan
+ĠD ill
+ĠBundes liga
+Ch uck
+Ġheal er
+ãĥ¼ãĥ Ĩ
+ĠH obby
+Ġneg ate
+Ġcrit iques
+section al
+mop olitan
+Ġd x
+Ġouts ourcing
+ĠC ipher
+t ap
+Sh arp
+Ġup beat
+Ġhang ar
+Ġcru ising
+ĠNi agara
+Ġ3 42
+ill us
+ĠS v
+Ġsubt itles
+Ġsqu ared
+Ġbook store
+Ġrevolution aries
+ĠCarl ton
+ab al
+Ut ah
+Ġdesp ise
+ĠU M
+cons ider
+aid o
+Ġc arts
+ĠT urtles
+Tr aining
+Ġhonor ary
+Â ¢
+Ġtri angles
+4 22
+Ġreprint ed
+Ġgrace ful
+ĠMong olia
+Ġdisrupt ions
+ĠB oh
+Ġ3 49
+Ġdr ains
+Ġcons ulate
+Ġb ends
+Ġm afia
+ur on
+ĠF ulton
+m isc
+Ġren al
+Ġin action
+ck ing
+Ġphot ons
+Ġbru ised
+ĠC odes
+og i
+Ġn ests
+ĠLove ly
+ĠLib re
+ĠD aryl
+Ġ# ##
+S ys
+. ,"
+Ġfree zes
+est ablishment
+and owski
+Ġcum bers
+ĠSt arg
+ĠBom bs
+Ġleg ions
+Ġhand writing
+Ġgr un
+ĠC ah
+sequ ent
+Ġm oth
+ĠMS M
+Ins ert
+F if
+Ġmot el
+Ġdex ter
+ĠB ild
+hearted ly
+Ġpro pe
+ĠText ure
+ĠJ unction
+ynt hesis
+oc ard
+ĠVer a
+ĠBar th
+ĠÎ¼ g
+Ġl ashed
+Ġ35 1
+ĠZ amb
+ĠSt aples
+ĠCort ex
+ĠCork er
+Ġcontinu um
+ĠWR ITE
+unt a
+rid or
+Ġde ems
+0 33
+ĠG OLD
+p as
+Ġrep ressive
+ãĥĨ ãĤ£
+Ġbaff led
+Sc ar
+Ġc rave
+Ġ ______
+Ġentrepreneurs hip
+ĠDirector ate
+Ġ' [
+Ġv ines
+Ġasc ended
+ĠGR OUP
+ĠGood bye
+Ġdo gged
+ãĥ´ ãĤ¡
+Man ufact
+Ġunimagin able
+ri ots
+ier rez
+Ġrel ativity
+ĠCraft ing
+ra ught
+ud en
+c ookie
+Ġassass ins
+Ġdissatisf ied
+ac ci
+Ġcondu it
+Sp read
+ĠR ican
+n ice
+izz le
+Ġsc ares
+ĠWH Y
+ph ans
+5 35
+Ġprot racted
+ĠKrist en
+5 36
+ĠSc rib
+ĠNe h
+Ġtwent ies
+Ġpredic ament
+Ġhandc uffs
+Ġfruit ful
+ĠU L
+ĠLud wig
+Ġatt est
+ĠBre aker
+Ġbi ologically
+ĠDeal er
+Ġrenov ations
+f w
+ess en
+Al ice
+ĠHen ri
+Ġun ilaterally
+ĠS idd
+h ai
+ĠSt retch
+S ales
+Ġcumbers ome
+ĠJ avier
+Ġtrend y
+Ġrot ting
+ĠChall enges
+Ġscra ps
+Ġfac ets
+ĠVer onica
+ĠVer ge
+ĠS ana
+Al ien
+ĠR ih
+Ġrad ial
+ect ar
+Ġ6 30
+cl i
+Mar ie
+Ġwild fire
+ĠCat o
+h ander
+Ġwait ress
+Ġch ops
+ĠS ECTION
+Ġblunt ly
+ĠCat alog
+n ian
+stud y
+Ġpat rolling
+ĠT enth
+nex us
+ĠN ON
+op sy
+Ġsc athing
+s ie
+Ġdeterior ated
+V B
+Naz is
+Ġdep ictions
+Ġauthent icated
+ĠCon ce
+k rit
+Ġpromul g
+ĠL ONG
+U FC
+ĠVis itors
+ĠRec all
+Ġrehab ilit
+ĠSL I
+Ġglac ier
+ĠB ite
+Ġ50 3
+Ġvom it
+Ġfer mented
+ĠKh alid
+Ġgrad ed
+ĠMag icka
+ĠIch igo
+power ful
+ic ators
+75 3
+Ġsh rew
+Ġ35 6
+Ġlegal izing
+Ġall otted
+ĠArch demon
+ith ing
+igg urat
+V OL
+Le od
+Ġo ily
+Ġindu cing
+Ġamy gdala
+Ġadm ins
+ĠAcqu isition
+C AN
+Ġsche matic
+Ġmo an
+ĠCamer oon
+Ġt ink
+Ġmer ry
+Ġbutter flies
+ĠGo ff
+Ġworks pace
+ĠCor ona
+Ġj avascript
+ĠD olphin
+ĠCant or
+4 64
+to e
+AP S
+ĠAg ing
+Ġpadd ed
+ĠZ heng
+ĠHe ld
+Ġest ranged
+Ġ7 70
+. }
+ĠDun ham
+Ġsm okes
+Ġcap itals
+und ai
+Sh in
+ĠFound ing
+Ġent itle
+Ġcenter piece
+D iscover
+Ġthere to
+al ert
+ĠN ou
+ĠAnaly st
+l c
+F H
+FI ELD
+ĠP OV
+gr ay
+Ġar cs
+ĠH OT
+Ġr s
+Ġoblig atory
+ĠArchitect s
+ĠS ven
+ĠF EC
+0 200
+Christ mas
+ĠAlban ia
+rat om
+58 7
+Ġhard ships
+Ġaut os
+ĠCharg es
+Ġap es
+Ġ3 76
+wal let
+Ġintox ication
+Ġgobl in
+Ġ5 70
+++++++++ ++++++++
+ĠYel p
+ĠMag netic
+ĠBr iggs
+R ail
+Ġspawn s
+ĠW iggins
+Ġshowc ased
+Ġres orted
+ub en
+Ġwh ipping
+Ġim itate
+Ġdigest ion
+ĠUS PS
+ĠG est
+Ġye a
+ĠT ight
+ind al
+ic as
+` .
+C AST
+'' ;
+ĠF et
+opath ic
+In valid
+Ġregrett ed
+Ġbro ccoli
+ĠSc ores
+e ve
+Ġpost ings
+Ġaccum ulating
+Ġneed less
+elf th
+Ġmay ors
+Ġsc rib
+Ġanecd otes
+Ġbot ched
+ĠRib bon
+ĠConstant ine
+i uses
+ess es
+Ġdev ise
+Comp ared
+Ġp udding
+Ġg arg
+Ġev oke
+79 7
+Ġdet ox
+9 09
+ĠPie ces
+ĠMcC artney
+Ġmet ast
+ĠK rypt
+P OR
+Ġt ending
+ĠMerch ants
+Pro of
+ĠV arg
+ĠPort able
+ãĥ¼ãĥĨ ãĤ£
+B rain
+25 00
+Ġfol iage
+Ø ¹
+Ġment ors
+ĠA ires
+Ġminimal ist
+Ġing ested
+ĠTro jan
+ĠQ ian
+inv olved
+0 27
+Ġer oded
+RA FT
+Ġbl urry
+M ob
+Ġbuff et
+ĠFn atic
+ae a
+KN OWN
+ĠIn it
+s afety
+en um
+ACT ION
+ĠCrus her
+ĠD ates
+Ġ ................
+c alling
+ak ov
+Ġvent ured
+Ġ5 55
+au ga
+H art
+ĠA ero
+M AC
+Ġthin ly
+Ġar ra
+ST ATE
+ild e
+ĠJac qu
+ĠFem ales
+Ġthe orem
+Ġ3 46
+Ġsmart est
+ĠPU BLIC
+ĠK ron
+ĠB its
+ĠV essel
+ĠTele phone
+Ġdec ap
+Ġadj unct
+ĠS EN
+mer ga
+Ġred acted
+Ġpre historic
+Ġexplan atory
+ĠRun s
+ĠUtt ar
+ĠM anny
+ĠAUTH OR
+ĠUnle ashed
+ĠBow ling
+be ans
+79 3
+Ġunivers es
+Ġsens it
+ĠK ung
+re peat
+ctr l
+Ġp aced
+Ġfull er
+Cl ock
+Ġrec omb
+ĠF aul
+ĠB unker
+Ġpool ed
+Ġan a
+ĠM outh
+LL OW
+hum ane
+Ġbull do
+ĠMicha els
+f am
+Ġwreck ed
+Ġport rays
+ĠWh ale
+ĠH es
+Ġguess es
+ĠBrow se
+ĠL APD
+Ġconsequ ential
+ĠInn ocent
+ĠD RAG
+Ġtrans gress
+ĠO aks
+Ġtri via
+ĠRes on
+ĠA DS
+-- +
+ĠT oll
+Ġgrasp ing
+ĠTHE M
+ĠT ags
+ĠCon clusion
+Ġpract icable
+Ġho op
+Ġunintention ally
+Ġign ite
+ĠM ov
+ur ized
+le hem
+Ter min
+Ġcolour ful
+ĠLin ear
+ĠEll ie
+G y
+Ġman power
+Ġj s
+Ġem oji
+ĠSHAR ES
+_ .
+0000 7
+Ġsophistic ation
+Ġunders core
+Ġpract ise
+Ġbl ob
+op ens
+Uk raine
+Ke eping
+Y C
+J R
+ult imate
+Cl aim
+Ġautom obiles
+99 3
+ste el
+Ġpart ing
+ĠL ank
+... ?
+Ġ38 5
+Ġremem brance
+Ġe ased
+Ġcov ari
+ĠS ind
+Effect ive
+Ġdisse mination
+ĠMo ose
+ĠCl apper
+br ates
+App ly
+Ġinv is
+Ġwors ened
+âĢĶ -
+Ġlegisl ator
+ĠL ol
+ĠRow e
+Ġdealers hip
+um ar
+id ences
+Ġinvestig ates
+Ġc ascade
+Ġbid der
+ĠB EN
+Iron ically
+Ġpres iding
+Ġd ing
+Ġcontrad icted
+Ġshut s
+ĠF IX
+Ġ3 66
+Dist rict
+Ġsin ful
+ĠChar isma
+o ops
+Ġtot ality
+Ġrest itution
+ĠOpt imus
+ĠD ah
+Ġcl ueless
+urn ed
+Ġnut rit
+Ġland owners
+Ġfl ushed
+Ġbroad en
+m ie
+Ġprint ln
+Ġn ig
+ĠCorp us
+J en
+Ġprot o
+ĠWik imedia
+ĠPal o
+C OR
+Ġstory lines
+Ġevangel icals
+ĠDar rell
+Ġrot or
+ĠH W
+sk illed
+ery l
+Ġbe gg
+ĠBl umenthal
+Ġwe aving
+Ġdown wards
+ĠJack et
+ĠANG EL
+Te chnology
+Ġes oteric
+alde hyde
+Ġfur iously
+Ġforeign er
+We ak
+CH O
+ĠH ound
+Exper ience
+ĠPlay station
+ĠM IA
+ĠU ng
+cl oth
+ag all
+Ġcal ming
+iz ens
+St ruct
+ĠW itches
+ĠCeleb ration
+Ġ........ ......
+pt roller
+ĠTC U
+Ġb unny
+ãĥ į
+ut orial
+Ġup scale
+ĠSt a
+ĠCol ossus
+Ġchlor ide
+ĠZ ac
+ĠRe asons
+ĠBrook ings
+ĠWH ITE
+][ /
+ĠL ose
+9 05
+Ġunders ide
+ern els
+Ġv ape
+do zen
+upp et
+ĠST OP
+mat ical
+ĠStat ements
+hed dar
+P AC
+Custom er
+Ġmem os
+ĠP J
+end ars
+ĠLim its
+l augh
+Ġstabil ized
+ĠALE C
+Y A
+Up grade
+al am
+Ġtechn o
+Ġan ew
+fore seen
+Ġcolleg iate
+ĠPy ro
+ĠD ism
+Ġfront line
+Ġammon ia
+I U
+Qu ite
+John ny
+ass in
+G OP
+ĠSt yles
+ĠSovere ign
+acter ial
+5 49
+ĠR IP
+ĠL ists
+Ġ3 64
+ĠRece p
+s ocket
+ĠByr d
+ĠCand le
+An cient
+Ġappell ant
+en forcement
+ace a
+ans ki
+Ġold s
+88 6
+Ġsl urs
+Ġem pires
+Ġbuck le
+Ġalien ation
+ĠAber deen
+Ġunic orn
+Ġoverr iding
+ĠL X
+pp a
+Ġdesp ised
+ĠB ugs
+ĠB ST
+S outhern
+5 33
+Ġhall mark
+ĠPost er
+Ġstem med
+Ġprincip als
+ĠT ECH
+ĠSand wich
+It aly
+Ġche esy
+ĠSet TextColor
+ĠProt ective
+ĠC ohn
+J O
+apt op
+Re ason
+Lead er
+ĠUnder stand
+ĠFr idays
+ĠContin uous
+Ġcl ipping
+ĠR ye
+Ġber th
+tim er
+ann is
+re act
+Ġbuff alo
+ĠPar as
+Ġ6 55
+Ġpres ided
+ĠSun rise
+Ġve ts
+Ġcl oves
+ĠMcC ull
+Stre ngth
+G AN
+Ġill iter
+ĠPric ing
+l Ã©
+Ġresist or
+Ġbr un
+ĠSuff olk
+Ñ ĭ
+ĠL iver
+Re leased
+Ġwhat s
+8 60
+ĠMe asures
+Ġden ouncing
+ĠRy zen
+Ġsou ven
+Ġcareg ivers
+ch ini
+ĠScar lett
+Ġt rough
+Cong ratulations
+Ġtax is
+ĠTrad ition
+j it
+Ġtable top
+Ġhither to
+Ġdis information
+off ensive
+h ra
+ĠDISTR ICT
+Ġcompl icate
+chen ko
+ĠRecon struction
+Ġpalp able
+Ġa usp
+Ġ4 28
+Ġshowc ases
+ĠPublic ation
+know ledge
+inn on
+4 19
+Ġretri eval
+and ers
+Ġref ute
+Ġinqu ired
+g ur
+Ġneg ativity
+Ġcons erve
+Ġafter life
+Ġpres upp
+ĠGill espie
+Ġm t
+ĠD N
+T ap
+Ġper pend
+ĠS my
+does n
+Ġsp illing
+Ġhyp ers
+K ate
+Â® ,
+ke pt
+ĠP owered
+Ġj a
+ĠK lux
+ard e
+ab an
+Ġ4 44
+Ġflatt ened
+ĠImprove ments
+urg a
+ĠK und
+Ġins cribed
+Ġfac ult
+Ġunpre pared
+ĠCons umers
+Ġsatisf ies
+Ġpul monary
+Ġinf iltration
+Ġex ternally
+Ġcongrat ulations
+ag han
+Ġair liner
+Ġfl ung
+Ġfly ers
+G D
+Ġsnipp ets
+Ġrec ursive
+Ġmaster ing
+L ex
+Ġovert ly
+v g
+Ġluck ily
+Ġenc ro
+ĠLanc et
+ĠAbyss al
+function al
+Ġs ow
+Ġsqu id
+Ġnar ration
+Ġn aughty
+ĠHon our
+ĠSpart ans
+Ġsh atter
+ĠTac oma
+ĠCal ories
+ĠR aces
+Sub mit
+Ġpurpose fully
+w av
+ĠY ok
+F est
+ĠG err
+Met ro
+Ġit iner
+f amous
+Ġ" {
+in line
+was her
+Iss ue
+ĠCL IENT
+oz o
+Vers ions
+7 25
+ĠGl ock
+Ġshield ed
+ĠPC R
+ENC Y
+ĠWe ld
+ĠSim pl
+Ġredirect ed
+ĠK ham
+Ġ( >
+Ġlab ou
+Ġdi apers
+ss l
+Ġcell ar
+organ isms
+ore sc
+ĠBer ks
+did n
+Sh ipping
+C hest
+Ġund one
+Ġmillion aire
+Ġc ords
+ĠYoung er
+appropri ately
+Ġsequ els
+u ve
+ant icipated
+Ġle wd
+ĠSh irt
+ĠDmit ry
+V eter
+Ġsl aying
+ĠY ar
+Ġcompl ication
+I owa
+ĠEric a
+ĠBL M
+g irlfriend
+b odied
+6 26
+19 63
+Ġintermedi ary
+Ġcons olation
+M ask
+ĠSi em
+ow an
+Beg inning
+Ġfix me
+Ġculmin ated
+Ġcon duc
+ĠVolunte er
+Ġpos itional
+Ġgre ets
+ĠDefin itions
+Ġthink er
+Ġingen uity
+Ġfresh men
+ĠMom ents
+Ġ35 7
+ate urs
+ĠFed Ex
+s g
+69 4
+Ġdwind ling
+ĠBO X
+sel age
+Ġt mp
+Ġst en
+ĠS ut
+Ġneighbourhood s
+Ġclass mate
+f ledged
+Ġleft ists
+Ġclim ates
+ATH ER
+ĠScy the
+ul iffe
+Ġs ag
+Ġho pped
+ĠF t
+ĠE ck
+ĠC K
+ĠDo omsday
+k ids
+Ġgas ped
+Ġmon iker
+ĠL od
+ĠC FL
+t ions
+r ums
+fol ios
+Ġm d
+Ġunc anny
+Ġtrans ports
+ĠLab rador
+Ġrail ways
+Ġappl iance
+ĠCTR L
+æ Ģ
+Pop ulation
+ĠConfeder acy
+Ġunb earable
+Ġdors al
+ĠIn form
+op ted
+ĠK ILL
+Mar x
+Ġhypoc ritical
+q us
+ĠN umerous
+ĠGeorg ian
+ĠAmbro se
+ĠL och
+Ġgu bernatorial
+ĠX eon
+ĠSupp orts
+ens er
+ee ly
+ĠAven ger
+19 65
+Ar my
+Ġju xtap
+Ġcho pping
+ĠSpl ash
+ĠS ustainable
+ĠFin ch
+Ġ18 61
+ict ive
+at meal
+ĠG ohan
+Ġlights aber
+ĠG PA
+ug u
+ĠRE PL
+vari able
+Ġher pes
+Ġdesert s
+ac iously
+Ġsitu ational
+week ly
+ob l
+Ġtext ile
+ĠCorn wall
+Ġcontrace ptives
+ĠA ke
+] -
+ä¹ ĭ
+: ,
+ĠW em
+ĠB ihar
+Ġ' .
+Ġbe re
+Ġanal ogue
+ĠCook ies
+Ġtake off
+Whe el
+Ġmaj estic
+Ġcomm uting
+0 23
+ĠCor pse
+ass ment
+min i
+Ġgor illa
+ĠAl as
+ere e
+Ġacquaint ances
+ĠAd vantage
+Ġspirit ually
+Ġey ed
+pm wiki
+ĠE nder
+Ġtrans lucent
+Ġnight time
+ĠIM AGES
+5 45
+ĠK amp
+ĠFre ak
+Ġ ig
+Port land
+4 32
+ĠM ata
+Ġmar ines
+Ġh ors
+ater asu
+ĠAtt ribution
+Ġ-------- -
+Ġk ins
+ĠBEL OW
+++ +
+Ġre eling
+ol ed
+Ġcl utter
+ĠRel ative
+Ġ4 27
+B US
+Ġa vert
+ĠChe ong
+ĠA ble
+ĠPry or
+Develop er
+Ġen cyclopedia
+ĠUSA F
+ĠG arry
+Sp ain
+Bl ocks
+Ġexp osition
+ĠGamer Gate
+W OR
+Ġstockp ile
+Ġclot hed
+ĠT one
+ĠR ue
+t umblr
+Ġtreacher ous
+Ġf rying
+Ñ Į
+ĠS ph
+Ġrest raints
+Ġemb odies
+ĠG es
+S afety
+Ġnegoti ators
+min ing
+ĠAppalach ian
+L OS
+ĠJenn a
+Ġpass ers
+ç ĭ
+sn ap
+Ġshort en
+creat or
+Ġinn umerable
+uther land
+67 4
+ĠW OM
+ĠAs cend
+ĠArm ory
+ĠTrans action
+K ick
+Ġsuit case
+day Name
+Ġwaste ful
+mar riage
+ĠMcC abe
+ite ch
+ĠO ss
+Cl osure
+ĠTreasure r
+Ġindec ent
+ĠD ull
+Ġresid ences
+19 59
+ĠS ettlement
+Ham ilton
+Ġself ies
+ĠRank ing
+ĠBark ley
+ĠB ore
+ĠW CS
+ĠMar itime
+ĠH uh
+ĠForest ry
+Ġcultiv ating
+ĠBall ard
+Ġg arrison
+ĠSD L
+9 30
+Ġnas cent
+Ġirresist ible
+Ġaw fully
+\/ \/
+Ġequ ate
+Ġanthrop ology
+ĠSylv ia
+Ġintest ine
+Ġinnoc uous
+cess ive
+ag ra
+ĠMet roid
+G rant
+8 55
+ģ ĸ
+Ġ" _
+ãĥĥ ãĥī
+Ġappra isal
+ĠFred dy
+04 6
+Ġ40 6
+Ġ18 30
+Ġd ocking
+St atic
+Ġp ont
+ĠVolt age
+ĠSt ead
+ĠMort gage
+ĠJon ah
+Y L
+CLASS IFIED
+Ġas bestos
+nik ov
+Ġcoll agen
+ĠOrb ital
+P ocket
+7 99
+Ġhy brids
+inc hes
+Ġinv oice
+und y
+Ġinequ alities
+T rend
+w ashed
+B ALL
+Ġluc id
+ĠComment ary
+Ġw itty
+Br andon
+Ġbru ising
+Ġ6 20
+es cent
+box ing
+P OL
+Ġ3 78
+R ect
+Ġlic ences
+ĠMcG ee
+p ressed
+D anny
+Ġj ammed
+ord inate
+Ġle th
+Ġdistingu ishes
+ĠYam aha
+IL S
+ĠH ume
+ĠC ategories
+Rober ts
+Ch art
+Ġbeet le
+ĠGra veyard
+Ġ($ )
+o ÄŁ
+Ġtw ilight
+are lla
+á ½
+Ġbooth s
+ĠH HS
+ĠFeld man
+Ġexcav ation
+Ġphilosoph ies
+at ography
+ĠGar age
+te chnology
+Ġunfor gettable
+Ġver ifying
+Ġsubord inates
+E ls
+Ġne b
+G aming
+EN A
+ĠAchieve ment
+it ters
+ĠG abe
+Ġd umps
+for cer
+Ġpo ignant
+ĠM BA
+ĠHe idi
+ime i
+Ġm ages
+Ġliber ate
+Ġcircum cised
+ĠMer maid
+ĠMat th
+t ogether
+ĠW ichita
+Ġstore front
+ĠAd in
+V II
+Four th
+Ġexplore rs
+W ER
+Not able
+Bro ok
+m ens
+F aith
+-------- -
+ĠJ ou
+¬ ¼
+Ġpine apple
+Ġam alg
+el n
+ark able
+ĠãĤµ ãĥ¼ãĥĨãĤ£
+ĠãĤµãĥ¼ãĥĨãĤ£ ãĥ¯ãĥ³
+Ġov arian
+ĠE choes
+Ġhairc ut
+Ġp av
+Ġch illed
+anas ia
+Ġsty led
+Ġd ab
+ni per
+Ġminister ial
+ĠD UP
+T an
+Ġsul ph
+ĠD eter
+ĠBo hem
+od an
+Ġeduc ator
+â ĵĺ
+sp ir
+Ch icken
+ĠE leanor
+Ġqu i
+Ġheav iest
+Ġgrasp ed
+U RA
+Ġcro oked
+Jess ica
+pro blem
+Ġpred etermined
+Ġman iac
+Ġbreath s
+ĠLauder dale
+Ġh obbies
+y z
+Cr ime
+Ġcharism a
+d L
+Ġle aping
+Ġk ittens
+Ang elo
+ĠJ ACK
+ĠSu zanne
+Ġhal ting
+ENT ION
+Ġswall owing
+ĠEarthqu ake
+Ġeight eenth
+ĠN IC
+ĠIN F
+ĠCons cious
+Ġparticular s
+circ le
+7 40
+Ġbene volent
+Ġ7 47
+Ġ4 90
+Ġr undown
+ĠVal erie
+ĠB UR
+Ġcivil isation
+ĠS chn
+W B
+ot ide
+intern ational
+Ġj ohn
+Ġ19 02
+Ġpe anuts
+Ġflav ored
+k us
+Ġro ared
+Ġcut off
+é £
+Ġorn ament
+Ġarchitect ures
+Ġ3 69
+ol or
+ĠWild e
+ĠC RC
+ĠAdjust ed
+Ġprov oking
+land ish
+Ġrational ity
+Ġjust ifies
+Ġdisp el
+Ġa meric
+ĠPol es
+Ø ©
+Ġen vis
+ĠD oodle
+ä½ ¿
+igs aw
+auld ron
+Techn ical
+T een
+up hem
+ĠX iang
+Ġdetract ors
+ĠZ i
+ĠJournal ists
+Ġconduc ive
+ĠVolunte ers
+Ġs d
+Know ing
+Ġtrans missions
+ĠPL AN
+ĠL IB
+Ġall uded
+Ġob e
+Ġd ope
+ĠGold stein
+Ġwavelength s
+ĠDest ination
+nd a
+ug i
+Ġattent ive
+ĠLe an
+ral tar
+Ġman g
+mb uds
+ak ings
+b ender
+Ġacc ol
+Ġcraw led
+N OW
+Min nesota
+Ġflour ished
+ĠZ up
+ĠSuper visor
+ĠOliv ier
+Ex cellent
+Ġwid en
+D one
+Ġw ig
+Ġmiscon ceptions
+Cor p
+W an
+Ġvener able
+ĠNot ably
+ĠKling on
+an imate
+Bo ost
+ĠS AY
+miss ing
+ibli ography
+mel on
+Ġpay day
+Ø ³
+bo le
+Ġve iled
+ĠAl phabet
+It alian
+Ġever lasting
+ĠR IS
+ĠC ree
+rom pt
+Ġh ating
+Ġgrin ning
+Ġge ographically
+OS H
+Ġwe eping
+ĠÂłĠÂłĠÂłĠÂł ĠÂłĠÂłĠÂłĠÂł
+Ġimpe cc
+Let ter
+Ġblo ated
+PL A
+ĠFe in
+Ġper sever
+Th under
+Ġa ur
+ĠR L
+Ġpit falls
+âĸ º
+Ġpredomin ant
+Ġ5 25
+7 18
+AP E
+7 14
+Ġfarm land
+ĠQ iao
+Ġv iolet
+ĠBah amas
+Ġinflic ting
+ĠE fficiency
+Ġhome brew
+Ġundert ook
+Ġcur ly
+ĠHard ing
+man ia
+59 6
+Ġtem pered
+Ġhar rowing
+ĠP ledge
+ĠFranken stein
+è ª
+M otion
+Ġpredict ably
+ĠExpl osion
+oc using
+er d
+col o
+FF ER
+Ġback field
+ĠV IDE
+ue bl
+N arr
+ĠArg ument
+Ġgen omic
+Ġbout ique
+Ġbatt ed
+ĠB inary
+Ġg amb
+ĠRh ythm
+67 3
+Ġa float
+ĠOlymp ia
+Y ING
+Ġend if
+is in
+Ġwin ters
+Ġsc attering
+I v
+D istance
+Ġtr u
+ĠCom fort
+Ġne xus
+Ġair flow
+ĠByz antine
+p ayers
+con i
+ĠB etsy
+D eal
+ĠN ug
+ĠContin ent
+red ibly
+Ġoptim izing
+al beit
+Ġec static
+ĠPro to
+ç ·
+iv ot
+âĸ Ħ
+em p
+rou nder
+Ġcl out
+ĠI ST
+66 3
+ĠDoll ars
+ĠD AC
+Ġsubsc ribed
+Ġrehears al
+Ġam ps
+ĠSh ang
+es m
+Ġspr inkle
+Ġassail ant
+ĠO o
+ĠCoin base
+T act
+Ġret ina
+Ġn uns
+R ON
+att o
+Ġj ug
+ĠSV G
+Ġb ikini
+ĠFI LE
+ĠFound ers
+ep ort
+ĠK P
+Ġrest ores
+ĠTh ick
+Ġash ore
+Ġappro vals
+R ender
+M AG
+G raham
+ĠCort ana
+ãĥ³ ãĤ¸
+ss h
+or ians
+ars ity
+ĠInsp ired
+u pper
+Ġsign alling
+Ġreb uke
+Ġfl ares
+Ġdownt ime
+Stud ies
+Ġstagn ation
+ĠSequ ence
+Ġgr unt
+Ġass ures
+ĠPL A
+59 2
+Ġintra ven
+d epend
+Sus an
+ĠManz iel
+Man ia
+Cont ract
+Ġsl ams
+Ġcult ured
+Ġcred itor
+L IST
+ĠH UM
+ĠChatt anooga
+serv ed
+Ġclo aked
+ĠF TP
+p owder
+ĠSt ella
+uct ive
+Ġcheap ly
+ĠMU CH
+ĠGalile o
+Ġsu ites
+spe ech
+Ġdeliber ations
+ĠCh ips
+« ĺ
+Bal ance
+ĠWyn ne
+ĠAk ron
+Ass et
+Ġhon oured
+Ġed ged
+Like wise
+anim ous
+ĠW age
+ĠEz ek
+ad vertisement
+ĠRT X
+ĠM AD
+Ġmigr ating
+ĠS QU
+Ġ4 75
+Ed ited
+Ġshorth and
+ĠBas ics
+Ġcro tch
+ĠEV EN
+Ġv m
+effic iency
+Ġcal ves
+ĠF rie
+ĠBrill iant
+Ġstri kers
+Ġrepent ance
+Ġarter ies
+r l
+B ed
+h ap
+Ġcrypt ography
+ĠSab res
+Ġ4 14
+vi ks
+ih ara
+aps es
+T alking
+Ġintertw ined
+Ġdoc ks
+Ġalle le
+ĠArt ifact
+ĠH IM
+t orn
+ç ķ
+Ġop acity
+ĠE ly
+os uke
+Ġn ipple
+Ġhand written
+ĠV K
+ĠChamber lain
+ĠLa os
+ig raph
+g row
+Ġtr illions
+Ġdescend ant
+ĠSail or
+as uring
+Ġce ilings
+ĠWare house
+f lying
+ĠGl ow
+Ġn ont
+Ġmiscar riage
+Ġrig s
+Ġmin istries
+Ġelabor ated
+Ġdel usional
+ĠHum ane
+Ġ3 79
+n ets
+Ġblack out
+add ers
+Ġn p
+ĠT ire
+ro sc
+Ġsub div
+Ġlink age
+Ġchron ological
+ĠHER O
+Ġres ettlement
+ĠVin yl
+Ġpast oral
+ĠMob il
+ĠBar bar
+Co oldown
+ĠF ritz
+c riminal
+re pe
+Ġbell ig
+ĠBre ed
+Ġ4 18
+Ġsem blance
+ij k
+Ġcur tail
+Ġclin ch
+cont ained
+ĠProm pt
+ast on
+Ġw i
+Ġpursu its
+5 15
+ĠGl oss
+Ġfl ips
+Ġcoup ons
+Ġcl oning
+ĠLike ly
+Rem oved
+ĠQu artz
+r ices
+ĠSpe ars
+Ġp ious
+Ġdep reciation
+ĠD are
+oun ces
+am az
+O nt
+Ġp innacle
+d ocker
+0 26
+ĠW yr
+ĠPro per
+Ë Ī
+n il
+By tes
+Ġseek er
+t rial
+Ġunf olds
+ĠMar se
+Ġextravag ant
+ĠSurviv ors
+RED ACTED
+ĠSpeed way
+ĠCra igslist
+sub mit
+ĠGener ations
+Ġup holding
+Ġblood stream
+ĠMiss ions
+ĠL awn
+Ġlim bo
+ene i
+H uh
+ĠWild cats
+pre p
+ĠMark us
+ĠFor bidden
+rit ic
+IN O
+Ġexhib iting
+requ ent
+ch uk
+Ġhabit ual
+ĠComp atibility
+Dr ag
+RIP T
+uj ah
+GR OUND
+Ġdelinqu ent
+Ġburn er
+Ġcontempor aries
+Ġgimm ick
+load s
+Ġno zzle
+p odcast
+ĠW ak
+ĠStat en
+ĠK uh
+ãģ ĵ
+inter rupted
+Ġinv incible
+ĠBurn ett
+cig arette
+ĠPeb ble
+ĠTem porary
+ĠMar ino
+58 2
+Ġwast eland
+ident ly
+T x
+Ġr ite
+ĠPan asonic
+ĠM iddles
+ĠHort on
+ae us
+Ġc uring
+Ġm ats
+Ġadj ourn
+Ġfears ome
+pe z
+bo ats
+Ġpro pell
+Ġconflic ted
+ĠAng er
+Ġinsurg ent
+K arl
+Ġco ales
+Ġsouth western
+Ġdis su
+ĠO vert
+******** ****
+Ġbox ed
+ĠBr une
+aa a
+Ġgard ening
+ĠEng el
+tr acks
+Ġpur ified
+Ġplace holder
+ĠL ikes
+Ġd an
+G ab
+Ġe ct
+ĠF aw
+ĠEl iot
+Ġ' ,
+otrop ic
+ĠRu in
+hed on
+Ġca ul
+Ġa ft
+ĠCad illac
+gh a
+ass ian
+ud eb
+ĠT ick
+Ġadjust s
+AR GET
+5 37
+isc he
+ant y
+ĠFried rich
+ĠBl izz
+ĠA OL
+Camp aign
+Ġmamm al
+ĠVe il
+ĠK ev
+ĠMaur it
+ĠDam ien
+N ation
+E astern
+Ġ{ :
+Ġ= ================================
+Ġstereotyp ical
+Ġatt ic
+ĠCy borg
+requ ire
+Ġaward ing
+ĠPap ua
+bt n
+b ent
+B oo
+Ġ( =
+ĠX ander
+ĠSomers et
+Ġcatch y
+Ġcert ify
+STR UCT
+Ġit al
+Ġt ides
+ĠBr ands
+G ray
+comp etitive
+Ġcur ator
+ĠD G
+omin ium
+ĠGM Os
+ci ating
+ĠCarm en
+ow ard
+Balt imore
+Ġr gb
+C u
+Ġwip es
+spe ll
+IT NESS
+Ġsummar izes
+ĠRe vis
+Ġwhistlebl owers
+ĠBre ach
+Ġcro chet
+k os
+ews ki
+Ġrep et
+Ġcrim son
+ĠKar achi
+read able
+dim ension
+ĠI gor
+ild ed
+ĠZ ed
+ĠKe ane
+ĠCos metic
+DE P
+Ġretreat ing
+ĠU A
+ens ical
+Ġd usk
+ĠDick ens
+Ġaren as
+ĠPass age
+level s
+Ġcur v
+P ope
+Ġch ores
+ĠEl ise
+ĠComp ass
+b ub
+Ġmamm alian
+ĠSans krit
+ĠAN C
+ĠCr ack
+Q ual
+L aun
+amp unk
+Ġlearn ers
+Ġglam orous
+Ġfur the
+erm ott
+c and
+Gener ic
+Ġnarr ated
+Ġdisorder ly
+ĠTrans actions
+ĠDet ention
+ĠR oku
+Ä į
+Ġunder statement
+ĠS aur
+ĠRodrig o
+ĠAS AP
+S in
+Ġre joice
+Method s
+Ġelectro de
+Ġworsh ipped
+Ġid i
+ĠPhys icians
+Ġpop up
+Ġde ft
+ĠRem oval
+ĠBu enos
+ver bs
+Ġfun k
+ush a
+rict ion
+ore a
+ĠBang alore
+ĠKen obi
+zz i
+Ġnorm ative
+Ġgobl ins
+Ġcaf es
+ĠUN CLASSIFIED
+ĠF ired
+S IGN
+Ġs clerosis
+ĠV oter
+ĠSon ny
+ĠExt end
+ĠEV s
+Ar senal
+Ġp si
+Ġwid est
+ĠT us
+Ġlo oms
+Ġjust ifying
+ĠGr anger
+è ¯
+Ref er
+58 3
+Ġflour ishing
+ab re
+Ġr ave
+ĠCont ra
+Ġ18 98
+Add s
+Ġf ul
+ĠCo oke
+some one
+= #
+67 1
+Ġy ak
+Ġar te
+ĠMis cellaneous
+ĠDet ection
+ĠCl ancy
+â ģ
+ass ies
+Ġval iant
+ĠFemin ist
+cor ruption
+V el
+P ear
+Ġsucc inct
+Ġquick est
+k w
+Ġsp itting
+ĠL ibraries
+åħ ī
+ant z
+D ad
+ĠSpec ifications
+rup ulous
+and r
+RES ULTS
+Ġsnow ball
+Ġpred is
+ĠB axter
+ĠNurs ing
+ĠCh aff
+s we
+Ġout age
+Ġnest ing
+Ġnotor iety
+tr igger
+on ite
+j on
+Ġf ou
+ook ed
+ĠCelebr ity
+re ality
+Ġfat ig
+Ġhug ging
+Ġbother s
+ĠPan zer
+ĠCh andra
+fig ured
+Ġvol ts
+ĠCloud s
+Ġfee ble
+ĠCur ve
+ĠAs us
+78 6
+abs or
+ĠV ICE
+ĠH ess
+Ġmanufact ures
+Ġgri zz
+ĠPower ful
+ac id
+Ġsub sections
+ĠKrug man
+ĠAl ps
+is u
+Ġsequ est
+ĠUlt ron
+ĠT inker
+ĠGo ose
+Ġmism atch
+Att orney
+Ġmorph ology
+ĠSix ers
+ut tered
+ĠE LECT
+gr an
+Rus sell
+ĠG SL
+Ġfort night
+Ġ. )
+Ġapost le
+pr one
+el ist
+Unt itled
+ĠIm plementation
+ist ors
+Ġtank er
+Ġpl ush
+Ġattend ants
+ĠT ik
+ĠGreen wich
+ĠY on
+ĠSP L
+cell s
+unt led
+S olution
+ĠQu Ã©
+Ġvac ated
+Ġupt ick
+ĠMer idian
+æ ĥ
+ĠDr ill
+9 25
+58 4
+Ġrenov ated
+ĠKub rick
+zy k
+Ġl ousy
+pp el
+ohyd rate
+ĠI zzy
+lesi astical
+CC C
+ĠAj ax
+Ġad apters
+ĠPetra eus
+Ġaffirm ation
+ĠST OR
+le ms
+ad oes
+ĠConstantin ople
+Ġp onies
+Ġl ighthouse
+Ġadherent s
+ĠBre es
+omorph ic
+Fight ing
+Ġpl aster
+ĠP VC
+ĠOb st
+Ġdear ly
+ĠTo oth
+icks on
+Ġsh aming
+P lex
+A gg
+ĠâĢ¦ "
+Ġsub reddits
+Ġpige on
+ĠResident ial
+ĠPass ing
+Ġl um
+ĠP ension
+Ġpessim istic
+Ġ4 32
+z inski
+c ade
+0 75
+Ġapolog ised
+iy ah
+Put ting
+Ġgloom y
+ĠLy me
+=-=-=-=- =-=-=-=-
+ĠT ome
+ĠPsych iatric
+ĠH IT
+c ms
+ap olog
+Ġbreak er
+Ġdeep en
+Ġtheor ist
+ĠHigh lands
+Ġb aker
+Ġst aples
+Ġinterf ered
+ĠAb ortion
+jo ined
+ch u
+Ġform ulate
+Ġvacc inations
+Ġban ter
+phe us
+Ġoutfield er
+ĠM eter
+Ġ# ####
+Ġ18 95
+Ġnarrow ing
+ĠST ORY
+f p
+ĠC ST
+ign ore
+Ġproclaim ing
+ĠR U
+ĠB ALL
+yn a
+65 3
+Ġpos it
+P RE
+59 4
+ĠRegist rar
+ĠPil grim
+ic io
+Ġpre tt
+Ġlif eless
+Ġ__ _
+Ne igh
+ĠCh urches
+orn o
+Ġor cs
+Ġkind red
+ĠAud it
+Ġmillenn ial
+ĠPers ia
+g ravity
+ĠDis ability
+ĠD ARK
+W s
+od on
+Ġgrand daughter
+ĠBro oke
+ĠA DA
+ER A
+Ġpick ups
+ĠWil kinson
+ĠSh ards
+ĠN K
+Ġexp el
+ĠKis lyak
+Ġj argon
+Ġpolar ized
+ian e
+Pub lisher
+Ġreb utt
+Ġapprehens ion
+ĠK essler
+Ġpr ism
+F UL
+19 64
+ĠL oll
+ä ¿
+le thal
+Å Ł
+Ġg hetto
+Ġb oulder
+ĠSlow ly
+ĠOsc ars
+ĠInst ruction
+ĠUl tr
+ĠM oe
+N ich
+ĠP ATH
+( *
+ĠRE LEASE
+un ing
+rou se
+en eg
+Ġre imb
+ĠDet ected
+Do S
+Ġster ling
+Ġaggreg ation
+ĠLone ly
+ĠAtt end
+hig her
+Ġairst rike
+ks on
+SE LECT
+Ġdef lation
+ĠHer rera
+C ole
+rit ch
+Ġadvis able
+F ax
+Ġwork around
+Ġp id
+mort em
+ers en
+Ġtyp o
+Ġal um
+78 2
+ĠJam al
+script s
+Ġcapt ives
+ĠPres ence
+ĠLie berman
+angel o
+Ġalcohol ism
+ass i
+Ġrec ite
+Ġgap ing
+Ġbask ets
+ĠG ou
+Brow ser
+ne au
+Ġcorrect ive
+und a
+sc oring
+ĠX D
+Ġfil ament
+Ġdeep ening
+ĠStain less
+Int eger
+Ġbu ggy
+Ġten ancy
+ĠMub arak
+Ġt uple
+ĠD roid
+ĠS itting
+Ġforfe it
+ĠRasm ussen
+ixt ies
+es i
+ĠKim mel
+Ġmetic ulously
+Ġap opt
+ĠS eller
+08 8
+ec ake
+hem atically
+T N
+Ġmind less
+Ġdig s
+ĠAcc ord
+ons ense
+em ing
+br ace
+Ġe Book
+ĠDist ribut
+ĠInvest ments
+w t
+] ),
+beh avior
+56 3
+Ġbl inding
+ĠPro testers
+top ia
+Ġreb orn
+ĠKel vin
+ĠDo ver
+ĠD airy
+ĠOut s
+Ġ[ /
+Ï Ģ
+b p
+ĠVan ity
+ĠRec ap
+ĠHOU SE
+ĠF ACE
+Ġ4 22
+69 2
+ĠAnt ioch
+cook ed
+Ġcoll ide
+Ġa pr
+Ġsle eper
+ĠJar vis
+Ġalternative ly
+ĠLe aves
+ĠM aw
+Ġantiqu ity
+ĠAdin ida
+Ġab user
+PokÃ© mon
+Ġass orted
+ĠRev ision
+ĠP iano
+ĠG ideon
+O cean
+Ġsal on
+Ġbust ling
+ogn itive
+ĠRah man
+Ġwa iter
+Ġpres ets
+ĠO sh
+ĠG HC
+oper ator
+Ġrept iles
+Ġ4 13
+ĠG arr
+ĠCh ak
+Ġhas hes
+Ġfail ings
+Ġfolk lore
+Ġab l
+ĠC ena
+ĠMac Arthur
+ĠCOUR T
+Ġperipher y
+app ers
+Ġreck oned
+ĠInf lu
+ĠC ET
+Ġ3 72
+ĠDefin itive
+ass ault
+4 21
+Ġreservoir s
+Ġd ives
+ĠCo il
+DA Q
+Ġvivid ly
+ĠR J
+ĠBel lev
+Ġec lectic
+ĠShow down
+ĠK M
+ip ed
+reet ings
+ĠAs uka
+L iberal
+ĠÏ Ħ
+Ġbystand ers
+ĠGood win
+uk ong
+S it
+ĠT rem
+Ġcrim inally
+ĠCirc us
+ch rome
+88 7
+Ġnan op
+ĠOb i
+ĠL OW
+o gh
+ĠAuth ors
+ob yl
+Ur ban
+Ġt i
+ĠWe ir
+t rap
+ag y
+Ġparent heses
+Ġout numbered
+Ġcounter productive
+ĠTob ias
+ub is
+P arser
+ST AR
+Ġsyn aptic
+ĠG ears
+Ġh iber
+Ġdebunk ed
+Ġex alted
+aw atts
+H OU
+Ch urch
+ĠPix ie
+ĠU ri
+ĠForm ation
+ĠPred iction
+C EO
+Ġthro tt
+ĠBrit ann
+ĠMad agascar
+ë ĭ
+Ġbill boards
+ĠRPG s
+ĠBe es
+complete ly
+F IL
+Ġdoes nt
+ĠGreen berg
+re ys
+Ġsl ing
+Ġempt ied
+ĠPix ar
+ĠDh arma
+l uck
+ingu ished
+Ġend ot
+Ġbab ys
+05 9
+che st
+r ats
+Ġr idden
+Ġbeet les
+Ġillum inating
+Ġfict itious
+ĠProv incial
+Ġ7 68
+Ġshe pherd
+ĠR ender
+Ġ18 96
+C rew
+Ġmold ed
+ĠXia omi
+ĠSp iral
+Ġdel im
+Ġorgan ising
+Ġho ops
+ĠBe i
+z hen
+Ġfuck in
+Ġdec ad
+Ġun biased
+am my
+sw ing
+Ġsmugg led
+Ġk ios
+ĠP ERSON
+ĠInquis itor
+Ġsnow y
+Ġscrap ing
+ĠBurg ess
+P tr
+ag ame
+R W
+Ġdro id
+ĠL ys
+ĠCass andra
+Jac ob
+Ġ35 4
+Ġpast ure
+Ġfr anc
+ĠScot ch
+ĠEnd s
+ĠI GF
+def inition
+Ġhyster ical
+ĠBrown e
+77 1
+Ġmobil ization
+æ ķ
+iqu eness
+Th or
+Ġspear headed
+Ġembro iled
+Ġconject ure
+jud icial
+Ch oice
+Ġpaper back
+P ir
+Ġrec overs
+ĠSur ge
+ĠSh ogun
+ĠPed iatrics
+ãģ ł
+Ġsweep s
+ĠLabor atories
+ĠP acks
+al us
+add in
+Ġhead lights
+g ra
+Ev idence
+COL OR
+Ad min
+Ĭ ±
+Ġconco ct
+s ufficient
+Ġun marked
+Ġrich ness
+Ġdiss ertation
+Ġseason ing
+Ġg ib
+ĠM ages
+un ctions
+ĠN id
+che at
+ĠTM Z
+c itizens
+ĠCatholic ism
+n b
+Ġdisemb ark
+ĠPROG RAM
+a ques
+Ty ler
+Or g
+ĠSl ay
+ĠN ero
+ĠTown send
+IN TON
+te le
+Ġmes mer
+9 01
+Ġfire ball
+ev idence
+aff iliated
+ĠFrench man
+ĠAugust a
+0 21
+Ġs led
+Ġre used
+ĠImmun ity
+Ġwrest le
+assemb led
+Mar ia
+Ġgun shots
+ĠBarb ie
+Ġcannabin oids
+ĠTo ast
+ĠK inder
+IR D
+Ġre juven
+Ġg ore
+Ġrupt ure
+Ġbre aching
+ĠCart oon
+Ġ4 55
+ĠPale o
+6 14
+Ġspe ars
+ĠAm es
+ab us
+Mad ison
+GR OUP
+Ġab orted
+y ah
+Ġfel on
+Ġcaus ation
+Ġprep aid
+Ġp itted
+op lan
+ĠShel ley
+ĠRus so
+ĠP agan
+Ġwill fully
+ĠCan aver
+und rum
+ĠSal ary
+ĠAr paio
+read er
+ĠR ational
+ĠOver se
+ĠCa uses
+Ġ* .
+Ġw ob
+Ke ith
+ĠCons ent
+man ac
+77 3
+6 23
+Ġfate ful
+et imes
+Ġspir ited
+ĠD ys
+Ġhe gemony
+Ġboy cot
+ĠEn rique
+em outh
+Ġtim elines
+ĠSah ara
+ĠRel ax
+ĠQuin cy
+ĠLess ons
+ĠE QU
+SE A
+N K
+ĠCost co
+Incre ase
+Ġmotiv ating
+ĠCh ong
+am aru
+ĠDiv ide
+Ġped igree
+ĠTasman ia
+ĠPrel ude
+L as
+9 40
+57 4
+Ġch au
+ĠSp iegel
+un ic
+-- >
+ĠPhil ips
+ĠKaf ka
+Ġuphe aval
+Ġsent imental
+Ġsa x
+ĠAk ira
+ser ial
+Mat rix
+Ġelect ing
+Ġcomment er
+ĠNeb ula
+ple ts
+ĠNad u
+ĠAd ren
+Ġen shr
+ĠR AND
+fin ancial
+ĠCly de
+uther ford
+Ġsign age
+Ġde line
+Ġphosph ate
+rovers ial
+f ascist
+ĠV all
+ĠBeth lehem
+Ġfor s
+Ġeng lish
+S olid
+N ature
+Ġv a
+ĠGu ests
+Ġtant al
+Ġauto immune
+;;;;;;;; ;;;;
+ĠTot ally
+ĠO v
+Ġdef ences
+ĠCoc onut
+Ġtranqu il
+Ġpl oy
+Ġflav ours
+ĠFl ask
+ãĤ¨ ãĥ«
+ĠWest on
+ĠVol vo
+8 70
+Ġmicro phones
+ver bal
+R PG
+Ġi ii
+; }
+0 28
+Ġhead lined
+Ġprim ed
+Ġho ard
+ĠSh ad
+ĠEN TER
+Ġtri angular
+Ġcap it
+l ik
+ĠAn cients
+Ġl ash
+Ġconv ol
+Ġcolon el
+en emy
+G ra
+Ġpub s
+ut ters
+Ġassign s
+ĠPen et
+ĠMon strous
+ĠBow en
+il ver
+H aunted
+ĠD ing
+start ed
+pl in
+Ġcontamin ants
+ĠDO E
+ff en
+ĠTechn ician
+R y
+Ġrob bers
+Ġhot line
+ĠGuard iola
+ĠKau fman
+row er
+ĠDres den
+ĠAl pine
+E lf
+Ġf mt
+ĠS ard
+urs es
+g pu
+Un ix
+Ġunequiv ocally
+ĠCitizens hip
+qu ad
+m ire
+ĠS weeney
+B attery
+6 15
+Ġpanc akes
+Ġo ats
+M aps
+ĠCont rast
+mbuds man
+ĠE PS
+Ġsub committee
+Ġsour cing
+Ġs izing
+ĠBuff er
+ĠMand atory
+Ġmoder ates
+ĠPattern s
+ĠCh ocobo
+ĠZ an
+ĠSTAT ES
+ĠJud ging
+ĠIn her
+* :
+Ġb il
+ĠY en
+Ġexh ilar
+oll ower
+z ers
+Ġsn ug
+max imum
+Ġdesp icable
+ĠP ACK
+ĠAn nex
+Ġsarcast ic
+Ġlate x
+Ġt amp
+ĠS ao
+b ah
+ĠRe verend
+ĠChin atown
+ĠA UT
+d ocumented
+ĠGA BA
+ĠCan aan
+ĠÙ ħ
+Ġgovern s
+pre v
+E sc
+ĠEst imates
+OS P
+Ġendeav our
+ĠCl osing
+omet ime
+every one
+Ġwor sen
+Ġsc anners
+Ġdev iations
+ĠRobot ics
+ĠCom pton
+Ġsorce rer
+Ġend ogenous
+Ġem ulation
+ĠPier cing
+ĠA ph
+ĠS ocket
+Ġb ould
+ĠO U
+ĠBorder lands
+Ġ18 63
+G ordon
+ĠW TO
+Ġrestrict s
+Ġmosa ic
+Ġmel odies
+ç Ħ
+T ar
+Ġdis son
+ĠProv ides
+Ġ ......
+b ek
+F IX
+Ġbro om
+ans hip
+Do ctors
+Ġner ds
+ĠReg ions
+na issance
+Ġmet e
+Ġcre pt
+pl ings
+Ġgirlfriend s
+kn it
+ig ent
+ow e
+Ġus hered
+ĠB az
+M obil
+4 34
+ĠPres ents
+orig in
+Ġins omnia
+ĠA ux
+4 39
+ĠCh ili
+irs ch
+G AME
+Ġgest ation
+alg ia
+rom ising
+$ ,
+c row
+ĠIn spection
+at omic
+Rel ations
+J OHN
+rom an
+ĠClock work
+ĠBak r
+m one
+M ET
+Ġthirst y
+Ġb c
+Ġfacult ies
+R um
+Ġnu ance
+ĠD arius
+ple ting
+fter s
+etch up
+Reg istration
+ĠK E
+R ah
+Ġpref erential
+ĠL ash
+ĠH H
+Val id
+ĠN AV
+Ġstar ve
+ĠG ong
+z ynski
+ĠAct ress
+Ġw ik
+Ġun accompanied
+lv l
+Br ide
+AD S
+ĠCommand o
+ĠVaugh n
+Wal let
+Ġho pping
+ĠV ie
+Ġcave ats
+Ġal as
+if led
+ab use
+66 1
+Ġib n
+Ġg ul
+Ġrob bing
+t il
+IL A
+Ġmit igating
+Ġapt ly
+Ġty rant
+Ġmid day
+ĠGil more
+ĠDe cker
+ĠÂ§ Â§
+part ial
+Ex actly
+Ġphen otype
+Ġ[+ ]
+ĠP lex
+ĠI ps
+vers ions
+Ġe book
+Ġch ic
+g ross
+":" "},{"
+ĠSur prisingly
+M organ
+Ġresid ues
+ĠConf ederation
+in feld
+Ġl yr
+mod erate
+Ġperpend icular
+V K
+Ġsynchron ized
+Ġrefres hed
+Ġad ore
+ĠTor ment
+ol ina
+Ġ26 00
+Item Tracker
+Ġp ies
+ĠF AT
+ĠR HP
+0 48
+ĠRES P
+ĠB J
+all ows
+P and
+Ġunw elcome
+ĠV oc
+ĠBast ard
+ĠO W
+ĠL AR
+ĠHeal er
+Environment al
+ĠKen yan
+ĠTr ance
+ĠP ats
+Ġali ases
+ĠGar field
+Ġcampaign er
+Ġadvance ments
+ĠOkin awa
+ĠC oh
+ows ky
+Ġstar ved
+Ġsize able
+Ġ: -)
+Ġm RNA
+Ġsusp ensions
+ist ar
+Scot land
+Pr in
+-------------------------------- ----------------
+Ġ50 2
+Ġteasp oons
+Ġ10 50
+Ġcoerc ive
+ĠMason ic
+edd ed
+ĠPass enger
+Ġl att
+Ġbr aces
+ĠSt eal
+ĠNY T
+ĠK ats
+ĠCel est
+ae z
+T u
+ĠCoul ter
+ðŁ ĺ
+Fl ickr
+ĠWil mington
+ith s
+++ ;
+Ġv ending
+Ġneg ro
+ĠPh i
+ĠYellow stone
+Call back
+Ġsh ampoo
+ĠSh ades
+w at
+Ġsuper human
+Ġridic uled
+Ġhol iest
+om bo
+Ġintern s
+Ġh one
+ĠPar agu
+UR I
+Ġd angling
+ãĤ »
+so v
+ict ional
+av ailability
+Ġrev ocation
+Ġd ow
+in ic
+ĠTHE IR
+Ġis o
+Ġout ings
+ĠLeth al
+Ġ) ))
+Ġinacc ur
+Ġout landish
+Ġan us
+let ico
+id on
+l ol
+Ġun regulated
+Ġsuccumb ed
+Ġc uff
+ĠWast eland
+let al
+Ġsub str
+Ġcoff ers
+Ġautom akers
+ov i
+ĠX ue
+ĠDayton a
+Ġjar ring
+Ġf umes
+Ġdisband ed
+z ik
+itt on
+Ġstriking ly
+Ġsp ores
+Ad apter
+.) :
+ĠLynd on
+ival ry
+Ġor ally
+Ġtumult uous
+Ġdisple asure
+Ġcon es
+or rect
+Ġappe ase
+Ġder by
+ĠTrip oli
+ĠAl ess
+Ġp oked
+ĠGu ilty
+v P
+En ough
+Ġorig inals
+6 99
+Ġrabb i
+Ġproverb ial
+Ġpostp one
+el ope
+ĠMist y
+Ġstaff ed
+ĠUn employment
+redit ary
+Ġdilig ent
+re comm
+me asures
+as in
+8 25
+Ġpond s
+Ġmm ol
+ĠS AR
+ĠC ARE
+Ġ3 71
+Ġclen ched
+ĠCors air
+Ġcaric ature
+z n
+att ach
+ĠSch ro
+spe ak
+p ainted
+ĠS uc
+ĠE NT
+Ġcell ul
+ĠP aid
+di agn
+WH ERE
+Ġtext ed
+B arn
+Ġret racted
+ĠRe ferred
+S av
+Ġup keep
+Ġwork places
+ĠTok ens
+Ġampl ify
+cl inical
+Ġmult ic
+mber g
+Ġconvol uted
+Reg ion
+5 65
+ĠTop ic
+Ġsn ail
+Ġsal ine
+Ġins urrection
+ĠPet r
+f orts
+B AT
+ĠNav ajo
+Ġrud imentary
+ĠLak sh
+OND ON
+Me asure
+Ġtransform er
+ĠGodd ard
+Ġcoinc ides
+ir in
+R ex
+ĠB ok
+qu it
+Ġshotgun s
+Ġprolet arian
+Ġsc orp
+ĠAd a
+5 14
+Ġsl ander
+record ed
+Ġemb ell
+ris ome
+Ġapolog izing
+ĠMul cair
+ĠGib raltar
+Cl a
+Ġall ot
+ĠAtt ention
+Ġ4 33
+le ave
+Ġwh ine
+ĠIss a
+ĠFa ust
+ĠBar ron
+hen y
+Ġvictim ized
+J ews
+Ġnurt uring
+ett el
+W inged
+ĠSub tle
+Ġflavor ful
+ĠRep s
+eng ed
+call back
+Ġdirection al
+Ġcl asp
+ĠDirect ions
+plan et
+icult ure
+Hel per
+ic ion
+ac ia
+Ġç ¥ŀ
+Ġsur ges
+Ġcan oe
+ĠPrem iership
+be en
+Ġdef ied
+ĠTro oper
+Ġtrip od
+Ġgas p
+ĠE uph
+ĠAd s
+vern ight
+high ly
+R ole
+Ġent angled
+ĠZe it
+6 18
+ĠRust y
+Ġhaven s
+ĠVaugh an
+HA EL
+ĠSER VICE
+/ ,
+Ġstr icken
+Ġdel usions
+Ġb is
+ĠH af
+Ġgrat ification
+Ġent icing
+UN CH
+Ad ams
+ĠOL ED
+ĠBeet le
+Ġ18 99
+ĠSO FTWARE
+ateg or
+V L
+ĠTot em
+ĠG ators
+AT URES
+Ġimped ance
+Reg istered
+ĠC ary
+ĠAer ial
+on ne
+en ium
+Ġd red
+ĠBe g
+Ġconcurrent ly
+Ġsuper power
+ĠX an
+j ew
+imes ter
+ĠDick inson
+âĶ ģ
+F la
+Ġp ree
+ĠRoll ins
+© ¶æ
+Ġden omination
+ĠL ana
+5 16
+Ġinc iting
+sc ribed
+j uries
+ĠWond ers
+app roximately
+Ġsusp ending
+Ġmountain ous
+ĠL augh
+oid al
+N s
+Det ect
+) =
+ĠL uthor
+ĠSchwarz enegger
+ĠMull er
+ĠDev i
+ec ycle
+J ar
+6 13
+ĠL ongh
+B ah
+ĠSP ORTS
+n w
+Ġref inement
+Ġwater ways
+Ġd iner
+Bl ade
+68 3
+F ac
+Ġinitial s
+Ġro g
+Ġparan ormal
+B UT
+Ġ[ (
+ĠSw anson
+ĠM esh
+âĸ ¬
+Impro ve
+ĠRad iation
+ĠEst her
+ĠE sk
+ĠA ly
+ik y
+Ġir rad
+ĠBuck ingham
+Ġref ill
+Ġ. _
+Re pe
+CON CLUS
+Ġdifferent iated
+Ġchi rop
+ĠAt kins
+Pat tern
+Ġexc ise
+Ġcab al
+N SA
+ĠST A
+ĠS IL
+ĠPar aly
+Ġr ye
+ĠHow ell
+ĠCount down
+ness es
+alys ed
+Ġres ize
+ãĤ ½
+Ġbudget ary
+ĠStr as
+w ang
+Ġap iece
+Ġprecinct s
+Ġpe ach
+Ġsky line
+Ġ35 3
+pop ular
+App earances
+ĠMechan ics
+ĠDev Online
+S ullivan
+Z en
+Ġp u
+op olis
+5 44
+Ġde form
+Ġcounter act
+ĠL ange
+Ġ4 17
+Con sole
+77 4
+Ġnodd ing
+Ġpopul ism
+Ġhe p
+Ġcoun selling
+compl iance
+U FF
+Ġunden iably
+Ġrail ing
+ĠHor owitz
+ĠSim one
+ĠBung ie
+Ġa k
+ĠTal ks
+x ff
+fl ake
+Cr ash
+Ġsweat y
+Ġban quet
+ĠOFF IC
+Ġinvent ive
+Ġastron omer
+ĠStam ford
+ĠSc are
+ĠGRE EN
+olic ited
+Ġr usher
+Ġcent rist
+ight ing
+Ġsub class
+Ġdis av
+Ġdef und
+ĠN anto
+oci ate
+m ast
+Ġpac if
+Ġm end
+e ers
+imm igration
+ESS ION
+Ġnumber ing
+Ġlaugh able
+ĠEnd ed
+v iation
+em ark
+P itt
+Ġmetic ulous
+ĠL F
+Ġcongrat ulated
+ĠBir ch
+Ġsway ed
+Ġsemif inals
+Ġhum ankind
+m atter
+ĠEqu ip
+opa usal
+S aid
+ĠLay out
+Ġvo icing
+Ġth ug
+Ġporn ographic
+I PS
+Ġmo aning
+Ġgriev ance
+Ġconf essions
+esc al
+TEXT URE
+Aut hent
+os aurus
+P urchase
+Ġreleg ation
+al ter
+ĠÂł Âł
+Ġr iddled
+Ġo gre
+ĠLow ell
+Occ up
+E at
+ĠHy der
+ĠAdvis er
+Com merce
+H unt
+ĠOr th
+ĠComp etitive
+ĠCL A
+CD C
+Ġsal ads
+F le
+Ġindustrial ized
+` ,
+ĠO WN
+Ġbec k
+ĠPart icularly
+oub t
+Ġm M
+ĠHuss ain
+ĠChen nai
+Ġ9 20
+Ġappoint ing
+ĠCull en
+,,,, ,,,,
+Ġp ores
+ver ified
+Ġbi ochemical
+em ate
+Ġcoward ly
+ĠHels inki
+ĠEthiop ian
+S OURCE
+ER C
+est ro
+Ġbi otech
+ĠS our
+Ġbrew er
+Bloom berg
+Ġintens ify
+Gl ass
+an co
+ĠF DR
+gre SQL
+ĠF ires
+©¶æ ¥µ
+ec o
+100 1
+ĠHom eless
+Ġinstant aneous
+ĠH aste
+ig el
+D iamond
+Ġp aving
+Ġland fill
+Ġd ads
+h oun
+: ]
+Ġinc endiary
+ĠLiving ston
+ĠHil bert
+ĠChe cks
+st yles
+in ators
+ĠCl ive
+ph rine
+Ġchimpan zees
+Ġp all
+ĠJ M
+ĠAad haar
+ð Ŀ
+Ġachie vable
+dis abled
+P ET
+OOOO OOOO
+M ot
+Ġint angible
+Ġbal let
+ĠWe bs
+ĠEst imated
+Effect s
+Ġb ailed
+Josh ua
+Ġturb ulence
+Ġoccup ant
+ĠDay light
+Ġ36 1
+me et
+Ġstat ically
+Ġon look
+Ġk i
+il legal
+Ġvel vet
+Ġdehyd ration
+Ġacqu ies
+ĠRe z
+ak ura
+ĠU pton
+at ro
+Ġincomp rehensible
+Ġback door
+ĠRh ino
+7 27
+Ġmath s
+) +
+Ġhe resy
+Ġd f
+ĠRoc he
+ĠL ydia
+Ġpanc reat
+re ply
+arre ll
+Ġsolicit ation
+Ġcirc adian
+BI P
+Ġfor ay
+Ġcrypt ic
+iz u
+ime o
+ĠTom ato
+ĠH oms
+ex amination
+Ġqu arry
+ĠVal iant
+ĠJer icho
+ĠIN CLUD
+Ġ18 40
+5 19
+Ġres ists
+Ġsnap shots
+ĠSp ur
+ĠAnt iqu
+Log in
+Ġbest selling
+Ġant ic
+ĠS utherland
+ãĤ¢ ãĥ«
+Ġ~ /
+ĠP arm
+è ĥ
+P ages
+int ensity
+Ġimm obil
+Ġ18 65
+zz o
+Ġn ifty
+Ġf entanyl
+ĠPres ervation
+op hen
+Ġd arts
+ĠD inosaur
+po inters
+ĠR ite
+s uggest
+aware ness
+ĠSher idan
+Ġst ances
+Ġsor cery
+Ġper jury
+ĠNik ola
+ie ver
+Ġf iance
+ĠJordan ian
+ĠBall oon
+Ġn ab
+Ġk b
+Ġhuman ities
+ĠTan aka
+hill ary
+Ġconsult ancy
+ĠZ ub
+Ġrem ission
+Ġconf id
+CH Q
+ĠF ug
+Ġimpro vis
+Y ep
+/ _
+Ġunwilling ness
+Ġport folios
+05 5
+ĠInstruct or
+aim an
+Ġclaim ants
+M bps
+ĠBy e
+re ceived
+T weet
+Ġind emn
+ri z
+am ara
+N at
+Ġeval uates
+ĠL ur
+ep ad
+FO X
+ĠTh ro
+Ġrust y
+Ġbed rock
+ĠOp rah
+J B
+Ġmanip ulative
+Ġwill ful
+Ġrel apse
+Ġext ant
+The me
+S ensor
+ĠSt ability
+go vern
+Ġpo ppy
+Ġkn ack
+Ġins ulated
+ĠT ile
+ĠExt rem
+Ġunt old
+Ġconver ge
+Ġref uel
+ig roup
+Ġdistort ions
+Ġrav aged
+Ġmechan ically
+ĠRe illy
+ĠN ose
+ĠIncarn ation
+ĠBeck y
+abb ling
+Ġt aco
+Ġr ake
+Ġmelanch oly
+Ġillust rious
+ĠDart mouth
+Gu ide
+ĠR azer
+ĠBen z
+Ult imate
+ĠSur prise
+Ġpage ant
+off er
+Who ever
+Ġw iser
+Ġchem ist
+ĠHE LL
+ĠBul k
+Ġpl utonium
+ĠCO VER
+Ö ¼
+f ailed
+Ġtire lessly
+Ġinf ertility
+ĠTr ident
+ĠShow time
+ĠC iv
+V ice
+requ ires
+itt ance
+Ġun controlled
+interest ing
+56 1
+Ġinnov ate
+ateg ic
+L ie
+ĠS elling
+U l
+Ġsav ior
+ĠT osh
+Ġsw ast
+P ASS
+Ġr ink
+Ġcard io
+ĠI ro
+ud i
+Ġv antage
+Ġv ans
+ĠNi Ã±o
++ =
+Ġpropag ate
+< ?
+Ġmethod ological
+204 39
+Ġtrig lycer
+Ġing rained
+ĠAn notations
+arr anted
+6 17
+ĠS odium
+ĠA AC
+techn ical
+mult ipl
+Ġ3 73
+å ĭ
+Ġdec isively
+Ġboost ers
+Ġdessert s
+ĠGren ade
+Ġtest ifying
+ĠSc ully
+ID s
+Ġlock down
+ĠSc her
+ĠR Ã©
+ĠWhit man
+ĠRams ay
+rem ote
+Ġh ikers
+ĠHy undai
+Ġcons cientious
+Ġcler ics
+ĠSiber ian
+ut i
+is bury
+Ġrel ayed
+Ġqu artz
+ĠC BI
+seek ers
+ull a
+Ġweld ing
+ĠSh al
+ble acher
+T ai
+ĠSam son
+Ġt umble
+ĠInvest or
+Ġsub contract
+ĠShin ra
+ow icz
+j andro
+d ad
+Ġtermin ating
+ĠNe ural
+ä» £
+Ġleak age
+ĠMid lands
+ĠCaucas us
+í ķ
+c it
+ll an
+iv ably
+ĠAlb ion
+Ġ4 57
+Ġregist rations
+Ġcomr ade
+Ġclip board
+0 47
+Ġdiscour aging
+ĠO ops
+Ad apt
+Ġem path
+n v
+ĠPR OT
+ĠDon n
+ĠP ax
+ĠB ayer
+t is
+Squ are
+Ġfoot prints
+part icip
+ĠChile an
+B rend
+ind ucing
+M agn
+Ġclub house
+ĠMagn um
+Ġenc amp
+ĠEth nic
+uch a
+ere y
+Ġw atered
+ĠCal ais
+Ġcomplex ion
+Ġsect s
+Ġren ters
+Ġbr as
+oÄŁ an
+Time out
+Man agement
+Ġinf ographic
+P okemon
+Cl ar
+Ġloc ality
+Ġfl ora
+as el
+P ont
+Ġpop ulate
+ĠO ng
+Ġsubs istence
+Ġa uctions
+ĠMcA uliffe
+ĠL OOK
+br inger
+Ġtit an
+Ġmanif old
+ĠâĹ ı
+Ġcalibr ated
+Ġcal iphate
+ĠSH E
+ĠCommission ers
+ce ivable
+j c
+W inner
+5 24
+Ġcond one
+Other wise
+Ġp iling
+Ġem body
+ĠCrime an
+ut ics
+ĠEx hibition
+Ġ4 26
+e ering
+Ġv ying
+ĠH UGE
+* =-
+Ġprin cipled
+à ¦
+Ġquir ks
+ĠEdit ors
+put ing
+G ES
+ĠF TA
+à¤ ¾
+add on
+ĠH AM
+ĠFrie za
+W oman
+. $
+Ġc rib
+ĠHer od
+Ġtim ers
+ĠSp aces
+ĠMac intosh
+at aka
+Ġgl ide
+Ġsmell ing
+ĠB AL
+Ġun su
+Ġcond os
+Ġbicy cl
+ĠRev ival
+55 3
+Ġjugg ling
+H ug
+ĠKardash ian
+ĠBalk ans
+mult iple
+Ġnutrit ious
+oc ry
+19 00
+Ġinteg rates
+Ġad joining
+ĠF older
+roll ment
+ven ient
+Ġu ber
+y i
+Ġwh iff
+ĠJu ven
+ĠB orough
+net te
+Ġb ilingual
+ĠSp arks
+ph thal
+man ufact
+Ġt outing
+ĠPH I
+Ke efe
+Rew ard
+Ġinf all
+ĠTem per
+typ ically
+ĠNik ol
+Ġregular s
+Ġpseud onym
+Ġexhib itions
+Ġbl aster
+Ġ40 9
+w arming
+Ġrever ber
+Ġrecip rocal
+Ġ6 70
+ip ient
+b ett
+ĠBe gins
+Ġit ching
+ĠPh ar
+Ass uming
+Ġem itting
+ĠML G
+Ġbirth place
+Ġt aunt
+ĠL uffy
+ĠAm it
+Ġcir cled
+ĠN ost
+enn ett
+Ġde forestation
+ĠHist orically
+ĠEvery day
+Ġovert ake
+79 2
+Ġn un
+ĠLuc ia
+Ġaccompan ies
+ĠSe eking
+ĠTr ash
+an ism
+R ogue
+Ġnorth western
+ĠSupplement al
+ĠNY U
+ĠF RI
+ĠSat isf
+x es
+5 17
+Ġreass ured
+Ġspor adic
+Ġ7 01
+Ġmed ial
+Ġcannabin oid
+Ġbarbar ic
+Ġep is
+ĠExplos ive
+ĠD ough
+Ġuns olved
+Support ed
+Ġacknowled gment
+sp awn
+Ġkit chens
+Ġ- =
+talk ing
+ic ist
+ĠPeg asus
+ĠPS U
+Ġphot on
+ĠAuthent ication
+R G
+@# &
+76 2
+ĠCl air
+Ġdi aper
+Ġbr ist
+ĠProsecut ors
+ĠJ em
+6 28
+ĠEvery where
+ĠJean ne
+equ ality
+ãĥ© ãĥ³
+object s
+ĠPel icans
+Ġ39 2
+Ġbl u
+b ys
+ĠA go
+Ġinstruction al
+Ġdiscrim inating
+ĠTR AN
+ĠCorn el
+ag os
+Ġty re
+Ġas piration
+ĠBrid gewater
+": -
+! ".
+ĠEn s
+ĠCoc o
+P ie
+Ġdet ach
+ĠC ouch
+Ġphys ique
+ĠOccup ations
+osc opic
+en ough
+B uzz
+App earance
+Y P
+Ġrac er
+Ġcompl icity
+r pm
+T oy
+Ġinterrupt s
+ĠCat alyst
+Ġut ilitarian
+imp act
+Ġsp aghetti
+Ġp orous
+Ġeste emed
+Ġinc iner
+ĠI OC
+7 48
+Ġesp resso
+ĠSm ile
+abil ia
+6 35
+Ġmathematic ian
+Ġ4 24
+ĠK L
+ĠH IP
+Ġover heard
+ĠT ud
+ĠT ec
+Ġqu izz
+Ġfl attering
+Ġcon n
+âĢ İ
+Ġatt aches
+ĠR OS
+ĠAC S
+Ġt cp
+ĠSh ame
+sk ip
+res pected
+ĠTrin idad
+gr ain
+Ġfooth old
+ĠUnch arted
+ĠJul io
+z l
+av ored
+ĠAn xiety
+er rors
+ĠCent auri
+its ch
+D addy
+Ġclutch ing
+ĠIm plement
+ĠGut ierrez
+Ġ7 60
+Ġtele portation
+end ra
+Ġrevers ible
+st ros
+Ad venture
+08 3
+Ġliber ating
+Ġas phalt
+ĠSp end
+AR DS
+im sy
+PR ES
+ĠEmer ging
+Ġwild fires
+Ġtechn ologically
+Ġem its
+ĠART ICLE
+Ġirregular ities
+Ġcher ish
+çī Ī
+Ġst ink
+ĠR ost
+Econom ic
+Ġcough ing
+ĠMcC ann
+pro perties
+ilant ro
+Ġreneg oti
+Trans lation
+Ġin quest
+ĠGra pe
+oot ers
+gu i
+ĠSwords man
+ace ae
+h itting
+Ġr c
+Ġexert ed
+ĠS AP
+it ent
+Ġperil ous
+Ġobsc urity
+Ġassass inate
+Ġab original
+Ġresc uing
+ĠSh attered
+lock ing
+all ion
+Ch anging
+ĠHar rington
+ĠB ord
+ĠAfgh ans
+Jam ie
+aret z
+ĠAugust us
+Ġ38 6
+8 30
+Ġj og
+ok ingly
+Tr igger
+ĠH OR
+Stat istics
+Ġviewers hip
+Ġadd itives
+h ur
+Ġmaxim izing
+ĠR ove
+ĠLou ie
+ĠBuck et
+ĠCHR IST
+ou sel
+Ġstre aks
+ir ted
+Ġt ert
+Ġcolonial ism
+Ġbur ying
+y k
+Cond ition
+ĠDPR K
+By Id
+75 1
+âĹ ¼
+Ġwor risome
+Ġvoc ational
+sl ice
+Ġsa ils
+ĠCorrection al
+95 4
+Ġt ul
+K id
+l uster
+Ġfam ilial
+ĠSp it
+ĠEp iscopal
+Specific ally
+ĠVol cano
+run s
+q s
+Ġve tted
+Ġcram med
+t rop
+here r
+Thank fully
+Ġper cussion
+Ġor anges
+Ġround up
+Ġ4 99
+x ious
+Char acters
+ĠZion ism
+ĠR ao
+ÃĽ ÃĽ
+W F
+Ġunintention al
+ONE Y
+Gr ab
+Com mercial
+Ġglut amate
+ĠMcK enna
+ru ciating
+ning ton
+ih u
+Ch an
+ĠSw ap
+Ġleaf lets
+Ġfunction ally
+er ous
+F arm
+Ġcal oric
+ĠLiter ally
+con cert
+Ġshe nan
+Ġrep aid
+ey es
+Ġbas hing
+ĠG orge
+Ġcollabor ations
+Ġun account
+itch ie
+Ġteam work
+pp elin
+Ġpip ing
+Ġmin ced
+Ġd iam
+ri eg
+Ġmasc ara
+Ġsuck er
+ĠMo ons
+App s
+ĠPe ck
+Ġper v
+ĠFl oat
+o ley
+ĠN ish
+im ize
+Ġarom atic
+u in
+end ish
+! /
+ĠB icycle
+ĠAS IC
+ile ged
+ĠQuad ro
+ios yn
+Ġlock out
+ĠW ink
+SP EC
+Attempt s
+Ġseed ed
+red o
+ias is
+Ġsn ag
+ãĥķ ãĤ©
+ãĤ ¶
+Ġground ing
+Ġrelie ver
+Ġfrivol ous
+ĠG ifts
+ĠF aces
+Es pecially
+Ġmicrobi ome
+im ag
+ĠSch l
+ĠP les
+ĠBle ach
+ĠIr win
+ĠE aton
+ĠDisc iple
+Ġmultipl ication
+Ġcoer ced
+Ġ4 19
+st h
+E vil
+B omb
+Ġex orc
+Ġstag gered
+L ESS
+Ġinert ia
+ĠED IT
+Ġgo b
+Tr aditional
+Ġclass y
+Lear y
+ĠP AGE
+yr s
+Ġtrans porter
+Ġmat ured
+Ġhij ab
+Ġbi ome
+Where as
+Ġex termination
+ĠT ues
+ĠT akeru
+ĠAud rey
+er ial
+ĠAd en
+aff les
+Ġnarciss istic
+ĠB aird
+UT F
+I re
+ĠCon nie
+Ch amp
+Ġwhis pering
+ĠH att
+D K
+Ġdis infect
+Ġdeduct ed
+Ġpart ake
+Ġdown grade
+ĠEs ports
+ĠContin uing
+Ġdemocr atically
+icro bial
+itt a
+Ġlim estone
+Ġexempt ed
+ĠFren zy
+H erm
+7 28
+Ġfled gling
+Met a
+765 61
+69 3
+% :
+w ake
+5 26
+ĠDis cipline
+Ġvirgin ity
+ĠLeg ions
+ĠFrank ie
+int ent
+Ġrest rooms
+ĠRou ter
+da q
+Ġobjection able
+âĨ ĳ
+w ark
+ĠRah ul
+g ain
+activ ation
+abs olute
+ĠAccess ed
+Ġ24 00
+ogg les
+Ġsecond ly
+ĠDEF ENSE
+Ġpost age
+wra pper
+sh arp
+7 29
+Ġcommun icates
+Ġadd on
+ĠMil itia
+H ong
+Ġsl umped
+ĠJP EG
+ĠI car
+ad ish
+68 1
+Ġmaj esty
+ĠWolf gang
+ĠEl astic
+u per
+Ġv iz
+Ġunconscious ly
+ĠST D
+ĠS ass
+Ġflower ing
+ĠHel ic
+ĠDra per
+ĠAm ateur
+Ġman ure
+Ġdis ingen
+ĠLe i
+br ing
+9 49
+Ġinhib ited
+Ġhead quartered
+Ġen igmatic
+ï¿½ï¿½ ï¿½
+Ġred ress
+R H
+Ġratt led
+Ġd iction
+l io
+ĠT BA
+ĠSN AP
+C alling
+Ġfasc ists
+ĠD ove
+iew icz
+0 36
+Ġco asts
+ĠR ect
+Ġ) ]
+L ot
+6 29
+ĠS EM
+ĠPeters en
+ĠExpl ain
+ĠBo ards
+ĠBe zos
+ĠJ ournals
+Ġ20 24
+p arser
+Ġmist rust
+Ġgr ate
+ĠL ocked
+bo a
+S aint
+g aming
+Ġvow el
+in ately
+bl ow
+All ah
+Ġun matched
+Ġb ordering
+ĠExp end
+n r
+Or acle
+rou ch
+Ġcont iguous
+ac us
+Ġdist raught
+58 1
+Ġanat omical
+O X
+ap ixel
+8 33
+ĠPL US
+Ġres usc
+Ġab iding
+57 3
+Ġvac ancies
+Em ily
+Ġhyp othal
+ĠWer ner
+ĠWe e
+ĠDJ s
+5 13
+Ġwitch craft
+Ġac upuncture
+ent ary
+benef it
+Product s
+ĠP SP
+ĠMP G
+ĠJ inn
+ĠJ arrett
+Ġ4 45
+ĠIm aging
+ĠP yth
+Fin ish
+Ġte x
+Ġjuven iles
+Ġhero ism
+Ġdoubt less
+ĠA ki
+ĠT end
+ĠPatri arch
+Ġbit ters
+ĠTele communications
+it atively
+ag na
+Ġr g
+ĠS OLD
+Ġcomp ulsion
+ĠN asa
+ĠKath ryn
+Ġmillion aires
+Ġintrins ically
+Ġbolst ered
+time out
+fl o
+Ġtut or
+p our
+Stat ement
+Ġ{ *
+ĠRud olph
+ĠKimber ly
+rog ens
+adi q
+] +
+Ġindign ation
+Ġfract uring
+ĠRe leases
+ĠGr ain
+pro tein
+L ago
+Ġvac ations
+Ġboot ed
+ĠTH REE
+ĠH G
+oresc ence
+Ġt f
+Ġso ar
+iosyn cr
+Ġgl ances
+ĠSp oon
+ĠJ ury
+ĠCow boy
+Ġcreat ively
+Hig her
+Ġsolic itor
+Ġhaw k
+ac io
+89 6
+Ġsuperf lu
+Ġbombs hell
+ct ure
+Ġbroker age
+Ġraid ing
+Ġf rench
+Ġang led
+Trans action
+ĠGen ocide
+u pe
+ĠHait ian
+57 2
+! :
+Ġunwitting ly
+iter ator
+sc roll
+Ġtall ied
+Ġbi omedical
+ĠC ARD
+Ġe uphem
+Ġbrain storm
+a quin
+K o
+Mic helle
+ĠR unes
+ĠBall istic
+ud ers
+Ġmod esty
+ĠiP ads
+ĠEzek iel
+Y E
+Ġstars hip
+Ġpower fully
+Ġper l
+ĠSh ade
+ĠQu art
+ĠE EG
+Ġfisher man
+OS ED
+ĠTyp ical
+df x
+Ġmes hes
+Ġet ched
+worth iness
+Ġtopp led
+Ġ3 96
+or ius
+We iss
+Ġmy sql
+ĠVal halla
+Ù Ĵ
+le asing
+Ġrec omp
+rap nel
+S el
+04 3
+Ġder ailed
+ĠGu ides
+IR T
+Ġde human
+ĠBritt any
+" ))
+Ġex claim
+Ġb alk
+Ġ8 40
+CLA IM
+int el
+L AB
+Ġpe gged
+Ġast roph
+sm oking
+Ġrig ging
+Ġfix ation
+Ġcat apult
+ins ide
+ĠC ascade
+ĠBolshe vik
+G aza
+Dep th
+Ġloud spe
+Ġalmond s
+me yer
+l eness
+j en
+f resh
+Ġunbeat en
+ĠSqu id
+ĠPres umably
+Tim er
+B W
+Ġro sters
+Ġell ipt
+ĠHar riet
+dat abase
+ĠMut ual
+ĠComm odore
+uk ed
+kn ife
+ĠCOMM UN
+h ya
+Ġmel ts
+arch ives
+Ġrat ification
+Ġmultip lying
+Ġinter oper
+Ġasc ert
+w ings
+ver ting
+ĠScorp ion
+ay e
+ĠPorts mouth
+ĠM TA
+n it
+iaz ep
+Ġqu arantine
+Ġslides how
+Ġcent imeters
+Ġsyn opsis
+Ġsp ate
+th irst
+Ġnom inating
+ĠMel vin
+Pre view
+Ġthro b
+Ġgener ational
+ĠRad ius
+rest ling
+put able
+aw ar
+N ECT
+Ġunlaw fully
+ĠRevel ations
+Wik ipedia
+sur v
+Ġeye ing
+ij n
+ĠF W
+Ġbr unt
+Ġinter stellar
+Ġcl itor
+ĠCroat ian
+ĠCh ic
+ev a
+ĠDis app
+ĠA kin
+iner ies
+d ust
+Interest ed
+Ġgen esis
+ĠE ucl
+Ã¶ n
+p icking
+Ġmut ated
+Ġdisappro ve
+ĠHD L
+Ġ6 25
+Ì ¶
+c ancer
+Ġsqu ats
+Ġle vers
+Disc uss
+= ]
+D ex
+ĠVIDE OS
+A UD
+Ġtrans act
+ĠKin ect
+ĠK uala
+ĠC yp
+7 47
+Ġsh attering
+Ġarsen ic
+ĠInt ake
+ĠAngel o
+ĠQu it
+ĠK he
+Ġ18 93
+M aker
+0 29
+ĠPain ting
+Dis able
+9 16
+Ġanal ges
+Ġtact ile
+Ġprop hes
+Ġd iced
+ĠTravel s
+ĠHe ader
+ĠClub s
+Ass istant
+Ġinc rim
+Ġd ips
+Ġcruc ifix
+ĠShan ahan
+ĠInter pret
+Ġ40 90
+al ogy
+abb a
+Ġsimul ac
+hus band
+S IM
+Ġrecy cle
+uc er
+ed ged
+Ġre naissance
+ĠBomb ay
+Cath olic
+ĠL INE
+ĠCl othing
+re ports
+Ġpl aus
+Ġd ag
+ĠM ace
+Z I
+Ġintr uder
+ĠVeter inary
+g ru
+Ġsne aky
+ĠS ie
+ĠC innamon
+P OSE
+Ġcou rier
+ĠC NS
+Ġemanc ipation
+s it
+Ġplay through
+ĠFac ilities
+v irt
+ĠG auntlet
+Thom pson
+Ġunbeliev ably
+Param eters
+Ġst itching
+ign e
+ĠTH ESE
+Priv acy
+Ġshenan igans
+Ġvit ri
+ĠVal id
+59 1
+Ń ·
+ĠProt otype
+ink a
+SC P
+ĠT id
+è Ī
+old ed
+Ġindividual ity
+Ġbark ing
+Ġm ars
+ĠW D
+Ġ8 20
+Ġt ir
+Ġsl apping
+Ġdisgr untled
+ĠAng ola
+ri us
+ĠTorn ado
+ĠTh urs
+Ġcapt cha
+Ġang st
+ĠP og
+ĠAssass ins
+ĠAd idas
+Ġjoy ful
+Ġwh ining
+Emer gency
+Ġphosph orus
+Ġatt rition
+oph on
+ĠTimber wolves
+ĠJ ah
+ĠBr inging
+ĠW ad
+ĠEn sure
+oh l
+ĠX ie
+omm el
+c mp
+Ġz ipper
+Ġrel at
+ĠCor ridor
+m ilo
+T ING
+Av g
+Ġcro pped
+] }
+Ġr aged
+ĠLump ur
+ĠGuer rero
+our ke
+N ut
+Ġoff sets
+og lu
+dr m
+Ġmort als
+lat able
+Ġdismiss ive
+ä¸ ī
+Ġthro ats
+Ġchips et
+ĠSpot light
+Catal og
+art ist
+G b
+Ġch illy
+Ġst oked
+Ġ3 74
+W ard
+L atin
+Ġf iasco
+Ġble ach
+Ġb rav
+Enh anced
+Ġin oc
+ĠFior ina
+_ >
+Ġle ukemia
+Ġel uc
+Ġannoun cer
+ĠLith uan
+ĠArm ageddon
+å ĩ
+Len in
+ĠR uk
+Ġpe pp
+ĠRom antic
+ĠP IT
+ĠInter stellar
+ĠAt kinson
+R aid
+J s
+Go al
+C ourse
+Ġvan ishing
+es ley
+ĠR ounds
+Els a
+59 3
+Ġredund ancy
+ĠST AND
+Ġprop hetic
+Ġhabit able
+ry u
+Ġfaint ly
+M ODE
+Ġfl anked
+IR C
+Aw esome
+Ġsp urious
+ĠZ ah
+ĠMS G
+Ġsh ading
+Ġmotiv ational
+ĠSant ana
+ĠS PR
+Ġexc ruciating
+om ial
+ĠM iko
+ĠLe opard
+A byss
+Ġ[ |
+d irty
+Ġbath s
+Ġdem oral
+and re
+P B
+Ġun ification
+Ġsac rament
+Ġ[ &
+Ġpric eless
+Ġgel atin
+Ġeman ating
+ĠAll aah
+98 6
+Ġout burst
+Ġer as
+ĠX VI
+ĠSP I
+O tt
+ĠLaz arus
+PL IED
+F lying
+blog s
+W isconsin
+R aven
+Ġreb ate
+Ġcreep s
+ĠSp an
+ĠPain ter
+ĠKir a
+ĠAm os
+ĠCor vette
+Cons umer
+ĠRec over
+ck i
+Ġpes ky
+ĠIn vention
+Compan ies
+Ġchalleng ers
+ad emic
+ĠUkrain ians
+ĠNeuro log
+ĠFors aken
+Ġent rants
+Ġemb attled
+Ġdef unct
+ĠGlac ier
+Ġpo isons
+ĠH orses
+m akes
+ĠD irt
+Ġ4 23
+hh h
+ĠTrans formation
+QUI RE
+................ ..
+Ġtrave ller
+ĠSe xy
+ĠK ern
+ip olar
+Ġransom ware
+oooooooo oooooooo
+E c
+rub y
+Prof essional
+ĠOut break
+arg ument
+G rey
+ĠFif a
+ĠCH O
+ĠFOR M
+ĠAm trak
+- [
+Ġcr adle
+Ġantioxid ants
+ãģ®å ®
+7 36
+ĠNAS L
+ĠContribut ions
+Ind iana
+ĠST EP
+C SS
+Ġsal ient
+Ġall ocations
+yr ights
+Ġm ashed
+ĠCut ter
+Sex ual
+Ġp ounded
+Ġfan base
+Ġc asc
+ĠTrans parency
+Ġanaly tic
+ĠSummon er
+× ŀ
+ĠAD C
+det ail
+Ġvan quished
+Ġcr abs
+ar ie
+Dest roy
+ĠS ack
+Ġtrans istor
+Al abama
+ĠK oen
+ĠFisher ies
+c one
+Ġannex ed
+ĠM GM
+es a
+Ġf aked
+ĠCong ratulations
+Ġhind ered
+Ġcorrection al
+ĠI TV
+lee ve
+Ġin appropriately
+lic ks
+Ġtresp ass
+Ġp aws
+Ġnegoti ator
+ĠChrist ensen
+lim its
+ĠDian ne
+Ġeleg ance
+ĠContract s
+an ke
+Ob j
+Ġvigil ance
+Ġcast les
+ĠN AD
+ĠHol o
+Ġemph atically
+ĠTit us
+ĠServ ing
+ĠRich ie
+ĠP igs
+5 68
+Ġanim osity
+ĠAtt ributes
+ĠU riel
+M Q
+my ra
+ĠApplic ant
+Ġpsychiat rists
+ĠV ij
+ĠAb by
+ag ree
+P ush
+Ġk Wh
+hib a
+Ġinc ite
+ĠWe asley
+ĠTax i
+minist ic
+hy per
+ĠF arn
+Ġ6 01
+ĠNation wide
+F ake
+95 2
+Ġma ize
+Ġinteract ed
+Ġtransition ed
+Ġparas itic
+Ġharm onic
+Ġdec aying
+Ġbas eless
+ns ics
+Ġtrans pired
+Ġabund antly
+ĠFore nsic
+Ġtread mill
+ĠJ av
+ab and
+Ġssh d
+Ġfront man
+ĠJak arta
+oll er
+dro ps
+ĠSERV ICES
+rompt u
+oph ical
+h ospital
+bled on
+6 45
+Ġmid range
+ĠEV ENT
+cul ated
+raw led
+Ġper ched
+Ġover board
+ĠPe el
+ĠP wr
+ĠCar th
+ĠCOM PLE
+co e
+sh all
+Ġdeter rence
+M ETHOD
+ĠAbs ent
+M EN
+Ġs ill
+ĠLE VEL
+Y ork
+Ġsin ners
+ĠOP EC
+ĠN ur
+ĠDesign s
+se lection
+Ġunw orthy
+CH A
+Ġstreng thens
+88 3
+ed ly
+Ġslic ing
+Ġmal nutrition
+Ġfilm making
+ĠPol k
+ur ated
+Ġ4 21
+bre akers
+!' "
+Ġwet lands
+ĠDisc rimination
+Ġallow able
+Ġste ered
+ĠSic ily
+S AM
+Ġmust ache
+Ġm ids
+Ġcl ipped
+Ġcirc ulate
+Ġbr ittle
+ĠBuild ings
+ra ised
+ĠRound up
+Ġwealth ier
+Ġoverw rite
+Ġover powered
+ĠGerr ard
+s ites
+PD ATED
+Ġacute ly
+ĠGam ble
+Ġp im
+ĠK us
+Typ ically
+De ploy
+ĠMoroc can
+p otion
+com be
+Ġvigil ante
+Ġ36 3
+St ew
+ĠB agg
+Ġres ided
+ĠSp o
+Ġrem nant
+Ġempt iness
+br ainer
+Ġout patient
+pri ority
+Ġle ptin
+ĠPay ton
+ĠGle aming
+ĠS hed
+ĠPol o
+ĠMormon ism
+rest ricted
+arl ane
+w x
+Ġcreat ine
+ĠAn on
+ĠST UD
+ĠJ UL
+ĠT ee
+5 28
+08 9
+Ġhat ched
+Dis patch
+ĠCompos ite
+Ġ45 1
+p uff
+ĠX COM
+ĠOr n
+ĠTH ANK
+END ED
+ĠAshe ville
+ĠÃ ľ
+Ġman go
+ĠS lightly
+world ly
+ĠW ander
+ĠExp and
+ĠCh r
+M ist
+Ġorthodox y
+ĠUN ESCO
+reg ate
+Else where
+k ie
+ir led
+Ġtopp le
+Ġadopt ive
+ĠLeg s
+d ress
+ĠS agan
+b are
+ĠGl ou
+Cr unch
+Ġhelp ers
+Ġchron ically
+ĠH uma
+1 0000
+Ġaccommod ating
+äº Ķ
+Ġwrink les
+Ġdod ged
+four th
+Ġpre con
+Ġcompress or
+ĠK are
+Ġev ict
+ĠWar wick
+im ar
+Ġmodern ization
+Ġband wagon
+Ġref uted
+Ġnet ted
+ĠNa ples
+ĠGen ie
+per ors
+Ġfield ed
+Ġde re
+ĠPar ables
+le es
+Ġtr out
+asp ers
+Ġn ihil
+Ġhapp iest
+Ġflo ppy
+ĠLo ft
+ĠHe ard
+Ġun ison
+Ġl ug
+ĠRed mond
+class ic
+Supp orters
+SH IP
+G MT
+Ġfue lled
+ç Ĳ
+Ġd d
+ĠEmin em
+Ġ18 97
+NY SE
+Ġsecret aries
+ĠF IA
+ĠCanaver al
+F avorite
+Ġp omp
+Ġdetain ee
+ers hip
+aim on
+i our
+ĠA pex
+Ġplant ations
+am ia
+ac ion
+R ust
+Ġtow ed
+ĠTru ly
+5 77
+Ġshel tered
+r ider
+W o
+Ġl air
+ĠInt elligent
+impro ve
+m atically
+Ġet iquette
+ad ra
+all o
+ĠJun o
+any thing
+ĠStru ggle
+ĠPred ict
+ĠGr imes
+ĠAMER ICA
+ct x
+ĠSit uation
+W OOD
+Ġsol uble
+me ier
+Ġintoler able
+ang ering
+Ġun interrupted
+Ġtool tip
+Ġinterrog ated
+Ġgun ned
+ĠSne ak
+æŃ ¦
+Ġt ether
+Ġcr umble
+L ens
+Ġclust ered
+ĠSy l
+ĠHas an
+Ġdystop ian
+w ana
+Ġjoy stick
+ĠTh ib
+amm u
+Tom orrow
+5 46
+Ġoverc ame
+Ġminim ized
+cept or
+Run ner
+ENG TH
+ĠBrend a
+ĠAchieve ments
+Ġtor ches
+Ġrapp ort
+ĠInvestig ator
+ĠHand ling
+rel ation
+g rey
+8 15
+Ġk cal
+ĠComm ands
+d q
+Ġcur ls
+Ġbe arer
+Ġcyn icism
+it ri
+ĠUse ful
+B ee
+D CS
+Ġab ras
+P ract
+BIL ITIES
+7 12
+Ġdebug ger
+Ġdebt or
+ĠL ia
+ĠK ers
+Ġexacerb ate
+ĠSt acy
+ĠB land
+ĠSc enes
+Ġbranch ing
+âĸĪâĸĪâĸĪâĸĪ âĸĪâĸĪâĸĪâĸĪ
+ape ake
+Ġs alsa
+Ġmish and
+ĠKon ami
+ĠN ib
+Ġanecd ote
+Ġagree able
+Ï ī
+ĠNath aniel
+ĠHe isman
+ĠB eware
+Ġ18 86
+spect ive
+69 1
+5 22
+Ġinhib its
+Ġhas hing
+Ġ18 89
+å° Ĩ
+v ich
+P ure
+Ġsolid ly
+Ġaspir in
+im aru
+Ġstreet car
+ĠU CS
+ĠJ udd
+Ġflash backs
+p ins
+Ġ14 40
+ĠUN HCR
+ĠSym ptoms
+T IT
+5 38
+F ra
+% );
+Ġo oz
+Ġcur few
+Ġcal med
+Ġparticip ates
+Te X
+Ġnons ensical
+Ġfull back
+ĠDe L
+mon key
+h ari
+Ġmetabol ites
+Ġloot ed
+ĠAL WAYS
+ĠB CC
+L t
+oc het
+B one
+Ġveto ed
+Ġg cc
+ĠCL ICK
+Ġ18 88
+s af
+Ġstiff ness
+Ġlow ly
+ĠGe h
+vers on
+ors et
+Ġun foreseen
+Ġan esthesia
+ĠOpt ical
+Ġrecon structed
+ĠT up
+sh ows
+NEW S
+ĠNewsp aper
+ĠA SA
+ter a
+N umbers
+Ġinexpl icable
+× ĳ
+Ġhard ness
+unt arily
+ĠA cer
+grad ient
+ARD IS
+Ġwood land
+Ġmetaph ors
+ĠWem bley
+ĠPa vel
+phil is
+Ġre writing
+Ġpercept ual
+Ġ10 70
+worm s
+ĠDown s
+Ġunsur prisingly
+Ġtag ging
+fl ame
+Ġlit res
+Ġboun ces
+ĠB abe
+sh ut
+Ġoverd oses
+ĠShe ila
+ĠCh au
+ĠBl ess
+Capt ure
+ĠSign ificant
+ĠSc ion
+Ġ38 9
+ĠMc H
+ĠTitan ium
+ĠMe al
+amed a
+ag ents
+agg ressive
+B illy
+76 3
+ĠS aying
+DER R
+it one
+Coll ins
+B ound
+Ġbol ted
+ĠDM CA
+95 3
+Ġun iqueness
+Ġep igen
+un ci
+ant am
+Ġreck oning
+ch airs
+OG R
+ĠSen egal
+Ġ18 62
+re levant
+ĠÂ ¯
+Ġpharm acies
+ĠG eral
+v ier
+Y an
+OR PG
+Ġrab id
+b ending
+ĠUN ITED
+Ġ4 65
+As sembly
+Ġwe ep
+Ġbe hest
+ĠMother s
+ĠJ ace
+h id
+Ġwh irlwind
+ĠUN IVERS
+Ġut opian
+Ġkidn ap
+Ph ilipp
+K in
+89 3
+Ġlivest ream
+ĠM ISS
+Ġsub versive
+ĠTechn iques
+ĠJUST ICE
+ĠB ASE
+Ġ38 7
+Ġassail ants
+ĠHard core
+Ġsprink led
+ĠP se
+é ļ
+print ed
+ĠH au
+OR GE
+ĠT OUR
+Ġl aced
+Ġit ch
+G iving
+Ġport ed
+78 1
+//////////////// ////////////////
+bre eding
+Ġlog ger
+ĠH OL
+inn ie
+First ly
+Ġembry onic
+Ġdeleg ated
+p ai
+O IL
+Ġcentr ally
+ĠR x
+ĠSc outing
+D utch
+Ġhe reditary
+ĠCru iser
+s at
+5 29
+ĠMar riott
+other mal
+Ġprohib itions
+E arn
+ĠSt ab
+ĠColleg es
+ĠBel ief
+st retched
+ĠL H
+ĠEntity Item
+C IA
+Ġun rem
+Ġlaure ate
+Ġdenomin ations
+sum mary
+h ler
+S pect
+ĠK laus
+ĠBe ans
+Ġins ur
+ĠPA X
+Ġfield er
+ĠV et
+ĠSp arrow
+z ie
+ĠS Q
+ĠMond ays
+ĠOff line
+ĠLer ner
+ĠExt ensions
+Ire land
+Ġpatron age
+Ġcontrast ed
+ĠMan ia
+h irt
+Mos cow
+Ġcondem ns
+ĠAn ge
+Ġcomp osing
+ĠPe pe
+ĠP addock
+Ġheter ogeneity
+Ġide ologically
+Ġf ishes
+Ġcur sing
+ĠR utherford
+ĠFlo ating
+ĠAm elia
+Te a
+Syn opsis
+Ġstun ts
+Ġbe ad
+Ġstock ing
+ĠM ILL
+ob ook
+mass ive
+\ <
+Ġh ump
+ĠPref erences
+Engine Debug
+ge ist
+ĠNiet o
+ome ver
+ish y
+eval uate
+col onial
+Altern ative
+ĠGo Pro
+ĠV ortex
+ĠNET WORK
+ans ky
+Sec ure
+ĠTh rust
+Sn ake
+Ġparcel s
+Ġsam urai
+Ġactress es
+N ap
+M F
+ifer ation
+Be er
+5 23
+ĠI ly
+oint ment
+P ing
+Ġstri ped
+ĠMell on
+oss ession
+Ġneut ron
+end ium
+Ġa ph
+ĠFlav oring
+Ġ38 3
+Ġrespons iveness
+ĠJ indal
+ĠHitch cock
+Den ver
+ĠDRAG ON
+sm anship
+ĠDu pl
+Ġs ly
+Ġweb cam
+ĠTw ain
+ĠDar ling
+ili ate
+cons umer
+D IT
+Ġnames ake
+Ġun orthodox
+Ġfun er
+ĠPL oS
+ĠCONTR OL
+ozy g
+ogl obin
+F ACE
+ER G
+ĠD ia
+ĠF iesta
+ce le
+0 34
+Ġencl ave
+âĸ¬ âĸ¬
+on ement
+al ist
+M and
+Ġhome grown
+ĠF ancy
+Ġconcept ions
+ĠCont ains
+ure en
+Ġreiter ate
+Ġme ager
+Ġinstall ments
+Sp awn
+6 27
+Ġphot oc
+ĠCab rera
+ĠRos enthal
+ĠLans ing
+is ner
+Ġinvest s
+ĠUFO s
+EX P
+Hard ware
+Ġtr agically
+Ġconced es
+ie ft
+ch am
+bor gh
+ĠSch r
+ĠMel anie
+ĠH oy
+Ġvisit ation
+Ġid iosyncr
+Ġfract ions
+Ġfore skin
+ob os
+Ġpo aching
+ĠVI EW
+Ġstimul ates
+ĠG ork
+can on
+M IC
+ĠNem esis
+ĠInd ra
+ĠDM V
+Ġ5 29
+Ġinspect ing
+Ġgrand ma
+ĠW hedon
+ĠSh ant
+ĠP urg
+ik an
+ĠT eg
+ĠCL R
+z ac
+Vict oria
+ĠVer ify
+ion ics
+Ġpart ying
+ĠM ou
+col our
+Ġtestim onies
+l ations
+Ġpress uring
+hi ro
+ac ers
+Ġf id
+ang ler
+ĠCS I
+Ġhere after
+Ġdiss idents
+report ing
+iph any
+che v
+Ġsol itude
+Ġl obe
+Ġind is
+Ġcred ential
+re cent
+ad ult
+ĠNir vana
+ĠFranch ise
+L ayer
+H yp
+ĠBerks hire
+Ġwill s
+t if
+Ġtot em
+ĠJud ah
+rep air
+Inst ant
+5 48
+Ġemb assies
+Ġbott leneck
+Ġb ount
+Ġtyp ew
+ĠAl vin
+j ing
+im ilar
+R ush
+Ġbr im
+ĠHEL P
+A im
+] '
+Ġpass ively
+Ġbound ed
+ĠR ated
+Ġcriminal ity
+Ġbiom ark
+Ġdisp atcher
+ĠTow ards
+Ġ+ ++
+right eous
+f rog
+ĠP anc
+C arter
+0 32
+æ© Ł
+Ġult raviolet
+ĠLic ensed
+ĠT ata
+ĠBl essing
+ĠG AM
+Ġchem ically
+ĠSe af
+ĠRE LE
+ĠMerc enary
+capital ist
+Ġform ulations
+Ġann ihilation
+ĠVer b
+ĠAr gon
+Ġun loaded
+Ġmorp hed
+Ġconqu ering
+back er
+I ELD
+Ġtheft s
+Ġfront runner
+ĠRoy ale
+ĠFund amental
+el ight
+C hip
+necess ary
+ay n
+ĠSl ip
+Ġ4 48
+cern ed
+P ause
+Ġshock ingly
+ĠAB V
+Ġcomp osure
+7 33
+ĠMotors port
+ah ime
+Mur ray
+M ach
+Ġgr ids
+Ġdeb ian
+Ġfurther more
+Ġdexter ity
+ĠCollect ions
+os lov
+il age
+b j
+ĠMont eneg
+Ġstrut Connector
+Ġmassac res
+Ġbrief s
+fet ched
+uv ian
+ol ition
+Fail ure
+emon ic
+Ġfl ared
+Ġclaim ant
+Ġc ures
+Ġgive aways
+ĠSubst ance
+al ions
+Ġcr inge
+ĠK ul
+Ġarist ocracy
+ĠUl ster
+ol ated
+h ousing
+ĠM IS
+Ġgl ared
+ĠWil helm
+ne eds
+lam bda
+build ers
+ĠV IS
+Ġradi ator
+ĠGhost busters
+Ġ4 36
+act ual
+Ġher ds
+Ã§ a
+watch ing
+Ġcounter ing
+Ch arge
+Ġchar red
+Ġwar heads
+Ġiod ine
+ĠM acy
+04 1
+Ġdepart ures
+ĠS ins
+Ġdy ed
+ĠConcept s
+g ado
+7 13
+Ġquot ations
+Ġg ist
+ĠChrist y
+Ġant igen
+ĠHem p
+ĠD rawn
+ĠB arg
+ez vous
+Ġp aternity
+Ġar du
+ĠAnch orage
+ĠR ik
+Ġover loaded
+ĠUs ername
+ĠTam my
+ĠN au
+ĠCell ular
+Ġw aning
+Ġrod ent
+ĠWor cester
+il ts
+ĠT ad
+Ġdwell ings
+Ġbull ish
+4 31
+Ġretali ate
+Ġmig raine
+ĠChev ron
+CH ECK
+Ġdon key
+c rim
+SP A
+ĠAn alog
+Ġmarqu ee
+ĠHa as
+B ir
+ĠGD DR
+ĠDownload s
+Ġwill power
+ĠFor th
+ĠRecord ed
+Ġimp ossibility
+ĠLog ged
+ĠFr anks
+ĠR att
+in itions
+Ġclean ers
+Ġsore ly
+Ġflick ering
+ĠEx amination
+c atching
+allow een
+Ms g
+Ġdun no
+F a
+Ġdys ph
+c razy
+.' '.
+Ġmain line
+Ġc s
+Ġp tr
+ĠW ally
+ig un
+95 1
+ĠBig foot
+f ights
+Ġretrie ving
+J r
+Ġdupl ication
+ĠExpl an
+Ġrel ational
+Ġqu aint
+Ġbisc uits
+Ġad o
+Ġsh udder
+Ġantid ote
+blood ed
+ks h
+Ġsa uces
+Ġrein vest
+Ġdispens ary
+ĠD iver
+Ġ9 000
+stud ent
+Ġin separ
+esc ap
+Ġtodd lers
+ĠGP IO
+ĠAss ignment
+head ers
+Ġlack luster
+Ġab ack
+95 6
+Ġtool bar
+7 45
+Ġo ust
+Ġcontempl ation
+ĠPRES IDENT
+Ġ4 58
+==== ==
+Ġguarantee ing
+ĠHe ist
+ĠCann es
+Ļ ½
+Ġcollabor ator
+ĠAm p
+Ġg ou
+ĠSH ALL
+st ories
+78 3
+Ġmobil ized
+Ġbro od
+ĠL U
+ĠðŁ ĳ
+Ġref in
+ĠAnthrop ology
+v ind
+ill i
+Ġwarrant ies
+ĠB abel
+Ġsw ath
+Ġc aches
+Ġantagon ists
+art ifacts
+Ġhot ly
+ĠSt arts
+ĠG Ã¶
+z ag
+!! !!!
+Ġsc ourge
+Ġcons piring
+ru its
+re verse
+ĠShe en
+ĠJes uit
+ĠGiov anni
+ad ies
+Ġbutt ocks
+ear cher
+ac an
+Ġvolley ball
+Ġshroud ed
+Ġscore board
+b ats
+ĠI PM
+Ġass es
+Ġde regulation
+ĠTe legram
+ĠReb oot
+Ġ7 000
+ĠCan ary
+Ġk ernels
+ĠFranÃ§ ois
+ĠD uff
+ĠP on
+ĠLe ica
+ĠGar min
+Ġor phans
+ĠClaud ia
+Ġcal endars
+ĠLe ilan
+ent o
+R ocket
+Ġbr unch
+ĠHaw king
+ain ers
+Ġsens ibilities
+Ġk W
+ĠK and
+Ġre claimed
+Ġinteresting ly
+× ©
+rom y
+J M
+ĠEnhance ment
+b ush
+Sk ip
+Ġrapp ers
+Ġg azing
+p edia
+ath lon
+Rev olution
+Ġsn ipers
+Ġre verted
+Ġconglomer ate
+T erry
+79 4
+Ġhars her
+Ġdes olate
+ĠHit man
+Comm ission
+Ġ( /
+âĢ¦ ."
+Com par
+Ġampl ification
+om inated
+Ġreg ress
+ĠColl ider
+Ġinform ants
+Ġg azed
diff --git a/tests/unit/gpt2-vocab.json b/tests/unit/gpt2-vocab.json
new file mode 100644
index 0000000000000000000000000000000000000000..52f89e182ce10f339c4709347a7617848c0f12b1
--- /dev/null
+++ b/tests/unit/gpt2-vocab.json
@@ -0,0 +1 @@
+{"!": 0, "\"": 1, "#": 2, "$": 3, "%": 4, "&": 5, "'": 6, "(": 7, ")": 8, "*": 9, "+": 10, ",": 11, "-": 12, ".": 13, "/": 14, "0": 15, "1": 16, "2": 17, "3": 18, "4": 19, "5": 20, "6": 21, "7": 22, "8": 23, "9": 24, ":": 25, ";": 26, "<": 27, "=": 28, ">": 29, "?": 30, "@": 31, "A": 32, "B": 33, "C": 34, "D": 35, "E": 36, "F": 37, "G": 38, "H": 39, "I": 40, "J": 41, "K": 42, "L": 43, "M": 44, "N": 45, "O": 46, "P": 47, "Q": 48, "R": 49, "S": 50, "T": 51, "U": 52, "V": 53, "W": 54, "X": 55, "Y": 56, "Z": 57, "[": 58, "\\": 59, "]": 60, "^": 61, "_": 62, "`": 63, "a": 64, "b": 65, "c": 66, "d": 67, "e": 68, "f": 69, "g": 70, "h": 71, "i": 72, "j": 73, "k": 74, "l": 75, "m": 76, "n": 77, "o": 78, "p": 79, "q": 80, "r": 81, "s": 82, "t": 83, "u": 84, "v": 85, "w": 86, "x": 87, "y": 88, "z": 89, "{": 90, "|": 91, "}": 92, "~": 93, "\u00a1": 94, "\u00a2": 95, "\u00a3": 96, "\u00a4": 97, "\u00a5": 98, "\u00a6": 99, "\u00a7": 100, "\u00a8": 101, "\u00a9": 102, "\u00aa": 103, "\u00ab": 104, "\u00ac": 105, "\u00ae": 106, "\u00af": 107, "\u00b0": 108, "\u00b1": 109, "\u00b2": 110, "\u00b3": 111, "\u00b4": 112, "\u00b5": 113, "\u00b6": 114, "\u00b7": 115, "\u00b8": 116, "\u00b9": 117, "\u00ba": 118, "\u00bb": 119, "\u00bc": 120, "\u00bd": 121, "\u00be": 122, "\u00bf": 123, "\u00c0": 124, "\u00c1": 125, "\u00c2": 126, "\u00c3": 127, "\u00c4": 128, "\u00c5": 129, "\u00c6": 130, "\u00c7": 131, "\u00c8": 132, "\u00c9": 133, "\u00ca": 134, "\u00cb": 135, "\u00cc": 136, "\u00cd": 137, "\u00ce": 138, "\u00cf": 139, "\u00d0": 140, "\u00d1": 141, "\u00d2": 142, "\u00d3": 143, "\u00d4": 144, "\u00d5": 145, "\u00d6": 146, "\u00d7": 147, "\u00d8": 148, "\u00d9": 149, "\u00da": 150, "\u00db": 151, "\u00dc": 152, "\u00dd": 153, "\u00de": 154, "\u00df": 155, "\u00e0": 156, "\u00e1": 157, "\u00e2": 158, "\u00e3": 159, "\u00e4": 160, "\u00e5": 161, "\u00e6": 162, "\u00e7": 163, "\u00e8": 164, "\u00e9": 165, "\u00ea": 166, "\u00eb": 167, "\u00ec": 168, "\u00ed": 169, "\u00ee": 170, "\u00ef": 171, "\u00f0": 172, "\u00f1": 173, "\u00f2": 174, "\u00f3": 175, "\u00f4": 176, "\u00f5": 177, "\u00f6": 178, "\u00f7": 179, "\u00f8": 180, "\u00f9": 181, "\u00fa": 182, "\u00fb": 183, "\u00fc": 184, "\u00fd": 185, "\u00fe": 186, "\u00ff": 187, "\u0100": 188, "\u0101": 189, "\u0102": 190, "\u0103": 191, "\u0104": 192, "\u0105": 193, "\u0106": 194, "\u0107": 195, "\u0108": 196, "\u0109": 197, "\u010a": 198, "\u010b": 199, "\u010c": 200, "\u010d": 201, "\u010e": 202, "\u010f": 203, "\u0110": 204, "\u0111": 205, "\u0112": 206, "\u0113": 207, "\u0114": 208, "\u0115": 209, "\u0116": 210, "\u0117": 211, "\u0118": 212, "\u0119": 213, "\u011a": 214, "\u011b": 215, "\u011c": 216, "\u011d": 217, "\u011e": 218, "\u011f": 219, "\u0120": 220, "\u0121": 221, "\u0122": 222, "\u0123": 223, "\u0124": 224, "\u0125": 225, "\u0126": 226, "\u0127": 227, "\u0128": 228, "\u0129": 229, "\u012a": 230, "\u012b": 231, "\u012c": 232, "\u012d": 233, "\u012e": 234, "\u012f": 235, "\u0130": 236, "\u0131": 237, "\u0132": 238, "\u0133": 239, "\u0134": 240, "\u0135": 241, "\u0136": 242, "\u0137": 243, "\u0138": 244, "\u0139": 245, "\u013a": 246, "\u013b": 247, "\u013c": 248, "\u013d": 249, "\u013e": 250, "\u013f": 251, "\u0140": 252, "\u0141": 253, "\u0142": 254, "\u0143": 255, "\u0120t": 256, "\u0120a": 257, "he": 258, "in": 259, "re": 260, "on": 261, "\u0120the": 262, "er": 263, "\u0120s": 264, "at": 265, "\u0120w": 266, "\u0120o": 267, "en": 268, "\u0120c": 269, "it": 270, "is": 271, "an": 272, "or": 273, "es": 274, "\u0120b": 275, "ed": 276, "\u0120f": 277, "ing": 278, "\u0120p": 279, "ou": 280, "\u0120an": 281, "al": 282, "ar": 283, "\u0120to": 284, "\u0120m": 285, "\u0120of": 286, "\u0120in": 287, "\u0120d": 288, "\u0120h": 289, "\u0120and": 290, "ic": 291, "as": 292, "le": 293, "\u0120th": 294, "ion": 295, "om": 296, "ll": 297, "ent": 298, "\u0120n": 299, "\u0120l": 300, "st": 301, "\u0120re": 302, "ve": 303, "\u0120e": 304, "ro": 305, "ly": 306, "\u0120be": 307, "\u0120g": 308, "\u0120T": 309, "ct": 310, "\u0120S": 311, "id": 312, "ot": 313, "\u0120I": 314, "ut": 315, "et": 316, "\u0120A": 317, "\u0120is": 318, "\u0120on": 319, "im": 320, "am": 321, "ow": 322, "ay": 323, "ad": 324, "se": 325, "\u0120that": 326, "\u0120C": 327, "ig": 328, "\u0120for": 329, "ac": 330, "\u0120y": 331, "ver": 332, "ur": 333, "\u0120u": 334, "ld": 335, "\u0120st": 336, "\u0120M": 337, "'s": 338, "\u0120he": 339, "\u0120it": 340, "ation": 341, "ith": 342, "ir": 343, "ce": 344, "\u0120you": 345, "il": 346, "\u0120B": 347, "\u0120wh": 348, "ol": 349, "\u0120P": 350, "\u0120with": 351, "\u01201": 352, "ter": 353, "ch": 354, "\u0120as": 355, "\u0120we": 356, "\u0120(": 357, "nd": 358, "ill": 359, "\u0120D": 360, "if": 361, "\u01202": 362, "ag": 363, "ers": 364, "ke": 365, "\u0120\"": 366, "\u0120H": 367, "em": 368, "\u0120con": 369, "\u0120W": 370, "\u0120R": 371, "her": 372, "\u0120was": 373, "\u0120r": 374, "od": 375, "\u0120F": 376, "ul": 377, "ate": 378, "\u0120at": 379, "ri": 380, "pp": 381, "ore": 382, "\u0120The": 383, "\u0120se": 384, "us": 385, "\u0120pro": 386, "\u0120ha": 387, "um": 388, "\u0120are": 389, "\u0120de": 390, "ain": 391, "and": 392, "\u0120or": 393, "igh": 394, "est": 395, "ist": 396, "ab": 397, "rom": 398, "\u0120N": 399, "th": 400, "\u0120com": 401, "\u0120G": 402, "un": 403, "op": 404, "00": 405, "\u0120L": 406, "\u0120not": 407, "ess": 408, "\u0120ex": 409, "\u0120v": 410, "res": 411, "\u0120E": 412, "ew": 413, "ity": 414, "ant": 415, "\u0120by": 416, "el": 417, "os": 418, "ort": 419, "oc": 420, "qu": 421, "\u0120from": 422, "\u0120have": 423, "\u0120su": 424, "ive": 425, "ould": 426, "\u0120sh": 427, "\u0120this": 428, "nt": 429, "ra": 430, "pe": 431, "ight": 432, "art": 433, "ment": 434, "\u0120al": 435, "ust": 436, "end": 437, "--": 438, "all": 439, "\u0120O": 440, "ack": 441, "\u0120ch": 442, "\u0120le": 443, "ies": 444, "red": 445, "ard": 446, "\u00e2\u0122": 447, "out": 448, "\u0120J": 449, "\u0120ab": 450, "ear": 451, "iv": 452, "ally": 453, "our": 454, "ost": 455, "gh": 456, "pt": 457, "\u0120pl": 458, "ast": 459, "\u0120can": 460, "ak": 461, "ome": 462, "ud": 463, "The": 464, "\u0120his": 465, "\u0120do": 466, "\u0120go": 467, "\u0120has": 468, "ge": 469, "'t": 470, "\u0120U": 471, "rou": 472, "\u0120sa": 473, "\u0120j": 474, "\u0120but": 475, "\u0120wor": 476, "\u0120all": 477, "ect": 478, "\u0120k": 479, "ame": 480, "\u0120will": 481, "ok": 482, "\u0120whe": 483, "\u0120they": 484, "ide": 485, "01": 486, "ff": 487, "ich": 488, "pl": 489, "ther": 490, "\u0120tr": 491, "..": 492, "\u0120int": 493, "ie": 494, "ure": 495, "age": 496, "\u0120ne": 497, "ial": 498, "ap": 499, "ine": 500, "ice": 501, "\u0120me": 502, "\u0120out": 503, "ans": 504, "one": 505, "ong": 506, "ions": 507, "\u0120who": 508, "\u0120K": 509, "\u0120up": 510, "\u0120their": 511, "\u0120ad": 512, "\u01203": 513, "\u0120us": 514, "ated": 515, "ous": 516, "\u0120more": 517, "ue": 518, "og": 519, "\u0120St": 520, "ind": 521, "ike": 522, "\u0120so": 523, "ime": 524, "per": 525, ".\"": 526, "ber": 527, "iz": 528, "act": 529, "\u0120one": 530, "\u0120said": 531, "\u0120-": 532, "are": 533, "\u0120your": 534, "cc": 535, "\u0120Th": 536, "\u0120cl": 537, "ep": 538, "ake": 539, "able": 540, "ip": 541, "\u0120cont": 542, "\u0120which": 543, "ia": 544, "\u0120im": 545, "\u0120about": 546, "\u0120were": 547, "very": 548, "ub": 549, "\u0120had": 550, "\u0120en": 551, "\u0120comp": 552, ",\"": 553, "\u0120In": 554, "\u0120un": 555, "\u0120ag": 556, "ire": 557, "ace": 558, "au": 559, "ary": 560, "\u0120would": 561, "ass": 562, "ry": 563, "\u0120\u00e2\u0122": 564, "cl": 565, "ook": 566, "ere": 567, "so": 568, "\u0120V": 569, "ign": 570, "ib": 571, "\u0120off": 572, "\u0120te": 573, "ven": 574, "\u0120Y": 575, "ile": 576, "ose": 577, "ite": 578, "orm": 579, "\u0120201": 580, "\u0120res": 581, "\u0120man": 582, "\u0120per": 583, "\u0120other": 584, "ord": 585, "ult": 586, "\u0120been": 587, "\u0120like": 588, "ase": 589, "ance": 590, "ks": 591, "ays": 592, "own": 593, "ence": 594, "\u0120dis": 595, "ction": 596, "\u0120any": 597, "\u0120app": 598, "\u0120sp": 599, "int": 600, "ress": 601, "ations": 602, "ail": 603, "\u01204": 604, "ical": 605, "\u0120them": 606, "\u0120her": 607, "ount": 608, "\u0120Ch": 609, "\u0120ar": 610, "\u0120if": 611, "\u0120there": 612, "\u0120pe": 613, "\u0120year": 614, "av": 615, "\u0120my": 616, "\u0120some": 617, "\u0120when": 618, "ough": 619, "ach": 620, "\u0120than": 621, "ru": 622, "ond": 623, "ick": 624, "\u0120over": 625, "vel": 626, "\u0120qu": 627, "\u010a\u010a": 628, "\u0120sc": 629, "reat": 630, "ree": 631, "\u0120It": 632, "ound": 633, "port": 634, "\u0120also": 635, "\u0120part": 636, "fter": 637, "\u0120kn": 638, "\u0120bec": 639, "\u0120time": 640, "ens": 641, "\u01205": 642, "ople": 643, "\u0120what": 644, "\u0120no": 645, "du": 646, "mer": 647, "ang": 648, "\u0120new": 649, "----": 650, "\u0120get": 651, "ory": 652, "ition": 653, "ings": 654, "\u0120just": 655, "\u0120into": 656, "\u01200": 657, "ents": 658, "ove": 659, "te": 660, "\u0120people": 661, "\u0120pre": 662, "\u0120its": 663, "\u0120rec": 664, "\u0120tw": 665, "ian": 666, "irst": 667, "ark": 668, "ors": 669, "\u0120work": 670, "ade": 671, "ob": 672, "\u0120she": 673, "\u0120our": 674, "wn": 675, "ink": 676, "lic": 677, "\u012019": 678, "\u0120He": 679, "ish": 680, "nder": 681, "ause": 682, "\u0120him": 683, "ons": 684, "\u0120[": 685, "\u0120ro": 686, "form": 687, "ild": 688, "ates": 689, "vers": 690, "\u0120only": 691, "oll": 692, "\u0120spe": 693, "ck": 694, "ell": 695, "amp": 696, "\u0120acc": 697, "\u0120bl": 698, "ious": 699, "urn": 700, "ft": 701, "ood": 702, "\u0120how": 703, "hed": 704, "\u0120'": 705, "\u0120after": 706, "aw": 707, "\u0120att": 708, "ov": 709, "ne": 710, "\u0120play": 711, "erv": 712, "ict": 713, "\u0120could": 714, "itt": 715, "\u0120am": 716, "\u0120first": 717, "\u01206": 718, "\u0120act": 719, "\u0120$": 720, "ec": 721, "hing": 722, "ual": 723, "ull": 724, "\u0120comm": 725, "oy": 726, "old": 727, "ces": 728, "ater": 729, "\u0120fe": 730, "\u0120bet": 731, "we": 732, "iff": 733, "\u0120two": 734, "ock": 735, "\u0120back": 736, ").": 737, "ident": 738, "\u0120under": 739, "rough": 740, "sel": 741, "xt": 742, "\u0120may": 743, "round": 744, "\u0120po": 745, "ph": 746, "iss": 747, "\u0120des": 748, "\u0120most": 749, "\u0120did": 750, "\u0120add": 751, "ject": 752, "\u0120inc": 753, "fore": 754, "\u0120pol": 755, "ont": 756, "\u0120again": 757, "clud": 758, "tern": 759, "\u0120know": 760, "\u0120need": 761, "\u0120cons": 762, "\u0120co": 763, "\u0120.": 764, "\u0120want": 765, "\u0120see": 766, "\u01207": 767, "ning": 768, "iew": 769, "\u0120This": 770, "ced": 771, "\u0120even": 772, "\u0120ind": 773, "ty": 774, "\u0120We": 775, "ath": 776, "\u0120these": 777, "\u0120pr": 778, "\u0120use": 779, "\u0120because": 780, "\u0120fl": 781, "ng": 782, "\u0120now": 783, "\u0120\u00e2\u0122\u0135": 784, "com": 785, "ise": 786, "\u0120make": 787, "\u0120then": 788, "ower": 789, "\u0120every": 790, "\u0120Un": 791, "\u0120sec": 792, "oss": 793, "uch": 794, "\u0120em": 795, "\u0120=": 796, "\u0120Re": 797, "ied": 798, "rit": 799, "\u0120inv": 800, "lect": 801, "\u0120supp": 802, "ating": 803, "\u0120look": 804, "man": 805, "pect": 806, "\u01208": 807, "row": 808, "\u0120bu": 809, "\u0120where": 810, "ific": 811, "\u0120years": 812, "ily": 813, "\u0120diff": 814, "\u0120should": 815, "\u0120rem": 816, "Th": 817, "In": 818, "\u0120ev": 819, "day": 820, "'re": 821, "rib": 822, "\u0120rel": 823, "ss": 824, "\u0120def": 825, "\u0120right": 826, "\u0120sy": 827, "),": 828, "les": 829, "000": 830, "hen": 831, "\u0120through": 832, "\u0120Tr": 833, "__": 834, "\u0120way": 835, "\u0120don": 836, "\u0120,": 837, "\u012010": 838, "ased": 839, "\u0120ass": 840, "ublic": 841, "\u0120reg": 842, "\u0120And": 843, "ix": 844, "\u0120very": 845, "\u0120includ": 846, "other": 847, "\u0120imp": 848, "oth": 849, "\u0120sub": 850, "\u0120\u00e2\u0122\u0136": 851, "\u0120being": 852, "arg": 853, "\u0120Wh": 854, "==": 855, "ible": 856, "\u0120does": 857, "ange": 858, "ram": 859, "\u01209": 860, "ert": 861, "ps": 862, "ited": 863, "ational": 864, "\u0120br": 865, "\u0120down": 866, "\u0120many": 867, "aking": 868, "\u0120call": 869, "uring": 870, "ities": 871, "\u0120ph": 872, "ics": 873, "als": 874, "\u0120dec": 875, "ative": 876, "ener": 877, "\u0120before": 878, "ility": 879, "\u0120well": 880, "\u0120much": 881, "erson": 882, "\u0120those": 883, "\u0120such": 884, "\u0120ke": 885, "\u0120end": 886, "\u0120But": 887, "ason": 888, "ting": 889, "\u0120long": 890, "ef": 891, "\u0120think": 892, "ys": 893, "\u0120bel": 894, "\u0120sm": 895, "its": 896, "ax": 897, "\u0120own": 898, "\u0120prov": 899, "\u0120set": 900, "ife": 901, "ments": 902, "ble": 903, "ward": 904, "\u0120show": 905, "\u0120pres": 906, "ms": 907, "omet": 908, "\u0120ob": 909, "\u0120say": 910, "\u0120Sh": 911, "ts": 912, "ful": 913, "\u0120eff": 914, "\u0120gu": 915, "\u0120inst": 916, "und": 917, "ren": 918, "cess": 919, "\u0120ent": 920, "\u0120You": 921, "\u0120good": 922, "\u0120start": 923, "ince": 924, "\u0120made": 925, "tt": 926, "stem": 927, "olog": 928, "up": 929, "\u0120|": 930, "ump": 931, "\u0120hel": 932, "vern": 933, "ular": 934, "ually": 935, "\u0120ac": 936, "\u0120mon": 937, "\u0120last": 938, "\u0120200": 939, "10": 940, "\u0120stud": 941, "ures": 942, "\u0120Ar": 943, "self": 944, "ars": 945, "meric": 946, "ues": 947, "cy": 948, "\u0120min": 949, "ollow": 950, "\u0120col": 951, "io": 952, "\u0120mod": 953, "\u0120count": 954, "\u0120Com": 955, "hes": 956, "\u0120fin": 957, "air": 958, "ier": 959, "\u00e2\u0122\u0136": 960, "read": 961, "ank": 962, "atch": 963, "ever": 964, "\u0120str": 965, "\u0120point": 966, "ork": 967, "\u0120New": 968, "\u0120sur": 969, "ool": 970, "alk": 971, "ement": 972, "\u0120used": 973, "ract": 974, "ween": 975, "\u0120same": 976, "oun": 977, "\u0120Al": 978, "ci": 979, "\u0120differe": 980, "\u0120while": 981, "--------": 982, "\u0120game": 983, "cept": 984, "\u0120sim": 985, "...": 986, "\u0120inter": 987, "ek": 988, "\u0120report": 989, "\u0120produ": 990, "\u0120still": 991, "led": 992, "ah": 993, "\u0120here": 994, "\u0120world": 995, "\u0120though": 996, "\u0120num": 997, "arch": 998, "imes": 999, "ale": 1000, "\u0120Se": 1001, "\u0120If": 1002, "//": 1003, "\u0120Le": 1004, "\u0120ret": 1005, "\u0120ref": 1006, "\u0120trans": 1007, "ner": 1008, "ution": 1009, "ters": 1010, "\u0120take": 1011, "\u0120Cl": 1012, "\u0120conf": 1013, "way": 1014, "ave": 1015, "\u0120going": 1016, "\u0120sl": 1017, "ug": 1018, "\u0120Americ": 1019, "\u0120spec": 1020, "\u0120hand": 1021, "\u0120between": 1022, "ists": 1023, "\u0120De": 1024, "oot": 1025, "It": 1026, "\u0120ear": 1027, "\u0120against": 1028, "\u0120high": 1029, "gan": 1030, "az": 1031, "ather": 1032, "\u0120exp": 1033, "\u0120op": 1034, "\u0120ins": 1035, "\u0120gr": 1036, "\u0120help": 1037, "\u0120requ": 1038, "ets": 1039, "ins": 1040, "\u0120Pro": 1041, "ism": 1042, "\u0120found": 1043, "land": 1044, "ata": 1045, "uss": 1046, "ames": 1047, "\u0120person": 1048, "\u0120great": 1049, "pr": 1050, "\u0120sign": 1051, "\u0120An": 1052, "'ve": 1053, "\u0120somet": 1054, "\u0120ser": 1055, "hip": 1056, "\u0120run": 1057, "\u0120:": 1058, "\u0120ter": 1059, "irect": 1060, "\u0120follow": 1061, "\u0120det": 1062, "ices": 1063, "\u0120find": 1064, "12": 1065, "\u0120mem": 1066, "\u0120cr": 1067, "ered": 1068, "ex": 1069, "\u0120ext": 1070, "uth": 1071, "ense": 1072, "co": 1073, "\u0120team": 1074, "ving": 1075, "ouse": 1076, "ash": 1077, "att": 1078, "ved": 1079, "\u0120system": 1080, "\u0120As": 1081, "der": 1082, "ives": 1083, "min": 1084, "\u0120lead": 1085, "\u0120Bl": 1086, "cent": 1087, "\u0120around": 1088, "\u0120govern": 1089, "\u0120cur": 1090, "velop": 1091, "any": 1092, "\u0120cour": 1093, "alth": 1094, "ages": 1095, "ize": 1096, "\u0120car": 1097, "ode": 1098, "\u0120law": 1099, "\u0120read": 1100, "'m": 1101, "con": 1102, "\u0120real": 1103, "\u0120support": 1104, "\u012012": 1105, "....": 1106, "\u0120really": 1107, "ness": 1108, "\u0120fact": 1109, "\u0120day": 1110, "\u0120both": 1111, "ying": 1112, "\u0120serv": 1113, "\u0120For": 1114, "\u0120three": 1115, "\u0120wom": 1116, "\u0120med": 1117, "ody": 1118, "\u0120They": 1119, "50": 1120, "\u0120exper": 1121, "ton": 1122, "\u0120each": 1123, "akes": 1124, "\u0120che": 1125, "\u0120cre": 1126, "ines": 1127, "\u0120rep": 1128, "19": 1129, "gg": 1130, "illion": 1131, "\u0120grou": 1132, "ute": 1133, "ik": 1134, "We": 1135, "get": 1136, "ER": 1137, "\u0120met": 1138, "\u0120says": 1139, "ox": 1140, "\u0120during": 1141, "ern": 1142, "ized": 1143, "ared": 1144, "\u0120fam": 1145, "ically": 1146, "\u0120happ": 1147, "\u0120Is": 1148, "\u0120char": 1149, "med": 1150, "vent": 1151, "\u0120gener": 1152, "ient": 1153, "ple": 1154, "iet": 1155, "rent": 1156, "11": 1157, "ves": 1158, "ption": 1159, "\u012020": 1160, "formation": 1161, "\u0120cor": 1162, "\u0120offic": 1163, "ield": 1164, "\u0120too": 1165, "ision": 1166, "\u0120inf": 1167, "\u0120Z": 1168, "the": 1169, "oad": 1170, "\u0120public": 1171, "\u0120prog": 1172, "ric": 1173, "**": 1174, "\u0120war": 1175, "\u0120power": 1176, "view": 1177, "\u0120few": 1178, "\u0120loc": 1179, "\u0120different": 1180, "\u0120state": 1181, "\u0120head": 1182, "'ll": 1183, "\u0120poss": 1184, "\u0120stat": 1185, "ret": 1186, "ants": 1187, "\u0120val": 1188, "\u0120iss": 1189, "\u0120cle": 1190, "ivers": 1191, "anc": 1192, "\u0120expl": 1193, "\u0120another": 1194, "\u0120Q": 1195, "\u0120av": 1196, "thing": 1197, "nce": 1198, "Wh": 1199, "\u0120child": 1200, "\u0120since": 1201, "ired": 1202, "less": 1203, "\u0120life": 1204, "\u0120develop": 1205, "ittle": 1206, "\u0120dep": 1207, "\u0120pass": 1208, "\u00e3\u0125": 1209, "\u0120turn": 1210, "orn": 1211, "This": 1212, "bers": 1213, "ross": 1214, "\u0120Ad": 1215, "\u0120fr": 1216, "\u0120resp": 1217, "\u0120second": 1218, "oh": 1219, "\u0120/": 1220, "\u0120disc": 1221, "\u0120&": 1222, "\u0120something": 1223, "\u0120comple": 1224, "\u0120ed": 1225, "\u0120fil": 1226, "\u0120month": 1227, "aj": 1228, "uc": 1229, "\u0120government": 1230, "\u0120without": 1231, "\u0120leg": 1232, "\u0120dist": 1233, "\u0120put": 1234, "\u0120quest": 1235, "ann": 1236, "\u0120prot": 1237, "20": 1238, "\u0120never": 1239, "ience": 1240, "\u0120level": 1241, "\u0120art": 1242, "\u0120things": 1243, "\u0120might": 1244, "\u0120effect": 1245, "\u0120contro": 1246, "\u0120cent": 1247, "\u012018": 1248, "\u0120allow": 1249, "\u0120belie": 1250, "chool": 1251, "ott": 1252, "\u0120incre": 1253, "\u0120feel": 1254, "\u0120result": 1255, "\u0120lot": 1256, "\u0120fun": 1257, "ote": 1258, "\u0120ty": 1259, "erest": 1260, "\u0120contin": 1261, "\u0120using": 1262, "\u0120big": 1263, "201": 1264, "\u0120ask": 1265, "\u0120best": 1266, "\u0120)": 1267, "IN": 1268, "\u0120opp": 1269, "30": 1270, "\u0120number": 1271, "iness": 1272, "St": 1273, "lease": 1274, "\u0120ca": 1275, "\u0120must": 1276, "\u0120direct": 1277, "\u0120gl": 1278, "\u0120<": 1279, "\u0120open": 1280, "\u0120post": 1281, "\u0120come": 1282, "\u0120seem": 1283, "ording": 1284, "\u0120week": 1285, "ately": 1286, "ital": 1287, "\u0120el": 1288, "riend": 1289, "\u0120far": 1290, "\u0120tra": 1291, "inal": 1292, "\u0120pri": 1293, "\u0120US": 1294, "\u0120place": 1295, "\u0120form": 1296, "\u0120told": 1297, "\":": 1298, "ains": 1299, "ature": 1300, "\u0120Trump": 1301, "\u0120stand": 1302, "\u0120#": 1303, "ider": 1304, "\u0120Fr": 1305, "\u0120next": 1306, "\u0120soc": 1307, "\u0120pur": 1308, "\u0120let": 1309, "\u0120little": 1310, "\u0120hum": 1311, "\u0120i": 1312, "ron": 1313, "15": 1314, "\u012015": 1315, "\u0120commun": 1316, "\u0120mark": 1317, "\u0120There": 1318, "\u0120wr": 1319, "\u0120That": 1320, "\u0120information": 1321, "ways": 1322, "\u0120bus": 1323, "app": 1324, "\u0120invest": 1325, "me": 1326, "\u0120hard": 1327, "ained": 1328, "ead": 1329, "\u0120import": 1330, "\u0120appro": 1331, "\u0120test": 1332, "\u0120tri": 1333, "\u0120rest": 1334, "osed": 1335, "\u0120full": 1336, "\u0120care": 1337, "\u0120Sp": 1338, "\u0120case": 1339, "ON": 1340, "\u0120sk": 1341, "\u0120less": 1342, "\u0120+": 1343, "\u0120partic": 1344, "\u0120Pl": 1345, "ably": 1346, "uck": 1347, "ished": 1348, "chn": 1349, "be": 1350, "\u0120list": 1351, "ator": 1352, "\u0120top": 1353, "\u0120adv": 1354, "\u0120Be": 1355, "ruct": 1356, "\u0120dem": 1357, "ration": 1358, "ling": 1359, "gy": 1360, "reen": 1361, "ger": 1362, "\u0120home": 1363, "\u0120left": 1364, "\u0120better": 1365, "\u0120data": 1366, "\u012011": 1367, "\u0120attack": 1368, "\u0120proble": 1369, "line": 1370, "ards": 1371, "\u0120beh": 1372, "ral": 1373, "\u0120How": 1374, "\u0120She": 1375, "arge": 1376, "\u0120--": 1377, "://": 1378, "\u0120bro": 1379, "\u0120Ph": 1380, "ats": 1381, "\u0120build": 1382, "ww": 1383, "ided": 1384, "aim": 1385, "ases": 1386, "ency": 1387, "\u0120main": 1388, "ined": 1389, "\u0120including": 1390, "\u0120{": 1391, "\u0120got": 1392, "\u0120interest": 1393, "\u0120keep": 1394, "\u0120X": 1395, "\u0120eas": 1396, "aining": 1397, "\u0120class": 1398, "\u00e2\u0122\u00a6": 1399, "\u0120No": 1400, "\u0120var": 1401, "\u0120small": 1402, "ample": 1403, "AT": 1404, "\u0120ide": 1405, "\u0120So": 1406, "\u0120rece": 1407, "\u0120polit": 1408, "\u0120mov": 1409, "\u0120plan": 1410, "\u0120percent": 1411, "iving": 1412, "\u0120camp": 1413, "\u0120pay": 1414, "14": 1415, "sc": 1416, "ised": 1417, "\u0120unt": 1418, "oney": 1419, "ploy": 1420, "====": 1421, "\u0120didn": 1422, "\u0120Ind": 1423, "els": 1424, "ertain": 1425, "\u0120pos": 1426, "____": 1427, "iver": 1428, "\u0120process": 1429, "\u0120program": 1430, "ified": 1431, "\u0120Rep": 1432, "16": 1433, "uro": 1434, "ology": 1435, "atter": 1436, "ina": 1437, "\u0120name": 1438, "\u0120All": 1439, "\u0120four": 1440, "\u0120return": 1441, "vious": 1442, "bs": 1443, "\u0120called": 1444, "\u0120move": 1445, "\u0120Sc": 1446, "ird": 1447, "\u0120group": 1448, "\u0120bre": 1449, "\u0120men": 1450, "\u0120cap": 1451, "ten": 1452, "ee": 1453, "\u0120dri": 1454, "leg": 1455, "here": 1456, "uthor": 1457, "\u0120pat": 1458, "\u0120current": 1459, "ides": 1460, "\u0120pop": 1461, "to": 1462, "ention": 1463, "\u0120always": 1464, "\u0120mil": 1465, "\u0120women": 1466, "\u012016": 1467, "\u0120old": 1468, "iven": 1469, "raph": 1470, "\u0120Or": 1471, "ror": 1472, "ently": 1473, "\u0120near": 1474, "\u0120Ex": 1475, "ream": 1476, "sh": 1477, "\u012014": 1478, "\u0120free": 1479, "ission": 1480, "stand": 1481, "\u0120Con": 1482, "ality": 1483, "used": 1484, "13": 1485, "\u0120design": 1486, "\u0120change": 1487, "\u0120chang": 1488, "\u0120bo": 1489, "\u0120vis": 1490, "ember": 1491, "\u0120book": 1492, "ready": 1493, "\u0120kill": 1494, "25": 1495, "pped": 1496, "\u0120away": 1497, "\u0120able": 1498, "\u0120country": 1499, "\u0120const": 1500, "arn": 1501, "\u0120order": 1502, "AR": 1503, "ior": 1504, "ium": 1505, "orth": 1506, "18": 1507, "ailable": 1508, "\u0120sw": 1509, "\u0120million": 1510, "\u012013": 1511, "atic": 1512, "ted": 1513, "\u0120Go": 1514, "\u0120oper": 1515, "eng": 1516, "\u0120thing": 1517, "ajor": 1518, "conom": 1519, "\u0120Comm": 1520, "\u0120why": 1521, "ured": 1522, "ural": 1523, "\u0120school": 1524, "by": 1525, "\u0120Mar": 1526, "\u0120aff": 1527, "\u0120days": 1528, "\u0120ann": 1529, "ush": 1530, "ane": 1531, "If": 1532, "eg": 1533, "\u0120prof": 1534, "\u0120health": 1535, "outh": 1536, "But": 1537, "ional": 1538, ".,": 1539, "\u0120sol": 1540, "\u0120already": 1541, "\u012030": 1542, "\u0120charact": 1543, "He": 1544, "\u0120friend": 1545, "ES": 1546, "ians": 1547, "icle": 1548, "'d": 1549, "\u0120On": 1550, "\u0120least": 1551, "\u0120prom": 1552, "\u0120dr": 1553, "\u0120hist": 1554, "ither": 1555, "\u0120est": 1556, "iqu": 1557, "17": 1558, "son": 1559, "\u0120tell": 1560, "\u0120talk": 1561, "ohn": 1562, "oint": 1563, "lection": 1564, "AN": 1565, "\u0120until": 1566, "augh": 1567, "\u0120later": 1568, "\u0120ve": 1569, "\u0120view": 1570, "ending": 1571, "ived": 1572, "\u0120word": 1573, "ware": 1574, "\u0120cost": 1575, "\u0120enough": 1576, "\u0120give": 1577, "\u0120United": 1578, "\u0120techn": 1579, "arent": 1580, "OR": 1581, "\u0120par": 1582, "\u0120Dr": 1583, "\u01202016": 1584, "rist": 1585, "ering": 1586, "\u0120\u00c2": 1587, "\u0120large": 1588, "side": 1589, "acy": 1590, "ccess": 1591, "\u0120win": 1592, "\u0120important": 1593, "\u0120199": 1594, "\u0120doesn": 1595, "\u012017": 1596, "\u0120business": 1597, "\u0120clear": 1598, "\u0120rese": 1599, "\",": 1600, "ury": 1601, "\u0120equ": 1602, "aster": 1603, "alf": 1604, "\u0120American": 1605, "nect": 1606, "\u0120expect": 1607, "iversity": 1608, "\u0120occ": 1609, "\u0120Fl": 1610, "\u0120kind": 1611, "\u0120mean": 1612, "\u0120past": 1613, "\u0120dev": 1614, "\u0120bas": 1615, "let": 1616, "raft": 1617, "\u0120organ": 1618, "\u0120del": 1619, "\u0120perform": 1620, "\u0120story": 1621, "\u0120season": 1622, "\u0120Col": 1623, "\u0120claim": 1624, "\u0120came": 1625, "\u0120within": 1626, "\u0120line": 1627, "\u0120project": 1628, "\u0120At": 1629, "\u0120control": 1630, "ended": 1631, "\u0120Sy": 1632, "\u0120air": 1633, "ization": 1634, "\u0120*": 1635, "ley": 1636, "\u0120money": 1637, "idd": 1638, "You": 1639, "for": 1640, "\u0120family": 1641, "\u0120making": 1642, "\u0120bit": 1643, "\u0120police": 1644, "\u0120happen": 1645, "\u0120vers": 1646, "ony": 1647, "uff": 1648, "\u0120When": 1649, "\u0120sit": 1650, "ideo": 1651, "lf": 1652, "ison": 1653, "\u0120sure": 1654, "gin": 1655, "\u0120appear": 1656, "\u0120light": 1657, "\u0120es": 1658, "of": 1659, "\u0120water": 1660, "\u0120times": 1661, "not": 1662, "\u0120grow": 1663, "\u0120company": 1664, "\u0120Te": 1665, "ows": 1666, "\u0120mar": 1667, "ource": 1668, "iol": 1669, "arm": 1670, "br": 1671, "\u0120example": 1672, "\u0120conc": 1673, "\u0120fore": 1674, "\u0120To": 1675, "pro": 1676, "EN": 1677, "ries": 1678, "\u012025": 1679, "\u0120Can": 1680, "ney": 1681, "\u0120actually": 1682, "\u0120ever": 1683, "urity": 1684, "aken": 1685, "aps": 1686, "\u0120tax": 1687, "\u0120major": 1688, "ama": 1689, "\u0120often": 1690, "eral": 1691, "\u0120human": 1692, "\u0120job": 1693, "ister": 1694, "\u0120available": 1695, "ocr": 1696, "enn": 1697, "aid": 1698, "ivid": 1699, "\u0120record": 1700, "?\"": 1701, "\u0120sing": 1702, "\u0120Am": 1703, "idence": 1704, "\u0120news": 1705, "ster": 1706, "\u0120econom": 1707, "\u0120following": 1708, "\u0120Br": 1709, "ising": 1710, "\u0120hour": 1711, "most": 1712, "ument": 1713, "\u0120sex": 1714, "\u0120desc": 1715, "\u0120become": 1716, "\u0120Ed": 1717, "\u0120took": 1718, "\u0120having": 1719, "\u0120product": 1720, "ault": 1721, "As": 1722, "aring": 1723, "\u0120means": 1724, "\u0120hop": 1725, "une": 1726, "\u0120cho": 1727, "\u0120certain": 1728, "\u0120non": 1729, "\u0120deal": 1730, "24": 1731, "lement": 1732, "oci": 1733, "ene": 1734, "\u0120side": 1735, "\u0120Pr": 1736, "\u0120May": 1737, "\u0120reason": 1738, "ued": 1739, "ched": 1740, "ulation": 1741, "\u0120elect": 1742, "\u0120official": 1743, "\u0120possible": 1744, "\u0120hold": 1745, "ands": 1746, "ots": 1747, "\u0120city": 1748, "ories": 1749, "\u0120sever": 1750, "\u0120children": 1751, "\u0120once": 1752, "\u0120activ": 1753, "ler": 1754, "\u0120night": 1755, "itions": 1756, "\u0120John": 1757, "ape": 1758, "play": 1759, "\u0120done": 1760, "\u0120lim": 1761, "\u0120working": 1762, "\u0120Pres": 1763, "orld": 1764, "eb": 1765, "\u0120Co": 1766, "\u0120body": 1767, "ails": 1768, "utes": 1769, "\u0120Mr": 1770, "\u0120whether": 1771, "\u0120author": 1772, "rop": 1773, "\u0120proper": 1774, "\u0120seen": 1775, ");": 1776, "\u0120fac": 1777, "\u0120Su": 1778, "\u0120cond": 1779, "iting": 1780, "\u0120course": 1781, "\u0120}": 1782, "----------------": 1783, "aign": 1784, "\u0120event": 1785, "\u0120eng": 1786, "\u0120pot": 1787, "\u0120intern": 1788, "iam": 1789, "\u0120short": 1790, "empt": 1791, "\u00e3\u0124": 1792, "\u0120God": 1793, "ilar": 1794, "80": 1795, "\u0120orig": 1796, "IS": 1797, "ourn": 1798, "ability": 1799, "itive": 1800, "\u0120dam": 1801, "\u0120100": 1802, "\u0120press": 1803, "\u0120doing": 1804, "\u0120protect": 1805, "ring": 1806, "\u0120thought": 1807, "\u0120question": 1808, "rew": 1809, "\u0120War": 1810, "\u0120several": 1811, "\u0120State": 1812, "\u0120given": 1813, "\u0120fund": 1814, "\u0120Tw": 1815, "\u0120went": 1816, "ances": 1817, "work": 1818, "por": 1819, "my": 1820, "40": 1821, "\u0120arg": 1822, "artment": 1823, "ustom": 1824, "\u0120polic": 1825, "\u0120meet": 1826, "\u0120creat": 1827, "22": 1828, "\u0120States": 1829, "\u0120games": 1830, "raw": 1831, "uture": 1832, "\u0120understand": 1833, "urs": 1834, "\u0120Ob": 1835, "lish": 1836, "sy": 1837, "\u0120makes": 1838, "\u0120won": 1839, "agon": 1840, "\u0120htt": 1841, "\u0120love": 1842, "ential": 1843, "\u0120complete": 1844, "par": 1845, "\u0120Im": 1846, "AL": 1847, "\u0120account": 1848, "\u00c2\u0142": 1849, "ored": 1850, "vert": 1851, "\u0120ident": 1852, "\u01202015": 1853, "\u0120others": 1854, "\u0120Min": 1855, "iber": 1856, "verage": 1857, "There": 1858, "itional": 1859, "dd": 1860, "\u0120prob": 1861, "\u0120young": 1862, "\u0120along": 1863, "\u0120according": 1864, "\u0120yet": 1865, "\u0120members": 1866, "\u0120What": 1867, "oid": 1868, "\u0120Man": 1869, "And": 1870, "\u0120among": 1871, "ai": 1872, "\u0120employ": 1873, "\u0120Res": 1874, "\u0120>": 1875, "\u0120invol": 1876, "\u0120low": 1877, "af": 1878, "\u0120Car": 1879, "\u0120hig": 1880, "\u0120One": 1881, "\u0120Sec": 1882, "ination": 1883, "\u0120likely": 1884, "\u0120ant": 1885, "aged": 1886, "\u0120Russ": 1887, "\u0120ben": 1888, "\u0120rele": 1889, "For": 1890, "back": 1891, "\u0120Not": 1892, "\u0120president": 1893, "ball": 1894, "\u0120access": 1895, "ividual": 1896, "\u0120Dem": 1897, "\u0120Euro": 1898, "60": 1899, "\u0120known": 1900, "irl": 1901, "\u0120Gr": 1902, "\u0120early": 1903, "use": 1904, "iety": 1905, "\u00e2\u0122\u0135": 1906, "\u0120fight": 1907, "\u0120sent": 1908, "\u0120today": 1909, "\u0120market": 1910, "\".": 1911, "\u0120based": 1912, "\u0120strong": 1913, "urther": 1914, "\u0120deb": 1915, "mber": 1916, "\u0120problem": 1917, "\u0120death": 1918, "\u0120social": 1919, "imate": 1920, "AS": 1921, "ortun": 1922, "\u0120campaign": 1923, "ery": 1924, "Ch": 1925, "\u0120ey": 1926, "ially": 1927, "\u0120mus": 1928, "wh": 1929, "pos": 1930, "\u0120er": 1931, "\u0120saf": 1932, "\u0120months": 1933, "iron": 1934, "\u0120viol": 1935, "\u0120five": 1936, "\u0120stre": 1937, "\u0120players": 1938, "inc": 1939, "ald": 1940, "year": 1941, "aun": 1942, "\u0120success": 1943, "\u0120present": 1944, "erence": 1945, "\u01202014": 1946, "\u0120sugg": 1947, "\u0120particular": 1948, "\u0120try": 1949, "\u0120suggest": 1950, "\u0120Christ": 1951, "ones": 1952, "\u0120priv": 1953, "23": 1954, "\u0120crit": 1955, "\u0120land": 1956, "\u0120local": 1957, "ify": 1958, "29": 1959, "\u0120aut": 1960, "ED": 1961, "\u0120Gu": 1962, "\u0120mult": 1963, "\u0120political": 1964, "\u0120asked": 1965, "\u0120former": 1966, "itter": 1967, "ript": 1968, "\u0120close": 1969, "\u0120pract": 1970, "\u0120York": 1971, "\u0120getting": 1972, "\u0120across": 1973, "\u0120comb": 1974, "\u0120believe": 1975, "\u0120z": 1976, "\u0120toget": 1977, "\u0120together": 1978, "\u0120Cent": 1979, "irc": 1980, "\u0120individual": 1981, "\u0120Mc": 1982, "27": 1983, "isk": 1984, "\u0120Eng": 1985, "\u0120face": 1986, "\u012024": 1987, "\u0120value": 1988, "\u0120area": 1989, "ev": 1990, "\u0120writ": 1991, "\u0120President": 1992, "\u0120vot": 1993, "\u0120key": 1994, "\u0120mom": 1995, "put": 1996, "\u0120anything": 1997, "\u0120experience": 1998, "attle": 1999, "\u0120mind": 2000, "aff": 2001, "omm": 2002, "\u0120future": 2003, "ged": 2004, "\u0120cut": 2005, "\u0120tot": 2006, "itch": 2007, "\u0120video": 2008, "\u0120investig": 2009, "\u0120net": 2010, "\u0120My": 2011, "rict": 2012, "ien": 2013, ".)": 2014, "\u0120impro": 2015, "though": 2016, "wards": 2017, "\u0120connect": 2018, "\u0120Med": 2019, "selves": 2020, "ensive": 2021, "mb": 2022, "ober": 2023, "ators": 2024, "An": 2025, "\u012050": 2026, "\u0120redu": 2027, "resent": 2028, "\u0120above": 2029, "\u0120fre": 2030, "\u0120Europe": 2031, "sw": 2032, "\u0120amount": 2033, "\u0120App": 2034, "\u0120either": 2035, "\u0120milit": 2036, "\u0120anal": 2037, "\u0120fail": 2038, "\u0120En": 2039, "ales": 2040, "\u0120special": 2041, "\u0120black": 2042, "IT": 2043, "cher": 2044, "\u0120looking": 2045, "\u0120fire": 2046, "yn": 2047, "\u0120almost": 2048, "oon": 2049, "\u0120study": 2050, "\u0120miss": 2051, "ches": 2052, "rown": 2053, "\u0120tre": 2054, "\u0120community": 2055, "\u0120media": 2056, "\u0120food": 2057, "\u0120comes": 2058, "\u0120University": 2059, "\u0120single": 2060, "What": 2061, "uly": 2062, "\u0120half": 2063, "ague": 2064, "hod": 2065, "\u0120Republic": 2066, "\u0120started": 2067, "\u0120quick": 2068, "oto": 2069, "book": 2070, "\u0120issue": 2071, "itor": 2072, "\u0120else": 2073, "\u0120consider": 2074, "26": 2075, "rodu": 2076, "\u0120taken": 2077, "28": 2078, "99": 2079, "\u0120With": 2080, "\u0120true": 2081, "\u0120wa": 2082, "\u0120trad": 2083, "\u0120ago": 2084, "\u0120mess": 2085, "ief": 2086, "\u0120added": 2087, "oke": 2088, "\u0120bad": 2089, "\u0120fav": 2090, "33": 2091, "\u0120similar": 2092, "ask": 2093, "\u0120Don": 2094, "\u0120character": 2095, "orts": 2096, "\u0120House": 2097, "\u0120reported": 2098, "\u0120type": 2099, "val": 2100, "iod": 2101, "\u0120However": 2102, "\u0120targ": 2103, "\u0120entire": 2104, "pping": 2105, "\u0120history": 2106, "\u0120live": 2107, "ffic": 2108, "........": 2109, "ederal": 2110, "\u0120trying": 2111, "\u0120discuss": 2112, "\u0120Har": 2113, "aces": 2114, "lished": 2115, "\u0120self": 2116, "osp": 2117, "rest": 2118, "\u0120room": 2119, "elt": 2120, "\u0120fall": 2121, "olution": 2122, "\u0120et": 2123, "\u0120x": 2124, "\u0120isn": 2125, "\u0120idea": 2126, "bo": 2127, "\u0120sound": 2128, "\u0120Dep": 2129, "\u0120someone": 2130, "cially": 2131, "ully": 2132, "\u0120foc": 2133, "\u0120object": 2134, "ift": 2135, "aper": 2136, "\u0120player": 2137, "\u0120rather": 2138, "\u0120service": 2139, "ashing": 2140, "\u0120Do": 2141, "\u0120Part": 2142, "rug": 2143, "mon": 2144, "ply": 2145, "\u0120mor": 2146, "\u0120nothing": 2147, "\u0120provide": 2148, "IC": 2149, "ung": 2150, "\u0120party": 2151, "\u0120exist": 2152, "\u0120mag": 2153, "70": 2154, "\u0120rul": 2155, "\u0120house": 2156, "\u0120behind": 2157, "\u0120however": 2158, "\u0120World": 2159, "\u0120sum": 2160, "\u0120applic": 2161, "\u0120;": 2162, "\u0120function": 2163, "gr": 2164, "\u0120Pol": 2165, "\u0120front": 2166, "200": 2167, "\u0120series": 2168, "\u0120tem": 2169, "\u0120typ": 2170, "ills": 2171, "\u0120opt": 2172, "\u0120points": 2173, "\u0120below": 2174, "itted": 2175, "\u0120specific": 2176, "\u01202017": 2177, "umb": 2178, "\u0120ra": 2179, "\u0120previous": 2180, "\u0120pret": 2181, "reme": 2182, "\u0120custom": 2183, "\u0120court": 2184, "\u0120Me": 2185, "\u0120repl": 2186, "\u0120whole": 2187, "go": 2188, "cer": 2189, "\u0120treat": 2190, "\u0120Act": 2191, "\u0120probably": 2192, "\u0120learn": 2193, "ender": 2194, "\u0120Ass": 2195, "\u0120version": 2196, "now": 2197, "\u0120check": 2198, "\u0120Cal": 2199, "RE": 2200, "minist": 2201, "On": 2202, "ources": 2203, "\u0120benef": 2204, "\u0120doc": 2205, "\u0120deter": 2206, "\u0120enc": 2207, "\u0120super": 2208, "\u0120address": 2209, "\u0120vict": 2210, "\u01202013": 2211, "\u0120meas": 2212, "tr": 2213, "\u0120field": 2214, "When": 2215, "\u0120signific": 2216, "uge": 2217, "\u0120feat": 2218, "\u0120common": 2219, "load": 2220, "\u0120begin": 2221, "\u0120bring": 2222, "\u0120action": 2223, "erman": 2224, "\u0120describ": 2225, "\u0120indust": 2226, "\u0120wanted": 2227, "ried": 2228, "ming": 2229, "\u0120attempt": 2230, "45": 2231, "fer": 2232, "\u0120due": 2233, "ression": 2234, "##": 2235, "\u0120shall": 2236, "\u0120six": 2237, "oo": 2238, "\u0120step": 2239, "\u0120pub": 2240, "\u0120himself": 2241, "\u012023": 2242, "\u0120cop": 2243, "\u0120dest": 2244, "\u0120stop": 2245, "AC": 2246, "ibility": 2247, "\u0120lab": 2248, "icult": 2249, "\u0120hours": 2250, "\u0120create": 2251, "\u0120further": 2252, "\u0120America": 2253, "\u0120City": 2254, "\u0120dou": 2255, "head": 2256, "ST": 2257, "\u0120North": 2258, "cing": 2259, "\u0120national": 2260, "ule": 2261, "\u0120Inst": 2262, "\u0120taking": 2263, "\u0120Qu": 2264, "irt": 2265, "\u0120red": 2266, "\u0120research": 2267, "viron": 2268, "\u0120Ge": 2269, "\u0120break": 2270, "ana": 2271, "\u0120space": 2272, "aterial": 2273, "\u0120recent": 2274, "\u0120Ab": 2275, "\u0120general": 2276, "\u0120hit": 2277, "\u0120period": 2278, "\u0120everything": 2279, "ively": 2280, "\u0120phys": 2281, "\u0120saying": 2282, "anks": 2283, "\u0120cou": 2284, "\u0120cult": 2285, "aced": 2286, "eal": 2287, "uation": 2288, "\u0120coun": 2289, "lu": 2290, "\u0120include": 2291, "\u0120position": 2292, "\u0120After": 2293, "\u0120Canad": 2294, "\u0120Em": 2295, "\u0120imm": 2296, "\u0120Red": 2297, "\u0120pick": 2298, "\u0120compl": 2299, "\u0120matter": 2300, "reg": 2301, "ext": 2302, "angu": 2303, "isc": 2304, "ole": 2305, "aut": 2306, "\u0120compet": 2307, "eed": 2308, "fect": 2309, "\u012021": 2310, "\u0120Sen": 2311, "\u0120These": 2312, "asing": 2313, "\u0120cannot": 2314, "\u0120init": 2315, "\u0120relations": 2316, "ached": 2317, "\u0120bar": 2318, "\u012040": 2319, "\u0120TH": 2320, "\u01202012": 2321, "\u0120vol": 2322, "\u0120ground": 2323, "\u0120security": 2324, "\u0120upd": 2325, "ilt": 2326, "35": 2327, "\u0120concern": 2328, "\u0120Just": 2329, "\u0120white": 2330, "\u0120seems": 2331, "\u0120Her": 2332, "pecially": 2333, "ients": 2334, "\u0120announ": 2335, "\u0120fig": 2336, "ights": 2337, "\u0120stri": 2338, "like": 2339, "ids": 2340, "\u0120sus": 2341, "\u0120watch": 2342, "\u0120\u00e2": 2343, "\u0120wind": 2344, "\u0120Cont": 2345, "\u0120itself": 2346, "\u0120mass": 2347, "Al": 2348, "yle": 2349, "ique": 2350, "\u0120National": 2351, "\u0120abs": 2352, "\u0120pack": 2353, "\u0120outside": 2354, "\u0120anim": 2355, "\u0120pain": 2356, "eter": 2357, "\u0120manag": 2358, "duct": 2359, "ogn": 2360, "\u0120]": 2361, "\u0120Sept": 2362, "sec": 2363, "off": 2364, "\u0120Jan": 2365, "\u0120foot": 2366, "ades": 2367, "\u0120third": 2368, "\u0120mot": 2369, "\u0120evidence": 2370, "inton": 2371, "\u0120threat": 2372, "apt": 2373, "ples": 2374, "cle": 2375, "\u0120lo": 2376, "\u0120decl": 2377, "\u0120item": 2378, "medi": 2379, "\u0120represent": 2380, "omb": 2381, "amer": 2382, "\u0120significant": 2383, "ograph": 2384, "su": 2385, "\u0120cal": 2386, "ires": 2387, "0000": 2388, "ID": 2389, "AM": 2390, "\u0120simply": 2391, "\u0120longer": 2392, "\u0120file": 2393, "OT": 2394, "che": 2395, "So": 2396, "ateg": 2397, "org": 2398, "\u0120His": 2399, "\u0120ener": 2400, "\u0120dom": 2401, "\u0120upon": 2402, "ili": 2403, "\":\"": 2404, "\u0120themselves": 2405, "\u0120coming": 2406, "\u0120quite": 2407, "\u0120difficult": 2408, "\u0120Bar": 2409, "ilities": 2410, "rel": 2411, "ends": 2412, "cial": 2413, "64": 2414, "\u0120woman": 2415, "rap": 2416, "yr": 2417, "\u0120necess": 2418, "ips": 2419, "\u0120text": 2420, "\u0120require": 2421, "\u0120military": 2422, "\u0120review": 2423, "\u0120respons": 2424, "75": 2425, "\u0120subject": 2426, "\u0120instead": 2427, "\u0120issues": 2428, "\u0120gen": 2429, "\",\"": 2430, "\u0120minutes": 2431, "\u0120weap": 2432, "ray": 2433, "amed": 2434, "time": 2435, "bl": 2436, "How": 2437, "\u0120code": 2438, "\u0120Sm": 2439, "\u0120higher": 2440, "\u0120Ste": 2441, "ris": 2442, "\u0120page": 2443, "\u0120students": 2444, "\u0120Intern": 2445, "\u0120method": 2446, "\u0120Aug": 2447, "\u0120Per": 2448, "\u0120Ag": 2449, "\u0120policy": 2450, "\u0120Sw": 2451, "\u0120exec": 2452, "\u0120accept": 2453, "ume": 2454, "ribut": 2455, "\u0120words": 2456, "\u0120final": 2457, "\u0120changes": 2458, "\u0120Democr": 2459, "\u0120friends": 2460, "\u0120respect": 2461, "\u0120ep": 2462, "\u0120compan": 2463, "ivil": 2464, "\u0120damage": 2465, "****": 2466, "ogle": 2467, "vironment": 2468, "\u0120neg": 2469, "ental": 2470, "\u0120ap": 2471, "\u0120total": 2472, "ival": 2473, "!\"": 2474, "lim": 2475, "\u0120needs": 2476, "\u0120agre": 2477, "\u0120development": 2478, "\u0120age": 2479, "iple": 2480, "21": 2481, "\u0120results": 2482, "\u0120Af": 2483, "Sh": 2484, "\u0120gun": 2485, "\u0120Obama": 2486, "roll": 2487, "\u0120@": 2488, "\u0120rights": 2489, "\u0120Brit": 2490, "\u0120running": 2491, "\u0120wasn": 2492, "\u0120port": 2493, "\u0120rate": 2494, "\u0120pretty": 2495, "\u0120target": 2496, "\u0120saw": 2497, "\u0120circ": 2498, "\u0120works": 2499, "icro": 2500, "alt": 2501, "over": 2502, "www": 2503, "That": 2504, "lier": 2505, "\u0120everyone": 2506, "ude": 2507, "\u0120pie": 2508, "iddle": 2509, "rael": 2510, "\u0120rad": 2511, "\u0120block": 2512, "\u0120walk": 2513, "To": 2514, "\u00e3\u0123": 2515, "nes": 2516, "\u0120Aust": 2517, "aul": 2518, "rote": 2519, "\u0120South": 2520, "ession": 2521, "oph": 2522, "\u0120shows": 2523, "\u0120site": 2524, "\u0120jo": 2525, "\u0120risk": 2526, "clus": 2527, "lt": 2528, "\u0120inj": 2529, "iding": 2530, "\u0120Spe": 2531, "\u0120chall": 2532, "irm": 2533, "\u012022": 2534, "itting": 2535, "str": 2536, "\u0120hy": 2537, "LE": 2538, "key": 2539, "\u0120began": 2540, "atur": 2541, "ashington": 2542, "lam": 2543, "\u0120Dav": 2544, "bit": 2545, "\u0120size": 2546, "\u0120Par": 2547, "38": 2548, "ournal": 2549, "face": 2550, "\u0120decision": 2551, "\u0120larg": 2552, "\u0120jud": 2553, "rect": 2554, "\u0120continue": 2555, "\u0120Oct": 2556, "overed": 2557, "\u0120Int": 2558, "========": 2559, "\u0120parent": 2560, "\u0120Will": 2561, "\u0120easy": 2562, "\u0120drug": 2563, "anger": 2564, "\u0120sense": 2565, "\u0120di": 2566, "iday": 2567, "\u0120energy": 2568, "istic": 2569, "\u0120associ": 2570, "arter": 2571, "obal": 2572, "eks": 2573, "\u0120El": 2574, "urch": 2575, "\u0120girl": 2576, "oe": 2577, "itle": 2578, "\u012028": 2579, "\u0120Che": 2580, "\u0120request": 2581, "\u0120soon": 2582, "\u0120host": 2583, "ky": 2584, "\u0120states": 2585, "omes": 2586, "\u0120material": 2587, "lex": 2588, "\u0120moment": 2589, "\u0120answ": 2590, "onse": 2591, "\u0120especially": 2592, "\u0120norm": 2593, "\u0120services": 2594, "pite": 2595, "ran": 2596, "\u0120role": 2597, "44": 2598, "):": 2599, "\u0120cred": 2600, "Cl": 2601, "________": 2602, "\u0120mat": 2603, "\u0120log": 2604, "\u0120Clinton": 2605, "OU": 2606, "\u0120office": 2607, "\u012026": 2608, "\u0120charg": 2609, "\u0120track": 2610, "ma": 2611, "\u0120heart": 2612, "\u0120ball": 2613, "\u0120personal": 2614, "\u0120building": 2615, "na": 2616, "set": 2617, "body": 2618, "\u0120Black": 2619, "\u0120increase": 2620, "itten": 2621, "\u0120needed": 2622, "36": 2623, "32": 2624, "=\"": 2625, "\u0120lost": 2626, "\u0120became": 2627, "\u0120groups": 2628, "\u0120Mus": 2629, "\u0120wrote": 2630, "\u0120Pe": 2631, "\u0120prop": 2632, "joy": 2633, "\u00c3\u00a9": 2634, "\u0120White": 2635, "\u0120dead": 2636, ".'": 2637, "\u0120http": 2638, "\u0120webs": 2639, "OS": 2640, "\u0120inside": 2641, "\u0120wrong": 2642, "\u0120statement": 2643, "\u0120...": 2644, "yl": 2645, "\u0120film": 2646, "\u0120music": 2647, "\u0120share": 2648, "ification": 2649, "\u0120release": 2650, "\u0120forward": 2651, "\u0120stay": 2652, "\u0120comput": 2653, "itte": 2654, "ser": 2655, "\u0120original": 2656, "\u0120card": 2657, "\u0120cand": 2658, "\u0120div": 2659, "atural": 2660, "\u0120favor": 2661, "OM": 2662, "\u0120cases": 2663, "uses": 2664, "\u0120section": 2665, "\u0120leave": 2666, "ging": 2667, "oved": 2668, "\u0120Washington": 2669, "39": 2670, "\u0120Gl": 2671, "\u0120required": 2672, "action": 2673, "apan": 2674, "oor": 2675, "iter": 2676, "\u0120King": 2677, "\u0120countries": 2678, "\u0120German": 2679, "lling": 2680, "\u012027": 2681, "34": 2682, "\u0120questions": 2683, "\u0120prim": 2684, "\u0120cell": 2685, "\u0120shoot": 2686, "\u0120anyone": 2687, "\u0120West": 2688, "\u0120affect": 2689, "epend": 2690, "\u0120online": 2691, "\u0120Israel": 2692, "\u0120September": 2693, "\u0120ability": 2694, "\u0120content": 2695, "ises": 2696, "\u0120reve": 2697, "\u0120laun": 2698, "\u0120indic": 2699, "\u0120force": 2700, "cast": 2701, "\u0120sold": 2702, "aving": 2703, "fl": 2704, "\u0120soft": 2705, "\u0120companies": 2706, "ceed": 2707, "\u0120article": 2708, "\u0120aud": 2709, "\u0120rev": 2710, "\u0120educ": 2711, "\u0120playing": 2712, "05": 2713, "\u0120held": 2714, "ctor": 2715, "\u0120released": 2716, "\u0120federal": 2717, "37": 2718, "\u0120administ": 2719, "\u0120interview": 2720, "\u0120install": 2721, "\u0120received": 2722, "\u0120source": 2723, "uk": 2724, "Ph": 2725, "\u0120serious": 2726, "\u0120created": 2727, "\u0120cause": 2728, "\u0120immedi": 2729, "\u0120defin": 2730, "uel": 2731, "\u0120Department": 2732, "ctions": 2733, "\u0120Cour": 2734, "\u0120Now": 2735, "ze": 2736, "ites": 2737, "itution": 2738, "\u0120late": 2739, "\u0120speak": 2740, "ners": 2741, "\u0120legal": 2742, "ari": 2743, "\u0120Cor": 2744, "\u0120weeks": 2745, "\u0120model": 2746, "\u0120pred": 2747, "\u0120exact": 2748, "BC": 2749, "\u0120By": 2750, "ING": 2751, "osing": 2752, "\u0120takes": 2753, "\u0120regard": 2754, "\u0120opportun": 2755, "\u0120price": 2756, "\u0120198": 2757, "\u0120Apr": 2758, "fully": 2759, "\u0120ord": 2760, "\u0120problems": 2761, "ruction": 2762, "ham": 2763, "\u0120Count": 2764, "lege": 2765, "\u0120leaders": 2766, "ET": 2767, "lev": 2768, "\u0120deep": 2769, "ological": 2770, "ese": 2771, "haps": 2772, "\u0120Some": 2773, "\u0120pers": 2774, "\u0120contract": 2775, "\u0120relationship": 2776, "sp": 2777, "oud": 2778, "\u0120base": 2779, "48": 2780, "mit": 2781, "Ad": 2782, "ancial": 2783, "\u0120consum": 2784, "\u0120potential": 2785, "\u0120langu": 2786, "rem": 2787, "eth": 2788, "\u0120relig": 2789, "ressed": 2790, "66": 2791, "\u0120link": 2792, "\u0120lower": 2793, "ayer": 2794, "\u0120June": 2795, "\u0120fem": 2796, "unt": 2797, "erc": 2798, "urd": 2799, "\u0120contact": 2800, "\u0120ill": 2801, "\u0120mother": 2802, "\u0120estab": 2803, "htt": 2804, "\u0120March": 2805, "\u0120Bro": 2806, "\u0120China": 2807, "\u012029": 2808, "\u0120squ": 2809, "\u0120provided": 2810, "\u0120average": 2811, "asons": 2812, "\u01202011": 2813, "\u0120exam": 2814, "lin": 2815, "55": 2816, "ned": 2817, "\u0120perfect": 2818, "\u0120tou": 2819, "alse": 2820, "ux": 2821, "\u0120buy": 2822, "\u0120shot": 2823, "\u0120collect": 2824, "\u0120phot": 2825, "\u0120played": 2826, "\u0120surpr": 2827, "\u0120officials": 2828, "\u0120simple": 2829, "avy": 2830, "\u0120industry": 2831, "\u0120hands": 2832, "ground": 2833, "\u0120pull": 2834, "\u0120round": 2835, "\u0120user": 2836, "\u0120range": 2837, "uary": 2838, "\u0120private": 2839, "ops": 2840, "ees": 2841, "\u0120ways": 2842, "\u0120Mich": 2843, "\u0120veh": 2844, "\u0120except": 2845, "\u0120terms": 2846, "imum": 2847, "pper": 2848, "ION": 2849, "ores": 2850, "\u0120Dragon": 2851, "oul": 2852, "\u0120den": 2853, "\u0120performance": 2854, "\u0120bill": 2855, "cil": 2856, "47": 2857, "\u0120environment": 2858, "\u0120exc": 2859, "add": 2860, "\u0120worth": 2861, "\u0120pict": 2862, "\u0120chance": 2863, "\u01202018": 2864, "bor": 2865, "\u0120speed": 2866, "iction": 2867, "\u0120alleg": 2868, "\u0120Japan": 2869, "atory": 2870, "reet": 2871, "\u0120match": 2872, "\u0120II": 2873, "\u0120stru": 2874, "order": 2875, "\u0120ste": 2876, "\u0120living": 2877, "\u0120struct": 2878, "ino": 2879, "\u0120separ": 2880, "hern": 2881, "\u0120response": 2882, "\u0120enjoy": 2883, "\u0120via": 2884, "AD": 2885, "uments": 2886, "acebook": 2887, "\u0120member": 2888, "ibr": 2889, "izing": 2890, "\u0120tool": 2891, "\u0120Mon": 2892, "\u0120While": 2893, "hood": 2894, "\u0120Ang": 2895, "\u0120Def": 2896, "\u0120offer": 2897, "Tr": 2898, "aur": 2899, "\u0120turned": 2900, "\u0120July": 2901, "down": 2902, "anced": 2903, "\u0120recently": 2904, "\u0120Ear": 2905, "\u0120ce": 2906, "\u0120Star": 2907, "\u0120Cong": 2908, "rought": 2909, "\u0120blood": 2910, "\u0120hope": 2911, "\u0120comment": 2912, "aint": 2913, "\u0120arri": 2914, "iles": 2915, "\u0120particip": 2916, "ought": 2917, "ription": 2918, "08": 2919, "49": 2920, "\u0120gave": 2921, "\u0120select": 2922, "\u0120killed": 2923, "sych": 2924, "\u0120goes": 2925, "ij": 2926, "\u0120coll": 2927, "\u0120impact": 2928, "atives": 2929, "\u0120Ser": 2930, "09": 2931, "\u0120August": 2932, "\u0120boy": 2933, "de": 2934, "\u0120Des": 2935, "\u0120felt": 2936, "US": 2937, "\u0120expected": 2938, "\u0120image": 2939, "\u0120Mark": 2940, "ccording": 2941, "oice": 2942, "EC": 2943, "\u0120Mag": 2944, "ened": 2945, "hold": 2946, "\u0120Post": 2947, "\u0120prevent": 2948, "No": 2949, "\u0120involved": 2950, "\u0120eyes": 2951, "\u0120quickly": 2952, "At": 2953, "unk": 2954, "\u0120behav": 2955, "\u0120ur": 2956, "\u0120led": 2957, "come": 2958, "ey": 2959, "\u0120candid": 2960, "\u0120earlier": 2961, "\u0120focus": 2962, "ety": 2963, "Pro": 2964, "ledge": 2965, "ixed": 2966, "illed": 2967, "\u0120popular": 2968, "AP": 2969, "\u0120sett": 2970, "light": 2971, "\u0120various": 2972, "inks": 2973, "\u0120levels": 2974, "\u0120road": 2975, "ellig": 2976, "ables": 2977, "hel": 2978, "ittee": 2979, "\u0120Gener": 2980, "ype": 2981, "\u0120heard": 2982, "icles": 2983, "\u0120mis": 2984, "\u0120users": 2985, "\u0120San": 2986, "\u0120improve": 2987, "\u0120father": 2988, "\u0120search": 2989, "They": 2990, "vil": 2991, "\u0120profess": 2992, "\u0120knew": 2993, "\u0120loss": 2994, "\u0120events": 2995, "65": 2996, "\u0120billion": 2997, "07": 2998, "02": 2999, "\u0120News": 3000, "\u0120AM": 3001, "\u0120cover": 3002, "where": 3003, "ension": 3004, "\u0120bott": 3005, "\u0120areas": 3006, "ences": 3007, "ope": 3008, "\u0120Twitter": 3009, "ael": 3010, "\u0120gets": 3011, "\u0120Google": 3012, "\u0120sn": 3013, "iant": 3014, "\u0120vote": 3015, "\u0120nearly": 3016, "\u0120included": 3017, "\u0120recogn": 3018, "zz": 3019, "mm": 3020, "aled": 3021, "\u0120happened": 3022, "04": 3023, "\u0120hot": 3024, "\u0120whose": 3025, "\u0120civil": 3026, "\u0120suff": 3027, "oes": 3028, "itiz": 3029, "\u0120Syri": 3030, "\u0120respond": 3031, "\u0120hon": 3032, "\u0120features": 3033, "\u0120economic": 3034, "\u0120April": 3035, "rim": 3036, "\u0120technology": 3037, "\u0120option": 3038, "aging": 3039, "\u0120purch": 3040, "Re": 3041, "\u0120lat": 3042, "chie": 3043, "isl": 3044, "\u0120recomm": 3045, "uf": 3046, "\u0120training": 3047, "\u0120effects": 3048, "\u0120fast": 3049, "\u01202010": 3050, "\u0120occur": 3051, "\u0120website": 3052, "\u0120email": 3053, "\u0120sens": 3054, "ech": 3055, "\u0120oil": 3056, "\u0120influ": 3057, "\u0120currently": 3058, "\u0120Sch": 3059, "\u0120Add": 3060, "\u0120goal": 3061, "\u0120scient": 3062, "\u0120conv": 3063, "100": 3064, "emy": 3065, "\u0120decided": 3066, "\u0120travel": 3067, "\u0120mention": 3068, "LL": 3069, "03": 3070, "\u0120election": 3071, "\u0120phone": 3072, "\u0120looks": 3073, "\u0120situation": 3074, "\u0120cy": 3075, "\u0120hor": 3076, "bed": 3077, "\u0120Court": 3078, "aily": 3079, "aves": 3080, "\u0120quality": 3081, "\u0120Comp": 3082, "wise": 3083, "\u0120table": 3084, "\u0120staff": 3085, "\u0120Wind": 3086, "ett": 3087, "\u0120tried": 3088, "idered": 3089, "\u0120addition": 3090, "\u0120box": 3091, "\u0120lack": 3092, "arily": 3093, "\u0120wide": 3094, "\u0120mid": 3095, "\u0120board": 3096, "ysis": 3097, "\u0120anti": 3098, "ha": 3099, "\u0120dig": 3100, "ening": 3101, "\u0120dro": 3102, "Con": 3103, "68": 3104, "\u0120slow": 3105, "based": 3106, "sequ": 3107, "\u0120path": 3108, "Ex": 3109, "aker": 3110, "\u0120worked": 3111, "\u0120pen": 3112, "\u0120engine": 3113, "\u0120looked": 3114, "\u0120Super": 3115, "\u0120Serv": 3116, "\u0120victim": 3117, "Un": 3118, "\u0120property": 3119, "\u0120introdu": 3120, "\u0120execut": 3121, "\u0120PM": 3122, "Le": 3123, "\u0120color": 3124, "\u0120More": 3125, "\u012060": 3126, "\u0120network": 3127, "\u0120date": 3128, "cul": 3129, "idge": 3130, "\u0120extra": 3131, "31": 3132, "\u0120sle": 3133, "67": 3134, "\u0120wond": 3135, "\u0120reports": 3136, "just": 3137, "\u0120Austral": 3138, "\u0120capital": 3139, "\u0120ens": 3140, "\u0120command": 3141, "\u0120allowed": 3142, "\u0120prep": 3143, "\u0120capt": 3144, "hib": 3145, "\u0120numbers": 3146, "chan": 3147, "\u0120fair": 3148, "mp": 3149, "oms": 3150, "\u0120reach": 3151, "With": 3152, "tain": 3153, "\u0120broad": 3154, "\u0120couple": 3155, "ecause": 3156, "lying": 3157, "\u0120Feb": 3158, "\u0120screen": 3159, "\u0120lives": 3160, "\u0120prior": 3161, "\u0120Congress": 3162, "Ar": 3163, "\u0120approach": 3164, "\u0120emer": 3165, "aries": 3166, "\u0120Dis": 3167, "serv": 3168, "\u0120Ne": 3169, "\u0120built": 3170, "cies": 3171, "\u0120repe": 3172, "\u0120rules": 3173, "force": 3174, "\u0120Pal": 3175, "\u0120financial": 3176, "\u0120considered": 3177, "\u0120Char": 3178, "nces": 3179, "\u0120IS": 3180, "\u0120brought": 3181, "\u0120bi": 3182, "iers": 3183, "\u0120Sim": 3184, "OP": 3185, "\u0120products": 3186, "\u0120visit": 3187, "\u0120document": 3188, "\u0120conduct": 3189, "\u0120completely": 3190, "ining": 3191, "\u0120Calif": 3192, "ibly": 3193, "\u0120written": 3194, "\u0120TV": 3195, "ements": 3196, "\u0120draw": 3197, "One": 3198, "\u0120published": 3199, "\u0120secret": 3200, "rain": 3201, "het": 3202, "\u0120Facebook": 3203, "onday": 3204, "\u0120Up": 3205, "\u0120sexual": 3206, "\u0120thous": 3207, "\u0120Pat": 3208, "\u0120ess": 3209, "\u0120standard": 3210, "\u0120arm": 3211, "ges": 3212, "ection": 3213, "\u0120fell": 3214, "\u0120foreign": 3215, "ani": 3216, "\u0120Friday": 3217, "\u0120regular": 3218, "inary": 3219, "\u0120increased": 3220, "\u0120usually": 3221, "\u0120demon": 3222, "\u0120dark": 3223, "\u0120additional": 3224, "rol": 3225, "\u0120Of": 3226, "\u0120production": 3227, "!!": 3228, "undred": 3229, "\u0120international": 3230, "idents": 3231, "\u0120Free": 3232, "roup": 3233, "\u0120race": 3234, "\u0120mach": 3235, "\u0120huge": 3236, "All": 3237, "lear": 3238, "ovember": 3239, "\u0120town": 3240, "\u0120attention": 3241, "\u0120Off": 3242, "yond": 3243, "\u0120Then": 3244, "field": 3245, "\u0120terror": 3246, "raz": 3247, "\u0120Bo": 3248, "\u0120meeting": 3249, "\u0120Park": 3250, "\u0120arrest": 3251, "\u0120fear": 3252, "\u0120aw": 3253, "\u0120Val": 3254, "oring": 3255, "',": 3256, "\u0120extreme": 3257, "arr": 3258, "\u0120workers": 3259, "After": 3260, "\u012031": 3261, "net": 3262, "ament": 3263, "\u0120directly": 3264, "\u0120population": 3265, "ube": 3266, "\u0120October": 3267, "\u0120IN": 3268, "\u0120January": 3269, "59": 3270, "\u0120David": 3271, "\u0120cross": 3272, "cember": 3273, "\u0120First": 3274, "\u0120message": 3275, "irit": 3276, "\u0120nation": 3277, "\u0120poll": 3278, "isions": 3279, "\u0120answer": 3280, "ny": 3281, "isode": 3282, "\u0120carry": 3283, "\u0120Russia": 3284, "\u0120hear": 3285, "ength": 3286, "roy": 3287, "\u0120natural": 3288, "inally": 3289, "\u0120dog": 3290, "mitted": 3291, "\u0120trade": 3292, "\u0120subst": 3293, "\u0120multiple": 3294, "\u0120Afric": 3295, "\u0120fans": 3296, "\u0120sort": 3297, "\u0120global": 3298, "ication": 3299, "\u0120Wed": 3300, "ara": 3301, "\u0120achie": 3302, "\u0120language": 3303, "vey": 3304, "\u0120tal": 3305, "\u0120necessary": 3306, "\u0120details": 3307, "\u0120sen": 3308, "\u0120Sund": 3309, "\u0120Reg": 3310, "\u0120Rec": 3311, "06": 3312, "\u0120sil": 3313, "ressive": 3314, "\u0120medical": 3315, "unch": 3316, "ornia": 3317, "\u0120und": 3318, "fort": 3319, "ocks": 3320, "\u0120Monday": 3321, "uesday": 3322, "craft": 3323, "77": 3324, "urt": 3325, "\u0120ver": 3326, "\u0120Hill": 3327, "\u0120receive": 3328, "\u0120morning": 3329, "estern": 3330, "\u0120bank": 3331, "\u0120sat": 3332, "irth": 3333, "\u0120High": 3334, "\u0120device": 3335, "\u0120THE": 3336, "\u0120Center": 3337, "\u0120safe": 3338, "\u0120ple": 3339, "\u0120Canada": 3340, "\u0120systems": 3341, "\u0120assist": 3342, "\u0120surv": 3343, "\u0120battle": 3344, "\u0120Soc": 3345, "vertis": 3346, "She": 3347, "\u0120paper": 3348, "\u0120growth": 3349, "\u0120cast": 3350, "Sc": 3351, "\u0120plans": 3352, "lled": 3353, "\u0120parts": 3354, "\u0120wall": 3355, "\u0120movement": 3356, "\u0120practice": 3357, "imately": 3358, "\u0120display": 3359, "\u0120sometimes": 3360, "omp": 3361, "\u0120Paul": 3362, "\u0120Yes": 3363, "king": 3364, "58": 3365, "oly": 3366, "\u0120son": 3367, "\u0120avoid": 3368, "okes": 3369, "\u0120Jew": 3370, "\u0120towards": 3371, "asc": 3372, "\u0120//": 3373, "\u0120Kore": 3374, "\u0120talking": 3375, "\u0120correct": 3376, "\u0120spent": 3377, "icks": 3378, "iable": 3379, "eared": 3380, "\u0120term": 3381, "\u0120wants": 3382, "oming": 3383, "\u0120ut": 3384, "\u0120doub": 3385, "\u0120forces": 3386, "\u0120please": 3387, "69": 3388, "\u0120November": 3389, "atform": 3390, "ondon": 3391, "\u0120ones": 3392, "\u0120immediately": 3393, "\u0120Russian": 3394, "\u0120Met": 3395, "\u0120deg": 3396, "\u0120parents": 3397, "CH": 3398, "\u0120Americans": 3399, "aly": 3400, "\u0120Mod": 3401, "\u0120shown": 3402, "\u0120conditions": 3403, "\u0120stuff": 3404, "\u0120reb": 3405, "\u0120Your": 3406, "\u0120includes": 3407, "nown": 3408, "\u0120Sam": 3409, "\u0120experien": 3410, "mission": 3411, "\u0120Even": 3412, "aught": 3413, "\u0120announced": 3414, "\u0120Republican": 3415, "\u0120determin": 3416, "\u0120described": 3417, "\u0120County": 3418, "()": 3419, "\u0120door": 3420, "\u0120changed": 3421, "\u0120neigh": 3422, "\u0120Here": 3423, "\u0120clean": 3424, "\u0120pan": 3425, "\u0120December": 3426, "\u0120European": 3427, "iring": 3428, "apter": 3429, "\u0120club": 3430, "\u0120Tuesday": 3431, "\u0120paid": 3432, "\u0120Net": 3433, "\u0120attacks": 3434, "\u0120characters": 3435, "\u0120alone": 3436, "\u0120director": 3437, "dom": 3438, "\u012035": 3439, "\u0120load": 3440, "\u0120rout": 3441, "\u0120California": 3442, "\u0120finally": 3443, "\u0120rac": 3444, "\u0120contr": 3445, "\u0120exactly": 3446, "resh": 3447, "pri": 3448, "\u0120Islam": 3449, "\u0120nature": 3450, "\u0120career": 3451, "\u0120latest": 3452, "\u0120convers": 3453, "\u0120Sl": 3454, "pose": 3455, "cient": 3456, "\u0120Inc": 3457, "ivity": 3458, "88": 3459, "\u0120Att": 3460, "\u0120Mor": 3461, "nesday": 3462, "\u0120weight": 3463, "ken": 3464, "\u0120note": 3465, "\u0120teams": 3466, "\u0120\\": 3467, "airs": 3468, "\u0120Green": 3469, "\u0120hundred": 3470, "onent": 3471, "\u0120streng": 3472, "\u0120consist": 3473, "icated": 3474, "\u0120regul": 3475, "\u0120lic": 3476, "astic": 3477, "\u0120ten": 3478, "ursday": 3479, "elligence": 3480, "ously": 3481, "\u0120UK": 3482, "BI": 3483, "\u0120costs": 3484, "\u0120independ": 3485, "\u0120AP": 3486, "\u0120normal": 3487, "\u0120hom": 3488, "\u0120obvious": 3489, "\u0120swe": 3490, "\u0120star": 3491, "\u0120ready": 3492, "acher": 3493, "\u0120implement": 3494, "gest": 3495, "\u0120song": 3496, "\u0120Get": 3497, "\u0120Lab": 3498, "\u0120interesting": 3499, "using": 3500, "\u0120giving": 3501, "\u0120Sunday": 3502, "\u0120etc": 3503, "\u0120middle": 3504, "\u0120remember": 3505, "right": 3506, "osition": 3507, "utions": 3508, "\u0120max": 3509, "46": 3510, "\u0120yourself": 3511, "\u0120demand": 3512, "\u0120treatment": 3513, "\u0120danger": 3514, "\u0120Cons": 3515, "\u0120guy": 3516, "\u0120British": 3517, "\u0120physical": 3518, "\u0120related": 3519, "\u0120remain": 3520, "\u0120couldn": 3521, "\u0120refer": 3522, "\u0120citiz": 3523, "box": 3524, "ENT": 3525, "board": 3526, "\u0120inn": 3527, "IG": 3528, "ero": 3529, "\u0120Street": 3530, "ospital": 3531, "rench": 3532, "chers": 3533, "\u0120stra": 3534, "OL": 3535, "ager": 3536, "\u0120AN": 3537, "\u0120easily": 3538, "IA": 3539, "enge": 3540, "iny": 3541, "\u0120clos": 3542, "ocked": 3543, "\u0120uses": 3544, "\u0120Coun": 3545, "Im": 3546, "uild": 3547, "??": 3548, "more": 3549, "\u0120ang": 3550, "\u0120write": 3551, "olute": 3552, "57": 3553, "\u0120leader": 3554, "\u0120reading": 3555, "</": 3556, "\u0120autom": 3557, "ests": 3558, "43": 3559, "\u0120legisl": 3560, "\u0120Gold": 3561, "\u0120designed": 3562, "\u0120ST": 3563, "\u0120Leg": 3564, "ares": 3565, "\u0120beaut": 3566, "\u0120Tex": 3567, "\u0120appears": 3568, "\u0120strugg": 3569, "\u0120Rom": 3570, "\u012000": 3571, "\u0120choice": 3572, "\u0120particularly": 3573, "\u0120From": 3574, "oper": 3575, "\u0120London": 3576, "anned": 3577, "\u0120allows": 3578, "obile": 3579, "\u0120difference": 3580, "\u00e2\u0122\u00a2": 3581, "\u0120View": 3582, "\u0120Wednesday": 3583, "\u0120although": 3584, "\u0120relative": 3585, "\u0120application": 3586, "atever": 3587, "\u0120aren": 3588, "\u0120myself": 3589, "\u0120imag": 3590, "\u0120dise": 3591, "\u0120society": 3592, "\u0120frequ": 3593, "\u0120English": 3594, "\u0120poor": 3595, "\u0120Day": 3596, "\u0120writing": 3597, "\u0120seven": 3598, "\u0120starting": 3599, "\u0120bud": 3600, "\u0120print": 3601, "\u0120Trans": 3602, "ufact": 3603, "\u0120Stud": 3604, "new": 3605, "\u0120crim": 3606, "\u0120gives": 3607, "\u0120cool": 3608, "ae": 3609, "iance": 3610, "\u0120General": 3611, "\u0120thinking": 3612, "\u0120save": 3613, "\u0120limited": 3614, "\u0120Party": 3615, "\u0120meaning": 3616, "pen": 3617, "owers": 3618, "\u0120Jack": 3619, "EM": 3620, "\u0120nice": 3621, "rupt": 3622, "\u0120gas": 3623, "\u0120eight": 3624, "\u0120feet": 3625, "\u0120effort": 3626, "\u0120ign": 3627, "icit": 3628, "Bl": 3629, "coin": 3630, "\u0120opin": 3631, "\u0120brain": 3632, "While": 3633, "hest": 3634, "\u0120Thursday": 3635, "\u0120wouldn": 3636, "aughter": 3637, "\u0120touch": 3638, "lements": 3639, "\u0120studies": 3640, "\u0120center": 3641, "cont": 3642, "orge": 3643, "\u0120computer": 3644, "\u0120investigation": 3645, "Pl": 3646, "orks": 3647, "\u01202008": 3648, "\u0120increasing": 3649, "\u0120store": 3650, "\u0120comments": 3651, "\u0120bal": 3652, "men": 3653, "\u0120doll": 3654, "\u0120liber": 3655, "\u0120wife": 3656, "\u0120laws": 3657, "aturday": 3658, "itness": 3659, "\u0120modern": 3660, "\u0120Sk": 3661, "\u0120administration": 3662, "\u0120opportunity": 3663, "\u0120sal": 3664, "\u0120powerful": 3665, "My": 3666, "\u0120claims": 3667, "\u0120Earth": 3668, "ords": 3669, "\u0120title": 3670, "\u0120esc": 3671, "name": 3672, "Not": 3673, "omen": 3674, "\u0120beyond": 3675, "\u0120camer": 3676, "\u0120sell": 3677, "itute": 3678, "earch": 3679, "\u0120appl": 3680, "iment": 3681, "42": 3682, "\u0120Art": 3683, "\u0120unf": 3684, "\u0120violence": 3685, "urg": 3686, "\u0120East": 3687, "\u0120compared": 3688, "\u0120options": 3689, "\u0120throughout": 3690, "\u0120vs": 3691, "igr": 3692, ".[": 3693, "aches": 3694, "78": 3695, "\u0120files": 3696, "FL": 3697, "EL": 3698, "arian": 3699, "\u0120James": 3700, "\u0120Air": 3701, "anch": 3702, "\u0120detail": 3703, "\u0120piece": 3704, "PS": 3705, "\u0120named": 3706, "\u0120education": 3707, "\u0120drive": 3708, "\u0120items": 3709, "\u0120student": 3710, "iced": 3711, "::": 3712, "ico": 3713, "\u0120throw": 3714, "\u0120scene": 3715, "\u0120complex": 3716, "\u01202009": 3717, "\u0120prec": 3718, "\u0120Bre": 3719, "79": 3720, "\u0120concept": 3721, "\u0120status": 3722, "aming": 3723, "\u0120died": 3724, "\u0120knowledge": 3725, "\u0120beginning": 3726, "OD": 3727, "ruary": 3728, "\u0120certainly": 3729, "\u0120guys": 3730, "\u0120slight": 3731, "inn": 3732, "ounds": 3733, "\u0120fine": 3734, "\u0120fat": 3735, "ications": 3736, "\u0120perhaps": 3737, "\u0120Ant": 3738, "\u0120income": 3739, "\u0120https": 3740, "\u0120majority": 3741, "ports": 3742, "ston": 3743, "\u0120greater": 3744, "\u0120feed": 3745, "entially": 3746, "\u0120safety": 3747, "\u0120unique": 3748, "andom": 3749, "\u0120gone": 3750, "\u0120showed": 3751, "\u0120histor": 3752, "\u0120counter": 3753, "ius": 3754, "ida": 3755, "\u0120leading": 3756, "ipe": 3757, "\u0120send": 3758, "\u0120Donald": 3759, "erve": 3760, "\u0120defense": 3761, "inese": 3762, "\u0120yes": 3763, "\u0120Fire": 3764, "\u0120Muslim": 3765, "raq": 3766, "\u0120continued": 3767, "osh": 3768, "\u0120provides": 3769, "\u0120prison": 3770, "\u0120Pre": 3771, "\u0120happy": 3772, "\u0120economy": 3773, "\u0120trust": 3774, "ags": 3775, "\u0120Game": 3776, "\u0120weapons": 3777, "uman": 3778, "\u0120Cle": 3779, "itation": 3780, "\u0120analysis": 3781, "\u0120Times": 3782, "\u0120science": 3783, "->": 3784, "\u0120figure": 3785, "\u0120disapp": 3786, "enty": 3787, "\u0120software": 3788, "\u0120ult": 3789, "\u0120officers": 3790, "New": 3791, "Is": 3792, "\u0120remains": 3793, "\u0120India": 3794, "\u0120psych": 3795, "rief": 3796, "\u0120cat": 3797, "esc": 3798, "\u0120observ": 3799, "\u0120stage": 3800, "\u0120Dark": 3801, "\u0120enter": 3802, "change": 3803, "\u0120passed": 3804, "\u0120despite": 3805, "\u0120Out": 3806, "\u0120movie": 3807, "rs": 3808, "\u0120voice": 3809, "mine": 3810, "\u0120Play": 3811, "\u0120toward": 3812, "\u0120Ter": 3813, "\u0120region": 3814, "\u0120values": 3815, "orters": 3816, "\u0120mount": 3817, "\u0120officer": 3818, "\u0120Other": 3819, "ban": 3820, "\u0120hous": 3821, "wood": 3822, "room": 3823, "IV": 3824, "\u0120Sun": 3825, "see": 3826, "\u0120Over": 3827, "rog": 3828, "90": 3829, "\u0120lay": 3830, "\u0120Tur": 3831, "awn": 3832, "\u0120pressure": 3833, "\u0120Sub": 3834, "\u0120books": 3835, "edom": 3836, "\u0120Sand": 3837, "AA": 3838, "ago": 3839, "\u0120reasons": 3840, "ford": 3841, "\u0120activity": 3842, "UT": 3843, "Now": 3844, "\u0120Senate": 3845, "cell": 3846, "night": 3847, "\u0120calls": 3848, "inter": 3849, "\u0120letter": 3850, "\u0120Rob": 3851, "\u0120Je": 3852, "\u0120choose": 3853, "\u0120Law": 3854, "Get": 3855, "Be": 3856, "\u0120rob": 3857, "\u0120types": 3858, "\u0120platform": 3859, "\u0120quarter": 3860, "RA": 3861, "\u0120Time": 3862, "\u0120maybe": 3863, "\u0120Cr": 3864, "95": 3865, "pre": 3866, "\u0120moving": 3867, "\u0120lif": 3868, "\u0120gold": 3869, "\u0120som": 3870, "\u0120patients": 3871, "\u0120truth": 3872, "\u0120Ke": 3873, "urance": 3874, "antly": 3875, "mar": 3876, "\u0120charge": 3877, "\u0120Great": 3878, "\u0120cele": 3879, "--------------------------------": 3880, "\u0120rock": 3881, "roid": 3882, "ancy": 3883, "\u0120credit": 3884, "aud": 3885, "By": 3886, "\u0120Every": 3887, "\u0120moved": 3888, "inger": 3889, "ribution": 3890, "\u0120names": 3891, "\u0120straight": 3892, "\u0120Health": 3893, "\u0120Well": 3894, "\u0120feature": 3895, "\u0120rule": 3896, "\u0120sche": 3897, "inated": 3898, "\u0120Michael": 3899, "berg": 3900, "41": 3901, "iled": 3902, "band": 3903, "\u0120click": 3904, "\u0120Angel": 3905, "onents": 3906, "\u00c2\u0143": 3907, "\u0120Iraq": 3908, "\u0120Saturday": 3909, "\u0120aware": 3910, "part": 3911, "\u0120pattern": 3912, "OW": 3913, "\u0120Let": 3914, "\u0120grad": 3915, "igned": 3916, "\u0120associated": 3917, "\u0120style": 3918, "no": 3919, "iation": 3920, "aith": 3921, "ilies": 3922, "\u0120stories": 3923, "uration": 3924, "\u0120individuals": 3925, "\u0120\u00e2\u0122\u00a6": 3926, "miss": 3927, "\u0120Associ": 3928, "ishing": 3929, "aby": 3930, "\u0120summer": 3931, "\u0120Ben": 3932, "\u012032": 3933, "\u0120arch": 3934, "uty": 3935, "\u0120Texas": 3936, "hol": 3937, "\u0120fully": 3938, "\u0120mill": 3939, "\u0120followed": 3940, "\u0120Bill": 3941, "\u0120Indian": 3942, "\u0120Secret": 3943, "\u0120Bel": 3944, "\u0120February": 3945, "\u0120jobs": 3946, "\u0120seemed": 3947, "\u0120Govern": 3948, "ipped": 3949, "\u0120reality": 3950, "\u0120lines": 3951, "\u0120park": 3952, "\u0120measure": 3953, "\u0120Our": 3954, "IM": 3955, "\u0120brother": 3956, "\u0120growing": 3957, "\u0120ban": 3958, "\u0120estim": 3959, "\u0120cry": 3960, "\u0120School": 3961, "\u0120mechan": 3962, "\u0120OF": 3963, "\u0120Windows": 3964, "\u0120rates": 3965, "\u0120Oh": 3966, "\u0120positive": 3967, "\u0120culture": 3968, "istics": 3969, "ica": 3970, "\u0120har": 3971, "ya": 3972, "itely": 3973, "ipp": 3974, "\u0120map": 3975, "encies": 3976, "\u0120William": 3977, "II": 3978, "akers": 3979, "56": 3980, "\u0120Mart": 3981, "\u0120Rem": 3982, "\u0120altern": 3983, "itude": 3984, "\u0120coach": 3985, "rowd": 3986, "Don": 3987, "\u0120kids": 3988, "\u0120journal": 3989, "\u0120corpor": 3990, "\u0120false": 3991, "\u0120web": 3992, "\u0120sleep": 3993, "\u0120contain": 3994, "\u0120sto": 3995, "\u0120bed": 3996, "iverse": 3997, "\u0120Rich": 3998, "\u0120Chinese": 3999, "\u0120pun": 4000, "\u0120meant": 4001, "known": 4002, "\u0120notice": 4003, "\u0120favorite": 4004, "aven": 4005, "\u0120condition": 4006, "\u0120purpose": 4007, "))": 4008, "\u0120organization": 4009, "\u0120challeng": 4010, "\u0120manufact": 4011, "\u0120susp": 4012, "\u0120Ac": 4013, "\u0120critic": 4014, "unes": 4015, "uclear": 4016, "\u0120mer": 4017, "vention": 4018, "\u012080": 4019, "\u0120mist": 4020, "\u0120Us": 4021, "\u0120Tor": 4022, "http": 4023, "olf": 4024, "\u0120larger": 4025, "\u0120advant": 4026, "\u0120resear": 4027, "\u0120actions": 4028, "ml": 4029, "\u0120kept": 4030, "\u0120aim": 4031, ",'": 4032, "col": 4033, "\u0120benefits": 4034, "ifying": 4035, "\u0120actual": 4036, "\u0120International": 4037, "\u0120vehicle": 4038, "\u0120chief": 4039, "\u0120efforts": 4040, "\u0120League": 4041, "\u0120Most": 4042, "\u0120wait": 4043, "\u0120adult": 4044, "\u0120overall": 4045, "\u0120speech": 4046, "\u0120highly": 4047, "\u0120female": 4048, "\u0120error": 4049, "\u0120effective": 4050, "54": 4051, "\u0120encour": 4052, "well": 4053, "\u0120failed": 4054, "\u0120conserv": 4055, "\u0120programs": 4056, "\u0120trou": 4057, "\u0120ahead": 4058, "500": 4059, "vertisement": 4060, "IP": 4061, "\u0120Found": 4062, "pir": 4063, "\u0120%": 4064, "\u0120crime": 4065, "ander": 4066, "\u0120location": 4067, "\u0120Iran": 4068, "\u0120behavior": 4069, "azing": 4070, "\u0120rare": 4071, "\u0120emb": 4072, "\u0120caused": 4073, "\u0120ship": 4074, "\u0120active": 4075, "\u0120contribut": 4076, "\u0120green": 4077, "\u0120acqu": 4078, "\u0120reflect": 4079, "venue": 4080, "\u0120firm": 4081, "\u0120birth": 4082, "].": 4083, "\u0120clearly": 4084, "\u0120emot": 4085, "\u0120agency": 4086, "riage": 4087, "\u0120memory": 4088, "98": 4089, "SA": 4090, "\u0120See": 4091, "acing": 4092, "CC": 4093, "\u0120biggest": 4094, "\u0120rap": 4095, "\u0120basic": 4096, "\u0120band": 4097, "eat": 4098, "\u0120suspect": 4099, "\u0120Mac": 4100, "\u012090": 4101, "mark": 4102, "istan": 4103, "\u0120spread": 4104, "ams": 4105, "ki": 4106, "asy": 4107, "rav": 4108, "\u0120Rober": 4109, "\u0120demonstr": 4110, "rated": 4111, "\u0120absolute": 4112, "\u0120places": 4113, "\u0120impl": 4114, "ibrary": 4115, "\u0120cards": 4116, "\u0120destroy": 4117, "\u0120virt": 4118, "vere": 4119, "\u0120appeared": 4120, "yan": 4121, "point": 4122, "\u0120beg": 4123, "\u0120temper": 4124, "spe": 4125, "anted": 4126, "ears": 4127, "\u0120Direct": 4128, "\u0120length": 4129, "\u0120blog": 4130, "amb": 4131, "\u0120integ": 4132, "\u0120resources": 4133, "acc": 4134, "iful": 4135, "\u0120spot": 4136, "\u0120forced": 4137, "\u0120thousands": 4138, "\u0120Minister": 4139, "\u0120qual": 4140, "\u0120French": 4141, "atically": 4142, "\u0120generally": 4143, "\u0120drink": 4144, "\u0120thus": 4145, "IL": 4146, "odes": 4147, "\u0120appropri": 4148, "\u0120Read": 4149, "\u0120whom": 4150, "\u0120eye": 4151, "\u0120college": 4152, "\u012045": 4153, "irection": 4154, "\u0120ensure": 4155, "\u0120apparent": 4156, "iders": 4157, "\u0120religious": 4158, "\u0120minor": 4159, "olic": 4160, "\u0120tro": 4161, "\u0120Why": 4162, "ribute": 4163, "met": 4164, "\u0120primary": 4165, "\u0120developed": 4166, "\u0120peace": 4167, "\u0120skin": 4168, "ste": 4169, "ava": 4170, "\u0120blue": 4171, "\u0120families": 4172, "\u0120ir": 4173, "\u0120apply": 4174, "\u0120inform": 4175, "\u0120Smith": 4176, "CT": 4177, "ii": 4178, "\u0120limit": 4179, "\u0120resist": 4180, "................": 4181, "umn": 4182, "\u0120conflic": 4183, "\u0120twe": 4184, "udd": 4185, "\u0120Tom": 4186, "\u0120liter": 4187, "que": 4188, "bon": 4189, "\u0120hair": 4190, "\u0120eventually": 4191, "\u0120pus": 4192, "\u0120helped": 4193, "\u0120agg": 4194, "orney": 4195, "\u0120Apple": 4196, "\u0120fit": 4197, "\u0120Sur": 4198, "\u0120prem": 4199, "\u0120sales": 4200, "\u0120seconds": 4201, "\u0120strength": 4202, "\u0120feeling": 4203, "\u00bf\u00bd": 4204, "\u0120tour": 4205, "\u0120knows": 4206, "oom": 4207, "\u0120exerc": 4208, "\u0120somew": 4209, "\u00ef\u00bf\u00bd": 4210, ">>": 4211, "\u0120spokes": 4212, "\u0120ideas": 4213, "\u0120regist": 4214, "soft": 4215, "\u0120Del": 4216, "\u0120PC": 4217, "\u0120propos": 4218, "\u0120launch": 4219, "\u0120bottom": 4220, "TH": 4221, "\u0120Please": 4222, "vest": 4223, "itz": 4224, "\u0120Inter": 4225, "\u0120script": 4226, "\u0120rat": 4227, "arning": 4228, "\u0120il": 4229, "\u0120Jer": 4230, "\u0120Are": 4231, "\u0120whatever": 4232, "oken": 4233, "cience": 4234, "\u0120mode": 4235, "\u0120agree": 4236, "\u0120sources": 4237, "\u0120initial": 4238, "\u0120restrict": 4239, "\u0120wonder": 4240, "usion": 4241, "####": 4242, "\u0120Sil": 4243, "ville": 4244, "\u0120burn": 4245, "tw": 4246, "asion": 4247, "\u0120\u00c2\u00a3": 4248, "\u0120nor": 4249, "uing": 4250, "\u0120reached": 4251, "\u0120sun": 4252, "\u0120categ": 4253, "igration": 4254, "\u0120cook": 4255, "\u0120promot": 4256, "\u0120male": 4257, "\u0120climate": 4258, "\u0120fix": 4259, "\u0120alleged": 4260, "UR": 4261, "alled": 4262, "\u0120images": 4263, "Cont": 4264, "ota": 4265, "\u0120schools": 4266, "ios": 4267, "\u0120drop": 4268, "\u0120stream": 4269, "\u0120Mo": 4270, "\u0120previously": 4271, "aling": 4272, "\u0120pet": 4273, "\u0120double": 4274, "\u0120(@": 4275, "annel": 4276, "\u0120default": 4277, "ties": 4278, "\u0120rank": 4279, "\u0120Dec": 4280, "\u0120Council": 4281, "\u0120weapon": 4282, "\u0120stock": 4283, "\u0120analy": 4284, "\u0120Str": 4285, "\u0120picture": 4286, "\u0120Police": 4287, "ference": 4288, "\u0120century": 4289, "\u0120citizens": 4290, "\u0120onto": 4291, "\u0120expand": 4292, "\u0120hero": 4293, "\u0120Sol": 4294, "\u0120wild": 4295, "\u0120update": 4296, "\u0120customers": 4297, "ront": 4298, "def": 4299, "\u0120lik": 4300, "\u0120criminal": 4301, "\u0120Christian": 4302, "SP": 4303, "76": 4304, "\u0120leaving": 4305, "\u0120otherwise": 4306, "\u0120Dist": 4307, "\u0120basis": 4308, "52": 4309, "53": 4310, "icip": 4311, "\u0120Ber": 4312, "\u0120recommend": 4313, "\u0120floor": 4314, "\u0120crowd": 4315, "oles": 4316, "\u012070": 4317, "\u0120central": 4318, "\u0120Ev": 4319, "\u0120dream": 4320, "\u0120download": 4321, "\u0120confir": 4322, "\u0120Thom": 4323, "\u0120window": 4324, "\u0120happens": 4325, "\u0120unit": 4326, "\u0120tend": 4327, "\u0120spl": 4328, "\u0120becomes": 4329, "\u0120fighting": 4330, "\u0120predict": 4331, "\u0120Press": 4332, "\u0120Power": 4333, "\u0120heavy": 4334, "aked": 4335, "\u0120fan": 4336, "orter": 4337, "ategy": 4338, "BA": 4339, "izes": 4340, "\u0120spend": 4341, "Here": 4342, "\u01202007": 4343, "\u0120adop": 4344, "\u0120Ham": 4345, "\u0120football": 4346, "\u0120Port": 4347, "oday": 4348, "51": 4349, "ampions": 4350, "\u0120transfer": 4351, "ht": 4352, "\u012038": 4353, "term": 4354, "acity": 4355, "\u0120bur": 4356, "],": 4357, "ternal": 4358, "rig": 4359, "but": 4360, "\u0120therefore": 4361, "\u0120Because": 4362, "resp": 4363, "rey": 4364, "\u0120mission": 4365, "Some": 4366, "\u0120noted": 4367, "\u0120assum": 4368, "\u0120disease": 4369, "\u0120edit": 4370, "\u0120progress": 4371, "rd": 4372, "\u0120Brown": 4373, "ocal": 4374, "\u0120adding": 4375, "\u0120raised": 4376, "\u0120Any": 4377, "\u0120tick": 4378, "\u0120seeing": 4379, "\u0120People": 4380, "\u0120agreement": 4381, "\u0120server": 4382, "\u0120wat": 4383, "\u0120debate": 4384, "\u0120supposed": 4385, "iling": 4386, "\u0120largest": 4387, "\u0120successful": 4388, "\u0120Pri": 4389, "\u0120Democratic": 4390, "\u0120jump": 4391, "\u0120Syria": 4392, "\u0120owners": 4393, "\u0120offers": 4394, "\u0120shooting": 4395, "\u0120effic": 4396, "sey": 4397, "\u0120haven": 4398, "verse": 4399, "tered": 4400, "\u0120Light": 4401, "imal": 4402, "\u0120Big": 4403, "\u0120defend": 4404, "\u0120beat": 4405, "\u0120records": 4406, "%)": 4407, "\u0120scen": 4408, "\u0120employees": 4409, "\u0120devices": 4410, "hem": 4411, "\u0120commer": 4412, "\u0120Mex": 4413, "\u0120benefit": 4414, "\u0120Prof": 4415, "\u0120illeg": 4416, "\u0120surface": 4417, "\u0120Also": 4418, "\u0120harm": 4419, "ingly": 4420, "wide": 4421, "\u0120Alex": 4422, "\u0120shut": 4423, "\u0120Cur": 4424, "\u0120lose": 4425, "pm": 4426, "\u0120challenge": 4427, "semb": 4428, "\u0120station": 4429, "\u0120intelligence": 4430, "\u0120accur": 4431, "\u0120Flor": 4432, "\u0120requires": 4433, "\u0120Mal": 4434, "bum": 4435, "\u0120hospital": 4436, "\u0120spirit": 4437, "\u0120offered": 4438, "\u0120produce": 4439, "\u0120Commun": 4440, "\u0120creating": 4441, "\u0120cris": 4442, "spect": 4443, "\u0120ended": 4444, "\u0120daily": 4445, "\u0120voters": 4446, "lands": 4447, "ias": 4448, "ih": 4449, "ona": 4450, "\u0120smart": 4451, "\u0120Office": 4452, "\u0120Lord": 4453, "rial": 4454, "\u0120Internet": 4455, "\u0120circum": 4456, "\u0120extremely": 4457, "'.": 4458, "\u0120opinion": 4459, "\u0120Mil": 4460, "\u0120gain": 4461, "BS": 4462, "\u0120Fin": 4463, "yp": 4464, "\u0120useful": 4465, "\u0120budget": 4466, "\u0120comfort": 4467, "isf": 4468, "\u0120background": 4469, "eline": 4470, "\u0120episode": 4471, "\u0120enemy": 4472, "\u0120trial": 4473, "\u0120establish": 4474, "date": 4475, "\u0120Cap": 4476, "\u0120continues": 4477, "\u0120showing": 4478, "\u0120Union": 4479, "with": 4480, "\u0120posted": 4481, "\u0120System": 4482, "\u0120eat": 4483, "rian": 4484, "\u0120rise": 4485, "\u0120Germany": 4486, "ils": 4487, "\u0120signed": 4488, "\u0120vill": 4489, "\u0120grand": 4490, "mor": 4491, "\u0120England": 4492, "\u0120projects": 4493, "umber": 4494, "\u0120conference": 4495, "za": 4496, "\u0120responsible": 4497, "\u0120Arab": 4498, "\u0120learned": 4499, "\u00e2\u0122\u0136\u00e2\u0122\u0136": 4500, "ipping": 4501, "\u0120George": 4502, "OC": 4503, "\u0120returned": 4504, "\u0120Australia": 4505, "\u0120brief": 4506, "Qu": 4507, "\u0120brand": 4508, "illing": 4509, "abled": 4510, "\u0120highest": 4511, "\u0120train": 4512, "\u0120Commission": 4513, "while": 4514, "\u0120nom": 4515, "ception": 4516, "\u0120mut": 4517, "\u0120Blue": 4518, "\u0120incident": 4519, "vant": 4520, "86": 4521, "\u0120ID": 4522, "\u0120nuclear": 4523, "74": 4524, "\u0120Like": 4525, "\u0120RE": 4526, "\u0120Micro": 4527, "li": 4528, "mail": 4529, "\u0120charges": 4530, "89": 4531, "\u0120adjust": 4532, "ado": 4533, "\u0120earth": 4534, "NA": 4535, "\u0120prices": 4536, "PA": 4537, "\u0120draft": 4538, "\u0120runs": 4539, "\u0120candidate": 4540, "enses": 4541, "\u0120management": 4542, "\u0120Phil": 4543, "\u0120Miss": 4544, "\u0120teach": 4545, "gram": 4546, "\u0120understanding": 4547, "ait": 4548, "icago": 4549, "Add": 4550, "\u0120Ep": 4551, "secut": 4552, "\u0120separate": 4553, "\u0120instance": 4554, "\u0120eth": 4555, "\u0120unless": 4556, "********": 4557, "\u0120Fore": 4558, "inate": 4559, "\u0120operations": 4560, "Sp": 4561, "\u0120faith": 4562, "gar": 4563, "\u0120Church": 4564, "ronic": 4565, "\u0120config": 4566, "osure": 4567, "\u0120activities": 4568, "\u0120traditional": 4569, "\u012036": 4570, "\u0120direction": 4571, "\u0120machine": 4572, "\u0120surround": 4573, "\u0120push": 4574, "unction": 4575, "\u0120EU": 4576, "\u0120easier": 4577, "\u0120argument": 4578, "GB": 4579, "\u0120micro": 4580, "\u0120spending": 4581, "izations": 4582, "\u0120theory": 4583, "adow": 4584, "\u0120calling": 4585, "\u0120Last": 4586, "\u0120der": 4587, "\u0120influence": 4588, "\u0120commit": 4589, "\u0120photo": 4590, "\u0120unc": 4591, "istry": 4592, "gn": 4593, "aste": 4594, "acks": 4595, "\u0120disp": 4596, "ady": 4597, "do": 4598, "\u0120Good": 4599, "\u0120`": 4600, "\u0120wish": 4601, "\u0120revealed": 4602, "\u00c2\u0142\u00c2\u0142": 4603, "lig": 4604, "\u0120enforce": 4605, "\u0120Committee": 4606, "\u0120chem": 4607, "\u0120miles": 4608, "\u0120interested": 4609, "\u0120solution": 4610, "icy": 4611, "inct": 4612, "\u0120->": 4613, "\u0120Det": 4614, "\u0120removed": 4615, "\u0120compar": 4616, "eah": 4617, "\u0120plant": 4618, "\u0120Since": 4619, "\u0120achieve": 4620, "\u0120advantage": 4621, "\u0120slightly": 4622, "bing": 4623, "\u0120placed": 4624, "under": 4625, "2015": 4626, "\u0120Mad": 4627, "\u0120tim": 4628, "oses": 4629, "\u0120cru": 4630, "\u0120Rock": 4631, "\u0120mostly": 4632, "\u0120negative": 4633, "\u0120setting": 4634, "\u0120produced": 4635, "\u0120mur": 4636, "\u0120connection": 4637, "\u0120Mer": 4638, "\u0120driver": 4639, "\u0120executive": 4640, "\u0120assault": 4641, "\u0120born": 4642, "\u0120Ver": 4643, "tained": 4644, "\u0120structure": 4645, "\u0120reduce": 4646, "\u0120decades": 4647, "\u0120ded": 4648, "uke": 4649, "\u0120Many": 4650, "idden": 4651, "\u0120league": 4652, "Se": 4653, "\u0120join": 4654, "\u0120disco": 4655, "\u0120die": 4656, "cks": 4657, "actions": 4658, "\u0120assess": 4659, "agn": 4660, "\u0120goals": 4661, "ours": 4662, "IR": 4663, "\u0120senior": 4664, "iller": 4665, "mod": 4666, "ipment": 4667, "ocol": 4668, "uy": 4669, "\u0120Que": 4670, "\u0120parties": 4671, "irgin": 4672, "\u0120learning": 4673, "itable": 4674, "\u0120street": 4675, "\u0120camera": 4676, "App": 4677, "\u0120skills": 4678, "bre": 4679, "cious": 4680, "\u0120celebr": 4681, "\u0120Franc": 4682, "\u0120existing": 4683, "\u0120willing": 4684, "lor": 4685, "\u0120id": 4686, "\u0120Space": 4687, "\u0120critical": 4688, "\u0120La": 4689, "ortunately": 4690, "\u0120serve": 4691, "\u0120cold": 4692, "\u0120species": 4693, "TS": 4694, "\u0120animals": 4695, "\u0120Bay": 4696, "\u0120older": 4697, "\u0120Under": 4698, "estic": 4699, "\u0120Tre": 4700, "\u0120teacher": 4701, "\u0120prefer": 4702, "vis": 4703, "\u0120thread": 4704, "\u0120Matt": 4705, "\u0120manager": 4706, "\u00e3\u0125\u00bb": 4707, "\u0120professional": 4708, "\u0120Vol": 4709, "\u0120notes": 4710, "These": 4711, "ula": 4712, "\u0120fresh": 4713, "ented": 4714, "uzz": 4715, "edy": 4716, "clusion": 4717, "\u0120Rel": 4718, "\u0120doubt": 4719, "EO": 4720, "\u0120opened": 4721, "\u0120Bit": 4722, "Advertisement": 4723, "\u0120guess": 4724, "\u0120UN": 4725, "\u0120sequ": 4726, "\u0120explain": 4727, "otten": 4728, "\u0120attract": 4729, "aks": 4730, "\u0120string": 4731, "\u0120context": 4732, "ossible": 4733, "\u0120Republicans": 4734, "\u0120solid": 4735, "\u0120cities": 4736, "\u0120asking": 4737, "\u0120random": 4738, "ups": 4739, "uries": 4740, "arant": 4741, "dden": 4742, "gl": 4743, "\u0120Florida": 4744, "\u0120depend": 4745, "\u0120Scott": 4746, "\u012033": 4747, "\u0120iT": 4748, "icon": 4749, "\u0120mentioned": 4750, "\u01202000": 4751, "\u0120claimed": 4752, "\u0120definitely": 4753, "ulf": 4754, "\u0120core": 4755, "\u0120opening": 4756, "\u0120Const": 4757, "which": 4758, "\u0120Tra": 4759, "AG": 4760, "72": 4761, "\u0120believed": 4762, "ada": 4763, "\u012048": 4764, "\u0120Security": 4765, "yright": 4766, "\u0120Pet": 4767, "\u0120Lou": 4768, "\u0120holding": 4769, "================": 4770, "\u0120ice": 4771, "\u0120brow": 4772, "\u0120authorities": 4773, "host": 4774, "word": 4775, "\u0120score": 4776, "\u0120Div": 4777, "\u0120cells": 4778, "\u0120transl": 4779, "\u0120neighbor": 4780, "\u0120remove": 4781, "uct": 4782, "\u0120district": 4783, "\u0120According": 4784, "\u0120worse": 4785, "\u0120concerns": 4786, "\u0120presidential": 4787, "\u0120policies": 4788, "\u0120Hall": 4789, "73": 4790, "\u0120hus": 4791, "AY": 4792, "\u01202006": 4793, "\u0120Jud": 4794, "\u0120independent": 4795, "\u0120Justice": 4796, "iliar": 4797, "print": 4798, "ighter": 4799, "\u0120protection": 4800, "zen": 4801, "\u0120sudden": 4802, "house": 4803, "\u0120Jes": 4804, "PR": 4805, "\u0120Inf": 4806, "\u0120bul": 4807, "\u0120_": 4808, "\u0120Service": 4809, "\u0120PR": 4810, "\u0120strategy": 4811, "ffect": 4812, "\u0120girls": 4813, "\u0120missing": 4814, "oyal": 4815, "\u0120Team": 4816, "ulated": 4817, "\u0120dat": 4818, "\u0120politics": 4819, "abor": 4820, "According": 4821, "\u0120spell": 4822, "\u0120graph": 4823, "orthern": 4824, "TC": 4825, "Ab": 4826, "\u0120labor": 4827, "isher": 4828, "\u0120kick": 4829, "\u0120iTunes": 4830, "\u0120steps": 4831, "poses": 4832, "\u0120smaller": 4833, "En": 4834, "bert": 4835, "\u0120roll": 4836, "\u0120researchers": 4837, "\u0120closed": 4838, "\u0120transport": 4839, "\u0120lawy": 4840, "________________": 4841, "\u0120Chicago": 4842, "\u0120aspect": 4843, "\u0120none": 4844, "\u0120marriage": 4845, "96": 4846, "\u0120elements": 4847, "\u0120Fre": 4848, "\u0120Sal": 4849, "\u0120dram": 4850, "FC": 4851, "top": 4852, "equ": 4853, "\u0120hearing": 4854, "\u0120supported": 4855, "\u0120testing": 4856, "cohol": 4857, "\u0120massive": 4858, "\u0120stick": 4859, "\u0120guard": 4860, "isco": 4861, "phone": 4862, "From": 4863, "However": 4864, "\u0120border": 4865, "\u0120copy": 4866, "ography": 4867, "list": 4868, "71": 4869, "\u0120owner": 4870, "class": 4871, "ruit": 4872, "rate": 4873, "\u0120Once": 4874, "\u0120digital": 4875, "\u0120task": 4876, "ERS": 4877, "\u0120incred": 4878, "tes": 4879, "++": 4880, "\u0120France": 4881, "\u0120breat": 4882, "owl": 4883, "\u0120issued": 4884, "\u0120Western": 4885, "\u0120detect": 4886, "\u0120partners": 4887, "\u0120shared": 4888, "\u0120Call": 4889, "\u0120cancer": 4890, "ache": 4891, "ribe": 4892, "\u0120explained": 4893, "\u0120heat": 4894, "{\"": 4895, "\u0120investment": 4896, "\u0120Book": 4897, "\u0120wood": 4898, "\u0120tools": 4899, "\u0120Although": 4900, "\u0120belief": 4901, "\u0120crisis": 4902, "\u0120ge": 4903, "\u0120MP": 4904, "\u0120operation": 4905, "type": 4906, "~~": 4907, "ga": 4908, "\u0120contains": 4909, "anta": 4910, "\u0120express": 4911, "\u0120Group": 4912, "\u0120Journal": 4913, "ka": 4914, "\u0120amb": 4915, "\u0120USA": 4916, "\u0120finding": 4917, "\u0120funding": 4918, "how": 4919, "\u0120established": 4920, "ideos": 4921, "\u0120degree": 4922, "\u0120dangerous": 4923, "anging": 4924, "\u0120freedom": 4925, "pport": 4926, "outhern": 4927, "\u0120church": 4928, "\u0120catch": 4929, "\u0120Two": 4930, "\u0120presence": 4931, "\u0120Guard": 4932, "Up": 4933, "\u0120authority": 4934, "\u0120Project": 4935, "\u0120button": 4936, "\u0120consequ": 4937, "\u0120valid": 4938, "\u0120weak": 4939, "\u0120starts": 4940, "\u0120reference": 4941, "\u0120Mem": 4942, "\")": 4943, "UN": 4944, "orage": 4945, "\u0120Open": 4946, "\u0120collection": 4947, "ym": 4948, "gency": 4949, "\u0120beautiful": 4950, "ros": 4951, "\u0120tells": 4952, "\u0120waiting": 4953, "nel": 4954, "\u0120providing": 4955, "\u0120Democrats": 4956, "\u0120daughter": 4957, "\u0120master": 4958, "\u0120purposes": 4959, "\u0120Japanese": 4960, "\u0120equal": 4961, "\u0120turns": 4962, "\u0120documents": 4963, "\u0120watching": 4964, "Res": 4965, "\u0120ran": 4966, "2014": 4967, "\u0120reject": 4968, "\u0120Korea": 4969, "\u0120victims": 4970, "Level": 4971, "erences": 4972, "\u0120witness": 4973, "\u012034": 4974, "\u0120reform": 4975, "coming": 4976, "\u0120occup": 4977, "\u0120caught": 4978, "\u0120traffic": 4979, "ading": 4980, "\u0120models": 4981, "ario": 4982, "\u0120served": 4983, "\u0120batter": 4984, "uate": 4985, "\u0120Secretary": 4986, "\u0120agreed": 4987, "\u0120truly": 4988, "ynam": 4989, "\u0120Ret": 4990, "\u0120units": 4991, "\u0120Research": 4992, "hand": 4993, "azine": 4994, "\u0120Mike": 4995, "\u0120variety": 4996, "otal": 4997, "\u0120amazing": 4998, "\u0120confirmed": 4999, "\u0120entirely": 5000, "\u0120purchase": 5001, "\u0120element": 5002, "\u0120cash": 5003, "\u0120determine": 5004, "De": 5005, "\u0120cars": 5006, "\u0120Wall": 5007, "\u00e2\u0138": 5008, "\u0120views": 5009, "\u0120drugs": 5010, "\u0120department": 5011, "\u0120Step": 5012, "uit": 5013, "\u012039": 5014, "asure": 5015, "\u0120Class": 5016, "\u0120covered": 5017, "\u0120Bank": 5018, "\u0120mere": 5019, "uana": 5020, "\u0120multi": 5021, "\u0120mix": 5022, "\u0120unlike": 5023, "levision": 5024, "\u0120stopped": 5025, "\u0120sem": 5026, "\u0120Gal": 5027, "ules": 5028, "\u0120wel": 5029, "\u0120Johnson": 5030, "la": 5031, "\u0120skill": 5032, "\u0120becoming": 5033, "rie": 5034, "\u0120appropriate": 5035, "fe": 5036, "ellow": 5037, "\u0120Prot": 5038, "ulate": 5039, "ocation": 5040, "\u0120weekend": 5041, "odies": 5042, "\u0120sites": 5043, "\u0120animal": 5044, "\u0120Tim": 5045, "\u0120scale": 5046, "\u0120charged": 5047, "\u0120instruct": 5048, "illa": 5049, "\u0120methods": 5050, "\u0120cert": 5051, "\u0120judge": 5052, "\u0120Hel": 5053, "\u0120dollars": 5054, "\u0120standing": 5055, "\u0120Squ": 5056, "\u0120debt": 5057, "liam": 5058, "\u0120driving": 5059, "\u0120Sum": 5060, "\u0120Edition": 5061, "\u0120album": 5062, "andon": 5063, "IF": 5064, "\u0120Uk": 5065, "63": 5066, "ader": 5067, "\u0120commercial": 5068, "esh": 5069, "\u0120Government": 5070, "\u0120discovered": 5071, "\u0120output": 5072, "\u0120Hillary": 5073, "\u0120Carol": 5074, "\u01202005": 5075, "\u0120abuse": 5076, "ancing": 5077, "\u0120switch": 5078, "\u0120annual": 5079, "Tw": 5080, "\u0120stated": 5081, "agement": 5082, "inner": 5083, "\u0120democr": 5084, "\u0120residents": 5085, "\u0120allowing": 5086, "\u0120factors": 5087, "odd": 5088, "\u0120fuck": 5089, "emies": 5090, "\u0120occurred": 5091, "oti": 5092, "\u0120north": 5093, "\u0120Public": 5094, "\u0120injury": 5095, "\u0120insurance": 5096, "CL": 5097, "olly": 5098, "\u00e3\u0122": 5099, "\u0120repeated": 5100, "\u0120arms": 5101, "anged": 5102, "\u0120construction": 5103, "\u0120fle": 5104, "PU": 5105, "icians": 5106, "\u0120forms": 5107, "\u0120McC": 5108, "antic": 5109, "\u0120mental": 5110, "pire": 5111, "\u0120equipment": 5112, "\u0120fant": 5113, "\u0120discussion": 5114, "\u0120regarding": 5115, "kin": 5116, "arp": 5117, "\u0120chair": 5118, "ogue": 5119, "\u0120proceed": 5120, "\u0120Id": 5121, "Our": 5122, "\u0120murder": 5123, "Man": 5124, "\u012049": 5125, "asp": 5126, "\u0120supply": 5127, "\u0120input": 5128, "\u0120wealth": 5129, "liament": 5130, "\u0120proced": 5131, "orial": 5132, "\u0120Stat": 5133, "\u0120NFL": 5134, "hens": 5135, "\u0120Institute": 5136, "\u0120putting": 5137, "ournament": 5138, "etic": 5139, "\u0120located": 5140, "\u0120kid": 5141, "eria": 5142, "run": 5143, "\u0120princ": 5144, "\u0120!": 5145, "going": 5146, "\u0120Bet": 5147, "\u0120clot": 5148, "\u0120telling": 5149, "\u0120proposed": 5150, "iot": 5151, "orry": 5152, "\u0120funds": 5153, "gment": 5154, "\u0120Life": 5155, "\u0120baby": 5156, "\u0120Back": 5157, "\u0120spoke": 5158, "Image": 5159, "\u0120earn": 5160, "\u0120AT": 5161, "gu": 5162, "\u0120exchange": 5163, "\u0120Lin": 5164, "oving": 5165, "\u0120pair": 5166, "More": 5167, "azon": 5168, "\u0120arrested": 5169, "\u0120killing": 5170, "can": 5171, "\u0120Card": 5172, "yd": 5173, "\u0120identified": 5174, "\u0120mobile": 5175, "\u0120thanks": 5176, "onym": 5177, "\u0120Form": 5178, "\u0120hundreds": 5179, "\u0120Chris": 5180, "\u0120Cat": 5181, "\u0120trend": 5182, "hat": 5183, "\u0120Av": 5184, "oman": 5185, "\u0120electric": 5186, "\u0120Wil": 5187, "SE": 5188, "Of": 5189, "\u0120restaur": 5190, "oted": 5191, "\u0120trig": 5192, "\u0120nine": 5193, "\u0120bomb": 5194, "Why": 5195, "\u00c2\u00af": 5196, "\u0120coverage": 5197, "\u0120appeal": 5198, "\u0120Robert": 5199, "\u0120Sup": 5200, "\u0120finished": 5201, "\u0120flow": 5202, "\u0120deliver": 5203, "\u0120calcul": 5204, "\u0120photos": 5205, "\u0120phil": 5206, "\u0120pieces": 5207, "\u0120appre": 5208, "kes": 5209, "\u0120rough": 5210, "Do": 5211, "\u0120partner": 5212, "\u0120concerned": 5213, "\u012037": 5214, "\u0120Gen": 5215, "Col": 5216, "ctors": 5217, "\u0120=>": 5218, "state": 5219, "\u0120suggested": 5220, "\u0120Force": 5221, "CE": 5222, "\u0120herself": 5223, "\u0120Plan": 5224, "works": 5225, "ooth": 5226, "rency": 5227, "\u0120corner": 5228, "\u0120husband": 5229, "\u0120internet": 5230, "\u0120Aut": 5231, "ems": 5232, "osen": 5233, "\u0120Atl": 5234, "gen": 5235, "\u0120balance": 5236, "62": 5237, "\u0120sounds": 5238, "text": 5239, "\u0120arr": 5240, "oves": 5241, "\u0120millions": 5242, "\u0120radio": 5243, "\u0120satisf": 5244, "\u0120Dam": 5245, "Mr": 5246, "Go": 5247, "Spe": 5248, "\u0120combat": 5249, "rant": 5250, "\u0120Gree": 5251, "\u0120fuel": 5252, "\u0120distance": 5253, "\u0120tests": 5254, "\u0120decre": 5255, "\u0120Er": 5256, "\u0120managed": 5257, "DS": 5258, "\u0120tit": 5259, "\u0120measures": 5260, "\u0120Liber": 5261, "\u0120attend": 5262, "ashed": 5263, "\u0120Jose": 5264, "\u0120Night": 5265, "dit": 5266, "\u0120Nov": 5267, "\u0120End": 5268, "outs": 5269, "\u0120generation": 5270, "\u0120advoc": 5271, "yth": 5272, "\u0120conversation": 5273, "\u0120Sky": 5274, "active": 5275, "cel": 5276, "rier": 5277, "\u0120Frank": 5278, "\u0120gender": 5279, "\u0120concent": 5280, "\u0120carried": 5281, "anda": 5282, "\u0120Virgin": 5283, "\u0120arrived": 5284, "icide": 5285, "aded": 5286, "\u0120failure": 5287, "\u0120minimum": 5288, "lets": 5289, "\u0120worst": 5290, "\u0120keeping": 5291, "\u0120intended": 5292, "\u0120illegal": 5293, "\u0120subsc": 5294, "\u0120determined": 5295, "\u0120trip": 5296, "Yes": 5297, "\u0120raise": 5298, "\u0120~": 5299, "\u0120feels": 5300, "\u0120package": 5301, "\u0120Jo": 5302, "hi": 5303, "2016": 5304, "real": 5305, "\u0120fra": 5306, "\u0120symb": 5307, "Me": 5308, "ucky": 5309, "pret": 5310, "\u0120Kh": 5311, "\u0120Edit": 5312, "\u0120Web": 5313, "emic": 5314, "\u0120Color": 5315, "\u0120justice": 5316, "Int": 5317, "\u0120farm": 5318, "cknow": 5319, "\">": 5320, "eless": 5321, "\u0120reduced": 5322, "\u0120500": 5323, "xx": 5324, "\u0120Rad": 5325, "\u0120Wood": 5326, "\u0120clin": 5327, "\u0120hyp": 5328, "iler": 5329, "ura": 5330, "kins": 5331, "85": 5332, "61": 5333, "\u0120Their": 5334, "\u0120Mary": 5335, "\u0120san": 5336, "\u0120novel": 5337, "\u0120Who": 5338, "\u0120capacity": 5339, "\u0120impossible": 5340, "\u0120plays": 5341, "\u0120minister": 5342, "ijuana": 5343, "icate": 5344, "\u0120Set": 5345, "\u0120fram": 5346, "\u0120ing": 5347, "\u0120communities": 5348, "\u0120FBI": 5349, "ita": 5350, "\u0120bon": 5351, "\u0120strateg": 5352, "\u0120interests": 5353, "lock": 5354, "gers": 5355, "mas": 5356, "\u0120AND": 5357, "\u0120conflict": 5358, "\u0120requirements": 5359, "\u0120sac": 5360, "\u0120operating": 5361, "ini": 5362, "related": 5363, "\u0120committed": 5364, "\u0120relatively": 5365, "\u0120south": 5366, "\u00c2\u00af\u00c2\u00af": 5367, "\u0120afford": 5368, "\u0120identity": 5369, "\u0120decisions": 5370, "\u0120accused": 5371, "place": 5372, "\u0120victory": 5373, "och": 5374, "iat": 5375, "Name": 5376, "Com": 5377, "tion": 5378, "eds": 5379, "\u0120seek": 5380, "\u0120tight": 5381, "\u0120Images": 5382, "\u0120initi": 5383, "\u0120humans": 5384, "\u0120familiar": 5385, "\u0120audience": 5386, "\u0120internal": 5387, "venture": 5388, "\u0120sides": 5389, "\u0120TO": 5390, "\u0120dim": 5391, "\u0120conclud": 5392, "\u0120appoint": 5393, "\u0120enforcement": 5394, "\u0120Jim": 5395, "\u0120Association": 5396, "\u0120circumst": 5397, "\u0120Canadian": 5398, "\u0120joined": 5399, "\u0120differences": 5400, "\u0120Los": 5401, "\u0120protest": 5402, "\u0120twice": 5403, "win": 5404, "\u0120glass": 5405, "arsh": 5406, "\u0120Army": 5407, "\u0120expression": 5408, "\u0120decide": 5409, "\u0120planning": 5410, "ania": 5411, "\u0120handle": 5412, "\u0120Microsoft": 5413, "\u0120Nor": 5414, "\u0120maximum": 5415, "\u0120Rev": 5416, "\u0120sea": 5417, "\u0120eval": 5418, "\u0120helps": 5419, "ref": 5420, "\u0120bound": 5421, "\u0120mouth": 5422, "\u0120standards": 5423, "\u0120clim": 5424, "\u0120Camp": 5425, "\u0120Fox": 5426, "cles": 5427, "\u0120army": 5428, "\u0120Techn": 5429, "acking": 5430, "xy": 5431, "SS": 5432, "\u012042": 5433, "\u0120bug": 5434, "\u0120Ukrain": 5435, "\u0120Max": 5436, "\u0120Jones": 5437, "\u0120Show": 5438, "lo": 5439, "\u0120planet": 5440, "\u012075": 5441, "\u0120winning": 5442, "\u0120faster": 5443, "\u0120spect": 5444, "\u0120broken": 5445, "TR": 5446, "\u0120defined": 5447, "\u0120healthy": 5448, "\u0120competition": 5449, "https": 5450, "\u0120Island": 5451, "\u0120Fe": 5452, "\u0120announce": 5453, "\u0120Cup": 5454, "\u0120Instead": 5455, "\u0120client": 5456, "\u0120possibly": 5457, "section": 5458, "ocket": 5459, "look": 5460, "\u0120finish": 5461, "\u0120crew": 5462, "\u0120reserv": 5463, "\u0120editor": 5464, "\u0120hate": 5465, "\u0120sale": 5466, "\u0120controvers": 5467, "\u0120pages": 5468, "wing": 5469, "\u0120numer": 5470, "\u0120opposition": 5471, "\u01202004": 5472, "\u0120refuge": 5473, "\u0120flight": 5474, "\u0120apart": 5475, "\u0120Lat": 5476, "Americ": 5477, "\u0120Africa": 5478, "\u0120applications": 5479, "\u0120Palest": 5480, "\u0120Bur": 5481, "\u0120gar": 5482, "\u0120Social": 5483, "\u0120upgr": 5484, "\u0120shape": 5485, "\u0120speaking": 5486, "ansion": 5487, "ao": 5488, "\u0120Sn": 5489, "\u0120worry": 5490, "\u0120Britain": 5491, "Please": 5492, "roud": 5493, "\u0120hun": 5494, "\u0120introduced": 5495, "\u0120diet": 5496, "Ind": 5497, "\u0120Second": 5498, "\u0120functions": 5499, "uts": 5500, "\u0120Each": 5501, "\u0120Jeff": 5502, "\u0120stress": 5503, "\u0120accounts": 5504, "\u0120guarant": 5505, "\u0120Ann": 5506, "edia": 5507, "\u0120honest": 5508, "\u0120tree": 5509, "\u0120African": 5510, "\u0120Bush": 5511, "},": 5512, "\u0120sch": 5513, "\u0120Only": 5514, "\u0120fif": 5515, "igan": 5516, "\u0120exercise": 5517, "\u0120Exp": 5518, "\u0120scientists": 5519, "\u0120legislation": 5520, "\u0120Work": 5521, "\u0120Spr": 5522, "\u00c3\u0124": 5523, "\u0120Human": 5524, "\u0120\u00e8": 5525, "\u0120survey": 5526, "\u0120rich": 5527, "rip": 5528, "\u0120maintain": 5529, "\u0120flo": 5530, "\u0120leadership": 5531, "stream": 5532, "\u0120Islamic": 5533, "\u012001": 5534, "\u0120College": 5535, "\u0120magic": 5536, "\u0120Prime": 5537, "\u0120figures": 5538, "2017": 5539, "inder": 5540, "xual": 5541, "\u0120Dead": 5542, "\u0120absolutely": 5543, "\u0120fourth": 5544, "\u0120presented": 5545, "respond": 5546, "rible": 5547, "\u0120alcohol": 5548, "ato": 5549, "\u0120DE": 5550, "porary": 5551, "\u0120grab": 5552, "\u0120vari": 5553, "\u0120quant": 5554, "\u0120Photo": 5555, "\u0120plus": 5556, "rick": 5557, "arks": 5558, "\u0120alternative": 5559, "\u0120pil": 5560, "\u0120approx": 5561, "that": 5562, "\u0120objects": 5563, "\u0120Ro": 5564, "\u0120Android": 5565, "\u0120significantly": 5566, "\u0120Road": 5567, "kay": 5568, "Read": 5569, "avor": 5570, "\u0120acknow": 5571, "\u0120HD": 5572, "\u0120Sing": 5573, "Or": 5574, "\u0120Mont": 5575, "\u0120uns": 5576, "prof": 5577, "\u0120negoti": 5578, "\u0120Arch": 5579, "iki": 5580, "\u0120television": 5581, "\u0120Jewish": 5582, "\u0120committee": 5583, "\u0120motor": 5584, "\u0120appearance": 5585, "\u0120sitting": 5586, "\u0120strike": 5587, "\u0120Down": 5588, "comp": 5589, "\u0120Hist": 5590, "\u0120fold": 5591, "acement": 5592, "\u0120Louis": 5593, "\u0120belong": 5594, "\u0120\u00e2\u0122\u00a2": 5595, "\u0120mort": 5596, "\u0120prepared": 5597, "\u012064": 5598, "\u0120Master": 5599, "\u0120indeed": 5600, "\u0120Den": 5601, "\u0120rent": 5602, "TA": 5603, "ourney": 5604, "arc": 5605, "Su": 5606, "97": 5607, "\u0120advice": 5608, "\u0120changing": 5609, "\u0120listed": 5610, "\u0120launched": 5611, "isation": 5612, "\u0120Peter": 5613, "ishes": 5614, "\u0120lived": 5615, "\u0120Mel": 5616, "\u0120Supreme": 5617, "\u0120Federal": 5618, "\u0120);": 5619, "ructure": 5620, "\u0120sets": 5621, "\u0120philos": 5622, "uous": 5623, "\u0120\u00c2\u0142": 5624, "\u0120applied": 5625, "\u0120NOT": 5626, "\u0120housing": 5627, "\u0120Mount": 5628, "\u0120odd": 5629, "\u0120sust": 5630, "DA": 5631, "fficient": 5632, "\u0120?": 5633, "olved": 5634, "\u0120powers": 5635, "\u0120thr": 5636, "\u0120remaining": 5637, "\u0120Water": 5638, "LC": 5639, "\u0120causes": 5640, "\u00e3\u0123\u00ae": 5641, "\u0120manner": 5642, "ads": 5643, "\u0120suggests": 5644, "\u0120ends": 5645, "standing": 5646, "fig": 5647, "\u0120Dun": 5648, "idth": 5649, "\u0120gay": 5650, "\u0120termin": 5651, "\u0120Angeles": 5652, "MS": 5653, "\u0120scientific": 5654, "\u0120coal": 5655, "apers": 5656, "bar": 5657, "\u0120Thomas": 5658, "\u0120sym": 5659, "\u0120Run": 5660, "this": 5661, "PC": 5662, "igrants": 5663, "\u0120minute": 5664, "\u0120District": 5665, "cellent": 5666, "\u0120leaves": 5667, "\u0120completed": 5668, "amin": 5669, "\u0120focused": 5670, "\u0120monitor": 5671, "\u0120vehicles": 5672, "MA": 5673, "\u0120Mass": 5674, "\u0120Grand": 5675, "\u0120affected": 5676, "itutional": 5677, "\u0120construct": 5678, "\u0120follows": 5679, "\u0120ton": 5680, "reens": 5681, "\u0120homes": 5682, "\u0120Ext": 5683, "\u0120Level": 5684, "rast": 5685, "\u0120Ir": 5686, "\u0120elim": 5687, "\u0120largely": 5688, "\u0120Joe": 5689, "\u0120votes": 5690, "alls": 5691, "\u0120businesses": 5692, "\u0120Foundation": 5693, "\u0120Central": 5694, "\u0120yards": 5695, "\u0120materials": 5696, "ulner": 5697, "\u0120guide": 5698, "\u0120closer": 5699, "ums": 5700, "\u0120sports": 5701, "eder": 5702, "Just": 5703, "\u0120taxes": 5704, "84": 5705, "\u0120Old": 5706, "\u0120decade": 5707, "ola": 5708, "\u0120vir": 5709, "\u0120dropped": 5710, "\u0120delay": 5711, "itect": 5712, "\u0120secure": 5713, "stein": 5714, "level": 5715, "\u0120treated": 5716, "\u0120filed": 5717, "aine": 5718, "\u0120van": 5719, "\u0120mir": 5720, "\u0120column": 5721, "icted": 5722, "eper": 5723, "\u0120rot": 5724, "\u0120consult": 5725, "\u0120entry": 5726, "\u0120marijuana": 5727, "\u0120Dou": 5728, "\u0120apparently": 5729, "oking": 5730, "clusive": 5731, "\u0120increases": 5732, "ano": 5733, "\u0120specifically": 5734, "\u0120tele": 5735, "ensions": 5736, "\u0120religion": 5737, "abilities": 5738, "\u0120frame": 5739, "\u0120Note": 5740, "\u0120Lee": 5741, "\u0120helping": 5742, "\u0120edge": 5743, "oston": 5744, "\u0120organizations": 5745, "\u00c3\u0125": 5746, "\u0120Both": 5747, "hips": 5748, "\u0120bigger": 5749, "\u0120boost": 5750, "\u0120Stand": 5751, "\u0120row": 5752, "uls": 5753, "abase": 5754, "\u0120rid": 5755, "Let": 5756, "aren": 5757, "rave": 5758, "\u0120stret": 5759, "PD": 5760, "\u0120vision": 5761, "\u0120wearing": 5762, "\u0120appreci": 5763, "\u0120award": 5764, "\u0120Use": 5765, "\u0120factor": 5766, "war": 5767, "ulations": 5768, ")(": 5769, "\u0120god": 5770, "\u0120territ": 5771, "\u0120param": 5772, "asts": 5773, "87": 5774, "\u0120enemies": 5775, "\u0120Games": 5776, "FF": 5777, "\u0120accident": 5778, "Well": 5779, "\u0120Martin": 5780, "TER": 5781, "\u0120ath": 5782, "\u0120Hell": 5783, "\u0120forg": 5784, "\u0120veter": 5785, "\u0120Medic": 5786, "free": 5787, "\u0120stars": 5788, "\u0120expensive": 5789, "\u0120acad": 5790, "rawn": 5791, "\u0120Whe": 5792, "\u0120lock": 5793, "\u0120format": 5794, "\u0120soldiers": 5795, "sm": 5796, "\u0120agent": 5797, "\u0120responsibility": 5798, "ora": 5799, "\u0120Science": 5800, "\u0120rapid": 5801, "\u0120tough": 5802, "\u0120Jesus": 5803, "\u0120believes": 5804, "ML": 5805, "\u0120wear": 5806, "lete": 5807, "\u00c3\u0125\u00c3\u0124": 5808, "\u0120Dri": 5809, "\u0120commission": 5810, "\u0120Bob": 5811, "Oh": 5812, "aped": 5813, "\u0120warm": 5814, "\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124": 5815, "\u01202003": 5816, "ortion": 5817, "\u0120hasn": 5818, "uster": 5819, "\u0120univers": 5820, "\u0120Ill": 5821, "\u0120king": 5822, "ologies": 5823, "94": 5824, "\u0120Tem": 5825, "\u0120Mos": 5826, "\u0120patient": 5827, "\u0120Mexico": 5828, "cean": 5829, "\u0120Death": 5830, "\u0120Sanders": 5831, "you": 5832, "\u0120Cast": 5833, "\u0120Company": 5834, "pty": 5835, "\u0120happening": 5836, "FP": 5837, "\u0120Battle": 5838, "\u0120bought": 5839, "Am": 5840, "Mod": 5841, "Us": 5842, "uters": 5843, "\u0120Cre": 5844, "\u0120Those": 5845, "\u012044": 5846, "iser": 5847, "\u0120soul": 5848, "\u0120Top": 5849, "\u0120Harry": 5850, "\u0120Aw": 5851, "\u0120seat": 5852, "ffee": 5853, "\u0120revolution": 5854, "\u0120(\"": 5855, "\u0120During": 5856, "ette": 5857, "\u0120ring": 5858, "\u0120offensive": 5859, "\u0120returns": 5860, "\u0120videos": 5861, "\u0120discl": 5862, "\u0120famous": 5863, "enced": 5864, "\u0120Sign": 5865, "\u0120River": 5866, "\u0120300": 5867, "PM": 5868, "\u0120Bus": 5869, "\u0120CH": 5870, "\u0120candidates": 5871, "arden": 5872, "\u0120percentage": 5873, "\u0120visual": 5874, "\u0120thank": 5875, "\u0120trouble": 5876, "nergy": 5877, "\u01202001": 5878, "\u0120prove": 5879, "ashion": 5880, "\u0120enh": 5881, "\u0120Long": 5882, "UM": 5883, "\u0120connected": 5884, "\u0120possibility": 5885, "Over": 5886, "\u0120expert": 5887, "\u0120library": 5888, "arts": 5889, "\u0120Director": 5890, "\u0120fellow": 5891, "92": 5892, "irty": 5893, "\u0120dry": 5894, "\u0120signs": 5895, "\u0120Love": 5896, "\u0120quiet": 5897, "foot": 5898, "\u0120pure": 5899, "\u0120Hun": 5900, "\u0120filled": 5901, "phas": 5902, "\u0120Elect": 5903, "endment": 5904, "\u0120Expl": 5905, "\u0120unable": 5906, "ns": 5907, "mo": 5908, "\u0120vast": 5909, "obe": 5910, "\u0120identify": 5911, "apping": 5912, "\u0120Carolina": 5913, "gress": 5914, "\u0120prote": 5915, "\u0120fish": 5916, "\u0120circumstances": 5917, "razy": 5918, "\u0120Phot": 5919, "\u0120bodies": 5920, "\u0120Mur": 5921, "\u0120developing": 5922, "\u0120AR": 5923, "\u0120experienced": 5924, "\u0120substant": 5925, "\u0120Board": 5926, "esome": 5927, "\u0120domestic": 5928, "\u0120combined": 5929, "\u0120Put": 5930, "\u0120chemical": 5931, "\u0120Child": 5932, "\u0120pool": 5933, "\u0120Cy": 5934, "\u0120egg": 5935, "cons": 5936, "sters": 5937, "\u0120hurt": 5938, "\u0120markets": 5939, "\u0120conservative": 5940, "\u0120supporters": 5941, "\u0120agencies": 5942, "idel": 5943, "Ob": 5944, "urb": 5945, "\u012043": 5946, "\u0120Defense": 5947, "ye": 5948, "\u0120Ap": 5949, "dule": 5950, "\u0120temperature": 5951, "\u0120conducted": 5952, "\u0120Chief": 5953, "\u0120pulled": 5954, "\u0120fol": 5955, "Last": 5956, "onto": 5957, "osis": 5958, "VER": 5959, "Des": 5960, "\u0120Pan": 5961, "First": 5962, "\u0120advance": 5963, "\u0120license": 5964, "rors": 5965, "\u0120Jon": 5966, "\u0120imagine": 5967, "\u0120hell": 5968, "\u0120fixed": 5969, "\u0120incor": 5970, "osite": 5971, "\u0120Log": 5972, "icken": 5973, "]:": 5974, "\u0120surprise": 5975, "hab": 5976, "\u0120craft": 5977, "olt": 5978, "\u0120Jul": 5979, "\u0120dial": 5980, "\u0120relevant": 5981, "\u0120entered": 5982, "\u0120leads": 5983, "\u0120AD": 5984, "\u0120Clean": 5985, "\u0120pictures": 5986, "essor": 5987, "\u0120alt": 5988, "\u0120paying": 5989, "Per": 5990, "\u0120Market": 5991, "\u0120updates": 5992, "amily": 5993, "\u0120Type": 5994, "\u0120Home": 5995, "\u012055": 5996, "sembly": 5997, "rome": 5998, "83": 5999, "\u0120greatest": 6000, "\u0120height": 6001, "\u0120heav": 6002, "aints": 6003, "\u0120listen": 6004, "aser": 6005, "\u0120SH": 6006, "\u0120capable": 6007, "acle": 6008, "\u0120perspect": 6009, "inating": 6010, "\u0120offering": 6011, "rypt": 6012, "\u0120Develop": 6013, "abin": 6014, "rc": 6015, "\u0120bright": 6016, "alty": 6017, "arrow": 6018, "\u0120suppl": 6019, "inding": 6020, "acked": 6021, "gypt": 6022, "\u0120Another": 6023, "pg": 6024, "\u0120Virginia": 6025, "\u0120Lu": 6026, "\u0120planned": 6027, "\u0120pit": 6028, "\u0120sweet": 6029, "Type": 6030, "\u0120Di": 6031, "\u0120typically": 6032, "\u0120Francisco": 6033, "\u0120prospect": 6034, "\u0120Dan": 6035, "\u0120teen": 6036, "rees": 6037, "\u0120sched": 6038, "\u0120hol": 6039, "\u0120scr": 6040, "\u0120lots": 6041, "life": 6042, "\u0120newsp": 6043, "\u0120forget": 6044, "\u0120None": 6045, "\u0120Middle": 6046, "\u0120Ryan": 6047, "edd": 6048, "\u0120severe": 6049, "\u0120suit": 6050, "ller": 6051, "93": 6052, "\u0120correspond": 6053, "\u0120explos": 6054, "uations": 6055, "\u0120flag": 6056, "game": 6057, "rid": 6058, "\u0120prin": 6059, "\u0120Data": 6060, "\u0120deploy": 6061, "\u0120Enter": 6062, "suit": 6063, "ghan": 6064, "\u0120Men": 6065, "\u0120thoughts": 6066, "\u0120matters": 6067, "\u0120adapt": 6068, "\u0120Ari": 6069, "\u0120fill": 6070, "\u0120forth": 6071, "\u0120sam": 6072, "\u012041": 6073, "\u0120payment": 6074, "\u0120Hor": 6075, "\u0120spring": 6076, "duc": 6077, "\u0120losing": 6078, "\u0120bringing": 6079, "FO": 6080, "ala": 6081, "\u0120distribution": 6082, "hered": 6083, "bour": 6084, "\u0120Israeli": 6085, "oma": 6086, "\u0120combination": 6087, "\u0120plenty": 6088, "VE": 6089, "Can": 6090, "\u0120Haw": 6091, "\u0120perman": 6092, "\u0120Special": 6093, "\u0120tow": 6094, "\u0120seeking": 6095, "\u0120examples": 6096, "\u0120classes": 6097, "cr": 6098, "\u0120beer": 6099, "\u0120moves": 6100, "\u0120IP": 6101, "\u0120Kn": 6102, "\u0120panel": 6103, "Even": 6104, "\u0120properly": 6105, "\u0120ris": 6106, "\u0120plug": 6107, "\u0120estimated": 6108, "Every": 6109, "\u0120defensive": 6110, "agraph": 6111, "\u0120pregn": 6112, "\u0120instit": 6113, "\u0120Vict": 6114, "\u0120volume": 6115, "\u0120positions": 6116, "\u0120links": 6117, "\u0120Program": 6118, "\u0120Week": 6119, "agues": 6120, "\u0120transform": 6121, "ker": 6122, "\u0120CEO": 6123, "\u0120cas": 6124, "\u0120opponent": 6125, "\u0120tweet": 6126, "\u0120Code": 6127, "\u0120shop": 6128, "\u0120fly": 6129, "\u0120talks": 6130, "\u0120bag": 6131, "Phone": 6132, "\u0120aid": 6133, "\u0120plants": 6134, "\u012065": 6135, "\u0120attorney": 6136, "arters": 6137, "quest": 6138, "\u0120Magic": 6139, "\u0120begins": 6140, "\u0120myster": 6141, "\u0120environmental": 6142, "\u0120storage": 6143, "NN": 6144, "\u0120marg": 6145, "\u0120ske": 6146, "\u0120metal": 6147, "elly": 6148, "\u0120ordered": 6149, "\u0120remained": 6150, "\u0120loved": 6151, "\u0120prompt": 6152, "\u0120updated": 6153, "\u0120experts": 6154, "\u0120walking": 6155, "\u0120ancient": 6156, "\u0120performed": 6157, "ATE": 6158, "\u0120neither": 6159, "iency": 6160, "\u0120manufacture": 6161, "\u0120Pak": 6162, "\u0120selected": 6163, "\u0120mine": 6164, "\u0120ultimately": 6165, "\u0120explan": 6166, "\u0120label": 6167, "\u0120Services": 6168, "ributed": 6169, "Trump": 6170, "\u0120syn": 6171, "\u0120Ult": 6172, "SC": 6173, "\u0120meat": 6174, "\u0120giant": 6175, "\u0120Wars": 6176, "\u0120ON": 6177, "\u0120adm": 6178, "\u0120interpret": 6179, "\u0120evening": 6180, "\u0120evil": 6181, "\u0120Boston": 6182, "\u0120Wild": 6183, "\u0120\u00c3": 6184, "\u0120Bitcoin": 6185, "\u0120Amazon": 6186, "Dr": 6187, "\u0120Information": 6188, "\u0120obviously": 6189, "\u0120advanced": 6190, "Photo": 6191, "olar": 6192, "\u0120weather": 6193, "\u0120symbol": 6194, "\u0120sole": 6195, "\u0120potentially": 6196, "oster": 6197, "\u0120originally": 6198, "mun": 6199, "300": 6200, "aze": 6201, "essions": 6202, "\u0120deck": 6203, "\u0120stood": 6204, "\u0120youth": 6205, "\u0120Bern": 6206, "Rep": 6207, "\u0120Test": 6208, "\u0120basically": 6209, "otic": 6210, "\u0120involve": 6211, "olit": 6212, "lyn": 6213, "See": 6214, "\u0120aircraft": 6215, "\u0120confirm": 6216, "EW": 6217, "\u0120messages": 6218, "\u0120Richard": 6219, "\u0120kit": 6220, "\u0120prohib": 6221, "\u0120vulner": 6222, "isters": 6223, "\u0120existence": 6224, "\u0120turning": 6225, "\u0120SP": 6226, "\u0120desire": 6227, "\u0120flat": 6228, "\u0120ment": 6229, "season": 6230, "anges": 6231, "\u0120neighborhood": 6232, "\u0120Lake": 6233, "ATION": 6234, "\u0120pointed": 6235, "bur": 6236, "\u0120innov": 6237, "ucks": 6238, "UL": 6239, "\u0120professor": 6240, "\u0120expressed": 6241, "AB": 6242, "icious": 6243, "\u01202002": 6244, "\u0120Dev": 6245, "\u0120session": 6246, "\u0120bare": 6247, "sen": 6248, "\u0120diss": 6249, "\u0120Cath": 6250, "\u0120Pass": 6251, "\u0120Point": 6252, "\u0120doctor": 6253, "orrow": 6254, "ailed": 6255, "\u0120Rub": 6256, "\u0120DC": 6257, "\u0120Charl": 6258, "person": 6259, "\u0120writer": 6260, "ighters": 6261, "ureau": 6262, "\u0120oblig": 6263, "\u0120recorded": 6264, "\u0120broke": 6265, "\u0120orders": 6266, "ilty": 6267, "\u0120motion": 6268, "inity": 6269, "law": 6270, "adium": 6271, "\u0120immigration": 6272, "\u0120contrast": 6273, "\u0120batt": 6274, "\u0120excellent": 6275, "\u0120technical": 6276, "ami": 6277, "\u0120tun": 6278, "\u0120cloud": 6279, "\u0120Year": 6280, "geon": 6281, "\u0120creation": 6282, "\u0120strange": 6283, "\u0120auth": 6284, "\u0120fort": 6285, "born": 6286, "\u0120extent": 6287, "\u0120Today": 6288, "\u0120Club": 6289, "\u0120rain": 6290, "\u0120sample": 6291, "\u0120accepted": 6292, "\u0120tact": 6293, "\u0120fired": 6294, "\u0120Son": 6295, "\u0120stands": 6296, "\u0120boot": 6297, "\u012047": 6298, "\u0120statements": 6299, "\u0120versions": 6300, "\u0120selling": 6301, "ounded": 6302, "\u01201990": 6303, "\u0120weren": 6304, "\u0120Watch": 6305, "\u0120experiment": 6306, "Post": 6307, "\u0120retail": 6308, "uled": 6309, "Inst": 6310, "unte": 6311, "\u00e3\u0125\u00bc": 6312, "\u0120depart": 6313, "\u0120bond": 6314, "ivery": 6315, "ompl": 6316, "\u0120reaction": 6317, "\u0120Syrian": 6318, "\u0120Pac": 6319, "apped": 6320, "aniel": 6321, "DP": 6322, "\u0120resolution": 6323, "\u0120react": 6324, "\u0120approved": 6325, "onom": 6326, "mond": 6327, "\u0120Offic": 6328, "---": 6329, "\u0120replace": 6330, "\u0120tack": 6331, "\u0120sport": 6332, "\u0120chain": 6333, "\u0120emergency": 6334, "rad": 6335, "\u0120Palestin": 6336, "\u012046": 6337, "\u0120automatically": 6338, "\u0120route": 6339, "\u0120pal": 6340, "\u0120banks": 6341, "\u0120Paris": 6342, "\u0120Media": 6343, "road": 6344, "icing": 6345, "ixt": 6346, "isted": 6347, "\u0120grew": 6348, "\u0120coord": 6349, "\u0120Where": 6350, "omin": 6351, "\u0120subs": 6352, "\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd": 6353, "\u0120\u00c2\u00b1": 6354, "\u0120corporate": 6355, "\u0120selection": 6356, "noon": 6357, "\u0120Report": 6358, "cs": 6359, "cluding": 6360, "orders": 6361, "anche": 6362, "\u0120Its": 6363, "\u0120slowly": 6364, "\u0120Egypt": 6365, "\u0120Acc": 6366, "\u0120colle": 6367, "iques": 6368, "EX": 6369, "\u0120attempts": 6370, "url": 6371, "\u0120Cross": 6372, "\u0120findings": 6373, "\u0120SC": 6374, "\u0120OR": 6375, "\u0120index": 6376, "ensity": 6377, "\u0120Way": 6378, "\u0120Land": 6379, "\u0120shock": 6380, "dis": 6381, "\u0120dynam": 6382, "\u0120cart": 6383, "mosp": 6384, "Since": 6385, "iest": 6386, "\u0120Boy": 6387, "\u0120storm": 6388, "\u0120Contin": 6389, "2013": 6390, "hew": 6391, "ilit": 6392, "\u0120essential": 6393, "iquid": 6394, "Other": 6395, "ivered": 6396, "\u0120reasonable": 6397, "Act": 6398, "\u0120subsequ": 6399, "\u0120Pack": 6400, "\u0120Fort": 6401, "\u0120considering": 6402, "\u0120university": 6403, "log": 6404, "\u0120married": 6405, "\u0120illust": 6406, "\u0120True": 6407, "\u00a3\u0131": 6408, "\u0120numerous": 6409, "rastructure": 6410, "\u0120seriously": 6411, "\u0120referred": 6412, "ua": 6413, "\u0120consistent": 6414, "onna": 6415, "\u0120Real": 6416, "ruption": 6417, "ciples": 6418, "\u0120facts": 6419, "91": 6420, "otes": 6421, "erg": 6422, "Then": 6423, "\u0120accompl": 6424, "Note": 6425, "\u0120revenue": 6426, "\u0120passing": 6427, "\u0120mal": 6428, "een": 6429, "\u0120Yet": 6430, "\u0120gather": 6431, "terday": 6432, "ework": 6433, "\u0120Author": 6434, "Pe": 6435, "\u0120optim": 6436, "\u0120rub": 6437, "\u0120\u00e8\u00a3\u0131": 6438, "\u0120unknown": 6439, "stone": 6440, "\u0120union": 6441, "olve": 6442, "\u0120opportunities": 6443, "\u0120browser": 6444, "\u0120Wal": 6445, "\u0120Cost": 6446, "\u0120reporting": 6447, "sts": 6448, "pet": 6449, "\u0120sand": 6450, "\u0120suddenly": 6451, "\u0120surprising": 6452, "\u0120VR": 6453, "\u0120somewhat": 6454, "\u0120Bas": 6455, "ulture": 6456, "izz": 6457, "\u0120CD": 6458, "\u0120challenges": 6459, "\u0120settings": 6460, "\u0120experiences": 6461, "\u0120Full": 6462, "\u0120cann": 6463, "\u0120receiving": 6464, "EST": 6465, "\u0120joint": 6466, "\u0120cultural": 6467, "\u0120ast": 6468, "82": 6469, "astern": 6470, "ceived": 6471, "\u0120Cru": 6472, "\u0120bull": 6473, "pired": 6474, "amm": 6475, "\u0120facing": 6476, "power": 6477, "\u0120boss": 6478, "\u0120Hol": 6479, "\u0120instr": 6480, "\u0120increasingly": 6481, "\u0120shift": 6482, "\u0120streets": 6483, "\u0120Williams": 6484, "abb": 6485, "\u0120lie": 6486, "\u0120laugh": 6487, "\u0120Ca": 6488, "PL": 6489, "\u0120adults": 6490, "\u0120customer": 6491, "\u0120obtained": 6492, "\u0120supporting": 6493, "html": 6494, "fire": 6495, "\u0120detailed": 6496, "\u0120picked": 6497, "\u0120Right": 6498, "lder": 6499, "EE": 6500, "stood": 6501, "\u0120Kim": 6502, "\u0120wire": 6503, "\u0120sight": 6504, "\u0120developers": 6505, "\u0120persons": 6506, "\u0120sad": 6507, "\u0120cup": 6508, "\u0120warning": 6509, "\u0120boys": 6510, "long": 6511, "\u0120bird": 6512, "fo": 6513, "\u0120wal": 6514, "\u0120observed": 6515, "\u0120zone": 6516, "iveness": 6517, "\u0120channel": 6518, "cript": 6519, "\u0120refused": 6520, "\u0120Again": 6521, "\u0120suc": 6522, "\u0120spokesman": 6523, "\u0120Ref": 6524, "rite": 6525, "ouston": 6526, "\u00e3\u0125\u00b3": 6527, "\u0120Sher": 6528, "\u0120acts": 6529, "\u0120Name": 6530, "\u0120struggle": 6531, "arry": 6532, "ometimes": 6533, "\u0120discrim": 6534, "HT": 6535, "\u0120category": 6536, "\u0120realize": 6537, "\u0120employee": 6538, "\u0120Afghan": 6539, "enger": 6540, "\u0120guns": 6541, "\u0120Steve": 6542, "\u0120Mot": 6543, "\u0120Ol": 6544, "oked": 6545, "\u0120thick": 6546, "\u0120fairly": 6547, "illy": 6548, "\u0120surve": 6549, "\u0120Mat": 6550, "weight": 6551, "\u00e2\u0136": 6552, "\u0120troops": 6553, "\u0120agents": 6554, "\u0120battery": 6555, "\u0120motiv": 6556, "\u00c3\u00a1": 6557, "Sec": 6558, "den": 6559, "overy": 6560, "LS": 6561, "\u0120flu": 6562, "\u0120confident": 6563, "\u0120Oper": 6564, "\u0120empty": 6565, "\u0120phen": 6566, "\u0120sector": 6567, "\u0120excited": 6568, "\u0120remote": 6569, "aph": 6570, "oen": 6571, "\u0120destroyed": 6572, "\u0120moral": 6573, "\u0120HP": 6574, "\u0120Ron": 6575, "\u0120dress": 6576, "\u0120Bat": 6577, "\u0120lit": 6578, "\u0120MS": 6579, "\u0120af": 6580, "HL": 6581, "rum": 6582, "isms": 6583, "\u0120shouldn": 6584, "\u0120sympt": 6585, "\u0120Toronto": 6586, "hetic": 6587, "\u0120carbon": 6588, "\u0120installed": 6589, "\u0120violent": 6590, "\u0120solar": 6591, "ja": 6592, "\u0120practices": 6593, "\u0120ride": 6594, "\u0120Penn": 6595, "\u0120improved": 6596, "\u0120audio": 6597, "\u0120behavi": 6598, "\u0120PS": 6599, "\u0120eating": 6600, "Data": 6601, "\u0120Review": 6602, "pass": 6603, "claim": 6604, "uated": 6605, "angers": 6606, "chen": 6607, "\u0120properties": 6608, "\u0120anywhere": 6609, "Another": 6610, "\u0120blow": 6611, "\u0120Jackson": 6612, "\u0120proud": 6613, "\u0120plane": 6614, "lines": 6615, "\u0120square": 6616, "\u0120proof": 6617, "ansas": 6618, "\u0120talked": 6619, "makers": 6620, "\u0120sister": 6621, "\u0120holds": 6622, "\u0120resident": 6623, "\u0120==": 6624, "\u0120resistance": 6625, "\u0120split": 6626, "\u0120prosecut": 6627, "\u0120confidence": 6628, "resents": 6629, "\u0120cuts": 6630, "\u0120exception": 6631, "\u0120zero": 6632, "Getty": 6633, "\u0120copyright": 6634, "\u0120totally": 6635, "ormal": 6636, "ifications": 6637, "\u0120Australian": 6638, "\u0120sick": 6639, "\u0120150": 6640, "\u0120household": 6641, "\u0120fees": 6642, "\u0120drivers": 6643, "ogen": 6644, "\u0120NY": 6645, "\u0120necessarily": 6646, "\u0120regulations": 6647, "earing": 6648, "sl": 6649, "\u0120perspective": 6650, "care": 6651, "icial": 6652, "His": 6653, "\u0120escape": 6654, "\u0120surprised": 6655, "\u0120Van": 6656, "urrent": 6657, "\u0120vac": 6658, "81": 6659, "\u0120Thus": 6660, "\u0120emphas": 6661, "\u0120Champions": 6662, "\u0120Ice": 6663, "\u0120narr": 6664, "\u0120heads": 6665, "\u0120causing": 6666, "bel": 6667, "fortunately": 6668, "\u0120Ma": 6669, "\u0120targets": 6670, "cipl": 6671, "\u0120afternoon": 6672, "\u0120adds": 6673, "\u0120Maybe": 6674, "\u0120Four": 6675, "essed": 6676, "plete": 6677, "\u0120usual": 6678, "cho": 6679, "ingu": 6680, "\u0120withd": 6681, "\u0120Energy": 6682, "\u0120Econom": 6683, "OO": 6684, "\u0120articles": 6685, "\u0120injured": 6686, "\u0120manage": 6687, "\u0120explains": 6688, "\u0120diagn": 6689, "Rec": 6690, "atures": 6691, "\u0120linked": 6692, "\u0120discussed": 6693, "\u0120explo": 6694, "\u0120occasion": 6695, "athan": 6696, "\u0120opposite": 6697, "\u0120faces": 6698, "\u0120denied": 6699, "\u0120Knight": 6700, "\u0120nut": 6701, "\u0120approximately": 6702, "\u0120disappoint": 6703, "onymous": 6704, "\u0120Best": 6705, "\u0120Lo": 6706, "\u0120Hy": 6707, "\u0120Aff": 6708, "\u0120voting": 6709, "anwhile": 6710, "\u0120III": 6711, "\u0120institutions": 6712, "agram": 6713, "\u0120Daily": 6714, "\u0120drag": 6715, "\u0120nearby": 6716, "\u0120guilty": 6717, "\u0120conver": 6718, "Pre": 6719, "ship": 6720, "\u0120reward": 6721, "\u0120philosoph": 6722, "\u0120SS": 6723, "ugh": 6724, "\u0120apps": 6725, "friend": 6726, "\u0120upper": 6727, "\u0120advert": 6728, "\u0120snow": 6729, "\u0120frust": 6730, "\u0120ourselves": 6731, "Fr": 6732, "\u0120Die": 6733, "ampion": 6734, "\u0120dismiss": 6735, "\u0120cere": 6736, "\u0120signal": 6737, "from": 6738, "\u0120).": 6739, "\u012052": 6740, "\u0120crimes": 6741, "itors": 6742, "estival": 6743, "useum": 6744, "\u0120council": 6745, "\u0120Saud": 6746, "May": 6747, "\u0120Gun": 6748, "ician": 6749, "ether": 6750, "\u0120sufficient": 6751, "\u0120Hen": 6752, "sole": 6753, "\u0120historical": 6754, "\u0120Far": 6755, "\u0120Turn": 6756, "\u0120pin": 6757, "\u0120succeed": 6758, "mat": 6759, "lymp": 6760, "\u0120tradition": 6761, "\u0120Ok": 6762, "\u0120cro": 6763, "\u0120description": 6764, "alle": 6765, "\u0120sky": 6766, "Te": 6767, "\u0120widely": 6768, "\u0120wave": 6769, "\u0120definition": 6770, "\u0120Jews": 6771, "\u0120cycle": 6772, "\u0120refere": 6773, "\u0120brings": 6774, "usal": 6775, "\u0120alive": 6776, "\u0120frequently": 6777, "\u0120intention": 6778, "\u0120Control": 6779, "lv": 6780, "ystem": 6781, "\u0120privacy": 6782, "gent": 6783, "rence": 6784, "\u0120Quest": 6785, "\u0120Christmas": 6786, "\u0120rail": 6787, "\u0120cooper": 6788, "\u0120tested": 6789, "\u0120Capt": 6790, "asks": 6791, "\u0120comfortable": 6792, "\u0120delivered": 6793, "scape": 6794, "\u0120depth": 6795, "\u0120GOP": 6796, "\u0120writes": 6797, "\u0120assets": 6798, "\u0120sav": 6799, "iments": 6800, "\u0120transition": 6801, "\u0120artist": 6802, "\u0120Look": 6803, "\u0120lob": 6804, "\u0120components": 6805, "arity": 6806, "\u0120walked": 6807, "\u0120root": 6808, "\u0120participants": 6809, "\u0120noticed": 6810, "\u0120resc": 6811, "\u0120nav": 6812, "\u0120Administ": 6813, "da": 6814, "utral": 6815, "plate": 6816, "\u0120importance": 6817, "\u0120assert": 6818, "iously": 6819, "cription": 6820, "\u0120injuries": 6821, "\u0120Check": 6822, "\u0120registered": 6823, "\u0120intent": 6824, "\u0120missed": 6825, "ographic": 6826, "\u0120sentence": 6827, "ounter": 6828, "\u0120assistance": 6829, "evin": 6830, "\u0120database": 6831, "\u0120buildings": 6832, "\u0120classic": 6833, "\u0120thinks": 6834, "\u0120Ohio": 6835, "Pr": 6836, "ugg": 6837, "\u0120fee": 6838, "pan": 6839, "\u0120effectively": 6840, "\u0120facility": 6841, "\u0120bear": 6842, "\u0120chapter": 6843, "\u0120dogs": 6844, "\u0120Columb": 6845, "\u0120latter": 6846, "itial": 6847, "\u0120admitted": 6848, "TV": 6849, "\u0120Georg": 6850, "\u0120posts": 6851, "\\\\": 6852, "\u0120lawyer": 6853, "\u0120equival": 6854, "\u0120mand": 6855, "\u0120controlled": 6856, "\u0120Walk": 6857, "\u0120Andrew": 6858, "\u0120menu": 6859, "amental": 6860, "\u0120protected": 6861, "va": 6862, "\u0120administr": 6863, "oral": 6864, "\u0120rein": 6865, "\u0120Sar": 6866, "\u0120amounts": 6867, "\u0120native": 6868, "\u0120Moon": 6869, "\u0120represents": 6870, "\u0120abandon": 6871, "\u0120carrying": 6872, "\u0120tank": 6873, "mary": 6874, "\u0120declared": 6875, "Tube": 6876, "\u0120hat": 6877, "\u0120punish": 6878, "ellect": 6879, "mes": 6880, "\u0120universe": 6881, "\u0120Rod": 6882, "phy": 6883, "\u0120infrastructure": 6884, "\u012051": 6885, "\u0120opposed": 6886, "ownt": 6887, "ca": 6888, "\u0120Make": 6889, "\u0120hardware": 6890, "\u0120coffee": 6891, "Rel": 6892, "bal": 6893, "world": 6894, "\u0120Saf": 6895, "\u0120Sea": 6896, "inals": 6897, "\u0120owned": 6898, "\u0120hall": 6899, "ersion": 6900, "\u0120describe": 6901, "\u0120Pot": 6902, "\u0120portion": 6903, "\u0120atmosp": 6904, "\u0120governments": 6905, "\u0120depending": 6906, "\u0120offense": 6907, "\u0120trick": 6908, "awa": 6909, "\u0120Line": 6910, "\u0120Vis": 6911, "\u0120Hard": 6912, "\u0120Orig": 6913, "\u0120Click": 6914, "\u0120desk": 6915, "\u0120Valley": 6916, "\u0120Sov": 6917, "\u0120movies": 6918, "\u0120remark": 6919, "\u0120mail": 6920, "\u0120conscious": 6921, "\u0120ruling": 6922, "\u0120Rights": 6923, "\u0120medic": 6924, "hent": 6925, "\u0120Women": 6926, "><": 6927, "\u0120replaced": 6928, "\u0120Prem": 6929, "\u0120Thanks": 6930, "\u0120renew": 6931, "\u0120Ball": 6932, "iform": 6933, "\u0120shots": 6934, "Comm": 6935, "\u0120armed": 6936, "\u0120constant": 6937, "\u0120taste": 6938, "\u0120realized": 6939, "\u0120buff": 6940, "\u0120mo": 6941, "\u0120efficient": 6942, "Most": 6943, "oration": 6944, "ifies": 6945, "\u0120communication": 6946, "\u0120flood": 6947, "\u0120consequences": 6948, "\u0120anyway": 6949, "igg": 6950, "\u0120GM": 6951, "\u0120Thank": 6952, "\u0120iron": 6953, "\u0120evolution": 6954, "\u0120Cop": 6955, "twitter": 6956, "\u012095": 6957, "\u0120relationships": 6958, "adel": 6959, "\u0120Young": 6960, "\u0120proposal": 6961, "ayers": 6962, "uilding": 6963, "\u0120Hot": 6964, "ORE": 6965, "cos": 6966, "\u0120collabor": 6967, "PG": 6968, "axy": 6969, "\u0120knowing": 6970, "\u0120supports": 6971, "owed": 6972, "\u0120controls": 6973, "\u0120merely": 6974, "umer": 6975, "\u0120athlet": 6976, "\u0120fashion": 6977, "path": 6978, "\u0120gift": 6979, "\u0120era": 6980, "AND": 6981, "\u0120kinds": 6982, "\u0120Korean": 6983, "\u0120legit": 6984, "ulous": 6985, "\u0120essentially": 6986, "\u0120therap": 6987, "nic": 6988, "\u0120suffered": 6989, "\u0120hur": 6990, "\u0120promise": 6991, "\u0120excess": 6992, "\u0120overw": 6993, "\u0120prime": 6994, "\u0120Houston": 6995, "erry": 6996, "\u0120Ms": 6997, "RS": 6998, "2012": 6999, "\u0120stores": 7000, "\u0120Olymp": 7001, "\u0120journey": 7002, "Although": 7003, "Sub": 7004, "\u0120Educ": 7005, "\u0120Chapter": 7006, "\u0120requests": 7007, "\u0120consumers": 7008, "\u0120tiny": 7009, "\u0120isol": 7010, "\u0120Fair": 7011, "ba": 7012, "\u0120YOU": 7013, "\u0120crash": 7014, "celer": 7015, "\u0120emotional": 7016, "\u0120goods": 7017, "\u0120elected": 7018, "\u0120moder": 7019, "\u0120Linux": 7020, "\u0120blocks": 7021, "\u0120island": 7022, "\u0120Society": 7023, "\u0120elections": 7024, "\u0120broadcast": 7025, "\u0120cheap": 7026, "\u0120nations": 7027, "\u0120seasons": 7028, "400": 7029, "\u0120waste": 7030, "\u0120Sat": 7031, "\u0120fields": 7032, "employ": 7033, "\u0120profile": 7034, "\u0120authors": 7035, "ALL": 7036, "\u0120Gra": 7037, "west": 7038, "\u0120Ty": 7039, "\u0120deaths": 7040, "\u0120vacc": 7041, "\u0120formed": 7042, "\u0120du": 7043, "\u0120ongoing": 7044, "\u0120Muslims": 7045, "elf": 7046, "igure": 7047, "\u0120assume": 7048, "\u0120Ukraine": 7049, "water": 7050, "\u0120coast": 7051, "\u0120voted": 7052, "gor": 7053, "\u0120AS": 7054, "\u0120Michigan": 7055, "aza": 7056, "\u0120Arm": 7057, "iro": 7058, "\u0120flex": 7059, "asters": 7060, "''": 7061, "\u0120welcome": 7062, "arl": 7063, "\u0120locations": 7064, "igation": 7065, "\u0120Fil": 7066, "\u0120buying": 7067, "\u0120architect": 7068, "\u0120harder": 7069, "\u0120Cub": 7070, "\u0120interface": 7071, "\u0120restaurant": 7072, "\u0120discover": 7073, "\u0120exceed": 7074, "\u0120favour": 7075, "gery": 7076, "\u0120duty": 7077, "\u0120pitch": 7078, "ador": 7079, "\u0120Mach": 7080, "boy": 7081, "\u0120responded": 7082, "\u0120extended": 7083, "hers": 7084, "Many": 7085, "raid": 7086, "ifer": 7087, "\u0120Ins": 7088, "Ser": 7089, "\u0120medium": 7090, "she": 7091, "\u0120Sports": 7092, "\u0120magazine": 7093, "utation": 7094, "\u0120limits": 7095, "\u0120Gall": 7096, "\u0120external": 7097, "razil": 7098, "\u0120younger": 7099, "tle": 7100, "\u0120remind": 7101, "\u0120CON": 7102, "\u0120immediate": 7103, "\u0120hidden": 7104, "\u0120volunte": 7105, "\u0120simpl": 7106, "odcast": 7107, "\u0120phase": 7108, "dr": 7109, "\u0120plot": 7110, "\u0120exposure": 7111, "RI": 7112, "ograp": 7113, "vin": 7114, "anish": 7115, "\u0120Acad": 7116, "\u0120Engine": 7117, "\u0120expansion": 7118, "\u0120Pay": 7119, "Your": 7120, "\u0120pushed": 7121, "\u0120Ell": 7122, "\u0120Head": 7123, "\u0120marketing": 7124, "\u0120AC": 7125, "ket": 7126, "\u0120hits": 7127, "\u0120gro": 7128, "\u0120Age": 7129, "\u0120Scot": 7130, "][": 7131, "\u0120stim": 7132, "\u0120iPhone": 7133, "\u012a\u0134": 7134, "\u0120narrow": 7135, "\u0120Getty": 7136, "\u0120Turkey": 7137, "\u0120perfectly": 7138, "\u0120enable": 7139, "utch": 7140, "\u0120precise": 7141, "\u0120regime": 7142, "\u0120shif": 7143, "\u0120compens": 7144, "gun": 7145, "div": 7146, "\u0120chosen": 7147, "\u0120Ken": 7148, "Any": 7149, "\u0120trees": 7150, "\u0120recommended": 7151, "\u0120Ren": 7152, "uable": 7153, "\u0120HT": 7154, "Follow": 7155, "EG": 7156, "\u0120Hand": 7157, "\u0120Kenn": 7158, "\u0120arguments": 7159, "\u0120exists": 7160, "\u0120bike": 7161, "\u0120Conserv": 7162, "\u0120breaking": 7163, "\u0120Gar": 7164, "\u0120crazy": 7165, "\u0120virtual": 7166, "aylor": 7167, "ixel": 7168, "\u01201980": 7169, "\u0120permission": 7170, "\u0120Series": 7171, "\u0120consumer": 7172, "\u0120closely": 7173, "called": 7174, "\u012054": 7175, "\u0120hopes": 7176, "\u0120array": 7177, "\u0120Win": 7178, "\u0120Labour": 7179, "\u0120spons": 7180, "\u0120Ire": 7181, "\u0120pow": 7182, "\u0120readers": 7183, "\u0120employment": 7184, "\u0120creature": 7185, "\u0120resulting": 7186, "\u0120accurate": 7187, "\u0120moments": 7188, "\u0120argued": 7189, "\u0120ped": 7190, "During": 7191, "\u012053": 7192, "\u0120Tal": 7193, "\u0120sought": 7194, "\u0120suffering": 7195, "\u0120icon": 7196, "lee": 7197, "\u0120($": 7198, "alian": 7199, "\u00c2\u00b0": 7200, "\u0120pra": 7201, "\u0120bonus": 7202, "(\"": 7203, "ko": 7204, "\u0120acting": 7205, "DE": 7206, "fall": 7207, "\u0120comparison": 7208, "\u0120smooth": 7209, "\u0120NAS": 7210, "upp": 7211, "\u0120Joseph": 7212, "eping": 7213, "\u0120Take": 7214, "\u0120Mid": 7215, "\u0120sending": 7216, "fast": 7217, "\u0120Fall": 7218, "\u0120dealing": 7219, "user": 7220, "\u0120Organ": 7221, "Co": 7222, "\u0120attached": 7223, "\u0120sees": 7224, "%.": 7225, "\u0120typical": 7226, "ART": 7227, "\u0120finds": 7228, "\u0120Asia": 7229, "umin": 7230, "\u0120Core": 7231, "\u0120Ent": 7232, "inent": 7233, "uce": 7234, "\u0120Blood": 7235, "\u0120Never": 7236, "\u0120emails": 7237, "\u0120highlight": 7238, "\u0120confront": 7239, "atus": 7240, "uted": 7241, "\u0120unus": 7242, "\u0120topic": 7243, "\u0120Adam": 7244, "\u0120ble": 7245, "ati": 7246, "\u0120understood": 7247, "Set": 7248, "struct": 7249, "TP": 7250, "\u0120mob": 7251, "aa": 7252, "\u0120Start": 7253, "pected": 7254, "sell": 7255, "\u0120dedicated": 7256, "\u0120CA": 7257, "uan": 7258, "\u0120songs": 7259, "escription": 7260, "\u0120tech": 7261, "\u0120rape": 7262, "\u0120aside": 7263, "\u0120grant": 7264, "\u012056": 7265, "sub": 7266, "\u0120argue": 7267, "\u0120containing": 7268, "\u0120schedule": 7269, "\u0120liberal": 7270, "\u0120publicly": 7271, "\u0120heavily": 7272, "\u0120Ut": 7273, "iner": 7274, "\u0120Section": 7275, "\u0120Care": 7276, "weet": 7277, "ls": 7278, "Dis": 7279, "\u00e2\u0136\u0122": 7280, "\u0120Follow": 7281, "Back": 7282, "\u0120IT": 7283, "\u0120bes": 7284, "ji": 7285, "\u0120Hit": 7286, "ested": 7287, "\u0120everybody": 7288, "\u0120Swed": 7289, "\u0120femin": 7290, "\u0120facilities": 7291, "\u0120conven": 7292, "Comp": 7293, "\u0120OS": 7294, "core": 7295, "\u0120anx": 7296, "\u0120division": 7297, "\u0120Cam": 7298, "\u0120Stan": 7299, "mates": 7300, "\u0120explore": 7301, "plom": 7302, "\u0120shares": 7303, "pload": 7304, "anes": 7305, "\u0120ideal": 7306, "eters": 7307, "\u0120Base": 7308, "\u0120plastic": 7309, "\u0120distinct": 7310, "\u0120Network": 7311, "\u0120Seattle": 7312, "\u0120trading": 7313, "ensus": 7314, "intend": 7315, "\u0120exhib": 7316, "\u0120initially": 7317, "\u0120Food": 7318, "\u0120thousand": 7319, "\u0120Business": 7320, "acter": 7321, "\u0120paragraph": 7322, "\u0120roughly": 7323, "\u0120www": 7324, "\u0120creative": 7325, "\u0120Conf": 7326, "\u0120consumption": 7327, "\u0120films": 7328, "agan": 7329, "\u0120obtain": 7330, "\u0120tall": 7331, "\u0120tor": 7332, "\u0120acknowled": 7333, "\u0120grown": 7334, "alo": 7335, "KE": 7336, "\u0120400": 7337, "enders": 7338, "taining": 7339, "UG": 7340, "\u0120suicide": 7341, "\u0120watched": 7342, "\u0120List": 7343, "ali": 7344, "rehens": 7345, "\u0120surrounding": 7346, "\u0120pip": 7347, "\u0120flying": 7348, "\u0120Java": 7349, "ordan": 7350, "\u0120serving": 7351, "inations": 7352, "post": 7353, "\u0120sho": 7354, "Av": 7355, "\u0120jail": 7356, "zy": 7357, "\u01201999": 7358, "\u0120</": 7359, "\u0120literally": 7360, "\u0120Sir": 7361, "\u0120exposed": 7362, "\u0120lies": 7363, "star": 7364, "\u0120bat": 7365, "\u0120earned": 7366, "\u0120Dig": 7367, "\u0120specified": 7368, "\u0120Season": 7369, "\u0120degrees": 7370, "Donald": 7371, "\u0120centre": 7372, "\u0120sharing": 7373, "\u0120winter": 7374, "\u0120CO": 7375, "Che": 7376, "\u0120\u00ce": 7377, "MP": 7378, "\u0120unw": 7379, "\u0120fewer": 7380, "\u0120Mir": 7381, "\u0120somewhere": 7382, "\u0120Key": 7383, "\u0120attacked": 7384, "\u0120Kir": 7385, "\u0120domain": 7386, "\u0120stronger": 7387, "\u012099": 7388, "\u0120penalty": 7389, "Id": 7390, "Script": 7391, "\u0120declined": 7392, "\u0120neck": 7393, "\u0120fraud": 7394, "\u0120currency": 7395, "\u0120rising": 7396, "RC": 7397, "\u00e2\u0122\u00a6\u00e2\u0122\u00a6": 7398, "Hz": 7399, "\u0120tab": 7400, "\u0120talent": 7401, "nam": 7402, "\u0120NBA": 7403, "\u0120village": 7404, "\u0120legs": 7405, "\u0120Next": 7406, "Ed": 7407, "\u0120acid": 7408, "\u0120hyd": 7409, "800": 7410, "\u0120involving": 7411, "\u0120Image": 7412, "\u0120Before": 7413, "Fl": 7414, "\u0120yesterday": 7415, "Source": 7416, "\u0120terrorist": 7417, "\u0120sup": 7418, "\u0120synt": 7419, "\u0120Saudi": 7420, "\u0120west": 7421, "\u0120ru": 7422, "burg": 7423, "\u0120visible": 7424, "\u0120struck": 7425, "rison": 7426, "\u0120awesome": 7427, "\u0120drawn": 7428, "\u0120answers": 7429, "\u0120Girl": 7430, "\u0120Ram": 7431, "\u0120threats": 7432, "\u0120defeat": 7433, "osit": 7434, "\u0120vent": 7435, "aturally": 7436, "American": 7437, "enda": 7438, "\u0120Holy": 7439, "\u0120rum": 7440, "%,": 7441, "case": 7442, "\u0120History": 7443, "\u0120YouTube": 7444, "\u0120situations": 7445, "\u0120DNA": 7446, "Ste": 7447, "\u0120saved": 7448, "Item": 7449, "\u0120recip": 7450, "ologist": 7451, "\u0120faced": 7452, "\u0120elig": 7453, "Once": 7454, "\u0120Li": 7455, "uh": 7456, "\u0120mistake": 7457, "\u0120Division": 7458, "\u0120Bell": 7459, "\u0120symptoms": 7460, "\u00c2\u00ae": 7461, "\u0120domin": 7462, "\u0120falling": 7463, "\u0120ending": 7464, "ashes": 7465, "\u0120matches": 7466, "\u0120Online": 7467, "\u0120explanation": 7468, "Def": 7469, "redit": 7470, "\u0120anymore": 7471, "\u0120Total": 7472, "\u0120FOR": 7473, "ushed": 7474, "\u0120letters": 7475, "\u0120risks": 7476, "\u0120OK": 7477, "\u0120reportedly": 7478, ":\\": 7479, "\u0120plate": 7480, "\u0120subjects": 7481, "\u0120attempted": 7482, "ifier": 7483, "iana": 7484, "\u0120unlikely": 7485, "\u0120Though": 7486, "uma": 7487, "\u0120Invest": 7488, "\u0120Prin": 7489, "ican": 7490, "\u0120Dar": 7491, "\u0120Colorado": 7492, "aug": 7493, "\u0120veget": 7494, "aos": 7495, "ria": 7496, "\u0120shel": 7497, "\u0120marked": 7498, "\u0120()": 7499, "\u0120spr": 7500, "po": 7501, "\u0120Link": 7502, "\u0120defe": 7503, "\u0120Jr": 7504, "\u0120theme": 7505, "\u0120passion": 7506, "\u0120Pen": 7507, "\u0120info": 7508, "izer": 7509, "\u0120shit": 7510, "\u0120Civil": 7511, "apse": 7512, "cre": 7513, "\u0120poly": 7514, "\u0120component": 7515, "\u0120Charles": 7516, "\u0120Ireland": 7517, "\u0120Prov": 7518, "\u0120doctors": 7519, "\u0120granted": 7520, "\u0120paint": 7521, "\u0120honor": 7522, "\u0120smoke": 7523, "\u0120payments": 7524, "\u0120primarily": 7525, "\u0120Kingdom": 7526, "rich": 7527, "atell": 7528, "\u0120deals": 7529, "\u0120scheduled": 7530, "\u0120fundamental": 7531, "\u0120protein": 7532, "\u0120newspaper": 7533, "\u0120clients": 7534, "ython": 7535, "\u0120Date": 7536, "hus": 7537, "\u0120feedback": 7538, "\u0120stretch": 7539, "\u0120cock": 7540, "\u0120hotel": 7541, "\u0120Queen": 7542, "\u0120sugar": 7543, "\u0120ju": 7544, "\u0120milk": 7545, "\u0120approval": 7546, "\u0120Live": 7547, "\u0120equivalent": 7548, "efully": 7549, "\u0120insert": 7550, "zona": 7551, "\u0120extension": 7552, "dri": 7553, "John": 7554, "\u0120accomp": 7555, "Sm": 7556, "\u0120Fund": 7557, "\u0120constantly": 7558, "\u0120``": 7559, "\u0120generated": 7560, "\u0120Action": 7561, "\u0120Psych": 7562, "\u0120Tri": 7563, "\u0120recognize": 7564, "\u0120vary": 7565, "pha": 7566, "\u0120Ra": 7567, "df": 7568, "etch": 7569, "\u0120Soviet": 7570, "Two": 7571, "\u0120patterns": 7572, "\u0120profession": 7573, "aning": 7574, "Time": 7575, "\u0120Lim": 7576, "\u0120colors": 7577, "\u0120Az": 7578, "\u0120TR": 7579, "\u0120infect": 7580, "\u0120phenomen": 7581, "\u0120shell": 7582, "Also": 7583, "\u0120puts": 7584, "\u0120delivery": 7585, "\u0120brown": 7586, "\u0120processing": 7587, "\u0120lights": 7588, "essage": 7589, "\u0120Brook": 7590, "\u0120Aud": 7591, "lation": 7592, "\u0120industrial": 7593, "Like": 7594, "\u0120Brazil": 7595, "rous": 7596, "ESS": 7597, "\u0120Luc": 7598, "\u0120somehow": 7599, "\u012085": 7600, "\u0120proport": 7601, "\u0120politicians": 7602, "\u0120indicate": 7603, "\u0120hole": 7604, "\u0120techniques": 7605, "\u0120competitive": 7606, "\u0120phr": 7607, "\u0120vo": 7608, "istent": 7609, "\u0120Dream": 7610, "\u0120campus": 7611, "\u0120aspects": 7612, "\u0120helpful": 7613, "\u0120shield": 7614, "orse": 7615, "\u0120trigger": 7616, "mal": 7617, "\u012058": 7618, "\u0120tort": 7619, "\u0120personally": 7620, "\u0120tag": 7621, "\u0120keeps": 7622, "\u0120Video": 7623, "\u0120bench": 7624, "\u0120gap": 7625, "aire": 7626, "\u0120east": 7627, "\u0120recovery": 7628, "perial": 7629, "\u0120profit": 7630, "\u0120Mic": 7631, "\u012057": 7632, "\u0120colon": 7633, "\u0120strongly": 7634, "style": 7635, "\u0120allegations": 7636, "han": 7637, "\u0120reporters": 7638, "jo": 7639, "rine": 7640, "arget": 7641, "andal": 7642, "\u012003": 7643, "\u0120flash": 7644, "trans": 7645, "\u0120strict": 7646, "\u0120parking": 7647, "\u0120Pakistan": 7648, "\u0120li": 7649, "\u0120weird": 7650, "\u0120Eric": 7651, "\u0120regions": 7652, "\u0120Jun": 7653, "\u0120intellect": 7654, "\u0120WH": 7655, "oding": 7656, "ributes": 7657, "upid": 7658, "\u0120Tit": 7659, "\u0120finger": 7660, "oria": 7661, "\u0120elev": 7662, "\u0120Field": 7663, "\u0120conclusion": 7664, ";;": 7665, "\u0120feelings": 7666, "\u0120extensive": 7667, "\u0120mixed": 7668, "\u0120neuro": 7669, "vy": 7670, "\u0120harass": 7671, "\u0120Circ": 7672, "ouch": 7673, "\u0120territory": 7674, "\u0120successfully": 7675, "Mar": 7676, "\u0120ingred": 7677, "\u0120overwhel": 7678, "\u0120layer": 7679, "View": 7680, "\u0120allies": 7681, "illance": 7682, "\u0120Three": 7683, "\u0120bunch": 7684, "\u0120normally": 7685, "\u0120networks": 7686, "\u0120sacr": 7687, "\u0120CIA": 7688, "bles": 7689, "\u0120chose": 7690, "\u0120opponents": 7691, "\u0120regardless": 7692, "\u0120franch": 7693, "\u0120pref": 7694, "\u0120Po": 7695, "\u0120bridge": 7696, "anna": 7697, "\u0120Silver": 7698, "\u0120wage": 7699, "page": 7700, "rior": 7701, "\u0120radical": 7702, "\u0120Little": 7703, "\u0120manip": 7704, "\u0120secretary": 7705, "\u0120gang": 7706, "DR": 7707, "FA": 7708, "\u0120decent": 7709, "\u0120Spirit": 7710, "\u0120uncle": 7711, "\u0120Development": 7712, "\u0120investors": 7713, "\u0120walls": 7714, "\u0120publish": 7715, "\u0120generate": 7716, "issions": 7717, "car": 7718, "\u0120promote": 7719, "\u0120cutting": 7720, "\u0120chest": 7721, "\u0120drinking": 7722, "\u0120collected": 7723, "\u012072": 7724, "\u0120hoping": 7725, "\u0120embr": 7726, "gorith": 7727, "\u0120warned": 7728, "\u0120instructions": 7729, "OG": 7730, "\u0120Did": 7731, "\u0120Agency": 7732, "\u0120gear": 7733, "\u0120criticism": 7734, "\u0120Further": 7735, "\u0120util": 7736, "anny": 7737, "Red": 7738, "\u0120counsel": 7739, "\u0120Asian": 7740, "\u0120reduction": 7741, "pool": 7742, "\u0120teaching": 7743, "\u0120deeply": 7744, "iy": 7745, "\u0120estimates": 7746, "\u0120choices": 7747, "\u0120permanent": 7748, "inem": 7749, "kel": 7750, "\u0120fasc": 7751, "pse": 7752, "file": 7753, "\u0120Low": 7754, "\u0120Person": 7755, "\u0120tournament": 7756, "stal": 7757, "\u0120mel": 7758, "UST": 7759, "\u0120Ray": 7760, "azi": 7761, "Val": 7762, "\u0120contained": 7763, "\u0120Holly": 7764, "\u0120wake": 7765, "\u0120reveal": 7766, "\u0120processes": 7767, "\u0120ISIS": 7768, "\u012009": 7769, "\u0120blind": 7770, "\u0120steel": 7771, "\u0120Bad": 7772, "\u0120carefully": 7773, "appy": 7774, "roit": 7775, "\u0120gaming": 7776, "\u0120houses": 7777, "\u0120Coll": 7778, "\u0120truck": 7779, "erm": 7780, "\u0120scored": 7781, "\u0120occas": 7782, "return": 7783, "bound": 7784, "var": 7785, "\u0120sharp": 7786, "\u0120afraid": 7787, "\u0120EX": 7788, "amber": 7789, "cific": 7790, "\u0120scheme": 7791, "NC": 7792, "\u0120Polit": 7793, "\u0120decline": 7794, "\u01201998": 7795, "\u0120pushing": 7796, "\u0120possession": 7797, "\u0120privile": 7798, "\u0120teachers": 7799, "\u0120yield": 7800, "HA": 7801, "\u0120Davis": 7802, "itled": 7803, "########": 7804, "\u0120rig": 7805, "\u0120Daniel": 7806, "acon": 7807, "\u0120hide": 7808, "uten": 7809, "\u0120colleagues": 7810, "\u0120principles": 7811, "\u0120loud": 7812, "\u0120sin": 7813, "\u0120Demon": 7814, "\u0120stone": 7815, "\u012002": 7816, "\u0120taught": 7817, "\u0120terrible": 7818, "\u0120stuck": 7819, "\u0120Policy": 7820, "teen": 7821, "\u0120implementation": 7822, "\u0120BBC": 7823, "\u0120API": 7824, "\u0120wheel": 7825, "allas": 7826, "\u0120champions": 7827, "olars": 7828, "player": 7829, "\u0120repeatedly": 7830, "\u0120Still": 7831, "\u0120likes": 7832, "asty": 7833, "ester": 7834, "\u0120Catholic": 7835, "RL": 7836, "\u0120bath": 7837, "\u0120noise": 7838, "title": 7839, "\u0120northern": 7840, "Part": 7841, "\u0120magn": 7842, "\u0120fab": 7843, "\u0120Ash": 7844, "\u0120displ": 7845, "\u0120ticket": 7846, "\u0120murd": 7847, "\u0120alongside": 7848, "\u0120Music": 7849, "\u0120river": 7850, "\u0120Steel": 7851, "\u0120CL": 7852, "\u0120Player": 7853, "\u0120Mult": 7854, "owing": 7855, "rep": 7856, "size": 7857, "\u0120tur": 7858, "\u0120Georgia": 7859, "iscal": 7860, "raction": 7861, "\u0120cable": 7862, "\u012059": 7863, "\u0120wins": 7864, "\u0120upcoming": 7865, "\u0120survive": 7866, "\u0120inspired": 7867, "\u0120Education": 7868, "\u0120statistics": 7869, "\u0120Foot": 7870, "iami": 7871, "\u0120yellow": 7872, "\u0120Page": 7873, ".-": 7874, "\u0120Has": 7875, "\u0120urban": 7876, "\u0120ax": 7877, "essel": 7878, "\\\"": 7879, "\u0120quarterback": 7880, "\u0120register": 7881, "\u0120Labor": 7882, "\u0120abilities": 7883, "\u0120Family": 7884, "\u0120variable": 7885, "\u0120Price": 7886, "\u0120contem": 7887, "\u0120thin": 7888, "\u0120Equ": 7889, "data": 7890, "\u0120gotten": 7891, "\u0120constit": 7892, "\u0120asks": 7893, "\u0120tail": 7894, "\u0120exciting": 7895, "\u0120Effect": 7896, "\u0120Spanish": 7897, "\u0120encourage": 7898, "inson": 7899, "\u0120Ah": 7900, "\u0120commitment": 7901, "CS": 7902, "\u0120rally": 7903, "\u0120::": 7904, "\u0120subsid": 7905, "\u0120spin": 7906, "\u0120captured": 7907, "2018": 7908, "\u0120innoc": 7909, "\u0120allegedly": 7910, "\u0120Come": 7911, "\u0120artists": 7912, "\u0120Number": 7913, "\u0120electronic": 7914, "\u0120regional": 7915, "apes": 7916, "\u0120wra": 7917, "\u0120myth": 7918, "prise": 7919, "\u0120Miller": 7920, "\u0120Creat": 7921, "\u0120Episode": 7922, "bell": 7923, "\u0120directed": 7924, "\u0120extract": 7925, "\u0120sorry": 7926, "\u0120vice": 7927, "agger": 7928, "\u0120Support": 7929, "\u012066": 7930, "\u0120Iron": 7931, "\u0120wonderful": 7932, "\u0120gra": 7933, "Net": 7934, "ione": 7935, "Eng": 7936, "\u0120ships": 7937, "ikes": 7938, "\u0120Kevin": 7939, "itar": 7940, "\u0120activists": 7941, "true": 7942, "\u0120Arizona": 7943, "enth": 7944, "\u0120Despite": 7945, "\u0120SE": 7946, "\u0120habit": 7947, "ernel": 7948, "\u0120inqu": 7949, "\u0120abortion": 7950, "\u0120void": 7951, "\u0120explicit": 7952, "\u0120engaged": 7953, "\u0120angry": 7954, "\u0120rating": 7955, "\u0120frag": 7956, "bro": 7957, "icking": 7958, "dev": 7959, "\u0120worried": 7960, "\u0120obser": 7961, "\u0120apartment": 7962, "\u0120GT": 7963, "\u0120estate": 7964, "\u0120Constitution": 7965, "emon": 7966, "\u0120Snow": 7967, "\u0120county": 7968, "\u0120disag": 7969, "\u0120Stephen": 7970, "\u0120immigrants": 7971, "wind": 7972, "\u0120Nations": 7973, "\u0120folks": 7974, "Out": 7975, "\u0120gall": 7976, "\u0120targeted": 7977, "\u0120stead": 7978, "\u0120Bon": 7979, "\u0120Lib": 7980, "\u0120informed": 7981, "\u0120120": 7982, "chain": 7983, "idelines": 7984, "orough": 7985, "\u0120driven": 7986, "\u0120regularly": 7987, "\u0120basket": 7988, "\u0120principle": 7989, "ocument": 7990, "\u0120stun": 7991, "ibilities": 7992, "\u0120Roman": 7993, "\u0120About": 7994, "\u0120alert": 7995, "\u0120democracy": 7996, "\u0120represented": 7997, "HS": 7998, "cers": 7999, "parent": 8000, "Art": 8001, "pack": 8002, "\u0120diplom": 8003, "rets": 8004, "\u0120NO": 8005, "\u0120capture": 8006, "\u0120Adv": 8007, "\u0126\u00a2": 8008, "\u0120announcement": 8009, "\u0120Lear": 8010, "\u0120hook": 8011, "\u0120purs": 8012, "\u0120Such": 8013, "\u0120Camer": 8014, "\u0120refugees": 8015, "\u0120Ve": 8016, "Pol": 8017, "\u0120recognized": 8018, "lib": 8019, "\u0120hadn": 8020, "Ass": 8021, "\u0120pilot": 8022, "ushing": 8023, "\u0120returning": 8024, "\u0120trail": 8025, "\u0120Stone": 8026, "\u0120routine": 8027, "\u0120courts": 8028, "\u0120desper": 8029, "\u0120friendly": 8030, "\u0120Italy": 8031, "\u0120pled": 8032, "\u0120breath": 8033, "\u0120studio": 8034, "NS": 8035, "\u0120impressive": 8036, "\u0120Afghanistan": 8037, "\u0120fing": 8038, "\u0120downt": 8039, "inking": 8040, "\u0120Rog": 8041, "iary": 8042, "color": 8043, "sex": 8044, "aron": 8045, "\u0120fault": 8046, "\u0120Nick": 8047, "Down": 8048, "\u0120Rose": 8049, "\u0120Southern": 8050, "XX": 8051, "isodes": 8052, "List": 8053, "600": 8054, "\u0120outcome": 8055, "err": 8056, "\u0120elsewhere": 8057, "\u0120retire": 8058, "\u0120pounds": 8059, "\u0120Global": 8060, "People": 8061, "\u0120communications": 8062, "\u0120loan": 8063, "\u0120ratio": 8064, "\u0120Empire": 8065, "\u0120gonna": 8066, "\u0120invent": 8067, "DF": 8068, "\u01201970": 8069, "\u0120Common": 8070, "pat": 8071, "\u0120promised": 8072, "\u0120dinner": 8073, "\u0120Hom": 8074, "\u0120creates": 8075, "\u0120operate": 8076, "verty": 8077, "\u0120Jordan": 8078, "etime": 8079, "\u0120sustain": 8080, "Reg": 8081, "\u0120incredible": 8082, "ima": 8083, "\u0120warrant": 8084, "\u0120mm": 8085, "Att": 8086, "\u0120lawsuit": 8087, "\u0120reviews": 8088, "iture": 8089, "\u0120Source": 8090, "lights": 8091, "\u0120Ford": 8092, "\u012063": 8093, "group": 8094, "store": 8095, "\u0120featured": 8096, "\u0120forever": 8097, "\u0120poverty": 8098, "\u0120Pop": 8099, "\u0120CNN": 8100, "azz": 8101, "abis": 8102, "aching": 8103, "\u0120laid": 8104, "\u0120Supp": 8105, "\u0120filter": 8106, "ena": 8107, "\u0120Community": 8108, "\u0120creatures": 8109, "uction": 8110, "\u0120Royal": 8111, "\u0120association": 8112, "\u0120Connect": 8113, "\u0120Brad": 8114, "\u00e2\u0138\u012a": 8115, "lers": 8116, "there": 8117, "\u0120Gi": 8118, "\u0120valuable": 8119, "ACK": 8120, "\u0120Taylor": 8121, "\u0120liquid": 8122, "\u0120Attorney": 8123, "\u0120Carl": 8124, "\u0120Final": 8125, "aga": 8126, "\u0120Wilson": 8127, "Because": 8128, "\u0120Professor": 8129, "aka": 8130, "\u0120incredibly": 8131, "rance": 8132, "!)": 8133, "Ref": 8134, "sk": 8135, "\u0120solutions": 8136, "\u0120atmosphere": 8137, "\u0120blame": 8138, "umes": 8139, "\u0120Nob": 8140, "CA": 8141, "umps": 8142, "rical": 8143, "\u0120Putin": 8144, "\u0120Dest": 8145, "oric": 8146, "\u0120PA": 8147, "\u0120respectively": 8148, "wan": 8149, "\u0120fifth": 8150, "\u00e2\u0126\u00a2": 8151, "\u0120Cry": 8152, "\u0120governor": 8153, "resident": 8154, "\u0120purchased": 8155, "\u0120hack": 8156, "\u0120intense": 8157, "obs": 8158, "\u0120origin": 8159, "\u0120define": 8160, "\u0120careful": 8161, "***": 8162, "\u0120shoulder": 8163, "Click": 8164, "\u0120tied": 8165, "\u0120destruction": 8166, "oured": 8167, "\u0120nobody": 8168, "\u0120ho": 8169, "\u0120Exper": 8170, "\u0120tip": 8171, "\";": 8172, "\u0120technique": 8173, "\u0120jur": 8174, "\u0120Pok": 8175, "bow": 8176, "\u0120legend": 8177, "\u0120accord": 8178, "\u0120busy": 8179, "\u0120Intel": 8180, "\u0120hang": 8181, "aki": 8182, ".]": 8183, "\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136": 8184, "\u0120surgery": 8185, "\u0120reprodu": 8186, "\u0120uniform": 8187, "\u0120scenes": 8188, "code": 8189, "\u012062": 8190, "lisher": 8191, "\u0120Have": 8192, "phia": 8193, "\u0120crypt": 8194, "\u0120recon": 8195, "\u0120scream": 8196, "\u0120adopted": 8197, "\u0120scores": 8198, "Ne": 8199, "\u0120Italian": 8200, "including": 8201, "BO": 8202, "\u0120indicated": 8203, "\u0120entertain": 8204, "Gu": 8205, "Text": 8206, "iel": 8207, "\u0120twenty": 8208, "\u0120engage": 8209, "offs": 8210, "\u0120Pacific": 8211, "\u0120smile": 8212, "\u0120personnel": 8213, "\u0120toler": 8214, "\u0120doors": 8215, "\u0120tone": 8216, "\u0120machines": 8217, "\u0120entering": 8218, "tenance": 8219, "CO": 8220, "\u0120Jersey": 8221, "\u0120forest": 8222, "\u0120horse": 8223, "\u0120complaint": 8224, "\u0120Spring": 8225, "yo": 8226, "\u0120Plus": 8227, "eding": 8228, "\u0120Return": 8229, "quarters": 8230, "ials": 8231, "cow": 8232, "\u0120academic": 8233, "\u0120fruit": 8234, "\u01201996": 8235, "ogether": 8236, "\u0120wine": 8237, "\u0120pursu": 8238, "\u0120Steven": 8239, "\u0120licens": 8240, "Who": 8241, "\u0120clothes": 8242, "rection": 8243, "\u0120squad": 8244, "\u0120stable": 8245, "\u0120raw": 8246, "zens": 8247, "Star": 8248, "uties": 8249, "ancer": 8250, "\u0120keys": 8251, "\u0120Mu": 8252, "\u0120complicated": 8253, "iger": 8254, "\u0120Text": 8255, "\u0120absor": 8256, "\u012068": 8257, "\u0120funny": 8258, "\u0120relief": 8259, "\u0120Lew": 8260, "\u0120Cook": 8261, "\u0120chart": 8262, "\u0120drawing": 8263, "GE": 8264, "\u0120module": 8265, "\u0120Bull": 8266, "ILL": 8267, "\u0120salt": 8268, "00000000": 8269, "ille": 8270, "\u0120resource": 8271, "away": 8272, "adelphia": 8273, "\u0120Bru": 8274, "\u012067": 8275, "\u0120somebody": 8276, "\u0120participate": 8277, "\u0120rose": 8278, "wered": 8279, "\u0120muscle": 8280, "\u0120consent": 8281, "\u0120continuing": 8282, "\u0120Guardian": 8283, "\u0120Order": 8284, "regon": 8285, "\u0120rear": 8286, "\u0120provision": 8287, "\u0120liked": 8288, "rient": 8289, "\u0120bra": 8290, "Trans": 8291, "\u0120meetings": 8292, "\u0120tox": 8293, "\u0120convent": 8294, "\u0120auto": 8295, "\u0120recording": 8296, "\u0120Soft": 8297, "001": 8298, "\u0120Roll": 8299, "\u0120programming": 8300, "\u0120pic": 8301, "\u0120proved": 8302, "\u0120stab": 8303, "\u0120Ast": 8304, "\u0120caption": 8305, "ulating": 8306, "\u0120Attack": 8307, "\u0120newly": 8308, "\u01201997": 8309, "fr": 8310, "\u0120discipl": 8311, "\u0120Greek": 8312, "\u0120edition": 8313, "\u0120Does": 8314, "\u0120Box": 8315, "ifle": 8316, "acket": 8317, "\u0120passes": 8318, "\u0120guest": 8319, "\u0120acceler": 8320, "itals": 8321, "UD": 8322, "\u0120authent": 8323, "\u0120Rest": 8324, "oval": 8325, "ta": 8326, "uine": 8327, "\u0120armor": 8328, "\u0120Town": 8329, "\u0120compat": 8330, "\u0120inches": 8331, "Despite": 8332, "\u0120assign": 8333, "herent": 8334, "\u0120prepare": 8335, "\u0120Meg": 8336, "ockey": 8337, "\u0120depends": 8338, "\u0120tracks": 8339, "watch": 8340, "\u0120lists": 8341, "\u0120Northern": 8342, "\u0120alter": 8343, "rec": 8344, "\u0120Eastern": 8345, "\u0120condem": 8346, "\u0120everywhere": 8347, "?'": 8348, "\u0120affili": 8349, "\u0120fought": 8350, "\":{\"": 8351, "\u0120mac": 8352, "itarian": 8353, "\u0120scope": 8354, "\u0120AL": 8355, "aws": 8356, "arms": 8357, "\u0120que": 8358, "\u0120enjoyed": 8359, "nesota": 8360, "\u0120aggressive": 8361, "\u0120Story": 8362, "\u0120IV": 8363, "\u0120recipe": 8364, "\u0120rarely": 8365, "\u0120Medical": 8366, "value": 8367, "angel": 8368, "aying": 8369, "omething": 8370, "\u0120subsection": 8371, "\u0120southern": 8372, "\u0120frequency": 8373, "rete": 8374, "rolled": 8375, "ults": 8376, "\u0120Nic": 8377, "\u0120behalf": 8378, "\u0120sequence": 8379, "abet": 8380, "\u0120controversial": 8381, "\u0120comprom": 8382, "\u0120worker": 8383, "\u0120mainly": 8384, "\u0120algorith": 8385, "\u0120Major": 8386, "orce": 8387, "gender": 8388, "\u0120organized": 8389, "\u0120fake": 8390, "\u0120concluded": 8391, "\u0120ED": 8392, "\u0120Exec": 8393, "rage": 8394, "\u0120chances": 8395, "berry": 8396, "\u0120Trad": 8397, "\u0120configuration": 8398, "\u0120withdraw": 8399, "\u0120fro": 8400, "udes": 8401, "\u0120Brother": 8402, "\u0120Brian": 8403, "\u0120tries": 8404, "\u0120samples": 8405, "\u0120bid": 8406, "\u0120Golden": 8407, "\u0120photograph": 8408, "ifest": 8409, "\u0120DO": 8410, "\u0120Parliament": 8411, "****************": 8412, "Rem": 8413, "\u0120contest": 8414, "\u0120signing": 8415, "px": 8416, "\u0120Zeal": 8417, "\u00e2\u0136\u0122\u00e2\u0136\u0122": 8418, "Ear": 8419, "\u0120exit": 8420, "Before": 8421, "\u0120Corpor": 8422, "null": 8423, "month": 8424, "\u0120racial": 8425, "otted": 8426, "\u0120Veg": 8427, "\u0120Reuters": 8428, "\u0120sword": 8429, "pson": 8430, "\u0120Romney": 8431, "aed": 8432, "\u0120trib": 8433, "\u0120inner": 8434, "\u0120protocol": 8435, "\u0120Bi": 8436, "\u0120Miami": 8437, "everal": 8438, "press": 8439, "\u0120shipping": 8440, "\u0120Amendment": 8441, "\u0120Howard": 8442, "connect": 8443, "\u0120Disc": 8444, "\u0120Jac": 8445, "iamond": 8446, "\u0120Therefore": 8447, "ses": 8448, "\u0120Princess": 8449, "\u0120USB": 8450, "\u0120Anth": 8451, "\u0120surveillance": 8452, "\u0120apolog": 8453, "\u012061": 8454, "owa": 8455, "\u0120fulf": 8456, "js": 8457, "\u0120luck": 8458, "usted": 8459, "\u0120\u00c2\u00a7": 8460, "ni": 8461, "\u0120anticip": 8462, "eman": 8463, "\u0120winner": 8464, "\u0120silver": 8465, "lla": 8466, "icity": 8467, "\u0120unusual": 8468, "\u0120crack": 8469, "\u0120ties": 8470, "ez": 8471, "\u0120practical": 8472, "\u0120province": 8473, "\u0120Place": 8474, "\u0120priority": 8475, "ICE": 8476, "\u0120describes": 8477, "\u0120branch": 8478, "Form": 8479, "aska": 8480, "missions": 8481, "bi": 8482, "\u0120porn": 8483, "\u0120Turk": 8484, "\u0120enthus": 8485, "\u0120fighters": 8486, "\u012008": 8487, "\u0120Detroit": 8488, "\u0120foundation": 8489, "avid": 8490, "Are": 8491, "\u0120judgment": 8492, "cling": 8493, "\u0120solve": 8494, "\u0120Design": 8495, "Where": 8496, "hesis": 8497, "\u0120Tro": 8498, "after": 8499, "\u0120neutral": 8500, "\u0120Palestinian": 8501, "\u0120Hollywood": 8502, "\u0120advis": 8503, "\u0120Non": 8504, "yes": 8505, "olis": 8506, "\u0120reputation": 8507, "\u0120smell": 8508, "\u0120bread": 8509, "\u0120Bul": 8510, "\u0120Beach": 8511, "\u0120claiming": 8512, "\u0120genetic": 8513, "\u0120technologies": 8514, "\u0120upgrade": 8515, "rows": 8516, "\u0120developer": 8517, "\u0120Josh": 8518, "\u0120Disney": 8519, "erved": 8520, "ipal": 8521, "\u0120unex": 8522, "\u0120barely": 8523, "then": 8524, "\u0120Pub": 8525, "\u0120illness": 8526, "etary": 8527, "\u0120Bal": 8528, "\u0120patch": 8529, "\u0120butt": 8530, "\u0120stupid": 8531, "\u0120Dog": 8532, "\u0120Dallas": 8533, "front": 8534, "iece": 8535, "\u0120protests": 8536, "\u0120chat": 8537, "oenix": 8538, "\u0120wing": 8539, "\u0120parliament": 8540, "\u012077": 8541, "osexual": 8542, "\u0120render": 8543, "ptions": 8544, "\u0120Coast": 8545, "osa": 8546, "\u0120Greg": 8547, "hop": 8548, "\u0120Management": 8549, "\u0120bitcoin": 8550, "\u0120recover": 8551, "\u0120incorpor": 8552, "orne": 8553, "\u0120Using": 8554, "\u0120preced": 8555, "\u0120threatened": 8556, "\u0120spiritual": 8557, "\u0120Event": 8558, "\u0120Fred": 8559, "\u0120advertising": 8560, "\u0120improvements": 8561, "\u0120Custom": 8562, "\u0120errors": 8563, "\u0120sensitive": 8564, "\u0120Navy": 8565, "\u0120cream": 8566, "Look": 8567, "\u0120exclusive": 8568, "\u0120comprehens": 8569, "\u0120deleg": 8570, "\u0120conce": 8571, "\u0120remem": 8572, "\u0120structures": 8573, "\u0120stored": 8574, "ND": 8575, "\u01201000": 8576, "UP": 8577, "\u0120Budd": 8578, "AF": 8579, "woman": 8580, "\u0120Academy": 8581, "\u00f0\u0141": 8582, "sea": 8583, "\u0120temporary": 8584, "About": 8585, "esters": 8586, "\u0120tickets": 8587, "\u0120possess": 8588, "inch": 8589, "oz": 8590, "\u0120la": 8591, "\u0120contracts": 8592, "\u0120unp": 8593, "\u0120cig": 8594, "\u0120Kat": 8595, "ultural": 8596, "asm": 8597, "\u0120mountain": 8598, "\u0120Captain": 8599, "Step": 8600, "making": 8601, "\u0120Spain": 8602, "\u0120equally": 8603, "\u0120lands": 8604, "aters": 8605, "\u0120rejected": 8606, "era": 8607, "imm": 8608, "rix": 8609, "CD": 8610, "\u0120transaction": 8611, "gener": 8612, "lessly": 8613, "\u0120||": 8614, "\u0120cos": 8615, "\u0120Henry": 8616, "\u0120provisions": 8617, "\u0120gained": 8618, "\u0120directory": 8619, "\u0120raising": 8620, "\u0120Sep": 8621, "olen": 8622, "onder": 8623, "\u0120console": 8624, "inst": 8625, "\u0120bom": 8626, "\u0120uncertain": 8627, "150": 8628, "ocking": 8629, "\u0120measured": 8630, "\u0120plain": 8631, "\u0120seats": 8632, "\u0120dict": 8633, "SL": 8634, "afe": 8635, "\u0120estimate": 8636, "izon": 8637, "athered": 8638, "\u0120contributed": 8639, "\u0120episodes": 8640, "ommod": 8641, "Gr": 8642, "ANT": 8643, "\u012069": 8644, "Gener": 8645, "\u0120250": 8646, "viously": 8647, "rogen": 8648, "\u0120terrorism": 8649, "\u0120movements": 8650, "entle": 8651, "ounce": 8652, "\u0120Soul": 8653, "\u0120prev": 8654, "\u0120Table": 8655, "acts": 8656, "riors": 8657, "tab": 8658, "\u0120suffer": 8659, "\u0120nerv": 8660, "\u0120mainstream": 8661, "\u0120Wolf": 8662, "\u0120franchise": 8663, "bat": 8664, "\u0120demands": 8665, "\u0120agenda": 8666, "\u0120dozen": 8667, "\u0120clinical": 8668, "izard": 8669, "\u0120Op": 8670, "td": 8671, "\u0120visited": 8672, "\u0120Perhaps": 8673, "\u0120actor": 8674, "\u0120delic": 8675, "\u0120contribute": 8676, "\u0120inject": 8677, "\u0120Es": 8678, "acco": 8679, "\u0120listening": 8680, "\u0120congress": 8681, "ependent": 8682, "\u0120premium": 8683, "\u012076": 8684, "\u0120Irish": 8685, "\u0120assigned": 8686, "\u0120Phys": 8687, "\u0120worldwide": 8688, "\u0120narrative": 8689, "otype": 8690, "mont": 8691, "base": 8692, "\u0120Bowl": 8693, "\u0120Administration": 8694, "\u0120relation": 8695, "\u0120EV": 8696, "CP": 8697, "\u0120covers": 8698, "\u012078": 8699, "\u0120certific": 8700, "\u0120grass": 8701, "\u012004": 8702, "piracy": 8703, "ira": 8704, "\u0120engineering": 8705, "\u0120Mars": 8706, "\u0120unemploy": 8707, "\u0120Foreign": 8708, "stract": 8709, "\u0120ven": 8710, "\u0120steal": 8711, "\u0120replied": 8712, "\u0120ultimate": 8713, "\u0120titles": 8714, "dated": 8715, "\u0120joy": 8716, "aus": 8717, "\u0120hyper": 8718, "aku": 8719, "\u0120officially": 8720, "\u0120Product": 8721, "\u0120difficulty": 8722, "peror": 8723, "\u0120resulted": 8724, "ribed": 8725, "link": 8726, "who": 8727, "~~~~": 8728, "\u0120Speed": 8729, "\u0120Viet": 8730, "Wind": 8731, "\u0120Barack": 8732, "\u0120restrictions": 8733, "\u0120Share": 8734, "\u01201995": 8735, "itionally": 8736, "\u0120beauty": 8737, "opt": 8738, "\u0120maps": 8739, "\u0120CR": 8740, "\u0120Nation": 8741, "\u0120Cruz": 8742, "Will": 8743, "\u0120electricity": 8744, "\u0120org": 8745, "\u0120burd": 8746, "\u0120violation": 8747, "\u0120usage": 8748, "\u0120permit": 8749, "\u0120Chron": 8750, "\u0120Fant": 8751, "\u0120naturally": 8752, "\u012007": 8753, "\u0120thrown": 8754, "\u0120Awoken": 8755, "\u0120alien": 8756, "\u0120Hero": 8757, "\u0120Kent": 8758, "\u0120Rick": 8759, "rike": 8760, "\u0120pace": 8761, "},{\"": 8762, "GL": 8763, "\u0120poison": 8764, "\u0120Tower": 8765, "\u0120formal": 8766, "alysis": 8767, "\u0120genuine": 8768, "\u0120kil": 8769, "aver": 8770, "\u0120procedure": 8771, "\u0120Prop": 8772, "intendo": 8773, "\u0120Main": 8774, "asant": 8775, "\u0120trained": 8776, "Game": 8777, "\u0120Load": 8778, "\u0120MA": 8779, "\u0120crucial": 8780, "\u0120lets": 8781, "\u0120FR": 8782, "\u0120champion": 8783, "101": 8784, "\u0120Conference": 8785, "\u0120writers": 8786, "\u0120connections": 8787, "\u0120okay": 8788, "irms": 8789, "\u0120Rand": 8790, "\u0120encounter": 8791, "\u0120Buff": 8792, "\u0120achieved": 8793, "\u0120checks": 8794, "iscons": 8795, "\u0120assistant": 8796, "\u0120whenever": 8797, "\u0120Access": 8798, "\u0120Ur": 8799, "bin": 8800, "\u0120clock": 8801, "isp": 8802, "opher": 8803, "\u0120borrow": 8804, "\u0120mad": 8805, "\u0120personality": 8806, "only": 8807, "IST": 8808, "abama": 8809, "\u0120gains": 8810, "\u0120commonly": 8811, "\u0120terr": 8812, "\u0120hypot": 8813, "\u0120rely": 8814, "\u0120tiss": 8815, "isconsin": 8816, "\u0120ridic": 8817, "function": 8818, "\u0120Oregon": 8819, "\u0120uncom": 8820, "rating": 8821, "eland": 8822, "\u0120NC": 8823, "\u0120moon": 8824, "annon": 8825, "\u0120vulnerable": 8826, "utive": 8827, "\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142": 8828, "\u0120Radio": 8829, "\u0120western": 8830, "sect": 8831, "\u0120Tony": 8832, "\u0120occurs": 8833, "\u0120Os": 8834, "\u0120Hon": 8835, "\u00c3\u0143": 8836, "\u0120vessel": 8837, "\u0120Scotland": 8838, "\u0120discrimination": 8839, "\u0120subsequent": 8840, "string": 8841, "\u0120fantasy": 8842, "\u0120Shadow": 8843, "\u0120testim": 8844, "WE": 8845, "iti": 8846, "ras": 8847, "\u0120boat": 8848, "\u0120marks": 8849, "\u0120ordinary": 8850, "\u0120ren": 8851, "\u0120representative": 8852, "\u0120petition": 8853, "\u012073": 8854, "\u0120adventure": 8855, "\u0120ignore": 8856, "\u0120Philadelphia": 8857, "\u0120Sav": 8858, "VP": 8859, "\u0120factory": 8860, "\u0120tasks": 8861, "\u0120depression": 8862, "zed": 8863, "................................": 8864, "\u0120Storm": 8865, "\u0120cogn": 8866, "\u0120eligible": 8867, "\u0120reducing": 8868, "via": 8869, "\u012005": 8870, "\u0120striking": 8871, "\u0120dollar": 8872, "ho": 8873, "OV": 8874, "\u0120instrument": 8875, "\u0120philosophy": 8876, "\u0120Moore": 8877, "\u0120Avenue": 8878, "\u0120ruled": 8879, "\u0120Front": 8880, "INE": 8881, "\u0120Mah": 8882, "\u0120scenario": 8883, "\u0120NASA": 8884, "\u0120enorm": 8885, "\u0120debut": 8886, "\u0120tea": 8887, "Today": 8888, "\u0120absence": 8889, "Sim": 8890, "\u0120ham": 8891, "leep": 8892, "\u0120tables": 8893, "\u0120Heart": 8894, "MI": 8895, "Ke": 8896, "requ": 8897, "VD": 8898, "map": 8899, "\u0120chairman": 8900, "\u0120pump": 8901, "\u0120rapidly": 8902, "vi": 8903, "\u0120substantial": 8904, "EP": 8905, "des": 8906, "chant": 8907, "ilipp": 8908, "\u0120Santa": 8909, "riers": 8910, "anchester": 8911, "Load": 8912, "\u0120Case": 8913, "\u0120saving": 8914, "\u012074": 8915, "\u0120AFP": 8916, "erning": 8917, "ounced": 8918, "\u0120Minnesota": 8919, "\u0120Was": 8920, "\u0120recru": 8921, "\u0120assessment": 8922, "\u0120Bron": 8923, "UE": 8924, "\u0120dynamic": 8925, "\u0120furn": 8926, "ulator": 8927, "\u0120propag": 8928, "high": 8929, "\u0120accommod": 8930, "\u0120stack": 8931, "\u0120Sus": 8932, "writ": 8933, "\u0120reven": 8934, "\u0120Godd": 8935, "\u0120Zealand": 8936, "abs": 8937, "\u0120brut": 8938, "\u0120perpet": 8939, "hot": 8940, "\u0120hardly": 8941, "\u0120Burn": 8942, "\u00e3\u0124\u00b9": 8943, "\u0120sty": 8944, "\u0120transactions": 8945, "\u0120gate": 8946, "\u0120screens": 8947, "\u0120submitted": 8948, "\u0120101": 8949, "\u0120languages": 8950, "ught": 8951, "emen": 8952, "\u0120falls": 8953, "\u0120coc": 8954, "\u0124\u00ac": 8955, "\u0120strikes": 8956, "pa": 8957, "\u0120deliber": 8958, "\u0120IM": 8959, "\u0120relax": 8960, "annels": 8961, "\u0120Senator": 8962, "\u0120extrem": 8963, "\u0120},": 8964, "\u0120Deb": 8965, "\u0120bell": 8966, "\u0120disorder": 8967, "cut": 8968, "\u0120iOS": 8969, "\u0120locked": 8970, "\u0120emissions": 8971, "\u0120shortly": 8972, "\"]": 8973, "\u0120Judge": 8974, "\u0120Sometimes": 8975, "\u0120rival": 8976, "\u0120dust": 8977, "\u0120reaching": 8978, "File": 8979, "\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af": 8980, "inois": 8981, "\u0120Jason": 8982, "\u0120satell": 8983, "aret": 8984, "\u0120stations": 8985, "\u0120agric": 8986, "\u0120Technology": 8987, "comes": 8988, "\u0120Unfortunately": 8989, "\u0120Children": 8990, "\u0120applies": 8991, "asted": 8992, "\u0120anger": 8993, "ailability": 8994, "\u0120Damage": 8995, "\u0120compare": 8996, "\u0120Standard": 8997, "\u0120aimed": 8998, "\u0120Ba": 8999, "anguage": 9000, "\u0120regulation": 9001, "\u0120jury": 9002, "\u0120airport": 9003, "\u0120sections": 9004, "\u0120Prince": 9005, "emed": 9006, "\u0120medicine": 9007, "\u0120hitting": 9008, "\u0120spark": 9009, "olves": 9010, "\u0120ads": 9011, "State": 9012, "\u0120foods": 9013, "\u0120replacement": 9014, "\u0120chicken": 9015, "\u0120lowest": 9016, "\u0120minds": 9017, "\u0120involves": 9018, "ui": 9019, "\u0120arrang": 9020, "\u0120procedures": 9021, "\u0120Which": 9022, "iversary": 9023, "\u0120bills": 9024, "\u0120improvement": 9025, "\u0120inev": 9026, "\u0120expectations": 9027, "\u0120intellectual": 9028, "\u0120spaces": 9029, "\u0120mechanism": 9030, "250": 9031, "break": 9032, "\u0120Ze": 9033, "\u0120Tenn": 9034, "\u0120Balt": 9035, "\u0120barrel": 9036, "\u0120static": 9037, "mann": 9038, "Police": 9039, "\u0120tips": 9040, "\u0120handling": 9041, "cus": 9042, "oded": 9043, "ilton": 9044, "iry": 9045, "\u0120journalists": 9046, "ourse": 9047, "\u0120comic": 9048, "\u0120nomine": 9049, "ITY": 9050, "\u0120versus": 9051, "\u0120loop": 9052, "\u0120surf": 9053, "\u0120Indust": 9054, "\u0120Hunter": 9055, "\u0120beliefs": 9056, "isan": 9057, "\u0120setup": 9058, "\u0120brew": 9059, "image": 9060, "\u0120computers": 9061, "fol": 9062, "},\"": 9063, "\u0120Medal": 9064, "\u0120taxp": 9065, "\u0120displayed": 9066, "\u0120grav": 9067, "\u0120fiscal": 9068, "Mon": 9069, "\u0120Moscow": 9070, "\u0120Kong": 9071, "\u0120Centre": 9072, "\u0120cameras": 9073, "\u0120Mrs": 9074, "\u0120Hay": 9075, "\u0120aver": 9076, "\u0120Kelly": 9077, "py": 9078, "\u0120requirement": 9079, "\u0120entitled": 9080, "ombie": 9081, "\u0120shadow": 9082, "agic": 9083, "\u0120Ak": 9084, "\u0120elite": 9085, "\u0120divided": 9086, "\u0120heading": 9087, "\u0120copies": 9088, "\u0120losses": 9089, "\u0120vit": 9090, "ked": 9091, "\u0120Bry": 9092, "\u0120ans": 9093, "\u0120Steam": 9094, "\u0120reporter": 9095, "heim": 9096, "\u0120Item": 9097, "\u0120superior": 9098, "don": 9099, "erent": 9100, "\u00c3\u00b6": 9101, "\u0120therapy": 9102, "\u0120peak": 9103, "\u0120Model": 9104, "\u0120lying": 9105, "\u0120gam": 9106, "zer": 9107, "ritten": 9108, "\u0120responses": 9109, "\u0120consideration": 9110, "\u0120Bible": 9111, "\u0120loyal": 9112, "\u0120instant": 9113, "\u0120pm": 9114, "\u0120Forest": 9115, "\u00c3\u00bc": 9116, "\u0120extend": 9117, "\u0120convicted": 9118, "\u0120founder": 9119, "\u0120convin": 9120, "\u0120Oak": 9121, "check": 9122, "\u0120scholars": 9123, "ped": 9124, "\u0120overse": 9125, "Top": 9126, "count": 9127, "\u0120Ark": 9128, "\u00c2\u00b7": 9129, "\u012006": 9130, "\u0120LA": 9131, "md": 9132, "\u0120Latin": 9133, "imental": 9134, "\u0120CPU": 9135, "\u0120substance": 9136, "\u0120minority": 9137, "\u0120manufacturing": 9138, "Er": 9139, "ocolate": 9140, "\u0120attended": 9141, "\u0120Manager": 9142, "rations": 9143, "\u0120appreciate": 9144, "omy": 9145, "GBT": 9146, "idency": 9147, "BL": 9148, "\u0120guarantee": 9149, "position": 9150, "\u0120ocean": 9151, "clude": 9152, "\u0120headed": 9153, "\u0120tape": 9154, "\u0120loose": 9155, "\u0120logic": 9156, "\u0120proven": 9157, "\u0120spir": 9158, "\u0120admit": 9159, "isa": 9160, "\u0120investigate": 9161, "\u01201994": 9162, "sylv": 9163, "\u0120Lost": 9164, "cest": 9165, "\u012071": 9166, "\u0120requested": 9167, "\u0120windows": 9168, "\u0120Pok\u00c3\u00a9": 9169, "\u0120Without": 9170, "Met": 9171, "\u0120behaviour": 9172, "\u0120reader": 9173, "\u0120hung": 9174, "\u0120Keep": 9175, "\u0120roles": 9176, "\u0120implemented": 9177, "\u0120blank": 9178, "\u0120serves": 9179, "\u0120Jay": 9180, "\u0120cited": 9181, "\u0120Friend": 9182, "profit": 9183, "apon": 9184, "\u0120repair": 9185, "item": 9186, "arrass": 9187, "\u0120critics": 9188, "adi": 9189, "\u0120Father": 9190, "\u0120shout": 9191, "\u0120fool": 9192, "\u012088": 9193, "\u0120producing": 9194, "\u0120lib": 9195, "\u0120rounds": 9196, "\u0120circle": 9197, "\u0120prepar": 9198, "\u0120submit": 9199, "\u0120nic": 9200, "morrow": 9201, "\u00e3\u0125\u00ab": 9202, "Under": 9203, "\u0120vital": 9204, "atern": 9205, "\u0120password": 9206, "\u0120publication": 9207, "\u0120prominent": 9208, "\u0120speaks": 9209, "\u0120bars": 9210, "\u0120deeper": 9211, "\u0120Mill": 9212, "ported": 9213, "\u0120wid": 9214, "\u0120butter": 9215, "\u0120smoking": 9216, "\u0120indicates": 9217, "Key": 9218, "ropri": 9219, "\u0120File": 9220, "alling": 9221, "asting": 9222, "\u0120Rus": 9223, "\u0120adj": 9224, "\u012079": 9225, "aval": 9226, "\u0120presum": 9227, "burgh": 9228, "onic": 9229, "\u0120fur": 9230, "\u0120polls": 9231, "ika": 9232, "\u0120secondary": 9233, "\u0120monster": 9234, "igs": 9235, "\u0120Current": 9236, "Event": 9237, "\u0120ownership": 9238, "endar": 9239, "\u0120arrive": 9240, "\u0120Tax": 9241, "\u0120null": 9242, "\u0120Priv": 9243, "\u0120thro": 9244, "\u0120kiss": 9245, "cat": 9246, "\u0120upset": 9247, "angle": 9248, "itches": 9249, "ector": 9250, "ologists": 9251, "\u0120Galaxy": 9252, "\u0120corruption": 9253, "\u0120hint": 9254, "enter": 9255, "\u0120Hospital": 9256, "\u0120greatly": 9257, "\u0120begun": 9258, "esy": 9259, "\u0120soil": 9260, "\u0120Anton": 9261, "\u0120maintenance": 9262, "\u00e3\u0125\u00a9": 9263, "\u0120dozens": 9264, "\u0120humanity": 9265, "\u0120Alabama": 9266, "\u0120rom": 9267, "worth": 9268, "aping": 9269, "sylvania": 9270, "lah": 9271, "\u0120gathered": 9272, "GA": 9273, "\u0120attacking": 9274, "found": 9275, "\u0120Square": 9276, "\u0120arbit": 9277, "ictions": 9278, "\u0120Wisconsin": 9279, "\u0120dance": 9280, "\u0120Saint": 9281, "archy": 9282, "\u0120baseball": 9283, "\u0120contributions": 9284, "\u0120literature": 9285, "\u0120exha": 9286, "perty": 9287, "test": 9288, "\u0120bab": 9289, "\u0120container": 9290, "letter": 9291, "\u0120fallen": 9292, "\u0120websites": 9293, "\u0120bottle": 9294, "\u0120Sac": 9295, "\u0120breast": 9296, "\u0120PL": 9297, "\u0120veteran": 9298, "\u0120interviews": 9299, "\u0120Ale": 9300, "\u0120banned": 9301, "engers": 9302, "\u0120Revolution": 9303, "inth": 9304, "\u0120concerning": 9305, "IVE": 9306, "\u0120expenses": 9307, "\u0120Matthew": 9308, "\u0120Columbia": 9309, "ds": 9310, "istance": 9311, "\u0120entity": 9312, "...\"": 9313, "\u0120reliable": 9314, "\u0120paralle": 9315, "\u0120Christians": 9316, "\u0120opinions": 9317, "\u0120indu": 9318, "low": 9319, "\u0120compete": 9320, "\u0120thorough": 9321, "\u0120employed": 9322, "\u0120establishment": 9323, "igen": 9324, "\u0120Cro": 9325, "\u0120lawyers": 9326, "\u0120Station": 9327, "TE": 9328, "\u0120Lind": 9329, "\u0120Pur": 9330, "itary": 9331, "\u0120efficiency": 9332, "\u00e2\u0122\u0132": 9333, "\u0120Ly": 9334, "\u0120mask": 9335, "\u0120disaster": 9336, "\u0120ages": 9337, "ERE": 9338, "esis": 9339, "\u0120Hold": 9340, "\u0120casual": 9341, "bled": 9342, "\u0120enabled": 9343, "\u0120Environment": 9344, "\u0120Intelligence": 9345, "iper": 9346, "\u0120Map": 9347, "\u0120BE": 9348, "\u0120emerged": 9349, "isdom": 9350, "\u0120cabin": 9351, "\u0120registration": 9352, "\u0120fingers": 9353, "\u0120roster": 9354, "\u0120framework": 9355, "\u0120Doctor": 9356, "etts": 9357, "\u0120transportation": 9358, "\u0120awareness": 9359, "Her": 9360, "\u0120attempting": 9361, "Off": 9362, "\u0120Store": 9363, "\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124": 9364, "\u0120Know": 9365, "\u0120defence": 9366, "\u0120scan": 9367, "\u0120Ten": 9368, "\u0120Chair": 9369, "\u0120PH": 9370, "\u0120Atlanta": 9371, "\u0120fucking": 9372, "\u0120answered": 9373, "bn": 9374, "\u0120Kar": 9375, "\u0120categories": 9376, "\u0120rational": 9377, "\u0120cust": 9378, "\u0120robot": 9379, "\u0120correctly": 9380, "\u0120gif": 9381, "\u0120graphics": 9382, "mic": 9383, "\u0120grounds": 9384, "\u0120Opp": 9385, "iate": 9386, "\u0120distributed": 9387, "\u0120sanctions": 9388, "\u0120challenging": 9389, "uto": 9390, "\u0120ingredients": 9391, "\u0120invited": 9392, "\u0120founded": 9393, "\u0120Requ": 9394, "ded": 9395, "\u0120bowl": 9396, "\u0120brothers": 9397, "\u0120Ha": 9398, "IO": 9399, "\u0120wages": 9400, "imore": 9401, "ocial": 9402, "\u0120seed": 9403, "atively": 9404, "\u0120addresses": 9405, "\u0120Iowa": 9406, "abeth": 9407, "\u0120attitude": 9408, "isd": 9409, "child": 9410, "\u0120mole": 9411, "\u0120discovery": 9412, "yard": 9413, "Br": 9414, "\u012082": 9415, "\u0120supplies": 9416, "elling": 9417, "\u0120distingu": 9418, "CR": 9419, "\u0120recept": 9420, "\u0120vert": 9421, "\u0120swim": 9422, "bec": 9423, "door": 9424, "\u0120Yeah": 9425, "\u0120gal": 9426, "\u0120interact": 9427, "\u0120ESP": 9428, "\u0120CS": 9429, "amps": 9430, "\u0120convinced": 9431, "\u0120objective": 9432, "\u0120dish": 9433, "\u0120Photos": 9434, "lad": 9435, "\u0120downtown": 9436, "oil": 9437, "inction": 9438, "\u0120tomorrow": 9439, "\u0120COM": 9440, "\u0120survival": 9441, "shot": 9442, "\u0120settlement": 9443, "Cons": 9444, "\u0120Xbox": 9445, "interest": 9446, "\u0120SM": 9447, "argo": 9448, "eness": 9449, "\u0120ethnic": 9450, "bered": 9451, "Min": 9452, "\u0120Tok": 9453, "\u0120incent": 9454, "\u0120Command": 9455, "\u0120maintained": 9456, "\u0120breaks": 9457, "bridge": 9458, "atar": 9459, "agg": 9460, "\u0120Finally": 9461, "unicip": 9462, "\u0120Ont": 9463, "left": 9464, "\u0120recognition": 9465, "\u0120*/": 9466, "\u0120Pers": 9467, "\u0120welf": 9468, "\u0120addressed": 9469, "\u0120Kansas": 9470, "\u0120virus": 9471, "\u0120whereas": 9472, "\u0120papers": 9473, "rams": 9474, "\u0120Ministry": 9475, "\u0120pleasure": 9476, "\u0120acquired": 9477, "\u0120duration": 9478, "jpg": 9479, "\u0120calm": 9480, "\u0120NHL": 9481, "\u0120burning": 9482, "\u0120folder": 9483, "icked": 9484, "\u0120Py": 9485, "\u0120Illinois": 9486, "Class": 9487, "\u0120Goddess": 9488, "\u0120performing": 9489, "\u0120welfare": 9490, "jar": 9491, "Inter": 9492, "\u0120lin": 9493, "\u0120enhance": 9494, "\u0120notion": 9495, "fare": 9496, "ypes": 9497, "\u0120Area": 9498, "\u0120cannabis": 9499, "\u0120Diego": 9500, "fs": 9501, "\u0120Manchester": 9502, "comm": 9503, "inite": 9504, "\u0120covering": 9505, "\u0120Sound": 9506, "\u01201960": 9507, "\u012084": 9508, "elect": 9509, "zing": 9510, "\u0120citizen": 9511, "\u0120phones": 9512, "\u0120raid": 9513, "\u0120ignored": 9514, "\u0120Object": 9515, "\u0120upload": 9516, "card": 9517, "\u0120modified": 9518, "\u0120rooms": 9519, "iah": 9520, "range": 9521, "heast": 9522, "achus": 9523, "\u0120suggesting": 9524, "\u00e2\u0122\u012d": 9525, "grade": 9526, "El": 9527, "\u0120clothing": 9528, "\u0120rh": 9529, "\u0120Han": 9530, "unity": 9531, "encing": 9532, "\u0120Austin": 9533, "secution": 9534, "tra": 9535, "dem": 9536, "\u0120Qual": 9537, "\u0120heaven": 9538, "\u0120stages": 9539, "\u0120wedd": 9540, "plus": 9541, "ificial": 9542, "\u0120Imm": 9543, "\u0120Ho": 9544, "ieties": 9545, "\u0120phrase": 9546, "\u0120brill": 9547, "actory": 9548, "\u0120providers": 9549, "\u0120silence": 9550, "\u0120aer": 9551, "\u0120AI": 9552, "\u0120Adventure": 9553, "\u0120platforms": 9554, "\u0120demonstrated": 9555, "\u0120interf": 9556, "ington": 9557, "\u0120races": 9558, "\u0120grade": 9559, "ultane": 9560, "\u0120Through": 9561, "false": 9562, "\u0120bow": 9563, "\u0120AB": 9564, "\u0120flavor": 9565, "\u0120historic": 9566, "gov": 9567, "\u0120colour": 9568, "\u0120viewed": 9569, "\u0120Email": 9570, "elcome": 9571, "\u0120intervention": 9572, "\u0120diversity": 9573, "\u0120periods": 9574, "\u0120reverse": 9575, "\u0120Very": 9576, "\u0120quote": 9577, "\u0120Left": 9578, "through": 9579, "\u0120screw": 9580, "\u0120landing": 9581, "\u0120pill": 9582, "\u0120wet": 9583, "\u0120protesters": 9584, "\u0120repeat": 9585, "aved": 9586, "erk": 9587, "\u0120salary": 9588, "\u0120Pennsylvania": 9589, "Still": 9590, "\u0120mayor": 9591, "\u0120kitchen": 9592, "\u0120featuring": 9593, "\u0120Museum": 9594, "\u0120Tournament": 9595, "\u0120Fal": 9596, "\u0120servers": 9597, "UC": 9598, "\u0120anybody": 9599, "img": 9600, "\u0120Trade": 9601, "ixture": 9602, "theless": 9603, "\u0120finance": 9604, "\u0120closing": 9605, "\u0120Patri": 9606, "iac": 9607, "abel": 9608, "\u0120>>": 9609, "orous": 9610, "\u0120firms": 9611, "screen": 9612, "una": 9613, "\u0120embarrass": 9614, "ulse": 9615, "\u0120letting": 9616, "\u0120threw": 9617, "iley": 9618, "\u0120channels": 9619, "lan": 9620, "\u0120Vegas": 9621, "\u0120sear": 9622, "\u0120fantastic": 9623, "arre": 9624, "uzzle": 9625, "\u0120Der": 9626, "Those": 9627, "\u0120swing": 9628, "\u0120sheet": 9629, "index": 9630, "cover": 9631, "ogan": 9632, "\u0120variables": 9633, "\u0120Tech": 9634, "\u0120spoken": 9635, "achel": 9636, "\u0120Da": 9637, "\u0120Mountain": 9638, "\u0120loaded": 9639, "\u0120footage": 9640, "version": 9641, "\u0120unl": 9642, "\u0120Phoenix": 9643, "\u0120throwing": 9644, "\u0120firing": 9645, "\u0120tracking": 9646, "\u0120width": 9647, "\u0120struggling": 9648, "rooms": 9649, "otion": 9650, "\u0120monthly": 9651, "\u0120Server": 9652, "\u0120eggs": 9653, "open": 9654, "MC": 9655, "\u01201993": 9656, "\u0120hired": 9657, "\u0120stayed": 9658, "\u0120Allen": 9659, "\u0120stro": 9660, "\u012098": 9661, "step": 9662, "\u0120Turkish": 9663, "\u0120fabric": 9664, "isting": 9665, "\u0120Dom": 9666, "\u0120dates": 9667, "\u0120pron": 9668, "\u0120basketball": 9669, "\u0120lucky": 9670, "\u0120Arabia": 9671, "\u0120assumed": 9672, "esty": 9673, "\u0120affairs": 9674, "\u0120glad": 9675, "\u0120Indeed": 9676, "\u0120FA": 9677, "\u0120Word": 9678, "\u0120joining": 9679, "ifice": 9680, "pread": 9681, "irts": 9682, "\u0120Select": 9683, "\u0120populations": 9684, "aware": 9685, "\u0120nose": 9686, "\u0120complaints": 9687, "start": 9688, "\u0120scoring": 9689, "Thanks": 9690, "\u0120mining": 9691, "\u0120visitors": 9692, "SH": 9693, "\u0120damaged": 9694, "\u0120characteristics": 9695, "\u0120Pent": 9696, "DC": 9697, "\u012083": 9698, "\u0120Six": 9699, "rates": 9700, "\u0120flags": 9701, "\u0120Brew": 9702, "dog": 9703, "Mark": 9704, "////": 9705, "\u0120execution": 9706, "\u0120joke": 9707, "phones": 9708, "\u0120testimony": 9709, "\u0120obst": 9710, "QL": 9711, "\u0120Cut": 9712, "\u0120studied": 9713, "\u0120Nintendo": 9714, "icket": 9715, "\u0120NBC": 9716, "\u0120lad": 9717, "\u0120Bra": 9718, "\u0120Moh": 9719, "\u0120kernel": 9720, "\u0120overwhelming": 9721, "\u0120aged": 9722, "\u0120applicable": 9723, "\u0120Cond": 9724, "\u0120roads": 9725, "\u0120Block": 9726, "made": 9727, "odge": 9728, "\u0120commands": 9729, "\u0120offices": 9730, "veland": 9731, "\u0120tut": 9732, "\u0120receiver": 9733, "\u0120Fro": 9734, "\u0120shopping": 9735, "\u0120iP": 9736, "\u0120Stre": 9737, "\u0120ABC": 9738, "\u0120entertainment": 9739, "\u0120Bow": 9740, "orted": 9741, "Mc": 9742, "\u0120reads": 9743, "grad": 9744, "\u0120Collect": 9745, "\u0120\u00e2\u012a\u0134": 9746, "\u0120Capital": 9747, "ederation": 9748, "\u0120employer": 9749, "\u0120involvement": 9750, "\u0120anxiety": 9751, "alia": 9752, "\u0120roof": 9753, "\u0120Among": 9754, "\u0120Democrat": 9755, "\u0120stats": 9756, "\u0120Vill": 9757, "\u0120constitutional": 9758, "\u0120referring": 9759, "itty": 9760, "\u0120tackle": 9761, "outube": 9762, "\u0120backed": 9763, "\u0120Hong": 9764, "\u0120Broad": 9765, "\u0120ele": 9766, "\u0120Ott": 9767, "\u01201992": 9768, "hour": 9769, "achusetts": 9770, "Cal": 9771, "\u0120defeated": 9772, "\u012081": 9773, "esp": 9774, "\u0120seemingly": 9775, "was": 9776, "\u0120Jenn": 9777, "\u0120Kurd": 9778, "\u0120gene": 9779, "\u0120discount": 9780, "Ret": 9781, "ECT": 9782, "();": 9783, "\u0120clubs": 9784, "\u0120sid": 9785, "\u0120Marsh": 9786, "Check": 9787, "\u0120pp": 9788, "\u0120Eag": 9789, "idespread": 9790, "\u0120beings": 9791, "FT": 9792, "\u0120introduction": 9793, "\u0120Change": 9794, "ARD": 9795, "\u0120110": 9796, "adows": 9797, "ierce": 9798, "\u0120meal": 9799, "author": 9800, "\u0120Bang": 9801, "lahoma": 9802, "\u0120ranks": 9803, "2011": 9804, "????": 9805, "max": 9806, "\u0120collapse": 9807, "\u0120opens": 9808, "\u0120echo": 9809, "\u0120soph": 9810, "\u0120racist": 9811, "\u0120enormous": 9812, "\u0120waves": 9813, "\u0120tap": 9814, "\u0120comprehensive": 9815, ".--": 9816, "\u0120Roy": 9817, "\u0120farmers": 9818, "Related": 9819, "aired": 9820, "rones": 9821, "\u0120Crim": 9822, "\u0120proportion": 9823, "\u0120designs": 9824, "\u0120negotiations": 9825, "\u0120virtually": 9826, "\u0120Batman": 9827, "\u0120warn": 9828, "\u0120legitimate": 9829, "mate": 9830, "\u0120convention": 9831, ",,": 9832, "netic": 9833, "\u0120SD": 9834, "\u0120consistently": 9835, "\u0120compensation": 9836, "\u0120punishment": 9837, "\u0120ye": 9838, "\u0120tie": 9839, "\u0120Bureau": 9840, "irlf": 9841, "\u0120Bu": 9842, "\u0120Aren": 9843, "\u0120Philipp": 9844, "\u0120knife": 9845, "\u0120memories": 9846, "\u0120Ross": 9847, "\u0120angle": 9848, "\u012086": 9849, "\u0120Thunder": 9850, "\u0120rend": 9851, "\u0120Tour": 9852, "\u0120counts": 9853, "sung": 9854, "\u0120Imp": 9855, "\u0120educational": 9856, "\u0120accessible": 9857, "COM": 9858, "\u0120drew": 9859, "yer": 9860, "Gl": 9861, "amine": 9862, "ORT": 9863, "OB": 9864, "IB": 9865, "master": 9866, "\u0120trials": 9867, "ogy": 9868, "har": 9869, "\u0120Trust": 9870, "\u0120preferred": 9871, "irlfriend": 9872, "\u0120Nev": 9873, "\u0120bin": 9874, "\u0120cow": 9875, "Page": 9876, "\u0120signature": 9877, "\u0120BL": 9878, "700": 9879, "\u0120retired": 9880, "\u0120bytes": 9881, "\u0120neighb": 9882, "\u0120Legend": 9883, "\u0120devast": 9884, "\u0120suspected": 9885, "isons": 9886, "\u0120Pok\u00c3\u00a9mon": 9887, "scale": 9888, "\u0120capabilities": 9889, "\u0120revel": 9890, "\u0120cheese": 9891, "dy": 9892, "igrant": 9893, "\u0120failing": 9894, "bits": 9895, "\u0120Heroes": 9896, "\u0120Ghost": 9897, "\u0120Scient": 9898, "\u0120appointed": 9899, "uri": 9900, "\u0120institution": 9901, "\u0120expanded": 9902, "greg": 9903, "\u0120monitoring": 9904, "\u0120podcast": 9905, "\u0120coalition": 9906, "\u012096": 9907, "Jo": 9908, "\u0120stolen": 9909, "\u0120Sab": 9910, "\u0120stops": 9911, "\u0120holiday": 9912, "\u0120intr": 9913, "Car": 9914, "Black": 9915, "\u0120LGBT": 9916, "\u0120warming": 9917, "\u0120Anderson": 9918, "\u012089": 9919, "\u0120producer": 9920, "Med": 9921, "\u0120accuracy": 9922, "\u0120Marvel": 9923, "izabeth": 9924, "\u0120Patrick": 9925, "mony": 9926, "\u0120mini": 9927, "acles": 9928, "\u0120overt": 9929, "they": 9930, "\u0120membership": 9931, "\u0120Ven": 9932, "\u0120exch": 9933, "\u0120removal": 9934, "\u0120Dave": 9935, "TY": 9936, "mad": 9937, "\u0120Find": 9938, "\u0120adequ": 9939, "\u0120ec": 9940, "\u0120teeth": 9941, "\u0120emotion": 9942, "\u0120perm": 9943, "\u0120solely": 9944, "db": 9945, "\u0120extraord": 9946, "IGHT": 9947, "cal": 9948, "\u0120guidelines": 9949, "\u0120dying": 9950, "\u0120suspended": 9951, "\u0120Premier": 9952, "\u0120Anthony": 9953, "elve": 9954, "\u0120dad": 9955, "\u0120Eth": 9956, "\u0120Football": 9957, "\u0120abandoned": 9958, "\u0120<<": 9959, "\u0120march": 9960, "\u0120horror": 9961, "\u00e2\u0122\u00a6\"": 9962, "\u0120childhood": 9963, "\u0120campaigns": 9964, "\u0120lunch": 9965, "\u0120Albert": 9966, "block": 9967, "\u00e2\u0138\u012a\u00e2\u0138\u012a": 9968, "ounding": 9969, "\u0120bone": 9970, "organ": 9971, "aders": 9972, "\u0120Flash": 9973, "\u0120Drive": 9974, "\u0120tonight": 9975, "\u0120wars": 9976, "\u0120FL": 9977, "\u0120formation": 9978, "const": 9979, "News": 9980, "\u0120compe": 9981, "orious": 9982, "\u0120Staff": 9983, "\u0120discussions": 9984, "\u0120Protection": 9985, "\u0120Jam": 9986, "\u0120criteria": 9987, "\u0120installation": 9988, "\u0120accomplish": 9989, "izza": 9990, "\u0120publisher": 9991, "\u0120rescue": 9992, "\u0120Try": 9993, "ULL": 9994, "\u0120Som": 9995, "\u0120Hop": 9996, "oret": 9997, "ths": 9998, "ordon": 9999, "\u0120pocket": 10000, "\u0120Inv": 10001, "Download": 10002, "\u0120Crime": 10003, "\u0120bene": 10004, "\u0120Guide": 10005, "\u0120Assembly": 10006, "\u0120parameters": 10007, "IE": 10008, "\u0120Alexander": 10009, "\u0120concert": 10010, "\u0120Sche": 10011, "\u0120shoes": 10012, "\u0120visiting": 10013, "\u0120recall": 10014, "\u0120bub": 10015, "\u0120rural": 10016, "\u0120concrete": 10017, "\u0120Ros": 10018, "Next": 10019, "Russ": 10020, "\u0120loans": 10021, "\u0120Shield": 10022, "\u0120trem": 10023, "hemat": 10024, "kg": 10025, "\u0120Harris": 10026, "isition": 10027, "\u0120Move": 10028, "\u0120FC": 10029, "\u0120fate": 10030, "\u0120Cho": 10031, "\u0120tired": 10032, "\u0120principal": 10033, "hist": 10034, "iences": 10035, "athy": 10036, "\u0120sevent": 10037, "\u0120mood": 10038, "\u0120strategic": 10039, "\u0120diseases": 10040, "\u0120forum": 10041, "\u0120tempor": 10042, "\u0120headquarters": 10043, "Par": 10044, "ige": 10045, "flix": 10046, "\u0120guitar": 10047, "\u012094": 10048, "Only": 10049, "\u0120releases": 10050, "roph": 10051, "================================": 10052, "\u0120600": 10053, "\u0120Continue": 10054, "igate": 10055, "\u0120Crit": 10056, "system": 10057, "\u0120disabled": 10058, "\u0120unexpected": 10059, "ithub": 10060, "\u0120unclear": 10061, "\u0120Est": 10062, "\u0120contrad": 10063, "\u0120strategies": 10064, "ventures": 10065, "\u0120passage": 10066, "AME": 10067, "\u0120improving": 10068, "\u0120reveals": 10069, "\u0120decrease": 10070, "ova": 10071, "\u0120annoy": 10072, "\u0120Short": 10073, "\u0120Library": 10074, "\u0120cyber": 10075, "nell": 10076, "\u0120Hur": 10077, "\u0120CB": 10078, "\u0120photograp": 10079, "UI": 10080, "\u0120sed": 10081, "Ge": 10082, "\u012087": 10083, "\u0120diverse": 10084, "\u0120encouraged": 10085, "\u0120conspiracy": 10086, "\u0120birds": 10087, "\u0120operator": 10088, "\u0120handful": 10089, "\u0120classified": 10090, "?)": 10091, "\u0120dramatic": 10092, "\u0120investigators": 10093, "ito": 10094, "\u0120widespread": 10095, "\u0120Room": 10096, "----------------------------------------------------------------": 10097, "\u0120collective": 10098, "\u0120journalist": 10099, "String": 10100, "\u0120temperatures": 10101, "ila": 10102, "\u0120guid": 10103, "\u0120inspect": 10104, "\u0120missile": 10105, "\u0120Mayor": 10106, "\u0120manual": 10107, "\u0120simultane": 10108, "\u0120ratings": 10109, "\u0120suck": 10110, "\u012097": 10111, "\u0120universal": 10112, "\u0120pharm": 10113, "\u0120disrupt": 10114, "iano": 10115, "AV": 10116, "\u0120ft": 10117, "\u0120statist": 10118, "olds": 10119, "\u0120Walker": 10120, "php": 10121, "\u0120undert": 10122, "\u0120Las": 10123, "ishop": 10124, "ntil": 10125, "reshold": 10126, "\u0120Whether": 10127, "Ms": 10128, "\u0120deny": 10129, "\u0120Cloud": 10130, "\u0120provider": 10131, "\u0120surviv": 10132, "\u0120Update": 10133, "has": 10134, "\u0120mistakes": 10135, "charge": 10136, "pled": 10137, "rity": 10138, "\u0120node": 10139, "\u0120Massachusetts": 10140, "ools": 10141, "lication": 10142, "\u0120fails": 10143, "emale": 10144, "ori": 10145, "backs": 10146, "\u0120shirt": 10147, "\u0120''": 10148, "\u0120NAT": 10149, "\u0120waters": 10150, "elson": 10151, "\u0120ease": 10152, "\u0120scar": 10153, "\u0120contents": 10154, "mind": 10155, "\u0120contribution": 10156, "\u0120shr": 10157, "\u0120handed": 10158, "\u0120stability": 10159, "\u0120trave": 10160, "Em": 10161, "\u0120mirror": 10162, "123": 10163, "\u0120weigh": 10164, "\u0120fiction": 10165, "ouver": 10166, "istant": 10167, "rition": 10168, "\u0120Fed": 10169, "\u0120physically": 10170, "\u0120stake": 10171, "\u0120Article": 10172, "\u0120Arc": 10173, "\u0120Lewis": 10174, "\u0120Mind": 10175, "\u0120demonstrate": 10176, "\u0120profits": 10177, "vision": 10178, "omic": 10179, "olid": 10180, "\u0120battles": 10181, "\u0120drives": 10182, "\u0120eastern": 10183, "\u0120Sony": 10184, "!!!": 10185, "aration": 10186, "vard": 10187, "\u0120GL": 10188, "portation": 10189, "\u012092": 10190, "\u0120lawmakers": 10191, "\u0120protecting": 10192, "\u0120EPA": 10193, "\u0120yeah": 10194, "\u0120shame": 10195, "olph": 10196, "even": 10197, "xit": 10198, "\u0120attach": 10199, "\u0120representing": 10200, "\u0120obs": 10201, "\u0120Utah": 10202, "iffs": 10203, "\u0120Freedom": 10204, "\u00c3\u00b3": 10205, "AK": 10206, "\u0120incidents": 10207, "itage": 10208, "\u0120viewers": 10209, "cd": 10210, "\u0120mouse": 10211, "\u0120clar": 10212, "\u0120accordance": 10213, "\u0120bot": 10214, "cor": 10215, "\u0120Summer": 10216, "held": 10217, "\u0120innocent": 10218, "\u0120initiative": 10219, "ols": 10220, "________________________________": 10221, "\u0120spots": 10222, "pace": 10223, "\u0120conventional": 10224, "\u0120corporations": 10225, "\u0120blocked": 10226, "HD": 10227, "attered": 10228, "\u0120refers": 10229, "\u0120buck": 10230, "\u0120Digital": 10231, "120": 10232, "\u0120topics": 10233, "TF": 10234, "\u00c4\u0123": 10235, "brid": 10236, "reement": 10237, "\u0120underlying": 10238, "\u0120Member": 10239, "\u0120investigating": 10240, "\u0120pregnancy": 10241, "\u0120touchdown": 10242, "\u0120Band": 10243, "\u0120Caller": 10244, "\u0120instances": 10245, "PP": 10246, "wa": 10247, "Good": 10248, "\u01201991": 10249, "\u0120Cold": 10250, "\u0120fears": 10251, "\u0120remarks": 10252, "\u0128\u0134": 10253, "atal": 10254, "\u0120mit": 10255, "\u0120experiments": 10256, "ipt": 10257, "Color": 10258, "indu": 10259, "Update": 10260, "\u012093": 10261, "Ag": 10262, "\u0120\u00e5": 10263, "ancouver": 10264, "Both": 10265, "\u0120judges": 10266, "Object": 10267, "\u0120stere": 10268, "umbn": 10269, "\u0120participation": 10270, "\u0120Stars": 10271, "\u0120Jere": 10272, "\u0120weekly": 10273, "\u0120Ban": 10274, "\u0120conversations": 10275, "\u0120Pitt": 10276, "uz": 10277, "\u0120Indiana": 10278, "\u0120Kick": 10279, "\u0120infection": 10280, "\u0120heroes": 10281, "\u0120settled": 10282, "\u0120strip": 10283, "\u0120hal": 10284, "\u0120dump": 10285, "\u0120Sci": 10286, "\u0120les": 10287, "\u0120references": 10288, "\u0120URL": 10289, "\u0120Bridge": 10290, "\u0120wanting": 10291, "Force": 10292, "\u0120exclus": 10293, "Meanwhile": 10294, "mn": 10295, "\u0120gentle": 10296, "maker": 10297, "senal": 10298, "\u0120Gro": 10299, "ouri": 10300, "\u0120Rain": 10301, "\u0120Alliance": 10302, "\u0120lift": 10303, "ela": 10304, "SD": 10305, "\u0120Cleveland": 10306, "\u0120ranked": 10307, "\u0120stadium": 10308, "\u0120deadly": 10309, "\u00e4\u00b8": 10310, "\u0120riding": 10311, "aria": 10312, "\u0120Armor": 10313, "\u0120documentation": 10314, "\u0120Greece": 10315, "reek": 10316, "\u0120lens": 10317, "\u0120Sa": 10318, "\u0120gross": 10319, "\u0120Emer": 10320, "agers": 10321, "\u0120Dub": 10322, "\u0120Rh": 10323, "\u0120AMD": 10324, "\u0120arrival": 10325, "\u0120desert": 10326, "\u0120supplement": 10327, "\u0120Resp": 10328, "\u0120knee": 10329, "\u0120margin": 10330, "font": 10331, "ogg": 10332, "2010": 10333, "\u0120Pir": 10334, "\u0120Prom": 10335, "ivals": 10336, "\u0120intake": 10337, "\u0120differently": 10338, "ugs": 10339, "\u0120bits": 10340, "cluded": 10341, "\u0120searching": 10342, "\u0120Du": 10343, "umble": 10344, "\u0120functional": 10345, "\u0120Baltimore": 10346, "\u0120Could": 10347, "\u0120desired": 10348, "\u0120circuit": 10349, "\u0120Lyn": 10350, "\u0120GO": 10351, "\u0120False": 10352, "repre": 10353, "':": 10354, "alties": 10355, "\u0120minim": 10356, "\u0120drove": 10357, "\u0120Should": 10358, "\u0120hip": 10359, "\u0120pros": 10360, "\u0120utility": 10361, "\u0120Nature": 10362, "\u0120Mode": 10363, "President": 10364, "opp": 10365, "rat": 10366, "formance": 10367, "\u0120concentration": 10368, "\u0120font": 10369, "\u0120Bud": 10370, "\u0120amid": 10371, "\u0120revers": 10372, "\u0120ML": 10373, "Bar": 10374, "\u0120interaction": 10375, "\u0120jurisd": 10376, "\u0120spells": 10377, "dep": 10378, "fil": 10379, "\u0120civilians": 10380, "utter": 10381, "\u0120Cooper": 10382, "\u0120Below": 10383, "\u0120entrance": 10384, "\u0120convert": 10385, "\u0120controversy": 10386, "owered": 10387, "\u0120contrary": 10388, "\u0120arc": 10389, "\u0120Executive": 10390, "\u0120Officer": 10391, "\u0120packages": 10392, "\u0120progressive": 10393, "width": 10394, "\u0120reserved": 10395, "vol": 10396, "\u0120Samsung": 10397, "\u0120printed": 10398, "\u0120centers": 10399, "\u0120introduce": 10400, "\u0120Kennedy": 10401, "\u0120odds": 10402, "\u0120surely": 10403, "\u0120independence": 10404, "\u0120passengers": 10405, "reprene": 10406, "\u0120Beh": 10407, "\u0120loves": 10408, "\u0120ESPN": 10409, "\u0120facilit": 10410, "\u0120identical": 10411, "\u0120doct": 10412, "\u0120partnership": 10413, "conf": 10414, "\u0120Hide": 10415, "\u0120confused": 10416, "\u0120Cow": 10417, "Men": 10418, "\u0120wrest": 10419, "\u0120Iraqi": 10420, "\u0120holes": 10421, "\u0120Studies": 10422, "\u0120pregnant": 10423, "hard": 10424, "\u0120signals": 10425, "IX": 10426, "\u0120pulling": 10427, "\u0120graduate": 10428, "\u0120nominee": 10429, "Date": 10430, "\u0120permitted": 10431, "\u0120\u00e2\u0124\u00ac": 10432, "\u0120Oklahoma": 10433, "Start": 10434, "\u0120authorized": 10435, "\u0120alarm": 10436, "\u0120Cos": 10437, "van": 10438, "\u0120generations": 10439, "cular": 10440, "\u0120dragon": 10441, "\u0120Software": 10442, "\u0120Edward": 10443, "\u0120controller": 10444, "Sen": 10445, "gered": 10446, "\u0120Vik": 10447, "\u0120approached": 10448, "Thank": 10449, "\u0120cance": 10450, "\u0120formula": 10451, "\u0120Small": 10452, "\u0120weakness": 10453, "\u0120ramp": 10454, "itudes": 10455, "jud": 10456, "\u0120brilliant": 10457, "\u0120accus": 10458, "source": 10459, "\u0120800": 10460, "\u0120Evil": 10461, "Sw": 10462, "\u0120homeless": 10463, "week": 10464, "iens": 10465, "rics": 10466, "\u0120Third": 10467, "TO": 10468, "\u0120organic": 10469, "\u0120presentation": 10470, "agh": 10471, "\u0120Download": 10472, "vation": 10473, "\u0120assembly": 10474, "orable": 10475, "holders": 10476, "\u0120Bernie": 10477, "\u0120Help": 10478, "\u0120tong": 10479, "\u0120Fight": 10480, "\u0120beach": 10481, "Book": 10482, "\u0120Lic": 10483, "\u0120rush": 10484, "\u0120Round": 10485, "oup": 10486, "\u0120Marx": 10487, "\u0120calculated": 10488, "\u0120Devil": 10489, "\u0120Sarah": 10490, "\u0120occasionally": 10491, "\u0120bullet": 10492, "Available": 10493, "gate": 10494, "\u012091": 10495, "\u0120hosp": 10496, "\u0120promises": 10497, "\u0120HIV": 10498, "\u0120Stadium": 10499, "\u0120Stock": 10500, "\u0120Corporation": 10501, "gage": 10502, "NG": 10503, "\u0120Credit": 10504, "\u0120sne": 10505, "ibl": 10506, "\u0120accum": 10507, "such": 10508, "\u0120terrorists": 10509, "\u0120consciousness": 10510, "\u0120Zh": 10511, "\u0120drama": 10512, "oola": 10513, "piration": 10514, "\u0120labour": 10515, "\u0120Nin": 10516, "\u0120utter": 10517, "\u0120democratic": 10518, "\u0120assass": 10519, "ilation": 10520, "\u0120gest": 10521, "\u0120abroad": 10522, "\u0120metab": 10523, "\u0120sorts": 10524, "\u0120flav": 10525, "UB": 10526, "\u0120mg": 10527, "\u0120Nothing": 10528, "\u0120Od": 10529, "\u0120musical": 10530, "2009": 10531, "\u0120drops": 10532, "ocated": 10533, "ateral": 10534, "000000": 10535, "\u0120gre": 10536, "\u0120equality": 10537, "\u0120burden": 10538, "\u0120vig": 10539, "\u0120Leader": 10540, "------------": 10541, "\u0120ceremony": 10542, "\u0120fighter": 10543, "\u0120actors": 10544, "\u0120\u00e6": 10545, "aman": 10546, "Fi": 10547, "\u0120align": 10548, "puter": 10549, "\u0120elder": 10550, "\u0120NSA": 10551, "\u0120representation": 10552, "\u0120Ontario": 10553, "ITH": 10554, "usalem": 10555, "\u0120harassment": 10556, "itzer": 10557, "\u0120symp": 10558, "\u0120boxes": 10559, "\u0120DR": 10560, "\u0120manifest": 10561, "atre": 10562, "\u0120^": 10563, "\u0120dies": 10564, "leton": 10565, "\u0120missions": 10566, "ethe": 10567, "\u0120resolve": 10568, "\u0120followers": 10569, "\u0120asc": 10570, "\u0120km": 10571, "lord": 10572, "ammed": 10573, "\u0120silent": 10574, "\u0120Associated": 10575, "\u0120timing": 10576, "\u0120prisoners": 10577, "\u0120Kings": 10578, "\u0120Five": 10579, "\u0120tower": 10580, "\u0120approaches": 10581, "\u0120precisely": 10582, "\u0120bureau": 10583, "\u0120Mother": 10584, "\u0120Iss": 10585, "\u0120keyboard": 10586, "itual": 10587, "\u0120funded": 10588, "\u0120staying": 10589, "\u0120psychological": 10590, "\u0120mile": 10591, "\u0120Leon": 10592, "\u0120Barb": 10593, "will": 10594, "\u0120wider": 10595, "\u0120Atlantic": 10596, "\u0120till": 10597, "\u0120Rome": 10598, "rot": 10599, "\u0120accompan": 10600, "\u0120flour": 10601, "aco": 10602, "World": 10603, "\u0120Express": 10604, "\u0120Yu": 10605, "Cor": 10606, "\u0120pleased": 10607, "party": 10608, "\u0120pointing": 10609, "\u0120inflation": 10610, "\u0120roy": 10611, "\u0120),": 10612, "ainer": 10613, "\u0120wedding": 10614, "ormon": 10615, "\u0120requiring": 10616, "\u0120qualified": 10617, "\u0120segment": 10618, "END": 10619, "\u0120sizes": 10620, "eals": 10621, "\u0120corrupt": 10622, "assador": 10623, "\u0120celeb": 10624, "\u0120dreams": 10625, "\u0120Mess": 10626, "\u0120checking": 10627, "\u0120Version": 10628, "\u0120preparing": 10629, "\u0120actively": 10630, "\u0120Diff": 10631, "\u0120lux": 10632, "\u0120Winter": 10633, "acteria": 10634, "\u0120NE": 10635, "\u0120deputy": 10636, "\u0120transgender": 10637, "\u0120summary": 10638, "\u0120inher": 10639, "eries": 10640, "char": 10641, "\u0120Yan": 10642, "\u0120knock": 10643, "\u0120Path": 10644, "\u0120lip": 10645, "roller": 10646, "\u0120impression": 10647, "\u0120celebrate": 10648, "\u0120slide": 10649, "\u0120guests": 10650, "\u0120clip": 10651, "FS": 10652, "\u0120savings": 10653, "\u0120captain": 10654, "\u0120legacy": 10655, "\u0120Denver": 10656, "\u0120wounded": 10657, "taboola": 10658, "ACT": 10659, "\u0120pursue": 10660, "\u0120oxy": 10661, "\u0120q": 10662, "\u0120semi": 10663, "\u0120Need": 10664, "\u0120Affairs": 10665, "\u0120obsc": 10666, "\u0120checked": 10667, "\u0120dual": 10668, "Code": 10669, "\u0120MD": 10670, "lem": 10671, "ulty": 10672, "\u0120\u00c2\u00a9": 10673, "\u0120Elizabeth": 10674, "\u0120centuries": 10675, "arded": 10676, "src": 10677, "\u0120evident": 10678, "ennis": 10679, "atin": 10680, "\u0120unemployment": 10681, "\u0120Mario": 10682, "\u0120intim": 10683, "Christ": 10684, "\u0120biological": 10685, "\u0120soldier": 10686, "\u0120Added": 10687, "\u0120math": 10688, "\u0120Gil": 10689, "\u0120bias": 10690, "\u0120dating": 10691, "\u0120Ocean": 10692, "\u0120mice": 10693, "Mus": 10694, "hire": 10695, "\u0120Tes": 10696, "Server": 10697, "limited": 10698, "Size": 10699, "\u0120meters": 10700, "\u0120rocket": 10701, "essee": 10702, "\u0120certificate": 10703, "\u0120Iranian": 10704, "ASS": 10705, "\u0120grid": 10706, "Dec": 10707, "\u0120rolling": 10708, "commun": 10709, "\u0120Sweden": 10710, "bury": 10711, "\u0120tissue": 10712, "\u0120racism": 10713, "\u0120Local": 10714, "\u0120mystery": 10715, "\u0120examine": 10716, "\u0120stem": 10717, "\u0120sits": 10718, "\u0120hoped": 10719, "oting": 10720, "\u0120dialogue": 10721, "\u0120persu": 10722, "Watch": 10723, "lay": 10724, "MAN": 10725, "\u0120chronic": 10726, "\u0120Portland": 10727, "market": 10728, "\u0120SEC": 10729, "\u0120parallel": 10730, "\u0120scandal": 10731, "\u0120carries": 10732, "\u0120phenomenon": 10733, "human": 10734, "acker": 10735, "\u0120Ox": 10736, "\u0120retirement": 10737, "tainment": 10738, "ovie": 10739, "\u0120Gear": 10740, "\u0120duties": 10741, "\u0120dose": 10742, "\u0120scroll": 10743, "MB": 10744, "inf": 10745, "\u0120sauce": 10746, "\u0120landscape": 10747, "reddit": 10748, "\u0120Championship": 10749, "\u0120Reddit": 10750, "alid": 10751, "\u0120coin": 10752, "\u0120overs": 10753, "\u0120posting": 10754, "about": 10755, "\u0120fel": 10756, "andy": 10757, "\u0120bold": 10758, "\u0120focusing": 10759, "effect": 10760, "GR": 10761, "\u0120deemed": 10762, "\u0120recommendations": 10763, "\u0120stepped": 10764, "\u0120voter": 10765, "\u0120Deep": 10766, "\u0120Instagram": 10767, "\u0120moderate": 10768, "\u0120Maryland": 10769, "\u0120restricted": 10770, "\u0120MB": 10771, "\u0120Chall": 10772, "\u0120tob": 10773, "\u0120cir": 10774, "\u0120Occ": 10775, "\u0120Ever": 10776, "\u0120collaps": 10777, "INFO": 10778, "=-": 10779, "\u0120Pict": 10780, "\u0120Account": 10781, "nc": 10782, "\u0120ought": 10783, "\u0120export": 10784, "\u0120drunk": 10785, "('": 10786, "\u0120wise": 10787, "\u0120Mort": 10788, "necess": 10789, "\u0120ancest": 10790, "\u0120Incre": 10791, "\u0120frequent": 10792, "mir": 10793, "\u0120interpretation": 10794, "\u0120dependent": 10795, "\u0120coins": 10796, "\u0120Bol": 10797, "Video": 10798, "\u0120Justin": 10799, "\u0120fatal": 10800, "\u0120cooking": 10801, "\u0120confusion": 10802, "ipher": 10803, "\u0120custody": 10804, "\u0120Morgan": 10805, "omach": 10806, "\u0120Governor": 10807, "\u0120restaurants": 10808, "eling": 10809, "\u0120acknowledged": 10810, "\u0120ther": 10811, "\u0120genes": 10812, "ching": 10813, "Hey": 10814, "\u0120tactics": 10815, "\u0120Mexican": 10816, "\u0120vend": 10817, "\u0120hes": 10818, "quer": 10819, "\u0120noting": 10820, "\u0120Cameron": 10821, "\u0120targeting": 10822, "rock": 10823, "\u0120credits": 10824, "\u0120emotions": 10825, "\u0120representatives": 10826, "news": 10827, "\u0120legislative": 10828, "\u0120removing": 10829, "\u0120tweeted": 10830, "\u0120Carter": 10831, "\u0120Fixed": 10832, "\u0120forcing": 10833, "\u0120speaker": 10834, "\u0120males": 10835, "\u0120Vietnam": 10836, "lined": 10837, "\u0120concepts": 10838, "\u0120voices": 10839, "oir": 10840, "\u0120Trib": 10841, "Whe": 10842, "\u0120Jerusalem": 10843, "\u0120Sant": 10844, "\u0120cul": 10845, "\u0120lady": 10846, "\u0120Hawai": 10847, "\u0120arts": 10848, "\u0120Inn": 10849, "\u0120Machine": 10850, "\u0120Emperor": 10851, "\u0120slot": 10852, "gly": 10853, "\u0120Process": 10854, "III": 10855, "\u0120athletes": 10856, "\u0120Temple": 10857, "\u0120Represent": 10858, "\u0120presc": 10859, "\u0120tons": 10860, "\u0120golden": 10861, "\u0120punch": 10862, "\u0120GR": 10863, "iverpool": 10864, "\u0120enact": 10865, "\u0120lobby": 10866, "\u0120mos": 10867, "\u0120picking": 10868, "\u0120lifetime": 10869, "\u0120cognitive": 10870, "Each": 10871, "zo": 10872, "\u0120dub": 10873, "\u0120consists": 10874, "oln": 10875, "\u0120festival": 10876, "amous": 10877, "\u0120intellig": 10878, "words": 10879, "\u0120Smart": 10880, "\u0120dele": 10881, "\u0120lapt": 10882, "\u0120magical": 10883, "\u0120Sin": 10884, "bus": 10885, "urities": 10886, "ighth": 10887, "\u0120Ruby": 10888, "\u0120Sure": 10889, "olving": 10890, "\u0120jun": 10891, "OST": 10892, "\u0120imposed": 10893, "\u0120astron": 10894, "\u0120correl": 10895, "\u0120NS": 10896, "\u0120Kit": 10897, "\u0120Future": 10898, "burn": 10899, "\u0120immune": 10900, "ocus": 10901, "\u0120courses": 10902, "\u0120String": 10903, "\u0120lean": 10904, "\u0120ghost": 10905, "\u0120outcomes": 10906, "\u0120expense": 10907, "\u0120everyday": 10908, "\u0120acceptable": 10909, "Ah": 10910, "\u0120equipped": 10911, "\u0120orange": 10912, "FR": 10913, "\u0120Dutch": 10914, "Though": 10915, "\u0120Rank": 10916, "QU": 10917, "\u0120Roberts": 10918, "what": 10919, "rend": 10920, "\u0120disappear": 10921, "\u0120spawn": 10922, "\u0120Lam": 10923, "ois": 10924, "\u0120deserve": 10925, "\u0120minimal": 10926, "\u0120nervous": 10927, "\u0120Would": 10928, "\u0120rook": 10929, "\u0120Vancouver": 10930, "\u0120resign": 10931, "shire": 10932, "\u0120Works": 10933, "\u0120Build": 10934, "\u0120affordable": 10935, "\u0120Gary": 10936, "\u0120Arena": 10937, "\u0120hanging": 10938, "\u0120implications": 10939, "\u0120Song": 10940, "\u0120maintaining": 10941, "\u0120guards": 10942, "CON": 10943, "\u0120derived": 10944, "\u0120executed": 10945, "\u0120theories": 10946, "\u0120quoted": 10947, "\u0120Andre": 10948, "oga": 10949, "seless": 10950, "info": 10951, "\u0120Belg": 10952, "\u0120tears": 10953, "\u0120Surv": 10954, "\u0120birthday": 10955, "igious": 10956, "immer": 10957, "\u0120spectrum": 10958, "\u0120architecture": 10959, "\u0120recruit": 10960, "arma": 10961, "Table": 10962, "\u0120monsters": 10963, "\u0120Gov": 10964, "\u0120destination": 10965, "\u0120attractive": 10966, "\u0120foss": 10967, "\u0120Moreover": 10968, "\u0120presents": 10969, "THE": 10970, "\u0120reply": 10971, "pton": 10972, "\u0120cum": 10973, "\u0120delight": 10974, "\u0120affects": 10975, "\u0120donations": 10976, "\u0120Toy": 10977, "\u0120Him": 10978, "MENT": 10979, "\u0120overcome": 10980, "itched": 10981, "\u0120Fantasy": 10982, "\u0120Hat": 10983, "\u0120Beast": 10984, "bott": 10985, "\u0120investigations": 10986, "Run": 10987, "\u0120hunting": 10988, "di": 10989, "fund": 10990, "\u0120sessions": 10991, "estyle": 10992, "\u0120portray": 10993, "oids": 10994, "Yeah": 10995, "\u0120communicate": 10996, "\u0120comedy": 10997, "\u0120Yang": 10998, "\u0120belt": 10999, "\u0120Marine": 11000, "\u0120predicted": 11001, "Play": 11002, "\u0120importantly": 11003, "\u0120remarkable": 11004, "\u0120eliminate": 11005, "David": 11006, "\u0120bind": 11007, "VID": 11008, "\u0120advocates": 11009, "\u0120Gaza": 11010, "imp": 11011, "DB": 11012, "\u0120Na": 11013, "\u0120Similar": 11014, "IES": 11015, "\u0120charity": 11016, "vas": 11017, "math": 11018, "\u0120\u00e2\u0138": 11019, "oker": 11020, "ndum": 11021, "\u0120caps": 11022, "\u0120Hal": 11023, "2000": 11024, "ean": 11025, "\u0120fleet": 11026, "\u0120recre": 11027, "Right": 11028, "\u0120sleeping": 11029, "ijing": 11030, "kind": 11031, "\u0120designated": 11032, "\u00c3\u00a4": 11033, "\u0120animation": 11034, "kee": 11035, "\u0120Introdu": 11036, "\u0120/>": 11037, "\u0120delayed": 11038, "\u0120tremend": 11039, "\u0120curious": 11040, "Use": 11041, "\u0120lect": 11042, "dam": 11043, "\u0120innovation": 11044, "\u0120Points": 11045, "\u0120loading": 11046, "\u0120dispute": 11047, "ctic": 11048, "irds": 11049, "\u0120BY": 11050, "\u0120nurs": 11051, "\u0120Value": 11052, "IONS": 11053, "\u0120Hum": 11054, "\u0120template": 11055, "mers": 11056, "\u0120appearances": 11057, "\u0120Entertainment": 11058, "\u0120translation": 11059, "\u0120sake": 11060, "\u0120beneath": 11061, "\u0120inhib": 11062, "\u0120euro": 11063, "abetes": 11064, "\u0120studying": 11065, "\u0120Mas": 11066, "\u0120perceived": 11067, "\u0120examined": 11068, "\u0120eager": 11069, "\u0120coaches": 11070, "\u0120imper": 11071, "chi": 11072, "\u0120produces": 11073, "\").": 11074, "\u0120Everyone": 11075, "\u0120municip": 11076, "\u0120girlfriend": 11077, "\u0120hire": 11078, "\u0120Vice": 11079, "\u0120suitable": 11080, "opy": 11081, "\u0120inequ": 11082, "\u0120Duke": 11083, "fish": 11084, "first": 11085, "\u0120Obs": 11086, "\u0120interior": 11087, "\u0120Bruce": 11088, "\u0120Ry": 11089, "\u0120analys": 11090, "\u0120considerable": 11091, "\u0120forecast": 11092, "\u0120fert": 11093, "orship": 11094, "\u0120Drug": 11095, "\u0120ALL": 11096, ":\"": 11097, "thur": 11098, "\u0120Mail": 11099, "\u0120ballot": 11100, "\u0120instantly": 11101, "\u0120Channel": 11102, "\u0120picks": 11103, "\u01201989": 11104, "\u0120tent": 11105, "oli": 11106, "\u0120civilian": 11107, "bling": 11108, "ello": 11109, "bu": 11110, "\u0120inch": 11111, "\u0120logo": 11112, "\u0120cooperation": 11113, "\u0120walks": 11114, "\u0120investments": 11115, "\u0120imprison": 11116, "\u0120Festival": 11117, "\u0120Ky": 11118, "\u0120legally": 11119, "\u0120gri": 11120, "charg": 11121, "Sl": 11122, "\u0120threatening": 11123, "duction": 11124, "flow": 11125, "\u0120dismissed": 11126, "ibraries": 11127, "cap": 11128, "ele": 11129, "\u0120McG": 11130, "\u0120Harvard": 11131, "\u0120Conservative": 11132, "\u0120CBS": 11133, "png": 11134, "\u0120roots": 11135, "\u0120Having": 11136, "umbled": 11137, "\u0120Fun": 11138, "\\/": 11139, "\u0120Search": 11140, "plex": 11141, "\u0120discussing": 11142, "\u0120continu": 11143, "\u0120Tai": 11144, "\u0120Wik": 11145, "Free": 11146, "fit": 11147, "\u0120refuse": 11148, "\u0120managing": 11149, "\u0120synd": 11150, "ipedia": 11151, "walk": 11152, "\u0120professionals": 11153, "\u0120guidance": 11154, "\u0120universities": 11155, "\u0120assemb": 11156, "untu": 11157, "Finally": 11158, "ASE": 11159, "\u0120Auto": 11160, "\u0120Had": 11161, "\u0120anniversary": 11162, "LD": 11163, "\u0120Dur": 11164, "\u0120Ultimate": 11165, "ihad": 11166, "product": 11167, "\u0120transit": 11168, "\u0120restore": 11169, "\u0120explaining": 11170, "\u0120asset": 11171, "\u0120transferred": 11172, "\u0120burst": 11173, "apolis": 11174, "\u0120Magazine": 11175, "\u0120Cra": 11176, "\u0120BR": 11177, "gged": 11178, "\u0120HE": 11179, "Mich": 11180, "bet": 11181, "\u0120Lady": 11182, "ylum": 11183, "erves": 11184, "\u0120meets": 11185, "white": 11186, "Log": 11187, "\u0120corresponding": 11188, "\u0120insisted": 11189, "GG": 11190, "\u0120surrounded": 11191, "\u0120tens": 11192, "\u0120lane": 11193, "\u0120coinc": 11194, "home": 11195, "\u0120existed": 11196, "ected": 11197, "\u0120Double": 11198, "lamm": 11199, "\u0120skept": 11200, "exp": 11201, "\u0120perception": 11202, "iev": 11203, "\u0120Being": 11204, "oft": 11205, "\u0120adopt": 11206, ".:": 11207, "];": 11208, "Windows": 11209, "\u0120satellite": 11210, "ASH": 11211, "\u0120infant": 11212, "description": 11213, "\u0120Meanwhile": 11214, "cm": 11215, "oca": 11216, "\u0120Treat": 11217, "actor": 11218, "\u0120tobacco": 11219, "\u0120Norm": 11220, "emption": 11221, "\u0120flesh": 11222, "\u0120je": 11223, "oop": 11224, "\u0120Heaven": 11225, "\u0120beating": 11226, "anim": 11227, "\u0120gathering": 11228, "\u0120cultiv": 11229, "GO": 11230, "abe": 11231, "\u0120Jonathan": 11232, "\u0120Safety": 11233, "\u0120badly": 11234, "prot": 11235, "\u0120choosing": 11236, "\u0120contacted": 11237, "\u0120quit": 11238, "\u0120distur": 11239, "\u0120stir": 11240, "\u0120token": 11241, "Det": 11242, "\u0120Pa": 11243, "\u0120functionality": 11244, "003": 11245, "some": 11246, "\u0120limitations": 11247, "\u0120meth": 11248, "build": 11249, "config": 11250, "NT": 11251, "rell": 11252, "blem": 11253, "\u0120Mom": 11254, "\u0120veterans": 11255, "\u0120Hu": 11256, "\u0120trends": 11257, "arer": 11258, "\u0120Given": 11259, "\u0120Caption": 11260, "may": 11261, "AST": 11262, "\u0120wondering": 11263, "\u0120Clark": 11264, "normal": 11265, "\u0120separated": 11266, "\u0120desp": 11267, "stic": 11268, "brew": 11269, "\u0120relating": 11270, "\u0120Nik": 11271, "\u0120Farm": 11272, "\u0120enthusi": 11273, "good": 11274, "deb": 11275, "\u0120activist": 11276, "\u0120mart": 11277, "\u0120explosion": 11278, "\u0120Economic": 11279, "Link": 11280, "\u0120insight": 11281, "\u0120convenient": 11282, "\u0120counterpart": 11283, "support": 11284, "\u0120Virt": 11285, "agen": 11286, "\u0120Tennessee": 11287, "\u0120Simon": 11288, "\u0120Award": 11289, "OCK": 11290, "\u0120Figure": 11291, "\u0120overseas": 11292, "\u0120pride": 11293, "\u0120Cas": 11294, "note": 11295, "mg": 11296, "Current": 11297, "\u0120displays": 11298, "content": 11299, "\u0120traveling": 11300, "\u0120hospitals": 11301, "\u0120Financial": 11302, "\u0120Past": 11303, "\u0120defendant": 11304, "\u0120streaming": 11305, "mble": 11306, "\u0120Berlin": 11307, "uki": 11308, "\u0120distribut": 11309, "\u0120antib": 11310, "\u0120chocolate": 11311, "\u0120Castle": 11312, "\u0120interrupt": 11313, "\u0120Row": 11314, "\u0120conversion": 11315, "\u0120bugs": 11316, "\u0120Rather": 11317, "liest": 11318, "LY": 11319, "\u0120Jean": 11320, "common": 11321, "akh": 11322, "\u0120130": 11323, "otton": 11324, "\u0120Dean": 11325, "\u0120amendment": 11326, "\u0120gameplay": 11327, "\u0120Warren": 11328, "oda": 11329, "\u0120highlights": 11330, "\u0120irre": 11331, "\u0120NATO": 11332, "\u0120balls": 11333, "\u0120demanding": 11334, "URE": 11335, "\u0120Luke": 11336, "Figure": 11337, "stop": 11338, "onia": 11339, "zone": 11340, "izers": 11341, "\u0120WR": 11342, "\u0120awarded": 11343, "\u0120regulatory": 11344, "\u0120Hart": 11345, "\u0120SN": 11346, "pling": 11347, "\u0120sour": 11348, "\u0120Pixel": 11349, "usive": 11350, "\u0120fet": 11351, "\u0120Sent": 11352, "\u0120automatic": 11353, "\u0120fer": 11354, "vernment": 11355, "\u0120Khan": 11356, "TON": 11357, "father": 11358, "\u0120extraordinary": 11359, "throp": 11360, "\u0120Python": 11361, "\u0120GPU": 11362, "\u0120sexually": 11363, "\u0120desktop": 11364, "itivity": 11365, "\u0120Antonio": 11366, "\u0120orient": 11367, "\u0120ears": 11368, "obby": 11369, "ouses": 11370, "vertisements": 11371, "\u0120manufacturers": 11372, "icient": 11373, "minute": 11374, "\u0120conviction": 11375, "\u0120garden": 11376, "public": 11377, "\u0120satisfied": 11378, "fold": 11379, "OK": 11380, "\u0120inhab": 11381, "\u0120Think": 11382, "\u0120programme": 11383, "\u0120stomach": 11384, "\u0120coordin": 11385, "\u0120holy": 11386, "\u0120threshold": 11387, "\u0120rhet": 11388, "\u0120serial": 11389, "\u0120employers": 11390, "\u0120Everything": 11391, "rah": 11392, "\u0120bother": 11393, "\u0120brands": 11394, "Value": 11395, "\u0120Ted": 11396, "\u0120Planet": 11397, "\u0120pink": 11398, "\u0120Furthermore": 11399, "sa": 11400, "PE": 11401, "reck": 11402, "\u0120USD": 11403, "otte": 11404, "\u0120&&": 11405, "\u0120landed": 11406, "gets": 11407, "\u0120producers": 11408, "\u0120healthcare": 11409, "\u0120dominant": 11410, "\u0120destro": 11411, "\u0120amended": 11412, "chron": 11413, "\u0120fits": 11414, "\u0120Syd": 11415, "\u0120Authority": 11416, "ATCH": 11417, "\u0120fights": 11418, "\u0120LLC": 11419, "\u0120---": 11420, "\u0120Corp": 11421, "\u0120toxic": 11422, "specific": 11423, "\u0120Corn": 11424, "\u0120Chel": 11425, "\u0120telephone": 11426, "\u0120Pant": 11427, "\u0120mysterious": 11428, "aunch": 11429, "odox": 11430, "media": 11431, "\u0120witnesses": 11432, "agu": 11433, "\u0120questioned": 11434, "\u0120Brexit": 11435, "\u0120Remember": 11436, "enez": 11437, "\u0120endorse": 11438, "iatric": 11439, "\u0120Ident": 11440, "\u0120ridiculous": 11441, "110": 11442, "\u0120prayer": 11443, "\u0120scientist": 11444, "\u01201950": 11445, "\u0120Aqu": 11446, "\u0120underground": 11447, "\u0120UFC": 11448, "mare": 11449, "\u0120Later": 11450, "wich": 11451, "\u0120subscrib": 11452, "\u0120hosts": 11453, "\u0120err": 11454, "\u0120grants": 11455, "antom": 11456, "\u0120summon": 11457, "early": 11458, "\u0120Clear": 11459, "\u0120Prim": 11460, "\u0120suspension": 11461, "\u0120guaranteed": 11462, "apper": 11463, "\u0120rice": 11464, "\u0120Sean": 11465, "\u0120Shin": 11466, "\u0120referendum": 11467, "\u0120fled": 11468, "rust": 11469, "\u0120360": 11470, "tery": 11471, "\u0120shocked": 11472, "BR": 11473, "\u0120Oil": 11474, "\u0120Allah": 11475, "\u0120partly": 11476, "\u0120ignor": 11477, "\u0120transmission": 11478, "\u0120homosexual": 11479, "iversal": 11480, "\u0120hopefully": 11481, "\u00e3\u0124\u00a4": 11482, "\u0120lesson": 11483, "Leg": 11484, "\u0120..": 11485, "Yet": 11486, "table": 11487, "appropri": 11488, "rett": 11489, "\u0120boards": 11490, "\u0120incorrect": 11491, "\u0120bacteria": 11492, "aru": 11493, "amac": 11494, "\u0120snap": 11495, ".'\"": 11496, "\u0120parad": 11497, "tem": 11498, "heart": 11499, "\u0120availability": 11500, "\u0120wisdom": 11501, "\u0120(+": 11502, "\u0120priest": 11503, "\u0120\u00c2\u0142\u0120\u00c2\u0142": 11504, "Open": 11505, "\u0120span": 11506, "\u0120parameter": 11507, "\u0120convince": 11508, "\u0120(%)": 11509, "rac": 11510, "\u0120fo": 11511, "\u0120safely": 11512, "\u0120converted": 11513, "\u0120Olympic": 11514, "\u0120reserve": 11515, "\u0120healing": 11516, "\u0120Mine": 11517, "Max": 11518, "\u0120inherent": 11519, "\u0120Graham": 11520, "\u0120integrated": 11521, "Dem": 11522, "\u0120pipeline": 11523, "\u0120applying": 11524, "\u0120embed": 11525, "\u0120Charlie": 11526, "\u0120cave": 11527, "2008": 11528, "\u0120consensus": 11529, "\u0120rewards": 11530, "Pal": 11531, "\u0120HTML": 11532, "\u0120popularity": 11533, "looking": 11534, "\u0120Sword": 11535, "\u0120Arts": 11536, "')": 11537, "\u0120electron": 11538, "clusions": 11539, "\u0120integrity": 11540, "\u0120exclusively": 11541, "\u0120grace": 11542, "\u0120torture": 11543, "\u0120burned": 11544, "two": 11545, "\u0120180": 11546, "Produ": 11547, "\u0120entreprene": 11548, "raphics": 11549, "\u0120gym": 11550, "ricane": 11551, "\u0120Tam": 11552, "\u0120administrative": 11553, "\u0120manufacturer": 11554, "\u0120vel": 11555, "\u0120Ni": 11556, "\u0120isolated": 11557, "\u0120Medicine": 11558, "\u0120backup": 11559, "\u0120promoting": 11560, "\u0120commander": 11561, "\u0120flee": 11562, "\u0120Russell": 11563, "\u0120forgotten": 11564, "\u0120Missouri": 11565, "\u0120residence": 11566, "mons": 11567, "\u0120resemb": 11568, "\u0120wand": 11569, "\u0120meaningful": 11570, "PT": 11571, "\u0120bol": 11572, "\u0120helic": 11573, "\u0120wealthy": 11574, "\u0120rifle": 11575, "strong": 11576, "rowing": 11577, "plan": 11578, "asury": 11579, "\u00e2\u0122\u00a6.": 11580, "\u0120expanding": 11581, "\u0120Hamilton": 11582, "\u0120receives": 11583, "SI": 11584, "eatures": 11585, "\u0120Anim": 11586, "REE": 11587, "Put": 11588, "\u0120briefly": 11589, "rive": 11590, "\u0120stimul": 11591, "\u0120``(": 11592, "\u0120__": 11593, "\u0120chip": 11594, "\u0120haz": 11595, "\u0120prize": 11596, "\u0120Things": 11597, "ACE": 11598, "ulin": 11599, "dict": 11600, "oku": 11601, "\u0120associate": 11602, "ockets": 11603, "youtube": 11604, "Story": 11605, "ategory": 11606, "\u0120mild": 11607, "ailing": 11608, "\u0120Ye": 11609, "Orig": 11610, "\u0120Ka": 11611, "orig": 11612, "\u0120propaganda": 11613, "\u0120anonymous": 11614, "\u0120struggled": 11615, "\u0120outrage": 11616, "ATED": 11617, "\u0120Beijing": 11618, "rary": 11619, "\u0120leather": 11620, "\u0120worlds": 11621, "\u0120broader": 11622, "125": 11623, "idal": 11624, "\u0120Better": 11625, "\u0120tear": 11626, "Ext": 11627, "\u0120proposals": 11628, "\u0120iter": 11629, "\u0120Squad": 11630, "\u0120volunt": 11631, "mi": 11632, "Did": 11633, "\u0120Pu": 11634, "pin": 11635, "\u0120speakers": 11636, "\u0120borders": 11637, "\u0120figured": 11638, "='": 11639, "\u0120simultaneously": 11640, "aeda": 11641, "\u0120charging": 11642, "\u0120urged": 11643, "\u0120conj": 11644, "256": 11645, "\u0120Gordon": 11646, "merce": 11647, "\u0120documentary": 11648, "Share": 11649, "itol": 11650, "ONE": 11651, "\u0120Garden": 11652, "hatt": 11653, "\u0120Thompson": 11654, "aneous": 11655, "apore": 11656, "\u0120tanks": 11657, "\u0120lessons": 11658, "track": 11659, "\u0120outstanding": 11660, "\u0120volunteers": 11661, "\u0120spray": 11662, "\u0120managers": 11663, "large": 11664, "\u0120camps": 11665, "\u0120artificial": 11666, "\u0120Ru": 11667, "\u0120bags": 11668, "thal": 11669, "\u0120compatible": 11670, "\u0120Blade": 11671, "\u0120fed": 11672, "\u0120argues": 11673, "FI": 11674, "\u0120unfair": 11675, "\u0120corn": 11676, "\u0120offset": 11677, "\u0120directions": 11678, "\u0120disappointed": 11679, "\u0120Convention": 11680, "\u0120viewing": 11681, "ME": 11682, "ocity": 11683, "\u0120towns": 11684, "\u0120layers": 11685, "\u0120rolled": 11686, "\u0120jumped": 11687, "\u0120attribute": 11688, "\u0120unnecess": 11689, "incoln": 11690, "\u0120suppose": 11691, "\u0120Nether": 11692, "cha": 11693, "\u0120buried": 11694, "\u0120sixth": 11695, "Ben": 11696, "ressing": 11697, "OUR": 11698, "\u0120wound": 11699, "\u0120cycl": 11700, "\u0120mechanisms": 11701, "\u0120congressional": 11702, "\u0120Element": 11703, "\u0120agreements": 11704, "\u0120decor": 11705, "\u0120closest": 11706, "\u0120Mit": 11707, "Google": 11708, "}}": 11709, "\u0120mixture": 11710, "\u0120fluid": 11711, "Sign": 11712, "\u0120Scholar": 11713, "\u0120pist": 11714, "asket": 11715, "abling": 11716, "\u0120racing": 11717, "hero": 11718, "riel": 11719, "assy": 11720, "\u0120cheaper": 11721, "ben": 11722, "\u0120vertical": 11723, "amacare": 11724, "\u0120Reading": 11725, "gments": 11726, "\u0120helicop": 11727, "\u0120sacrifice": 11728, "aya": 11729, "paren": 11730, "VA": 11731, "\u0120Les": 11732, "\u0120Studio": 11733, "\u0120violations": 11734, "\u0120Anna": 11735, "acer": 11736, "\u00e9\u00be": 11737, "\u0120Rat": 11738, "\u0120Beck": 11739, "\u0120Dick": 11740, "\u0120ACT": 11741, "\u0120composition": 11742, "\u0120texture": 11743, "\u0120Own": 11744, "\u0120smartphone": 11745, "\u0120NA": 11746, "\u0120forb": 11747, "import": 11748, "\u0120defending": 11749, "ilst": 11750, "rer": 11751, "\u0120oh": 11752, "\u0120Jeremy": 11753, "\u0120banking": 11754, "ceptions": 11755, "\u0120respective": 11756, "/.": 11757, "\u0120drinks": 11758, "\u0120Wi": 11759, "\u0120bands": 11760, "\u0120Liverpool": 11761, "\u0120grip": 11762, "\u0120Buy": 11763, "\u0120openly": 11764, "\u0120reviewed": 11765, "pert": 11766, "\u0120verify": 11767, "\u0120Cole": 11768, "\u0120Wales": 11769, "MO": 11770, "\u0120unpre": 11771, "\u0120shelter": 11772, "\u0120Imperial": 11773, "\u0120gui": 11774, "\u0120Dak": 11775, "\u0120suggestions": 11776, "\u0120explicitly": 11777, "\u0120slave": 11778, "\u0120blockchain": 11779, "\u0120competing": 11780, "\u0120promising": 11781, "SON": 11782, "\u0120soccer": 11783, "\u0120constitution": 11784, "429": 11785, "\u0120distract": 11786, "\u0120User": 11787, "esides": 11788, "\u0120Method": 11789, "\u0120Tokyo": 11790, "\u0120accompanied": 11791, "Client": 11792, "sur": 11793, "alog": 11794, "\u0120identification": 11795, "\u0120invasion": 11796, "asma": 11797, "\u0120industries": 11798, "ppers": 11799, "\u0120subtle": 11800, "\u0120Unit": 11801, "natural": 11802, "\u0120survived": 11803, "\u0120flaw": 11804, "\u013a\u0127": 11805, "\u0120Holl": 11806, "\u0120deficit": 11807, "\u0120tutorial": 11808, "\u0120Chance": 11809, "\u0120arguing": 11810, "\u0120contemporary": 11811, "\u0120integration": 11812, "forward": 11813, "\u0120tum": 11814, "itis": 11815, "\u0120hiding": 11816, "\u0120Domin": 11817, "\u0120Tan": 11818, "\u0120Building": 11819, "\u0120Vin": 11820, "\u0120spokesperson": 11821, "\u0120Notes": 11822, "\u0120emerging": 11823, "\u0120preparation": 11824, "\u0120prost": 11825, "\u0120suspects": 11826, "\u0120autonom": 11827, "Description": 11828, "\u0120dealt": 11829, "\u0120Pear": 11830, "\u0120steady": 11831, "\u0120decreased": 11832, "\u0120sovere": 11833, "\u0120Clin": 11834, "\u0120gradually": 11835, "orses": 11836, "\u0120WAR": 11837, "Serv": 11838, "\u00e3\u0124\u00a2": 11839, "hr": 11840, "\u0120dirty": 11841, "\u0120Barn": 11842, "\u0120BC": 11843, "\u0120dil": 11844, "\u0120calendar": 11845, "\u0120compliance": 11846, "\u0120chamber": 11847, "bb": 11848, "\u0120passenger": 11849, "ateful": 11850, "\u0120Title": 11851, "\u0120Sydney": 11852, "\u0120Got": 11853, "\u0120darkness": 11854, "\u0120defect": 11855, "\u0120packed": 11856, "assion": 11857, "\u0120gods": 11858, "\u0120harsh": 11859, "ICK": 11860, "leans": 11861, "\u0120algorithm": 11862, "\u0120oxygen": 11863, "\u0120visits": 11864, "\u0120blade": 11865, "\u0120kilomet": 11866, "\u0120Kentucky": 11867, "\u0120killer": 11868, "Pack": 11869, "enny": 11870, "\u0120divine": 11871, "\u0120nomination": 11872, "being": 11873, "\u0120engines": 11874, "\u0120cats": 11875, "\u0120buffer": 11876, "\u0120Phill": 11877, "\u0120traff": 11878, "AGE": 11879, "\u0120tongue": 11880, "\u0120radiation": 11881, "erer": 11882, "mem": 11883, "\u0120Explicit": 11884, "\u00e9\u00be\u012f": 11885, "\u0120couples": 11886, "\u0120physics": 11887, "\u0120McK": 11888, "\u0120politically": 11889, "awks": 11890, "\u0120Bloom": 11891, "\u0120worship": 11892, "eger": 11893, "uter": 11894, "\u0120FO": 11895, "\u0120mathemat": 11896, "\u0120sentenced": 11897, "\u0120disk": 11898, "\u0120Marg": 11899, "\u0120/*": 11900, "PI": 11901, "\u0120optional": 11902, "\u0120babies": 11903, "\u0120seeds": 11904, "\u0120Scottish": 11905, "\u0120thy": 11906, "]]": 11907, "\u0120Hitler": 11908, "PH": 11909, "ngth": 11910, "\u0120recovered": 11911, "inge": 11912, "\u0120powder": 11913, "\u0120lips": 11914, "\u0120designer": 11915, "\u0120disorders": 11916, "\u0120courage": 11917, "\u0120chaos": 11918, "\"},{\"": 11919, "\u0120carrier": 11920, "bably": 11921, "High": 11922, "\u0120RT": 11923, "esity": 11924, "len": 11925, "\u0120routes": 11926, "uating": 11927, "Fil": 11928, "NOT": 11929, "wall": 11930, "sburgh": 11931, "\u0120engaging": 11932, "\u0120JavaScript": 11933, "orer": 11934, "lihood": 11935, "\u0120unions": 11936, "\u0120Federation": 11937, "\u0120Tesla": 11938, "\u0120completion": 11939, "\u0120Ta": 11940, "\u0120privilege": 11941, "\u0120Orange": 11942, "\u0120neur": 11943, "parency": 11944, "\u0120bones": 11945, "\u0120titled": 11946, "\u0120prosecutors": 11947, "\u0120ME": 11948, "\u0120engineer": 11949, "\u0120Universe": 11950, "\u0120Hig": 11951, "nie": 11952, "oard": 11953, "\u0120hearts": 11954, "\u0120Gre": 11955, "ussion": 11956, "\u0120ministry": 11957, "\u0120penet": 11958, "\u0120Nut": 11959, "\u0120Ow": 11960, "\u0120XP": 11961, "instein": 11962, "\u0120bulk": 11963, "System": 11964, "icism": 11965, "\u0120Marketable": 11966, "\u0120preval": 11967, "\u0120poster": 11968, "\u0120attending": 11969, "urable": 11970, "\u0120licensed": 11971, "\u0120Gh": 11972, "etry": 11973, "\u0120Tradable": 11974, "\u0120blast": 11975, "\u00e0\u00a4": 11976, "\u0120Titan": 11977, "elled": 11978, "die": 11979, "Have": 11980, "\u0120Flame": 11981, "\u0120profound": 11982, "\u0120participating": 11983, "\u0120anime": 11984, "\u0120Ess": 11985, "\u0120specify": 11986, "\u0120regarded": 11987, "\u0120Spell": 11988, "\u0120sons": 11989, "owned": 11990, "\u0120merc": 11991, "\u0120experimental": 11992, "lando": 11993, "hs": 11994, "\u0120Dungeon": 11995, "inos": 11996, "\u0120comply": 11997, "\u0120Systems": 11998, "arth": 11999, "\u0120seized": 12000, "local": 12001, "\u0120Girls": 12002, "udo": 12003, "oned": 12004, "\u0120Fle": 12005, "\u0120constructed": 12006, "\u0120hosted": 12007, "\u0120scared": 12008, "actic": 12009, "\u0120Islands": 12010, "\u0120MORE": 12011, "\u0120bless": 12012, "\u0120blocking": 12013, "\u0120chips": 12014, "\u0120evac": 12015, "Ps": 12016, "\u0120corporation": 12017, "\u0120ox": 12018, "\u0120lighting": 12019, "\u0120neighbors": 12020, "\u0120Ub": 12021, "aro": 12022, "\u0120beef": 12023, "\u0120Uber": 12024, "Facebook": 12025, "armed": 12026, "itate": 12027, "\u0120Rating": 12028, "\u0120Quick": 12029, "\u0120occupied": 12030, "\u0120aims": 12031, "\u0120Additionally": 12032, "\u0120Interest": 12033, "\u0120dramatically": 12034, "\u0120heal": 12035, "\u0120painting": 12036, "\u0120engineers": 12037, "MM": 12038, "\u0120Must": 12039, "\u0120quantity": 12040, "Paul": 12041, "\u0120earnings": 12042, "\u0120Posts": 12043, "stra": 12044, "\u00e3\u0125\u00bc\u00e3\u0125": 12045, "\u0120stance": 12046, "\u0120dropping": 12047, "script": 12048, "\u0120dressed": 12049, "Make": 12050, "\u0120justify": 12051, "\u0120Ltd": 12052, "\u0120prompted": 12053, "\u0120scrut": 12054, "\u0120speeds": 12055, "\u0120Giants": 12056, "omer": 12057, "\u0120Editor": 12058, "\u0120describing": 12059, "\u0120Lie": 12060, "mented": 12061, "\u0120nowhere": 12062, "ocaly": 12063, "\u0120instruction": 12064, "fortable": 12065, "\u0120entities": 12066, "\u0120cm": 12067, "\u0120Natural": 12068, "\u0120inquiry": 12069, "\u0120pressed": 12070, "izont": 12071, "forced": 12072, "\u0120raises": 12073, "\u0120Netflix": 12074, "\u0120Side": 12075, "\u0120outer": 12076, "\u0120amongst": 12077, "ims": 12078, "owski": 12079, "\u0120climb": 12080, "never": 12081, "\u0120combine": 12082, "ding": 12083, "\u0120compr": 12084, "\u0120significance": 12085, "\u0120remembered": 12086, "\u0120Nevada": 12087, "\u0120Tel": 12088, "\u0120Scar": 12089, "\u0120Warriors": 12090, "\u0120Jane": 12091, "\u0120coup": 12092, "bas": 12093, "\u0120terminal": 12094, ",-": 12095, "OH": 12096, "\u0120tension": 12097, "\u0120wings": 12098, "\u0120Myster": 12099, "\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd": 12100, "\u0120Unlike": 12101, "valid": 12102, "vironments": 12103, "\u0120Ali": 12104, "\u0120naked": 12105, "books": 12106, "\u0120Mun": 12107, "\u0120Gulf": 12108, "\u0120density": 12109, "\u0120dimin": 12110, "\u0120desperate": 12111, "\u0120presidency": 12112, "\u01201986": 12113, "hy": 12114, "IND": 12115, "\u0120unlock": 12116, "imens": 12117, "\u0120handled": 12118, "\u0120Eb": 12119, "\u0120disappeared": 12120, "\u0120genre": 12121, "\u01201988": 12122, "\u0120determination": 12123, "Stream": 12124, "iko": 12125, "apters": 12126, "\u0120acknowledge": 12127, "Jan": 12128, "\u0120capitalism": 12129, "Pat": 12130, "\u01202020": 12131, "\u0120painful": 12132, "\u0120curve": 12133, "\u0120bombs": 12134, "storm": 12135, "\u0120Metal": 12136, "encer": 12137, "\u0120Fig": 12138, "\u0120Aaron": 12139, "anches": 12140, "\u0120inspiration": 12141, "\u0120exhaust": 12142, "tains": 12143, "ashi": 12144, "\u0120descript": 12145, "\u0120ritual": 12146, "\u0120Chelsea": 12147, "\u0120promotion": 12148, "\u0120Hung": 12149, "\u0120Ward": 12150, "iva": 12151, "\u0120ET": 12152, "\u0120toss": 12153, "allow": 12154, "\u0120Francis": 12155, "Dep": 12156, "\u0120happiness": 12157, "\u0120Glass": 12158, "\u0120beta": 12159, "\u0120strengthen": 12160, "NE": 12161, "oa": 12162, "\u0120buttons": 12163, "\u0120Murray": 12164, "\u0120kicked": 12165, "Quest": 12166, "\u0120Talk": 12167, "\u0120Several": 12168, "\u0120Zero": 12169, "\u0120drone": 12170, "ulk": 12171, "\u0120cam": 12172, "\u0120Mobile": 12173, "\u0120preventing": 12174, "\u0120retro": 12175, "\u0120Ax": 12176, "\u0120cruel": 12177, "\u0120float": 12178, ".),": 12179, "\u0120filing": 12180, "\u0120Grant": 12181, "\u0120Bor": 12182, "\u0120rib": 12183, "\u0120championship": 12184, "\u0120Merc": 12185, "\u0120styles": 12186, "\u0120cake": 12187, "\u0120builds": 12188, "\u0120Self": 12189, "iox": 12190, "\u0120epic": 12191, "oyd": 12192, "Bel": 12193, "\u0120Stew": 12194, ".(": 12195, "ahu": 12196, "\u0120Beyond": 12197, "\u0120outs": 12198, "\u0120solo": 12199, "\u0120Tree": 12200, "\u0120preserve": 12201, "\u0120tub": 12202, "ARE": 12203, "roc": 12204, "\u0120Impro": 12205, "\u0120Wright": 12206, "\u0120bund": 12207, "\u0120traged": 12208, "\u0120occasional": 12209, "bian": 12210, "Second": 12211, "rons": 12212, "\u0120interactions": 12213, "formed": 12214, "sing": 12215, "\u0120owns": 12216, "\u0120hockey": 12217, "General": 12218, "\u0120logical": 12219, "\u0120expend": 12220, "\u0120escal": 12221, "\u0120Griff": 12222, "\u0120Crown": 12223, "\u0120Reserve": 12224, "\u0120stopping": 12225, "\u0120excuse": 12226, "second": 12227, "\u0120operated": 12228, "\u0120reaches": 12229, "\u0120Malays": 12230, "\u0120pollution": 12231, "\u0120Brooklyn": 12232, "\u0120delete": 12233, "\u0120hash": 12234, "Block": 12235, "aha": 12236, "\u00e2\u0122\u00b3": 12237, "\u0120shorter": 12238, "piece": 12239, "></": 12240, "\u0120horm": 12241, "\u0120Wat": 12242, "\u0120Break": 12243, "\u0120prohibited": 12244, "\u0120intensity": 12245, "\u0120Alan": 12246, "\u0120liability": 12247, "?!": 12248, "anded": 12249, "\u0120neighbour": 12250, "\u0120Collection": 12251, "\u0120fires": 12252, "\u0120revolutionary": 12253, "fly": 12254, "\u0120Orleans": 12255, "White": 12256, "\u0120Writ": 12257, "\u0120Dawn": 12258, "\u0120settle": 12259, "\u0120execute": 12260, "BM": 12261, "\u0120spokeswoman": 12262, "\u0120lifestyle": 12263, "\u0120clicking": 12264, "\u0120Kill": 12265, "\u0120Liberal": 12266, "\u0120Nazi": 12267, "\u0120trailer": 12268, "\u0120mountains": 12269, "\u0120damn": 12270, "zes": 12271, "pes": 12272, "\u0120pressing": 12273, "\u0120bail": 12274, "\u0120Organization": 12275, "\u0120pir": 12276, "\u0120thirty": 12277, "\u0120electrical": 12278, "\u0120115": 12279, "\u0120Poly": 12280, "\u0120Rap": 12281, "\u0120Strike": 12282, "\u0120Cann": 12283, "\u0120demanded": 12284, "\u0120backing": 12285, "default": 12286, "speed": 12287, "\u0120Legisl": 12288, "\u0120mothers": 12289, "\u0120Body": 12290, "\u0120variation": 12291, "cedented": 12292, "powered": 12293, "leading": 12294, "Never": 12295, "\u0120grave": 12296, "\u0120Anti": 12297, "AW": 12298, "\u0120interviewed": 12299, "\u0120Gab": 12300, "\u0120Fat": 12301, "\u0120rookie": 12302, "uu": 12303, "\u0120depos": 12304, "ixon": 12305, "\u0120ampl": 12306, "retion": 12307, "\u0120Heat": 12308, "\u0120peaceful": 12309, "SM": 12310, "ieve": 12311, "\u0120diver": 12312, "\u0120Victoria": 12313, "\u0120mic": 12314, "pdf": 12315, "\u0120stating": 12316, "\u0120lung": 12317, "\u0120criticized": 12318, "\u0120vaccine": 12319, "\u0120Loading": 12320, "urse": 12321, "Take": 12322, "\u0120Fran": 12323, "\u0120Sold": 12324, "\u0120Robin": 12325, "\u0120detected": 12326, "\u0120Script": 12327, "\u0120adjusted": 12328, "\u0120senator": 12329, "\u0120opposing": 12330, "Error": 12331, "Count": 12332, "\u0120conflicts": 12333, "\u0120ow": 12334, "\u0120Argent": 12335, "\u0120matching": 12336, "hh": 12337, "\u0120Trek": 12338, "starter": 12339, "\"),": 12340, "\u0120AF": 12341, "oder": 12342, "xxxx": 12343, "\u0120Alt": 12344, "acre": 12345, "\u0120Pick": 12346, "\u0120Solar": 12347, "\u0120Dal": 12348, "Oct": 12349, "\u0120Batt": 12350, "\u0120src": 12351, "\u0120engagement": 12352, "\u0120executives": 12353, "\u0120liberty": 12354, "java": 12355, "\u0120talented": 12356, "igenous": 12357, "\u0120consecut": 12358, ".....": 12359, "Info": 12360, "\u0120horrible": 12361, "\u0120surprisingly": 12362, "feed": 12363, "icating": 12364, "\u0120LED": 12365, "\u0120females": 12366, "Station": 12367, "eller": 12368, "\u0120Oakland": 12369, "\u0120mechanical": 12370, "iology": 12371, "\u0120Var": 12372, "\u0120robust": 12373, "ettings": 12374, "otta": 12375, "\u0120theoret": 12376, "\u0120retain": 12377, "kward": 12378, "\u0120da": 12379, "\u0120deployed": 12380, "del": 12381, "\u0120Andy": 12382, "\u0120subscribe": 12383, "web": 12384, "\u0120na": 12385, "\u0120Michel": 12386, "\u0120partially": 12387, "\u0120Comey": 12388, "\u0120crown": 12389, "\u0120Maj": 12390, "\u0120Blu": 12391, "rator": 12392, "Day": 12393, "INT": 12394, "\u0120documented": 12395, "\u0120GDP": 12396, "gi": 12397, "chell": 12398, "\u0120brutal": 12399, "\u0120Bab": 12400, "stration": 12401, "\u0120theft": 12402, "\u0120tube": 12403, "@@": 12404, "\u0120query": 12405, "\u0120Lincoln": 12406, "\u0120publishing": 12407, "\u0120wore": 12408, "orical": 12409, "\u0120ric": 12410, "\u0120notable": 12411, "\u0120subsequently": 12412, "nex": 12413, "\u0120observe": 12414, "\u0120Boe": 12415, "\u0120codes": 12416, "main": 12417, "WH": 12418, "\u0120SL": 12419, "\u0120residential": 12420, "avan": 12421, "\u0120mas": 12422, "arest": 12423, "adeon": 12424, "OUT": 12425, "\u0120sophistic": 12426, "ante": 12427, "\u0120cens": 12428, "\u0120**": 12429, "\u0120mortality": 12430, "\u0120yours": 12431, "\u0120occasions": 12432, "\u0120recalled": 12433, "\u0120Driver": 12434, "\u0120vocal": 12435, "\u0120bathroom": 12436, "\u0120shops": 12437, "\u0120collaboration": 12438, "\u0120Obamacare": 12439, "\u0120Cell": 12440, "Char": 12441, "Super": 12442, "Cre": 12443, "\u0120tends": 12444, "\u0120torn": 12445, "\u0120economics": 12446, "avery": 12447, "\u0120Raid": 12448, "\u0120Sem": 12449, "\u0120shoulders": 12450, "\u0120expecting": 12451, "\u0120examination": 12452, "ename": 12453, "\u0120UI": 12454, "iability": 12455, "olas": 12456, "\u0120Amb": 12457, "\u0120Dra": 12458, "\u0120midfield": 12459, "\u0120IC": 12460, "\u0120layout": 12461, "\u0120floating": 12462, "fi": 12463, "itative": 12464, "\u0120tremendous": 12465, "\u0120\u00d0": 12466, "\u0120abund": 12467, "Work": 12468, "\u0120Lightning": 12469, "\u0120similarly": 12470, "\u0120conservatives": 12471, "\u0120pray": 12472, "BE": 12473, "izarre": 12474, "\u0120tempt": 12475, "\u0120emphasis": 12476, "\u0120Metro": 12477, "\u0120fishing": 12478, "\u0120marry": 12479, "neg": 12480, "\u0120Study": 12481, "\u0120reck": 12482, "\u0120dispos": 12483, "oning": 12484, "bsite": 12485, "\u0120suspic": 12486, "\u0120merch": 12487, "\u0120Gib": 12488, "\u0120Description": 12489, "\u0120DVD": 12490, "whe": 12491, "\u0120Yemen": 12492, "\u0120environments": 12493, "ooting": 12494, "\u0120Modern": 12495, "eu": 12496, "\u0120reflects": 12497, "\u0120honey": 12498, "\u0120analyst": 12499, "\u0120gut": 12500, "dec": 12501, "Action": 12502, "\u0120households": 12503, "\u0120ster": 12504, "\u0120temple": 12505, "\u0120reforms": 12506, "\u0120favourite": 12507, "\u0120deadline": 12508, "\u0120LE": 12509, "Three": 12510, "\u0120Within": 12511, "Aug": 12512, "\u0120nights": 12513, "elta": 12514, "\u0120invalid": 12515, "\u0120Exchange": 12516, "\u0120Delhi": 12517, "when": 12518, "income": 12519, "\u0120\u00f0\u0141": 12520, "\u0120wireless": 12521, "scribe": 12522, "ista": 12523, "\u0120hostile": 12524, "\u0120ally": 12525, "\u0120gig": 12526, "\u0120outlets": 12527, "\u0120Dor": 12528, "EMENT": 12529, "\u0120ash": 12530, "\u0120abstract": 12531, "ORD": 12532, "\u0120Motor": 12533, "\u0120adviser": 12534, "istle": 12535, "\u0120bases": 12536, "\u0120courtesy": 12537, "\u0120crossing": 12538, "\u0120cleared": 12539, "\u0120refugee": 12540, "cosystem": 12541, "\u0120throws": 12542, "fun": 12543, "bourne": 12544, "days": 12545, "\u0120disagree": 12546, "\u0120Native": 12547, "\u0120reflected": 12548, "\u0120Fast": 12549, "\u0120Yellow": 12550, "\u0120Singapore": 12551, "\u0120Raven": 12552, "\u0120embrace": 12553, "\u0120Ku": 12554, "\u0120Chen": 12555, "\u0120Early": 12556, "\u0120appointment": 12557, "\u0120Mini": 12558, "itement": 12559, "\u0120placing": 12560, "\u0120bicy": 12561, "SR": 12562, "\u0120whis": 12563, "SU": 12564, "\u0120investigated": 12565, "\u0120photographs": 12566, "github": 12567, "\u0120Beat": 12568, "\u0120Ring": 12569, "ighed": 12570, "iar": 12571, "\u0120evolved": 12572, "erald": 12573, "\u0120dun": 12574, "\u0120hub": 12575, "IAL": 12576, "\u0120encouraging": 12577, "\u0120Print": 12578, "\u0120Days": 12579, "\u0120prosecution": 12580, "\u0120pants": 12581, "azy": 12582, "live": 12583, "\u0120fossil": 12584, "\u0120Ju": 12585, "\u0120rocks": 12586, "udge": 12587, "\u0120Race": 12588, "\u0120greet": 12589, "bie": 12590, "\u0120filling": 12591, "\u0120Len": 12592, "\u0120diabetes": 12593, "\u0120firearms": 12594, "uming": 12595, "enezuel": 12596, "\u0120BB": 12597, "\u0120accepting": 12598, "ATH": 12599, "\u0120resort": 12600, "\u0120hunt": 12601, "rik": 12602, "ucker": 12603, "aments": 12604, "\u0120sustained": 12605, "\u0120crossed": 12606, "\u0120breakfast": 12607, "\u0120attributes": 12608, "lected": 12609, "atile": 12610, "\u0120vibr": 12611, "\u0120Kal": 12612, "arson": 12613, "oples": 12614, "\u0120touched": 12615, "\u0120damages": 12616, "\u0120impressed": 12617, "rup": 12618, "\u0120anch": 12619, "\u0120Adams": 12620, "Hel": 12621, "\u0120Victor": 12622, "\u0120mounted": 12623, "\u0120CC": 12624, "\u0120delicious": 12625, "span": 12626, "ella": 12627, "\u0120elabor": 12628, "amples": 12629, "\u0120defic": 12630, "\u0120constitu": 12631, "uates": 12632, "\u0120Mission": 12633, "\u0120Ther": 12634, "\u0120Monster": 12635, "bes": 12636, "Reuters": 12637, "\u0120Indones": 12638, "hill": 12639, "munition": 12640, "\u0120confirmation": 12641, "\u0120Consider": 12642, "acent": 12643, "\u0120jet": 12644, "\u0120Employ": 12645, "\u0120GTX": 12646, "nan": 12647, "\u0120Spider": 12648, "\u0120processor": 12649, "\u0120patri": 12650, "\u0120Pentagon": 12651, "\u0120Robinson": 12652, "\u0120realistic": 12653, "\u00c3\u00b1": 12654, "\u0120appearing": 12655, "\u0120pipe": 12656, "omed": 12657, "\u0120fru": 12658, "\u0120awful": 12659, "\u0120evaluation": 12660, "\u0120intelligent": 12661, "\u0120Citiz": 12662, "\u0120fundra": 12663, "odium": 12664, "\u0120tweets": 12665, "\u0120worn": 12666, "pring": 12667, "\u0120kidn": 12668, "\u0120rebels": 12669, "\u0120Kam": 12670, "\u0120Netherlands": 12671, "\u0120SW": 12672, "\u0120acquisition": 12673, "\u0120Male": 12674, "\u00e3\u0125\u00aa": 12675, "ombies": 12676, "\u0120tradem": 12677, "\u0120Status": 12678, "Bre": 12679, "\u0120THIS": 12680, "\u0120adverse": 12681, "\u0120NEW": 12682, "sign": 12683, "\u0120organisation": 12684, "enc": 12685, "\u0120Harper": 12686, "apor": 12687, "\u0120Members": 12688, "\u0120Peace": 12689, "\u0120Airport": 12690, "\u0120Others": 12691, "\u0120scratch": 12692, "\u0120Pil": 12693, "\u0120sensor": 12694, "\u0120adoption": 12695, "\u0120Hotel": 12696, "\u0120Drag": 12697, "\u0120honestly": 12698, "\u0120yard": 12699, "\u0120Forces": 12700, "\u0120patent": 12701, "\u0120bass": 12702, "\u0120quietly": 12703, "\u0120breathing": 12704, "\u0120pose": 12705, "iors": 12706, "\u0120Jess": 12707, "static": 12708, "ITE": 12709, "Offic": 12710, "\u0120jew": 12711, "wcs": 12712, "\u0120140": 12713, "\u0120preview": 12714, "ippi": 12715, "\u0120unfortunately": 12716, "okemon": 12717, "\u0120horn": 12718, "\u0120reass": 12719, "\u0120peer": 12720, "ocker": 12721, "\u0120unto": 12722, "\u0120Gray": 12723, "\u0120cleaning": 12724, "\u0120attracted": 12725, "2007": 12726, "Point": 12727, "kill": 12728, "\u0120Agreement": 12729, "urches": 12730, "\u0120horr": 12731, "\u0120Mississ": 12732, "\u0120worthy": 12733, "\u0120flowers": 12734, "town": 12735, "dll": 12736, "\u0120reactions": 12737, "\u0120dece": 12738, "\u0120indicating": 12739, "MD": 12740, "\u0120preference": 12741, "\u0120MVP": 12742, "essional": 12743, "\u0120Target": 12744, "gence": 12745, "\u0120Indians": 12746, "\u0120misc": 12747, "\u0120freely": 12748, "\u0120muscles": 12749, "\u0120lineup": 12750, "\u0120impacts": 12751, "ousing": 12752, "omi": 12753, "acular": 12754, "\u0120controlling": 12755, "agine": 12756, "cery": 12757, "hell": 12758, "\u0120ranking": 12759, "\u0120Nich": 12760, "\u0120Ave": 12761, "128": 12762, "\u0120highway": 12763, "\u0120incons": 12764, "\u0120binding": 12765, "\u0120struggles": 12766, "\u0120Pittsburgh": 12767, "\u0120gray": 12768, "rin": 12769, "\u0120comics": 12770, "\u0120Sport": 12771, "\u0120relatives": 12772, "\u0120fright": 12773, "\u0120probe": 12774, "\u0120Portug": 12775, "\u0120voc": 12776, "\u0120tu": 12777, "\u0120Corps": 12778, "\u0120possibilities": 12779, "\u0120qualify": 12780, "wcsstore": 12781, "\u0120libraries": 12782, "\u0120migrants": 12783, "\u0120entries": 12784, "\u0120consecutive": 12785, "vals": 12786, "\u0120Chairman": 12787, "\u0120hill": 12788, "IME": 12789, "\u0120Gard": 12790, "\u0120inequality": 12791, "fox": 12792, "\u0120Save": 12793, "\u0120cort": 12794, "claimed": 12795, "\u0120traits": 12796, "\u0120pour": 12797, "\u0120missiles": 12798, "\u0120essence": 12799, "\u0120sends": 12800, "\u0120alliance": 12801, "\u0120wishes": 12802, "\u0120Christopher": 12803, "Big": 12804, "NY": 12805, "\u0120Jacob": 12806, "san": 12807, "urred": 12808, "\u0120SO": 12809, "lly": 12810, "\u0120advocate": 12811, "\u0120Bond": 12812, "\u0120\"/": 12813, "Using": 12814, "\u0120districts": 12815, "\u0120Gate": 12816, "\u0120Bir": 12817, "ridge": 12818, "\u0120Naz": 12819, "\u0120Rs": 12820, "boards": 12821, "\u0120Ga": 12822, "\u0120Reagan": 12823, "\u0120influenced": 12824, "1000": 12825, "apy": 12826, "\u0120challenged": 12827, "\u0120barg": 12828, "\u0120faculty": 12829, "\u0120Fif": 12830, "\u0120acquire": 12831, "Ac": 12832, "\u0120insect": 12833, "\u0120instruments": 12834, "\u0120leaf": 12835, "thodox": 12836, "Message": 12837, "\u0120tale": 12838, "\u0120thereby": 12839, "\u0120trap": 12840, "\u0120strongest": 12841, "\u0120Military": 12842, "isible": 12843, "\u01201984": 12844, "etheless": 12845, "\u0120flexible": 12846, "\u0120kills": 12847, "\u0120finishing": 12848, "\u0120Size": 12849, "\u0120reduces": 12850, "\u0120epid": 12851, "\u0120orientation": 12852, "full": 12853, "\u0120trace": 12854, "\u0120laser": 12855, "\u0120oppose": 12856, "\u0120editing": 12857, "\u0120momentum": 12858, "\u00e4\u00ba": 12859, "show": 12860, "VI": 12861, "\u0120Lad": 12862, "\u01201985": 12863, "\u0120murdered": 12864, "900": 12865, "uther": 12866, "\u0120probability": 12867, "\u0120Poll": 12868, "\u0120reluct": 12869, "\u0120Chem": 12870, "\u0120Montreal": 12871, "\u0120adequate": 12872, "\u0120Poland": 12873, "\u0120Sheriff": 12874, "umph": 12875, "\u0120ok": 12876, "\u0120000": 12877, "\u0120\"[": 12878, "\u0120operators": 12879, "\u0120Fer": 12880, "\u0120modes": 12881, "\u0120Eve": 12882, "\u0120discipline": 12883, "NET": 12884, "Hand": 12885, "\u0120oral": 12886, "\u0120WE": 12887, "email": 12888, "JP": 12889, "\u0120Palestinians": 12890, "\u0120hence": 12891, "\u0120Less": 12892, "\u0120overl": 12893, "dig": 12894, "\u0120intimid": 12895, "\u0120Coal": 12896, "\u0120ranging": 12897, "tha": 12898, "\u0120distant": 12899, "\u0120fib": 12900, "\u0120Index": 12901, "\u0120Wonder": 12902, "\u0120Pel": 12903, "hattan": 12904, "\u0120Hug": 12905, "\u00c3\u0139": 12906, "rait": 12907, "\u0120wrapped": 12908, "\u0120RPG": 12909, "\u0120chemicals": 12910, "\u0120Money": 12911, "\u0120frozen": 12912, "\u0120indirect": 12913, "\u0120Against": 12914, "End": 12915, "\u0120uncomfortable": 12916, "\u0120Gallery": 12917, "\u0120Posted": 12918, "\u00d8\u00a7": 12919, "onduct": 12920, "\u0120consequence": 12921, "\u0120bitter": 12922, "\u01201987": 12923, "pop": 12924, "\u0120countless": 12925, "\u0120Alaska": 12926, "ffff": 12927, "\u0120departure": 12928, "\u0120refund": 12929, "\u0120Ian": 12930, "iated": 12931, "\u0120seeks": 12932, "\u0120mechanics": 12933, "\u0120jurisdiction": 12934, "lynn": 12935, "\u0120alike": 12936, "\u0120Hunt": 12937, "athon": 12938, "\u0120resolved": 12939, "\u0120cache": 12940, "\u0120distinction": 12941, "direct": 12942, "\u0120encount": 12943, "oub": 12944, "beat": 12945, "\u0120Country": 12946, "search": 12947, "\u0120continuous": 12948, "\u0120modest": 12949, "\u0120Rail": 12950, "thood": 12951, "130": 12952, "BUG": 12953, "\u0120criminals": 12954, "\u0120indication": 12955, "\u0120encountered": 12956, "last": 12957, "\u0120Wy": 12958, "\u0120ideology": 12959, "\u0120PDF": 12960, "security": 12961, "])": 12962, "\u0120Jimmy": 12963, "\u0120EN": 12964, "\u0120hiring": 12965, "Tem": 12966, "\u0120pig": 12967, "aunt": 12968, "\u0120Crystal": 12969, "\u0120penalties": 12970, "\u0120capability": 12971, "\u0120py": 12972, "\u0120productive": 12973, "\u0120balanced": 12974, "\u0120GeForce": 12975, "click": 12976, "olitan": 12977, "ods": 12978, "\u0120afterwards": 12979, "\u0120playoffs": 12980, "\u0120Gill": 12981, "User": 12982, "\u0120backs": 12983, "pub": 12984, "tag": 12985, "\u0120absurd": 12986, "piring": 12987, "\u0120citing": 12988, "\u0120trillion": 12989, "\u0120obligation": 12990, "\u0120maxim": 12991, "ahoo": 12992, "cf": 12993, "umi": 12994, "\u0120Alpha": 12995, "\u0120Nelson": 12996, "\u0120pursuant": 12997, "initely": 12998, "\u0120fract": 12999, "entry": 13000, "bery": 13001, "\u0120Thor": 13002, "Added": 13003, "\u0120DJ": 13004, "\u0120Gene": 13005, "\u0120awkward": 13006, "Stud": 13007, "\u0120wallet": 13008, "\u0120Divine": 13009, "arios": 13010, "\u0120releasing": 13011, "\u0120edited": 13012, "\u0120accomplished": 13013, "Best": 13014, "\u0120edges": 13015, "\u0120planes": 13016, "\u0120feeding": 13017, "\"},\"": 13018, "\u0120disclosure": 13019, "\u0120grain": 13020, "airy": 13021, "oons": 13022, "ernand": 13023, "VR": 13024, "\u0120reasonably": 13025, "\u0120drum": 13026, "\u0120partial": 13027, "\u0120graphic": 13028, "\u0120unprecedented": 13029, "\u0120advised": 13030, "Micro": 13031, "\u0120Assad": 13032, "points": 13033, "scar": 13034, "\u0120Zone": 13035, "ttes": 13036, "\u0120700": 13037, "vo": 13038, "\u0120Hamp": 13039, "\u0120fixes": 13040, "\u0120caution": 13041, "\u0120strings": 13042, "\u0120panels": 13043, "\u0120leak": 13044, "\u0120pricing": 13045, "rowth": 13046, "\u0120Error": 13047, "\u0120Saints": 13048, "fix": 13049, "\u0120observations": 13050, "\u0120Abs": 13051, "\u0120suggestion": 13052, "\u0120Ukrainian": 13053, "\u0120barrier": 13054, "\u0120painted": 13055, "Bet": 13056, "imir": 13057, "\u0120Spect": 13058, "pot": 13059, "orneys": 13060, "\u0120compound": 13061, "\u0120bears": 13062, "\u0120Rush": 13063, "\u0120luxury": 13064, "Sum": 13065, "\u0120orbit": 13066, "\u0120Marc": 13067, "\u0120exempt": 13068, "\u0120Trail": 13069, "\u0120MO": 13070, "\u0120Hans": 13071, "\u0120Weapon": 13072, "ocused": 13073, "uminum": 13074, "\u0120Jerry": 13075, "\u0120bust": 13076, "\u0120AG": 13077, "\u0120Wiki": 13078, "\u0120endless": 13079, "\u0120Vlad": 13080, "\u0120Bah": 13081, "\u0120Radeon": 13082, "keys": 13083, "\u0120Survey": 13084, "\u0120Viol": 13085, "define": 13086, "lean": 13087, "\u0120commod": 13088, "\u0120revenues": 13089, "\u00c5\u012f": 13090, "\u0120furniture": 13091, "\u0120casting": 13092, "\u0120diplomatic": 13093, "\u0120Players": 13094, "\u0120Killed": 13095, "\u0120modify": 13096, "\u0120innovative": 13097, "\u0120Abu": 13098, "nor": 13099, "\u0120bonds": 13100, "\u0120coaching": 13101, "Mer": 13102, "\u0120modules": 13103, "\u0120Patriots": 13104, "\u0120enhanced": 13105, "\u0120proceedings": 13106, "\u0120teammates": 13107, "\u0120128": 13108, "ardo": 13109, "\u0120compromise": 13110, "\u0120Much": 13111, "\u0120flew": 13112, "\u0120Edge": 13113, "\u0120unnecessary": 13114, "\u0120doctrine": 13115, "report": 13116, "\u0120Orlando": 13117, "\u0120Profile": 13118, "\u0120playoff": 13119, "friendly": 13120, "\u0120complain": 13121, "\u0120MC": 13122, "\u0120Opt": 13123, "\u0120GB": 13124, "\u0120beaten": 13125, "\u0120golf": 13126, "\u0120placement": 13127, "Bit": 13128, "\u0120newsletter": 13129, "\u01202019": 13130, "visor": 13131, "rawl": 13132, "\u0120iPad": 13133, "\u0120acted": 13134, "\u0120juice": 13135, "\u0120decks": 13136, "PN": 13137, "success": 13138, "\u0120Half": 13139, "\u0120deleted": 13140, "\u0120secrets": 13141, "\u0120asylum": 13142, "Mart": 13143, "\u0120Activ": 13144, "\u0120Guy": 13145, "\u0120Ts": 13146, "\u0120dys": 13147, "\u0120assuming": 13148, "\u0120mana": 13149, "\u0120subur": 13150, "\u0120125": 13151, "Media": 13152, "ARY": 13153, "ride": 13154, "cp": 13155, "\u0120difficulties": 13156, "\u0120collecting": 13157, "\u0120bankrupt": 13158, "non": 13159, "\u0120composed": 13160, "\u0120volt": 13161, "\u0120militants": 13162, "\u0120>>>": 13163, "\u0120Mormon": 13164, "tor": 13165, "\u0120particles": 13166, "\u0120Bart": 13167, "ryption": 13168, "\u0120admin": 13169, "\u0120squee": 13170, "VIDIA": 13171, "\u0120creator": 13172, "iameter": 13173, "icular": 13174, "NBC": 13175, "\u0120grabbed": 13176, "\u0120nodd": 13177, "\u0120rated": 13178, "\u0120rotation": 13179, "\u0120grasp": 13180, "\u0120excessive": 13181, "\u0120EC": 13182, "\u0120Whit": 13183, "\u0120inventory": 13184, "aults": 13185, "\u0120FB": 13186, "\u0120ecosystem": 13187, "\u0120billions": 13188, "\u0120venture": 13189, "named": 13190, "\u0120defender": 13191, "oute": 13192, "Instead": 13193, "irable": 13194, "War": 13195, "\u0120assumption": 13196, "\u0120bite": 13197, "\u0120earthqu": 13198, "tail": 13199, "space": 13200, "\u0120gifts": 13201, "boys": 13202, "\u0120inevitable": 13203, "\u0120structural": 13204, "\u0120beneficial": 13205, "\u0120compelling": 13206, "hole": 13207, "ervation": 13208, "\u0120coat": 13209, "oj": 13210, "incarn": 13211, "\u0120Years": 13212, "\u0120determining": 13213, "\u0120rhetoric": 13214, "\u0120boundaries": 13215, "\u0120whites": 13216, "Ant": 13217, "addy": 13218, ")-": 13219, "raham": 13220, "etermin": 13221, "\u0120harvest": 13222, "\u0120Conc": 13223, "\u0120laptop": 13224, "\u0120Match": 13225, "\u0120enjoying": 13226, "cca": 13227, "ollar": 13228, "\u0120trips": 13229, "\u0120addiction": 13230, "\u0120Sak": 13231, "\u0120powered": 13232, "\u0120cous": 13233, "\u0120Russians": 13234, "iere": 13235, "\u0120retrie": 13236, "quality": 13237, "\u0120differ": 13238, "\u0120kingdom": 13239, "\u0120Laur": 13240, "\u0120Capitol": 13241, "\u0120conclusions": 13242, "\u0120Altern": 13243, "\u0120Nav": 13244, "\u0120transparent": 13245, "BER": 13246, "Group": 13247, "\u0120Complete": 13248, "\u0120infer": 13249, "\u0120intrig": 13250, "\u0120insane": 13251, "RO": 13252, "ophob": 13253, "isen": 13254, "qual": 13255, "Michael": 13256, "\u0120museum": 13257, "\u0120Pope": 13258, "\u0120reset": 13259, "rative": 13260, "five": 13261, "\u0120aggreg": 13262, "ittees": 13263, "ository": 13264, "\u0120carb": 13265, "\u0120Record": 13266, "\u0120decides": 13267, "\u0120Fix": 13268, "\u0120exceptions": 13269, "\u0120Commissioner": 13270, "uns": 13271, "\u0120Environmental": 13272, "\u0120legendary": 13273, "istence": 13274, "\u0120tunnel": 13275, "km": 13276, "\u0120insult": 13277, "\u0120troll": 13278, "\u0120shake": 13279, "\u0120detention": 13280, "ques": 13281, "\u0120Chrome": 13282, "\u0120Files": 13283, "\u0120subt": 13284, "\u0120prospects": 13285, "\u0120prol": 13286, "render": 13287, "proof": 13288, "\u0120performances": 13289, "Str": 13290, "\u0120href": 13291, "ername": 13292, "\u0120achievement": 13293, "\u0120fut": 13294, "Full": 13295, "\u0120Leban": 13296, "google": 13297, "\u00e3\u0125\u012a": 13298, "ampa": 13299, "Maybe": 13300, "\u0120projected": 13301, "\u0120Emb": 13302, "\u0120colleg": 13303, "\u0120awards": 13304, "\u0120\u00e2\u0136": 13305, "Gold": 13306, "\u0120Blake": 13307, "\u0120Raj": 13308, "ifting": 13309, "\u0120pending": 13310, "\u0120instinct": 13311, "\u0120developments": 13312, "Connect": 13313, "\u0120Mand": 13314, "\u0120WITH": 13315, "\u0120Philippines": 13316, "profile": 13317, "\u0120altogether": 13318, "\u0120Bund": 13319, "\u0120TD": 13320, "oooo": 13321, "amped": 13322, "iph": 13323, "\u0120steam": 13324, "\u0120oldest": 13325, "\u0120detection": 13326, "ulpt": 13327, "\u0120\u00e7": 13328, "\u0120Wayne": 13329, "2006": 13330, "fa": 13331, "\u0120circles": 13332, "\u0120Fu": 13333, "\u0120donors": 13334, "appropriate": 13335, "\u0120Dakota": 13336, "jamin": 13337, "\u0120motivated": 13338, "\u0120purchases": 13339, "\u0120Louisiana": 13340, "\u0120Spl": 13341, "\u0120globe": 13342, "\u0120105": 13343, "zip": 13344, "call": 13345, "\u0120departments": 13346, "\u0120sustainable": 13347, "105": 13348, "\u0120OP": 13349, "ifiers": 13350, "\u0120prevented": 13351, "\u0120incomp": 13352, "\u0120Commander": 13353, "\u0120dominated": 13354, "\u0120\u00c2\u00bb": 13355, "\u0120invested": 13356, "\u0120complexity": 13357, "\u0120incl": 13358, "\u0120ensuring": 13359, "\u0120realm": 13360, "ync": 13361, "\u0120Independent": 13362, "rained": 13363, "\u0120Jen": 13364, "\u0120Flight": 13365, "\u0120athe": 13366, "\u0120speculation": 13367, "\u0120TE": 13368, "ocate": 13369, "tic": 13370, "\u0120plaint": 13371, "herry": 13372, "\u0120toy": 13373, "\u0120111": 13374, "\u0120plates": 13375, "status": 13376, "\u0120Isa": 13377, "\u0120devoted": 13378, "Cop": 13379, "\u0120ES": 13380, "255": 13381, "urrency": 13382, "Main": 13383, "\u0120slaves": 13384, "\u0120pepper": 13385, "\u0120quotes": 13386, "\u0120ceiling": 13387, "\u0120Fish": 13388, "\u0120transformation": 13389, "\u0120fraction": 13390, "\u0120advantages": 13391, "\u0120toile": 13392, "\u0120stunning": 13393, "\u0120moist": 13394, "breaking": 13395, "si": 13396, "\u0120Location": 13397, "\u0120Medium": 13398, "\u0120texts": 13399, "\u0120ugly": 13400, "\u0120bio": 13401, ".\u00e2\u0122\u0136": 13402, "\u0120Based": 13403, "\u0120trains": 13404, "\u0120Wing": 13405, "\u0120Ancient": 13406, "\u0120Records": 13407, "\u0120Hope": 13408, "Special": 13409, "adesh": 13410, "obi": 13411, "[/": 13412, "\u0120temporarily": 13413, "Ver": 13414, "hu": 13415, "oser": 13416, "\u0120overnight": 13417, "\u0120mamm": 13418, "\u0120Treasury": 13419, "\u0120Venezuel": 13420, "\u0120Mega": 13421, "\u0120tar": 13422, "\u0120expects": 13423, "black": 13424, "orph": 13425, "\\\\\\\\": 13426, "\u0120acceptance": 13427, "\u0120radar": 13428, "sis": 13429, "\u0120junior": 13430, "\u0120frames": 13431, "\u0120observation": 13432, "acies": 13433, "Power": 13434, "\u0120Advanced": 13435, "Mag": 13436, "ologically": 13437, "\u0120Mechan": 13438, "\u0120sentences": 13439, "\u0120analysts": 13440, "aughters": 13441, "forcement": 13442, "\u0120vague": 13443, "\u0120clause": 13444, "\u0120directors": 13445, "\u0120evaluate": 13446, "\u0120cabinet": 13447, "Matt": 13448, "\u0120Classic": 13449, "Ang": 13450, "\u0120cler": 13451, "\u0120Buck": 13452, "\u0120researcher": 13453, "\u0120160": 13454, "\u0120poorly": 13455, "\u0120experiencing": 13456, "\u0120Ped": 13457, "\u0120Manhattan": 13458, "\u0120freed": 13459, "\u0120themes": 13460, "advant": 13461, "\u0120nin": 13462, "\u0120praise": 13463, "104": 13464, "\u0120Libya": 13465, "best": 13466, "\u0120trusted": 13467, "\u0120cease": 13468, "\u0120dign": 13469, "Direct": 13470, "\u0120bombing": 13471, "\u0120migration": 13472, "\u0120Sciences": 13473, "\u0120municipal": 13474, "\u0120Average": 13475, "\u0120glory": 13476, "\u0120revealing": 13477, "\u0120arena": 13478, "\u0120uncertainty": 13479, "\u0120battlefield": 13480, "iao": 13481, "God": 13482, "\u0120cinem": 13483, "rape": 13484, "elle": 13485, "apons": 13486, "\u0120listing": 13487, "\u0120waited": 13488, "\u0120spotted": 13489, "keley": 13490, "\u0120Audio": 13491, "eor": 13492, "arding": 13493, "idding": 13494, "igma": 13495, "\u0120Neg": 13496, "\u0120lone": 13497, "\u0120----": 13498, "exe": 13499, "deg": 13500, "\u0120transf": 13501, "\u0120wash": 13502, "\u0120slavery": 13503, "\u0120exploring": 13504, "\u0120WW": 13505, "atson": 13506, "\u0120encl": 13507, "lies": 13508, "\u0120Creek": 13509, "\u0120wooden": 13510, "Manager": 13511, "\u0120Brand": 13512, "ummy": 13513, "\u0120Arthur": 13514, "\u0120bureaucr": 13515, "\u0120blend": 13516, "arians": 13517, "Further": 13518, "\u0120supposedly": 13519, "\u0120winds": 13520, "\u01201979": 13521, "\u0120gravity": 13522, "\u0120analyses": 13523, "\u0120Travel": 13524, "\u0120Veter": 13525, "\u0120dumb": 13526, "\u0120alternate": 13527, "gal": 13528, "\u0120consumed": 13529, "\u0120effectiveness": 13530, ".''": 13531, "\u0120paths": 13532, "onda": 13533, "LA": 13534, "\u0120Strong": 13535, "\u0120enables": 13536, "\u0120escaped": 13537, "\u0120\"\"": 13538, "\u0120112": 13539, "\u01201983": 13540, "\u0120smiled": 13541, "\u0120tendency": 13542, "Fire": 13543, "\u0120pars": 13544, "\u0120Roc": 13545, "\u0120lake": 13546, "\u0120fitness": 13547, "\u0120Ath": 13548, "\u0120Horn": 13549, "\u0120hier": 13550, "\u0120impose": 13551, "mother": 13552, "\u0120pension": 13553, "icut": 13554, "borne": 13555, "iciary": 13556, "._": 13557, "\u0120SU": 13558, "\u0120polar": 13559, "isy": 13560, "engu": 13561, "itialized": 13562, "ATA": 13563, "write": 13564, "\u0120exercises": 13565, "\u0120Diamond": 13566, "otypes": 13567, "\u0120harmful": 13568, "onz": 13569, "\u0120printing": 13570, "story": 13571, "\u0120expertise": 13572, "\u0120Ger": 13573, "\u0120tragedy": 13574, "\u0120Fly": 13575, "\u0120divid": 13576, "ampire": 13577, "stock": 13578, "Mem": 13579, "\u0120reign": 13580, "\u0120unve": 13581, "\u0120amend": 13582, "\u0120Prophet": 13583, "\u0120mutual": 13584, "\u0120Fac": 13585, "\u0120replacing": 13586, "Har": 13587, "\u0120Circuit": 13588, "\u0120throat": 13589, "\u0120Shot": 13590, "\u0120batteries": 13591, "\u0120toll": 13592, "\u0120addressing": 13593, "\u0120Medicaid": 13594, "\u0120pupp": 13595, "\u0120Nar": 13596, "olk": 13597, "\u0120equity": 13598, "MR": 13599, "\u0120Hispan": 13600, "\u0120Large": 13601, "mid": 13602, "Dev": 13603, "\u0120exped": 13604, "\u0120demo": 13605, "\u0120Marshall": 13606, "ergus": 13607, "\u0120fiber": 13608, "\u0120divorce": 13609, "\u0120Create": 13610, "\u0120slower": 13611, "\u0120Parker": 13612, "\u0120Student": 13613, "\u0120Training": 13614, "Return": 13615, "\u0120Tru": 13616, "\u0120cub": 13617, "\u0120Reached": 13618, "\u0120panic": 13619, "\u0120quarters": 13620, "\u0120rect": 13621, "\u0120treating": 13622, "\u0120rats": 13623, "\u0120Christianity": 13624, "oler": 13625, "\u0120sacred": 13626, "\u0120declare": 13627, "ulative": 13628, "eting": 13629, "\u0120delivering": 13630, "estone": 13631, "\u0120tel": 13632, "\u0120Larry": 13633, "\u0120meta": 13634, "accept": 13635, "artz": 13636, "\u0120Roger": 13637, "handed": 13638, "\u0120header": 13639, "\u0120trapped": 13640, "\u0120Century": 13641, "\u0120knocked": 13642, "\u0120Oxford": 13643, "\u0120survivors": 13644, "bot": 13645, "\u0120demonstration": 13646, "\u0120dirt": 13647, "\u0120assists": 13648, "OME": 13649, "\u0120Draft": 13650, "ortunate": 13651, "folio": 13652, "pered": 13653, "usters": 13654, "gt": 13655, "\u0120Lock": 13656, "\u0120judicial": 13657, "verted": 13658, "\u0120secured": 13659, "outing": 13660, "\u0120Books": 13661, "\u0120hosting": 13662, "\u0120lifted": 13663, "length": 13664, "\u0120jer": 13665, "\u0120wheels": 13666, "\u0120Range": 13667, "umbnails": 13668, "\u0120diagnosis": 13669, "tech": 13670, "\u0120Stewart": 13671, "\u0120Pract": 13672, "\u0120nationwide": 13673, "\u0120dear": 13674, "\u0120obligations": 13675, "\u0120grows": 13676, "\u0120mandatory": 13677, "\u0120suspicious": 13678, "!'": 13679, "Apr": 13680, "Great": 13681, "\u0120mortgage": 13682, "\u0120prosecutor": 13683, "\u0120editorial": 13684, "\u0120Kr": 13685, "\u0120processed": 13686, "ungle": 13687, "\u0120flexibility": 13688, "Earlier": 13689, "\u0120Cart": 13690, "\u0120Sug": 13691, "\u0120focuses": 13692, "\u0120startup": 13693, "\u0120breach": 13694, "\u0120Tob": 13695, "cycle": 13696, "\u00e3\u0122\u012e": 13697, "rose": 13698, "\u0120bizarre": 13699, "\u00e3\u0122\u012f": 13700, "\u0120vegetables": 13701, "$$": 13702, "\u0120retreat": 13703, "oshi": 13704, "\u0120Shop": 13705, "\u0120Ground": 13706, "\u0120Stop": 13707, "\u0120Hawaii": 13708, "\u0120Ay": 13709, "Perhaps": 13710, "\u0120Beaut": 13711, "uffer": 13712, "enna": 13713, "\u0120productivity": 13714, "Fixed": 13715, "control": 13716, "\u0120absent": 13717, "\u0120Campaign": 13718, "Green": 13719, "\u0120identifying": 13720, "\u0120regret": 13721, "\u0120promoted": 13722, "\u0120Seven": 13723, "\u0120eru": 13724, "neath": 13725, "aughed": 13726, "\u0120Pin": 13727, "\u0120Living": 13728, "Cost": 13729, "omatic": 13730, "mega": 13731, "\u0120Nig": 13732, "ocy": 13733, "\u0120inbox": 13734, "\u0120empire": 13735, "\u0120horizont": 13736, "\u0120branches": 13737, "\u0120metaph": 13738, "Active": 13739, "edi": 13740, "\u0120Film": 13741, "\u0120Something": 13742, "\u0120mods": 13743, "incial": 13744, "\u0120Original": 13745, "Gen": 13746, "\u0120spirits": 13747, "\u0120earning": 13748, "Hist": 13749, "\u0120riders": 13750, "\u0120sacrific": 13751, "MT": 13752, "\u0120VA": 13753, "\u0120Salt": 13754, "\u0120occupation": 13755, "\u0120Mi": 13756, "\u0120disg": 13757, "lict": 13758, "\u0120nit": 13759, "\u0120nodes": 13760, "eem": 13761, "\u0120Pier": 13762, "\u0120hatred": 13763, "psy": 13764, "\u00e3\u0125\u012b": 13765, "\u0120theater": 13766, "\u0120sophisticated": 13767, "\u0120defended": 13768, "\u0120besides": 13769, "\u0120thoroughly": 13770, "\u0120Medicare": 13771, "\u0120blamed": 13772, "arently": 13773, "\u0120crying": 13774, "FOR": 13775, "priv": 13776, "\u0120singing": 13777, "\u0120Il": 13778, "\u0120cute": 13779, "oided": 13780, "olitical": 13781, "\u0120Neuro": 13782, "\u00e5\u00a4": 13783, "\u0120donation": 13784, "\u0120Eagles": 13785, "\u0120Give": 13786, "Tom": 13787, "\u0120substantially": 13788, "\u0120License": 13789, "\u0120Ja": 13790, "\u0120grey": 13791, "\u0120Animal": 13792, "\u0120ER": 13793, "\u0120Und": 13794, "\u0120keen": 13795, "\u0120conclude": 13796, "\u0120Mississippi": 13797, "Engine": 13798, "\u0120Studios": 13799, "Press": 13800, "overs": 13801, "llers": 13802, "\u0120350": 13803, "\u0120Rangers": 13804, "\u0120rou": 13805, "erto": 13806, "Ep": 13807, "issa": 13808, "ivan": 13809, "\u0120seal": 13810, "\u0120Regist": 13811, "display": 13812, "\u0120weaken": 13813, "uum": 13814, "\u0120Commons": 13815, "\u0120Say": 13816, "\u0120cultures": 13817, "\u0120laughed": 13818, "\u0120slip": 13819, "\u0120treatments": 13820, "izable": 13821, "mart": 13822, "\u0120Rice": 13823, "\u0120beast": 13824, "\u0120obesity": 13825, "\u0120Laure": 13826, "iga": 13827, "Which": 13828, "holder": 13829, "\u0120elderly": 13830, "\u0120pays": 13831, "\u0120complained": 13832, "\u0120crop": 13833, "\u0120proc": 13834, "\u0120explosive": 13835, "\u0120Fan": 13836, "\u0120Arsenal": 13837, "Author": 13838, "eful": 13839, "\u0120meals": 13840, "\u0120(-": 13841, "idays": 13842, "\u0120imagination": 13843, "\u0120annually": 13844, "\u0120ms": 13845, "asures": 13846, "Head": 13847, "ikh": 13848, "matic": 13849, "\u0120boyfriend": 13850, "\u0120Computer": 13851, "\u0120bump": 13852, "\u0120surge": 13853, "\u0120Craig": 13854, "\u0120Kirk": 13855, "Del": 13856, "mediate": 13857, "\u0120scenarios": 13858, "\u0120Mut": 13859, "\u0120Stream": 13860, "\u0120competitors": 13861, "\u00d9\u0126": 13862, "\u0120Stanford": 13863, "\u0120Resources": 13864, "azed": 13865, "bage": 13866, "\u0120organis": 13867, "\u0120Release": 13868, "\u0120separately": 13869, "\u0120habits": 13870, "\u0120measurements": 13871, "\u0120Close": 13872, "\u0120accompany": 13873, "\u0120gly": 13874, "\u0120tang": 13875, "\u0120Rou": 13876, "\u0120plugin": 13877, "\u0120convey": 13878, "\u0120Challenge": 13879, "oots": 13880, "jan": 13881, "\u0120curs": 13882, "\u0120Relations": 13883, "keeper": 13884, "\u0120approaching": 13885, "ping": 13886, "Speaking": 13887, "\u0120arrangement": 13888, "\u0120VI": 13889, "arettes": 13890, "\u0120affecting": 13891, "\u0120permits": 13892, "because": 13893, "\u0120useless": 13894, "\u0120Hus": 13895, "!!!!": 13896, "\u0120destroying": 13897, "Unfortunately": 13898, "\u0120fascinating": 13899, "Sem": 13900, "\u0120electoral": 13901, "\u0120transparency": 13902, "\u0120Chaos": 13903, "\u0120volunteer": 13904, "\u0120statistical": 13905, "\u0120activated": 13906, "rox": 13907, "Web": 13908, "HE": 13909, "\u0120Hampshire": 13910, "isive": 13911, "Map": 13912, "\u0120trash": 13913, "\u0120Lawrence": 13914, "stick": 13915, "Cr": 13916, "\u0120rings": 13917, "EXT": 13918, "\u0120operational": 13919, "opes": 13920, "Does": 13921, "\u0120Evans": 13922, "\u0120witnessed": 13923, "Port": 13924, "\u0120launching": 13925, "econom": 13926, "wear": 13927, "\u0120Particip": 13928, "umm": 13929, "cules": 13930, "\u0120RAM": 13931, "\u0120Tun": 13932, "\u0120assured": 13933, "\u0120binary": 13934, "\u0120betray": 13935, "\u0120exploration": 13936, "\u0120Fel": 13937, "\u0120admission": 13938, "itated": 13939, "Sy": 13940, "\u0120avoided": 13941, "\u0120Simulator": 13942, "\u0120celebrated": 13943, "\u0120Electric": 13944, "\u00a5\u0140": 13945, "\u0120cluster": 13946, "itzerland": 13947, "health": 13948, "Line": 13949, "\u0120Nash": 13950, "aton": 13951, "\u0120spare": 13952, "\u0120enterprise": 13953, "\u0120DIS": 13954, "cludes": 13955, "\u0120flights": 13956, "\u0120regards": 13957, "\u0120\u00c3\u0139": 13958, "half": 13959, "\u0120trucks": 13960, "\u0120contacts": 13961, "\u0120uncons": 13962, "\u0120Climate": 13963, "\u0120immense": 13964, "NEW": 13965, "occ": 13966, "ective": 13967, "\u0120embod": 13968, "\u0120patrol": 13969, "\u0120beside": 13970, "\u0120viable": 13971, "\u0120creep": 13972, "\u0120triggered": 13973, "verning": 13974, "\u0120comparable": 13975, "ql": 13976, "\u0120gaining": 13977, "asses": 13978, "\u0120();": 13979, "\u0120Grey": 13980, "\u0120MLS": 13981, "sized": 13982, "\u0120prosper": 13983, "\"?": 13984, "\u0120polling": 13985, "\u0120shar": 13986, "\u0120RC": 13987, "\u0120firearm": 13988, "orient": 13989, "\u0120fence": 13990, "\u0120variations": 13991, "giving": 13992, "\u0120Pi": 13993, "ospel": 13994, "\u0120pledge": 13995, "\u0120cure": 13996, "\u0120spy": 13997, "\u0120violated": 13998, "\u0120rushed": 13999, "\u0120stroke": 14000, "\u0120Blog": 14001, "sels": 14002, "\u0120Ec": 14003, ",''": 14004, "\u0120pale": 14005, "\u0120Collins": 14006, "terror": 14007, "\u0120Canadians": 14008, "\u0120tune": 14009, "\u0120laboratory": 14010, "\u0120nons": 14011, "tarian": 14012, "\u0120disability": 14013, "\u0120Gam": 14014, "\u0120singer": 14015, "alg": 14016, "\u0120Senior": 14017, "\u0120traded": 14018, "\u0120Warrior": 14019, "\u0120infring": 14020, "\u0120Franklin": 14021, "\u0120strain": 14022, "\u0120Swedish": 14023, "\u0120seventh": 14024, "\u0120Benn": 14025, "\u0120Tell": 14026, "\u0120syndrome": 14027, "\u0120wondered": 14028, "iden": 14029, "++++": 14030, "igo": 14031, "\u0120purple": 14032, "\u0120journalism": 14033, "\u0120rebel": 14034, "\u0120fu": 14035, "blog": 14036, "\u0120invite": 14037, "rencies": 14038, "\u0120Contact": 14039, "Israel": 14040, "\u0120Content": 14041, "\u0120cheer": 14042, "\u0120bedroom": 14043, "\u0120Engineering": 14044, "\u0120Queens": 14045, "\u0120dwell": 14046, "\u0120PlayStation": 14047, "\u0120Dim": 14048, "\u0120Colon": 14049, "lr": 14050, "\u0120operates": 14051, "\u0120motivation": 14052, "USA": 14053, "astered": 14054, "Core": 14055, "\u0120Truth": 14056, "olo": 14057, "OSE": 14058, "\u0120Memory": 14059, "\u0120predec": 14060, "\u0120anarch": 14061, "\u01201920": 14062, "\u0120Yam": 14063, "\u00c3\u00a8": 14064, "bid": 14065, "\u0120grateful": 14066, "\u0120excitement": 14067, "\u0120treasure": 14068, "\u0120longest": 14069, "ctive": 14070, "\u0120deserves": 14071, "\u0120reserves": 14072, "\u0120cops": 14073, "\u0120Ottawa": 14074, "\u0120Egyptian": 14075, "anked": 14076, "\u0120artif": 14077, "\u0120hypothesis": 14078, ":/": 14079, "\u0120purchasing": 14080, "\u0120lovely": 14081, "HP": 14082, "\u0120divide": 14083, "\u0120strictly": 14084, "\u0120questioning": 14085, "\u0120taxpayers": 14086, "\u0120Joy": 14087, "\u0120rolls": 14088, "\u0120Heavy": 14089, "\u0120ports": 14090, "\u0120magnetic": 14091, "\u0120inflamm": 14092, "\u0120brush": 14093, "tics": 14094, "\u00e2\u012a\u0134": 14095, "\u0120bottles": 14096, "ppy": 14097, "\u0120padd": 14098, "\u00e3\u0124\u00af": 14099, "million": 14100, "\u0120devastating": 14101, "\u0120compiled": 14102, "\u0120medication": 14103, "\u0120twelve": 14104, "\u0120Perry": 14105, "Space": 14106, "imb": 14107, "your": 14108, "\u0120leaked": 14109, "\u0120Tar": 14110, "\u0120unity": 14111, "\u0120infected": 14112, "\u0120traveled": 14113, "IDE": 14114, "\u0120McDonald": 14115, "txt": 14116, "\u0120Princ": 14117, "\u0120interven": 14118, "\u0120Taiwan": 14119, "\u0120Pow": 14120, "\u0120bearing": 14121, "\u0120Thread": 14122, "\u0120zones": 14123, "izards": 14124, "unks": 14125, "Chapter": 14126, "llor": 14127, "\u0120\u00c2\u00b7": 14128, "\u0120wounds": 14129, "\u0120discretion": 14130, "\u0120succeeded": 14131, "iking": 14132, "\u0120iconic": 14133, "Call": 14134, "\u0120screening": 14135, "\u0120Mis": 14136, "icts": 14137, "\u0120ministers": 14138, "\u0120separation": 14139, "Player": 14140, "\u0120bip": 14141, "\u0120beloved": 14142, "\u0120counting": 14143, "\u0120Eye": 14144, "around": 14145, "inging": 14146, "\u0120tablet": 14147, "\u0120offence": 14148, "inance": 14149, "have": 14150, "\u0120Info": 14151, "\u0120Ninja": 14152, "\u0120protective": 14153, "\u0120Cass": 14154, "Mac": 14155, "\u0120Quality": 14156, "North": 14157, "\u0120ic": 14158, "\u0120Cuba": 14159, "\u0120Chronicle": 14160, "\u0120Property": 14161, "\u0120fastest": 14162, "otos": 14163, "\u0120Germ": 14164, "OWN": 14165, "\u0120boom": 14166, "\u0120Stanley": 14167, "erguson": 14168, "\u0120clever": 14169, "\u0120enters": 14170, "mode": 14171, "terior": 14172, "\u0120Sens": 14173, "\u0120linear": 14174, "ARK": 14175, "\u0120comparing": 14176, "\u0120purely": 14177, "\u0120safer": 14178, "\u0120Potter": 14179, "\u0120cups": 14180, "RT": 14181, "\u0120gluc": 14182, "\u0120attributed": 14183, "\u0120dupl": 14184, "\u0120Pap": 14185, "\u0120precious": 14186, "\u0120pa": 14187, "ictionary": 14188, "\u0120Tig": 14189, "\u0120Too": 14190, "olutions": 14191, "stan": 14192, "\u0120robots": 14193, "\u0120lobb": 14194, "\u0120statute": 14195, "\u0120prevention": 14196, "western": 14197, "160": 14198, "\u0120Active": 14199, "\u0120Maria": 14200, "hal": 14201, "None": 14202, "ellar": 14203, "\u0120KB": 14204, "\u0120Partners": 14205, "\u0120Single": 14206, "\u0120Following": 14207, "ango": 14208, "acious": 14209, "\u0120thou": 14210, "\u0120kg": 14211, "\u0120influential": 14212, "\u0120Friends": 14213, "Sur": 14214, "ainted": 14215, "\u0120forums": 14216, "\u0120starter": 14217, "\u0120citizenship": 14218, "\u0120Election": 14219, "onge": 14220, "otation": 14221, "osph": 14222, ";;;;": 14223, "utical": 14224, "pur": 14225, "eren": 14226, "\u0120accusations": 14227, "bitious": 14228, "abbit": 14229, "\u0120Ord": 14230, "Posted": 14231, "irk": 14232, "\u0120sensitivity": 14233, "iche": 14234, "\u0120Amy": 14235, "\u0120Fab": 14236, "\u0120summit": 14237, "\u0120pedest": 14238, "\u0120rubber": 14239, "\u0120agricultural": 14240, "\u0120cancel": 14241, "AE": 14242, "\u0120inaug": 14243, "\u0120contam": 14244, "\u0120firmly": 14245, "iw": 14246, "stage": 14247, "\u0120Kan": 14248, "\u0120tier": 14249, "\u0120invention": 14250, "\u0120translated": 14251, "\u0120Rules": 14252, "Box": 14253, "Twitter": 14254, "IDS": 14255, "\u0120pizza": 14256, "\u0120debug": 14257, "\u0120Drop": 14258, "vs": 14259, "\u0120horses": 14260, "big": 14261, "\u0120boring": 14262, "\u0120hood": 14263, "\u0120McCain": 14264, "atched": 14265, "\u0120Bros": 14266, "\u0120skip": 14267, "\u0120essay": 14268, "stat": 14269, "\u0120Legends": 14270, "\u0120ammunition": 14271, "auc": 14272, "\u0120shooter": 14273, "\u0120unh": 14274, "\u0120supplied": 14275, "\u0120generic": 14276, "\u0120SK": 14277, "iban": 14278, "yrics": 14279, "\u0120255": 14280, "\u0120climbing": 14281, "Former": 14282, "\u0120flip": 14283, "\u0120jumping": 14284, "\u0120frustration": 14285, "\u0120Terry": 14286, "\u0120neighborhoods": 14287, "\u0120median": 14288, "bean": 14289, "\u0120brains": 14290, "Following": 14291, "\u0120shaped": 14292, "\u0120draws": 14293, "\u0120altered": 14294, "Jack": 14295, "\u0120recipes": 14296, "\u0120skilled": 14297, "wealth": 14298, "achi": 14299, "election": 14300, "\u0120behaviors": 14301, "deals": 14302, "\u0120Until": 14303, "Fe": 14304, "\u0120declaration": 14305, "marks": 14306, "\u0120Between": 14307, "celona": 14308, "\u0120reson": 14309, "\u0120bubble": 14310, "Among": 14311, "\u0120imperial": 14312, "GS": 14313, "\u0120feminist": 14314, "2005": 14315, "\u0120Kyle": 14316, "\u0120accounting": 14317, "\u0120Tele": 14318, "\u0120Tyr": 14319, "\u0120connecting": 14320, "\u0120rehab": 14321, "\u0120Pred": 14322, "sim": 14323, "\u0120meantime": 14324, "\u0120physician": 14325, "MW": 14326, "\u0120Campbell": 14327, "\u0120Brandon": 14328, "\u0120contributing": 14329, "\u0120Rule": 14330, "\u0120Weight": 14331, "\u0120Nap": 14332, "\u0120interactive": 14333, "\u0120vag": 14334, "\u0120helmet": 14335, "\u0120Comb": 14336, "four": 14337, "\u0120shipped": 14338, "\u0120completing": 14339, "\u0120PD": 14340, "PDATE": 14341, "\u0120spreading": 14342, "\u0120scary": 14343, "erving": 14344, "\u0120Gas": 14345, "\u0120frank": 14346, "school": 14347, "\u0120romantic": 14348, "\u0120stabil": 14349, "Rob": 14350, "\u0120accurately": 14351, "\u0120acute": 14352, "\u0120Hann": 14353, "\u0120symbols": 14354, "\u0120civilization": 14355, "\u0120AW": 14356, "\u0120lightning": 14357, "\u0120considers": 14358, "\u0120venue": 14359, "\u0120\u00d7": 14360, "\u0120oven": 14361, "\u0120SF": 14362, "his": 14363, "\u0120nu": 14364, "\u0120Learn": 14365, "\u0120peoples": 14366, "\u0120std": 14367, "\u0120slee": 14368, "\u0120slic": 14369, "\u0120Statistics": 14370, "\u0120corners": 14371, "\u0120Baker": 14372, "\u0120:)": 14373, "mentation": 14374, "olver": 14375, "\u0120laughing": 14376, "\u0120Todd": 14377, "onde": 14378, "\u0120Hills": 14379, "\u0120nuts": 14380, "\u0120Woman": 14381, "plane": 14382, "\u0120liver": 14383, "\u0120Inside": 14384, "Sorry": 14385, "\u0120agrees": 14386, "\u0120fundament": 14387, "\u0120Fisher": 14388, "\u0120auction": 14389, "\u0120threads": 14390, "glas": 14391, "\u0120Basic": 14392, "\u0120Nat": 14393, "\u0120lacking": 14394, "\u0120celebration": 14395, "ju": 14396, "\u0120silly": 14397, "Euro": 14398, "\u0120tatt": 14399, "ighty": 14400, "controlled": 14401, "Test": 14402, "\u0120Singh": 14403, "\u0120rage": 14404, "\u0120rhyth": 14405, "offic": 14406, "\u0120Phantom": 14407, "\u0120headlines": 14408, "\u0120responding": 14409, "\u0120Morning": 14410, "\u0120vitamin": 14411, "\u0120boots": 14412, "\u0120Site": 14413, "alin": 14414, "pi": 14415, "\u0120viral": 14416, "\u0120UC": 14417, "DER": 14418, "\u0120Sex": 14419, "\u0120stocks": 14420, "current": 14421, "\u0120churches": 14422, "\u0120Rare": 14423, "\u0120Murphy": 14424, "\u0120denial": 14425, "\u0120Gaming": 14426, "\u0120toug": 14427, "\u0120nick": 14428, "\u0120makers": 14429, "\u0120Ronald": 14430, "\u0120generous": 14431, "\u0120Doc": 14432, "\u0120Morris": 14433, "\u0120transformed": 14434, "\u0120Normal": 14435, "\u0120104": 14436, "\u0120Kickstarter": 14437, "\u0120Upon": 14438, "Online": 14439, "\u0120IRS": 14440, "\u0120wrap": 14441, "\u0120loving": 14442, "\u0120arrives": 14443, "\u0120Due": 14444, "\u0120heter": 14445, "\u0120Made": 14446, "\u0120rental": 14447, "\u0120belongs": 14448, "\u0120attorneys": 14449, "\u0120crops": 14450, "\u0120matched": 14451, "ulum": 14452, "oline": 14453, "109": 14454, "\u0120dispar": 14455, "\u0120buyers": 14456, "\u0120Cambridge": 14457, "\u0120ethics": 14458, "roups": 14459, "\u0120justified": 14460, "\u0120marginal": 14461, "\u0120respected": 14462, "winning": 14463, "\u0120nodded": 14464, "\u0120Serge": 14465, "\u0120Former": 14466, "Craft": 14467, "################": 14468, "\u0120Warner": 14469, "\u0120dash": 14470, "ete": 14471, "\u0120entert": 14472, "\u0120Escape": 14473, "outheast": 14474, "\u0120knees": 14475, "\u0120Bomb": 14476, "\u0120rug": 14477, "Pass": 14478, "\u0120attitudes": 14479, "government": 14480, "\u0120Prior": 14481, "\u0120qualities": 14482, "\u0120notification": 14483, "\u0120Phone": 14484, "lie": 14485, "\u0120anticipated": 14486, "\u0120Combat": 14487, "\u0120Barry": 14488, "\u01201982": 14489, "Users": 14490, "oner": 14491, "\u0120computing": 14492, "\u0120Connecticut": 14493, "\u0120lesser": 14494, "\u0120peers": 14495, "\u0120Cu": 14496, "\u0120technically": 14497, "\u0120submission": 14498, "\u0120Universal": 14499, "\u0120manually": 14500, "ourge": 14501, "\u0120respondents": 14502, "\u0120BTC": 14503, "\u0120Host": 14504, "\u0120fare": 14505, "\u0120Bird": 14506, "\u0120receipt": 14507, "also": 14508, "\u0120jack": 14509, "\u0120agriculture": 14510, "\u0120skull": 14511, "\u0120!=": 14512, "\u0120passive": 14513, "\u0120CI": 14514, "\u0120societies": 14515, "\u0120reminded": 14516, "\u0120interference": 14517, "Buy": 14518, "\u0120\u00e2\u013e": 14519, "gon": 14520, "\u0120scrutiny": 14521, "\u0120Witch": 14522, "\u0120conducting": 14523, "\u0120\u00e3\u0125": 14524, "\u0120exchanges": 14525, "\u0120Mitchell": 14526, "\u0120inhabit": 14527, "\u0120twist": 14528, "BD": 14529, "\u0120wherever": 14530, "groupon": 14531, "\u0120jokes": 14532, "\u0120Benjamin": 14533, "\u0120Random": 14534, "frame": 14535, "\u0120Lions": 14536, "\u0120highlighted": 14537, "\u0120Arkansas": 14538, "Ent": 14539, "\u0120pile": 14540, "\u0120prelim": 14541, "gs": 14542, "minded": 14543, "\u0120felony": 14544, "\u0120GA": 14545, "\u0120Luck": 14546, "\u0120practically": 14547, "\u0120Bos": 14548, "\u0120actress": 14549, "Dam": 14550, "\u0120Bou": 14551, "\u0120visa": 14552, "\u0120embedded": 14553, "\u0120hybrid": 14554, "\u0120earliest": 14555, "\u0120sooner": 14556, "social": 14557, "\u0120HA": 14558, "\u0120steep": 14559, "\u0120disadvant": 14560, "\u0120exploit": 14561, "\u0120Egg": 14562, "\u0120Ultra": 14563, "\u0120necessity": 14564, "Local": 14565, "iege": 14566, "\u0120dated": 14567, "\u0120masses": 14568, "\u0120subscription": 14569, "pless": 14570, "\u0120anonym": 14571, "\u0120presumably": 14572, "Blue": 14573, "Their": 14574, "asketball": 14575, "\u0120Philip": 14576, "\u0120comed": 14577, "loaded": 14578, "rane": 14579, "\u0120reflection": 14580, "China": 14581, "\u0120extends": 14582, "\u0120forming": 14583, "\u0120unders": 14584, "2001": 14585, "\u0120grat": 14586, "\u0120concentrations": 14587, "\u0120insulin": 14588, "\u0120secular": 14589, "\u0120whilst": 14590, "\u0120winners": 14591, "Advertisements": 14592, "\u0120deliberately": 14593, "\u0120Working": 14594, "\u0120sink": 14595, "etics": 14596, "dale": 14597, "\u0120mandate": 14598, "\u0120gram": 14599, "\u0120vacation": 14600, "\u0120warnings": 14601, "ripp": 14602, "\u0120THAT": 14603, "\u0120commentary": 14604, "\u0120intu": 14605, "\u0120aest": 14606, "\u0120reasoning": 14607, "\u0120breakdown": 14608, "\u0120Zombie": 14609, "\u0120-->": 14610, "\u0120Political": 14611, "cott": 14612, "\u0120thrust": 14613, "\u0120technological": 14614, "\u0120deciding": 14615, "\u0120trafficking": 14616, "Long": 14617, "Welcome": 14618, "prising": 14619, "\u0120Communications": 14620, "\u0120endors": 14621, "\u0120swift": 14622, "\u0120metabol": 14623, "coins": 14624, "resa": 14625, "\u0120HTTP": 14626, "\u0120enroll": 14627, "\u0120Happy": 14628, "usr": 14629, "intage": 14630, "\u0120[\"": 14631, "uably": 14632, "\u0120Material": 14633, "\u0120repeal": 14634, "Sept": 14635, "kh": 14636, "\u0120Modi": 14637, "\u0120underneath": 14638, "\u0120IL": 14639, "shore": 14640, "\u0120diagnosed": 14641, "aceutical": 14642, "\u0120shower": 14643, "aux": 14644, "\u0120Switch": 14645, "\u0120Strength": 14646, "\u0120jihad": 14647, "national": 14648, "\u0120trauma": 14649, "ussy": 14650, "oni": 14651, "\u0120consolid": 14652, "\u0120calories": 14653, "\u0120Flynn": 14654, "agged": 14655, "168": 14656, "\u0120Pink": 14657, "\u0120fulfill": 14658, "\u0120chains": 14659, "\u0120notably": 14660, "\u0120AV": 14661, "Life": 14662, "\u0120Chuck": 14663, "mus": 14664, "\u0120Urban": 14665, "\u0120Hend": 14666, "\u0120deposit": 14667, "\u0120Sad": 14668, "\u0120affair": 14669, "ORK": 14670, "ieval": 14671, "\u0120FDA": 14672, "\u0120trop": 14673, "\u0120Overall": 14674, "\u0120virtue": 14675, "\u0120satisfaction": 14676, "aund": 14677, "\u0120lun": 14678, "\u0120Switzerland": 14679, "\u0120Operation": 14680, "process": 14681, "\u0120shook": 14682, "\u0120counties": 14683, "leased": 14684, "\u0120Charlotte": 14685, "112": 14686, "\u0120transcript": 14687, "\u0120redd": 14688, "push": 14689, "\u0120Hey": 14690, "\u0120Analysis": 14691, "[\"": 14692, "\u0120alternatives": 14693, "ardless": 14694, "\u0120eleph": 14695, "\u0120prejud": 14696, "\u0120Leaf": 14697, "Having": 14698, "\u0120Hub": 14699, "\u0120expressions": 14700, "\u0120Volume": 14701, "\u0120shocking": 14702, "\u0120Reds": 14703, "\u0120readily": 14704, "\u0120planets": 14705, "adata": 14706, "\u0120collapsed": 14707, "\u0120Madrid": 14708, "\u0120irrit": 14709, "ipper": 14710, "\u0120Enc": 14711, "\u0120Wire": 14712, "\u0120buzz": 14713, "\u0120GP": 14714, "asha": 14715, "\u0120accidentally": 14716, "uru": 14717, "\u0120frustrated": 14718, "\u0120SA": 14719, "\u0120hungry": 14720, "\u0120Huff": 14721, "\u0120labels": 14722, "anto": 14723, "\u0120EP": 14724, "\u0120barriers": 14725, ")|": 14726, "\u0120Berkeley": 14727, "\u0120Jets": 14728, "\u0120pairs": 14729, "\u0120Lan": 14730, "James": 14731, "\u0120Bear": 14732, "\u0120humor": 14733, "\u0120Liberty": 14734, "\u0120magnitude": 14735, "\u0120aging": 14736, "\u0120Mason": 14737, "\u0120friendship": 14738, "umbling": 14739, "\u0120emerge": 14740, "\u0120newspapers": 14741, "\u0120ambitious": 14742, "\u0120Richards": 14743, "aternal": 14744, "\u01201981": 14745, "\u0120cookies": 14746, "\u0120sculpt": 14747, "\u0120pursuit": 14748, "Location": 14749, "\u0120scripts": 14750, "pc": 14751, "\u0120arrangements": 14752, "\u0120diameter": 14753, "\u0120loses": 14754, "amation": 14755, "\u0120liqu": 14756, "\u0120Jake": 14757, "arette": 14758, "\u0120understands": 14759, "\u0120Zen": 14760, "vm": 14761, "\u0120approve": 14762, "\u0120wip": 14763, "\u0120ultra": 14764, "\u0120intend": 14765, "\u0120DI": 14766, "ascular": 14767, "\u0120stays": 14768, "\u0120Kor": 14769, "\u0120Kl": 14770, "\u0120investing": 14771, "La": 14772, "\u0120believing": 14773, "bad": 14774, "mouth": 14775, "\u0120taxpayer": 14776, "\u00e3\u0125\u0125": 14777, "\u0120Quebec": 14778, "\u0120lap": 14779, "\u0120Swiss": 14780, "drop": 14781, "\u0120drain": 14782, "iri": 14783, "etc": 14784, "ften": 14785, "\u0120Nex": 14786, "\u0120straw": 14787, "\u0120screaming": 14788, "\u0120counted": 14789, "\u0120damaging": 14790, "\u0120ambassador": 14791, "century": 14792, "\u0120prox": 14793, "\u0120arrests": 14794, "uv": 14795, "ilateral": 14796, "\u0120Charg": 14797, "\u0120prescribed": 14798, "\u0120independently": 14799, "\u0120fierce": 14800, "\u0120Baby": 14801, "\u0120brave": 14802, "\u0120suits": 14803, "=>": 14804, "\u0120baseline": 14805, "\u0120Rate": 14806, "\u0120islands": 14807, "\u0120((": 14808, "green": 14809, "ixels": 14810, "\u0120namely": 14811, "\u0120Village": 14812, "than": 14813, "amy": 14814, "Version": 14815, "gmail": 14816, "entials": 14817, "\u0120Sud": 14818, "\u0120Melbourne": 14819, "\u0120arriving": 14820, "\u0120quantum": 14821, "eff": 14822, "ropolitan": 14823, "Tri": 14824, "\u0120funeral": 14825, "\u0120IR": 14826, "\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124": 14827, "\u0120Cob": 14828, "itably": 14829, "\u0120turb": 14830, "\u0120combo": 14831, "Review": 14832, "\u0120deployment": 14833, "uity": 14834, "\u0120Bott": 14835, "\u0120invisible": 14836, "\u0120rendering": 14837, "\u0120unlocked": 14838, "\u0120aqu": 14839, "\u0120Vladimir": 14840, "\u0120pad": 14841, "\u0120Brain": 14842, "\u0120Legacy": 14843, "dragon": 14844, "\u0120Kurdish": 14845, "\u0120sounded": 14846, "\u0120detained": 14847, "\u0120DM": 14848, "gary": 14849, "\u0120daughters": 14850, "\u0120disturbing": 14851, "uka": 14852, "\u0120Parad": 14853, "\u0120tast": 14854, "\u0120unfortunate": 14855, "\u0120ul": 14856, "emin": 14857, "\u0120attendance": 14858, "trl": 14859, "\u0120parks": 14860, "\u0120Memorial": 14861, "\u0120Alice": 14862, "othy": 14863, "guard": 14864, "\u0120Dise": 14865, "\u0120Shan": 14866, "\u0120Forum": 14867, "Rich": 14868, "\u0120shifted": 14869, "uez": 14870, "\u0120lighter": 14871, "\u0120Magn": 14872, "\u0120cod": 14873, "Sch": 14874, "hammad": 14875, "Pub": 14876, "350": 14877, "\u0120Pokemon": 14878, "\u0120prototype": 14879, "\u0120unre": 14880, "Base": 14881, "\u0120Students": 14882, "\u0120Reply": 14883, "\u0120Communist": 14884, "\u0120gau": 14885, "\u0120Tyler": 14886, "IZ": 14887, "\u0120participated": 14888, "\u0120suprem": 14889, "\u0120Details": 14890, "\u0120vessels": 14891, "rod": 14892, "\u0120tribe": 14893, "keep": 14894, "\u0120assumptions": 14895, "\u0120pound": 14896, "\u0120crude": 14897, "\u0120Available": 14898, "\u0120swimming": 14899, "\u0120inclusion": 14900, "\u0120advances": 14901, "culation": 14902, "\u0120conservation": 14903, "\u0120overd": 14904, "\u0120Buffalo": 14905, "Article": 14906, "edge": 14907, "\u0120awa": 14908, "\u0120Madison": 14909, "\u0120sidew": 14910, "\u0120catast": 14911, "\u0120Krist": 14912, "ucle": 14913, "\u0120Highway": 14914, "\u0120Terror": 14915, "\u0120activation": 14916, "\u0120unconscious": 14917, "\u0120Satan": 14918, "\u0120Susan": 14919, "illery": 14920, "\u0120arranged": 14921, "iop": 14922, "\u0120rumors": 14923, "urring": 14924, "think": 14925, "\u0120Keith": 14926, "\u0120Kind": 14927, "\u0120avoiding": 14928, "byn": 14929, "nut": 14930, "\u0120Speaker": 14931, "rus": 14932, "names": 14933, "\u0120guilt": 14934, "\u0120Olympics": 14935, "\u0120sail": 14936, "\u0120Mes": 14937, "levant": 14938, "\u0120Columbus": 14939, "aft": 14940, "City": 14941, "South": 14942, "\u0120Harvey": 14943, "\u0120Pun": 14944, "Several": 14945, "\u0120mentally": 14946, "\u0120impress": 14947, "mount": 14948, "\u0120Ubuntu": 14949, "\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136": 14950, "\u0120Superman": 14951, "\u0120MPs": 14952, "\u0120intentions": 14953, "\u0120Racing": 14954, "\u0120likelihood": 14955, "\u0120240": 14956, "Total": 14957, "\u0120toys": 14958, "\u0120Watson": 14959, "\u0120urge": 14960, "Lear": 14961, "\u0120Paper": 14962, "\u0120occurring": 14963, "\u0120Beng": 14964, "\u0120Cert": 14965, "\u0120stones": 14966, "Tim": 14967, "\u0120Twin": 14968, "zb": 14969, "\u0120Dynam": 14970, "\u0120politician": 14971, "kens": 14972, "\u0120Enterprise": 14973, "UTERS": 14974, "\u0120abol": 14975, "\u0120refresh": 14976, "\u0120arbitrary": 14977, "pection": 14978, "\u0120troubles": 14979, "\u0120});": 14980, "tv": 14981, "\u0120pilots": 14982, "\u0120distribute": 14983, "\u0120audit": 14984, "\u0120pause": 14985, "original": 14986, "\u0120rivals": 14987, "\u00c2\u00a3": 14988, "Fig": 14989, "TL": 14990, "abil": 14991, "rying": 14992, "Lin": 14993, "ioned": 14994, "lon": 14995, "\u0120fancy": 14996, "\u0120crashed": 14997, "\u0120tract": 14998, "\u0120shed": 14999, "\u0120consume": 15000, "Based": 15001, "download": 15002, "init": 15003, "\u0120voltage": 15004, "Introdu": 15005, "\u0120condemned": 15006, "\u0120Finance": 15007, "respect": 15008, "\u0120excluded": 15009, "\u0120establishing": 15010, "heric": 15011, "\u0120heritage": 15012, "\u0120spectacular": 15013, "\u0120unst": 15014, "\u0120Snowden": 15015, "\u0120Lane": 15016, "San": 15017, "\u0120protections": 15018, "struction": 15019, "incinn": 15020, "\u0120macro": 15021, "Custom": 15022, "iosity": 15023, "\u0120esp": 15024, "\u0120functioning": 15025, "\u0120mush": 15026, "\u0120puzzle": 15027, "\u0120ethical": 15028, "Mal": 15029, "\u0120governing": 15030, "\u0120Ferguson": 15031, "\u0120restored": 15032, "\u0120stressed": 15033, "\u0120Counter": 15034, "\u0120Kas": 15035, "clip": 15036, "ANS": 15037, "\u0120seiz": 15038, "UK": 15039, "byss": 15040, "oldown": 15041, "api": 15042, "\u0120permanently": 15043, "ounters": 15044, "West": 15045, "Through": 15046, "Light": 15047, "atoes": 15048, "\u0120neat": 15049, "\u0120cord": 15050, "urer": 15051, "\u0120severely": 15052, "\u0120Aven": 15053, "\u0120interrog": 15054, "\u0120triple": 15055, "Given": 15056, "Number": 15057, "\u0120arise": 15058, "\u0120sher": 15059, "plant": 15060, "\u0120flower": 15061, "\u0120Cou": 15062, "\u0120ate": 15063, "\u0120newer": 15064, "bul": 15065, "\u0120meanwhile": 15066, "\u0120Lair": 15067, "\u0120adjustment": 15068, "\u0120Copyright": 15069, "\u0120divers": 15070, "iological": 15071, "\u0120gamers": 15072, "oat": 15073, "\u0120historically": 15074, "\u0120analog": 15075, "\u0120longtime": 15076, "\u0120prescription": 15077, "\u0120Mist": 15078, "\u0120Hyper": 15079, "\u0120Maine": 15080, "\u0120Deity": 15081, "\u0120multipl": 15082, "\u0120Reincarn": 15083, "\u0120Hyd": 15084, "\u0120Pic": 15085, "Sil": 15086, "rants": 15087, "\u0120Cris": 15088, ".;": 15089, "({": 15090, "ependence": 15091, "\u0120recy": 15092, "ateur": 15093, "\u0120quad": 15094, "\u0120glob": 15095, "\u0120conced": 15096, "team": 15097, "\u0120capitalist": 15098, "\u0120Lot": 15099, "\u0120royal": 15100, "\u0120Cyber": 15101, "\u0120blacks": 15102, "metic": 15103, "riv": 15104, "\u0120Danny": 15105, "\u0120spo": 15106, "\u0120RO": 15107, "\u0120animated": 15108, "rypted": 15109, "\u0120Deputy": 15110, "\u0120rendered": 15111, "FE": 15112, "\u0120streak": 15113, "\u0120clouds": 15114, "\u0120Doug": 15115, "~~~~~~~~": 15116, "\u0120discour": 15117, "\u0120Veh": 15118, "\u0120psychology": 15119, "\u0120Journey": 15120, "\u0120crystal": 15121, "\u0120Frost": 15122, "\u0120suspicion": 15123, "\u0120relate": 15124, "orus": 15125, "\u0120Crypt": 15126, "\u0120NVIDIA": 15127, "comed": 15128, "uting": 15129, "incinnati": 15130, "\u0120vulnerability": 15131, "ostic": 15132, "\u0120isolation": 15133, "\u0120cooling": 15134, "\u0120Coalition": 15135, "\u0120119": 15136, "Four": 15137, "\u0120Deal": 15138, "\u0120\u00e2\u012b": 15139, "semble": 15140, "rament": 15141, "\u0120Barcelona": 15142, "\u0120102": 15143, "\u0120cocaine": 15144, "ocalypse": 15145, "Feb": 15146, "ogenic": 15147, "\u0120mutation": 15148, "\u0120cryptoc": 15149, "\u0120Kel": 15150, "\u0120Git": 15151, "ais": 15152, "\u0120sisters": 15153, "ANK": 15154, "\u0120activate": 15155, "Ter": 15156, "\u0120dread": 15157, "ylon": 15158, "\u0120propri": 15159, "Aust": 15160, "\u0120Default": 15161, "\u0120outdoor": 15162, "\u0120sheer": 15163, "ceive": 15164, "\u0120gently": 15165, "\u00d0\u00be": 15166, "Program": 15167, "\u0120\u00e2\u0128\u0134": 15168, "\u0120vegan": 15169, "\u0120Crus": 15170, "\u0120responsibilities": 15171, "\u0120HR": 15172, "OLD": 15173, "\u0120prevents": 15174, "\u0120stiff": 15175, "\u0120Were": 15176, "\u0120athletic": 15177, "\u0120Score": 15178, "\u0120):": 15179, "\u0120columns": 15180, "\u0120Loc": 15181, "available": 15182, "\u0120Fram": 15183, "\u0120Sessions": 15184, "\u0120companion": 15185, "\u0120packs": 15186, "140": 15187, "\u0120Knights": 15188, "\u0120fart": 15189, "\u0120streams": 15190, "\u0120shore": 15191, "\u0120appeals": 15192, "\u0120Performance": 15193, "haul": 15194, "\u0120Stra": 15195, "\u0120Nag": 15196, "103": 15197, "\u0120Transportation": 15198, "BB": 15199, "Ev": 15200, "zan": 15201, "Public": 15202, "\u0120twin": 15203, "ulsion": 15204, "Mult": 15205, "\u0120electro": 15206, "\u0120statue": 15207, "ationally": 15208, "\u0120Nort": 15209, "\u0120inspection": 15210, "/*": 15211, "igue": 15212, "\u0120compassion": 15213, "\u0120Tales": 15214, "\u0120Stein": 15215, "\u0120Screen": 15216, "\u0120Bug": 15217, "\u0120Lion": 15218, "girl": 15219, "\u0120withdrawal": 15220, "\u0120objectives": 15221, "\u0120bloody": 15222, "\u0120preliminary": 15223, "\u0120jacket": 15224, "\u0120dimensions": 15225, "\u0120Cool": 15226, "\u0120Occup": 15227, "\u0120wreck": 15228, "\u0120doubled": 15229, "anking": 15230, "\u01201975": 15231, "\u0120glasses": 15232, "\u0120Wang": 15233, "prov": 15234, "Path": 15235, "connected": 15236, "\u0120Multi": 15237, "\u0120Norway": 15238, "agonist": 15239, "\u0120feared": 15240, "\u0120touching": 15241, "\u0120arguably": 15242, "\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af": 15243, "\u0120NCAA": 15244, "chem": 15245, "\u0120spat": 15246, "\u0120WWE": 15247, "\u0120Cel": 15248, "igger": 15249, "\u0120attacker": 15250, "\u0120Join": 15251, "object": 15252, "etta": 15253, "\u0120eliminated": 15254, "det": 15255, "\u0120destruct": 15256, "\u0120Lucas": 15257, "ctuary": 15258, "180": 15259, "\u0120Brady": 15260, "\u0120Blues": 15261, "Bay": 15262, "aukee": 15263, "\u0120timeline": 15264, "\u0120delegates": 15265, "written": 15266, "ufficient": 15267, "\u0120shapes": 15268, "Copyright": 15269, "ouble": 15270, "service": 15271, "\u0120pione": 15272, "\u0120colleges": 15273, "\u0120rows": 15274, "\u0120spite": 15275, "\u0120assessed": 15276, "360": 15277, "\u0120lease": 15278, "\u0120confidential": 15279, "cker": 15280, "\u0120Manning": 15281, "\u0120Voice": 15282, "\u0120sealed": 15283, "\u0120calculate": 15284, "NO": 15285, "\u0120Assistant": 15286, "\u0120teenager": 15287, "ulent": 15288, "atherine": 15289, "\u0120mock": 15290, "\u0120diamond": 15291, "\u0120fest": 15292, "\u0120switched": 15293, "\u0120resume": 15294, "\u0120Puerto": 15295, "\u0120lanes": 15296, "iration": 15297, "\u0120Similarly": 15298, "\u0120rod": 15299, "\u0120Sel": 15300, "\u0120Palace": 15301, "\u0120Limited": 15302, "eous": 15303, "\u0120variant": 15304, "\u0120ward": 15305, "\u0120))": 15306, "Show": 15307, "OOK": 15308, "Alex": 15309, "\u0120Nep": 15310, "bris": 15311, "\u0120Wikipedia": 15312, "\u0120exceptional": 15313, "\u0120manages": 15314, "\u0120Draw": 15315, "Again": 15316, "\u0120copper": 15317, "utt": 15318, "\u0120exports": 15319, "\u0120portfolio": 15320, "\u0120elevated": 15321, "Rated": 15322, "\u0120Otherwise": 15323, "\u0120Tact": 15324, "\u0120Shel": 15325, "\u0120TX": 15326, "\"\u00e2\u0122\u0136": 15327, "\u0120resur": 15328, "\u0120Wa": 15329, "venant": 15330, "\u0120monetary": 15331, "people": 15332, "Email": 15333, "\u0120fifty": 15334, "\u0120Sweet": 15335, "\u0120Malaysia": 15336, "\u0120confusing": 15337, "\u0120Rio": 15338, "uda": 15339, "utenant": 15340, "\");": 15341, "\u0120praised": 15342, "\u0120volumes": 15343, "turn": 15344, "\u0120mature": 15345, "\u0120nonprofit": 15346, "\u0120passionate": 15347, "\u0120Private": 15348, "\u0120103": 15349, "\u0120descend": 15350, "\u00e7\u00a5\u0140": 15351, "uffy": 15352, "headed": 15353, "Whether": 15354, "rien": 15355, "zech": 15356, "beit": 15357, "\u0120chrom": 15358, "\u0120McM": 15359, "\u0120dancing": 15360, "\u0120eleg": 15361, "\u0120Noticed": 15362, "115": 15363, "\u0120advocacy": 15364, "ENTS": 15365, "ambling": 15366, "\u0120Minor": 15367, "\u0120Finn": 15368, "\u0120priorities": 15369, "\u0120thereof": 15370, "\u0120Stage": 15371, "\u0120Rogers": 15372, "\u0120substitute": 15373, "\u0120Jar": 15374, "\u0120Jefferson": 15375, "\u0120lightly": 15376, "102": 15377, "\u0120Lisa": 15378, "uits": 15379, "ysical": 15380, "\u0120shifts": 15381, "\u0120drones": 15382, "\u0120workplace": 15383, "\u0120resid": 15384, "ensed": 15385, "ahn": 15386, "\u0120preferences": 15387, "server": 15388, "\u0120debates": 15389, "doc": 15390, "\u0120Gods": 15391, "\u0120helicopter": 15392, "\u0120honour": 15393, "\u0120considerably": 15394, "eded": 15395, "\u0120Female": 15396, "\u0120Anne": 15397, "\u0120reun": 15398, "\u0120Face": 15399, "\u0120Hallow": 15400, "\u0120Budget": 15401, "\u0120condemn": 15402, "\u0120tender": 15403, "Prof": 15404, "ocratic": 15405, "\u0120Turner": 15406, "\u0120Agric": 15407, "\u01201976": 15408, "\u0120apt": 15409, "disc": 15410, "\u0120Fighter": 15411, "\u0120Aur": 15412, "\u0120garbage": 15413, "input": 15414, "\u0120Karl": 15415, "\u0120Oliver": 15416, "\u0120Language": 15417, "kn": 15418, "Non": 15419, "\u0120Clar": 15420, "\u0120traditions": 15421, "\u0120advertisement": 15422, "\u0120Sor": 15423, "\u0120archive": 15424, "\u0120villages": 15425, "750": 15426, "\u0120implementing": 15427, "waukee": 15428, "\u0120dietary": 15429, "\u0120switching": 15430, "Republic": 15431, "\u0120velocity": 15432, "\u0120cit": 15433, "\u0120Awards": 15434, "\u0120financing": 15435, "\u0120lasted": 15436, ")]": 15437, "\u0120reminder": 15438, "Person": 15439, "\u0120precision": 15440, "\u0120designers": 15441, "\u0120Fried": 15442, "\u0120Border": 15443, "\u0120tragic": 15444, "\u0120wield": 15445, "\u0120initiatives": 15446, "\u0120Tank": 15447, "wer": 15448, "\u0120joins": 15449, "Ro": 15450, "inery": 15451, "\u0120arrow": 15452, "\u0120generating": 15453, "founder": 15454, "\u0120searches": 15455, "\u0120randomly": 15456, "Access": 15457, "\u0120batch": 15458, "\u0120posed": 15459, "lat": 15460, "\u0120pursuing": 15461, "asa": 15462, "\u0120testified": 15463, "forming": 15464, "\u0120Shar": 15465, "wiki": 15466, "\u0120Either": 15467, "Sometimes": 15468, "\u0120senators": 15469, "\u0120Johnny": 15470, "\u0120Taliban": 15471, "\u0120GPS": 15472, "\":\"/": 15473, "\u00e3\u0123\u00ae\u00e5": 15474, "\u0120analyzed": 15475, "\u0120Rubio": 15476, "\u0120Movement": 15477, "opard": 15478, "iii": 15479, "Stand": 15480, "fight": 15481, "\u0120ignoring": 15482, "iang": 15483, "\u0120GN": 15484, "soever": 15485, "\u0120STAT": 15486, "\u0120refusing": 15487, "\u0120sweat": 15488, "\u0120bay": 15489, "PORT": 15490, "irmed": 15491, "aky": 15492, "\u0120dispro": 15493, "\u0120labeled": 15494, "\u0120108": 15495, "Hello": 15496, "\u0120pleasant": 15497, "aba": 15498, "\u0120triumph": 15499, "\u0120aboard": 15500, "\u0120incom": 15501, "\u0120Crow": 15502, "lett": 15503, "\u0120folk": 15504, "\u0120chase": 15505, "``": 15506, "\u0120Brus": 15507, "\u0120teens": 15508, "cue": 15509, "\u0120terrain": 15510, "hyd": 15511, "ilight": 15512, "ORY": 15513, "Support": 15514, "ews": 15515, "lli": 15516, "raints": 15517, "\u0120Cand": 15518, "\u0120abused": 15519, "achment": 15520, "larg": 15521, "Bas": 15522, "\u0120Cancer": 15523, "\u01201978": 15524, "\u0120supporter": 15525, "access": 15526, "\u0120Termin": 15527, "\u0120Tampa": 15528, "\u0120ANY": 15529, "\u0120newest": 15530, "\u0120Criminal": 15531, "edu": 15532, "\u01201930": 15533, "\u0120admits": 15534, "\u0120ende": 15535, "\u0120failures": 15536, "urate": 15537, "fulness": 15538, "cycl": 15539, "\u0120Subject": 15540, "\u0120infinite": 15541, "three": 15542, "WA": 15543, "pit": 15544, "\u0120Install": 15545, "Rad": 15546, "iliation": 15547, "GM": 15548, "\u0120continent": 15549, "\u0120accommodate": 15550, "\u0120Clay": 15551, "\u0120pup": 15552, "\u0120Function": 15553, "\u0120hammer": 15554, "\u0120Alberta": 15555, "\u0120revised": 15556, "\u0120minorities": 15557, "\u0120measurement": 15558, "Connell": 15559, "\u0120disable": 15560, "\u0120Mix": 15561, "Incre": 15562, "\u0120fork": 15563, "\u0120Rosen": 15564, "\u0120implies": 15565, "umblr": 15566, "ANG": 15567, "\u0120proteins": 15568, "\u0120aggression": 15569, "\u0120facilitate": 15570, "SN": 15571, "\u0120illegally": 15572, "uer": 15573, "\u0120academ": 15574, "\u0120puzz": 15575, "\u0120Shift": 15576, "pay": 15577, "ollo": 15578, "\u0120audiences": 15579, "Build": 15580, "\u0120noble": 15581, "\u0120syntax": 15582, "\u00e2\u013a\u0127": 15583, "\u0120beam": 15584, "\u0120Bed": 15585, "\u0120Ald": 15586, "\u0120origins": 15587, "video": 15588, "\u01201977": 15589, "\u0120Assault": 15590, "\u0120garage": 15591, "Team": 15592, "\u0120verdict": 15593, "\u0120dwar": 15594, "\u0120Virtual": 15595, "event": 15596, "Keep": 15597, "\u0120sentiment": 15598, "\u0120wildlife": 15599, "shirt": 15600, "\u0120burg": 15601, "\u0120recommendation": 15602, "represent": 15603, "\u0120gallery": 15604, "owners": 15605, "\u0120scholar": 15606, "\u0120convenience": 15607, "\u0120Swift": 15608, "\u0120convinc": 15609, "Cap": 15610, "\u0120warfare": 15611, "\u0120Visual": 15612, "\u0120constitute": 15613, "\u0120abort": 15614, "\u0120Weather": 15615, "\u0120Looking": 15616, "\u0120Hem": 15617, "\u0120martial": 15618, "\u0120incoming": 15619, "etition": 15620, "\u0120tolerance": 15621, "\u0120Created": 15622, "\u0120flows": 15623, "\u0120Elder": 15624, "\u0120souls": 15625, "\u0120foul": 15626, "\u0120Pain": 15627, "\u0120CAN": 15628, "\u0120220": 15629, "bc": 15630, "hend": 15631, "\u0120genius": 15632, "Real": 15633, "\u0120Wr": 15634, "ometer": 15635, "pad": 15636, "\u0120limiting": 15637, "\u0120Si": 15638, "\u0120Lore": 15639, "\u0120Adventures": 15640, "\u0120varied": 15641, "Disc": 15642, "fin": 15643, "\u0120Personal": 15644, "Chris": 15645, "\u0120invented": 15646, "\u0120dive": 15647, "\u0120Rise": 15648, "\u0120oz": 15649, "\u0120Comics": 15650, "\u0120expose": 15651, "\u0120Reb": 15652, "letters": 15653, "site": 15654, "imated": 15655, "\u0120hacking": 15656, "\u0120educated": 15657, "\u0120Nobody": 15658, "\u0120depri": 15659, "\u0120incentive": 15660, "\u00e3\u0124\u00b7": 15661, "\u0120oversight": 15662, "\u0120tribes": 15663, "\u0120Belgium": 15664, "\u0120licensing": 15665, "ourt": 15666, "Product": 15667, "ahl": 15668, "\u0120Gem": 15669, "\u0120specialist": 15670, "\u0120cra": 15671, "anners": 15672, "\u0120Corbyn": 15673, "\u01201973": 15674, "READ": 15675, "\u0120summar": 15676, "\u0120overlook": 15677, "\u0120Application": 15678, "\u0120inappropriate": 15679, "\u0120downloaded": 15680, "Que": 15681, "\u0120Bears": 15682, "\u0120thumb": 15683, "\u0120Character": 15684, "\u0120Reincarnated": 15685, "\u0120Sid": 15686, "\u0120demonstrates": 15687, "sky": 15688, "\u0120Bloomberg": 15689, "\u0120Array": 15690, "\u0120Results": 15691, "\u0120Fourth": 15692, "\u0120EDT": 15693, "\u0120Oscar": 15694, "cend": 15695, "\u0120106": 15696, "\u0120NULL": 15697, "\u0120HERE": 15698, "match": 15699, "\u0120Brun": 15700, "\u0120glucose": 15701, "ieg": 15702, "egu": 15703, "\u0120certified": 15704, "\u0120relie": 15705, "\u0120humanitarian": 15706, "\u0120prayers": 15707, "King": 15708, "\u0120nan": 15709, "hou": 15710, "108": 15711, "ulu": 15712, "\u0120renewable": 15713, "\u0120distinguish": 15714, "\u0120dense": 15715, "\u0120Vent": 15716, "\u0120Package": 15717, "\u0120Boss": 15718, "\u0120editors": 15719, "\u0120migr": 15720, "Tra": 15721, "\u0120Peters": 15722, "\u0120Arctic": 15723, "2004": 15724, "\u0120Cape": 15725, "\u0120locally": 15726, "\u0120lasting": 15727, "\u0120handy": 15728, ".).": 15729, "Pan": 15730, "\u0120RES": 15731, "Index": 15732, "\u0120tensions": 15733, "\u0120formerly": 15734, "\u0120ideological": 15735, "\u0120sensors": 15736, "\u0120dealers": 15737, "\u0120defines": 15738, "Sk": 15739, "\u0120proceeds": 15740, "\u0120proxy": 15741, "azines": 15742, "\u0120Bash": 15743, "\u0120Pad": 15744, "\u0120Craft": 15745, "ealous": 15746, "\u0120sheets": 15747, "ometry": 15748, "June": 15749, "clock": 15750, "TT": 15751, "\u0120Theatre": 15752, "\u0120Buzz": 15753, "\u0120chapters": 15754, "\u0120millenn": 15755, "\u0120dough": 15756, "\u0120Congressional": 15757, "\u0120imagined": 15758, "avior": 15759, "\u0120clinic": 15760, "\u01201945": 15761, "\u0120holder": 15762, "root": 15763, "olester": 15764, "\u0120restart": 15765, "BN": 15766, "\u0120Hamas": 15767, "\u0120Job": 15768, "\u0120orb": 15769, "\u0120ram": 15770, "\u0120disclose": 15771, "\u0120translate": 15772, "\u0120immigrant": 15773, "\u0120annoying": 15774, "\u0120treaty": 15775, "anium": 15776, "\u0120Tea": 15777, "\u0120Legion": 15778, "\u0120crowds": 15779, "\u0120Bec": 15780, "\u0120Aer": 15781, "ohyd": 15782, "Bro": 15783, "Looking": 15784, "\u0120lbs": 15785, "\u0120aggress": 15786, "\u0120seam": 15787, "\u0120intercept": 15788, "\u0120MI": 15789, "mercial": 15790, "activ": 15791, "\u0120Cit": 15792, "\u0120dimension": 15793, "\u0120consistency": 15794, "\u0120rushing": 15795, "\u0120Douglas": 15796, "\u0120trim": 15797, "Install": 15798, "icker": 15799, "\u0120shy": 15800, "106": 15801, "\u0120mentions": 15802, "pelled": 15803, "\u0120Tak": 15804, "cost": 15805, "\u0120classroom": 15806, "\u0120fortune": 15807, "driven": 15808, "\u0120unle": 15809, "\u0120Wheel": 15810, "\u0120investor": 15811, "\u0120Masters": 15812, "kit": 15813, "\u0120associations": 15814, "\u0120Evolution": 15815, "oping": 15816, "uscript": 15817, "\u0120provincial": 15818, "\u0120Walter": 15819, "avi": 15820, "SO": 15821, "\u0120unlimited": 15822, "English": 15823, "\u0120Cards": 15824, "\u0120Ebola": 15825, "nered": 15826, "\u0120revenge": 15827, "\u0120outright": 15828, "umper": 15829, "\u0120fitting": 15830, "\u0120Solid": 15831, "\u0120formally": 15832, "\u0120problematic": 15833, "\u0120hazard": 15834, "\u0120encryption": 15835, "\u0120straightforward": 15836, "\u0120AK": 15837, "\u0120pse": 15838, "\u0120Orb": 15839, "\u0120Chamber": 15840, "\u0120Mak": 15841, "Contents": 15842, "\u0120loyalty": 15843, "\u0120lyrics": 15844, "\u0120Sym": 15845, "\u0120welcomed": 15846, "\u0120cooked": 15847, "\u0120monop": 15848, "\u0120nurse": 15849, "\u0120misleading": 15850, "\u0120eternal": 15851, "\u0120shifting": 15852, "\u0120+=": 15853, "Vis": 15854, "\u0120institutional": 15855, "illary": 15856, "\u0120pant": 15857, "VERT": 15858, "\u0120ACC": 15859, "\u0120Enh": 15860, "\u0120incon": 15861, "\u0120REUTERS": 15862, "\u0120donated": 15863, "\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6": 15864, "Intern": 15865, "\u0120exhibit": 15866, "\u0120tire": 15867, "\u0120Ric": 15868, "\u0120Champion": 15869, "\u0120Muhammad": 15870, "NING": 15871, "\u0120Soccer": 15872, "\u0120mobility": 15873, "\u0120varying": 15874, "\u0120Movie": 15875, "\u0120lord": 15876, "oak": 15877, "Field": 15878, "\u0120vector": 15879, "usions": 15880, "\u0120scrap": 15881, "\u0120enabling": 15882, "make": 15883, "Tor": 15884, ".*": 15885, "||": 15886, "\u0120Website": 15887, "\u0120NPC": 15888, "\u0120socialist": 15889, "\u0120Billy": 15890, "\u0120Additional": 15891, "\u0120cargo": 15892, "\u0120farms": 15893, "\u0120Soon": 15894, "\u0120Prize": 15895, "\u0120midnight": 15896, "\u0120900": 15897, "seen": 15898, "\u0120Spot": 15899, "\u0120sheep": 15900, "\u0120sponsored": 15901, "\u0120Hi": 15902, "\u0120Jump": 15903, "\u01201967": 15904, "Microsoft": 15905, "\u0120Agent": 15906, "\u0120charts": 15907, "dir": 15908, "\u0120adjacent": 15909, "\u0120tricks": 15910, "\u0120manga": 15911, "\u0120exagger": 15912, "/>": 15913, "football": 15914, "\u0120FCC": 15915, "GC": 15916, "\u0120Tier": 15917, "andra": 15918, "OUND": 15919, "%),": 15920, "\u0120fruits": 15921, "VC": 15922, "\u0120AA": 15923, "Rober": 15924, "\u0120midst": 15925, "\u00e2\u0139": 15926, "anka": 15927, "\u0120legislature": 15928, "\u0120Neil": 15929, "\u0120tourists": 15930, "\"\"": 15931, "\u0120Warning": 15932, "\u0120Nevertheless": 15933, "\u0120Official": 15934, "\u0120Whatever": 15935, "\u0120mold": 15936, "\u0120drafted": 15937, "\u0120substances": 15938, "\u0120breed": 15939, "\u0120tags": 15940, "\u0120Task": 15941, "\u0120verb": 15942, "\u0120manufactured": 15943, "comments": 15944, "\u0120Polish": 15945, "Prov": 15946, "\u0120determines": 15947, "Obama": 15948, "kers": 15949, "\u0120utterly": 15950, "\u0120sect": 15951, "sche": 15952, "\u0120Gates": 15953, "\u0120Chap": 15954, "\u0120aluminum": 15955, "\u0120zombie": 15956, "\u0120Touch": 15957, "\u0120UP": 15958, "\u0120satisfy": 15959, "\u0120predomin": 15960, "ascript": 15961, "\u0120elaborate": 15962, "\u01201968": 15963, "\u0120measuring": 15964, "\u0120Vari": 15965, "anyahu": 15966, "\u0120sir": 15967, "ulates": 15968, "idges": 15969, "ickets": 15970, "\u0120Spencer": 15971, "TM": 15972, "oubted": 15973, "\u0120prey": 15974, "\u0120installing": 15975, "\u0120Cab": 15976, "reed": 15977, "reated": 15978, "Supp": 15979, "\u0120wrist": 15980, "\u0120Kerry": 15981, "107": 15982, "\u0120Kle": 15983, "\u0120Rachel": 15984, "\u0120cotton": 15985, "\u0120ARE": 15986, "\u0120Ele": 15987, "Control": 15988, "\u0120loads": 15989, "\u0120Dod": 15990, "anas": 15991, "bone": 15992, "\u0120classical": 15993, "\u0120Regional": 15994, "\u0120Integ": 15995, "VM": 15996, "\u0120desires": 15997, "\u0120autism": 15998, "supported": 15999, "\u0120Message": 16000, "\u0120compact": 16001, "writer": 16002, "\u0120109": 16003, "\u0120Hurricane": 16004, "cision": 16005, "\u0120cycles": 16006, "\u0120drill": 16007, "\u0120colleague": 16008, "\u0120maker": 16009, "German": 16010, "\u0120mistaken": 16011, "Sun": 16012, "\u0120Gay": 16013, "\u0120whatsoever": 16014, "\u0120sells": 16015, "\u0120Airl": 16016, "liv": 16017, "\u0120Option": 16018, "\u0120solved": 16019, "\u0120sectors": 16020, "\u0120horizontal": 16021, "\u0120equation": 16022, "\u0120Skill": 16023, "\u0120Bio": 16024, "gement": 16025, "\u0120Snap": 16026, "\u0120Legal": 16027, "\u0120trademark": 16028, "\u0120makeup": 16029, "\u0120assembled": 16030, "\u0120saves": 16031, "\u0120Halloween": 16032, "\u0120Vermont": 16033, "\u0120FROM": 16034, "\u0120farming": 16035, "\u0120Podcast": 16036, "acceptable": 16037, "\u0120Higher": 16038, "\u0120asleep": 16039, "ullivan": 16040, "\u0120referen": 16041, "\u0120Lev": 16042, "\u0120bullets": 16043, "oko": 16044, "HC": 16045, "\u0120stairs": 16046, "\u0120maintains": 16047, "\u0120Lower": 16048, "\u0120Vi": 16049, "\u0120marine": 16050, "\u0120acres": 16051, "\u0120coordinator": 16052, "\u0120Joh": 16053, "\u0120counterparts": 16054, "\u0120Brothers": 16055, "\u0120indict": 16056, "bra": 16057, "\u0120chunk": 16058, "\u0120cents": 16059, "Home": 16060, "\u0120Month": 16061, "\u0120accordingly": 16062, "ifles": 16063, "\u0120Germans": 16064, "\u0120Syn": 16065, "Hub": 16066, "\u0120eyeb": 16067, "\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122": 16068, "\u0120ranges": 16069, "\u0120Holland": 16070, "\u0120Robot": 16071, "fc": 16072, "Mike": 16073, "\u0120plasma": 16074, "\u0120swap": 16075, "\u0120athlete": 16076, "\u0120Rams": 16077, ",'\"": 16078, "\u0120infections": 16079, "\u0120corrid": 16080, "\u0120vib": 16081, "\u0120patches": 16082, "\u0120traditionally": 16083, "\u0120revelation": 16084, "\u0120sweep": 16085, "\u0120glance": 16086, "\u0120inex": 16087, "2003": 16088, "\u0120Raw": 16089, "working": 16090, "osures": 16091, "\u0120Dat": 16092, "\u0120Lynch": 16093, "\u0120leverage": 16094, "\u0120Reid": 16095, "\u0120correlation": 16096, "iances": 16097, "avascript": 16098, "\u0120repository": 16099, "retty": 16100, "\u01201972": 16101, "240": 16102, "\u0120oun": 16103, "pol": 16104, "\u0120Reed": 16105, "\u0120tactical": 16106, "isite": 16107, "Apple": 16108, "\u0120Quinn": 16109, "\u0120raped": 16110, "illo": 16111, "Europe": 16112, "\u0120algorithms": 16113, "\u0120Rodrig": 16114, "iu": 16115, "\u0120illum": 16116, "\u0120fame": 16117, "\u0120introducing": 16118, "\u0120delays": 16119, "\u0120Raiders": 16120, "\u0120whistle": 16121, "\u0120novels": 16122, "\u0120Really": 16123, "\u0120deriv": 16124, "\u0120publications": 16125, "\u0120Neither": 16126, "\u0120Commerce": 16127, "\u0120aston": 16128, "language": 16129, "Notes": 16130, "\u0120Roth": 16131, "\u0120Fear": 16132, "\u0120mate": 16133, "\u0120parade": 16134, "\u0120QB": 16135, "\u0120maneu": 16136, "\u0120Cincinnati": 16137, "mitting": 16138, "\u0120waist": 16139, "\u0120Rew": 16140, "\u0120discont": 16141, "\u00d0\u00b0": 16142, "\u0120staring": 16143, "\u0120alias": 16144, "\u0120securities": 16145, "\u0120toilet": 16146, "\u0120Jedi": 16147, "\u0120unlaw": 16148, "vised": 16149, "////////": 16150, "](": 16151, "\u0120Weiss": 16152, "\u0120prest": 16153, "\u0120Compan": 16154, "\u0120memo": 16155, "\u0120Grace": 16156, "July": 16157, "\u0120Elite": 16158, "center": 16159, "\u0120Stay": 16160, "\u0120galaxy": 16161, "\u0120tooth": 16162, "\u0120Settings": 16163, "\u0120subjected": 16164, "\u00e3\u0124\u00a6": 16165, "\u0120lineback": 16166, "\u0120retailers": 16167, "\u0120Want": 16168, "\u0120dangers": 16169, "Air": 16170, "\u0120voluntary": 16171, "eway": 16172, "\u0120interpreted": 16173, "otine": 16174, "\u00c3\u00a7": 16175, "\u0120pel": 16176, "Service": 16177, "\u0120Eventually": 16178, "\u0120careers": 16179, "\u0120threaten": 16180, "\u0120memor": 16181, "\u0120Bradley": 16182, "ancies": 16183, "sn": 16184, "\u0120Unknown": 16185, "National": 16186, "\u0120shadows": 16187, "ailand": 16188, "\u0120Dash": 16189, "Everyone": 16190, "izzard": 16191, "March": 16192, "=(": 16193, "\u0120pulls": 16194, "\u0120stranger": 16195, "\u0120backwards": 16196, "\u0120Bernard": 16197, "imensional": 16198, "\u0120chron": 16199, "\u0120theoretical": 16200, "ktop": 16201, "\u0120ware": 16202, "\u0120Investig": 16203, "\u0120Initi": 16204, "\u0120Operations": 16205, "oven": 16206, "ocide": 16207, "*/": 16208, "\u0120flames": 16209, "\u0120Cash": 16210, "shit": 16211, "\u0120cab": 16212, "\u0120Analy": 16213, "\u0120Seah": 16214, "\u0120defining": 16215, "\u0120ordering": 16216, "\u0120immun": 16217, "\u0120persistent": 16218, "ACH": 16219, "Russian": 16220, "mans": 16221, "\u0120hind": 16222, "\u0120photography": 16223, "\u00c2\u00a9": 16224, "\u0120hug": 16225, "\u0120107": 16226, "\u0120Hence": 16227, "iots": 16228, "udeau": 16229, "\u0120subsidies": 16230, "\u0120routinely": 16231, "\u0120Device": 16232, "itic": 16233, "\u0120disgust": 16234, "lander": 16235, "\u01201940": 16236, "\u0120assignment": 16237, "\u0120Besides": 16238, "wick": 16239, "\u0120Dust": 16240, "usc": 16241, "structed": 16242, "111": 16243, "develop": 16244, "\u0120fond": 16245, "\u0120intersection": 16246, "\u0120dignity": 16247, "\u0120commissioner": 16248, "Without": 16249, "reach": 16250, "\u0120cartoon": 16251, "\u0120scales": 16252, "\u00e3\u0125\u0143": 16253, "FIG": 16254, "\u0120surveys": 16255, "\u0120Indonesia": 16256, "\u0120artwork": 16257, "\u0120unch": 16258, "\u0120cycling": 16259, "unct": 16260, "auer": 16261, "orate": 16262, "\u0120Obviously": 16263, "\u0120characterized": 16264, "feld": 16265, "\u0120affirm": 16266, "\u0120innings": 16267, "\u0120\u00e9": 16268, "\u0120aliens": 16269, "\u0120cloth": 16270, "etooth": 16271, "\u0120Certain": 16272, "\u00c2\u00a7": 16273, "\u0120digest": 16274, "know": 16275, "\u0120XL": 16276, "\u0120predictions": 16277, "\u0120din": 16278, "WAR": 16279, "\u0120aftermath": 16280, "Example": 16281, "\u0120Success": 16282, "\u0120Thr": 16283, "IGN": 16284, "\u0120miner": 16285, "Bus": 16286, "\u0120clarity": 16287, "heimer": 16288, "\u0120OUT": 16289, "\u0120Send": 16290, "\u0120Circle": 16291, "\u0120Diet": 16292, "\u0120pronounced": 16293, "\u0120creators": 16294, "\u0120earthquake": 16295, "attery": 16296, "geons": 16297, "\u0120od": 16298, "\u0120laying": 16299, "orp": 16300, "Ult": 16301, "project": 16302, "\u0120undermin": 16303, "\u0120sequel": 16304, "Sam": 16305, "\u0120Darkness": 16306, "\u0120reception": 16307, "bull": 16308, "YS": 16309, "\u0120Vir": 16310, "\u0120sequences": 16311, "\u0120Coin": 16312, "\u0120outfit": 16313, "\u0120Wait": 16314, "119": 16315, "\u0120delivers": 16316, "......": 16317, "\u0120blown": 16318, "\u0120Esc": 16319, "\u0120Math": 16320, "perm": 16321, "\u0120Ul": 16322, "\u0120glim": 16323, "\u0120facial": 16324, "\u0120greenhouse": 16325, "\u0120tokens": 16326, "/-": 16327, "\u0120Annual": 16328, "\u0120ONE": 16329, "\u0120teenage": 16330, "\u0120Physical": 16331, "\u0120Lang": 16332, "\u0120Celt": 16333, "\u0120sued": 16334, "ividually": 16335, "\u0120patience": 16336, "chair": 16337, "regular": 16338, "\u0120aug": 16339, "inv": 16340, "except": 16341, "\u0120Lil": 16342, "\u0120nest": 16343, "fd": 16344, "sum": 16345, "\u0120Chase": 16346, "Russia": 16347, "\u0120Jennifer": 16348, "\u0120offseason": 16349, "Overall": 16350, "Fore": 16351, "\u0120riot": 16352, "Aud": 16353, "former": 16354, "\u0120defenders": 16355, "\u0120CT": 16356, "iotic": 16357, "ribly": 16358, "\u0120automated": 16359, "\u0120penis": 16360, "\u0120insist": 16361, "\u0120diagram": 16362, "\u0120SQL": 16363, "\u0120Garc": 16364, "\u0120witch": 16365, "client": 16366, "ierra": 16367, "ambers": 16368, "\u0120recount": 16369, "far": 16370, "Very": 16371, "osterone": 16372, "\u0120appreciated": 16373, "\u0120Perfect": 16374, "Section": 16375, "\u0120doses": 16376, "ocaust": 16377, "\u0120costly": 16378, "\u0120grams": 16379, "\u0120Shi": 16380, "\u0120wrestling": 16381, "\u01201971": 16382, "\u0120trophy": 16383, "\u0120nerve": 16384, "\u0120Kaz": 16385, "\u0120Experience": 16386, "\u0120pledged": 16387, "\u0120playback": 16388, "\u0120creativity": 16389, "bye": 16390, "\u0120attackers": 16391, "\u0120holders": 16392, "\u0120Coach": 16393, "\u0120PhD": 16394, "\u0120transfers": 16395, "\u0120colored": 16396, "\u0120Hindu": 16397, "\u0120drown": 16398, "\u0120listened": 16399, "\u0120WA": 16400, "iasm": 16401, "PO": 16402, "\u0120appealing": 16403, "\u0120disclosed": 16404, "\u0120Chicken": 16405, "agging": 16406, "\u0120pleaded": 16407, "\u0120navigation": 16408, "\u0120Returns": 16409, "\u0120[[": 16410, "ROR": 16411, "EA": 16412, "\u0120photographer": 16413, "\u0120Rider": 16414, "ippers": 16415, "\u0120slice": 16416, "\u0120erect": 16417, "\u0120hed": 16418, "issance": 16419, "\u0120Vikings": 16420, "urious": 16421, "\u0120appet": 16422, "oubtedly": 16423, "Child": 16424, "\u0120authentic": 16425, "oos": 16426, "\u0120Making": 16427, "\u0120announcing": 16428, "\u0120bod": 16429, "\u0120meter": 16430, "\u0120Nine": 16431, "\u0120Rogue": 16432, "\u0120workforce": 16433, "\u0120renewed": 16434, "\u0120organisations": 16435, "acs": 16436, "PLE": 16437, "Short": 16438, "\u0120compounds": 16439, "\u0120Visit": 16440, "\u0120envelop": 16441, "earth": 16442, "\u0120supportive": 16443, "ggle": 16444, "\u0120Brussels": 16445, "\u0120Guild": 16446, "Create": 16447, "REL": 16448, "\u0120averaged": 16449, "\u01201969": 16450, "riages": 16451, "\u0120lengthy": 16452, "\u0120forgot": 16453, "Okay": 16454, "\u0120Erd": 16455, "\u0120dealer": 16456, "\u0120recession": 16457, "DD": 16458, "\u0120desperately": 16459, "\u0120hunger": 16460, "\u0120sticks": 16461, "\u0120mph": 16462, "\u0120Faith": 16463, "\u0120intentionally": 16464, "\u0120demol": 16465, "ueller": 16466, "\u0120Sale": 16467, "\u0120debris": 16468, "spring": 16469, "\u0120leap": 16470, ">>>>": 16471, "\u0120containers": 16472, "selling": 16473, "ranean": 16474, "attering": 16475, "\u0120commented": 16476, "\u0120CM": 16477, "onut": 16478, "\u0120woods": 16479, "especially": 16480, "\u0120organize": 16481, "ivic": 16482, "\u0120Woods": 16483, "anga": 16484, "squ": 16485, "\u0120maj": 16486, "amon": 16487, "\u0120axis": 16488, "\u01201974": 16489, "\u0120Denmark": 16490, "\u0120warrior": 16491, "\u0120Pand": 16492, "\u0120outlined": 16493, "\u0120BO": 16494, "insula": 16495, "zilla": 16496, "ebook": 16497, "\u0120dare": 16498, "\u0120searched": 16499, "\u0120navigate": 16500, "Sn": 16501, "writing": 16502, "\u0120united": 16503, "Japan": 16504, "\u0120Hebrew": 16505, "\u0120flame": 16506, "\u0120relies": 16507, "\u0120catching": 16508, "\u0120Sho": 16509, "\u0120imprisonment": 16510, "\u0120pockets": 16511, "\u0120closure": 16512, "\u0120Fam": 16513, "tim": 16514, "adequ": 16515, "Activity": 16516, "\u0120recruiting": 16517, "\u0120WATCH": 16518, "\u0120Argentina": 16519, "dest": 16520, "\u0120apologize": 16521, "oro": 16522, "\u0120lacks": 16523, "\u0120tuned": 16524, "\u0120Griffin": 16525, "\u0120infamous": 16526, "\u0120celebrity": 16527, "sson": 16528, "\u0120----------------------------------------------------------------": 16529, "\u0120Isis": 16530, "\u0120Display": 16531, "\u0120credibility": 16532, "\u0120economies": 16533, "\u0120headline": 16534, "\u0120Cowboys": 16535, "\u0120indef": 16536, "\u0120lately": 16537, "\u0120incentives": 16538, "button": 16539, "\u0120Mob": 16540, "Aut": 16541, "\u0120resigned": 16542, "\u0120Om": 16543, "camp": 16544, "\u0120profiles": 16545, "\u0120schemes": 16546, "olphins": 16547, "ayed": 16548, "Clinton": 16549, "enh": 16550, "\u0120Yahoo": 16551, "\u0120abst": 16552, "\u0120ank": 16553, "suits": 16554, "\u0120wished": 16555, "\u0120Marco": 16556, "udden": 16557, "\u0120sphere": 16558, "\u0120Bishop": 16559, "\u0120incorporated": 16560, "\u0120Plant": 16561, "114": 16562, "\u0120hated": 16563, "pic": 16564, "\u0120donate": 16565, "\u0120lined": 16566, "\u0120beans": 16567, "\u0120stealing": 16568, "\u0120costume": 16569, "\u0120sheriff": 16570, "\u0120forty": 16571, "\u0120intact": 16572, "\u0120adapted": 16573, "\u0120travelling": 16574, "bart": 16575, "\u0120nicely": 16576, "\u0120dried": 16577, "\u0120scal": 16578, "osity": 16579, "NOTE": 16580, "\u0120Bh": 16581, "\u0120Broncos": 16582, "\u0120Ign": 16583, "\u0120intimate": 16584, "\u0120chemistry": 16585, "\u0120optimal": 16586, "Deb": 16587, "\u0120Generation": 16588, "\u0120],": 16589, "ichi": 16590, "\u0120Wii": 16591, "\u0120YOUR": 16592, "ventions": 16593, "Write": 16594, "\u0120popul": 16595, "unning": 16596, "\u0120Wor": 16597, "Vol": 16598, "\u0120queen": 16599, "heads": 16600, "KK": 16601, "\u0120analyze": 16602, "opic": 16603, "earchers": 16604, "\u0120dot": 16605, "legraph": 16606, "astically": 16607, "\u0120upgrades": 16608, "\u0120cares": 16609, "\u0120extending": 16610, "\u0120freeze": 16611, "\u0120inability": 16612, "\u0120organs": 16613, "\u0120pretend": 16614, "\u0120outlet": 16615, "113": 16616, "olan": 16617, "\u0120Mall": 16618, "uling": 16619, "talk": 16620, "\u0120expressing": 16621, "\u0120Always": 16622, "\u0120Begin": 16623, "files": 16624, "\u0120licenses": 16625, "%%": 16626, "\u0120Mitt": 16627, "\u0120filters": 16628, "\u0120Milwaukee": 16629, "GN": 16630, "\u0120unfold": 16631, "Mo": 16632, "\u0120nutrition": 16633, "ppo": 16634, "Bo": 16635, "\u0120founding": 16636, "\u0120undermine": 16637, "\u0120easiest": 16638, "\u0120Czech": 16639, "\u0120Mack": 16640, "\u0120sexuality": 16641, "\u0120Nixon": 16642, "Win": 16643, "\u0120Arn": 16644, "\u0120Kin": 16645, "\u00e3\u0124\u00a3": 16646, "icer": 16647, "\u0120fortun": 16648, "\u0120surfaces": 16649, "aghd": 16650, "\u0120carriers": 16651, "\u0120PART": 16652, "\u0120Tib": 16653, "\u0120interval": 16654, "\u0120frustrating": 16655, "\u0120Ship": 16656, "\u0120Armed": 16657, "ffe": 16658, "\u0120boats": 16659, "\u0120Abraham": 16660, "inis": 16661, "\u0120suited": 16662, "thread": 16663, "iov": 16664, "abul": 16665, "\u0120Venezuela": 16666, "\u0120tom": 16667, "super": 16668, "\u0120castle": 16669, "although": 16670, "ioxide": 16671, "eches": 16672, "\u0120evolutionary": 16673, "\u0120negotiate": 16674, "\u0120confronted": 16675, "Remember": 16676, "\u0120170": 16677, "Such": 16678, "\u0120911": 16679, "mult": 16680, "\u0120Abyss": 16681, "urry": 16682, "kees": 16683, "spec": 16684, "\u0120Barbara": 16685, "\u0120belonging": 16686, "\u0120villain": 16687, "istani": 16688, "\u0120accountable": 16689, "\u0120portions": 16690, "\u0120Decl": 16691, "Ur": 16692, "\u0120Kate": 16693, "gre": 16694, "\u0120magazines": 16695, "UCK": 16696, "\u0120regulate": 16697, "omon": 16698, "\u0120Almost": 16699, "\u0120overview": 16700, "\u0120scram": 16701, "\u0120loot": 16702, "\u0120Fitz": 16703, "\u0120characteristic": 16704, "\u0120Snake": 16705, "say": 16706, "\u0120Rico": 16707, "\u0120trait": 16708, "\u0120Joined": 16709, "aucus": 16710, "\u0120adaptation": 16711, "\u0120Airlines": 16712, "\u0120archae": 16713, "\u0120Ide": 16714, "\u0120bikes": 16715, "\u0120literary": 16716, "\u0120influences": 16717, "\u0120Used": 16718, "Creat": 16719, "\u0120plea": 16720, "\u0120Defence": 16721, "\u0120Assass": 16722, "\u0120pond": 16723, "ULT": 16724, ")\"": 16725, "\u0120evaluated": 16726, "\u0120obtaining": 16727, "\u0120demographic": 16728, "\u0120vigil": 16729, "aley": 16730, "\u0120spouse": 16731, "\u0120Seahawks": 16732, "respons": 16733, "\u0120Belt": 16734, "umatic": 16735, "\u0120rises": 16736, "runner": 16737, "\u0120Michelle": 16738, "\u0120potent": 16739, "race": 16740, "\u0120PAC": 16741, "Find": 16742, "olesterol": 16743, "ISS": 16744, "\u0120Introduced": 16745, "resses": 16746, "ignment": 16747, "Os": 16748, "\u0120Tu": 16749, "\u0120Dex": 16750, "icides": 16751, "\u0120sparked": 16752, "\u0120Laura": 16753, "\u0120Bryant": 16754, "\u0120smiling": 16755, "\u0120Nexus": 16756, "\u0120defendants": 16757, "\u0120Catal": 16758, "\u0120dishes": 16759, "shaped": 16760, "\u0120prolong": 16761, "mt": 16762, "($": 16763, "\u00e3\u0122\u0124": 16764, "\u0120calculations": 16765, "\u0120Same": 16766, "\u0120piv": 16767, "HH": 16768, "\u0120cancelled": 16769, "\u0120grin": 16770, "\u0120territories": 16771, "istically": 16772, "Come": 16773, "\u0120Parent": 16774, "Project": 16775, "\u0120neglig": 16776, "\u0120Privacy": 16777, "\u0120ammo": 16778, "LECT": 16779, "olutely": 16780, "\u0120Epic": 16781, "\u0120misunder": 16782, "wal": 16783, "April": 16784, "mos": 16785, "pathy": 16786, "\u0120Carson": 16787, "\u0120albums": 16788, "\u0120Easy": 16789, "\u0120pistol": 16790, "<<": 16791, "\u0120\\(": 16792, "target": 16793, "help": 16794, "\u0120interpre": 16795, "conscious": 16796, "\u0120Housing": 16797, "\u0120Joint": 16798, "127": 16799, "\u0120beers": 16800, "science": 16801, "\u0120Firefox": 16802, "effective": 16803, "\u0120Cabin": 16804, "\u0120Okay": 16805, "\u0120Applic": 16806, "\u0120spacecraft": 16807, "\u0120SR": 16808, "vet": 16809, "\u0120Strange": 16810, "SB": 16811, "\u0120corps": 16812, "iberal": 16813, "efficient": 16814, "\u0120prevalence": 16815, "\u0120economists": 16816, "118": 16817, "Thread": 16818, "ordable": 16819, "ODE": 16820, "\u0120Cant": 16821, "=-=-": 16822, "ifiable": 16823, "\u0120Around": 16824, "\u0120pole": 16825, "\u0120willingness": 16826, "CLA": 16827, "\u0120Kid": 16828, "\u0120complement": 16829, "\u0120scattered": 16830, "\u0120inmates": 16831, "\u0120bleeding": 16832, "every": 16833, "\u0120queue": 16834, "\u0120Train": 16835, "\u0120hij": 16836, "\u0120melee": 16837, "pleted": 16838, "\u0120digit": 16839, "\u0120gem": 16840, "official": 16841, "\u0120lifting": 16842, "\u00d0\u00b5": 16843, "Requ": 16844, "itutes": 16845, "\u0120packaging": 16846, "\u0120Workers": 16847, "hran": 16848, "\u0120Lebanon": 16849, "olesc": 16850, "\u0120punished": 16851, "\u0120Juan": 16852, "\u0120jam": 16853, "\u0120Document": 16854, "\u0120mapping": 16855, "icates": 16856, "\u0120inevitably": 16857, "\u0120vanilla": 16858, "\u0120Ton": 16859, "\u0120watches": 16860, "\u0120leagues": 16861, "\u0120initiated": 16862, "degree": 16863, "portion": 16864, "\u0120recalls": 16865, "\u0120ruin": 16866, "\u0120melt": 16867, "IAN": 16868, "\u0120hem": 16869, "Exp": 16870, "\u0120baking": 16871, "\u0120Colomb": 16872, "atible": 16873, "\u0120radius": 16874, "plug": 16875, "\u0120IF": 16876, "etically": 16877, "\u0120fict": 16878, "HER": 16879, "\u0120Tap": 16880, "atinum": 16881, "\u0120ink": 16882, "\u0120coh": 16883, "\u0120Wizard": 16884, "both": 16885, "tex": 16886, "\u0120spends": 16887, "\u0120Currently": 16888, "\u0120Pit": 16889, "\u0120neurons": 16890, "ignt": 16891, "\u0120rall": 16892, "\u0120buses": 16893, "building": 16894, "\u0120adjustments": 16895, "\u0120cried": 16896, "iblical": 16897, "atted": 16898, "\u0120Zion": 16899, "\u0120Matter": 16900, "\u0120meditation": 16901, "\u0120Dennis": 16902, "\u0120ours": 16903, "\u0120Tab": 16904, "\u0120rankings": 16905, "ortal": 16906, "\u0120advers": 16907, "\u0120surrender": 16908, "\u0120Gob": 16909, "cium": 16910, "omas": 16911, "imeter": 16912, "\u0120multiplayer": 16913, "\u0120heroin": 16914, "\u0120optimistic": 16915, "\u0120indicator": 16916, "\u0120Brig": 16917, "\u0120grocery": 16918, "\u0120applicant": 16919, "\u0120Rocket": 16920, "vid": 16921, "Exception": 16922, "pent": 16923, "\u0120organizing": 16924, "\u0120encounters": 16925, "\u0120TOD": 16926, "\u0120jewel": 16927, "Save": 16928, "\u0120Christie": 16929, "\u0120heating": 16930, "\u0120lazy": 16931, "\u0120CP": 16932, "\u0120cousin": 16933, "Config": 16934, "\u0120regener": 16935, "\u0120nearest": 16936, "\u0120achieving": 16937, "ENS": 16938, "throw": 16939, "\u0120Richmond": 16940, "antle": 16941, "2002": 16942, "\u0120anten": 16943, "bird": 16944, "133": 16945, "\u0120narc": 16946, "raint": 16947, "unny": 16948, "\u0120Hispanic": 16949, "ournaments": 16950, "\u0120prophe": 16951, "\u0120Thailand": 16952, "\u0120Ti": 16953, "\u0120injection": 16954, "\u0120inherit": 16955, "ravis": 16956, "\u0120medi": 16957, "\u0120whoever": 16958, "\u0120DEBUG": 16959, "GP": 16960, "\u0120Hud": 16961, "Card": 16962, "prom": 16963, "\u0120por": 16964, "\u0120overhead": 16965, "Law": 16966, "\u0120violate": 16967, "\u0120heated": 16968, "\u0120descriptions": 16969, "\u0120achievements": 16970, "\u0120Beer": 16971, "\u0120Quant": 16972, "Was": 16973, "\u0120eighth": 16974, "\u0120Iv": 16975, "\u0120specialized": 16976, "UPDATE": 16977, "\u0120Delta": 16978, "Pop": 16979, "Jul": 16980, "\u0120Ask": 16981, "ophy": 16982, "\u0120newsletters": 16983, "\u0120Tool": 16984, "\u0120gard": 16985, "\u0120Confeder": 16986, "\u0120GMT": 16987, "\u0120Abbott": 16988, "\u0120immunity": 16989, "\u0120VM": 16990, "Islam": 16991, "\u0120implicit": 16992, "wd": 16993, "\u01201944": 16994, "ravity": 16995, "ometric": 16996, "\u0120surviving": 16997, "urai": 16998, "\u0120Prison": 16999, "\u0120rust": 17000, "\u0120Sketch": 17001, "\u0120bees": 17002, "\u0120Theory": 17003, "\u0120merit": 17004, "Tex": 17005, "chat": 17006, "\u0120mim": 17007, "\u0120paste": 17008, "\u0120Koch": 17009, "\u0120ignorance": 17010, "\u0120Shoot": 17011, "\u0120basement": 17012, "United": 17013, "\u0120Advis": 17014, "height": 17015, "\u0120foster": 17016, "\u0120detain": 17017, "information": 17018, "\u0120neural": 17019, "';": 17020, "\u0120proves": 17021, "allery": 17022, "\u0120invitation": 17023, "umbers": 17024, "\u0120cattle": 17025, "\u0120bicycle": 17026, "zi": 17027, "\u0120consultant": 17028, "\u0120apology": 17029, "\u0120Tiger": 17030, "\u0120123": 17031, "999": 17032, "\u0120individually": 17033, "rt": 17034, "igion": 17035, "\u0120Brazilian": 17036, "\u0120disturb": 17037, "\u0120entrepreneurs": 17038, "\u0120forests": 17039, "cerpt": 17040, "plates": 17041, "pher": 17042, "clipse": 17043, "\u0120twitter": 17044, "\u0120acids": 17045, "ographical": 17046, "hum": 17047, "\u0120Bald": 17048, "ifully": 17049, "\u0120compiler": 17050, "\u0120DA": 17051, "\u0120donor": 17052, "asi": 17053, "\u0120tribal": 17054, "lash": 17055, "\u0120Config": 17056, "\u0120applicants": 17057, "\u0120salaries": 17058, "135": 17059, "Putin": 17060, "\u0120Focus": 17061, "irs": 17062, "\u0120misconduct": 17063, "\u0120Haz": 17064, "\u0120eaten": 17065, "Mobile": 17066, "Muslim": 17067, "\u0120Marcus": 17068, "viol": 17069, "\u0120favorable": 17070, "\u0120stub": 17071, "adin": 17072, "\u0120Hob": 17073, "\u0120faithful": 17074, "\u0120electronics": 17075, "\u0120vacuum": 17076, "wait": 17077, "backed": 17078, "economic": 17079, "dist": 17080, "\u0120tenure": 17081, "\u0120sincere": 17082, "\u0120Together": 17083, "\u0120Wave": 17084, "\u0120progression": 17085, "\u0120denying": 17086, "\u0120distress": 17087, "braska": 17088, "third": 17089, "\u0120mixing": 17090, "\u0120colonial": 17091, "\u0120privately": 17092, "\u0120unrest": 17093, "aternity": 17094, "\u0120premises": 17095, "anti": 17096, "gregation": 17097, "\u0120licence": 17098, "\u0120Hind": 17099, "\u0120Samuel": 17100, "\u0120convincing": 17101, "\u0120Ace": 17102, "\u0120Rust": 17103, "\u0120Netanyahu": 17104, "\u0120handles": 17105, "\u0120Patch": 17106, "oriented": 17107, "aho": 17108, "\u0120Gonz": 17109, "\u0120hackers": 17110, "claimer": 17111, "\u0120customs": 17112, "\u0120Gran": 17113, "fighters": 17114, "\u0120luc": 17115, "\u0120manuscript": 17116, "arenthood": 17117, "\u0120devil": 17118, "\u0120warriors": 17119, "\u0120offenders": 17120, "William": 17121, "\u0120holidays": 17122, "\u0120nightmare": 17123, "\u0120lever": 17124, "ifferent": 17125, "Stat": 17126, "\u0120exhibition": 17127, "puted": 17128, "\u0120Pure": 17129, "\u0120alpha": 17130, "\u0120enthusiasm": 17131, "\u0120Representatives": 17132, "EAR": 17133, "\u0120Typ": 17134, "\u0120wheat": 17135, "\u0120Alf": 17136, "\u0120correction": 17137, "\u0120evangel": 17138, "ATT": 17139, "Miss": 17140, "\u0120soup": 17141, "\u0120implied": 17142, "param": 17143, "\u0120sexy": 17144, "\u0120Lux": 17145, "\u0120republic": 17146, "patch": 17147, "ablish": 17148, "\u0120icons": 17149, "\u0120fathers": 17150, "\u0120GET": 17151, "\u0120Carib": 17152, "\u0120regulated": 17153, "\u0120Cohen": 17154, "\u0120Bobby": 17155, "\u0120ner": 17156, "\u0120bent": 17157, "ventory": 17158, "\u0120Along": 17159, "\u0120EST": 17160, "\u0120Wallace": 17161, "\u0120murders": 17162, "rise": 17163, "kell": 17164, "\u0120Commonwealth": 17165, "\u0120nasty": 17166, "eta": 17167, "\u0120MIT": 17168, "\u0120administered": 17169, "\u0120genuinely": 17170, "Editor": 17171, "nick": 17172, "\u0120hydro": 17173, "********************************": 17174, "\u0120Ble": 17175, "\u0120fines": 17176, "\u0120gorge": 17177, "ausible": 17178, "rh": 17179, "\u0120apple": 17180, "mentioned": 17181, "\u0120rope": 17182, "otyp": 17183, "HR": 17184, "\u0120disappointing": 17185, "\u0120cage": 17186, "nik": 17187, "\u0120doubts": 17188, "\u0120FREE": 17189, "prints": 17190, "\u0120MUST": 17191, "\u0120vendors": 17192, "\u0120Inqu": 17193, "\u0120liberals": 17194, "\u0120contractor": 17195, "\u0120upside": 17196, "children": 17197, "\u0120tricky": 17198, "\u0120regulators": 17199, "charged": 17200, "liter": 17201, "\u0120***": 17202, "\u0120rebell": 17203, "lang": 17204, "\u0120locals": 17205, "\u0120physicians": 17206, "\u0120hey": 17207, "arse": 17208, "tm": 17209, "\u0120Lex": 17210, "\u0120behavioral": 17211, "successful": 17212, "FX": 17213, "\u0120brick": 17214, "ovic": 17215, "\u0120conform": 17216, "\u0120reviewing": 17217, "\u0120insights": 17218, "\u0120biology": 17219, "\u0120Remove": 17220, "\u0120Extra": 17221, "\u0120committing": 17222, "induced": 17223, "ignty": 17224, "igm": 17225, "\u0120atomic": 17226, "Common": 17227, "\u0120EM": 17228, "\u0120Pere": 17229, "\u0120Items": 17230, "eh": 17231, "\u0120preserved": 17232, "\u0120Hood": 17233, "\u0120prisoner": 17234, "\u0120bankruptcy": 17235, "\u0120gren": 17236, "ushes": 17237, "\u0120exploitation": 17238, "\u0120signatures": 17239, "\u0120finan": 17240, "],\"": 17241, "\u0120MR": 17242, "\u0120meg": 17243, "remlin": 17244, "\u0120musicians": 17245, "\u0120selecting": 17246, "\u0120examining": 17247, "INK": 17248, "lated": 17249, "Hi": 17250, "\u0120artic": 17251, "\u0120pets": 17252, "\u0120impair": 17253, "\u0120MAN": 17254, "\u0120tablets": 17255, "include": 17256, "Range": 17257, "\u0120caut": 17258, "\u0120logs": 17259, "\u0120mounting": 17260, "\u0120unaware": 17261, "\u0120dynamics": 17262, "\u0120Palestine": 17263, "\u0120Quarter": 17264, "\u0120Purple": 17265, "\u0120ma": 17266, "\u0120Import": 17267, "\u0120collections": 17268, "ciation": 17269, "\u0120successor": 17270, "\u0120clone": 17271, "\u0120aiming": 17272, "\u0120possessed": 17273, "\u0120sticking": 17274, "\u0120shaking": 17275, "\u0120locate": 17276, "\u0120Hockey": 17277, "Turn": 17278, "170": 17279, "\u0120fifteen": 17280, "\u0120Harrison": 17281, "\u0120continuously": 17282, "\u0120TC": 17283, "\u0120Valent": 17284, "\u0120Rescue": 17285, "\u0120bypass": 17286, "amount": 17287, "\u0120mast": 17288, "\u0120protects": 17289, "\u0120artistic": 17290, "\u0120sometime": 17291, "\u0120shoe": 17292, "\u0120shouted": 17293, "ificant": 17294, "etitive": 17295, "\u0120Register": 17296, "\u0120Jin": 17297, "\u0120concentrated": 17298, "lington": 17299, "onies": 17300, "\u0120generator": 17301, "yrim": 17302, "\u0120Armen": 17303, "\u0120clearing": 17304, "ido": 17305, "\u0120TW": 17306, "alph": 17307, "\u0120ladies": 17308, "Hard": 17309, "\u0120dialog": 17310, "\u0120inputs": 17311, "\u00e6\u013e": 17312, "\u0120poses": 17313, "\u0120slots": 17314, "\u0120Premium": 17315, "\u0120leaks": 17316, "\u0120bosses": 17317, "\u0120113": 17318, "course": 17319, "Acc": 17320, "\u0120Newton": 17321, "\u0120Austria": 17322, "\u0120Mage": 17323, "\u0120teaches": 17324, "abad": 17325, "\u0120wears": 17326, "\u0120cyl": 17327, "\u0120curse": 17328, "\u0120Sales": 17329, "\u0120Wings": 17330, "\u0120psy": 17331, "\u0120gaps": 17332, "\u0120Iceland": 17333, "\u0120Pinterest": 17334, "\u0120landlord": 17335, "\u0120definitions": 17336, "\u0120Ker": 17337, "\u0120sufficiently": 17338, "\u0120Pence": 17339, "\u0120Architect": 17340, "\u0120surpass": 17341, "\u0120114": 17342, "\u0120superhero": 17343, "\u0120Disease": 17344, "\u0120priests": 17345, "\u0120Culture": 17346, "\u0120definitive": 17347, "\u0120secretly": 17348, "\u0120Dance": 17349, "install": 17350, "chief": 17351, "\u0120Jessica": 17352, "Would": 17353, "Updated": 17354, "\u0120locker": 17355, "\u0120Kay": 17356, "\u0120memorial": 17357, "\u00e8\u00a6": 17358, "fat": 17359, "\u0120disgu": 17360, "\u0120flavors": 17361, "\u0120Baseball": 17362, "\u0120Resistance": 17363, "\u0120kicks": 17364, "\u0120env": 17365, "\u0120teenagers": 17366, "Dark": 17367, "\u0120CAR": 17368, "\u0120halt": 17369, "\u0120LG": 17370, "\u0120Gabriel": 17371, "\u0120fever": 17372, "\u0120satur": 17373, "\u0120mall": 17374, "\u0120affiliate": 17375, "\u0120Sleep": 17376, "\u0120Specific": 17377, "\u0120Vel": 17378, "\u0120jar": 17379, "\u0120Sacred": 17380, "\u0120Edwards": 17381, "\u0120ACL": 17382, "\u0120retained": 17383, "\u0120Giant": 17384, "\u0120limitation": 17385, "inces": 17386, "\u0120refusal": 17387, "\u0120Tale": 17388, "\u0120Butler": 17389, "\u0120accidents": 17390, "\u0120CSS": 17391, "\u0120imported": 17392, "\u0120Copy": 17393, "\u00ce\u00b1": 17394, "ERT": 17395, "zel": 17396, "\u0120divisions": 17397, "hots": 17398, "\u0120Alb": 17399, "\u0120DS": 17400, "Loader": 17401, "Washington": 17402, "atisf": 17403, "\u0120Creative": 17404, "\\.": 17405, "\u0120Autom": 17406, "redict": 17407, "\u0120receptor": 17408, "\u0120Carlos": 17409, "Method": 17410, "oka": 17411, "\u0120malicious": 17412, "\u0120stepping": 17413, ",[": 17414, "\u0120Dad": 17415, "\u0120attraction": 17416, "\u0120Effects": 17417, "\u0120Pirate": 17418, "\u0120Cer": 17419, "\u0120Industry": 17420, "\u0120Rud": 17421, "\u0120charter": 17422, "\u0120dining": 17423, "\u0120insists": 17424, "\u0120configure": 17425, "\u0120(#": 17426, "\u0120Simple": 17427, "\u0120Scroll": 17428, "UTC": 17429, "175": 17430, "\u0120Kon": 17431, "\u0120marketplace": 17432, "\u0120\u00e3\u0124": 17433, "\u0120refres": 17434, "\u0120gates": 17435, "erred": 17436, "\u0120Pod": 17437, "\u0120behave": 17438, "Frank": 17439, "node": 17440, "\u0120endorsed": 17441, "hett": 17442, "asive": 17443, "\u0120Homeland": 17444, "\u0120rides": 17445, "\u0120Leave": 17446, "erness": 17447, "\u0120flooding": 17448, "AFP": 17449, "\u0120risen": 17450, "\u0120continually": 17451, "\u0120unanim": 17452, "\u0120Contract": 17453, "\u0120Pas": 17454, "\u0120guided": 17455, "\u0120Chile": 17456, "bd": 17457, "\u0120succ": 17458, "ptic": 17459, "\u0120committees": 17460, "\u0120Luther": 17461, "\u0120Anyone": 17462, "\u0120sab": 17463, "124": 17464, "\u0120pixel": 17465, "\u0120Bak": 17466, "\u0120Tag": 17467, "\u0120Bennett": 17468, "Enter": 17469, "small": 17470, "\u0120Presidential": 17471, "\u0120pul": 17472, "\u0120contrace": 17473, "archive": 17474, "\u0120coastal": 17475, "\u0120Kids": 17476, "192": 17477, "\u00e2\u0122\u00b2": 17478, "icky": 17479, "INGTON": 17480, "\u0120wolf": 17481, "\u0120Stalin": 17482, "Tur": 17483, "idget": 17484, "amas": 17485, "\u0120Unless": 17486, "\u0120sponsor": 17487, "\u0120morph": 17488, "\u0120Choose": 17489, "\u0120runner": 17490, "\u0120unbel": 17491, "\u0120mud": 17492, "\u0120Mana": 17493, "\u0120dubbed": 17494, "\u0120godd": 17495, "urers": 17496, "window": 17497, "\u0120relied": 17498, "\u0120celebrating": 17499, "osc": 17500, "\u0120135": 17501, "\u0120lobbying": 17502, "\u0120incomplete": 17503, "\u0120restriction": 17504, "\u0120incap": 17505, "itus": 17506, "\u0120expectation": 17507, "\u0120Apollo": 17508, "\u0120intens": 17509, "\u0120sync": 17510, "GH": 17511, "\u0120manipulation": 17512, "BY": 17513, "\u0120spear": 17514, "\u0120breasts": 17515, "\u0120volcan": 17516, "ilia": 17517, "Material": 17518, "\u0120formats": 17519, "\u0120Bast": 17520, "\u0120parliamentary": 17521, "\u0120snake": 17522, "\u0120servants": 17523, "\u0120Trudeau": 17524, "\u0120Grim": 17525, "\u0120Arabic": 17526, "\u0120SCP": 17527, "\u0120Boys": 17528, "station": 17529, "\u0120prospective": 17530, "orde": 17531, "initialized": 17532, "\u0120bored": 17533, "ABLE": 17534, "\u0120accessed": 17535, "\u0120taxi": 17536, "\u0120Shell": 17537, "aiden": 17538, "ursed": 17539, "inates": 17540, "\u0120Insurance": 17541, "\u0120Pete": 17542, "September": 17543, "650": 17544, "\u0120adventures": 17545, "\u0120Cover": 17546, "\u0120tribute": 17547, "\u0120sketch": 17548, "\u0120empower": 17549, "\u0120\u00d8": 17550, "\u0120Glenn": 17551, "\u0120Daw": 17552, "=\\\"": 17553, "\u0120Politics": 17554, "\u0120guides": 17555, "\u0120dioxide": 17556, "\u0120Gore": 17557, "\u0120Bright": 17558, "\u0120Sierra": 17559, "\u0120valued": 17560, "cond": 17561, "\u0120pointer": 17562, "Select": 17563, "\u0120risky": 17564, "\u0120absorb": 17565, "images": 17566, "\u0120refuses": 17567, "\u0120bonuses": 17568, "___": 17569, "\u0120hilar": 17570, "\u0120Features": 17571, "220": 17572, "\u0120Collector": 17573, "Foot": 17574, "\u01201964": 17575, "culus": 17576, "\u0120dawn": 17577, "\u0120workout": 17578, "\u0120LO": 17579, "\u0120philosophical": 17580, "\u0120Sandy": 17581, "\u0120Youth": 17582, "\u0120liable": 17583, "Af": 17584, "blue": 17585, "\u0120overturn": 17586, "lessness": 17587, "\u0120Tribune": 17588, "\u0120Ing": 17589, "\u0120factories": 17590, "\u0120catches": 17591, "\u0120prone": 17592, "\u0120matrix": 17593, "\u0120login": 17594, "\u0120inacc": 17595, "\u0120exert": 17596, "sys": 17597, "\u0120needle": 17598, "\u0120Qur": 17599, "\u0120notified": 17600, "oulder": 17601, "tx": 17602, "\u0120reminds": 17603, "\u0120publishers": 17604, "\u0120nort": 17605, "\u0120git": 17606, "\u0120flies": 17607, "\u0120Emily": 17608, "\u0120flowing": 17609, "\u0120Alien": 17610, "\u0120Strateg": 17611, "\u0120hardest": 17612, "\u0120modification": 17613, "API": 17614, "\u0120MY": 17615, "\u0120crashes": 17616, "stairs": 17617, "number": 17618, "\u0120urging": 17619, "channel": 17620, "\u0120Falcon": 17621, "\u0120inhabitants": 17622, "\u0120terrifying": 17623, "\u0120utilize": 17624, "\u0120banner": 17625, "\u0120cigarettes": 17626, "\u0120senses": 17627, "\u0120Holmes": 17628, "\u0120practition": 17629, "\u0120Phillips": 17630, "otto": 17631, "\u0120compile": 17632, "Model": 17633, "\u0120Ko": 17634, "\u0120[]": 17635, "Americans": 17636, "\u0120Terms": 17637, "\u0120medications": 17638, "\u0120Ana": 17639, "\u0120fundamentally": 17640, "\u0120Notice": 17641, "\u0120weaker": 17642, "\u01200000": 17643, "\u0120garlic": 17644, "\u0120outbreak": 17645, "\u0120economist": 17646, "\u0120Birth": 17647, "\u0120obstacles": 17648, "arcer": 17649, "\u0120Orthodox": 17650, "\u0120placebo": 17651, "\u0120Crew": 17652, "aspberry": 17653, "\u0120Angels": 17654, "\u0120discharge": 17655, "\u0120destructive": 17656, "117": 17657, "\u0120Rising": 17658, "\u0120dairy": 17659, "late": 17660, "\u0120collision": 17661, "\u0120Tigers": 17662, "eanor": 17663, "ocumented": 17664, "\u0120Invalid": 17665, "\u0120dont": 17666, "\u0120Liter": 17667, "\u0120Va": 17668, "\u0120hydrogen": 17669, "\u0120variants": 17670, "\u0120Browns": 17671, "\u01201965": 17672, "\u0120indigenous": 17673, "\u0120trades": 17674, "\u0120remainder": 17675, "\u0120swept": 17676, "\u0120Impact": 17677, "\u0120redist": 17678, "\u0120unint": 17679, "graduate": 17680, "\u00e3\u0125\u0137": 17681, "\u0120WILL": 17682, "\u00e3\u0123\u00ae\u00e7": 17683, "\u0120Critical": 17684, "\u0120fisher": 17685, "\u0120vicious": 17686, "\u0120reversed": 17687, "Year": 17688, "\u0120Sox": 17689, "\u0120shootings": 17690, "\u0120filming": 17691, "\u0120touchdowns": 17692, "aires": 17693, "mel": 17694, "\u0120grandfather": 17695, "\u0120affection": 17696, "ingle": 17697, "\u0120overly": 17698, "Additional": 17699, "\u0120supreme": 17700, "\u0120Grad": 17701, "\u0120sporting": 17702, "\u0120mercy": 17703, "\u0120Brooks": 17704, "ounty": 17705, "\u0120performs": 17706, "\u0120tightly": 17707, "\u0120demons": 17708, "\u0120killings": 17709, "\u0120faction": 17710, "\u0120Nova": 17711, "auts": 17712, "\u0120undoubtedly": 17713, "arin": 17714, "\u0120underway": 17715, "rak": 17716, "\u0120liv": 17717, "\u0120Region": 17718, "\u0120briefing": 17719, "sers": 17720, "cloud": 17721, "\u0120Mik": 17722, "usp": 17723, "\u0120prediction": 17724, "azor": 17725, "\u0120portable": 17726, "\u0120Gand": 17727, "\u0120presenting": 17728, "\u01201080": 17729, "\u00c2\u00bb": 17730, "ushi": 17731, "\u0120Spark": 17732, "thereum": 17733, "\u0120justification": 17734, "\u0120Ny": 17735, "\u0120contractors": 17736, "mingham": 17737, "\u0120Style": 17738, "\u00e5\u0127": 17739, "\u0120Chronicles": 17740, "\u0120Picture": 17741, "\u0120proving": 17742, "\u0120wives": 17743, "sett": 17744, "\u0120molecules": 17745, "\u0120Fairy": 17746, "\u0120consisting": 17747, "\u0120pier": 17748, "alone": 17749, "inition": 17750, "\u0120nucle": 17751, "json": 17752, "\u0120gotta": 17753, "\u0120mobil": 17754, "\u0120verbal": 17755, "arium": 17756, "\u0120monument": 17757, "ucked": 17758, "\u0120256": 17759, "Tech": 17760, "minecraft": 17761, "\u0120Track": 17762, "\u0120tile": 17763, "\u0120compatibility": 17764, "asis": 17765, "\u0120sadd": 17766, "\u0120instructed": 17767, "\u0120Mueller": 17768, "\u0120lethal": 17769, "\u0120hormone": 17770, "\u0120orche": 17771, "else": 17772, "\u0120skelet": 17773, "\u0120entertaining": 17774, "\u0120minimize": 17775, "again": 17776, "\u0120undergo": 17777, "\u0120constraints": 17778, "\u0120cigarette": 17779, "\u0120Islamist": 17780, "\u0120travels": 17781, "\u0120Panthers": 17782, "lings": 17783, "Care": 17784, "\u0120lawsuits": 17785, "uras": 17786, "\u0120cryst": 17787, "\u0120lowered": 17788, "\u0120aerial": 17789, "\u0120combinations": 17790, "\u0120haun": 17791, "\u0120cha": 17792, "\u0120vine": 17793, "\u0120quantities": 17794, "\u0120linking": 17795, "bank": 17796, "\u0120soy": 17797, "Bill": 17798, "\u0120Angela": 17799, "\u0120recipient": 17800, "\u0120Protest": 17801, "\u0120socket": 17802, "\u0120solidarity": 17803, "\u0120\u00e2\u0128": 17804, "mill": 17805, "\u0120varies": 17806, "\u0120Pakistani": 17807, "Dragon": 17808, "\u0120une": 17809, "\u0120horizon": 17810, "\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142": 17811, "\u0120provinces": 17812, "\u0120frankly": 17813, "\u0120enacted": 17814, "notes": 17815, "['": 17816, "\u0120192": 17817, "ocracy": 17818, "\u0120endorsement": 17819, "\u0120overtime": 17820, "True": 17821, "Lab": 17822, "licted": 17823, "\u0120DNC": 17824, "\u0120beats": 17825, "\u0120Jamie": 17826, "152": 17827, "\u0120INT": 17828, "Contact": 17829, "\u0120accounted": 17830, "hash": 17831, "\u0120Packers": 17832, "pires": 17833, "\u0120lesbian": 17834, "\u0120amendments": 17835, "\u0120hopeful": 17836, "\u0120Finland": 17837, "\u0120spotlight": 17838, "\u0120configured": 17839, "\u0120troubled": 17840, "\u0120gaze": 17841, "\u0120Calgary": 17842, "\u0120reliability": 17843, "\u0120insurg": 17844, "swer": 17845, "buy": 17846, "\u0120Skin": 17847, "\u0120pixels": 17848, "\u0120handgun": 17849, "\u0120paras": 17850, "\u0120categor": 17851, "\u0120EL": 17852, "\u0120Rex": 17853, "Indeed": 17854, "\u0120kinda": 17855, "\u0120conjunction": 17856, "\u0120Bryan": 17857, "\u0120Manufact": 17858, "yang": 17859, "Plus": 17860, "SQL": 17861, "ishment": 17862, "\u0120dominate": 17863, "\u0120nail": 17864, "\u0120oath": 17865, "\u0120erupt": 17866, "\u0120Fine": 17867, "itbart": 17868, "\u0120Chip": 17869, "\u0120Abd": 17870, "\u0120Nam": 17871, "\u0120buyer": 17872, "\u0120dissent": 17873, "Leaks": 17874, "Contin": 17875, "\u0120rider": 17876, "\u0120Someone": 17877, "\u0120illusion": 17878, "cin": 17879, "\u0120Boeing": 17880, "\u0120inadequ": 17881, "ovation": 17882, "iants": 17883, "\u0120rebuild": 17884, "450": 17885, "\u0120Destiny": 17886, "SW": 17887, "\u0120Till": 17888, "Hit": 17889, "iaz": 17890, "\u0120Bangl": 17891, "achers": 17892, "\u0120Reform": 17893, "\u0120segments": 17894, "\u0120systematic": 17895, "dc": 17896, "\u0120Conservatives": 17897, "\u0120portal": 17898, "hor": 17899, "\u0120Dragonbound": 17900, "\u0120dragged": 17901, "omo": 17902, "\u0120thee": 17903, "advert": 17904, "\u0120Reports": 17905, "\u0120Et": 17906, "\u0120barrels": 17907, "August": 17908, "\u0120comparisons": 17909, "\u0120hex": 17910, "\u0120anthrop": 17911, "\"[": 17912, "borough": 17913, "abi": 17914, "\u0120pictured": 17915, "playing": 17916, "\u0120Address": 17917, "\u0120Mirror": 17918, "Smith": 17919, "\u0120tires": 17920, "\u0120NPR": 17921, "AAAA": 17922, "\u0120classification": 17923, "\u0120Than": 17924, "\u0120Harm": 17925, "\u0120RA": 17926, "\u0120rejection": 17927, "mination": 17928, "\u0120ranged": 17929, "\u0120Falls": 17930, "DI": 17931, "Host": 17932, "\u00e3\u0124\u00b4": 17933, "\u0120Example": 17934, "listed": 17935, "thirds": 17936, "\u0120safegu": 17937, "brand": 17938, "\u0120probable": 17939, "Canada": 17940, "ITION": 17941, "\u0120Qaeda": 17942, "\u0120chick": 17943, "\u0120imports": 17944, "hit": 17945, "loc": 17946, "WW": 17947, "\u0120blew": 17948, "\u0120anytime": 17949, "\u0120wholes": 17950, "iked": 17951, "\u0120calculation": 17952, "create": 17953, "\u0120Ori": 17954, "\u0120upgraded": 17955, "\u0120appar": 17956, "utory": 17957, "\u0120Mol": 17958, "Brit": 17959, "\u0120Jong": 17960, "INAL": 17961, "\u0120Starting": 17962, "\u0120dice": 17963, "urtle": 17964, "\u0120relying": 17965, "closure": 17966, "\u0120profitable": 17967, "\u0120slaughter": 17968, "\u0120Manual": 17969, "caster": 17970, "\u0120\"$": 17971, "\u0120feather": 17972, "\u0120Simply": 17973, "ieves": 17974, "\u0120deterior": 17975, "\u0120PCI": 17976, "\u0120stamp": 17977, "\u0120flaws": 17978, "\u0120shade": 17979, "hammer": 17980, "\u0120passport": 17981, "\u0120conting": 17982, "amel": 17983, "\u0120observers": 17984, "\u0120neglect": 17985, "\u0120RB": 17986, "\u0120Brotherhood": 17987, "\u0120skeptical": 17988, "family": 17989, "usk": 17990, "\u0120emotionally": 17991, "\u00e2\u013b": 17992, "\u0120Beta": 17993, "asonable": 17994, "idity": 17995, "\u0120Mul": 17996, "\u0120kicking": 17997, "\u0120Carm": 17998, "ollah": 17999, "VERTIS": 18000, "\u0120Athen": 18001, "\u0120ladder": 18002, "\u0120Bullet": 18003, "\u00e5\u00a3": 18004, "0001": 18005, "\u0120Wildlife": 18006, "\u0120Mask": 18007, "\u0120Nan": 18008, "Rev": 18009, "\u0120unacceptable": 18010, "legal": 18011, "\u0120crowded": 18012, "agi": 18013, "\u0120Cox": 18014, "je": 18015, "\u0120morality": 18016, "\u0120fuels": 18017, "\u0120cables": 18018, "\u0120mankind": 18019, "\u0120Caribbean": 18020, "\u0120anchor": 18021, "\u0120byte": 18022, "\u0120Often": 18023, "\u0120Oz": 18024, "\u0120crafted": 18025, "\u0120historian": 18026, "\u0120Wu": 18027, "\u0120towers": 18028, "\u0120Citizens": 18029, "\u0120helm": 18030, "\u0120credentials": 18031, "\u0120singular": 18032, "\u0120Jesse": 18033, "\u0120tackles": 18034, "\u0120contempt": 18035, "\u0120afore": 18036, "\u0120Shadows": 18037, "\u0120nil": 18038, "\u0120urgent": 18039, "apple": 18040, "blood": 18041, "\u0120von": 18042, "\u0120offline": 18043, "\u0120breathe": 18044, "\u0120jumps": 18045, "\u0120irrelevant": 18046, "oxic": 18047, "omal": 18048, "important": 18049, "Jim": 18050, "\u0120gloves": 18051, "arming": 18052, "depth": 18053, "\u0120talents": 18054, "ookie": 18055, "\u0120SB": 18056, "\u0120palm": 18057, "uffs": 18058, "esta": 18059, "IGH": 18060, "\u0120canon": 18061, "\u0120Verizon": 18062, "\u0120Ple": 18063, "\u0120coupled": 18064, "velt": 18065, "\u0120fundraising": 18066, "\u0120Getting": 18067, "\u0120DLC": 18068, "\u0120mathematical": 18069, "\u0120HS": 18070, "\u0120Cardinals": 18071, "telling": 18072, "\u0120sponsors": 18073, "\u0120\u00cf": 18074, "\u0120Bulls": 18075, "option": 18076, "\u0120propose": 18077, "\u0120memorable": 18078, "\u0120embraced": 18079, "\u0120declining": 18080, "Health": 18081, "eda": 18082, "\u0120};": 18083, "\u0120spam": 18084, "mile": 18085, "\u0120pitcher": 18086, "\u0120Eight": 18087, "\u0120caring": 18088, "utic": 18089, "role": 18090, "\u0120airline": 18091, "ernandez": 18092, "\u0120Athlet": 18093, "\u0120certification": 18094, "uxe": 18095, "riger": 18096, "\u0120empir": 18097, "\u0120sensation": 18098, "\u0120dism": 18099, "\u0120bolt": 18100, "\u0120evolve": 18101, "House": 18102, "\u0120consultation": 18103, "\u0120Duty": 18104, "\u0120touches": 18105, "\u0120Nathan": 18106, "\u0120faint": 18107, "had": 18108, "\"(": 18109, "\u0120Consumer": 18110, "\u0120Extreme": 18111, "\u0120127": 18112, "\u0120Herm": 18113, "\u0120Sacrament": 18114, "izoph": 18115, "\u0120anxious": 18116, "ulously": 18117, "\u0120socially": 18118, "\u0120UTC": 18119, "\u0120solving": 18120, "\u0120Letter": 18121, "History": 18122, "educ": 18123, "Price": 18124, "));": 18125, "\u0120reload": 18126, "amic": 18127, "\u0120pork": 18128, "\u0120discourse": 18129, "\u0120tournaments": 18130, "airo": 18131, "\u0120Kur": 18132, "\u0120Costa": 18133, "\u0120violating": 18134, "\u0120interfere": 18135, "\u0120recreational": 18136, "uffle": 18137, "\u0120speeches": 18138, "\u0120needing": 18139, "\u0120remembers": 18140, "\u0120credited": 18141, "nia": 18142, "focused": 18143, "amera": 18144, "\u0120bru": 18145, "umbs": 18146, "\u0120Cuban": 18147, "\u0120preceding": 18148, "\u0120nonsense": 18149, "acial": 18150, "\u0120smartphones": 18151, "\u0120Stories": 18152, "Sports": 18153, "\u0120Emergency": 18154, "ouncing": 18155, "efined": 18156, "\u0120ber": 18157, "\u0120consulting": 18158, "\u0120masters": 18159, "heastern": 18160, ".\"[": 18161, "\u0120Running": 18162, "\u0120suscept": 18163, "\u0120Feng": 18164, "America": 18165, "prises": 18166, "stitial": 18167, "\u0120Weekly": 18168, "\u0120Greater": 18169, "modules": 18170, "ifter": 18171, "Graphics": 18172, "uler": 18173, "\u0120wholly": 18174, "\u0120suppress": 18175, "\u0120concealed": 18176, "\u0120happily": 18177, "\u0120accepts": 18178, "\u0120Enjoy": 18179, "\u0120rivers": 18180, "\u0120Except": 18181, "225": 18182, "\u0120NHS": 18183, "\u0120McConnell": 18184, "\u0120pussy": 18185, "ferred": 18186, "utable": 18187, "\u0120attain": 18188, "\u0120>=": 18189, "\u0120deposits": 18190, "rophic": 18191, "\u0120notorious": 18192, "\u0120Shaw": 18193, "ilitation": 18194, "\u0120epidemic": 18195, "allic": 18196, "\u0120smallest": 18197, "ovich": 18198, "\u0120accessories": 18199, "perties": 18200, "\u0120surplus": 18201, "\u0120Mech": 18202, "\u0120ambig": 18203, "\u0120Immigration": 18204, "\u0120chim": 18205, "eval": 18206, "\u0120practicing": 18207, "\u0120Mystery": 18208, "\u0120domains": 18209, "\u0120Silicon": 18210, "apps": 18211, "\u0120kilometers": 18212, "ea": 18213, "\u0120Smash": 18214, "\u0120warranty": 18215, "\u0120nost": 18216, "sil": 18217, "rev": 18218, "Jon": 18219, "\u0120Dublin": 18220, "\u0120tastes": 18221, "\u0120bout": 18222, "great": 18223, "error": 18224, "\u0120switches": 18225, "\u0120Bapt": 18226, "DO": 18227, "oki": 18228, "\u0120sourced": 18229, "produ": 18230, "\u0120attachment": 18231, "\u0120Issue": 18232, "\u0120Question": 18233, "Join": 18234, "\u0120fitted": 18235, "\u0120unlawful": 18236, "^^": 18237, "erek": 18238, "\u0120authentication": 18239, "\u0120stole": 18240, "\u0120accountability": 18241, "label": 18242, "Search": 18243, "\u0120albeit": 18244, "atican": 18245, "funded": 18246, "\u0120Adding": 18247, "\u0120IQ": 18248, "\u0120submar": 18249, "lit": 18250, "aque": 18251, "\u0120Learning": 18252, "\u0120integer": 18253, "Master": 18254, "\u0120Chrom": 18255, "\u0120premier": 18256, "Op": 18257, "\u0120Liu": 18258, "\u0120blessed": 18259, "\u0120Globe": 18260, "\u0120Response": 18261, "\u0120legitim": 18262, "\u0120Merkel": 18263, "\u0120disposal": 18264, "\u00c2\u00b4": 18265, "\u0120gauge": 18266, "peat": 18267, "\u0120induced": 18268, "\u0120questionable": 18269, "arthy": 18270, "\u0120Vit": 18271, "\u0120Feed": 18272, "Until": 18273, "Ut": 18274, "worthy": 18275, "RY": 18276, "\u0120Herald": 18277, "\u0120Hammer": 18278, "\u0120medal": 18279, "\u0120Rivers": 18280, "\u0120Hack": 18281, "\u0120clarify": 18282, "\u0120tracked": 18283, "\u0120autonomous": 18284, "\u0120tenant": 18285, "\u0120Qatar": 18286, "erie": 18287, "\u0120grim": 18288, "\u0120Monitor": 18289, "\u0120resistant": 18290, "\u0120Spec": 18291, "\u0120Wells": 18292, "NAS": 18293, "148": 18294, "\u0120miners": 18295, "iotics": 18296, "\u0120misses": 18297, "116": 18298, "gian": 18299, "git": 18300, "\u0120Eyes": 18301, "pres": 18302, "\u0120graduated": 18303, "\u0120angel": 18304, "\u0120synchron": 18305, "\u0120efficiently": 18306, "\u0120transmitted": 18307, "Harry": 18308, "\u0120globally": 18309, "ENCE": 18310, "\u0120Montana": 18311, "raged": 18312, "\u0120Prevention": 18313, "\u0120piss": 18314, "\u0120Ll": 18315, "\u0120shelf": 18316, "\u0120BJP": 18317, "\u0120Testament": 18318, "\u0120Late": 18319, "iker": 18320, "\u0120Happ": 18321, "\u0120Julian": 18322, "hall": 18323, "\u0120spont": 18324, "\u0120shutdown": 18325, "\u0120inconsistent": 18326, "\u0120subscribers": 18327, "\u0120skeleton": 18328, "\u0120Nebraska": 18329, "\u0120inspire": 18330, "\u0120Void": 18331, "Feed": 18332, "\u0120angles": 18333, "\u0120Springs": 18334, "\u0120benchmark": 18335, "\u0120vaccines": 18336, "izophren": 18337, "sexual": 18338, "uffed": 18339, "\u0120shine": 18340, "\u0120Kath": 18341, "\u0120gesture": 18342, "inea": 18343, "\u0120rip": 18344, "\u0120oppression": 18345, "\u0120conscience": 18346, "bt": 18347, "\u0120Lum": 18348, "\u0120incidence": 18349, "\u0120Fa": 18350, "wr": 18351, "\u0120mineral": 18352, "\u0120Spurs": 18353, "alky": 18354, "\u0120thunder": 18355, "\u0120opio": 18356, "Being": 18357, "\u0120Palm": 18358, "\u0120wasted": 18359, "\u0120lb": 18360, "iaries": 18361, "\u0120Initiative": 18362, "\u0120curric": 18363, "\u0120marker": 18364, "\u0120McL": 18365, "\u0120extensions": 18366, "\u0120Pv": 18367, "\u0120Arms": 18368, "\u0120offerings": 18369, "\u0120defenses": 18370, "\u0120vendor": 18371, "\u0120contradict": 18372, "\u0120Colin": 18373, "\u0120reddit": 18374, "\u0120peripher": 18375, "122": 18376, "\u0120sins": 18377, "Edit": 18378, "ICT": 18379, "Soft": 18380, "\u0120Shah": 18381, "\u0120administrator": 18382, "\u0120Trip": 18383, "\u0120pornography": 18384, "\u0120tuition": 18385, "inence": 18386, "\u0120Progress": 18387, "\u0120catalog": 18388, "\u0120suite": 18389, "\u0120hike": 18390, "\u0120reproductive": 18391, "engine": 18392, "\u0120drought": 18393, "\u0120Noah": 18394, "\u0120230": 18395, "\u0120dude": 18396, "\u0120relaxed": 18397, "\u0120partition": 18398, "\u0120participant": 18399, "\u0120telesc": 18400, "\u0120feas": 18401, "\u0120FF": 18402, "owner": 18403, "\u0120sweeping": 18404, "\u0120lenses": 18405, "\u0120matchup": 18406, "\u0120Repl": 18407, "ournals": 18408, "\u0120credible": 18409, "\u0120grandmother": 18410, "\u0120thermal": 18411, "\u0120subscribing": 18412, "\u0120identities": 18413, "colm": 18414, "UCT": 18415, "\u0120reluctant": 18416, "users": 18417, "\u0120Cort": 18418, "\u0120assisted": 18419, "OSS": 18420, "ATIONS": 18421, "ISH": 18422, "\u0120pharmaceutical": 18423, "icable": 18424, "adian": 18425, "\u0120Sonic": 18426, "\u0120Fury": 18427, "\u0120Mong": 18428, "AH": 18429, "\u0120Psychology": 18430, "\u0120phosph": 18431, "\u0120treats": 18432, "\u0143\u0136": 18433, "\u0120steadily": 18434, "\u0120Hello": 18435, "\u0120relates": 18436, "\u0120clue": 18437, "Expl": 18438, "auth": 18439, "\u0120revision": 18440, "\u0120eld": 18441, "osion": 18442, "\u0120bron": 18443, "144": 18444, "rikes": 18445, "\u0120mines": 18446, "\u0120blanket": 18447, "\u0120Fail": 18448, "eled": 18449, "\u0120Imagine": 18450, "\u0120Planned": 18451, "aic": 18452, "Request": 18453, "Mad": 18454, "\u0120Horse": 18455, "\u0120Eagle": 18456, "\u0120capac": 18457, "157": 18458, "\u0120ling": 18459, "\u0120Nice": 18460, "\u0120Parenthood": 18461, "minster": 18462, "ogs": 18463, "ensitive": 18464, "Nothing": 18465, "\u0120carn": 18466, "Fin": 18467, "\u0120PE": 18468, "\u0120rifles": 18469, "\u0120LP": 18470, "Sand": 18471, "\u0120guiActive": 18472, "\u0120tourist": 18473, "CNN": 18474, "\u0120unveiled": 18475, "\u0120predecessor": 18476, "}{": 18477, "uber": 18478, "\u0120offshore": 18479, "\u0120optical": 18480, "\u0120Rot": 18481, "\u0120Pearl": 18482, "eton": 18483, "\u0120stared": 18484, "\u0120farther": 18485, "atility": 18486, "contin": 18487, "\u0120Gy": 18488, "\u0120Foster": 18489, "\u0120Coc": 18490, "rients": 18491, "\u0120designing": 18492, "\u0120Economy": 18493, "ONG": 18494, "Women": 18495, "\u0120Nancy": 18496, "erver": 18497, "\u0120mascul": 18498, "\u0120casualties": 18499, "\u0120225": 18500, "\u0120Sullivan": 18501, "\u0120Choice": 18502, "\u0120aster": 18503, "ws": 18504, "\u0120hotels": 18505, "\u0120considerations": 18506, "\u0120couch": 18507, "\u0120Strip": 18508, "\u0120Gn": 18509, "\u0120manipulate": 18510, "lied": 18511, "\u0120synthetic": 18512, "\u0120assaulted": 18513, "\u0120offenses": 18514, "\u0120Drake": 18515, "\u0120impe": 18516, "October": 18517, "\u0120Heritage": 18518, "hl": 18519, "\u0120Blair": 18520, "Unlike": 18521, "\u0120grief": 18522, "\u0120450": 18523, "\u0120opted": 18524, "\u0120resignation": 18525, "ilo": 18526, "\u0120verse": 18527, "\u0120Tomb": 18528, "\u0120upt": 18529, "\u0120aired": 18530, "\u0120Hook": 18531, "\u0120MLB": 18532, "\u0120assumes": 18533, "outed": 18534, "\u0120Vers": 18535, "\u0120inferior": 18536, "\u0120bundle": 18537, "\u0120DNS": 18538, "ographer": 18539, "\u0120multip": 18540, "\u0120Souls": 18541, "\u0120illustrated": 18542, "\u0120tactic": 18543, "\u0120dressing": 18544, "\u0120duo": 18545, "Conf": 18546, "\u0120relent": 18547, "\u0120cant": 18548, "\u0120scarce": 18549, "\u0120candy": 18550, "\u0120CF": 18551, "\u0120affiliated": 18552, "\u0120sprint": 18553, "ylan": 18554, "\u0120Garcia": 18555, "\u0120junk": 18556, "Print": 18557, "exec": 18558, "Crit": 18559, "\u0120portrait": 18560, "iries": 18561, "\u0120OFF": 18562, "\u0120disputes": 18563, "WR": 18564, "Love": 18565, "\u00e3\u0123\u0126": 18566, "\u0120Reyn": 18567, "\u0120hipp": 18568, "opath": 18569, "\u0120floors": 18570, "\u0120Feel": 18571, "\u0120worries": 18572, "\u0120settlements": 18573, "\u0120Pos": 18574, "\u0120mosque": 18575, "\u0120finals": 18576, "\u0120crushed": 18577, "\u0120Probably": 18578, "\u0120Bot": 18579, "\u0120Mans": 18580, "\u0120Period": 18581, "\u0120sovereignty": 18582, "\u0120seller": 18583, "\u0120apost": 18584, "\u0120amateur": 18585, "\u0120dorm": 18586, "\u0120consuming": 18587, "\u0120armour": 18588, "\u0120Roose": 18589, "\u0120intensive": 18590, "\u0120eliminating": 18591, "\u0120Sunni": 18592, "\u0120Aleppo": 18593, "jin": 18594, "\u0120advise": 18595, "pal": 18596, "\u0120Halo": 18597, "\u0120descent": 18598, "\u0120simpler": 18599, "\u0120booth": 18600, "STR": 18601, "Later": 18602, "\u0120Cave": 18603, "===": 18604, "\u0120mol": 18605, "\u0120fist": 18606, "\u0120shotgun": 18607, "supp": 18608, "\u0120robbery": 18609, "Effect": 18610, "\u0120obscure": 18611, "\u0120Professional": 18612, "\u0120embassy": 18613, "\u0120militant": 18614, "\u0120incarcer": 18615, "\u0120generates": 18616, "\u0120launches": 18617, "\u0120administrators": 18618, "\u0120shaft": 18619, "\u0120circular": 18620, "\u0120freshman": 18621, "\u0120Wes": 18622, "\u0120Joel": 18623, "\u0120Drew": 18624, "\u0120Duncan": 18625, "\u0120Apparently": 18626, "sight": 18627, "\u0120Internal": 18628, "\u0120Individual": 18629, "\u0120FE": 18630, "\u0120bore": 18631, "\u0120Mt": 18632, "\u0120broadly": 18633, "\u0120Options": 18634, "ountain": 18635, "ipes": 18636, "\u0120Videos": 18637, "204": 18638, "\u0120hills": 18639, "\u0120simulation": 18640, "\u0120disappointment": 18641, "itan": 18642, "\u0120Laboratory": 18643, "\u0120upward": 18644, "\u0120boundary": 18645, "\u0120darker": 18646, "hart": 18647, "\u0120dominance": 18648, "Cong": 18649, "\u0120Oracle": 18650, "\u0120Lords": 18651, "\u0120scholarship": 18652, "\u0120Vincent": 18653, "ede": 18654, "\u0120Rah": 18655, "\u0120encourages": 18656, "rov": 18657, "\u0120quo": 18658, "\u0120premise": 18659, "\u0120Crisis": 18660, "\u0120Holocaust": 18661, "\u0120rhythm": 18662, "\u0120metric": 18663, "club": 18664, "\u0120transported": 18665, "\u0120nod": 18666, "\u0120Pist": 18667, "\u0120ancestors": 18668, "\u0120Freder": 18669, "thumbnails": 18670, "\u0120CE": 18671, "OND": 18672, "Phil": 18673, "venge": 18674, "\u0120Products": 18675, "castle": 18676, "\u0120qualifying": 18677, "\u0120Karen": 18678, "VERTISEMENT": 18679, "\u0120mighty": 18680, "\u0120explanations": 18681, "\u0120fixing": 18682, "Di": 18683, "\u0120declaring": 18684, "\u0120anonymity": 18685, "\u0120juven": 18686, "\u0120Nord": 18687, "\u0120Doom": 18688, "\u0120Actually": 18689, "Ok": 18690, "phis": 18691, "\u0120Desert": 18692, "\u0120116": 18693, "IK": 18694, "\u0120FM": 18695, "\u0120incomes": 18696, "VEL": 18697, "okers": 18698, "\u0120pecul": 18699, "\u0120lightweight": 18700, "gue": 18701, "\u0120accent": 18702, "\u0120increment": 18703, "\u0120Chan": 18704, "\u0120complaining": 18705, "\u0120Baghd": 18706, "\u0120midfielder": 18707, "\u0120overhaul": 18708, "Process": 18709, "\u0120Hollow": 18710, "\u0120Titans": 18711, "Small": 18712, "manuel": 18713, "\u0120Unity": 18714, "\u0120Events": 18715, "Sty": 18716, "\u0120disproportion": 18717, "nesty": 18718, "enes": 18719, "\u0120Cod": 18720, "\u0120demonstrations": 18721, "\u0120Crimson": 18722, "\u0120OH": 18723, "\u0120enrolled": 18724, "\u0120cel": 18725, "\u0120Brett": 18726, "\u0120aide": 18727, "\u0120heels": 18728, "\u0120broadband": 18729, "\u0120marking": 18730, "\u0120wizard": 18731, "\u0120NJ": 18732, "\u0120Chiefs": 18733, "\u0120ingredient": 18734, "\u0120dug": 18735, "\u0120Shut": 18736, "urchase": 18737, "endor": 18738, "\u0120farmer": 18739, "\u0120Goldman": 18740, "129": 18741, "155": 18742, "Order": 18743, "\u0120lion": 18744, "iably": 18745, "\u0120stain": 18746, "array": 18747, "ilitary": 18748, "\u0120FAQ": 18749, "\u0120exploded": 18750, "\u0120McCarthy": 18751, "\u0120Tweet": 18752, "\u0120Greens": 18753, "eking": 18754, "ln": 18755, "ensen": 18756, "\u0120motorcycle": 18757, "\u0120particle": 18758, "\u0120cholesterol": 18759, "Bron": 18760, "\u0120stair": 18761, "\u0120oxid": 18762, "\u0120desirable": 18763, "ibles": 18764, "\u0120theor": 18765, "forcing": 18766, "\u0120promotional": 18767, "ovo": 18768, "boot": 18769, "\u0120Bonus": 18770, "rawling": 18771, "\u0120shortage": 18772, "\u0120Psy": 18773, "\u0120recruited": 18774, "\u0120infants": 18775, "\u0120testosterone": 18776, "\u0120deduct": 18777, "\u0120distinctive": 18778, "\u0120firmware": 18779, "built": 18780, "145": 18781, "\u0120explored": 18782, "\u0120factions": 18783, "\u0120vide": 18784, "\u0120tattoo": 18785, "\u0120financially": 18786, "\u0120fatigue": 18787, "\u0120proceeding": 18788, "constitutional": 18789, "\u0120miser": 18790, "\u0120chairs": 18791, "gging": 18792, "ipple": 18793, "\u0120dent": 18794, "\u0120disreg": 18795, "\u00e7\u0136": 18796, "stant": 18797, "llo": 18798, "bps": 18799, "akening": 18800, "\u0120abnormal": 18801, "\u0120ERA": 18802, "\u00e5\u00a3\u00ab": 18803, "\u0120HBO": 18804, "\u0120MAR": 18805, "\u0120concess": 18806, "\u0120servant": 18807, "\u0120aspir": 18808, "lav": 18809, "\u0120Panel": 18810, "amo": 18811, "\u0120precip": 18812, "\u0120recordings": 18813, "\u0120proceeded": 18814, "\u0120colony": 18815, "\u0120Tang": 18816, "ablo": 18817, "\u0120stripped": 18818, "Left": 18819, "too": 18820, "\u0120potatoes": 18821, "\u0120finest": 18822, "%).": 18823, "\u0120crap": 18824, "\u0120Zach": 18825, "abases": 18826, "\u0120Goth": 18827, "\u0120billionaire": 18828, "wolf": 18829, "\u0120sanction": 18830, "SK": 18831, "\u0120logged": 18832, "Po": 18833, "eyed": 18834, "unal": 18835, "\u0120cricket": 18836, "\u0120armies": 18837, "\u0120uncovered": 18838, "Cloud": 18839, "\u00c3\u00b3n": 18840, "\u0120rebounds": 18841, "\u0120mes": 18842, "Oper": 18843, "Pac": 18844, "\u0120nationally": 18845, "\u0120inserted": 18846, "pict": 18847, "\u0120governance": 18848, "\u00d0\u00b8": 18849, "\u0120privileges": 18850, "GET": 18851, "\u0120favorites": 18852, "imity": 18853, "\u0120lover": 18854, "them": 18855, "empl": 18856, "\u0120gorgeous": 18857, "Ann": 18858, "\u0120slipped": 18859, "\u0120veto": 18860, "Bob": 18861, "\u0120slim": 18862, "ucc": 18863, "\u0120Fame": 18864, "uddenly": 18865, "\u0120denies": 18866, "\u0120Maur": 18867, "\u0120distances": 18868, "\u0120wanna": 18869, "tar": 18870, "\u0120SER": 18871, "\u0120\u00e2\u012a": 18872, "\u0120lemon": 18873, "athetic": 18874, "\u0120literal": 18875, "\u0120distinguished": 18876, "\u0120answering": 18877, "GI": 18878, "\u0120religions": 18879, "\u0120Philos": 18880, "\u0120Lay": 18881, "\u0120compos": 18882, "irements": 18883, "\u0120Kos": 18884, "inez": 18885, "rolling": 18886, "\u0120youngest": 18887, "andise": 18888, "\u0120Born": 18889, "\u0120altar": 18890, "amina": 18891, "\u0120Boot": 18892, "voc": 18893, "\u0120digging": 18894, "\u0120pressures": 18895, "\u0120len": 18896, "264": 18897, "\u0120assassination": 18898, "\u0120Birmingham": 18899, "\u0120Myth": 18900, "\u0120sovereign": 18901, "\u0120Artist": 18902, "\u0120Photograph": 18903, "\u0120depicted": 18904, "\u0120dispens": 18905, "orthy": 18906, "\u0120ambul": 18907, "integ": 18908, "\u0120Cele": 18909, "\u0120Tibet": 18910, "\u0120hierarchy": 18911, "\u0120cu": 18912, "\u0120preseason": 18913, "\u0120Peterson": 18914, "\u0120colours": 18915, "\u0120worrying": 18916, "\u0120backers": 18917, "\u0120Palmer": 18918, "\u0120\u00ce\u00bc": 18919, "\u0120contributor": 18920, "\u0120hearings": 18921, "\u0120urine": 18922, "\u0120\u00d9": 18923, "ourgeois": 18924, "Similar": 18925, "\u0120Zimmer": 18926, "something": 18927, "\u0120USC": 18928, "\u0120strengths": 18929, "\u0120FI": 18930, "\u0120logging": 18931, "Asked": 18932, "\u0120Thai": 18933, "inqu": 18934, "\u0120Walt": 18935, "\u0120crews": 18936, "itism": 18937, "301": 18938, "\u0120sharply": 18939, "umed": 18940, "\u0120redirect": 18941, "rators": 18942, "Inf": 18943, "\u0120Weapons": 18944, "\u0120teasp": 18945, "1999": 18946, "Live": 18947, "\u0120Especially": 18948, "\u0120Ster": 18949, "\u0120Veterans": 18950, "\u0120intro": 18951, "otherapy": 18952, "\u0120malware": 18953, "\u0120breeding": 18954, "\u0120molecular": 18955, "\u0120Route": 18956, "\u0120Comment": 18957, "ochem": 18958, "\u0120ain": 18959, "Season": 18960, "\u0120linebacker": 18961, "\u00c4\u00ab": 18962, "\u0120Economics": 18963, "esar": 18964, "\u0120Lives": 18965, "\u0120Emma": 18966, "\u0120kin": 18967, "\u0120Territ": 18968, "\u0120planted": 18969, "oton": 18970, "\u0120Butter": 18971, "\u0120Spons": 18972, "PER": 18973, "\u0120dungeon": 18974, "\u0120symbolic": 18975, "\u0120filmed": 18976, "\u0120diets": 18977, "\u0120concludes": 18978, "\u0120certainty": 18979, "\u0120Format": 18980, "\u0120strangers": 18981, "format": 18982, "\u0120Phase": 18983, "\u0120copied": 18984, "\u0120metres": 18985, "lda": 18986, "\u0120Users": 18987, "\u0120deliberate": 18988, "\u0120washed": 18989, "\u0120Lance": 18990, "imation": 18991, "\u0120improper": 18992, "\u0120Genesis": 18993, "ickr": 18994, "\u0120Kush": 18995, "\u0120realise": 18996, "\u0120embarrassing": 18997, "alking": 18998, "bucks": 18999, "\u0120verified": 19000, "\u0120outline": 19001, "years": 19002, "\u0120Income": 19003, "202": 19004, "\u0120zombies": 19005, "Final": 19006, "\u0120Millenn": 19007, "\u0120modifications": 19008, "\u0120Vision": 19009, "\u0120Moses": 19010, "verb": 19011, "iterranean": 19012, "\u0120Jet": 19013, "\u0120naval": 19014, "\u0120Agg": 19015, "\u0120url": 19016, "\u0120victories": 19017, "\u0120nonetheless": 19018, "\u0120injust": 19019, "\u0120Fact": 19020, "\u00e7\u013c": 19021, "\u0120insufficient": 19022, "review": 19023, "facebook": 19024, "\u0120negotiating": 19025, "\u0120guarantees": 19026, "imen": 19027, "utenberg": 19028, "\u0120gambling": 19029, "\u0120congr": 19030, "Loading": 19031, "\u0120nevertheless": 19032, "\u0120presidents": 19033, "\u0120Industrial": 19034, "\u0120118": 19035, "\u0120poured": 19036, "\u0120Tory": 19037, "\u0120175": 19038, "\u0120:=": 19039, "Scott": 19040, "angered": 19041, "Tok": 19042, "\u0120organizers": 19043, "Mat": 19044, "\u0120Growth": 19045, "\u0120adul": 19046, "\u0120ensures": 19047, "\u0120117": 19048, "\u00e9\u00be\u012f\u00e5": 19049, "\u0120massacre": 19050, "\u0120grades": 19051, "before": 19052, "ADVERTISEMENT": 19053, "\u0120Slow": 19054, "\u0120MMA": 19055, "\u00e2\u0122\u0136\"": 19056, "\u0120Vatican": 19057, "Qaeda": 19058, "\u0120owe": 19059, "6666": 19060, "\u0120Sorry": 19061, "\u0120Grass": 19062, "\u0120backgrounds": 19063, "\u0120exhausted": 19064, "\u0120clan": 19065, "\u0120compromised": 19066, "\u0120Elf": 19067, "\u0120Isaac": 19068, "enson": 19069, "Invest": 19070, "IFA": 19071, "\u0120interrupted": 19072, "\u00e3\u0125\u012b\u00e3\u0125\u00a9": 19073, "\u0120twisted": 19074, "\u0120Dragons": 19075, "Mode": 19076, "\u0120Kremlin": 19077, "\u0120fertil": 19078, "heres": 19079, "phan": 19080, "\u0120Node": 19081, "fed": 19082, "\u0120Orc": 19083, "\u0120unwilling": 19084, "Cent": 19085, "\u0120priorit": 19086, "\u0120graduates": 19087, "\u0120subjective": 19088, "\u0120issuing": 19089, "\u0120Lt": 19090, "\u0120viewer": 19091, "\u0120woke": 19092, "Thus": 19093, "brook": 19094, "\u0120depressed": 19095, "\u0120bracket": 19096, "\u0120Gor": 19097, "\u0120Fighting": 19098, "\u0120striker": 19099, "Report": 19100, "\u0120Portugal": 19101, "\u0120neo": 19102, "wed": 19103, "199": 19104, "\u0120fleeing": 19105, "shadow": 19106, "identified": 19107, "USE": 19108, "Steam": 19109, "\u0120stretched": 19110, "\u0120revelations": 19111, "arted": 19112, "\u0120Dw": 19113, "\u0120alignment": 19114, "eston": 19115, "\u0120Jared": 19116, "Sep": 19117, "\u0120blogs": 19118, "update": 19119, "gom": 19120, "risk": 19121, "\u0120clash": 19122, "\u0120Hour": 19123, "\u0120runtime": 19124, "\u0120unwanted": 19125, "\u0120scam": 19126, "\u0120rack": 19127, "\u0120enlight": 19128, "onest": 19129, "\u0120Ferr": 19130, "\u0120convictions": 19131, "\u0120piano": 19132, "\u0120circulation": 19133, "\u0120Welcome": 19134, "\u0120backlash": 19135, "\u0120Wade": 19136, "\u0120receivers": 19137, "otive": 19138, "Jeff": 19139, "\u0120networking": 19140, "\u0120Prep": 19141, "\u0120Explorer": 19142, "\u0120lecture": 19143, "\u0120uploaded": 19144, "\u0120Meat": 19145, "BLE": 19146, "\u0120Nazis": 19147, "\u0120Synd": 19148, "stud": 19149, "roots": 19150, "rians": 19151, "\u0120portrayed": 19152, "\u0120??": 19153, "\u0120Buddha": 19154, "sun": 19155, "Robert": 19156, "\u0120Complex": 19157, "\u0120oversee": 19158, "\u0120stealth": 19159, "Title": 19160, "\u0120Jobs": 19161, "\u0120Kum": 19162, "\u0120appreciation": 19163, "\u0120MOD": 19164, "\u0120basics": 19165, "\u0120clips": 19166, "\u0120nursing": 19167, "\u0120proposition": 19168, "\u0120realised": 19169, "\u0120NYC": 19170, "\u0120allocated": 19171, "rium": 19172, "aran": 19173, "\u0120Production": 19174, "\u0120Vote": 19175, "\u0120smugg": 19176, "\u0120hunter": 19177, "azer": 19178, "\u0120Changes": 19179, "\u0120fluct": 19180, "yon": 19181, "Array": 19182, "\u0120kits": 19183, "Water": 19184, "\u0120uncommon": 19185, "\u0120resting": 19186, "ells": 19187, "would": 19188, "\u0120pursued": 19189, "\u0120assertion": 19190, "ometown": 19191, "\u0120Mosul": 19192, "\u0120Platform": 19193, "iolet": 19194, "\u0120shareholders": 19195, "\u0120trails": 19196, "Pay": 19197, "\u0120Enforcement": 19198, "types": 19199, "\u0120Anonymous": 19200, "\u0120satisfying": 19201, "ilogy": 19202, "\u0120('": 19203, "wave": 19204, "city": 19205, "Steve": 19206, "\u0120confrontation": 19207, "\u0120Eld": 19208, "Capt": 19209, "ahan": 19210, "htm": 19211, "\u0120Ctrl": 19212, "ONS": 19213, "230": 19214, "ifa": 19215, "holding": 19216, "\u0120delicate": 19217, "\u0120jaw": 19218, "\u0120Going": 19219, "orum": 19220, "Sal": 19221, "\u0120dull": 19222, "\u0120Beth": 19223, "\u0120prisons": 19224, "\u0120ego": 19225, "\u0120Elsa": 19226, "avorite": 19227, "\u0120Gang": 19228, "\u0120Nuclear": 19229, "\u0120spider": 19230, "atsu": 19231, "\u0120sampling": 19232, "\u0120absorbed": 19233, "\u0120Pharm": 19234, "ieth": 19235, "\u0120bucket": 19236, "\u0120Recomm": 19237, "OF": 19238, "\u0120Factory": 19239, "ANCE": 19240, "\u0120bacter": 19241, "Has": 19242, "\u0120Observ": 19243, "121": 19244, "\u0120premiere": 19245, "Develop": 19246, "\u0120currencies": 19247, "Cast": 19248, "\u0120accompanying": 19249, "\u0120Nashville": 19250, "\u0120fatty": 19251, "\u0120Brend": 19252, "\u0120locks": 19253, "\u0120centered": 19254, "\u0120UT": 19255, "aughs": 19256, "orie": 19257, "\u0120Affordable": 19258, "vance": 19259, "DL": 19260, "emet": 19261, "\u0120throne": 19262, "\u0120Bluetooth": 19263, "\u0120naming": 19264, "ifts": 19265, "ADE": 19266, "\u0120corrected": 19267, "\u0120promptly": 19268, "\u0120STR": 19269, "\u0120genome": 19270, "\u0120cope": 19271, "\u0120valley": 19272, "\u0120rounded": 19273, "\u0120Kend": 19274, "alion": 19275, "pers": 19276, "\u0120tourism": 19277, "\u0120stark": 19278, "vl": 19279, "\u0120blowing": 19280, "\u0120Schedule": 19281, "std": 19282, "\u0120unhappy": 19283, "\u0120litigation": 19284, "cedes": 19285, "\u0120android": 19286, "\u0120integral": 19287, "erers": 19288, "uded": 19289, "tax": 19290, "\u0120reiter": 19291, "\u0120Motors": 19292, "ociated": 19293, "\u0120wonders": 19294, "\u0120Apost": 19295, "ucking": 19296, "\u0120Roosevelt": 19297, "fram": 19298, "\u0120yields": 19299, "\u0120constitutes": 19300, "awk": 19301, "Interest": 19302, "\u0120interim": 19303, "\u0120breakthrough": 19304, "\u0120Cher": 19305, "\u0120prosec": 19306, "\u0120Dj": 19307, "\u0120MT": 19308, "Resp": 19309, "\u0120PT": 19310, "\u0120sperm": 19311, "edit": 19312, "BT": 19313, "Linux": 19314, "country": 19315, "league": 19316, "\u0120dick": 19317, "\u0120oct": 19318, "\u0120inserting": 19319, "\u0120scra": 19320, "\u0120Brewing": 19321, "\u01201966": 19322, "\u0120runners": 19323, "\u0120plun": 19324, "idy": 19325, "\u0120Dian": 19326, "\u0120dysfunction": 19327, "\u0120exclusion": 19328, "\u0120disgr": 19329, "\u0120incorporate": 19330, "\u0120reconc": 19331, "\u0120nominated": 19332, "\u0120Archer": 19333, "draw": 19334, "achelor": 19335, "\u0120writings": 19336, "\u0120shallow": 19337, "\u0120hast": 19338, "\u0120BMW": 19339, "\u0120RS": 19340, "\u0120thigh": 19341, "\u01201963": 19342, "\u0120lamb": 19343, "\u0120favored": 19344, "agle": 19345, "\u0120cooler": 19346, "\u0120Hours": 19347, "\u0120GU": 19348, "\u0120Origin": 19349, "\u0120glimpse": 19350, "--------------------": 19351, "Lim": 19352, "\u0120cheek": 19353, "\u0120jealous": 19354, "-'": 19355, "\u0120harness": 19356, "\u0120Poison": 19357, "\u0120disabilities": 19358, "neapolis": 19359, "\u0120outlook": 19360, "\u0120notify": 19361, "\u0120Indianapolis": 19362, "\u0120abrupt": 19363, "nsic": 19364, "\u0120encrypted": 19365, "\u0120forfe": 19366, "reath": 19367, "\u0120rabb": 19368, "\u0120foundations": 19369, "\u0120compliment": 19370, "\u0120Interview": 19371, "\u0120Swe": 19372, "\u0120adolesc": 19373, "\u0120monitors": 19374, "\u0120Sacramento": 19375, "\u0120timely": 19376, "\u0120contempl": 19377, "\u0120positioned": 19378, "\u0120posters": 19379, "phies": 19380, "iovascular": 19381, "void": 19382, "\u0120Fifth": 19383, "\u0120investigative": 19384, "OUN": 19385, "\u0120integrate": 19386, "\u0120INC": 19387, "isha": 19388, "iblings": 19389, "\u0120Request": 19390, "\u0120Rodriguez": 19391, "\u0120slides": 19392, "\u0120DX": 19393, "\u0120feminism": 19394, "\u0120datas": 19395, "\u0120bend": 19396, "irus": 19397, "\u0120Nigeria": 19398, "Fox": 19399, "Change": 19400, "\u0120airplane": 19401, "\u0120Laden": 19402, "\u0120publicity": 19403, "ixty": 19404, "\u0120commitments": 19405, "\u0120aggregate": 19406, "\u0120displaying": 19407, "\u0120Arrow": 19408, "\u0120122": 19409, "\u0120respects": 19410, "android": 19411, "six": 19412, "\u0120Sha": 19413, "\u0120restoration": 19414, ")\\": 19415, "WS": 19416, "oys": 19417, "\u0120illustrate": 19418, "without": 19419, "126": 19420, "\u0120\u00e2\u0136\u0124": 19421, "\u0120pickup": 19422, "nels": 19423, "\u0120....": 19424, "food": 19425, "\u0120Fen": 19426, ")?": 19427, "\u0120phenomena": 19428, "\u0120companions": 19429, "\u0120Write": 19430, "\u0120spill": 19431, "\u0120bridges": 19432, "\u0120Updated": 19433, "\u0120Fo": 19434, "\u0120insects": 19435, "ASHINGTON": 19436, "\u0120scare": 19437, "iltr": 19438, "\u0120Zhang": 19439, "\u0120severity": 19440, "\u0120indul": 19441, "149": 19442, "\u0120Coffee": 19443, "\u0120norms": 19444, "\u0120pulse": 19445, "\u0120FT": 19446, "\u0120horrific": 19447, "\u0120Destroy": 19448, "\u0120JSON": 19449, "\u0120olive": 19450, "\u0120discusses": 19451, "Rest": 19452, "Elect": 19453, "\u0120Winn": 19454, "\u0120Surviv": 19455, "\u0120Hait": 19456, "Sure": 19457, "oped": 19458, "\u0120rooted": 19459, "\u0120Ske": 19460, "\u0120Bronze": 19461, "\u0120lol": 19462, "Default": 19463, "\u0120commodity": 19464, "redited": 19465, "\u0120libertarian": 19466, "\u0120forbidden": 19467, "\u0120gran": 19468, "\u00e0\u00a8": 19469, "\u0120lag": 19470, "enz": 19471, "drive": 19472, "\u0120mathematics": 19473, "\u0120wires": 19474, "\u0120critically": 19475, "\u0120carbohyd": 19476, "\u0120Chancellor": 19477, "\u0120Eddie": 19478, "\u0120banning": 19479, "\u0120Fri": 19480, "\u0120complications": 19481, "etric": 19482, "\u0120Bangladesh": 19483, "\u0120bandwidth": 19484, "Stop": 19485, "\u0120Originally": 19486, "\u0120halfway": 19487, "ynasty": 19488, "shine": 19489, "\u0120tales": 19490, "rities": 19491, "avier": 19492, "\u0120spinning": 19493, "\u0120WHO": 19494, "\u0120neighbourhood": 19495, "bach": 19496, "\u0120commerce": 19497, "\u0120Sle": 19498, "BU": 19499, "\u0120entrepreneur": 19500, "\u0120peculiar": 19501, "\u0120Comments": 19502, "fre": 19503, "320": 19504, "ICS": 19505, "\u0120imagery": 19506, "\u0120Canon": 19507, "\u0120Electronic": 19508, "short": 19509, "((": 19510, "Dig": 19511, "\u0120commem": 19512, "uced": 19513, "\u0120inclined": 19514, "\u0120Summon": 19515, "\u0120cliff": 19516, "\u0120Mediterranean": 19517, "\u0120poetry": 19518, "\u0120prosperity": 19519, "\u0120Rece": 19520, "\u0120pills": 19521, "member": 19522, "\u0120finale": 19523, "unc": 19524, "\u0120Gig": 19525, "\u00e4\u00bd": 19526, "\u0120lod": 19527, "\u0120backward": 19528, "-+": 19529, "\u0120Forward": 19530, "\u0120thri": 19531, "sure": 19532, "\u0120soap": 19533, "\u0120FX": 19534, "RES": 19535, "\u0120Sexual": 19536, "oulos": 19537, "\u0120foolish": 19538, "\u0120righteous": 19539, "\u0120coff": 19540, "terrorism": 19541, "ustain": 19542, "oter": 19543, "\u0120abuses": 19544, "next": 19545, "\u0120abusive": 19546, "\u0120thereafter": 19547, "\u0120prohibition": 19548, "\u0120SUP": 19549, "\u0120dip": 19550, "\u0120ripped": 19551, "\u0120inherited": 19552, "\u0120bats": 19553, "stru": 19554, "GT": 19555, "\u0120flawed": 19556, "phabet": 19557, "\u0120fog": 19558, "doors": 19559, "\u0120imaging": 19560, "\u0120digits": 19561, "\u0120Hungary": 19562, "\u0120arrog": 19563, "\u0120teachings": 19564, "\u0120protocols": 19565, "\u0120Banks": 19566, "\u00e0\u00b8": 19567, "pound": 19568, "\u0120Curt": 19569, ".\")": 19570, "./": 19571, "\u0120exemption": 19572, "endix": 19573, "\u0120Mull": 19574, "\u0120improves": 19575, "\u0120Gamer": 19576, "dimensional": 19577, "Icon": 19578, "\u0120Margaret": 19579, "Status": 19580, "dates": 19581, "\u0120intends": 19582, "\u0120depict": 19583, "\u0120parked": 19584, "Joe": 19585, "\u0120Marines": 19586, "chnology": 19587, "!).": 19588, "\u0120judged": 19589, "\u0120weights": 19590, "Ray": 19591, "\u0120apartments": 19592, "hester": 19593, "\u0120reinforce": 19594, "\u0120offender": 19595, "occup": 19596, "\u0120sore": 19597, "ept": 19598, "\u0120PHP": 19599, "\u0120Brow": 19600, "\u0120authorization": 19601, "\u0120Risk": 19602, "\u0120Delaware": 19603, "\u0120QU": 19604, "\u0120notifications": 19605, "\u0120sunlight": 19606, "\u0120exclude": 19607, "dat": 19608, "\u0120mesh": 19609, "\u0120Sudan": 19610, "\u0120belonged": 19611, "\u0120subway": 19612, "\u0120noon": 19613, "\u0120Interior": 19614, "olics": 19615, "\u0120Lakers": 19616, "\u0120coding": 19617, "Disclaimer": 19618, "Calif": 19619, "Old": 19620, "\u0120disl": 19621, "?????": 19622, "\u0120confirms": 19623, "\u0120recruitment": 19624, "\u0120homicide": 19625, "Consider": 19626, "\u0120Jeffrey": 19627, "fty": 19628, "};": 19629, "\u0120objection": 19630, "doing": 19631, "\u0120Leo": 19632, "Want": 19633, "\u0120glow": 19634, "\u0120Clarke": 19635, "\u0120Norman": 19636, "\u0120verification": 19637, "\u0120packet": 19638, "\u0120Formula": 19639, "\u0120plag": 19640, "esville": 19641, "\u0120shouting": 19642, "\u0120ov": 19643, "\u0120REC": 19644, "\u0120Bub": 19645, "\u0120ninth": 19646, "\u0120energ": 19647, "\u0120validity": 19648, "\u0120ups": 19649, "jack": 19650, "\u0120neighboring": 19651, "\u0120Nec": 19652, "eworks": 19653, "\u0120Hab": 19654, "arez": 19655, "\u0120spine": 19656, "\u0120eventual": 19657, "\u0120Leaders": 19658, "\u0120Carn": 19659, "\u0120probation": 19660, "\u0120romance": 19661, "msg": 19662, "\u0120Mechanical": 19663, "ERY": 19664, "Rock": 19665, "\u0120partisan": 19666, "Node": 19667, "assets": 19668, "minent": 19669, "\u0120foreigners": 19670, "\u0120testify": 19671, "\u0120Usually": 19672, "lords": 19673, "\u0120Gren": 19674, "\u0120Powell": 19675, "BIL": 19676, "\u0120sr": 19677, "\u0120addict": 19678, "\u0120shells": 19679, "\u0120sigh": 19680, "\u0120Yale": 19681, "ternity": 19682, "\u0120750": 19683, "EU": 19684, "\u0120Rifle": 19685, "\u0120patron": 19686, "ema": 19687, "\u0120Bannon": 19688, "anity": 19689, "\u0120tropical": 19690, "\u0120VII": 19691, "cross": 19692, "Everything": 19693, "\u0120ISO": 19694, "\u0120humble": 19695, "assing": 19696, "\u0120FIG": 19697, "\u0120updating": 19698, "yson": 19699, "\u0120calcium": 19700, "\u0120competent": 19701, "\u0120steering": 19702, "Prot": 19703, "\u0120SY": 19704, "\u0120Finals": 19705, "\u0120Rug": 19706, "159": 19707, "137": 19708, "\u0120Golf": 19709, "\u0120126": 19710, "\u0120accommodation": 19711, "\u0120Hughes": 19712, "\u0120aesthetic": 19713, "artisan": 19714, "\u0120Twilight": 19715, "\u0120prince": 19716, "\u0120Agriculture": 19717, "\u0120Disco": 19718, "\u0120precedent": 19719, "\u0120typing": 19720, "authorized": 19721, "Option": 19722, "\u0120Aub": 19723, "lishes": 19724, "acht": 19725, "mag": 19726, "Peter": 19727, "\u0120UFO": 19728, "monton": 19729, "\u0120Lith": 19730, "\u0120arom": 19731, "\u0120securing": 19732, "\u0120confined": 19733, "private": 19734, "\u0120swords": 19735, "\u0120markers": 19736, "\u0120metabolic": 19737, "select": 19738, "\u0120Curse": 19739, "\u0120Ot": 19740, "gressive": 19741, "\u0120incumb": 19742, "\u0120Saga": 19743, "\u0120priced": 19744, "\u0120clearance": 19745, "Content": 19746, "\u0120drilling": 19747, "\u0120notices": 19748, "\u0120bourgeois": 19749, "\u0120vest": 19750, "\u0120cookie": 19751, "\u0120Guardians": 19752, "rys": 19753, "inyl": 19754, "\u0120124": 19755, "\u0120plausible": 19756, "ongh": 19757, "\u0120Odin": 19758, "\u0120conception": 19759, "\u0120Yuk": 19760, "\u0120Baghdad": 19761, "\u0120Flag": 19762, "Austral": 19763, "\u0120IBM": 19764, "\u0120internationally": 19765, "\u0120WikiLeaks": 19766, "IED": 19767, "\u0120cyn": 19768, "\u0120chooses": 19769, "\u0120Pill": 19770, "\u0120combining": 19771, "\u0120radi": 19772, "\u0120Mohammed": 19773, "defense": 19774, "atching": 19775, "Subject": 19776, "iciency": 19777, "Frame": 19778, "\u0120{\"": 19779, "\u0120chess": 19780, "\u0120timer": 19781, "190": 19782, "\u0120tin": 19783, "\u0120ordinance": 19784, "emetery": 19785, "\u0120accusing": 19786, "\u0120noticeable": 19787, "\u0120centres": 19788, "\u0120lid": 19789, "\u0120Mills": 19790, "imgur": 19791, "\u0120zoom": 19792, "ergic": 19793, "\u0120compression": 19794, "prim": 19795, "find": 19796, "\u0120surg": 19797, "\u0120pand": 19798, "\u0120Kee": 19799, "\u0120Chad": 19800, "cellence": 19801, "oyle": 19802, "\u0120socialism": 19803, "\u0120Travis": 19804, "\u0120MHz": 19805, "\u0120guild": 19806, "ALLY": 19807, "\u0120Subscribe": 19808, "\u0120Related": 19809, "\u0120occurrence": 19810, "itching": 19811, "\u0120fictional": 19812, "\u0120crush": 19813, "\u0120EA": 19814, "cod": 19815, "mix": 19816, "\u0120Triple": 19817, "\u0120retrieve": 19818, "\u0120stimulus": 19819, "\u0120psychiat": 19820, "\u0120Door": 19821, "\u0120homosexuality": 19822, "\u0120elementary": 19823, "\u0120cellular": 19824, "idian": 19825, "\u0120Laun": 19826, "\u0120intriguing": 19827, "\u0120foam": 19828, "\u0120Bass": 19829, "idi": 19830, "itsu": 19831, "\u0120assure": 19832, "\u0120congrat": 19833, "\u0120businessman": 19834, "\u0120Boost": 19835, "close": 19836, "\u0120lied": 19837, "\u0120sciences": 19838, "\u0120Omega": 19839, "\u0120Graphics": 19840, "\u0120<=": 19841, "spoken": 19842, "\u0120connectivity": 19843, "Saturday": 19844, "\u0120Avengers": 19845, "\u0120toggle": 19846, "\u0120ankle": 19847, "\u0120nationalist": 19848, "model": 19849, "\u0120Pool": 19850, "ophobia": 19851, "Var": 19852, "\u0120Mons": 19853, "atories": 19854, "\u0120aggressively": 19855, "Clear": 19856, "Forge": 19857, "acters": 19858, "\u0120hedge": 19859, "\u0120pipes": 19860, "\u0120blunt": 19861, "\u0120sq": 19862, "\u0120remotely": 19863, "Wed": 19864, "asers": 19865, "\u0120refriger": 19866, "\u0120tiles": 19867, "\u0120rescued": 19868, "\u0120comprised": 19869, "insky": 19870, "\u0120manif": 19871, "avanaugh": 19872, "\u0120prolifer": 19873, "\u0120aligned": 19874, "xml": 19875, "\u0120triv": 19876, "\u0120coordination": 19877, "\u0120PER": 19878, "\u0120Quote": 19879, "134": 19880, "bf": 19881, "\u0120Saw": 19882, "\u0120termination": 19883, "\u0120190": 19884, "\u0120additions": 19885, "\u0120trio": 19886, "\u0120projections": 19887, "\u0120positively": 19888, "\u0120inclusive": 19889, "\u0120membr": 19890, "1990": 19891, "older": 19892, "\u0120practiced": 19893, "inkle": 19894, "Arch": 19895, "\u0120starters": 19896, "arius": 19897, "\u0120intermediate": 19898, "\u0120Benef": 19899, "\u0120Killer": 19900, "\u0120interventions": 19901, "\u0120Kil": 19902, "\u0120Flying": 19903, "Inv": 19904, "\u0120premature": 19905, "\u0120psychiatric": 19906, "\u0120indie": 19907, "\u0120collar": 19908, "\u0120Rainbow": 19909, "afi": 19910, "\u0120disruption": 19911, "\u0120FOX": 19912, "casting": 19913, "\u0120misdem": 19914, "cro": 19915, "\u0120wipe": 19916, "ardon": 19917, "\u0120bast": 19918, "\u0120Tommy": 19919, "\u0120Representative": 19920, "\u0120belly": 19921, "\u0120PO": 19922, "\u0120Breitbart": 19923, "132": 19924, "\u0120messaging": 19925, "Should": 19926, "References": 19927, "\u0120GRE": 19928, "istical": 19929, "LP": 19930, "\u0120Cav": 19931, "\u0120Crazy": 19932, "\u0120intuitive": 19933, "keeping": 19934, "\u0120Moss": 19935, "\u0120discontin": 19936, "\u0120Module": 19937, "\u0120unrelated": 19938, "\u0120Practice": 19939, "\u0120Transport": 19940, "\u0120statistically": 19941, "orns": 19942, "\u0120sized": 19943, "pu": 19944, "\u0120caf": 19945, "\u0120Worlds": 19946, "\u0120Rodgers": 19947, "\u0120Lun": 19948, "\u0120Comic": 19949, "living": 19950, "\u0120cared": 19951, "\u0120climbed": 19952, "){": 19953, "\u0120consisted": 19954, "\u0120medieval": 19955, "folk": 19956, "\u0120hacked": 19957, "\u0120dire": 19958, "\u0120Hermione": 19959, "\u0120tended": 19960, "ceans": 19961, "Daniel": 19962, "went": 19963, "\u0120legislators": 19964, "\u0120redes": 19965, "games": 19966, "\u0120gn": 19967, "amiliar": 19968, "\u0120++": 19969, "ggy": 19970, "threat": 19971, "\u0120magnet": 19972, "\u0120perceive": 19973, "\u0120zip": 19974, "\u0120indictment": 19975, "\u0120critique": 19976, "gard": 19977, "\u0120Safe": 19978, "\u0120Cream": 19979, "\u0120advent": 19980, "oba": 19981, "\u0120vowed": 19982, "ousands": 19983, "\u0120ski": 19984, "\u0120abortions": 19985, "uart": 19986, "\u0120stunned": 19987, "\u0120advancing": 19988, "\u0120lacked": 19989, "\u0120\\\"": 19990, "\u0120schizophren": 19991, "\u0120elegant": 19992, "\u0120conferences": 19993, "\u0120canceled": 19994, "\u0120Hudson": 19995, "\u0120Hopefully": 19996, "\u0120trump": 19997, "\u0120frequencies": 19998, "\u0120meteor": 19999, "\u0120Junior": 20000, "\u0120Fleet": 20001, "\u0120Malcolm": 20002, "\u0120Tools": 20003, "\u0120........": 20004, "\u0120hobby": 20005, "\u0120Europeans": 20006, "\u01201500": 20007, "\u0120Into": 20008, "\u0120sway": 20009, "\u0120Appro": 20010, "\u0120Compl": 20011, "Community": 20012, "\u0120tide": 20013, "\u0120Summit": 20014, "\u00e4\u00bb": 20015, "\u0120intervals": 20016, "\u0120Ether": 20017, "\u0120habitat": 20018, "\u0120Stevens": 20019, "lishing": 20020, "\u0120Domain": 20021, "\u0120triggers": 20022, "\u0120chasing": 20023, "\u0120charm": 20024, "\u0120Flower": 20025, "itored": 20026, "\u0120blessing": 20027, "\u0120textures": 20028, "Five": 20029, "\u0120liquor": 20030, "RP": 20031, "FIN": 20032, "\u01201962": 20033, "CAR": 20034, "Unknown": 20035, "\u0120resil": 20036, "\u0120Lily": 20037, "\u0120abundance": 20038, "\u0120predictable": 20039, "rar": 20040, "\u0120bullshit": 20041, "leen": 20042, "chet": 20043, "Mor": 20044, "Much": 20045, "\u00e4\u00b9": 20046, "\u0120emphasized": 20047, "\u0120crust": 20048, "\u0120primitive": 20049, "\u0120enjoyable": 20050, "\u0120Pictures": 20051, "\u0120teammate": 20052, "pler": 20053, "\u0120Tol": 20054, "\u0120Kane": 20055, "\u0120summoned": 20056, "thy": 20057, "rama": 20058, "\u0120Honda": 20059, "\u0120realizing": 20060, "\u0120quicker": 20061, "\u0120concentrate": 20062, "clear": 20063, "\u0120210": 20064, "\u0120Erdogan": 20065, "aris": 20066, "\u0120responds": 20067, "\u0120BI": 20068, "\u0120eligibility": 20069, "\u0120pushes": 20070, "\u0120Idaho": 20071, "\u0120aggrav": 20072, "\u0120ruins": 20073, "urations": 20074, "\u0120bans": 20075, "\u0120anat": 20076, "share": 20077, "\u0120grind": 20078, "hin": 20079, "umen": 20080, "\u0120utilities": 20081, "\u0120Yankees": 20082, "\u0120databases": 20083, "\u0120DD": 20084, "\u0120displaced": 20085, "\u0120dependencies": 20086, "\u0120stimulation": 20087, "hun": 20088, "houses": 20089, "\u0120Pretty": 20090, "\u0120Ravens": 20091, "\u0120TODAY": 20092, "\u0120associates": 20093, "\u0120therape": 20094, "cled": 20095, "\u0120deer": 20096, "\u0120repairs": 20097, "rentice": 20098, "\u0120receptors": 20099, "\u0120remed": 20100, "\u0120Ce": 20101, "\u0120marriages": 20102, "\u0120ballots": 20103, "\u0120Soldier": 20104, "\u0120hilarious": 20105, "opl": 20106, "138": 20107, "\u0120inherently": 20108, "\u0120ignorant": 20109, "\u0120bounce": 20110, "\u0120Easter": 20111, "RELATED": 20112, "\u0120Currency": 20113, "EV": 20114, "\u00e3\u0125\u0140": 20115, "\u0120Lead": 20116, "\u0120deceased": 20117, "Brien": 20118, "\u0120Musk": 20119, "JS": 20120, "\u0120merge": 20121, "hearted": 20122, "creat": 20123, "mitt": 20124, "mund": 20125, "\u0120\u00e2\u0122\u012d": 20126, "\u0120Bag": 20127, "\u0120projection": 20128, "\u0120java": 20129, "\u0120Standards": 20130, "\u0120Leonard": 20131, "\u0120coconut": 20132, "\u0120Population": 20133, "\u0120traject": 20134, "\u0120imply": 20135, "\u0120curiosity": 20136, "\u0120DB": 20137, "\u0120Fresh": 20138, "\u0120Por": 20139, "\u0120heavier": 20140, "neys": 20141, "gomery": 20142, "\u0120deserved": 20143, "\u0120phrases": 20144, "\u0120GC": 20145, "\u0120yeast": 20146, "desc": 20147, "Death": 20148, "\u0120reboot": 20149, "\u0120metadata": 20150, "ICAL": 20151, "\u0120repay": 20152, "\u0120Independence": 20153, "\u0120suburban": 20154, "icals": 20155, "\u0120atop": 20156, "\u0120allocation": 20157, "generation": 20158, "\u0120Gram": 20159, "\u0120moisture": 20160, "\u0120pine": 20161, "\u0120Liberals": 20162, "\u0120aides": 20163, "\u0120underest": 20164, "\u0120Berry": 20165, "\u0120ceremon": 20166, "370": 20167, "astrous": 20168, "\u0120Pirates": 20169, "\u0120tense": 20170, "\u0120Industries": 20171, "\u0120Appeals": 20172, "\u0120Near": 20173, "\u0120\u00e8\u00a3\u0131\u00e7": 20174, "\u0120lovers": 20175, "\u0120CAP": 20176, "\u0120Craw": 20177, "\u0120giants": 20178, "\u0120efficacy": 20179, "Element": 20180, "\u0120Behavior": 20181, "\u0120Toyota": 20182, "\u0120intest": 20183, "Priv": 20184, "AI": 20185, "\u0120maneuver": 20186, "\u0120perfection": 20187, "\u0120bang": 20188, "paper": 20189, "rill": 20190, "George": 20191, "border": 20192, "inters": 20193, "\u0120Seth": 20194, "\u0120clues": 20195, "\u0120Levi": 20196, "\u0120Revenue": 20197, "147": 20198, "\u0120vapor": 20199, "\u0120fortunate": 20200, "\u0120threatens": 20201, "\u0120vet": 20202, "\u0120dependency": 20203, "ersed": 20204, "article": 20205, "\u0120Blizzard": 20206, "\u0120chlor": 20207, "\u0120minus": 20208, "\u0120Bills": 20209, "\u0120cryptocurrency": 20210, "\u0120metabolism": 20211, "tering": 20212, "\u0120pestic": 20213, "steps": 20214, "\u0120Treasure": 20215, "racted": 20216, "\u0120Constant": 20217, "\u0120temp": 20218, "139": 20219, "\u0120Detective": 20220, "urally": 20221, "\u0120recovering": 20222, "\u0120cortex": 20223, "\u0120144": 20224, "closed": 20225, "\u0120prejudice": 20226, "aunted": 20227, "\u0120storms": 20228, "\u0120NOW": 20229, "\u0120machinery": 20230, "Address": 20231, "\u0120compelled": 20232, "270": 20233, "\u0120despair": 20234, "bane": 20235, "\u0120vegetable": 20236, "\u0120beds": 20237, "Learn": 20238, "\u0120colorful": 20239, "\u0120spike": 20240, "\u0120margins": 20241, "\u0120sympathy": 20242, "\u0120workshop": 20243, "\u0120CBC": 20244, "Sat": 20245, "\u0120burns": 20246, "\u0120Gender": 20247, "\u0120129": 20248, "\u0120Cable": 20249, "\u0120debts": 20250, "\u0120Theresa": 20251, "\u0120reflecting": 20252, "\u0120airst": 20253, "\u0120rim": 20254, "ramid": 20255, "\u0120weaknesses": 20256, "Writ": 20257, "oggle": 20258, "ti": 20259, "\u0120Charge": 20260, "\u0120weighed": 20261, "\u0120(.": 20262, "\u0120laughter": 20263, "\u0120router": 20264, "\u0120Democracy": 20265, "Dear": 20266, "\u0120hasht": 20267, "\u0120dy": 20268, "\u0120hints": 20269, "running": 20270, "\u0120finishes": 20271, "arus": 20272, "Mass": 20273, "result": 20274, "ascus": 20275, "\u0120vintage": 20276, "\u0120conqu": 20277, "\u0120wildly": 20278, "acist": 20279, "\u0120lingu": 20280, "\u0120protagonist": 20281, "strom": 20282, "teenth": 20283, "\u0120Solo": 20284, "mac": 20285, "filled": 20286, "\u0120renown": 20287, "itives": 20288, "\u0120motive": 20289, "\u0120Antar": 20290, "\u0120Mann": 20291, "\u0120Adjust": 20292, "\u0120rockets": 20293, "\u0120troubling": 20294, "ei": 20295, "\u0120organisms": 20296, "assis": 20297, "Christian": 20298, "\u0120145": 20299, "\u0120Hass": 20300, "\u0120swall": 20301, "\u0120wax": 20302, "\u0120Survival": 20303, "VS": 20304, "\u0120Murd": 20305, "vd": 20306, "standard": 20307, "\u0120dragons": 20308, "\u0120acceleration": 20309, "rational": 20310, "final": 20311, "\u0120paired": 20312, "\u0120Ethereum": 20313, "\u0120interfaces": 20314, "\u0120resent": 20315, "\u0120artifacts": 20316, "\u00c5\u00ab": 20317, "arel": 20318, "\u0120competitor": 20319, "\u0120Nicholas": 20320, "\u0120Surface": 20321, "cpp": 20322, "\u0120Tot": 20323, "\u0120economically": 20324, "\u0120organised": 20325, "\u0120enforced": 20326, "inho": 20327, "\u0120varieties": 20328, "\u0120abdom": 20329, "\u0120Bailey": 20330, "idav": 20331, "\u0120Salv": 20332, "paid": 20333, "\u0120altitude": 20334, "essert": 20335, "\u0120Gutenberg": 20336, "area": 20337, "opoulos": 20338, "\u0120professors": 20339, "iggs": 20340, "\u0120Fate": 20341, "hey": 20342, "\u01203000": 20343, "Dist": 20344, "\u0120twins": 20345, "cill": 20346, "\u0120Maps": 20347, "\u0120traps": 20348, "\u0120weed": 20349, "\u0120Kiss": 20350, "\u0120yoga": 20351, "\u0120recipients": 20352, "\u0120Westminster": 20353, "\u0120pools": 20354, "\u0120Walmart": 20355, "188": 20356, "\u0120Schools": 20357, "attack": 20358, "\u0120ARM": 20359, "paragraph": 20360, "Warning": 20361, "jl": 20362, "\u0120selfish": 20363, "anchez": 20364, "\u0120Heights": 20365, "Fre": 20366, "\u0120Soph": 20367, "\u0120--------------------------------": 20368, "tml": 20369, "333": 20370, "\u0120raids": 20371, "\u0120satellites": 20372, "KEY": 20373, "\u0120lasts": 20374, "\u00d1\u0124": 20375, "Ins": 20376, "\u0120Dame": 20377, "\u0120unpredict": 20378, "///": 20379, "ghai": 20380, "\u0120artillery": 20381, "\u0120cruise": 20382, "\u0120gel": 20383, "\u0120Cabinet": 20384, "\u0120blows": 20385, "\u0120Esp": 20386, "\u0120proximity": 20387, "othe": 20388, "\u0120Skills": 20389, "\u0120Upper": 20390, "obo": 20391, "\u0120NDP": 20392, "\u0120enjoys": 20393, "\u0120repeating": 20394, "\u0120Construction": 20395, "\u0120Questions": 20396, "Hillary": 20397, "\u0120uint": 20398, "\u0120processors": 20399, "\u0120Gibson": 20400, "\u0120Multiple": 20401, "qa": 20402, "\u0120Bom": 20403, "\u0120Miles": 20404, "ventional": 20405, "\u0120hurts": 20406, "skin": 20407, "\u0120AIDS": 20408, "\u0120advisers": 20409, "\u0120Root": 20410, "\u0120methodology": 20411, "\u0120Dale": 20412, "\u0120deton": 20413, "\u0120Knowledge": 20414, "sequently": 20415, "\u0120121": 20416, "\u0120connects": 20417, "Cy": 20418, "\u0120Danger": 20419, "\u0120contributors": 20420, "\u0120Bent": 20421, "\u0120brass": 20422, "\u0120Guns": 20423, "into": 20424, "\u0120Fortune": 20425, "\u0120broker": 20426, "balance": 20427, "\u0120lengths": 20428, "\u0120vic": 20429, "\u0120averaging": 20430, "\u0120appropriately": 20431, "\u0120Camera": 20432, "\u0120sandwich": 20433, "\u0120CDC": 20434, "\u0120coordinate": 20435, "\u0120navig": 20436, "\u0120goodness": 20437, "laim": 20438, "\u0120brake": 20439, "\u0120extremist": 20440, "\u0120Wake": 20441, "\u0120Mend": 20442, "\u0120Tiny": 20443, "\u0120COL": 20444, "\u0120RF": 20445, "\u0120Dual": 20446, "\u0120Wine": 20447, "Case": 20448, "\u0120refined": 20449, "\u0120lamp": 20450, "Lead": 20451, "\u0120bapt": 20452, "\u0120Carb": 20453, "\u0120Sadd": 20454, "\u0120Minneapolis": 20455, "PDF": 20456, "Early": 20457, "\u0120Hidden": 20458, "Its": 20459, "\u0120TIME": 20460, "\u0120pap": 20461, "\u0120commissioned": 20462, "\u0120Few": 20463, "\u0120Colts": 20464, "\u0120Bren": 20465, "\u0120bothered": 20466, "\u0120likewise": 20467, "Exper": 20468, "\u0120Schw": 20469, "cry": 20470, "nn": 20471, "\u0120Mitch": 20472, "imon": 20473, "MG": 20474, "bm": 20475, "UMP": 20476, "rays": 20477, "\u0120registry": 20478, "\u0120270": 20479, "achine": 20480, "rella": 20481, "anting": 20482, "00000": 20483, "\u0120ruined": 20484, "spot": 20485, "\u0120ta": 20486, "\u0120maximize": 20487, "\u0120inconven": 20488, "Dead": 20489, "Human": 20490, "Enabled": 20491, "\u0120Marie": 20492, "\u0120chill": 20493, "\u0120Paradise": 20494, "\u0120starring": 20495, "\u0120Latino": 20496, "\u0120Protocol": 20497, "\u0120EVER": 20498, "\u0120suppliers": 20499, "message": 20500, "\u0120Brock": 20501, "\u0120serum": 20502, "\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a": 20503, "\u0120encomp": 20504, "\u0120ambition": 20505, "uese": 20506, "\u0120arrows": 20507, "Andrew": 20508, "\u0120antenna": 20509, "\u01201961": 20510, "\u0120Bark": 20511, "\u0120bool": 20512, "\u00e3\u0124\u00aa": 20513, "\u0120Storage": 20514, "\u0120railway": 20515, "\u0120tougher": 20516, "\u0120Cad": 20517, "\u0120washing": 20518, "Py": 20519, "']": 20520, "embed": 20521, "\u0120Memphis": 20522, "ackle": 20523, "\u0120famously": 20524, "\u0120Fortunately": 20525, "ovies": 20526, "\u0120mindset": 20527, "\u0120sneak": 20528, "\u0120Dh": 20529, "RAW": 20530, "\u0120Simpson": 20531, "\u0120livest": 20532, "\u0120landmark": 20533, "\u0120cement": 20534, "Low": 20535, "\u0120thrilled": 20536, "\u0120Course": 20537, "inel": 20538, "\u0120chuck": 20539, "idate": 20540, "global": 20541, "\u0120whit": 20542, "\u0120\u00ef\u00bf\u00bd": 20543, "adays": 20544, "ski": 20545, "\u0120SV": 20546, "\u0120viruses": 20547, "306": 20548, "\u0120Respons": 20549, "\u0120theaters": 20550, "\u0120Branch": 20551, "\u0120Geneva": 20552, "\u0120MK": 20553, "\u0120unbeliev": 20554, "\u0120communist": 20555, "Original": 20556, "\u0120Received": 20557, "\u0120Transfer": 20558, "\u0120Arg": 20559, "Input": 20560, "\u0120Strategy": 20561, "\u0120palace": 20562, "thening": 20563, "Dri": 20564, "\u0120sentencing": 20565, "umbnail": 20566, "\u0120pins": 20567, "recy": 20568, "\u0120siblings": 20569, "Getting": 20570, "\u0120BU": 20571, "\u0120Northwest": 20572, "\u0120prolonged": 20573, "\u0120Sakura": 20574, "Comb": 20575, "\u0120Bour": 20576, "\u0120inadequate": 20577, "\u0120Kash": 20578, "\u0120username": 20579, "\u0120Improve": 20580, "\u0120battling": 20581, "\u0120MAC": 20582, "\u0120curriculum": 20583, "\u0120soda": 20584, "\u0120Cannon": 20585, "\u0120sensible": 20586, "spons": 20587, "December": 20588, "\u0120wicked": 20589, "\u0120Pengu": 20590, "\u0120dictators": 20591, "\u0120Hearts": 20592, "ogyn": 20593, "\u0120similarities": 20594, "\u0120Stats": 20595, "\u0120hollow": 20596, "itations": 20597, "\":[": 20598, "\u0120hover": 20599, "\u0120Listen": 20600, "sch": 20601, "Sund": 20602, "\u0120cad": 20603, "\u0120Parks": 20604, "\u0120lur": 20605, "\u0120hype": 20606, "\u0120Lem": 20607, "NAME": 20608, "isure": 20609, "Friday": 20610, "\u0120shoots": 20611, "\u0120closes": 20612, "\u0120db": 20613, "\u0120Ridge": 20614, "\u0120Different": 20615, "\u0120replies": 20616, "\u0120Broadway": 20617, "opers": 20618, "\u0120intoler": 20619, "\u0120Zeus": 20620, "akespe": 20621, "\u0120proprietary": 20622, "\u0120requesting": 20623, "\u0120controllers": 20624, "\u0120MIN": 20625, "imedia": 20626, "becca": 20627, "\u0120expans": 20628, "\u0120oils": 20629, "Bot": 20630, "\u0120Chand": 20631, "\u0120printer": 20632, "\u0120topped": 20633, "\u0120POL": 20634, "\u0120Earlier": 20635, "Social": 20636, "avin": 20637, "\u0120decreases": 20638, "\u0120Seb": 20639, "\u0120specifications": 20640, "\u0120Blast": 20641, "\u0120Kurt": 20642, "\u0120freel": 20643, "Brown": 20644, "\u0120dilig": 20645, "roe": 20646, "\u0120Problem": 20647, "\u0120Quad": 20648, "\u0120decentral": 20649, "\u0120Vector": 20650, "anut": 20651, "\u0120plugins": 20652, "\u0120Gregory": 20653, "\u0120fucked": 20654, "elines": 20655, "\u0120Ambassador": 20656, "take": 20657, "\u0120cleans": 20658, "ongyang": 20659, "Anonymous": 20660, "stro": 20661, "\"}": 20662, "aline": 20663, "\u0120Odd": 20664, "\u0120Eug": 20665, "216": 20666, "\u0120boil": 20667, "\u0120Powers": 20668, "\u0120nurses": 20669, "Obviously": 20670, "\u0120Technical": 20671, "\u0120exceeded": 20672, "ORS": 20673, "\u0120extremists": 20674, "\u0120traces": 20675, "expl": 20676, "\u0120comr": 20677, "\u0120Sach": 20678, ")/": 20679, "\u0120masks": 20680, "\u0120sci": 20681, "Bon": 20682, "\u0120regression": 20683, "wegian": 20684, "\u0120advisor": 20685, "itures": 20686, "\u0120Vo": 20687, "example": 20688, "\u0120Instruct": 20689, "\u0120siege": 20690, "\u0120reductions": 20691, "ptr": 20692, "\u0120statutory": 20693, "\u0120removes": 20694, "\u0120puck": 20695, "redits": 20696, "\u0120bee": 20697, "\u0120salad": 20698, "\u0120promotions": 20699, "\u0120Joshua": 20700, "withstanding": 20701, "ETH": 20702, "\u0120Cha": 20703, "imus": 20704, "\u0120expenditure": 20705, "aunting": 20706, "\u0120delighted": 20707, "\u0120155": 20708, "beh": 20709, "\u0120carpet": 20710, "\u0120Spart": 20711, "\u0120jungle": 20712, "lists": 20713, "\u0120bullying": 20714, "\u0120Nobel": 20715, "\u0120Glen": 20716, "\u0120referenced": 20717, "\u0120introduces": 20718, "sein": 20719, "\u0120chopped": 20720, "glass": 20721, "\u0120Wrest": 20722, "\u0120neutrality": 20723, "\u0120\u00e2\u013b": 20724, "\u0120investigator": 20725, "\u0120shelves": 20726, "\u0120unconstitutional": 20727, "\u0120reproduction": 20728, "\u0120merchant": 20729, "mia": 20730, "\u0120metrics": 20731, "\u0120explosives": 20732, "\u0120Sonia": 20733, "\u0120bodily": 20734, "\u0120thickness": 20735, "\u0120predominantly": 20736, "\u0120Ability": 20737, "\u0120monitored": 20738, "ICH": 20739, "\u0120].": 20740, "\u0120Martinez": 20741, "\u0120visibility": 20742, "\u0120queries": 20743, "\u0120genocide": 20744, "\u0120Warfare": 20745, "Query": 20746, "\u0120studios": 20747, "\u0120embry": 20748, "\u0120corridor": 20749, "\u0120cleaned": 20750, "complete": 20751, "\u0120MH": 20752, "\u0120enrollment": 20753, "INGS": 20754, "\u0120impacted": 20755, "\u0120disastrous": 20756, "\u0120Yun": 20757, "\u0120Claire": 20758, "\u0120Basically": 20759, "yt": 20760, "usterity": 20761, "\u0120indirectly": 20762, "wik": 20763, "\u0120dod": 20764, "\u0120Carr": 20765, "\u0120amp": 20766, "\u0120prohibit": 20767, "\u0120Initial": 20768, "\u0120Rd": 20769, "iji": 20770, "\u0120educate": 20771, "corn": 20772, "iott": 20773, "\u0120Beauty": 20774, "\u0120detective": 20775, "\u0120Conn": 20776, "since": 20777, "\u0120stagger": 20778, "\u0120obese": 20779, "\u0120bree": 20780, "ologic": 20781, "isse": 20782, "walker": 20783, "\u0120blades": 20784, "\u0120lawful": 20785, "func": 20786, "\u0120Behind": 20787, "\u0120appetite": 20788, "\u0120(*": 20789, "\u0120tennis": 20790, "\u0120offspring": 20791, "\u0120jets": 20792, "\u0120structured": 20793, "\u0120aforementioned": 20794, "Nov": 20795, "\u0120scaling": 20796, "fill": 20797, "\u0120stew": 20798, "\u0120curb": 20799, "\u0120Stephan": 20800, "edIn": 20801, "SF": 20802, "obic": 20803, "\u00e9\u0143\u0136": 20804, "oug": 20805, "\u0120MM": 20806, "\u0120genetically": 20807, "opez": 20808, "136": 20809, "\u0120umb": 20810, "ancers": 20811, "\u0120cohort": 20812, "\u0120merchandise": 20813, "\u0120imposing": 20814, "\u0120Legislature": 20815, "\u0120Archive": 20816, "ivia": 20817, "\u0120Naval": 20818, "\u0120offences": 20819, "\u0120miracle": 20820, "\u0120snapped": 20821, "\u0120foes": 20822, "\u0120extensively": 20823, "\u0120Raf": 20824, "\u0120cater": 20825, "edience": 20826, "Kit": 20827, "\u0120Bin": 20828, "\u0120recommends": 20829, "\u0120Cities": 20830, "\u0120rigid": 20831, "\u0120READ": 20832, "\u0120Noble": 20833, "\u0120Tian": 20834, "\u0120certificates": 20835, "antis": 20836, "oiler": 20837, "\u0120Buddhist": 20838, "did": 20839, "\u0120surveyed": 20840, "\u0120downward": 20841, "\u0120prints": 20842, "\u0120Motion": 20843, "ronics": 20844, "\u0120Sans": 20845, "ossibly": 20846, "uctions": 20847, "\u0120colonies": 20848, "\u0120Danish": 20849, "unit": 20850, "\u0120spoil": 20851, "\u0120advisory": 20852, "berries": 20853, "Plan": 20854, "\u0120specification": 20855, "ophers": 20856, "\u0120Resource": 20857, "\u0120shirts": 20858, "prisingly": 20859, "communications": 20860, "\u0120trivial": 20861, "\u0120mentioning": 20862, "isexual": 20863, "\u0120supplements": 20864, "\u0120supervision": 20865, "BP": 20866, "vor": 20867, "\u0120wit": 20868, "\u0120cooldown": 20869, "\u0120plaintiff": 20870, "\u0120Reviews": 20871, "\u0120Sri": 20872, "\u0120Mint": 20873, "\u0120Sugar": 20874, "\u0120afterward": 20875, "\u0120Priest": 20876, "\u0120Investment": 20877, "ogene": 20878, "\u0120Taking": 20879, "\u0120stretching": 20880, "\u0120inflammation": 20881, "\u0120Tehran": 20882, "\u0120lining": 20883, "\u0120freezing": 20884, "\u0120Entity": 20885, "\u0120inspiring": 20886, "special": 20887, "price": 20888, "\u0120sue": 20889, "\u0120Porter": 20890, "ounge": 20891, "ETA": 20892, "\u0120Derek": 20893, "\u0120Luis": 20894, "uo": 20895, "ymph": 20896, "\u0120exterior": 20897, "ihil": 20898, "\u0120Ashley": 20899, "inator": 20900, "\u0120nutrients": 20901, "\u0120Thrones": 20902, "\u0120finances": 20903, "\u0120Inspect": 20904, "\u0120specially": 20905, "\u0120Required": 20906, "\u0120PTS": 20907, "\u0120Violence": 20908, "ointed": 20909, "shots": 20910, "\u0120excerpt": 20911, "coon": 20912, "INS": 20913, "\u0120Gri": 20914, "\u0120recognised": 20915, "Week": 20916, "Young": 20917, "\u0120vom": 20918, "isle": 20919, "\u0120Curry": 20920, "\u0120Buddh": 20921, "\u0120notebook": 20922, "\u0120durable": 20923, "/?": 20924, "\u0120Gad": 20925, "\u0120Pupp": 20926, "\u0120forgive": 20927, "park": 20928, "\u0120personalities": 20929, "analysis": 20930, "clamation": 20931, "\u0120elevator": 20932, "\u0120warehouse": 20933, "\u0120Role": 20934, "unn": 20935, "\u0120illustration": 20936, "\u0120Scan": 20937, "\u0120atmospheric": 20938, "Import": 20939, "ANC": 20940, "ricted": 20941, "fu": 20942, "010": 20943, "\u0120arche": 20944, "\u0120rewarded": 20945, "akespeare": 20946, "\u0120internally": 20947, "\u0120RBI": 20948, "alker": 20949, "\u0120elephant": 20950, "owitz": 20951, "\u0120Pizza": 20952, "\u0120bipartisan": 20953, "\u00c3\u00a9s": 20954, "\u0120slowed": 20955, "\u0120Stark": 20956, "\u0120override": 20957, "OUS": 20958, "\u0120320": 20959, "undreds": 20960, "\u0120Deck": 20961, "\u0120Census": 20962, "bee": 20963, "146": 20964, "otor": 20965, "\u0120ip": 20966, "\u0120ub": 20967, "ocations": 20968, "\u0120Button": 20969, "rice": 20970, "\u0120cripp": 20971, "fff": 20972, "\u0120originated": 20973, "\u0120overwhelmed": 20974, "appa": 20975, "\u0120foremost": 20976, "\u00e2\u0122\u0133": 20977, "\u0120LEG": 20978, "release": 20979, "eatured": 20980, "atches": 20981, "\u0120reps": 20982, "\u0120lending": 20983, "\u0120Reference": 20984, "\u0120Client": 20985, "165": 20986, "venth": 20987, "Complete": 20988, "\u0120Patrol": 20989, "\u0120sworn": 20990, "cam": 20991, "\u0120shuttle": 20992, "\u0120Ralph": 20993, "\u0120hometown": 20994, "-,": 20995, "onal": 20996, "\u0120BP": 20997, "\u00e5\u0131": 20998, "\u0120persuade": 20999, "\u0120Alexand": 21000, "\u0120combines": 21001, "\u0120vivid": 21002, "\u0120Lag": 21003, "\u0120encoding": 21004, "\u0120salvation": 21005, "wen": 21006, "\u0120Recovery": 21007, "iya": 21008, "University": 21009, "\u0120Biden": 21010, "\u0120budgets": 21011, "\u0120Texans": 21012, "fits": 21013, "\u0120honored": 21014, "\u0120python": 21015, "TD": 21016, "###": 21017, "clone": 21018, "\u0120blink": 21019, "\u0120Liquid": 21020, "\u0120unemployed": 21021, "\u0120clashes": 21022, "\u0120Counsel": 21023, "\u0120directing": 21024, "\u0120punct": 21025, "\u0120Falcons": 21026, "\u0120shark": 21027, "\u0120Damascus": 21028, "\u0120jeans": 21029, "\u0120embark": 21030, "\u0120seize": 21031, "\u0120upwards": 21032, "280": 21033, "\u0120Ez": 21034, "\u0120Anything": 21035, "\u0120exotic": 21036, "lower": 21037, "\u0120Creator": 21038, "\u0120Um": 21039, "\u0120suburbs": 21040, "berger": 21041, "\u0120Wend": 21042, "\u0120mint": 21043, "\u0120XX": 21044, "\u0120Dro": 21045, "\u0120suffers": 21046, "\u0120herb": 21047, "tree": 21048, "\u0120fragile": 21049, "\u0120flooded": 21050, "\u0120Alcohol": 21051, "olean": 21052, "nyder": 21053, "\u0120KO": 21054, "Fram": 21055, "\u0120136": 21056, "\u0120owed": 21057, "\u0120Melee": 21058, "\u0120Hash": 21059, "\u0120whisk": 21060, "\u0120sudo": 21061, "rr": 21062, "Quick": 21063, "appro": 21064, "\u0120ii": 21065, "\u0120Examples": 21066, "hee": 21067, "\u0120promotes": 21068, "perature": 21069, "kar": 21070, "\u0120Honor": 21071, "\u0120sodium": 21072, "\u0120Lif": 21073, "rosso": 21074, "intendent": 21075, "\u0120correspondent": 21076, "Found": 21077, "secret": 21078, "\u0120identifies": 21079, "agne": 21080, "\u0120lou": 21081, "\u0120PP": 21082, "\u0120coincidence": 21083, "move": 21084, "\u0120militia": 21085, "\u0120infiltr": 21086, "\u0120Primary": 21087, "\u0120pitching": 21088, "\u0120Ib": 21089, "\u0120GOOD": 21090, "\u00e3\u0124\u00b8": 21091, "\u0120Wizards": 21092, "iral": 21093, "\u0120Venus": 21094, "RR": 21095, "\u0120\u00e2\u0122\u0137": 21096, "\u0120Casey": 21097, "\u0120sadly": 21098, "\u0120admire": 21099, "\u0120embarrassed": 21100, "cb": 21101, "Mel": 21102, "\u0120tubes": 21103, "\u0120beautifully": 21104, "\u0120Queensland": 21105, "Below": 21106, "rez": 21107, "quet": 21108, "pleasant": 21109, "\u0120\u00c2\u00ab": 21110, "Camp": 21111, "\u0120decisive": 21112, "1998": 21113, "\u0120Lamb": 21114, "utton": 21115, "hn": 21116, "\u0120Jagu": 21117, "aunder": 21118, "\u0120Cord": 21119, "\u0120clerk": 21120, "\u0120caffe": 21121, "\u0120wiped": 21122, "\u0120reim": 21123, "\u0120Mountains": 21124, "\u0120imprisoned": 21125, "\u0120develops": 21126, "\u0120Pra": 21127, "\u0120modeling": 21128, "Anyone": 21129, "ancel": 21130, "\u0120Sit": 21131, "\u0120shields": 21132, "\u0120lawn": 21133, "\u0120cardiovascular": 21134, "\u0120demonstrating": 21135, "\u0120parse": 21136, "\u0120Israelis": 21137, "\u0120euros": 21138, "143": 21139, "\u0120glorious": 21140, "inski": 21141, "ecd": 21142, "\u0120conditioning": 21143, "\u0120helpless": 21144, "\u0120microsc": 21145, "\u0120Harbor": 21146, "\u0120stakes": 21147, "\u0120260": 21148, "\u0120unequ": 21149, "\u0120Floyd": 21150, "\u0120damp": 21151, "\u0120apparatus": 21152, "\u0120Laws": 21153, "\u0120counters": 21154, "\u0120induce": 21155, "atable": 21156, "\u0120Ahmed": 21157, "\u0120slam": 21158, "November": 21159, "\u0120persist": 21160, "\u0120imminent": 21161, "\u00c3\u00a1n": 21162, "\u0120shred": 21163, "\u0120phases": 21164, "\u0120Edmonton": 21165, "\u0120Armstrong": 21166, "\u0120Meet": 21167, "\u0120Kitty": 21168, "\u00d1\u0122": 21169, "circ": 21170, "\u0120Adult": 21171, "\u0120arose": 21172, "\u0120Xen": 21173, "Dan": 21174, "gow": 21175, "\u0120superf": 21176, "\u0120Admir": 21177, "\u0120endure": 21178, "\u0120keyword": 21179, "yrus": 21180, "\u0120yarn": 21181, "\u0120pathway": 21182, "\u0120Hopkins": 21183, "midt": 21184, "\u0120censorship": 21185, "dependent": 21186, "\u0120instructor": 21187, "Sources": 21188, "\u0120toe": 21189, "\u0120balloon": 21190, "Nob": 21191, "\u0120swear": 21192, "\u0120Castro": 21193, "\u0120gloss": 21194, "\u0120Kavanaugh": 21195, "\u0120remarkably": 21196, "Photos": 21197, "\u0120Nom": 21198, "\u0120Southeast": 21199, "yers": 21200, "\u0120validation": 21201, "\u0120cannon": 21202, "\u0120Victory": 21203, "\u0120Pierre": 21204, "\u0120cautious": 21205, "Audio": 21206, "\u0120fetch": 21207, "\u0120Gift": 21208, "\u0120Hyp": 21209, "\u0120remedy": 21210, "ZE": 21211, "\u0120scent": 21212, "\u0120beard": 21213, "\u0120Rut": 21214, "-\"": 21215, "\u0120patents": 21216, "Hy": 21217, "\u0120unjust": 21218, "\u0120potato": 21219, "\u0120forthcoming": 21220, "\u0120chef": 21221, "\u0120Rift": 21222, "affe": 21223, "\u0120ROM": 21224, "\u0120Launch": 21225, "\u0120pads": 21226, "\u0120Neo": 21227, "\u0120onset": 21228, "\u0120squeeze": 21229, "safe": 21230, "\u0120prefix": 21231, "\u0120TM": 21232, "\u0120Nearly": 21233, "\u0120Clinical": 21234, "\u0120Mental": 21235, "otiation": 21236, "\u0120Unic": 21237, "antry": 21238, "\u0120Cir": 21239, "\u0120epit": 21240, "\u00c3\u00a6": 21241, "\u0120extracted": 21242, "versely": 21243, "riad": 21244, "\u0120strains": 21245, "\u0120tops": 21246, "\u0120poem": 21247, "\u0120Randy": 21248, "\u0120Maple": 21249, "THER": 21250, "upiter": 21251, "\u0120SSD": 21252, "\u013c\u00e9": 21253, "\u0120uncon": 21254, "pering": 21255, "\u0120slept": 21256, "iners": 21257, "\u0120underwater": 21258, "\u0120Evidence": 21259, "gone": 21260, "205": 21261, "\u0120historians": 21262, "\u0120synthesis": 21263, "\u0120frog": 21264, "basketball": 21265, "\u0120vibrant": 21266, "\u0120subord": 21267, "\u0120365": 21268, "\u0120Dial": 21269, "\u0120cooperate": 21270, "HAHA": 21271, "\u0120greeted": 21272, "158": 21273, "\u0120jazz": 21274, "\u0120intox": 21275, "\u0120Walking": 21276, "\u0120supervisor": 21277, "\u0120Fusion": 21278, "\u0120Mercedes": 21279, "send": 21280, "Ham": 21281, "sd": 21282, "nl": 21283, "\u0120tours": 21284, "\u0120FIFA": 21285, "\u0120culp": 21286, "gd": 21287, "304": 21288, "\u0120pleas": 21289, "\u0120illustrates": 21290, "\u0120Colombia": 21291, "\u0120highlighting": 21292, "\u0120Summary": 21293, "\u0120exposing": 21294, "\u0120Dru": 21295, "\u0120irony": 21296, "ritional": 21297, "\u0120Carroll": 21298, "\u0120Ellis": 21299, "Pict": 21300, "\u0120Rapt": 21301, "\u0120adapter": 21302, "\u0120unm": 21303, "\u0120corpse": 21304, "\u0120celebrities": 21305, "Den": 21306, "atum": 21307, "\u0120Apocalypse": 21308, "\u0120Wag": 21309, "lining": 21310, "\u0120hormones": 21311, "Rub": 21312, "\u0120Xi": 21313, "\u0120Vaults": 21314, "208": 21315, "alkyrie": 21316, "inosaur": 21317, "\u0120feeds": 21318, "vity": 21319, "\u0120defeating": 21320, "Wait": 21321, "\u0120emphasize": 21322, "\u0120Steelers": 21323, "yrinth": 21324, "leys": 21325, "\u0120Whenever": 21326, "Currently": 21327, "\u0120Clock": 21328, "\u0120collectively": 21329, "anyon": 21330, "\u0120JP": 21331, "\u0120mentality": 21332, "\u0120downloads": 21333, "\u0120surroundings": 21334, "\u0120Barnes": 21335, "\u0120flagship": 21336, "\u0120indicators": 21337, "\u0120grapp": 21338, "January": 21339, "\u0120Elemental": 21340, "\u0120Athena": 21341, "ibal": 21342, "\u0120sights": 21343, "\u0120capita": 21344, "\u0120Treaty": 21345, "\u0120voiced": 21346, "\u0120Gaz": 21347, "lette": 21348, "\u0120ya": 21349, "\u0120expired": 21350, "Legend": 21351, "Hot": 21352, "nature": 21353, "\u0120unstable": 21354, "\u0120280": 21355, "\u00c3\u00ba": 21356, "Comment": 21357, "ALE": 21358, "\u0120quests": 21359, "\u0120handler": 21360, "nis": 21361, "\u0120versatile": 21362, "\u0120conceal": 21363, "engeance": 21364, "\u0120Interactive": 21365, "\u0120obsessed": 21366, "\u0120Dogs": 21367, "\u0120cracked": 21368, "Sound": 21369, "sv": 21370, "\u0120Dylan": 21371, "roads": 21372, "fx": 21373, "\u0120Catholics": 21374, "\u0120Hag": 21375, "\u0120slammed": 21376, "\u0120glowing": 21377, "sale": 21378, "\u0120tissues": 21379, "\u0120Chi": 21380, "nee": 21381, "\u0120cher": 21382, "sic": 21383, "urrection": 21384, "\u0120bacon": 21385, "ulatory": 21386, ").\"": 21387, "\u0120irregular": 21388, "FORM": 21389, "assed": 21390, "\u0120intentional": 21391, "\u0120compensate": 21392, "\u0120Speaking": 21393, "\u0120Sets": 21394, "153": 21395, "\u0120conventions": 21396, "bands": 21397, "emade": 21398, "\u0120ecc": 21399, "\u0120Winston": 21400, "\u0120Assassin": 21401, "\u0120Belgian": 21402, "\u0120dependence": 21403, "\u0120niche": 21404, "\u0120bark": 21405, "\u0120Jazz": 21406, "\u0120disadvantage": 21407, "\u0120gasoline": 21408, "\u0120165": 21409, "\u00e7\u013c\u0126": 21410, "essa": 21411, "module": 21412, "angular": 21413, "OY": 21414, "\u0120Treatment": 21415, "itas": 21416, "olation": 21417, "\u0120Arnold": 21418, "\u0120feud": 21419, "\u0120Nest": 21420, "\u0120theatre": 21421, "ewater": 21422, "\u0120minors": 21423, "olicy": 21424, "\u0120Haven": 21425, "division": 21426, "\u0120trunk": 21427, "Far": 21428, "\u0120Pull": 21429, "\u0120capturing": 21430, "\u01201800": 21431, "\u0120Teen": 21432, "\u0120exempl": 21433, "\u0120clinics": 21434, "\u0120Burg": 21435, "\u0120substit": 21436, "\u0120payload": 21437, "\u0120Lav": 21438, "\u0120Troy": 21439, "\u0120Witness": 21440, "\u0120fragments": 21441, "\u0120passwords": 21442, "\u0120gospel": 21443, "\u0120Gin": 21444, "\u0120tenants": 21445, "olith": 21446, "Six": 21447, "Previous": 21448, "\u0120Ages": 21449, "\u0120Darwin": 21450, "\u0120blat": 21451, "\u0120empathy": 21452, "smith": 21453, "bag": 21454, "\u0120Echo": 21455, "\u0120Camb": 21456, "\u0120Madd": 21457, "\u0120Boo": 21458, "\u0120rede": 21459, "\u0120Burning": 21460, "\u0120smoothly": 21461, "\u0120Adrian": 21462, "\u0120Vampire": 21463, "\u0120Monsters": 21464, "steam": 21465, "Style": 21466, "Ma": 21467, "rea": 21468, "\u0120Dwar": 21469, "alyst": 21470, "ursor": 21471, "\u0120elimination": 21472, "\u0120crypto": 21473, "cht": 21474, "\u0120Eternal": 21475, "\u00e2\u0122\u00a6]": 21476, "\u0120Sorce": 21477, "Ill": 21478, "NER": 21479, "\u0120uh": 21480, "Conclusion": 21481, "wage": 21482, "\u0120respir": 21483, "\u0120reminis": 21484, "hetical": 21485, "\u0120gy": 21486, "\u0120utilized": 21487, "icidal": 21488, "\u01201900": 21489, "\u0120hunters": 21490, "\u0120Swan": 21491, "\u0120React": 21492, "\u0120visitor": 21493, "\u0120Thanksgiving": 21494, "308": 21495, "Posts": 21496, "\u0120hips": 21497, "1997": 21498, "omers": 21499, "\u0120knocking": 21500, "\u0120Vehicle": 21501, "\u0120til": 21502, "\u0120138": 21503, "\u0120mi": 21504, "\u0120Investigation": 21505, "\u0120Kenya": 21506, "\u0120casino": 21507, "\u0120motives": 21508, "\u0120regain": 21509, "rex": 21510, "\u0120weekends": 21511, "\u0120stabbed": 21512, "boro": 21513, "\u0120exploited": 21514, "\u0120HAVE": 21515, "\u0120Television": 21516, "cock": 21517, "\u0120preparations": 21518, "\u0120endeav": 21519, "\u0120Remote": 21520, "\u0120Maker": 21521, "\u0120Produ": 21522, "\u0120Evan": 21523, "\u0120informational": 21524, "\u0120Louisville": 21525, "154": 21526, "\u0120Dreams": 21527, "\u0120plots": 21528, "\u0120Runner": 21529, "\u0120hurting": 21530, "\u0120academy": 21531, "\u0120Montgomery": 21532, "nm": 21533, "\u0120Lanc": 21534, "\u0120Alz": 21535, "210": 21536, "elong": 21537, "\u0120retailer": 21538, "\u0120arising": 21539, "\u0120rebellion": 21540, "\u0120blonde": 21541, "played": 21542, "\u0120instrumental": 21543, "Cross": 21544, "\u0120retention": 21545, "\u0120therapeutic": 21546, "\u0120seas": 21547, "\u0120infantry": 21548, "\u0120Clint": 21549, "\u0120prompting": 21550, "\u0120bitch": 21551, "\u0120stems": 21552, "\u0120Kra": 21553, "\u0120thesis": 21554, "\u0120Bog": 21555, "rued": 21556, "\u0120kings": 21557, "\u0120clay": 21558, "ificent": 21559, "\u0120YES": 21560, "\u0120Thing": 21561, "\u0120Cubs": 21562, "veyard": 21563, "elsh": 21564, "inarily": 21565, "\u0120Ey": 21566, "\u0120Rolling": 21567, "\u0120evolving": 21568, "India": 21569, "\u0120recognizes": 21570, "\u0120graduation": 21571, "isers": 21572, "\u0120fertility": 21573, "\u0120Milan": 21574, "Command": 21575, "\u0120boxing": 21576, "\u01201943": 21577, "\u0120gluten": 21578, "\u0120Emir": 21579, "\u0120idol": 21580, "\u0120conceived": 21581, "\u0120Creation": 21582, "Merit": 21583, "uddy": 21584, "ussions": 21585, "\u0120Lieutenant": 21586, "ietal": 21587, "\u0120unchanged": 21588, "\u0120Scale": 21589, "\u0120Crimea": 21590, "balls": 21591, "atorial": 21592, "\u0120depths": 21593, "\u0120empirical": 21594, "\u0120transm": 21595, "\u0120unsafe": 21596, "missible": 21597, "comfort": 21598, "156": 21599, "\u0120mechanic": 21600, "002": 21601, "lins": 21602, "\u0120smoked": 21603, "Pos": 21604, "\u0120slowing": 21605, "\u0120lav": 21606, "Texas": 21607, "\u0120cheating": 21608, "\u0120Metropolitan": 21609, "ethyl": 21610, "\u0120discovering": 21611, "asse": 21612, "\u0120pencil": 21613, "\u0120Pyongyang": 21614, "\u0120closet": 21615, "\u0120Sheet": 21616, "\u0120Entry": 21617, "oustic": 21618, "\u0120myst": 21619, "erate": 21620, "ariat": 21621, "\u0120minerals": 21622, "\u0120musician": 21623, "\u0120Pul": 21624, "\u0120Maz": 21625, "249": 21626, "\u0120permissions": 21627, "\u0120iv": 21628, "enary": 21629, "ickers": 21630, "\u0120Bing": 21631, "hea": 21632, "enable": 21633, "\u0120griev": 21634, "\u0120asserted": 21635, "\u0120Colonel": 21636, "\u0120affidav": 21637, "wo": 21638, "\u0120seated": 21639, "\u0120Ride": 21640, "\u0120paintings": 21641, "\u0120Pix": 21642, "\u0120137": 21643, "ishi": 21644, "umbai": 21645, "gotten": 21646, "\u0120Earl": 21647, "\u0120inning": 21648, "\u0120census": 21649, "\u0120travelled": 21650, "\u0120Consult": 21651, "185": 21652, "bind": 21653, "\u0120simplicity": 21654, "\u0120overlooked": 21655, "\u0120Helpful": 21656, "\u0120monkey": 21657, "\u0120overwhelmingly": 21658, "Blood": 21659, "\u0120Flint": 21660, "\u0120Jama": 21661, "\u0120Present": 21662, "\u0120Rage": 21663, "\u0120TA": 21664, "ptive": 21665, "\u0120turnout": 21666, "wald": 21667, "\u0120Dolphins": 21668, "\u0120VPN": 21669, "\u0120onion": 21670, "\u0120crafting": 21671, "mma": 21672, "\u0120Mercury": 21673, "\u0120arrange": 21674, "\u0120alerts": 21675, "\u0120OT": 21676, "zbollah": 21677, "\u0120gases": 21678, "\u0120Richardson": 21679, "sal": 21680, "lar": 21681, "\u0120frost": 21682, "\u0120lowering": 21683, "\u0120acclaim": 21684, "\u0120startups": 21685, "\u0120Gain": 21686, "essment": 21687, "\u0120guardian": 21688, "\u00e4\u00ba\u00ba": 21689, "\u0120Pie": 21690, "\u0120Links": 21691, "\u0120merits": 21692, "\u0120awake": 21693, "\u0120parental": 21694, "\u0120exceeds": 21695, "\u0120idle": 21696, "\u0120Pilot": 21697, "\u0120eBay": 21698, "\u0120Accept": 21699, "ipeg": 21700, "Cam": 21701, "\u0120Kot": 21702, "\u0120traders": 21703, "olitics": 21704, "unker": 21705, "\u0120Pale": 21706, "osi": 21707, "anmar": 21708, "\u01201947": 21709, "\u0120Fell": 21710, "estial": 21711, "itating": 21712, "GF": 21713, "\u0120Sr": 21714, "ifted": 21715, "\u0120connector": 21716, "\u0120Bone": 21717, "illes": 21718, "260": 21719, "hma": 21720, "\u0120overlap": 21721, "\u0120GitHub": 21722, "\u0120cleaner": 21723, "\u0120Baptist": 21724, "\u0120WAS": 21725, "\u0120lungs": 21726, "\u00d1\u0123": 21727, "\u0120BUT": 21728, "\u0120cite": 21729, "\u0120pitched": 21730, "reatment": 21731, "\u0120trophies": 21732, "\u0120Nu": 21733, "386": 21734, "\u0120Pride": 21735, "\u0120attendees": 21736, "[]": 21737, "179": 21738, "\u0120spatial": 21739, "\u0120prizes": 21740, "\u0120Religion": 21741, "\u0120showcase": 21742, "\u0120Category": 21743, "vidia": 21744, "Target": 21745, "Property": 21746, "?,": 21747, "\u0120fusion": 21748, "pie": 21749, "\u0120UCLA": 21750, "\u0120soundtrack": 21751, "\u0120princess": 21752, "\u0120Caval": 21753, "should": 21754, "\u0120limbs": 21755, "Background": 21756, "\u0120lonely": 21757, "\u0120cores": 21758, "\u0120Tail": 21759, "sheet": 21760, "\u0120132": 21761, "Ra": 21762, "\u00e3\u0124\u00ab": 21763, "\u0120Bolt": 21764, "\u0120booked": 21765, "\u0120administer": 21766, "\u0120equals": 21767, "wy": 21768, "\u0120observing": 21769, "\u0120Baron": 21770, "\u0120Adobe": 21771, "\u0120virgin": 21772, "\u0120Socialist": 21773, "Move": 21774, "ghazi": 21775, "\u0120Linda": 21776, "212": 21777, "\u0120brewing": 21778, "\u0120merchants": 21779, "burse": 21780, "\u0120divor": 21781, "\u0120metals": 21782, "\u0120Ner": 21783, "\u0120sums": 21784, "\u0120Enemy": 21785, "\u0120envision": 21786, "\u0120granting": 21787, "\u0120Honey": 21788, "\u0120Skyrim": 21789, "\u0120socio": 21790, "graded": 21791, "\u0120selective": 21792, "WASHINGTON": 21793, "\u01201948": 21794, "\u0120Sirius": 21795, "\u0120Gross": 21796, "activity": 21797, "\u0120Ivan": 21798, "\u0120furious": 21799, "BSD": 21800, "\u0120Previous": 21801, "\u0120responsive": 21802, "\u0120charitable": 21803, "\u0120leaning": 21804, "\u0120Pew": 21805, "\u0120violates": 21806, "\\\\\\\\\\\\\\\\": 21807, "\u0120Coming": 21808, "wire": 21809, "\u0120poet": 21810, "\u0120resolutions": 21811, "command": 21812, "\u0120Portuguese": 21813, "\u0120nickname": 21814, "\u0120deaf": 21815, "February": 21816, "\u0120recognise": 21817, "\u0120entirety": 21818, "\u0120seasonal": 21819, "placed": 21820, "\u0120Telegraph": 21821, "\u0120microphone": 21822, "ouring": 21823, "\u0120grains": 21824, "\u0120governed": 21825, "\u0120postp": 21826, "\u0120Waters": 21827, "inement": 21828, "\u0120undocumented": 21829, "\u0120Comcast": 21830, "\u0120fox": 21831, "\u0120assaults": 21832, "reon": 21833, "many": 21834, "\u0120Jenkins": 21835, "\u0120Anyway": 21836, "\u0120assessments": 21837, "\u0120downs": 21838, "\u0120Mouse": 21839, "\u0120superb": 21840, "kt": 21841, "\u0120Dow": 21842, "\u0120taxation": 21843, "401": 21844, "\u0120smiles": 21845, "\u0120undertaken": 21846, "\u0120exh": 21847, "\u0120enthusiastic": 21848, "\u0120twent": 21849, "\u0120governmental": 21850, "\u0120autonomy": 21851, "\u0120Technologies": 21852, "\u0120Chain": 21853, "\u0120prevalent": 21854, "fb": 21855, "\u0120nicotine": 21856, "ogram": 21857, "job": 21858, "\u0120awaiting": 21859, "\u0120Menu": 21860, "\u0120deputies": 21861, "kov": 21862, "ishops": 21863, "Button": 21864, "\u0120Shanghai": 21865, "\u0120diesel": 21866, "\u0120Duck": 21867, "Ryan": 21868, "\u0120PCs": 21869, "NF": 21870, "jury": 21871, "ente": 21872, "\u0120inaccurate": 21873, "eddy": 21874, "Whatever": 21875, "\u0120showc": 21876, "\u0120Nad": 21877, "odus": 21878, "etr": 21879, "\u0120plaintiffs": 21880, "\u0120WOR": 21881, "\u0120Assange": 21882, "\u0120privat": 21883, "\u0120premiums": 21884, "\u0120tam": 21885, "URL": 21886, "\u0120elites": 21887, "\u0120Ranger": 21888, "ottenham": 21889, "\u0120Hoff": 21890, "\u0120Athens": 21891, "\u0120definite": 21892, "\u0120sighed": 21893, "\u0120evenly": 21894, "211": 21895, "\u0120Amber": 21896, "akia": 21897, "\u0120mailing": 21898, "\u0120crashing": 21899, "\u0120Confederate": 21900, "rugged": 21901, "Wal": 21902, "\u0120Depths": 21903, "\u0120juvenile": 21904, "\u0120reactor": 21905, "Introduction": 21906, "\u0120Deluxe": 21907, "1995": 21908, "\u0120Sanchez": 21909, "\u0120Mead": 21910, "ivable": 21911, ":-": 21912, "\u0120Planning": 21913, "\u0120Trap": 21914, "quin": 21915, "\u0120Protect": 21916, "vered": 21917, "Information": 21918, "\u0120kidney": 21919, "innamon": 21920, "las": 21921, "\u0120policing": 21922, "\u0120tolerate": 21923, "\u0120Qi": 21924, "\u0120biased": 21925, "Fort": 21926, "\u0120Ki": 21927, "save": 21928, "\u0120privileged": 21929, "\u0120beasts": 21930, "\u0120Glas": 21931, "\u0120Cinem": 21932, "\u0120comeback": 21933, "Sunday": 21934, "\u0120extinction": 21935, "hops": 21936, "\u0120transmit": 21937, "\u0120doubles": 21938, "\u0120Flat": 21939, "167": 21940, "\u0120disputed": 21941, "\u0120injustice": 21942, "foo": 21943, "Vict": 21944, "roleum": 21945, "\u0120Julie": 21946, "Context": 21947, "\u0120Rarity": 21948, "issue": 21949, "Component": 21950, "\u0120counseling": 21951, "anne": 21952, "dark": 21953, "\u0120objections": 21954, "uilt": 21955, "\u0120gast": 21956, "\u0120plac": 21957, "\u0120unused": 21958, "\u00e3\u0125\u0129": 21959, "\u0120Trial": 21960, "\u0120Jas": 21961, "hedral": 21962, "obb": 21963, "\u0120temporal": 21964, "\u0120PRO": 21965, "\u0120NW": 21966, "\u0120Anniversary": 21967, "Large": 21968, "\u0120therm": 21969, "\u0120david": 21970, "\u0120systemic": 21971, "\u0120Shir": 21972, "mut": 21973, "\u0120Nept": 21974, "address": 21975, "\u0120scanning": 21976, "\u0120understandable": 21977, "\u0120canvas": 21978, "Cat": 21979, "\u0120Zoo": 21980, "\u0120angels": 21981, "LO": 21982, "\u0120Statement": 21983, "\u0120Sig": 21984, "ovable": 21985, "\u0120Away": 21986, "sharing": 21987, "ocrats": 21988, "stated": 21989, "\u0120weighing": 21990, "Nor": 21991, "wild": 21992, "Bey": 21993, "\u0120astonishing": 21994, "\u0120Reynolds": 21995, "\u0120opener": 21996, "\u0120trainer": 21997, "\u0120surgical": 21998, "pn": 21999, "\u0120adjusting": 22000, "wheel": 22001, "\u0120frown": 22002, "ervative": 22003, "\u0120suspend": 22004, "Within": 22005, "tein": 22006, "\u0120obstacle": 22007, "\u0120liberties": 22008, "ymes": 22009, "\u0120uranium": 22010, "ansom": 22011, "anol": 22012, "uba": 22013, "\u0120Loss": 22014, "\u0120arous": 22015, "\u0120Henderson": 22016, "Wow": 22017, "spl": 22018, "cur": 22019, "\u0120\u00c2\u0143": 22020, "\u0120theirs": 22021, "Damage": 22022, "\u0120downloading": 22023, "\u0120discern": 22024, "\u0120Sto": 22025, "\u0120Fla": 22026, "\u0120hath": 22027, "\u0120Aj": 22028, "\u0120unpleasant": 22029, "European": 22030, "expensive": 22031, "\u0120screenshot": 22032, "\u0120UV": 22033, "\u0120allied": 22034, "\u0120Persian": 22035, "\u0120monopoly": 22036, "\u0120atom": 22037, "\u0120Redskins": 22038, "\"><": 22039, "\u0120cancell": 22040, "\u0120cinema": 22041, "131": 22042, "fair": 22043, "\u0120Alfred": 22044, "\u0120duck": 22045, "args": 22046, "223": 22047, "\u0120ISI": 22048, "\u0120signaling": 22049, "inar": 22050, "\u0120laughs": 22051, "\u0120forwards": 22052, "\u0120reckless": 22053, "\u0120listeners": 22054, "ativity": 22055, "\u0120vastly": 22056, "nant": 22057, "Less": 22058, "\u0120Hunting": 22059, "\u0120Scientific": 22060, "ITED": 22061, "\u0120knight": 22062, "\u0120HTC": 22063, "usa": 22064, "tmp": 22065, "\u0120rude": 22066, "\u0120Legendary": 22067, "\u0120arises": 22068, "Bad": 22069, "\u0120Claim": 22070, "peg": 22071, "\u0120realities": 22072, "Think": 22073, "\u0120\u00c2\u00b0": 22074, "\u0120rode": 22075, "\u0120strive": 22076, "\u0120anecd": 22077, "\u0120shorts": 22078, "\u0120hypothes": 22079, "\u0120coordinated": 22080, "\u0120Gandhi": 22081, "\u0120FPS": 22082, "RED": 22083, "\u0120susceptible": 22084, "\u0120shrink": 22085, "\u0120Chart": 22086, "Help": 22087, "\u0120ion": 22088, "deep": 22089, "ribes": 22090, "\u0120Kai": 22091, "\u0120Customer": 22092, "Summary": 22093, "\u0120cough": 22094, "wife": 22095, "\u0120lend": 22096, "\u0120positioning": 22097, "\u0120lottery": 22098, "\u0120Canyon": 22099, "\u0120fade": 22100, "\u0120bronze": 22101, "\u0120Kenny": 22102, "\u0120boasts": 22103, "\u0120Enhanced": 22104, "record": 22105, "\u0120emergence": 22106, "\u0120akin": 22107, "\u0120Bert": 22108, "itous": 22109, "\u00e2\u0138\u0133": 22110, "\u0120stip": 22111, "\u0120exchanged": 22112, "omore": 22113, "alsh": 22114, "\u0120reservoir": 22115, "\u0120standpoint": 22116, "WM": 22117, "\u0120initiate": 22118, "\u0120decay": 22119, "\u0120brewery": 22120, "\u0120terribly": 22121, "\u0120mortal": 22122, "levard": 22123, "\u0120revis": 22124, "NI": 22125, "elo": 22126, "\u0120confess": 22127, "\u0120MSNBC": 22128, "\u0120submissions": 22129, "Controller": 22130, "\u0120202": 22131, "\u0120Ruth": 22132, "});": 22133, "\u0120Azure": 22134, "\u0120.\"": 22135, "206": 22136, "\u0120Marketing": 22137, "\u0120laund": 22138, "iencies": 22139, "\u0120renowned": 22140, "\u0120Trou": 22141, "\u0120NGO": 22142, "blems": 22143, "\u0120terrified": 22144, "\u0120warns": 22145, "\u0120pert": 22146, "\u0120unsure": 22147, "480": 22148, "alez": 22149, "ultz": 22150, "\u0120Outside": 22151, "\u0120styl": 22152, "\u0120Underground": 22153, "\u0120panc": 22154, "\u0120dictionary": 22155, "\u0120foe": 22156, "riminal": 22157, "\u0120Norwegian": 22158, "\u0120jailed": 22159, "\u0120maternal": 22160, "\u00c3\u00a9e": 22161, "\u0120Lucy": 22162, "cop": 22163, "Cho": 22164, "\u0120unsigned": 22165, "\u0120Zelda": 22166, "\u0120Insider": 22167, "\u0120Continued": 22168, "\u0120133": 22169, "\u0120Naruto": 22170, "\u0120Majority": 22171, "169": 22172, "\u0120Wo": 22173, "\u00e3\u0124\u0135": 22174, "\u0120pastor": 22175, "\u0120informal": 22176, "\u00d0\u00bd": 22177, "anthrop": 22178, "join": 22179, "\u00e3\u0123\u0139": 22180, "itational": 22181, "NP": 22182, "\u0120Writing": 22183, "fn": 22184, "\u0120Bever": 22185, "195": 22186, "\u0120yelling": 22187, "\u0120drastically": 22188, "\u0120eject": 22189, "\u0120neut": 22190, "\u0120thrive": 22191, "\u0120Frequ": 22192, "oux": 22193, "\u0120possesses": 22194, "\u0120Senators": 22195, "\u0120DES": 22196, "\u0120Shakespeare": 22197, "\u0120Franco": 22198, "\u0120LB": 22199, "uchi": 22200, "\u0120incarn": 22201, "\u0120founders": 22202, "Function": 22203, "\u0120brightness": 22204, "\u0120BT": 22205, "\u0120whale": 22206, "\u0120Theater": 22207, "mass": 22208, "\u0120Doll": 22209, "Something": 22210, "\u0120echoed": 22211, "\u0120Hex": 22212, "crit": 22213, "afia": 22214, "\u0120goddess": 22215, "\u0120eleven": 22216, "\u0120Preview": 22217, "\u0120Aurora": 22218, "\u0120401": 22219, "ulsive": 22220, "\u0120Logan": 22221, "inburgh": 22222, "\u0120Centers": 22223, "\u0120ONLY": 22224, "\u0120Aid": 22225, "\u0120paradox": 22226, "\u0120hurd": 22227, "\u0120LC": 22228, "Due": 22229, "court": 22230, "\u0120offended": 22231, "\u0120evaluating": 22232, "\u0120Matthews": 22233, "\u0120tomb": 22234, "\u0120payroll": 22235, "\u0120extraction": 22236, "\u0120Hands": 22237, "ifi": 22238, "\u0120supernatural": 22239, "\u0120COMM": 22240, "]=": 22241, "dogs": 22242, "\u0120512": 22243, "\u0120Meeting": 22244, "Richard": 22245, "\u0120Maximum": 22246, "\u0120ideals": 22247, "Things": 22248, "mand": 22249, "\u0120Regardless": 22250, "\u0120humili": 22251, "buffer": 22252, "Little": 22253, "\u0120Dani": 22254, "\u0120Nak": 22255, "\u0120liberation": 22256, "\u0120Abe": 22257, "\u0120OL": 22258, "\u0120stuffed": 22259, "aca": 22260, "inda": 22261, "raphic": 22262, "\u0120mosqu": 22263, "\u0120campaigning": 22264, "\u0120occupy": 22265, "Squ": 22266, "rina": 22267, "\u0120Wel": 22268, "\u0120VS": 22269, "\u0120physic": 22270, "\u0120puls": 22271, "rint": 22272, "oaded": 22273, "ETF": 22274, "\u0120Archives": 22275, "\u0120venues": 22276, "hner": 22277, "\u0120Turbo": 22278, "\u0120lust": 22279, "\u0120appealed": 22280, "quez": 22281, "ilib": 22282, "\u0120Timothy": 22283, "\u0120omn": 22284, "dro": 22285, "\u0120obsession": 22286, "\u0120Savage": 22287, "1996": 22288, "Global": 22289, "Jes": 22290, "214": 22291, "\u0120sliding": 22292, "\u0120disappro": 22293, "\u0120Magical": 22294, "\u0120voluntarily": 22295, "gb": 22296, "aney": 22297, "\u0120prophet": 22298, "\u0120Rein": 22299, "\u0120Julia": 22300, "\u0120Worth": 22301, "aurus": 22302, "\u0120bounds": 22303, "ieu": 22304, ")))": 22305, "\u0120crore": 22306, "\u0120Citizen": 22307, "Sky": 22308, "\u0120columnist": 22309, "\u0120seekers": 22310, "ondo": 22311, "ISA": 22312, "\u0120Length": 22313, "\u0120nostalg": 22314, "\u0120newcom": 22315, "\u0120detrim": 22316, "entric": 22317, "375": 22318, "\u0120GE": 22319, "\u0120autop": 22320, "\u0120academics": 22321, "AppData": 22322, "\u0120Shen": 22323, "\u0120idiot": 22324, "\u0120Transit": 22325, "\u0120teaspoon": 22326, "Wil": 22327, "KO": 22328, "\u0120Comedy": 22329, ">,": 22330, "\u0120populated": 22331, "WD": 22332, "\u0120pigs": 22333, "\u0120Oculus": 22334, "\u0120sympathetic": 22335, "\u0120marathon": 22336, "198": 22337, "\u0120seizure": 22338, "sided": 22339, "\u0120dop": 22340, "irtual": 22341, "Land": 22342, "\u0120Floor": 22343, "osaurs": 22344, "...]": 22345, "\u0120los": 22346, "\u0120subsidiary": 22347, "EY": 22348, "\u0120Parts": 22349, "\u0120Stef": 22350, "\u0120Judiciary": 22351, "\u0120134": 22352, "\u0120mirrors": 22353, "\u0120ket": 22354, "times": 22355, "\u0120neurolog": 22356, "\u0120cav": 22357, "\u0120Guest": 22358, "\u0120tumor": 22359, "scill": 22360, "\u0120Lloyd": 22361, "Est": 22362, "\u0120clearer": 22363, "\u0120stereotypes": 22364, "\u0120dur": 22365, "nothing": 22366, "Reddit": 22367, "\u0120negotiated": 22368, "------------------------": 22369, "235": 22370, "\u0120flown": 22371, "\u0120Seoul": 22372, "\u0120Resident": 22373, "\u0120SCH": 22374, "\u0120disappearance": 22375, "\u0120Vince": 22376, "grown": 22377, "\u0120grabs": 22378, "ril": 22379, "\u0120Infinite": 22380, "\u0120Twenty": 22381, "\u0120pedestrian": 22382, "\u0120jersey": 22383, "\u0120Fur": 22384, "\u0120Infinity": 22385, "\u0120Elliott": 22386, "\u0120mentor": 22387, "\u0120morally": 22388, "\u0120obey": 22389, "secure": 22390, "iffe": 22391, "\u0120antibiotics": 22392, "angled": 22393, "\u0120Freeman": 22394, "\u0120Introduction": 22395, "Jun": 22396, "\u0120marsh": 22397, "icans": 22398, "\u0120EVENTS": 22399, "ochond": 22400, "Wall": 22401, "iculty": 22402, "\u0120misdemeanor": 22403, "\u0120ly": 22404, "Thomas": 22405, "\u0120Resolution": 22406, "\u0120animations": 22407, "\u0120Dry": 22408, "\u0120intercourse": 22409, "\u0120Newcastle": 22410, "\u0120Hog": 22411, "\u0120Equipment": 22412, "177": 22413, "\u0120territorial": 22414, "\u0120archives": 22415, "203": 22416, "Filter": 22417, "\u0120Munich": 22418, "\u0120commanded": 22419, "\u0120Wand": 22420, "\u0120pitches": 22421, "\u0120Croat": 22422, "\u0120ratios": 22423, "\u0120Mits": 22424, "\u0120accumulated": 22425, "\u0120Specifically": 22426, "\u0120gentleman": 22427, "acerb": 22428, "\u0120penn": 22429, "\u0120aka": 22430, "\u0120Fuk": 22431, "\u0120intervene": 22432, "\u0120Refuge": 22433, "\u0120Alzheimer": 22434, "\u0120succession": 22435, "ohan": 22436, "does": 22437, "Lord": 22438, "\u0120separat": 22439, "\u0120correspondence": 22440, "\u0120shiny": 22441, "Prior": 22442, "\u0120sulf": 22443, "\u0120miserable": 22444, "\u0120dedication": 22445, "().": 22446, "\u0120specialists": 22447, "\u0120defects": 22448, "\u0120Cult": 22449, "\u0120Xia": 22450, "\u0120jeopard": 22451, "\u0120Ore": 22452, "Ability": 22453, "\u0120lear": 22454, "\u0120ambitions": 22455, "\u0120BMI": 22456, "\u0120Arabs": 22457, "\u01201942": 22458, "\u0120preservation": 22459, "ificate": 22460, "\u0120ashamed": 22461, "loss": 22462, "\u0120Restaur": 22463, "\u0120resemble": 22464, "\u0120enrich": 22465, "\u0120KN": 22466, "\u0120Clan": 22467, "float": 22468, "\u0120playable": 22469, "ITT": 22470, "\u0120harmony": 22471, "arrison": 22472, "\u0120Weinstein": 22473, "were": 22474, "\u0120poisoning": 22475, "\u0120Comput": 22476, "\u0120WordPress": 22477, "major": 22478, "\u0120Valve": 22479, "Fan": 22480, "\u0120Throw": 22481, "\u0120Romans": 22482, "\u0120Depression": 22483, "ados": 22484, "\u0120tortured": 22485, "\u0120balancing": 22486, "bottom": 22487, "\u0120acquiring": 22488, "\u0120Monte": 22489, "ardi": 22490, "\u0120aura": 22491, "\u0120##": 22492, "\u0120Standing": 22493, "\u0120Atlas": 22494, "CF": 22495, "\u0120intrins": 22496, "\u0120Benghazi": 22497, "\u0120camping": 22498, "\u0120tapped": 22499, "blade": 22500, "strous": 22501, "\u0120Rabb": 22502, "\u0120Written": 22503, "tip": 22504, "\u0120Neigh": 22505, "sterdam": 22506, "\u0120Allow": 22507, "\u0120Healing": 22508, "\u0120Rhod": 22509, "num": 22510, "\u0120caffeine": 22511, "\u0120Percent": 22512, "\u0120boo": 22513, "\u0120apples": 22514, "305": 22515, "\u0120welcoming": 22516, "\u0120applaud": 22517, "\u0120austerity": 22518, "\u00c2\u00b1": 22519, "\u0120Reality": 22520, "efe": 22521, "\u00e5\u00ae": 22522, "\u0120sucks": 22523, "\u0120tabs": 22524, "\u0120PayPal": 22525, "\u0120backpack": 22526, "\u0120gifted": 22527, "abulary": 22528, "\u0120Scout": 22529, "irteen": 22530, "\u0120chin": 22531, "\u0120omitted": 22532, "\u0120negatively": 22533, "\u0120accessing": 22534, "\u0120Earn": 22535, "\u0120ambulance": 22536, "\u0120headphones": 22537, "\u0120205": 22538, "\u0120Refresh": 22539, "president": 22540, "\u0120Kitchen": 22541, "\u0120Entered": 22542, "\u0120Snyder": 22543, "005": 22544, "omical": 22545, "\u0120borrowed": 22546, "\u0120Nem": 22547, "\u0120aviation": 22548, "\u0120stall": 22549, "rimination": 22550, "\u0120uniforms": 22551, "itime": 22552, "\u0120Simmons": 22553, "energy": 22554, "ablished": 22555, "yy": 22556, "qualified": 22557, "\u0120rallies": 22558, "\u0120Stuart": 22559, "flight": 22560, "\u0120gangs": 22561, "rag": 22562, "\u0120vault": 22563, "lux": 22564, "\u0120Compar": 22565, "\u0120designation": 22566, "209": 22567, "\u0120Jos": 22568, "dollar": 22569, "zero": 22570, "\u0120wells": 22571, "303": 22572, "\u0120constituents": 22573, "\u0120heck": 22574, "\u0120cows": 22575, "\u0120commanders": 22576, "\u0120differential": 22577, "\u0120Catherine": 22578, "299": 22579, "\u0120valve": 22580, "\u0120brace": 22581, "\u0120perspectives": 22582, "cert": 22583, "fact": 22584, "icularly": 22585, "\u0120McN": 22586, "planes": 22587, "\u0120intric": 22588, "\u0120peas": 22589, "ovan": 22590, "\u0120tossed": 22591, "retch": 22592, "\u0120Lopez": 22593, "\u0120unfamiliar": 22594, "death": 22595, "\u0120Apart": 22596, "\u0120Chang": 22597, "\u0120relieved": 22598, "rophe": 22599, "\u0120airports": 22600, "\u0120freak": 22601, "util": 22602, "Mill": 22603, "\u0120Chin": 22604, "\u0120Owen": 22605, "male": 22606, "\u0120Broken": 22607, "\u0120Winds": 22608, "rob": 22609, "rising": 22610, "\u0120firefighters": 22611, "\u0120authoritarian": 22612, "\u0120148": 22613, "Bitcoin": 22614, "external": 22615, "\u0120browsers": 22616, "ichever": 22617, "orian": 22618, "\u0120unb": 22619, "\u0120poke": 22620, "\u0120Zot": 22621, "Mid": 22622, "\u0120Popular": 22623, "\u0120covert": 22624, "\u0120contributes": 22625, "\u0120650": 22626, "\u0120contention": 22627, "Gate": 22628, "\u0120consoles": 22629, "\u0120chromos": 22630, "\u0120IX": 22631, "\u0120visually": 22632, "\u0120Eisen": 22633, "\u0120jewelry": 22634, "\u0120delegation": 22635, "\u0120accelerate": 22636, "\u0120Riley": 22637, "\u0120slope": 22638, "\u0120indoor": 22639, "itially": 22640, "\u0120hugely": 22641, "\u0120tunnels": 22642, "\u0120fined": 22643, "\u0120directive": 22644, "\u0120forehead": 22645, "ustomed": 22646, "\u0120skate": 22647, "Music": 22648, "gas": 22649, "\u0120recognizing": 22650, "ambo": 22651, "\u0120overweight": 22652, "\u0120Grade": 22653, "\u00d9\u012c": 22654, "\u0120sounding": 22655, "\u0120locking": 22656, "\u0120REM": 22657, "Store": 22658, "\u0120excav": 22659, "\u0120Likewise": 22660, "\u0120Lights": 22661, "\u0120elbow": 22662, "\u0120Supply": 22663, "wic": 22664, "\u0120handsome": 22665, "1994": 22666, "Coll": 22667, "\u0120adequately": 22668, "\u0120Associate": 22669, "\u0120strips": 22670, "\u0120crackdown": 22671, "\u0120marvel": 22672, "\u0120Kun": 22673, "\u0120passages": 22674, "@@@@": 22675, "\u0120Tall": 22676, "\u0120thoughtful": 22677, "namese": 22678, "\u0120prostitution": 22679, "business": 22680, "\u0120ballistic": 22681, "personal": 22682, "cig": 22683, "izational": 22684, "Round": 22685, "\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142": 22686, "\u0120Coleman": 22687, "\u0120admitting": 22688, "\u0120Plug": 22689, "\u0120bitcoins": 22690, "\u0120Suz": 22691, "\u0120fairness": 22692, "\u0120supplier": 22693, "\u0120catastrophic": 22694, "\u0120Helen": 22695, "oqu": 22696, "Marc": 22697, "\u0120Articles": 22698, "gie": 22699, "\u0120endangered": 22700, "\u0120destiny": 22701, "\u0120Volt": 22702, "olia": 22703, "axis": 22704, "\u0120cheat": 22705, "\u0120unified": 22706, "ICO": 22707, "quote": 22708, "302": 22709, "\u0120Sed": 22710, "\u0120suppression": 22711, "\u0120analyzing": 22712, "\u0120squat": 22713, "\u0120figuring": 22714, "\u0120coordinates": 22715, "\u0120chunks": 22716, "\u01201946": 22717, "\u0120subp": 22718, "\u0120wiki": 22719, "\u0120Forbes": 22720, "\u0120Jupiter": 22721, "\u0120Erik": 22722, "imer": 22723, "\u0120Commercial": 22724, "\\)": 22725, "\u0120legitimacy": 22726, "\u0120dental": 22727, "\u0120Mean": 22728, "\u0120deficits": 22729, "550": 22730, "Originally": 22731, "\u0120Horror": 22732, "\u0120contamination": 22733, "llah": 22734, "\u0120confisc": 22735, "\u0120Clare": 22736, "TB": 22737, "\u0120Failed": 22738, "aned": 22739, "\u0120ruler": 22740, "\u0120Controller": 22741, "\u0120feminists": 22742, "Fix": 22743, "gay": 22744, "207": 22745, "\u0120rabbit": 22746, "Third": 22747, "owntown": 22748, "\u0120glue": 22749, "\u0120volatile": 22750, "\u0120shining": 22751, "\u0120foll": 22752, "\u0120impaired": 22753, "\u0120supers": 22754, "\u00e6\u012a": 22755, "\u0120clutch": 22756, "\u013c\u00e9\u0128\u0134": 22757, "\u0120prolet": 22758, "\u0120(!": 22759, "\u0120yelled": 22760, "\u0120Kiev": 22761, "\u0120Ern": 22762, "\u0120Shock": 22763, "KB": 22764, "\u0120situated": 22765, "query": 22766, "\u0120Nas": 22767, "\u0120annex": 22768, "character": 22769, "\u0120Holiday": 22770, "\u0120automation": 22771, "\u0120Jill": 22772, "\u0120Remastered": 22773, "\u0120linem": 22774, "\u0120wilderness": 22775, "\u0120Horizon": 22776, "\u0120Guinea": 22777, "AZ": 22778, "\u0120mainland": 22779, "\u0120secrecy": 22780, "LEASE": 22781, "\u0120punk": 22782, "\u0120Province": 22783, "(),": 22784, "Speed": 22785, "\u0120handing": 22786, "\u0120Sebast": 22787, "Sir": 22788, "rase": 22789, "\u0120journals": 22790, "\u0120congest": 22791, "\u0120Tut": 22792, "irrel": 22793, "\u0120schizophrenia": 22794, "\u0120misogyn": 22795, "healthy": 22796, "Iron": 22797, "\u0120reacted": 22798, "-$": 22799, "252": 22800, "\u0120plural": 22801, "\u0120plum": 22802, "\u0120bargain": 22803, "\u0120grounded": 22804, "finder": 22805, "\u0120disse": 22806, "\u0120Laz": 22807, "OOD": 22808, "\u0120atroc": 22809, "Factory": 22810, "\u0120minions": 22811, "\u0120ori": 22812, "\u0120Brave": 22813, "\u0120PRE": 22814, "\u0120Myanmar": 22815, "\u0120Hod": 22816, "\u0120expedition": 22817, "\u0120explode": 22818, "\u0120Coord": 22819, "\u0120extr": 22820, "\u0120Brief": 22821, "\u0120ADHD": 22822, "\u0120hardcore": 22823, "feeding": 22824, "\u0120dile": 22825, "\u0120Fruit": 22826, "\u0120vaccination": 22827, "\u0120Mao": 22828, "osphere": 22829, "\u0120contests": 22830, "-|": 22831, "\u0120fren": 22832, "isphere": 22833, "Rom": 22834, "\u0120Sharp": 22835, "\u0120Trend": 22836, "\u0120disconnect": 22837, "\u00e2\u0122\u00a2\u00e2\u0122\u00a2": 22838, "\u0120persecution": 22839, "Earth": 22840, "\u0120healthier": 22841, "384": 22842, "\u0120cob": 22843, "\u0120Trinity": 22844, "OWS": 22845, "ANN": 22846, "\u0120specialty": 22847, "\u0120gru": 22848, "\u0120cooperative": 22849, "why": 22850, "Starting": 22851, "\u0120Issues": 22852, "stre": 22853, "ensor": 22854, "\u0120185": 22855, "Adv": 22856, "!?": 22857, "\u0120Revel": 22858, "emia": 22859, "\u0120Hulk": 22860, "\u0120celebrations": 22861, "\u0120Sou": 22862, "raud": 22863, "\u0120Klein": 22864, "\u0120unreal": 22865, "context": 22866, "\u0120partnerships": 22867, "\u0120adopting": 22868, "tical": 22869, "\u0120splash": 22870, "\u0120Hezbollah": 22871, "category": 22872, "cyclop": 22873, "xton": 22874, "\u0120Dot": 22875, "urdy": 22876, "tz": 22877, "\u0120envelope": 22878, "\u0120NL": 22879, "\u00e2\u0137": 22880, "\u0120wherein": 22881, "Spec": 22882, "184": 22883, "\u0120telev": 22884, "aliation": 22885, "\u0120myths": 22886, "\u00e5\u00b0": 22887, "\u0120rigorous": 22888, "\u0120communicating": 22889, "\u0120observer": 22890, "\u0120rehe": 22891, "\u0120Wash": 22892, "\u0120apologized": 22893, "\u0120Tin": 22894, "\u0120expenditures": 22895, "workers": 22896, "document": 22897, "\u0120hesitate": 22898, "\u0120Lenin": 22899, "\u0120unpredictable": 22900, "\u0120renewal": 22901, "cler": 22902, "okia": 22903, "\u0120CONT": 22904, "\u0120postseason": 22905, "Tokens": 22906, "\u0120exacerb": 22907, "\u0120betting": 22908, "\u0120147": 22909, "\u0120elevation": 22910, "Wood": 22911, "\u0120Solomon": 22912, "194": 22913, "004": 22914, "output": 22915, "\u0120redund": 22916, "\u0120Mumbai": 22917, "\u0120pH": 22918, "\u0120reproduce": 22919, "\u0120Duration": 22920, "MAX": 22921, "\u0120bog": 22922, "CBS": 22923, "\u0120Balance": 22924, "\u0120Sgt": 22925, "\u0120Recent": 22926, "\u0120cd": 22927, "\u0120popped": 22928, "\u0120incompet": 22929, "prop": 22930, "ayan": 22931, "guy": 22932, "Pacific": 22933, "\u0120tyr": 22934, "\u0120{{": 22935, "\u0120Mystic": 22936, "\u0120Dana": 22937, "\u0120masturb": 22938, "\u0120geometry": 22939, "\u00c3\u00a2": 22940, "\u0120Correct": 22941, "\u0120trajectory": 22942, "\u0120distracted": 22943, "\u0120foo": 22944, "\u0120Welsh": 22945, "Luc": 22946, "mith": 22947, "\u0120rugby": 22948, "\u0120respiratory": 22949, "\u0120triangle": 22950, "\u0120215": 22951, "\u0120undergraduate": 22952, "\u0120Superior": 22953, "changing": 22954, "_-": 22955, "\u0120rightly": 22956, "\u0120referee": 22957, "\u0120lucrative": 22958, "\u0120unauthorized": 22959, "\u0120resembles": 22960, "\u0120GNU": 22961, "\u0120Derby": 22962, "\u0120pathways": 22963, "\u0120Led": 22964, "\u0120endurance": 22965, "\u0120stint": 22966, "\u0120collector": 22967, "Fast": 22968, "\u0120dots": 22969, "\u0120nationals": 22970, "\u0120Securities": 22971, "\u0120whip": 22972, "Param": 22973, "\u0120learns": 22974, "Magic": 22975, "\u0120detailing": 22976, "moon": 22977, "\u0120broadcasting": 22978, "\u0120baked": 22979, "265": 22980, "holm": 22981, "\u0120Sah": 22982, "\u0120Hussein": 22983, "\u0120Courtesy": 22984, "174": 22985, "\u0120146": 22986, "\u0120geographic": 22987, "peace": 22988, "\u0120judging": 22989, "\u0120Stern": 22990, "Bur": 22991, "\u0120storyline": 22992, "Gun": 22993, "\u0120Stick": 22994, "245": 22995, "307": 22996, "\u00e3\u0124\u00b4\u00e3\u0125\u00b3": 22997, "\u0120Administrator": 22998, "\u0120burnt": 22999, "\u0120pave": 23000, "choes": 23001, "Exec": 23002, "\u0120campuses": 23003, "Result": 23004, "\u0120mutations": 23005, "\u0120Charter": 23006, "\u0120captures": 23007, "\u0120compares": 23008, "\u0120badge": 23009, "Scient": 23010, "\u0120erad": 23011, "iery": 23012, "oi": 23013, "ettes": 23014, "\u0120Estate": 23015, "\u0120strap": 23016, "\u0120proudly": 23017, "\u0120fried": 23018, "\u0120withdrawn": 23019, "\u0120Voy": 23020, "phony": 23021, "Items": 23022, "\u0120Pierce": 23023, "bard": 23024, "\u0120annotation": 23025, "anton": 23026, "illon": 23027, "Impro": 23028, "...)": 23029, "\u0120happier": 23030, "------": 23031, "adjust": 23032, "\u0120staffers": 23033, "\u0120activism": 23034, "\u0120perf": 23035, "\u0120alright": 23036, "Need": 23037, "\u0120commence": 23038, "\u0120opioid": 23039, "\u0120Amanda": 23040, "Es": 23041, "\u0120Pars": 23042, "\u0120Kaw": 23043, "Works": 23044, "248": 23045, "\u0120indo": 23046, "tc": 23047, "endant": 23048, "\u0120Moto": 23049, "\u0120legalization": 23050, "OTE": 23051, "\u0120tasked": 23052, "\u0120tsp": 23053, "\u0120ACTIONS": 23054, "166": 23055, "\u0120refreshing": 23056, "\u0120NR": 23057, "\u0120Perez": 23058, "\u0120infringement": 23059, "SY": 23060, "Listen": 23061, "inning": 23062, "ku": 23063, "\u0120rotate": 23064, "program": 23065, "arah": 23066, "Design": 23067, "\u0120(\u00c2\u00a3": 23068, "\u0120storing": 23069, "\u0120warrants": 23070, "\u0120judgement": 23071, "\u0120Brist": 23072, "usually": 23073, "photo": 23074, "\u0120Ran": 23075, "\u0120Pine": 23076, "\u0120outrageous": 23077, "\u0120Valentine": 23078, "luence": 23079, "\u0120Everybody": 23080, "Altern": 23081, "\u0120relevance": 23082, "\u0120terminated": 23083, "\u0120dessert": 23084, "\u0120fulfilled": 23085, "\u0120prosecuted": 23086, "\u0120Words": 23087, "\u0120migrant": 23088, "\u0120cultivation": 23089, "\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124": 23090, "idelity": 23091, "\u0120Vern": 23092, "\u0120Login": 23093, "\u0120metaphor": 23094, "\u0120Tip": 23095, "\u0120recruits": 23096, "\u0120Pig": 23097, "ribing": 23098, "\u0120enthusiasts": 23099, "exper": 23100, "\u0120frightening": 23101, "\u0120Hair": 23102, "anson": 23103, "strate": 23104, "\u0120hi": 23105, "Height": 23106, "\u0120owning": 23107, "none": 23108, "\u0120dislike": 23109, "\u0120knives": 23110, "pherd": 23111, "\u0120loudly": 23112, "\u0120APIs": 23113, "Display": 23114, "\u0120Lac": 23115, "\u0120USS": 23116, "abl": 23117, "verages": 23118, "Jew": 23119, "\u0120172": 23120, "\u0120Historical": 23121, "atoon": 23122, "\u0120Physics": 23123, "intern": 23124, "\u0120warmth": 23125, "\u0120topp": 23126, "DM": 23127, "\u0120gunman": 23128, "\u0120emperor": 23129, "odi": 23130, "\u00e3\u0125\u00a3": 23131, "inatory": 23132, "\u0120Rib": 23133, "\u0120131": 23134, "\u0120Saturn": 23135, "\u0120Shining": 23136, "\u0120waking": 23137, "Quotes": 23138, "\u0120comedian": 23139, "enberg": 23140, "\u00c2\u00bd": 23141, "\u0120believers": 23142, "\u0120paperwork": 23143, "custom": 23144, "\u0120lev": 23145, "\u0120lament": 23146, "\u0120pouring": 23147, "222": 23148, "political": 23149, "\u0120Supplement": 23150, "maid": 23151, "\u0120cruelty": 23152, "\u0120tread": 23153, "ysics": 23154, "Aw": 23155, "rites": 23156, "\u0120modifier": 23157, "\u0120Position": 23158, "Adam": 23159, "lb": 23160, "ubs": 23161, "\u0120imperfect": 23162, "\u0120clusters": 23163, "\u0120Engineer": 23164, "\u0120Cherry": 23165, "\u0120inauguration": 23166, "\u0120Sau": 23167, "\u0120embodiment": 23168, "\u0120Uncle": 23169, "\u0120overr": 23170, "\u0120explosions": 23171, "cule": 23172, "\u0120Princeton": 23173, "\u0120Andrea": 23174, "\u0120incorrectly": 23175, "\u0120earnest": 23176, "\u0120pilgr": 23177, "\u0120Sprint": 23178, "\u0120sleeve": 23179, "\u0120hears": 23180, "\u0120Amazing": 23181, "\u0120browsing": 23182, "agin": 23183, "\u0120homeland": 23184, "\u0120haw": 23185, "\u0120diving": 23186, "istered": 23187, "178": 23188, "\u0120bargaining": 23189, "\u0120Arcade": 23190, "\u0120delegate": 23191, "terson": 23192, "................................................................": 23193, "\u0120Jacksonville": 23194, "275": 23195, "\u0120stagn": 23196, "\u0120adam": 23197, "\u0120Sherman": 23198, "CB": 23199, "\u0120suburb": 23200, "\u0120Foods": 23201, "\u0120converting": 23202, "\u0120Arist": 23203, "\u0120chambers": 23204, "love": 23205, "\u0120amino": 23206, "\u0120Gan": 23207, "\u0120madness": 23208, "mc": 23209, "\u0120USE": 23210, "defined": 23211, "\u0120ultr": 23212, "indust": 23213, "\u0120wolves": 23214, "lance": 23215, "Additionally": 23216, "\u0120cracks": 23217, "asia": 23218, "\u0120Reason": 23219, "\u0120Pump": 23220, "\u0120accidental": 23221, "\u0120Laser": 23222, "\u0120Rid": 23223, "\u0120initialized": 23224, "elli": 23225, "\u0120unnamed": 23226, "\u0120noun": 23227, "\u0120Passed": 23228, "\u0120hostage": 23229, "\u0120Ethiop": 23230, "shirts": 23231, "\u0120unrel": 23232, "\u0120Embassy": 23233, "\u01201941": 23234, "\u0120atoms": 23235, "\u0120purported": 23236, "164": 23237, "\u0120Fi": 23238, "\u0120gallons": 23239, "\u0120Monica": 23240, "\u0120pg": 23241, "enment": 23242, "\u0120sorted": 23243, "\u0120Gospel": 23244, "\u0120heights": 23245, "\u0120traced": 23246, "\u0120undergoing": 23247, "Shell": 23248, "\u0120sacks": 23249, "\u0120proportions": 23250, "\u0120halluc": 23251, "Font": 23252, "acet": 23253, "\u0120warmer": 23254, "\u0120INTER": 23255, "\u0120grabbing": 23256, "Plug": 23257, "\u0120realization": 23258, "\u0120Burke": 23259, "\u0120enchant": 23260, "ATER": 23261, "\u0120Seed": 23262, "\u0120abundant": 23263, "FM": 23264, "\u0120civic": 23265, "Vs": 23266, "isi": 23267, "\u0120vow": 23268, "\u0120reper": 23269, "\u0120Partnership": 23270, "\u0120penetration": 23271, "\u0120axe": 23272, "\u0120shattered": 23273, "\u0120Zombies": 23274, "\u0120vinyl": 23275, "\u0120Alert": 23276, "eon": 23277, "\u0120obliged": 23278, "\u0120Illust": 23279, "\u0120Plaza": 23280, "\u0120Frontier": 23281, "\u0120davidjl": 23282, "\u0120Serial": 23283, "\u0120Hav": 23284, "\u0120Nutrition": 23285, "Bi": 23286, "\u0120\u00e2\u0138\u012a": 23287, "\u0120Jays": 23288, "linux": 23289, "\u0120hurry": 23290, "\u0120voy": 23291, "\u0120hopeless": 23292, "\u0120Stealth": 23293, "\u0120\u00e3\u0123": 23294, "essors": 23295, "ttle": 23296, "borg": 23297, "\u0120Safari": 23298, "fell": 23299, "\u0120wary": 23300, "due": 23301, "\u0120Above": 23302, "Ha": 23303, "ELL": 23304, "\u0120notor": 23305, "\u0120Won": 23306, "Too": 23307, "\u0120occupations": 23308, "\u0120possessions": 23309, "\u0120inviting": 23310, "\u0120predators": 23311, "\u0120accelerated": 23312, "\u0120157": 23313, "uterte": 23314, "\u0120Cube": 23315, "east": 23316, "account": 23317, "Give": 23318, "\u0120transplant": 23319, "redients": 23320, "idable": 23321, "\u0120screenshots": 23322, "\u0120Gund": 23323, "\u0120FS": 23324, "\u0120travelers": 23325, "\u0120sensory": 23326, "\u0120Fiat": 23327, "\u0120Rockets": 23328, "\u0130\u012d": 23329, "_{": 23330, "Friend": 23331, "\u0120charming": 23332, "ALS": 23333, "\u0120enjoyment": 23334, "mph": 23335, "\u01205000": 23336, "\u0120REG": 23337, "\u00d9\u0128": 23338, "bia": 23339, "\u0120compilation": 23340, "rost": 23341, "\u0120VP": 23342, "\u0120Schne": 23343, "2019": 23344, "\u0120copying": 23345, "MORE": 23346, "\u0120Flore": 23347, "falls": 23348, "215": 23349, "total": 23350, "\u0120disciples": 23351, "double": 23352, "\u0120exceeding": 23353, "\u0120smashed": 23354, "\u0120conceptual": 23355, "\u0120Romania": 23356, "\u0120Brent": 23357, "\u0120ICE": 23358, "\u0120Tou": 23359, "\u0120grap": 23360, "\u0120nails": 23361, "189": 23362, "\u00e3\u0125\u013a": 23363, "\u0120procure": 23364, "eur": 23365, "\u0120confirming": 23366, "\u0120Cec": 23367, "awi": 23368, "\u0120Eden": 23369, "\u0120ng": 23370, "\u0120engineered": 23371, "atics": 23372, "\u0120hooked": 23373, "\u0120disgusting": 23374, "\u0120Murder": 23375, "\u00e3\u0124\u00bf": 23376, "Library": 23377, "\u0120168": 23378, "Almost": 23379, "hematic": 23380, "Menu": 23381, "\u0120Notre": 23382, "\u0120Jur": 23383, "\u0120kidnapped": 23384, "\u0120hacker": 23385, "\u0120Jade": 23386, "\u0120creepy": 23387, "\u0120drawings": 23388, "\u0120Sponsor": 23389, "\u0120cyclists": 23390, "\u0120Goblin": 23391, "\u0120optimized": 23392, "\u0120staged": 23393, "\u0120McD": 23394, "between": 23395, "Age": 23396, "eno": 23397, "Sex": 23398, "\u0120Wide": 23399, "nings": 23400, "avis": 23401, "\u0120incapable": 23402, "\u0120Kob": 23403, "\u0120rewarding": 23404, "\u0120Lone": 23405, "olescent": 23406, "\u0120contracted": 23407, "\u0120sticky": 23408, "Jose": 23409, "Ball": 23410, "fest": 23411, "\u0120Input": 23412, "\u0120Recently": 23413, "\u0120tomat": 23414, "square": 23415, "Application": 23416, "\u0120nitrogen": 23417, "\u0120duplicate": 23418, "\u0120Recon": 23419, "\u0120Dear": 23420, "London": 23421, "\u0120intra": 23422, "\u0120dock": 23423, "\u0120outreach": 23424, "\u0120Million": 23425, "\u0120mammals": 23426, "ampton": 23427, "VAL": 23428, "\u0120snaps": 23429, "\u0120dos": 23430, "\u0120Whole": 23431, "\u0120Ready": 23432, "Try": 23433, "\u0120Winnipeg": 23434, "earance": 23435, "\u0120incurred": 23436, "renched": 23437, "\u0120NSW": 23438, "ilot": 23439, "raine": 23440, "\u0120cube": 23441, "got": 23442, "\u0120runway": 23443, "etermined": 23444, "\u0120Hawks": 23445, "\u0120survivor": 23446, "\u0120Wish": 23447, "\u0120Din": 23448, "\u0120DEF": 23449, "\u0120Vault": 23450, "187": 23451, "\u0120mushrooms": 23452, "\u0120crisp": 23453, "bey": 23454, "\u0120Discovery": 23455, "\u0120developmental": 23456, "\u0120paradigm": 23457, "\u0120chaotic": 23458, "\u0120Tsu": 23459, "\u0120333": 23460, "bons": 23461, "\u0120bacterial": 23462, "\u0120commits": 23463, "\u0120cosmic": 23464, "\u0120mega": 23465, "ocative": 23466, "\u0120Paint": 23467, "ophobic": 23468, "\u0120vain": 23469, "\u0120carved": 23470, "\u0120Thief": 23471, "\u0120Gul": 23472, "owship": 23473, "\u0120cites": 23474, "\u0120Edinburgh": 23475, "\u0120diminished": 23476, "\u0120acknowledges": 23477, "\u0120Kills": 23478, "\u0120microw": 23479, "\u0120Hera": 23480, "\u0120seniors": 23481, "\u0120whereby": 23482, "Hop": 23483, "atron": 23484, "\u0120unavailable": 23485, "\u0120Nate": 23486, "\u0120480": 23487, "\u0120slated": 23488, "\u0120Rebecca": 23489, "\u0120Battery": 23490, "\u0120grammar": 23491, "\u0120headset": 23492, "\u0120cursor": 23493, "\u0120excluding": 23494, "anye": 23495, "aundering": 23496, "ebin": 23497, "\u0120feasible": 23498, "\u0120Publishing": 23499, "\u0120Labs": 23500, "\u0120Cliff": 23501, "\u0120Ferrari": 23502, "\u0120pac": 23503, "visible": 23504, "marked": 23505, "pell": 23506, "\u0120polite": 23507, "\u0120staggering": 23508, "\u0120Galactic": 23509, "\u0120superst": 23510, "\u0120paran": 23511, "\u0120Officers": 23512, "\u00e3\u0122\u0123": 23513, "\u0120specifics": 23514, "ulus": 23515, "239": 23516, "\u0120Paste": 23517, "AMP": 23518, "\u0120Panama": 23519, "\u0120Delete": 23520, "anguard": 23521, "restrial": 23522, "\u0120heroic": 23523, "\u0120Dy": 23524, "\u00d8\u00a7\u00d9\u0126": 23525, "\u0120incumbent": 23526, "\u0120crunch": 23527, "tro": 23528, "\u0120scoop": 23529, "\u0120blogger": 23530, "\u0120sellers": 23531, "uren": 23532, "\u0120medicines": 23533, "\u0120Caps": 23534, "\u0120Animation": 23535, "oxy": 23536, "\u0120outward": 23537, "\u0120inquiries": 23538, "229": 23539, "\u0120psychologist": 23540, "\u0120Sask": 23541, "evil": 23542, "\u0120contaminated": 23543, "\u00e3\u0124\u00a8": 23544, "herence": 23545, "\u0120branded": 23546, "\u0120Abdul": 23547, "zh": 23548, "\u0120paragraphs": 23549, "\u0120mins": 23550, "\u0120correlated": 23551, "erb": 23552, "\u0120impart": 23553, "\u0120milestone": 23554, "\u0120Solutions": 23555, "otle": 23556, "\u0120undercover": 23557, "\u0120marched": 23558, "\u0120Chargers": 23559, "fax": 23560, "\u0120Secrets": 23561, "\u0120ruth": 23562, "weather": 23563, "\u0120feminine": 23564, "\u0120sham": 23565, "\u0120prestigious": 23566, "iggins": 23567, "\u0120sung": 23568, "history": 23569, "ettle": 23570, "ggie": 23571, "\u0120outdated": 23572, "oland": 23573, "\u0120perceptions": 23574, "\u0120Session": 23575, "\u0120Dodgers": 23576, "uj": 23577, "\u0120END": 23578, "Doc": 23579, "\u0120deficiency": 23580, "Grand": 23581, "\u0120Joker": 23582, "\u0120retrospect": 23583, "\u0120diagnostic": 23584, "\u0120harmless": 23585, "\u0120rogue": 23586, "\u0120Aval": 23587, "Equ": 23588, "\u0120transc": 23589, "\u0120Robertson": 23590, "\u0120Depending": 23591, "\u0120Burns": 23592, "ivo": 23593, "\u0120hostility": 23594, "Features": 23595, "\u0135\u013a": 23596, "\u0120discomfort": 23597, "\u0120LCD": 23598, "specified": 23599, "\u0120Expect": 23600, "340": 23601, "\u0120imperative": 23602, "\u0120Regular": 23603, "Chinese": 23604, "\u0120statewide": 23605, "\u0120symm": 23606, "\u0120loops": 23607, "\u0120autumn": 23608, "Nick": 23609, "\u0120shaping": 23610, "\u0120quot": 23611, "\u0120cherry": 23612, "\u0120Crossref": 23613, "\u00e8\u00a6\u013c\u00e9\u0128\u0134": 23614, "Standard": 23615, "heed": 23616, "\u0120Dell": 23617, "\u0120Vietnamese": 23618, "\u0120ost": 23619, "\u0120Valkyrie": 23620, "OA": 23621, "Assad": 23622, "\u0120rebound": 23623, "\u0120Traffic": 23624, "places": 23625, "\u00e6\u013a": 23626, "\u0120Buc": 23627, "172": 23628, "\u0120shelters": 23629, "\u0120insisting": 23630, "\u0120Certainly": 23631, "\u0120Kenneth": 23632, "\u0120TCP": 23633, "\u0120penal": 23634, "\u0120Replay": 23635, "heard": 23636, "\u0120dialect": 23637, "iza": 23638, "\u0120FY": 23639, "itcher": 23640, "\u0120DL": 23641, "\u0120spiral": 23642, "\u0120quarterbacks": 23643, "\u0120hull": 23644, "\u0120google": 23645, "\u0120todd": 23646, "\u0120Sterling": 23647, "\u0120Plate": 23648, "\u0120spying": 23649, "mbol": 23650, "\u0120Realm": 23651, "\u0120Proced": 23652, "\u0120Crash": 23653, "\u0120terminate": 23654, "\u0120protesting": 23655, "Center": 23656, "guided": 23657, "\u0120uncover": 23658, "\u0120boycott": 23659, "\u0120realizes": 23660, "sound": 23661, "\u0120pretending": 23662, "\u0120Vas": 23663, "1980": 23664, "\u0120framed": 23665, "\u0120139": 23666, "\u0120descended": 23667, "\u0120rehabilitation": 23668, "\u0120borrowing": 23669, "\u0120Buch": 23670, "\u0120blur": 23671, "Ron": 23672, "\u0120Frozen": 23673, "enza": 23674, "Chief": 23675, "\u0120Poor": 23676, "\u0120translates": 23677, "MIN": 23678, "\u0120212": 23679, "JECT": 23680, "\u0120erupted": 23681, "\u0120successes": 23682, "SEC": 23683, "\u0120plague": 23684, "\u0120gems": 23685, "doms": 23686, "\u0120stretches": 23687, "\u0120Spy": 23688, "\u0120storytelling": 23689, "Credit": 23690, "\u0120Push": 23691, "\u0120traction": 23692, "\u0120ineffective": 23693, "\u0120Luna": 23694, "\u0120tapes": 23695, "\u0120analytics": 23696, "ercise": 23697, "\u0120programmes": 23698, "\u0120Carbon": 23699, "\u0120behold": 23700, "heavy": 23701, "\u0120Conservation": 23702, "\u0120FIR": 23703, "\u0120sack": 23704, "termin": 23705, "ricks": 23706, "\u0120housed": 23707, "\u0120unusually": 23708, "Ice": 23709, "\u0120executing": 23710, "\u0120Moroc": 23711, "eday": 23712, "\u0120editions": 23713, "\u0120smarter": 23714, "\u0120BA": 23715, "\u0120outlaw": 23716, "\u0120vanished": 23717, "iba": 23718, "ALSE": 23719, "\u0120Silva": 23720, "238": 23721, "Could": 23722, "\u0120philosopher": 23723, "\u0120evacuated": 23724, "Secret": 23725, "142": 23726, "\u0120visas": 23727, "\u00e3\u0124\u00ac": 23728, "\u0120Malt": 23729, "\u0120Clearly": 23730, "\u0120Niger": 23731, "\u0120Cairo": 23732, "\u0120Fist": 23733, "380": 23734, "\u0120XML": 23735, "auto": 23736, "itant": 23737, "\u0120reinforced": 23738, "Record": 23739, "\u0120Survivor": 23740, "GHz": 23741, "\u0120screws": 23742, "parents": 23743, "\u0120oceans": 23744, "mares": 23745, "\u0120brakes": 23746, "vasive": 23747, "\u0120hello": 23748, "\u0120SIM": 23749, "rimp": 23750, "\u0120ore": 23751, "\u0120Armour": 23752, "247": 23753, "\u0120terrific": 23754, "\u0120tones": 23755, "141": 23756, "\u0120Minutes": 23757, "Episode": 23758, "\u0120curves": 23759, "\u0120inflammatory": 23760, "\u0120batting": 23761, "\u0120Beautiful": 23762, "Lay": 23763, "\u0120unpop": 23764, "vable": 23765, "\u0120riots": 23766, "\u0120Tactics": 23767, "baugh": 23768, "\u0120Cock": 23769, "\u0120orgasm": 23770, "\u0120Sas": 23771, "\u0120constructor": 23772, "etz": 23773, "Gov": 23774, "\u0120antagon": 23775, "\u0120theat": 23776, "\u0120deeds": 23777, "hao": 23778, "cuts": 23779, "\u0120McCl": 23780, "\u0120um": 23781, "\u0120Scientists": 23782, "\u0120grassroots": 23783, "yssey": 23784, "\"]=>": 23785, "\u0120surfaced": 23786, "\u0120shades": 23787, "\u0120neighbours": 23788, "\u0120advertis": 23789, "oya": 23790, "\u0120merged": 23791, "Upon": 23792, "\u0120gad": 23793, "\u0120anticipate": 23794, "Anyway": 23795, "\u0120slogan": 23796, "\u0120disrespect": 23797, "Iran": 23798, "\u0120TB": 23799, "acted": 23800, "\u0120subpoen": 23801, "mediately": 23802, "OOOO": 23803, "\u0120waiver": 23804, "\u0120vulnerabilities": 23805, "ottesville": 23806, "\u0120Huffington": 23807, "Josh": 23808, "\u0120DH": 23809, "Monday": 23810, "\u0120Ellen": 23811, "Know": 23812, "xon": 23813, "items": 23814, "228": 23815, "\u0120fills": 23816, "\u0120Nike": 23817, "\u0120cumulative": 23818, "andals": 23819, "Ir": 23820, "\u0120\u00ec": 23821, "\u0120friction": 23822, "igator": 23823, "\u0120scans": 23824, "\u0120Vienna": 23825, "ldom": 23826, "\u0120performers": 23827, "Prim": 23828, "\u0120bidding": 23829, "Mur": 23830, "\u0120leaned": 23831, "\u0120Prix": 23832, "alks": 23833, "\u0120[\u00e2\u0122\u00a6]": 23834, "\u0120Twitch": 23835, "\u0120Developer": 23836, "\u0120Gir": 23837, "\u0120callback": 23838, "Abstract": 23839, "\u0120accustomed": 23840, "\u0120freedoms": 23841, "\u0120PG": 23842, "uracy": 23843, "\u0120lump": 23844, "isman": 23845, ",,,,": 23846, "1992": 23847, "\u0120RED": 23848, "\u0120worm": 23849, "Match": 23850, "\u0120Platinum": 23851, "IJ": 23852, "\u0120Owner": 23853, "Trivia": 23854, "compl": 23855, "\u0120newborn": 23856, "\u0120fantas": 23857, "Own": 23858, "\u01201959": 23859, "\u0120sympath": 23860, "\u0120ubiqu": 23861, "\u0120outputs": 23862, "\u0120allev": 23863, "\u0120prag": 23864, "Kevin": 23865, "\u0120favors": 23866, "\u0120burial": 23867, "\u0120nurt": 23868, "solete": 23869, "cache": 23870, "\u0120156": 23871, "\u0120unlocks": 23872, "techn": 23873, "Making": 23874, "\u0120conquer": 23875, "adic": 23876, "\u00e6\u0138": 23877, "\u0120elf": 23878, "\u0120electorate": 23879, "\u0120Kurds": 23880, "\u0120Stack": 23881, "\u0120Samurai": 23882, "\u0120\u00e2\u013a\u0127": 23883, "\u0120{}": 23884, "\u0120Said": 23885, "\u0120Fallout": 23886, "\u0120kindness": 23887, "\u0120Customs": 23888, "\u0120Boulevard": 23889, "\u0120helicopters": 23890, "otics": 23891, "\u0120Veget": 23892, "comment": 23893, "\u0120criticised": 23894, "\u0120polished": 23895, "\u0120Remix": 23896, "\u0120Cultural": 23897, "\u0120recons": 23898, "\u0120doi": 23899, "atem": 23900, "Screen": 23901, "\u0120barred": 23902, "Comments": 23903, "\u0120Generally": 23904, "\u0120slap": 23905, "720": 23906, "Vari": 23907, "pine": 23908, "\u0120empt": 23909, "\u0120hats": 23910, "\u0120Playing": 23911, "lab": 23912, "average": 23913, "forms": 23914, "\u0120Cotton": 23915, "\u0120cans": 23916, "\u0120DON": 23917, "\u0120Somalia": 23918, "Crypt": 23919, "\u0120Increases": 23920, "Ever": 23921, "modern": 23922, "\u0120surgeon": 23923, "3000": 23924, "\u0120randomized": 23925, "================================================================": 23926, "Bern": 23927, "impl": 23928, "\u0120COR": 23929, "\u0120proclaim": 23930, "thouse": 23931, "\u0120toes": 23932, "\u0120ample": 23933, "\u0120preserving": 23934, "\u0120disbel": 23935, "grand": 23936, "Besides": 23937, "\u0120silk": 23938, "\u0120Pattern": 23939, "hm": 23940, "\u0120enterprises": 23941, "\u0120affidavit": 23942, "\u0120Advisory": 23943, "\u0120advertised": 23944, "\u0120Religious": 23945, "sections": 23946, "psych": 23947, "\u0120Fields": 23948, "aways": 23949, "\u0120hashtag": 23950, "\u0120Nightmare": 23951, "\u0120vampire": 23952, "\u0120forensic": 23953, "rossover": 23954, "nar": 23955, "\u0120navy": 23956, "\u0120vacant": 23957, "\u0120Duel": 23958, "\u0120hallway": 23959, "\u0120facebook": 23960, "identally": 23961, "\u0120NRA": 23962, "\u0120matt": 23963, "\u0120hurricane": 23964, "\u0120Kirby": 23965, "\u0120Puzzle": 23966, "\u0120skirt": 23967, "oust": 23968, "dullah": 23969, "\u0120analogy": 23970, "inion": 23971, "\u0120tomatoes": 23972, "\u0120NV": 23973, "\u0120Peak": 23974, "\u0120Meyer": 23975, "\u0120appointments": 23976, "\u0120masc": 23977, "\u0120alley": 23978, "rehend": 23979, "\u0120charities": 23980, "\u0120undo": 23981, "\u0120destinations": 23982, "\u0120Testing": 23983, "\"></": 23984, "\u0120destined": 23985, "\u0120implements": 23986, "\u0120Harold": 23987, "RECT": 23988, "\u0120optimization": 23989, "\u0120kilometres": 23990, "\u0120cmd": 23991, "\u0120impairment": 23992, "\u0120unsuccessful": 23993, "\u0120swiftly": 23994, "\u0120Glasgow": 23995, "arten": 23996, "\u0120Shares": 23997, "\u0120Answer": 23998, "\u0120Album": 23999, "\u0120nutritional": 24000, "\u00e3\u0125\u0138": 24001, "\u0120Fut": 24002, "\u0120bloc": 24003, "\u0120NFC": 24004, "\u0120wholesale": 24005, "\u0120CW": 24006, "\u0120neglected": 24007, "\u0120launcher": 24008, "\u0120announcements": 24009, "OULD": 24010, "comb": 24011, "\u0120rotating": 24012, "\u0120rests": 24013, "\u0120Ticket": 24014, "chedel": 24015, "Lou": 24016, "\u0120Vic": 24017, "\u0120\"'": 24018, "\u0120templates": 24019, "\u0120replaces": 24020, "Arc": 24021, "::::": 24022, "\u0120Gilbert": 24023, "\u0120illnesses": 24024, "\u0120schedules": 24025, "\u0120heterosexual": 24026, "LINE": 24027, "\u0120herein": 24028, "\u0120coerc": 24029, "\u0120decreasing": 24030, "\u0120deportation": 24031, "sudo": 24032, "\u0120Indigenous": 24033, "\u0120weighs": 24034, "Along": 24035, "');": 24036, "\u0120Bengals": 24037, "707": 24038, "\u0120joints": 24039, "verts": 24040, "\u0120149": 24041, "naire": 24042, "\u0120simplest": 24043, "\u0120lore": 24044, "1080": 24045, "fiction": 24046, "\u0120Database": 24047, "\u0120reservation": 24048, "\u0120sou": 24049, "\u0120sanctuary": 24050, "audio": 24051, "aple": 24052, "\u0120vegetarian": 24053, "\u0120anticipation": 24054, "micro": 24055, "\u0120enduring": 24056, "\u0120departed": 24057, "\u0120sidewalk": 24058, "\u0120prohibits": 24059, "\u0120Font": 24060, "\u0120compute": 24061, "\u0120Sect": 24062, "\u0120158": 24063, "Battle": 24064, "\u0120bomber": 24065, "\u0120distraction": 24066, "\u0120endured": 24067, "\u0120practitioners": 24068, "\u0120disturbed": 24069, "\u0120drank": 24070, "ordered": 24071, "\u0120surprises": 24072, "seat": 24073, "Security": 24074, "\u0120Wisdom": 24075, "ogo": 24076, "\u0120subparagraph": 24077, "\u0120Peninsula": 24078, "\u0120Origins": 24079, "iren": 24080, "\u0120Pav": 24081, "iggle": 24082, "\u0120gratitude": 24083, "\u0120Gravity": 24084, "overty": 24085, "iman": 24086, "ctr": 24087, "\u0120Caesar": 24088, "could": 24089, "gem": 24090, "\u0120skies": 24091, "\u0120champ": 24092, "\u0120agreeing": 24093, "Family": 24094, "Div": 24095, "176": 24096, "\u0120messy": 24097, "umption": 24098, "Federal": 24099, "erno": 24100, "\u0120Chat": 24101, "Beyond": 24102, "\u0120devote": 24103, "\u0120Walsh": 24104, "\u0120dumped": 24105, "\u0120accumulation": 24106, "stad": 24107, "hibition": 24108, "\u0120smokers": 24109, "\u0120inspector": 24110, "French": 24111, "issan": 24112, "\u0120Vita": 24113, "\u0120researching": 24114, "RAM": 24115, "\u0120Celtics": 24116, "\u0120cloak": 24117, "\u0120Terra": 24118, "Mary": 24119, "sold": 24120, "\u0120DOM": 24121, "mods": 24122, "Intel": 24123, "\u0120multitude": 24124, "\u0120Improved": 24125, "\u0120reliance": 24126, "\u0120artifact": 24127, "\u0120alarming": 24128, "Prom": 24129, "hon": 24130, "TION": 24131, "medium": 24132, "\u0120reflex": 24133, "\u0120Excel": 24134, "\u0120weakened": 24135, "163": 24136, "224": 24137, "\u0120costumes": 24138, "\u0120uniquely": 24139, "\u0120sorrow": 24140, "\u0120mansion": 24141, "wp": 24142, "\u0120salv": 24143, "\u0120Grove": 24144, "bsp": 24145, "\u0120Sniper": 24146, "\u0120Shipping": 24147, "\u0120POW": 24148, "\u0120undis": 24149, "\u0120branding": 24150, "Girl": 24151, "\u0120Ahmad": 24152, "\u0120Lakes": 24153, "\u0120Corey": 24154, "\u0120inheritance": 24155, "enery": 24156, "\u0120packing": 24157, "\u0120Prest": 24158, "Dest": 24159, "FW": 24160, "\u0120regulator": 24161, "locked": 24162, "\u0120contested": 24163, "\u0120Melissa": 24164, "\u0120Duc": 24165, "\u0120unpopular": 24166, "\u0120stacked": 24167, "\u01201917": 24168, "\u0120yearly": 24169, "\u0120stare": 24170, "\u0120assessing": 24171, "\u00c3\u00b8": 24172, "\u0120beverages": 24173, "\u0120competitions": 24174, "\u0120strengthening": 24175, "along": 24176, "\u0120Lud": 24177, "\u0120melted": 24178, "stanbul": 24179, "\u0120bounty": 24180, "ENC": 24181, "\u0120Lands": 24182, "\u0120declares": 24183, "\u0120customize": 24184, "\u0120composite": 24185, "\u00e3\u0125\u00ac": 24186, "CM": 24187, "ographics": 24188, "\u0120Temp": 24189, "\u0120contender": 24190, "\u0120insign": 24191, "\u0120LAN": 24192, "\u0120disasters": 24193, "inspired": 24194, "\u0120judgments": 24195, "ustainable": 24196, "ursion": 24197, "\u0120variance": 24198, "\u0120Ultimately": 24199, "\u0120--------": 24200, "uador": 24201, "\u0120RX": 24202, "\u0120melting": 24203, "\u0120Extended": 24204, "\u0120Twe": 24205, "Major": 24206, "\u0120Bil": 24207, "\u0120syrup": 24208, "quick": 24209, "\u0120Holder": 24210, "\u0120innocence": 24211, "ULE": 24212, "\u0120Might": 24213, "9999": 24214, "\u0120fal": 24215, "\u0120continuity": 24216, "\u01201953": 24217, "\u0120BS": 24218, "still": 24219, "Lat": 24220, "\u0120Abuse": 24221, "\u0120unsupported": 24222, "xxxxxxxx": 24223, "\u0120institute": 24224, "\u0120fragment": 24225, "\u0120Pep": 24226, "Western": 24227, "\u0120Cause": 24228, "\u0120Frag": 24229, "\u0120Ars": 24230, "\u00e0\u00a5": 24231, "astics": 24232, "\u0120bishop": 24233, "\u0120crosses": 24234, "\u0120154": 24235, "\u0120Upgrade": 24236, "\u0120mitigate": 24237, "\u0120Raymond": 24238, "Mods": 24239, "\u0120tomato": 24240, "\u0120stumbled": 24241, "\u0120differs": 24242, "Initial": 24243, "\u0120Raspberry": 24244, "\u0120ignores": 24245, "\u0120tant": 24246, "\u00c3\u0142": 24247, "\u0120relay": 24248, "\u0120bisexual": 24249, "\u0120confession": 24250, "\u0120dement": 24251, "inas": 24252, "\u0120Heather": 24253, "platform": 24254, "driving": 24255, "bourg": 24256, "\u0120Mush": 24257, "\u0120hyster": 24258, "Details": 24259, "\u0120drift": 24260, "\u0120Wald": 24261, "\u0120Luckily": 24262, "orf": 24263, "\u0120expire": 24264, "\u0120Punch": 24265, "zyme": 24266, "gold": 24267, "\u0120unpaid": 24268, "\u0120Trent": 24269, "\u0120unarmed": 24270, "\u0120illicit": 24271, "\u0120Tottenham": 24272, "\u0120smash": 24273, "International": 24274, "inker": 24275, "\u0120sting": 24276, "\u0120Saddam": 24277, "\u0120ART": 24278, "\u0120truths": 24279, "birth": 24280, "\u0120sober": 24281, "\u0120Nit": 24282, "\u0120ib": 24283, "\u0120usable": 24284, "\u0120stacks": 24285, "\u0120Sylv": 24286, "\u0120northeast": 24287, "\u0120domination": 24288, "\u0120Mour": 24289, "ENSE": 24290, "\u0120Measure": 24291, "\u0120programmer": 24292, "\u0120<-": 24293, "182": 24294, "\u0120Condition": 24295, "\u0120backyard": 24296, "irling": 24297, "\u0120Jeb": 24298, "\u0120Creed": 24299, "\u0120Hang": 24300, "\u0120COMP": 24301, "FER": 24302, "\u0120Ish": 24303, "\u0120detectives": 24304, "---------------": 24305, "\u0120Messenger": 24306, "\u0120looph": 24307, "\u0120gateway": 24308, "151": 24309, "\u0120Materials": 24310, "\u0120DT": 24311, "\u0120doomed": 24312, "odo": 24313, "\u0120slices": 24314, "\u0120emailed": 24315, "\u0120Perl": 24316, "\u0120renov": 24317, "UTH": 24318, "odynam": 24319, "\u0120Southwest": 24320, "getic": 24321, "\u0120TPP": 24322, "\u0120optimism": 24323, "\u0120Tow": 24324, "ulators": 24325, "protected": 24326, "yles": 24327, "\u00c2\u00ab": 24328, "\u0120exile": 24329, "env": 24330, "Prop": 24331, "\u0120Zimmerman": 24332, "\u00d9\u0130": 24333, "Ca": 24334, "omaly": 24335, "\u00e3\u0125\u0128": 24336, "\u0120railroad": 24337, "Lee": 24338, "232": 24339, "\u0120replicate": 24340, "\u0120comfortably": 24341, "actly": 24342, "\u0120rav": 24343, "\u0120telescope": 24344, "\u0120honesty": 24345, "\u0120Pepper": 24346, "\u0120Bring": 24347, "\u0120richest": 24348, "\u0120outdoors": 24349, "\u0120halls": 24350, "\u0120contend": 24351, "ISE": 24352, "\u0120submitting": 24353, "\u0120naive": 24354, "arations": 24355, "\u0120143": 24356, "\u0120poised": 24357, "responsible": 24358, "\u0120socks": 24359, "\u0120Skull": 24360, "Question": 24361, "\u0120discoveries": 24362, "Joined": 24363, "\u0120Enemies": 24364, "\u0120Wireless": 24365, "\u0120Revenge": 24366, "\u0120puzzles": 24367, "\u0120ceased": 24368, "290": 24369, "criptions": 24370, "\u0120Console": 24371, "\u0120boiling": 24372, "\u0120discrep": 24373, "\u0120deduction": 24374, "\u0120arsenal": 24375, "XXXX": 24376, "\u0120Amsterdam": 24377, "roximately": 24378, "\u0120Shane": 24379, "\u0120posing": 24380, "\u0120ACLU": 24381, "\u0120Companies": 24382, "\u0120theology": 24383, "\u0120Ug": 24384, "quarter": 24385, "\u0120Hank": 24386, "Coin": 24387, "\u0120Lv": 24388, "\u0120allegation": 24389, "\u0120Avoid": 24390, "\u0120indefinitely": 24391, "\u0120commodities": 24392, "\u0120brig": 24393, "\u0120Manit": 24394, "\u0120tenth": 24395, "method": 24396, "\u0120Knicks": 24397, "\u0120\u00e2\u0122\u0130": 24398, "\u0120invoked": 24399, "Dial": 24400, "ARA": 24401, "\u0120caucus": 24402, "227": 24403, "\u0120Jab": 24404, "\u0120ounces": 24405, "bay": 24406, "\u0120buddy": 24407, "fan": 24408, "234": 24409, "\u0120Hil": 24410, "adh": 24411, "\u0120TY": 24412, "\u0120IND": 24413, "\u01201939": 24414, "\u0120iteration": 24415, "\u0120Gonzalez": 24416, "\u0120Vert": 24417, "\u0120IO": 24418, "emb": 24419, "rera": 24420, "ench": 24421, "\u0120Requirements": 24422, "\u0120Wins": 24423, "\u0120livestock": 24424, "hours": 24425, "\"\u00e2\u0122\u00a6": 24426, "bral": 24427, "Marg": 24428, "\u0120Done": 24429, "\u0120wasting": 24430, "inged": 24431, "groups": 24432, "\u0120wishing": 24433, "\u0120Tumblr": 24434, "\u0120tapping": 24435, "\u0120nationalism": 24436, "\u0120Byr": 24437, "\u0120squares": 24438, "\u0120Actions": 24439, "\u00e3\u0125\u00a5": 24440, "Inside": 24441, "debug": 24442, "\u0120append": 24443, "\u0120stubborn": 24444, "\u0120Cind": 24445, "Tell": 24446, "\u0120tearing": 24447, "\u0120Rey": 24448, "orc": 24449, "\u0120Dayton": 24450, "\u0120NH": 24451, "\u0120Madness": 24452, "Charl": 24453, "\u0120Morrison": 24454, "filter": 24455, "\u0120accuse": 24456, "\u0120./": 24457, "\u0120torrent": 24458, "\u0120declines": 24459, "gallery": 24460, "Mine": 24461, "\u0120negotiation": 24462, "\u0120Bashar": 24463, "opia": 24464, "1993": 24465, "emort": 24466, "\u0120Novel": 24467, "\u0120Fang": 24468, "ersive": 24469, "\u0120Instant": 24470, "\u0120roller": 24471, "Around": 24472, "\u0120Elections": 24473, "Games": 24474, "\u0120inexpensive": 24475, "\u0120wors": 24476, "\u0120vul": 24477, "\u0120Hole": 24478, "\u0120unbelievable": 24479, "\u0120nause": 24480, "\u0120entr": 24481, "boat": 24482, "\u0120STE": 24483, "\u0120bush": 24484, "\u0120Hassan": 24485, "\u0120wo": 24486, "\u0120paused": 24487, "\u0120Mig": 24488, "lived": 24489, "\u0120scout": 24490, "\u0120lith": 24491, "Published": 24492, "duino": 24493, "cool": 24494, "\u0120circulating": 24495, "idas": 24496, "\u0120Pam": 24497, "violent": 24498, "\u0120Crawford": 24499, "uddle": 24500, "\u0120Letters": 24501, "Guard": 24502, "morph": 24503, "\u0120wandering": 24504, "\u0120sophomore": 24505, "\u0120queer": 24506, "\u0120Blind": 24507, "rue": 24508, "\u0120Marriage": 24509, "Dom": 24510, "\u0120padding": 24511, "\u0120folders": 24512, "\u0120meaningless": 24513, "\u0120candidacy": 24514, "afort": 24515, "\u0120whistlebl": 24516, "\u0120Identified": 24517, "\u0120cigar": 24518, "\u0120hid": 24519, "\u0120Dubai": 24520, "\u0120posture": 24521, "\u0120hiking": 24522, "\u0120Terminal": 24523, "Legendary": 24524, "\u0120TP": 24525, "\u0120ATK": 24526, "\u0120Starbucks": 24527, "\u0120Riot": 24528, "1991": 24529, "\u0120Bottom": 24530, "effic": 24531, "\u0120Eugene": 24532, "\u0120Wyoming": 24533, "\u0120Rocky": 24534, "\u0120salmon": 24535, "\u0120metro": 24536, "\u0120bilateral": 24537, "\u0120celebrates": 24538, "Length": 24539, "billion": 24540, "Bat": 24541, "\u0120releg": 24542, "\u0120pseudo": 24543, "DT": 24544, "\u0120Rhode": 24545, "Parent": 24546, "pletion": 24547, "\u0120attribut": 24548, "\u0120tuning": 24549, "\u0120NOTE": 24550, "\u0120Rebel": 24551, "icus": 24552, "Fund": 24553, "\u0120cocktail": 24554, "\u0120501": 24555, "\u0120spoon": 24556, "\u0120brutality": 24557, "\u0120unite": 24558, "\u0120microbi": 24559, "\u0120Reich": 24560, "positive": 24561, "\u0120amazed": 24562, "\u0120NT": 24563, "Desc": 24564, "ECTION": 24565, "\u0120falsely": 24566, "\u0120Highlander": 24567, "\u0120Crist": 24568, "\u0120Victorian": 24569, "\u0120distributions": 24570, "their": 24571, "\u0120Einstein": 24572, "\u0120pod": 24573, "\u0120epidem": 24574, "\u0120heap": 24575, "\u0120Ranch": 24576, "\u0120anthem": 24577, "\u0120reapp": 24578, "\u0120Auburn": 24579, "\u0120concurrent": 24580, "\u0120Throughout": 24581, "\u0120POST": 24582, "\u00e2\u013a": 24583, "\u0120homemade": 24584, "kick": 24585, "Beg": 24586, "\u0120chassis": 24587, "counter": 24588, "\u0120merger": 24589, "\u0120laps": 24590, "217": 24591, "union": 24592, "\u0120Trigger": 24593, "\u0120debated": 24594, "\u0120silently": 24595, "\u0120restraint": 24596, "Bal": 24597, "0000000": 24598, "\u0120formidable": 24599, "\u0120Filip": 24600, "\u0120sacrifices": 24601, "Food": 24602, "\u0120dwarf": 24603, "\u0120Sequ": 24604, "inian": 24605, "Moreover": 24606, "\u0120tangible": 24607, "opsis": 24608, "\u0120Minecraft": 24609, "\u0120Registration": 24610, "oan": 24611, "\u0120representations": 24612, "\u0120thirst": 24613, "\u0120corp": 24614, "irement": 24615, "Made": 24616, "loe": 24617, ">\"": 24618, "cats": 24619, "*.": 24620, "\u0120gestures": 24621, "general": 24622, "League": 24623, "\u0120packets": 24624, "\u0120Inspector": 24625, "\u0120Berg": 24626, "\u0120fraudulent": 24627, "\u0120criticize": 24628, "Fun": 24629, "\u0120blaming": 24630, "ndra": 24631, "\u0120slash": 24632, "\u0120Eston": 24633, "\u0120proposing": 24634, "\u0120whales": 24635, "\u0120therapist": 24636, "\u0120subset": 24637, "\u0120leisure": 24638, "ELD": 24639, "\u0120CVE": 24640, "\u0120Activity": 24641, "\u0120culmin": 24642, "shop": 24643, "\u0120DAY": 24644, "ischer": 24645, "\u0120Admiral": 24646, "\u0120Attacks": 24647, "\u01201958": 24648, "\u0120memoir": 24649, "\u0120folded": 24650, "\u0120sexist": 24651, "\u0120153": 24652, "\u0120LI": 24653, "\u0120readings": 24654, "\u0120embarrassment": 24655, "\u0120Employment": 24656, "wart": 24657, "chin": 24658, "\u0120continuation": 24659, "lia": 24660, "Recently": 24661, "\u0120duel": 24662, "\u0120evacuation": 24663, "\u0120Kashmir": 24664, "\u0120disposition": 24665, "\u0120Rig": 24666, "\u0120bolts": 24667, "\u0120insurers": 24668, "467": 24669, "Mex": 24670, "\u0120retaliation": 24671, "\u0120misery": 24672, "\u0120unreasonable": 24673, "raining": 24674, "Imm": 24675, "\u0120PU": 24676, "emer": 24677, "\u0120genital": 24678, "\u00e3\u0124\u00b3": 24679, "\u0120Candy": 24680, "\u0120onions": 24681, "\u0120Patt": 24682, "liner": 24683, "\u0120conceded": 24684, "\u0120fa": 24685, "\u0120forc": 24686, "\u0120Hernandez": 24687, "\u0120Geoff": 24688, "debian": 24689, "\u0120Teams": 24690, "\u0120cries": 24691, "\u0120homeowners": 24692, "237": 24693, "ABC": 24694, "\u0120stitch": 24695, "\u0120statistic": 24696, "\u0120headers": 24697, "\u0120Biology": 24698, "\u0120motors": 24699, "\u0120GEN": 24700, "\u0120Lip": 24701, "\u0120hates": 24702, "\u0120heel": 24703, "Self": 24704, "ipl": 24705, "EDIT": 24706, "orting": 24707, "\u0120annot": 24708, "\u0120Speech": 24709, "oldemort": 24710, "\u0120Javascript": 24711, "\u0120LeBron": 24712, "\u0120footprint": 24713, "\u0120fn": 24714, "\u0120seizures": 24715, "nas": 24716, "hide": 24717, "\u01201954": 24718, "\u0120Bee": 24719, "\u0120Declaration": 24720, "\u0120Katie": 24721, "\u0120reservations": 24722, "NR": 24723, "female": 24724, "\u0120saturated": 24725, "\u0120biblical": 24726, "\u0120trolls": 24727, "Device": 24728, "photos": 24729, "\u0120drums": 24730, "\u00e3\u0125\u012b\u00e3\u0125\u00a9\u00e3\u0124\u00b4\u00e3\u0125\u00b3": 24731, "Night": 24732, "fighter": 24733, "\u0120Hak": 24734, "riber": 24735, "\u0120cush": 24736, "\u0120disciplinary": 24737, "baum": 24738, "\u0120GH": 24739, "\u0120Schmidt": 24740, "ilibrium": 24741, "\u0120sixty": 24742, "\u0120Kushner": 24743, "rots": 24744, "\u0120pund": 24745, "\u0120Rac": 24746, "\u0120springs": 24747, "\u0120conve": 24748, "Business": 24749, "Fall": 24750, "\u0120qualifications": 24751, "\u0120verses": 24752, "\u0120narciss": 24753, "\u0120Koh": 24754, "\u0120Wow": 24755, "\u0120Charlottesville": 24756, "edo": 24757, "\u0120interrogation": 24758, "\u0120Wool": 24759, "365": 24760, "Brian": 24761, "\u0120\u00e2\u013e\u0135": 24762, "\u0120alleges": 24763, "onds": 24764, "idation": 24765, "\u0120Jackie": 24766, "yu": 24767, "\u0120lakes": 24768, "\u0120worthwhile": 24769, "\u0120crystals": 24770, "\u0120Juda": 24771, "\u0120comprehend": 24772, "\u0120flush": 24773, "\u0120absorption": 24774, "\u0120OC": 24775, "\u0120frightened": 24776, "\u0120Chocolate": 24777, "Martin": 24778, "\u0120buys": 24779, "\u0120bucks": 24780, "\u0120appell": 24781, "\u0120Championships": 24782, "\u0120listener": 24783, "\u0120Defensive": 24784, "\u0120cz": 24785, "uds": 24786, "\u0120Mate": 24787, "\u0120replay": 24788, "\u0120decorated": 24789, "\u0120sunk": 24790, "\u0120VIP": 24791, "\u0120Ank": 24792, "\u0120195": 24793, "aaaa": 24794, "Nobody": 24795, "\u0120Milk": 24796, "\u0120Gur": 24797, "\u0120Mk": 24798, "\u0120Sara": 24799, "\u0120seating": 24800, "\u0120Wid": 24801, "Track": 24802, "\u0120employs": 24803, "\u0120gigantic": 24804, "APP": 24805, "\u00e3\u0124\u00a7": 24806, "inventory": 24807, "\u0120towel": 24808, "atche": 24809, "lasting": 24810, "\u0120TL": 24811, "\u0120latency": 24812, "\u0120kne": 24813, "Ber": 24814, "meaning": 24815, "\u0120upheld": 24816, "\u0120playground": 24817, "\u0120mant": 24818, "Side": 24819, "\u0120stereo": 24820, "\u0120northwest": 24821, "\u0120exceptionally": 24822, "\u0120rays": 24823, "\u0120recurring": 24824, "Drive": 24825, "\u0120upright": 24826, "\u0120abduct": 24827, "\u0120Marathon": 24828, "\u0120goodbye": 24829, "\u0120alphabet": 24830, "hp": 24831, "\u0120courtroom": 24832, "rington": 24833, "othing": 24834, "Tag": 24835, "\u0120diplomats": 24836, "\u0120barbar": 24837, "\u0120Aqua": 24838, "183": 24839, "3333": 24840, "\u0120maturity": 24841, "\u0120instability": 24842, "\u0120Apache": 24843, "\u0120===": 24844, "\u0120fasting": 24845, "\u0120Grid": 24846, "ModLoader": 24847, "\u0120152": 24848, "Abs": 24849, "\u0120Operating": 24850, "etti": 24851, "\u0120acquaint": 24852, "Donnell": 24853, "\u0120Kem": 24854, "\u0120Forge": 24855, "\u0120armored": 24856, "Mil": 24857, "\u0120philosophers": 24858, "invest": 24859, "Players": 24860, "\u00e2\u012a": 24861, "\u0120myriad": 24862, "\u0120comrades": 24863, "Rot": 24864, "\u0120remembering": 24865, "\u0120corresponds": 24866, "\u0120programmers": 24867, "\u0120Lynn": 24868, "\u0120olig": 24869, "\u0120coherent": 24870, "ynchron": 24871, "\u0120Chemical": 24872, "\u0120jugg": 24873, "pair": 24874, "posts": 24875, "Eye": 24876, "\u0120Inner": 24877, "\u0120semester": 24878, "ottest": 24879, "\u0120Emirates": 24880, "ricanes": 24881, "orously": 24882, "mits": 24883, "\u0120Wis": 24884, "\u0120dodge": 24885, "location": 24886, "\u0120faded": 24887, "Amazon": 24888, "\u0120Proceed": 24889, "\u0120INFO": 24890, "journal": 24891, "\u0120Truck": 24892, "Ten": 24893, "\u0120217": 24894, "\u0120statutes": 24895, "mobile": 24896, "\u0120Types": 24897, "Recomm": 24898, "buster": 24899, "pex": 24900, "\u0120legends": 24901, "\u0120headache": 24902, "faced": 24903, "\u0120WiFi": 24904, "ifty": 24905, "\u0120HER": 24906, "\u0120circuits": 24907, "ERROR": 24908, "226": 24909, "olin": 24910, "\u0120cylinder": 24911, "ospace": 24912, "ikers": 24913, "Prem": 24914, "Quant": 24915, "\u0120conflicting": 24916, "\u0120slightest": 24917, "\u0120forged": 24918, "ionage": 24919, "Stephen": 24920, "\u0120Kub": 24921, "\u0120Opportun": 24922, "\u0120Heal": 24923, "\u0120blo": 24924, "\u0120rulers": 24925, "\u0120huh": 24926, "\u0120submarine": 24927, "fy": 24928, "asser": 24929, "\u0120allowance": 24930, "\u0120Kasich": 24931, "\u0120Tas": 24932, "\u0120Australians": 24933, "ForgeModLoader": 24934, "\u0120\u00e2\u0128\u0133": 24935, "\u0120Matrix": 24936, "amins": 24937, "\u01201200": 24938, "\u0120Acqu": 24939, "236": 24940, "Document": 24941, "\u0120Breaking": 24942, "193": 24943, "\u0120Subst": 24944, "\u0120Roller": 24945, "\u0120Properties": 24946, "\u0120NI": 24947, "tier": 24948, "\u0120crushing": 24949, "\u0120advocating": 24950, "Furthermore": 24951, "keepers": 24952, "\u0120sexism": 24953, "xd": 24954, "\u0120caller": 24955, "\u0120Sense": 24956, "chieve": 24957, "\u0120TF": 24958, "\u0120fueled": 24959, "\u0120reminiscent": 24960, "\u0120obsess": 24961, "urst": 24962, "\u0120uphold": 24963, "\u0120Fans": 24964, "hetics": 24965, "\u0120\u00e2\u0139": 24966, "\u0120Bath": 24967, "\u0120beverage": 24968, "\u0120oscill": 24969, "254": 24970, "\u0120poles": 24971, "\u0120gradual": 24972, "\u0120exting": 24973, "\u0120Suff": 24974, "\u0120Suddenly": 24975, "\u0120liking": 24976, "\u01201949": 24977, "unciation": 24978, "amination": 24979, "\u0120Omar": 24980, "\u0120LV": 24981, "\u0120Consequently": 24982, "\u0120synthes": 24983, "\u0120GIF": 24984, "\u0120pains": 24985, "\u0120interacting": 24986, "uously": 24987, "incre": 24988, "\u0120rumor": 24989, "\u0120Scientology": 24990, "197": 24991, "\u0120Zig": 24992, "\u0120spelling": 24993, "\u0120ASS": 24994, "\u0120extingu": 24995, "mson": 24996, "\u0120gh": 24997, "\u0120remarked": 24998, "\u0120Strategic": 24999, "\u0120MON": 25000, "\u00e5\u00a5": 25001, "gae": 25002, "\u0120WHAT": 25003, "Eric": 25004, "\u0120Campus": 25005, "\u0120methane": 25006, "\u0120imagin": 25007, "JUST": 25008, "\u0120Alm": 25009, "XT": 25010, "iq": 25011, "\u0120RSS": 25012, "\u0120wrongdoing": 25013, "atta": 25014, "\u0120bigot": 25015, "\u0120demonstrators": 25016, "\u0120Calvin": 25017, "\u0120Villa": 25018, "\u0120membrane": 25019, "\u0120Awesome": 25020, "\u0120benefic": 25021, "268": 25022, "\u0120magnificent": 25023, "\u0120Lots": 25024, "Greg": 25025, "\u0120Boris": 25026, "\u0120detainees": 25027, "\u0120Herman": 25028, "\u0120whispered": 25029, "\u0120awe": 25030, "Professor": 25031, "funding": 25032, "\u0120physiological": 25033, "\u0120Destruction": 25034, "\u0120limb": 25035, "\u0120manipulated": 25036, "\u0120bubbles": 25037, "\u0120pseud": 25038, "\u0120hydra": 25039, "\u0120Bristol": 25040, "\u0120stellar": 25041, "\u0120Expansion": 25042, "\u0120Kell": 25043, "\u0120Interestingly": 25044, "\u0120mans": 25045, "\u0120dragging": 25046, "\u0120ecological": 25047, "\u0120Fit": 25048, "\u0120gent": 25049, "\u0120benefited": 25050, "\u0120Haiti": 25051, "\u0120polyg": 25052, "\u00e3\u0125\u0130": 25053, "\u01202030": 25054, "\u0120prow": 25055, "\u0120reconstruction": 25056, "\u0120wast": 25057, "\u0120psychic": 25058, "\u0120Greeks": 25059, "Handler": 25060, "162": 25061, "\u0120Pulse": 25062, "\u0120solicit": 25063, "\u0120sys": 25064, "\u0120influx": 25065, "\u0120Gentle": 25066, "percent": 25067, "\u0120proliferation": 25068, "\u0120taxable": 25069, "\u0120disregard": 25070, "\u0120escaping": 25071, "\u0120ginger": 25072, "\u0120withstand": 25073, "\u0120devastated": 25074, "\u0120Dew": 25075, "series": 25076, "\u0120injected": 25077, "elaide": 25078, "\u0120turnover": 25079, "heat": 25080, "\u013b\u0124": 25081, "Happy": 25082, "\u0120Silent": 25083, "\u00e3\u0124\u0143": 25084, "ivism": 25085, "\u0120irrational": 25086, "AMA": 25087, "\u0120reef": 25088, "rub": 25089, "\u0120162": 25090, "\u0120bankers": 25091, "\u0120Ethics": 25092, "vv": 25093, "\u0120criticisms": 25094, "Kn": 25095, "186": 25096, "Movie": 25097, "\u0120Tories": 25098, "\u0120nood": 25099, "\u0120distortion": 25100, "False": 25101, "odore": 25102, "\u0120tasty": 25103, "Research": 25104, "\u0120UID": 25105, "-)": 25106, "\u0120divorced": 25107, "\u0120MU": 25108, "\u0120Hayes": 25109, "\u0120Isn": 25110, "iani": 25111, "\u0120HQ": 25112, "\u0120\"#": 25113, "ignant": 25114, "\u0120traumatic": 25115, "\u0120Ling": 25116, "Hun": 25117, "\u0120sabot": 25118, "online": 25119, "random": 25120, "\u0120renamed": 25121, "rared": 25122, "KA": 25123, "dead": 25124, "\u00c3\u00a9t": 25125, "\u0120Assistance": 25126, "\u0120seaf": 25127, "++++++++": 25128, "\u0120seldom": 25129, "\u0120Webb": 25130, "\u0120boolean": 25131, "ulet": 25132, "\u0120refrain": 25133, "\u0120DIY": 25134, "rule": 25135, "\u0120shutting": 25136, "\u0120utilizing": 25137, "loading": 25138, "\u0120Param": 25139, "coal": 25140, "ooter": 25141, "\u0120attracting": 25142, "\u0120Dol": 25143, "\u0120hers": 25144, "agnetic": 25145, "\u0120Reach": 25146, "imo": 25147, "\u0120discarded": 25148, "\u0120Pip": 25149, "015": 25150, "\u00c3\u00bcr": 25151, "\u0120mug": 25152, "Imagine": 25153, "COL": 25154, "\u0120cursed": 25155, "\u0120Shows": 25156, "\u0120Curtis": 25157, "\u0120Sachs": 25158, "speaking": 25159, "\u0120Vista": 25160, "\u0120Framework": 25161, "ongo": 25162, "\u0120subreddit": 25163, "\u0120crus": 25164, "\u0120Oval": 25165, "Row": 25166, "growing": 25167, "\u0120installment": 25168, "\u0120glac": 25169, "\u0120Advance": 25170, "ECK": 25171, "\u0120LGBTQ": 25172, "LEY": 25173, "\u0120acet": 25174, "\u0120successive": 25175, "\u0120Nicole": 25176, "\u01201957": 25177, "Quote": 25178, "\u0120circumstance": 25179, "ackets": 25180, "\u0120142": 25181, "ortium": 25182, "\u0120guessed": 25183, "\u0120Frame": 25184, "\u0120perpetrators": 25185, "\u0120Aviation": 25186, "\u0120Bench": 25187, "\u0120handc": 25188, "Ap": 25189, "\u01201956": 25190, "259": 25191, "rand": 25192, "NetMessage": 25193, "din": 25194, "urtles": 25195, "hig": 25196, "\u0120VIII": 25197, "ffiti": 25198, "\u0120Swords": 25199, "bial": 25200, "\u0120kidnapping": 25201, "device": 25202, "\u0120barn": 25203, "\u0120Eli": 25204, "aucas": 25205, "Send": 25206, "Constructed": 25207, "\u0120\u00c2\u00bd": 25208, "\u0120needles": 25209, "\u0120advertisements": 25210, "\u0120vou": 25211, "\u0120exhibited": 25212, "\u0120Fortress": 25213, "Ask": 25214, "Berry": 25215, "TYPE": 25216, "\u0120cancers": 25217, "umping": 25218, "\u0120Territory": 25219, "\u0120prud": 25220, "\u0120nas": 25221, "\u0120atheist": 25222, "\u0120balances": 25223, "\u00e3\u0123\u0141": 25224, "\u0120Shawn": 25225, "&&": 25226, "\u0120landsc": 25227, "\u0120RGB": 25228, "\u0120petty": 25229, "\u0120excellence": 25230, "\u0120translations": 25231, "\u0120parcel": 25232, "\u0120Chev": 25233, "East": 25234, "\u0120Output": 25235, "imi": 25236, "\u0120ambient": 25237, "\u0120Threat": 25238, "\u0120villains": 25239, "\u0120550": 25240, "ICA": 25241, "\u0120taller": 25242, "\u0120leaking": 25243, "cup": 25244, "\u0120polish": 25245, "\u0120infectious": 25246, "\u0120KC": 25247, "\u0120@@": 25248, "background": 25249, "\u0120bureaucracy": 25250, "\u0120Sai": 25251, "unless": 25252, "itious": 25253, "\u0120Skype": 25254, "Atl": 25255, "IDENT": 25256, "008": 25257, "\u0120hypocr": 25258, "\u0120pitchers": 25259, "\u0120guessing": 25260, "\u0120FINAL": 25261, "Between": 25262, "\u0120villagers": 25263, "\u0120252": 25264, "fashion": 25265, "\u0120Tunis": 25266, "Beh": 25267, "\u0120Exc": 25268, "\u0120MID": 25269, "288": 25270, "\u0120Haskell": 25271, "196": 25272, "\u0120NOR": 25273, "\u0120specs": 25274, "\u0120invari": 25275, "\u0120glut": 25276, "\u0120Cars": 25277, "\u0120impulse": 25278, "\u0120honors": 25279, "gel": 25280, "\u0120jurisdictions": 25281, "\u0120Bundle": 25282, "ulas": 25283, "California": 25284, "\u0120Increase": 25285, "\u0120pear": 25286, "\u0120singles": 25287, "\u0120cues": 25288, "\u0120underwent": 25289, "\u0120WS": 25290, "\u0120exaggerated": 25291, "\u0120dubious": 25292, "\u0120flashing": 25293, "LOG": 25294, ")].": 25295, "Journal": 25296, "tg": 25297, "Van": 25298, "\u0120Istanbul": 25299, "\u0120Insp": 25300, "\u0120Franken": 25301, "Draw": 25302, "\u0120sadness": 25303, "\u0120ironic": 25304, "\u0120Fry": 25305, "xc": 25306, "\u0120164": 25307, "isch": 25308, "Way": 25309, "\u0120Protestant": 25310, "horn": 25311, "\u0120unaff": 25312, "\u0120Viv": 25313, "illas": 25314, "\u0120Productions": 25315, "\u0120Hogan": 25316, "\u0120perimeter": 25317, "\u0120Sisters": 25318, "\u0120spontaneous": 25319, "\u0120downside": 25320, "\u0120descendants": 25321, "\u0120orn": 25322, "worm": 25323, "Japanese": 25324, "\u01201955": 25325, "\u0120151": 25326, "\u0120Doing": 25327, "elsen": 25328, "umbles": 25329, "\u0120radically": 25330, "\u0120Drum": 25331, "\u0120Bach": 25332, "\u0120liabilities": 25333, "\u0120OB": 25334, "\u0120Elementary": 25335, "\u0120meme": 25336, "ynes": 25337, "\u0120fingerprint": 25338, "\u0120Grab": 25339, "\u0120undertake": 25340, "Members": 25341, "\u0120Reader": 25342, "\u0120Sims": 25343, "god": 25344, "\u0120hypothetical": 25345, "scient": 25346, "\u0120AJ": 25347, "\u0120charism": 25348, "\u0120admissions": 25349, "\u0120Missile": 25350, "trade": 25351, "\u0120exercising": 25352, "\u0120Background": 25353, "Written": 25354, "\u0120vocals": 25355, "whether": 25356, "\u0120vi": 25357, "\u0120Winner": 25358, "\u0120litter": 25359, "\u0120Shooting": 25360, "STEM": 25361, "\u00e3\u0124\u00a1": 25362, "\u0120AFL": 25363, "\u0120variability": 25364, "\u0120eats": 25365, "\u0120DPS": 25366, "brow": 25367, "\u0120elephants": 25368, "\u0120strat": 25369, "\u0120\u00c5": 25370, "\u0120settlers": 25371, "Matthew": 25372, "\u0120inadvert": 25373, "HI": 25374, "\u0120IMF": 25375, "\u0120Goal": 25376, "\u0120nerves": 25377, "Johnson": 25378, "eye": 25379, "ablishment": 25380, "Thursday": 25381, "BILITY": 25382, "Had": 25383, "amoto": 25384, "hetamine": 25385, "eps": 25386, "\u0120mitochond": 25387, "\u0120compressed": 25388, "\u0120Trevor": 25389, "\u0120Animals": 25390, "Tool": 25391, "Lock": 25392, "\u0120tweak": 25393, "\u0120pinch": 25394, "\u0120cancellation": 25395, "Pot": 25396, "\u0120focal": 25397, "\u0120Astron": 25398, "173": 25399, "\u0120ASC": 25400, "\u0120OTHER": 25401, "umni": 25402, "\u0120demise": 25403, "dl": 25404, "\u00d9\u0127": 25405, "Semitism": 25406, "\u0120cracking": 25407, "\u0120collaborative": 25408, "\u0120explores": 25409, "sql": 25410, "\u0120herbs": 25411, "\u0120configurations": 25412, "mis": 25413, "\u0120Result": 25414, "acey": 25415, "\u0120Smoke": 25416, "\u0120sanct": 25417, "elia": 25418, "\u0120degener": 25419, "\u0120deepest": 25420, "\u0120screamed": 25421, "\u0120nap": 25422, "Software": 25423, "\u0120STAR": 25424, "EF": 25425, "\u0120Xin": 25426, "sponsored": 25427, "manship": 25428, "233": 25429, "\u0120primaries": 25430, "\u0120filtering": 25431, "\u0120assemble": 25432, "mil": 25433, "\u0120Myers": 25434, "bows": 25435, "\u0120punched": 25436, "Mic": 25437, "\u0120innovations": 25438, "\u0120func": 25439, "ando": 25440, "\u0120fracking": 25441, "\u0120Vul": 25442, "\u00d0\u00be\u00d0": 25443, "oshop": 25444, "\u0120Immun": 25445, "\u0120settling": 25446, "\u0120adolescents": 25447, "\u0120rebuilding": 25448, "\u0120transforming": 25449, "\u0120parole": 25450, "\u0120harbor": 25451, "\u0120booking": 25452, "otional": 25453, "ongevity": 25454, "\u0120Yo": 25455, "bug": 25456, "\u0120emerges": 25457, "\u0120Methods": 25458, "\u0120Chu": 25459, "Pres": 25460, "\u0120Dungeons": 25461, "\u0120trailing": 25462, "\u0120Rum": 25463, "\u0120Hugh": 25464, "\u00e5\u00a4\u00a9": 25465, "\u0120Era": 25466, "\u0120Battles": 25467, "Results": 25468, "\u0120Trading": 25469, "\u0120versa": 25470, "css": 25471, "axies": 25472, "heet": 25473, "\u0120greed": 25474, "1989": 25475, "\u0120gardens": 25476, "\u0120contingent": 25477, "Park": 25478, "\u0120Leafs": 25479, "hook": 25480, "robe": 25481, "\u0120diplomacy": 25482, "\u0120Fuel": 25483, "\u0120Invasion": 25484, "\u0120upgrading": 25485, "Male": 25486, "\u0120elic": 25487, "\u0120relentless": 25488, "\u0120Covenant": 25489, "apesh": 25490, "\u0120Trop": 25491, "Ty": 25492, "production": 25493, "arty": 25494, "\u0120punches": 25495, "ako": 25496, "cyclopedia": 25497, "\u0120Rabbit": 25498, "\u0120HDMI": 25499, "\u0120141": 25500, "\u0120foil": 25501, "ItemImage": 25502, "\u0120FG": 25503, "\u0120implementations": 25504, "\u0120Pom": 25505, "ixtures": 25506, "\u0120await": 25507, "\u0120330": 25508, "amus": 25509, "\u0120umbrella": 25510, "\u0120foresee": 25511, "separ": 25512, "\u0120circumcision": 25513, "\u0120peripheral": 25514, "Say": 25515, "\u0120Expert": 25516, "Inc": 25517, "\u0120withdrew": 25518, "\u0120Anders": 25519, "fried": 25520, "\u0120radioactive": 25521, "\u0120Opening": 25522, "\u0120boarding": 25523, "\u0120ND": 25524, "\u0120overthrow": 25525, "Activ": 25526, "WP": 25527, "\u0120Acts": 25528, "\u00d7\u013b": 25529, "\u0120motions": 25530, "vic": 25531, "\u0120Mighty": 25532, "\u0120Defender": 25533, "aer": 25534, "\u0120thankful": 25535, "\u0120Killing": 25536, "\u0120Bris": 25537, "moil": 25538, "\u0120predicting": 25539, "266": 25540, "choice": 25541, "\u0120killers": 25542, "\u0120incub": 25543, "\u0120Chest": 25544, "athering": 25545, "\u0120proclaimed": 25546, "flower": 25547, "ossom": 25548, "umbledore": 25549, "\u0120Cycling": 25550, "\u0120Occupy": 25551, "AGES": 25552, "Pen": 25553, "\u0120Yug": 25554, "\u0120packaged": 25555, "\u0120heightened": 25556, "cot": 25557, "stack": 25558, "Cond": 25559, "\u0120stamps": 25560, "mage": 25561, "\u0120persuaded": 25562, "\u0120ensl": 25563, "\u0120Cardinal": 25564, "\u0120solitary": 25565, "\u0120possessing": 25566, "\u0120Cork": 25567, "\u0120evid": 25568, "\u0120Tay": 25569, "\u0120blues": 25570, "\u0120extremism": 25571, "\u0120lunar": 25572, "\u0120clown": 25573, "Techn": 25574, "\u0120festivals": 25575, "\u0120PvP": 25576, "\u0120Lar": 25577, "\u0120consequently": 25578, "present": 25579, "\u0120someday": 25580, "\u00e7\u0130\u012d": 25581, "\u0120Meteor": 25582, "\u0120touring": 25583, "culture": 25584, "\u0120beaches": 25585, "Ship": 25586, "cause": 25587, "\u0120Flood": 25588, "\u00e3\u0125\u00af": 25589, "\u0120purity": 25590, "those": 25591, "\u0120emission": 25592, "bolt": 25593, "\u0120chord": 25594, "\u0120Scripture": 25595, "Lu": 25596, "\u0120${": 25597, "created": 25598, "Others": 25599, "258": 25600, "\u0120elemental": 25601, "\u0120annoyed": 25602, "\u0120AE": 25603, "dan": 25604, "\u0120Sag": 25605, "Researchers": 25606, "\u0120fairy": 25607, "\u00e2\u0122\u0135\u00e2\u0122\u0135": 25608, "============": 25609, "Smart": 25610, "GGGG": 25611, "\u0120skeletons": 25612, "\u0120pupils": 25613, "linked": 25614, "\u0120urgency": 25615, "enabled": 25616, "\u0120Fuck": 25617, "\u0120councill": 25618, "rab": 25619, "UAL": 25620, "TI": 25621, "\u0120lifes": 25622, "\u0120confessed": 25623, "Bug": 25624, "\u0120harmon": 25625, "\u0120CONFIG": 25626, "\u0120Neutral": 25627, "Double": 25628, "\u0120staple": 25629, "\u0120SHA": 25630, "British": 25631, "\u0120SNP": 25632, "ATOR": 25633, "oco": 25634, "\u0120swinging": 25635, "gex": 25636, "oleon": 25637, "plain": 25638, "\u0120Missing": 25639, "\u0120Trophy": 25640, "vari": 25641, "ranch": 25642, "\u0120301": 25643, "440": 25644, "0000000000000000": 25645, "\u0120restoring": 25646, "\u0120haul": 25647, "ucing": 25648, "nerg": 25649, "\u0120futures": 25650, "\u0120strategist": 25651, "question": 25652, "\u0120lateral": 25653, "\u0120Bard": 25654, "\u0120sor": 25655, "\u0120Rhodes": 25656, "\u0120Downtown": 25657, "?????-": 25658, "\u0120Lit": 25659, "\u0120Bened": 25660, "\u0120coil": 25661, "street": 25662, "\u0120Portal": 25663, "FILE": 25664, "\u0120Gru": 25665, "*,": 25666, "231": 25667, "neum": 25668, "\u0120sucked": 25669, "\u0120rapper": 25670, "\u0120tendencies": 25671, "\u0120Lauren": 25672, "cellaneous": 25673, "267": 25674, "\u0120browse": 25675, "\u0120overc": 25676, "header": 25677, "oise": 25678, "\u0120beet": 25679, "\u0120Gle": 25680, "Stay": 25681, "\u0120mum": 25682, "\u0120typed": 25683, "\u0120discounts": 25684, "Talk": 25685, "\u0120Og": 25686, "existing": 25687, "\u0120Sell": 25688, "uph": 25689, "CI": 25690, "\u0120Austrian": 25691, "\u0120Warm": 25692, "\u0120dismissal": 25693, "\u0120averages": 25694, "camera": 25695, "\u0120allegiance": 25696, "LAN": 25697, "=\"#": 25698, "\u0120commentators": 25699, "\u0120Setting": 25700, "\u0120Midwest": 25701, "\u0120pharmac": 25702, "\u0120EXP": 25703, "\u0120stainless": 25704, "Chicago": 25705, "\u0120tan": 25706, "244": 25707, "\u0120countryside": 25708, "\u0120Vac": 25709, "295": 25710, "\u0120pinned": 25711, "\u0120crises": 25712, "\u0120standardized": 25713, "Task": 25714, "\u0120Jail": 25715, "\u0120Docker": 25716, "colored": 25717, "forth": 25718, "\"},": 25719, "\u0120patrons": 25720, "\u0120spice": 25721, "\u0120mourn": 25722, "\u0120Mood": 25723, "\u0120laundry": 25724, "\u0120equip": 25725, "\u0120Mole": 25726, "yll": 25727, "\u0120THC": 25728, "nation": 25729, "\u0120Sherlock": 25730, "\u0120issu": 25731, "\u0120Kre": 25732, "\u0120Americas": 25733, "\u0120AAA": 25734, "\u0120systematically": 25735, "\u0120contra": 25736, "\u0120Sally": 25737, "\u0120rationale": 25738, "\u0120carriage": 25739, "\u0120peaks": 25740, "\u0120contradiction": 25741, "ensation": 25742, "\u0120Failure": 25743, "\u0120props": 25744, "\u0120namespace": 25745, "\u0120cove": 25746, "fields": 25747, "\u00e3\u0124\u012d": 25748, "\u0120wool": 25749, "\u0120Catch": 25750, "\u0120presumed": 25751, "\u0120Diana": 25752, "ragon": 25753, "igi": 25754, "\u0120hamm": 25755, "\u0120stunt": 25756, "\u0120GUI": 25757, "\u0120Observatory": 25758, "\u0120Shore": 25759, "\u0120smells": 25760, "annah": 25761, "\u0120cockpit": 25762, "\u0120Duterte": 25763, "850": 25764, "\u0120oppressed": 25765, "breaker": 25766, "\u0120Contribut": 25767, "\u0120Peru": 25768, "\u0120Monsanto": 25769, "\u0120Attempt": 25770, "\u0120commanding": 25771, "\u0120fridge": 25772, "\u0120Rin": 25773, "\u0120Chess": 25774, "uality": 25775, "\u0120ol": 25776, "Republican": 25777, "\u0120Glory": 25778, "\u0120WIN": 25779, ".......": 25780, "agent": 25781, "reading": 25782, "\u0120inh": 25783, "Jones": 25784, "\u0120clicks": 25785, "alan": 25786, "\u0120[];": 25787, "\u0120Majesty": 25788, "\u0120Ced": 25789, "opus": 25790, "atel": 25791, "\u00c3\u00aa": 25792, "ARC": 25793, "\u0120Ecuador": 25794, "\u00e3\u0125\u0142": 25795, "\u0120Kuro": 25796, "\u0120rituals": 25797, "\u0120captive": 25798, "\u0120ounce": 25799, "\u0120disagreement": 25800, "\u0120slog": 25801, "fuel": 25802, "Pet": 25803, "Mail": 25804, "\u0120exercised": 25805, "\u0120solic": 25806, "\u0120rainfall": 25807, "\u0120devotion": 25808, "\u0120Assessment": 25809, "\u0120robotic": 25810, "options": 25811, "\u0120RP": 25812, "\u0120Families": 25813, "\u0120Flames": 25814, "\u0120assignments": 25815, "007": 25816, "akedown": 25817, "\u0120vocabulary": 25818, "Reilly": 25819, "\u0120caval": 25820, "gars": 25821, "\u0120suppressed": 25822, "\u0120SET": 25823, "\u0120Johns": 25824, "\u0120warp": 25825, "broken": 25826, "\u0120statues": 25827, "\u0120advocated": 25828, "\u0120275": 25829, "\u0120peril": 25830, "omorph": 25831, "\u0120Femin": 25832, "perfect": 25833, "\u0120hatch": 25834, "Lib": 25835, "512": 25836, "\u0120lifelong": 25837, "313": 25838, "\u0120cheeks": 25839, "\u0120numbered": 25840, "\u0120Mug": 25841, "Body": 25842, "ravel": 25843, "Weight": 25844, "\u0120Jak": 25845, "\u0120Heath": 25846, "\u0120kissing": 25847, "\u0120JUST": 25848, "\u0120waving": 25849, "upload": 25850, "\u0120insider": 25851, "\u0120Progressive": 25852, "\u0120Filter": 25853, "tta": 25854, "\u0120Beam": 25855, "\u0120violently": 25856, "ipation": 25857, "\u0120skepticism": 25858, "\u01201918": 25859, "\u0120Annie": 25860, "\u0120SI": 25861, "\u0120genetics": 25862, "\u0120onboard": 25863, "atl": 25864, "\u0120Friedman": 25865, "\u0120Bri": 25866, "ceptive": 25867, "\u0120pirate": 25868, "\u0120Reporter": 25869, "278": 25870, "\u0120mythology": 25871, "\u0120eclipse": 25872, "\u0120skins": 25873, "\u0120glyph": 25874, "ingham": 25875, "Files": 25876, "Cour": 25877, "women": 25878, "\u0120regimes": 25879, "\u0120photographed": 25880, "Kat": 25881, "\u0120MAX": 25882, "Officials": 25883, "\u0120unexpectedly": 25884, "\u0120impressions": 25885, "Front": 25886, ";;;;;;;;": 25887, "\u0120supremacy": 25888, "\u0120sang": 25889, "\u0120aggravated": 25890, "\u0120abruptly": 25891, "\u0120Sector": 25892, "\u0120excuses": 25893, "\u0120costing": 25894, "idepress": 25895, "Stack": 25896, "\u0120RNA": 25897, "obil": 25898, "\u0120ghosts": 25899, "ldon": 25900, "atibility": 25901, "Topics": 25902, "\u0120reimburse": 25903, "\u0120HM": 25904, "\u0120Deg": 25905, "\u0120thief": 25906, "yet": 25907, "ogenesis": 25908, "leaning": 25909, "\u0120Kol": 25910, "\u0120Basketball": 25911, "\u0120fi": 25912, "\u0120Seeing": 25913, "\u0120recycling": 25914, "\u0120[-": 25915, "Congress": 25916, "\u0120lectures": 25917, "Psy": 25918, "\u0120nep": 25919, "\u0120maid": 25920, "\u0120oriented": 25921, "AX": 25922, "\u0120respectful": 25923, "rene": 25924, "flush": 25925, "\u0120Unloaded": 25926, "request": 25927, "grid": 25928, "\u0120Alternatively": 25929, "\u0120Hugo": 25930, "\u0120decree": 25931, "\u0120Buddhism": 25932, "andum": 25933, "Android": 25934, "\u0120Congo": 25935, "\u0120Joyce": 25936, "\u0120acknowledging": 25937, "hesive": 25938, "\u0120Tomorrow": 25939, "\u0120Hiro": 25940, "thren": 25941, "\u0120Maced": 25942, "\u0120hoax": 25943, "\u0120Increased": 25944, "\u0120Pradesh": 25945, "Wild": 25946, "______": 25947, "161": 25948, "\u0120aunt": 25949, "\u0120distributing": 25950, "\u0120Tucker": 25951, "\u0120SSL": 25952, "\u0120Wolves": 25953, "Building": 25954, "oult": 25955, "\u0120Luo": 25956, "\u0120Yas": 25957, "\u0120Spir": 25958, "\u0120Shape": 25959, "\u0120Cambod": 25960, "\u0120IPv": 25961, "\u0120ml": 25962, "\u0120extrad": 25963, "390": 25964, "\u0120Penny": 25965, "dream": 25966, "\u0120stationed": 25967, "optional": 25968, "eworthy": 25969, ".</": 25970, "\u0120undertaking": 25971, "\u0120chickens": 25972, "\u0120stimuli": 25973, "\u0120Else": 25974, "igators": 25975, "\u0120Beginning": 25976, "ctory": 25977, "\u0120prepares": 25978, "\u0120delta": 25979, "\u0120vicinity": 25980, "tool": 25981, "\u0120workshops": 25982, "MHz": 25983, "\u0120accusation": 25984, "\u0120histories": 25985, "ropolis": 25986, "\u0120Churchill": 25987, "\u0120neon": 25988, "\u0120baff": 25989, "dies": 25990, "maybe": 25991, "\u0120\u00e8\u00a3\u0131\u00e8\u00a6\u013c\u00e9\u0128\u0134": 25992, "\u0120symptom": 25993, "ECH": 25994, "\u0120Manuel": 25995, "\u0120banana": 25996, "\u0120HB": 25997, "\u0120****": 25998, "\u0120Koreans": 25999, "coll": 26000, "FB": 26001, "\u0120praying": 26002, "\u0120Cannot": 26003, "\u0120Mile": 26004, "\u0120embracing": 26005, "\u0120Silk": 26006, "393": 26007, "oters": 26008, "FD": 26009, "\u0120daylight": 26010, "alias": 26011, "\u0120Brigade": 26012, "\u0120Hannah": 26013, "\u0120clergy": 26014, "\u0120southeast": 26015, "\u0120alcoholic": 26016, "\u0120proposes": 26017, "livion": 26018, "\u0120calculating": 26019, "\u0120stimulate": 26020, "\u0120splitting": 26021, "eight": 26022, "\u0120Indy": 26023, "plays": 26024, "\u0120Pik": 26025, "\u0120domest": 26026, "\u0120forgiveness": 26027, "\u0120Rings": 26028, "patient": 26029, "kinson": 26030, "Mont": 26031, "igible": 26032, ";\"": 26033, "\u0120periodically": 26034, "ammad": 26035, "\u0120Britt": 26036, "pard": 26037, "\u0120arbitration": 26038, "\u0120Schneider": 26039, "\u0120Corporate": 26040, "\u0120Maya": 26041, "\u0120snakes": 26042, "aum": 26043, "\u0120blasted": 26044, "\u0120mysteries": 26045, "\u0120revive": 26046, "ocamp": 26047, "\u0120Dodge": 26048, "\u0120Opera": 26049, "279": 26050, "\u0120orphan": 26051, "\u0120specifies": 26052, "\u0120Mets": 26053, "Duration": 26054, "Hen": 26055, "\u0120fireworks": 26056, "\u0120prosecute": 26057, "\u0120Tillerson": 26058, "dp": 26059, "usage": 26060, "liness": 26061, "\u0120Debian": 26062, "\u0120224": 26063, "rises": 26064, "\u0120Infect": 26065, "atra": 26066, "\u0120RR": 26067, "\u0120Lor": 26068, "diff": 26069, "\u0120Charleston": 26070, "\u0120acoustic": 26071, "\u0120amuse": 26072, "330": 26073, "\u0120cer": 26074, "\u0120Tac": 26075, "\u0120[+": 26076, "\u0120cardiac": 26077, "\u0120Restaurant": 26078, "ergy": 26079, "\u0120fuzz": 26080, "\u0120bites": 26081, "\u0120hazardous": 26082, "\u0120brighter": 26083, "rans": 26084, "\u0120Stephanie": 26085, "extra": 26086, "RET": 26087, "\u0120Christine": 26088, "\u0120Sue": 26089, "statement": 26090, "\u0120bolster": 26091, "\u0120antit": 26092, "Radio": 26093, "BIT": 26094, "\u00e3\u0124\u00b0": 26095, "\u0120visions": 26096, "\u0120Concept": 26097, "\u0120inline": 26098, "\u0120Philosophy": 26099, "isans": 26100, "\u0120Irving": 26101, "\u00c3\u00a3": 26102, "taking": 26103, "\u0120inconsist": 26104, "\u0120Kumar": 26105, "\u0120lig": 26106, "\u0120Schumer": 26107, "\u0120Regulations": 26108, "\u0120Hz": 26109, "thro": 26110, "\u0120Voldemort": 26111, "\u0120MED": 26112, "\u0120Frederick": 26113, "Pad": 26114, "221": 26115, "\u0120alleging": 26116, "\u0120Communication": 26117, "\u0120167": 26118, "\u0120forecasts": 26119, "\u0120spiders": 26120, "Organ": 26121, "\u0120Participants": 26122, "\u0120Ops": 26123, "design": 26124, "Close": 26125, "\u0120facto": 26126, "\u0120bombers": 26127, "resistant": 26128, "ategories": 26129, "School": 26130, "\u0120homework": 26131, "\u0120corro": 26132, "Tuesday": 26133, "\u0120Brendan": 26134, "\u0120MX": 26135, "\u0120TS": 26136, "\u0120Stri": 26137, "\u0120stakeholders": 26138, "\u0120Millennium": 26139, "\u0120transferring": 26140, "Jud": 26141, "\u0120tac": 26142, "\u01201600": 26143, "\u0120SDK": 26144, "rb": 26145, "\u0120interpretations": 26146, "\u0120SG": 26147, "\u0120upstairs": 26148, "\u0120Harvest": 26149, "\u0120vagina": 26150, "\u0120ingest": 26151, "xf": 26152, "\u0120Orion": 26153, "\u0120Joey": 26154, "\u0120sandwic": 26155, "\u0120immortal": 26156, "\u0120flipped": 26157, "ortex": 26158, "threatening": 26159, "\u0120sniper": 26160, "\u0120converts": 26161, "\u0120installations": 26162, "\u0120Bulgar": 26163, "orsche": 26164, "mails": 26165, "\u0120lure": 26166, "\u0120narrowly": 26167, "\u0120grenade": 26168, "\u0120Ging": 26169, "\u0120underwear": 26170, "--------------": 26171, "\u0120chased": 26172, "\u0120VAL": 26173, "\u0120parenting": 26174, "\u0120Hamb": 26175, "\u0120Blaz": 26176, "\u0120anarchist": 26177, "\u0120Median": 26178, "\u0120Programs": 26179, "\u00ce\u00bd": 26180, "\u0120obj": 26181, "\u0120Nokia": 26182, "orman": 26183, "anqu": 26184, "atism": 26185, "opa": 26186, "\u0120fulfilling": 26187, "\u0120puppy": 26188, "\u0120entit": 26189, "\u0120Sebastian": 26190, "\u0120shooters": 26191, "\u0120richer": 26192, "\u00e8\u00a1": 26193, "\u0120tempted": 26194, "\u0120ATT": 26195, "\u0120CV": 26196, "\u0120tore": 26197, "Resource": 26198, "\u0120Devils": 26199, "408": 26200, "inational": 26201, "\u0120assurance": 26202, "\u0120Darren": 26203, "\u0120whichever": 26204, "posure": 26205, "\u0120fury": 26206, "Stock": 26207, "\u0120universally": 26208, "response": 26209, "\u0120oak": 26210, "\u0120workload": 26211, "\u0120Corner": 26212, "eele": 26213, "\"...": 26214, "\u0120deprived": 26215, "kowski": 26216, "\u0120casts": 26217, "\u0120affiliation": 26218, "\u0120Ach": 26219, "\u0120Asked": 26220, "athe": 26221, "\u0120lact": 26222, "\u0120Thu": 26223, "rm": 26224, "\u0120airlines": 26225, "\u0120notions": 26226, "Format": 26227, "\u0120FAA": 26228, "\u00e3\u0125\u012c": 26229, "driver": 26230, "\u0120transcend": 26231, "Settings": 26232, "\u0120Prosecut": 26233, "\u0120spinal": 26234, "\u0120defaults": 26235, "FK": 26236, "\u0120prefers": 26237, "rendered": 26238, "thus": 26239, "film": 26240, "\u0120tiger": 26241, "\u0120Spicer": 26242, "recogn": 26243, "\u0120Rugby": 26244, "Network": 26245, "\u0120pity": 26246, "\u0120compartment": 26247, "casters": 26248, "\u0120Monroe": 26249, "\u0120720": 26250, "\u0120corrections": 26251, "\u0120dopamine": 26252, "\u0120AZ": 26253, "Cut": 26254, "\u0120roomm": 26255, "\u0120speculate": 26256, "Hash": 26257, "\u0120restrictive": 26258, "1111": 26259, "redible": 26260, "onel": 26261, "\u0120rampant": 26262, "reported": 26263, "\u0120Suite": 26264, "\u0120Minimum": 26265, "alys": 26266, "azard": 26267, "loop": 26268, "\u0120lent": 26269, "sha": 26270, "\u0120vandal": 26271, "menu": 26272, "\u0120Boehner": 26273, "\u0120narratives": 26274, "\u0120authenticity": 26275, "269": 26276, "anic": 26277, "duty": 26278, "285": 26279, "\u0120thanked": 26280, "\u0120betrayed": 26281, "lift": 26282, "\u0120southwest": 26283, "\u0120Dexter": 26284, "\u0120Bod": 26285, "\u0120keywords": 26286, "Average": 26287, "DIS": 26288, "\u0120ethnicity": 26289, "!),": 26290, "\u0120Nationals": 26291, "\u00e1\u00b9": 26292, "\u0120Tah": 26293, "ioxid": 26294, "\u0120widget": 26295, "\u0120pasta": 26296, "\u0120billing": 26297, "\u0120trilogy": 26298, "\u0120Lines": 26299, "\u0120sniff": 26300, "\u0120nephew": 26301, "Late": 26302, "\u0120princip": 26303, "\u0120Loop": 26304, "\u0120Marxist": 26305, "\u0120dissolved": 26306, "\u0120contexts": 26307, "\u0120Amount": 26308, "\u0120Spike": 26309, "\u0120totals": 26310, "\u0120organizer": 26311, "\u0120uprising": 26312, "ships": 26313, "YY": 26314, "\u0120Northeast": 26315, "money": 26316, "gradation": 26317, "\u0120goalkeeper": 26318, "\u0120Hear": 26319, "\u0120steak": 26320, "\u0120BuzzFeed": 26321, "\u0120solemn": 26322, "\u0120Scand": 26323, "\u0120popping": 26324, "\u0120adhere": 26325, "\u0120Alleg": 26326, "byte": 26327, "\u0120Wolver": 26328, "\u0120unin": 26329, "\u0120recol": 26330, "itud": 26331, "\u0120mimic": 26332, "ibus": 26333, "\u0120predicts": 26334, "\u0120Keeper": 26335, "iating": 26336, "\u0120deception": 26337, "\u0120learnt": 26338, "\u0120diary": 26339, "\u0120conditional": 26340, "\u0120relic": 26341, "\u0120invoke": 26342, "ienced": 26343, "\u00e5\u012a": 26344, "\u0120Pont": 26345, "\u0120cellphone": 26346, "\u0120speeding": 26347, "\u0120tackling": 26348, "\u0120nude": 26349, "opened": 26350, "\u0120Manafort": 26351, "\u01201952": 26352, "\u0120majors": 26353, "\u0120Silence": 26354, "\u0120logistics": 26355, "\u0120weighted": 26356, "\u0120Psychiat": 26357, "\":[\"": 26358, "\u0120sickness": 26359, "\u0120dividends": 26360, "zon": 26361, "Release": 26362, "\u0120Keys": 26363, "\u0120Ich": 26364, "\u0120enz": 26365, "\u0120Fernand": 26366, "\u0120\u00ce\u00b1": 26367, "\u0120meanings": 26368, "\u0120penny": 26369, "\u0120stern": 26370, "\u0120lar": 26371, "\u0120Published": 26372, "\u0120backdrop": 26373, "Kim": 26374, "\u0120Synt": 26375, "\u0120debuted": 26376, "wm": 26377, "\u0120Isle": 26378, "\u0120regulating": 26379, "otti": 26380, "\u0120Scholars": 26381, "icester": 26382, "\u0120Chef": 26383, "\u0120pops": 26384, "\u0120Launcher": 26385, "\u0120Various": 26386, "\u0120commenting": 26387, "oslav": 26388, "enzie": 26389, "\u0120rivalry": 26390, "\u00e2\u0124\u00ac": 26391, "Really": 26392, "\u0120orc": 26393, "\u0120bean": 26394, "\u0120Judy": 26395, "Notice": 26396, "\u0120Bike": 26397, "?]": 26398, "\u0120rented": 26399, "sten": 26400, "\u0120forefront": 26401, "\u0120Baldwin": 26402, "\u0120yielded": 26403, "tails": 26404, "Prime": 26405, "\u0120Sources": 26406, "icator": 26407, "Sean": 26408, "\u0120marching": 26409, "Output": 26410, "\u0120Jungle": 26411, "\u0120reside": 26412, "zzle": 26413, "\u0120Andrews": 26414, "\u0120torque": 26415, "Basic": 26416, "Actually": 26417, "strap": 26418, "penter": 26419, "\u0120exams": 26420, "\u0120Ya": 26421, "\u0120159": 26422, "\u0120Decision": 26423, "\u0120ransom": 26424, "eteenth": 26425, "ensing": 26426, "213": 26427, "\u0120sunset": 26428, "404": 26429, "\u0120Rapid": 26430, "\u0120Hein": 26431, "\u0120Aboriginal": 26432, "\u0120organism": 26433, "\u0120Sever": 26434, "\u0120cla": 26435, "aji": 26436, "Simple": 26437, "\u0120Flavor": 26438, "\u0120Eval": 26439, "prus": 26440, "\u0120chorus": 26441, "DAY": 26442, "\u0120denounced": 26443, "\u0120biography": 26444, "\u0120Turnbull": 26445, "Recent": 26446, "Normal": 26447, "lections": 26448, "Word": 26449, "\u0120ferry": 26450, "\u0120Wagner": 26451, "hom": 26452, "Unit": 26453, "\u0120supermarket": 26454, "\u0120Sith": 26455, "\u0120nominees": 26456, "\u0120dictatorship": 26457, "iddler": 26458, "\u0120announces": 26459, "\u0120Them": 26460, "\u0120Neptune": 26461, "\u0120deity": 26462, "\u0120Yi": 26463, "\u0120monarch": 26464, "ARR": 26465, "\u0120invaded": 26466, "\u0120Hok": 26467, "untary": 26468, "Certain": 26469, "ega": 26470, "\u0120kidding": 26471, "\u0120Regulation": 26472, "\u0120tray": 26473, "\u0120photographers": 26474, "\u0120Arcane": 26475, "\u0120discharged": 26476, "\u0120evangelical": 26477, "\u0120interchange": 26478, "\u0120filmmaker": 26479, "\u0120Endless": 26480, "\u0120290": 26481, "\u0120Salvador": 26482, "ASY": 26483, "\u0120Signal": 26484, "\u0120wrath": 26485, "\u00e2\u013e": 26486, "lot": 26487, "'/": 26488, "\u0120projectile": 26489, "\u0120employing": 26490, "\u0120Interface": 26491, "191": 26492, "atellite": 26493, "\u0120Rath": 26494, "package": 26495, "\u0120indications": 26496, "Jason": 26497, "\u0120args": 26498, "\u0120GHz": 26499, "\u0120tilt": 26500, "nants": 26501, "won": 26502, "\u00e3\u0124\u00b5": 26503, "redd": 26504, "rescent": 26505, "\u0120Calendar": 26506, "\u0120modular": 26507, "\u0120assisting": 26508, "\u0120redeem": 26509, "\u0120Bean": 26510, "\u0120worsh": 26511, "\u0120decentralized": 26512, ")...": 26513, "377": 26514, "\u0120arrays": 26515, "\u0120accomplishments": 26516, "\u00ce\u00bf": 26517, "dot": 26518, "\u0120mutually": 26519, "\u0120obstruct": 26520, "\u0120misrepresent": 26521, "orest": 26522, "ionic": 26523, "ruce": 26524, "%;": 26525, "\u0120knowingly": 26526, "porting": 26527, "inently": 26528, "Ari": 26529, "\u0120Schultz": 26530, "Da": 26531, "\u0120Cere": 26532, "\u0120obsolete": 26533, "\u0127\u012d": 26534, "give": 26535, "\u0120bait": 26536, "\u0120enlarg": 26537, "Neill": 26538, "\u01201933": 26539, "\u0120reconsider": 26540, "\u0120Sergeant": 26541, "\u0120Diane": 26542, "\u0120Cogn": 26543, "\u0120Icon": 26544, "Position": 26545, "\u0120fost": 26546, "\u0120stirring": 26547, "seven": 26548, "\u0120SpaceX": 26549, "uggets": 26550, "\u0120medd": 26551, "Gal": 26552, "\u0120Sister": 26553, "Boy": 26554, "\u0120triggering": 26555, "Taking": 26556, "\u0120screams": 26557, "\u0120causal": 26558, "\u0120awaken": 26559, "Arm": 26560, "297": 26561, "\u0120dispatched": 26562, "\u0120FALSE": 26563, "\u0120organizational": 26564, "\u0120Tong": 26565, "\u0120dilemma": 26566, "demon": 26567, "Spl": 26568, "\u0120hooks": 26569, "uding": 26570, "\u0120validate": 26571, "\u0120potion": 26572, "\u0120claw": 26573, "\u0120burgl": 26574, "\u0120quir": 26575, "ACA": 26576, "\u0120Brennan": 26577, "\u0120durability": 26578, "\u0120bombings": 26579, "\u0120Window": 26580, "\u0120culprit": 26581, "325": 26582, "Therefore": 26583, "umbered": 26584, "performance": 26585, "warts": 26586, "\u0120enforcing": 26587, "\u0120Blow": 26588, "\u0120reprint": 26589, "ifax": 26590, "alpha": 26591, "\u0120sinister": 26592, "\u0120burger": 26593, "fighting": 26594, "Score": 26595, "\u0120Stones": 26596, "iem": 26597, "405": 26598, "chemy": 26599, "\u0120vinegar": 26600, "nom": 26601, "\u0120prevailing": 26602, "\u0120Latest": 26603, "\u00c2\u00b6": 26604, "\u0120ba": 26605, "\u0120Writer": 26606, "\u0120177": 26607, "\u0120Conway": 26608, "\u0120collects": 26609, "\u0120quantitative": 26610, "\u0120horrors": 26611, "ogens": 26612, "\u0120Slov": 26613, "\u0120lays": 26614, "haw": 26615, "\u0120Slash": 26616, "\u0120nightclub": 26617, "\u0120Davies": 26618, "\u0120bride": 26619, "\u0120Scarlet": 26620, "ymm": 26621, "\u0120Applications": 26622, "velength": 26623, "\u0120revival": 26624, "\u0120softly": 26625, "\u0120zoo": 26626, "itaire": 26627, "Cur": 26628, "\u0120electrom": 26629, "\u0120planting": 26630, "OTO": 26631, "\u0120Elements": 26632, "\u0120swallow": 26633, "porter": 26634, "\u0120laptops": 26635, "\u0120peanut": 26636, "\u0120lobbyists": 26637, "\u00ce\u00b2": 26638, "Panel": 26639, "\u0120Joan": 26640, "imil": 26641, "tnc": 26642, "\u0120resisted": 26643, "\u0120outwe": 26644, "\u0120retaining": 26645, "atri": 26646, "\u0120poorer": 26647, "\u0120Syrians": 26648, "\u0120Hammond": 26649, "\u0120weld": 26650, "uder": 26651, "topic": 26652, "\u0120TT": 26653, "ricia": 26654, "\u0120thieves": 26655, "Lic": 26656, "\u0120Gust": 26657, "\u0120Ways": 26658, "areth": 26659, "243": 26660, "\u0120broadcaster": 26661, "shield": 26662, "assium": 26663, "uble": 26664, "\u0120airstrikes": 26665, "onso": 26666, "\u0120pedal": 26667, "\u0120collectors": 26668, "\u0120Vander": 26669, "\u0120Mesa": 26670, "\u0120dictator": 26671, "\u0120dir": 26672, "enton": 26673, "cart": 26674, "score": 26675, "adder": 26676, "Cry": 26677, "\u0120ssh": 26678, "gger": 26679, "\u0120drunken": 26680, "\u0120GS": 26681, "\u0120Seat": 26682, "\u0120cornerback": 26683, "\u0120skipped": 26684, "\u0120Researchers": 26685, "\u0120Audi": 26686, "Reference": 26687, "\u0120haunted": 26688, "\u00c3\u00ab": 26689, "\u0120Clinic": 26690, "cz": 26691, "\u0120ps": 26692, "\u0120Paladin": 26693, "\u0120Recipe": 26694, "\u0120stigma": 26695, "oppy": 26696, "\u0120monkeys": 26697, "\u0120Hawk": 26698, "Sad": 26699, "\"/>": 26700, "\u0120Workshop": 26701, "\u0120Retail": 26702, "\u0120Avatar": 26703, "625": 26704, "Na": 26705, "\u0120VC": 26706, "\u0120Secure": 26707, "MY": 26708, "1988": 26709, "ossip": 26710, "\u0120prostate": 26711, "\u0120unden": 26712, "\u0120gamer": 26713, "\u0120Contents": 26714, "\u0120Warhammer": 26715, "\u0120Sentinel": 26716, "310": 26717, "\u0120segregation": 26718, "\u0120Flex": 26719, "\u0120MAY": 26720, "\u0120drills": 26721, "\u0120Drugs": 26722, "Islamic": 26723, "\u0120spur": 26724, "\u0120cafe": 26725, "\u0120imaginary": 26726, "\u0120guiding": 26727, "\u0120swings": 26728, "\u0120Theme": 26729, "oby": 26730, "\u0120nud": 26731, "\u0120begging": 26732, "\u0120strongh": 26733, "\u0120rejecting": 26734, "\u0120pedestrians": 26735, "\u0120Prospect": 26736, "Rare": 26737, "sle": 26738, "\u0120concessions": 26739, "\u0120Constitutional": 26740, "\u0120beams": 26741, "\u0120fibers": 26742, "poon": 26743, "\u0120instincts": 26744, "property": 26745, "\u0120BIG": 26746, "Sanders": 26747, "imates": 26748, "\u0120coating": 26749, "\u0120corpses": 26750, "\u0120TRUE": 26751, "checked": 26752, "\u0120166": 26753, "Ash": 26754, "\u0120JS": 26755, "\u0120Fiction": 26756, "\u0120communal": 26757, "\u0120energetic": 26758, "oooooooo": 26759, "\u0120nowadays": 26760, "ILD": 26761, "ibo": 26762, "\u0120SUV": 26763, "Ren": 26764, "\u0120dwelling": 26765, "Silver": 26766, "\u0120tally": 26767, "\u0120Moving": 26768, "\u0120coward": 26769, "\u0120generals": 26770, "\u0120horns": 26771, "\u0120circulated": 26772, "\u0120robbed": 26773, "\u0120Unlimited": 26774, "\u0120harassed": 26775, "\u0120inhibit": 26776, "\u0120composer": 26777, "\u0120Spotify": 26778, "\u0120spreads": 26779, "364": 26780, "\u0120suicidal": 26781, "\u0120noises": 26782, "\u0120Stur": 26783, "\u0120saga": 26784, "\u0120Kag": 26785, "iso": 26786, "\u0120theoretically": 26787, "Money": 26788, "\u0120similarity": 26789, "\u0120sliced": 26790, "utils": 26791, "inges": 26792, "\"-": 26793, "\u0120anth": 26794, "\u0120imped": 26795, "Module": 26796, "Throughout": 26797, "\u0120menus": 26798, "committee": 26799, "andi": 26800, "obj": 26801, "inav": 26802, "fired": 26803, "\u0120Abdullah": 26804, "\u0120undead": 26805, "\u0120fonts": 26806, "Hold": 26807, "ENG": 26808, "\u0120sustainability": 26809, "\u0120flick": 26810, "\u0120razor": 26811, "\u0120Fest": 26812, "\u0120Characters": 26813, "\u0120wording": 26814, "\u0120populist": 26815, "\u0120criticizing": 26816, "\u0120muse": 26817, "vine": 26818, "\u0120cardboard": 26819, "\u0120kindly": 26820, "\u0120fringe": 26821, "\u0120Theft": 26822, "icultural": 26823, "\u0120governors": 26824, "\u0120\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd": 26825, "\u0120163": 26826, "\u0120timeout": 26827, "\u0120Auth": 26828, "Children": 26829, "AU": 26830, "\u0120redemption": 26831, "\u0120Alger": 26832, "\u01201914": 26833, "\u0120waved": 26834, "\u0120astronauts": 26835, "ograms": 26836, "\u0120swamp": 26837, "\u0120Finnish": 26838, "\u0120candle": 26839, "\u0120tonnes": 26840, "utm": 26841, "\u0120ray": 26842, "\u0120spun": 26843, "\u0120fearful": 26844, "articles": 26845, "\u0120caus": 26846, "orically": 26847, "\u0120Requires": 26848, "\u0120Gol": 26849, "\u0120pope": 26850, "\u0120inaugural": 26851, "\u0120gle": 26852, "ADA": 26853, "\u0120ISIL": 26854, "\u0120Offensive": 26855, "\u0120watchdog": 26856, "\u0120balcon": 26857, "entity": 26858, "\u0120Hoo": 26859, "\u0120gallon": 26860, "ACC": 26861, "\u0120doubling": 26862, "\u0120implication": 26863, "\u0120Sight": 26864, "\u0120doctr": 26865, "-------": 26866, "\u0120\\\\": 26867, "\u0120malt": 26868, "Roll": 26869, "\u0120\u00e2\u012b\u00a5": 26870, "\u0120recap": 26871, "adding": 26872, "uces": 26873, "\u0120Bend": 26874, "figure": 26875, "\u0120turkey": 26876, "\u0120societal": 26877, "\u0120Tickets": 26878, "\u0120commercially": 26879, "\u0120spicy": 26880, "\u0120216": 26881, "\u0120Ramp": 26882, "\u0120superiority": 26883, "\u00c3\u00af": 26884, "\u0120Tracker": 26885, "Carl": 26886, "\u0120Coy": 26887, "\u0120Patriot": 26888, "\u0120consulted": 26889, "\u0120listings": 26890, "\u0120slew": 26891, "reenshot": 26892, "\u0120Gone": 26893, "\u0120[...]": 26894, "309": 26895, "\u0120hottest": 26896, "\u00d8\u00b1": 26897, "\u0120rocky": 26898, "\u0120Diaz": 26899, "\u0120massage": 26900, "\u0120paraly": 26901, "\u0120pony": 26902, "Az": 26903, "\u0120cartridge": 26904, "\u0120NZ": 26905, "\u0120snack": 26906, "\u0120Lamar": 26907, "plement": 26908, "\u0120Leslie": 26909, "\u0120mater": 26910, "\u0120snipp": 26911, "246": 26912, "\u0120jointly": 26913, "\u0120Brisbane": 26914, "\u0120iPod": 26915, "\u0120pumping": 26916, "\u0120goat": 26917, "\u0120Sharon": 26918, "ealing": 26919, "\u0120coron": 26920, "\u0120anomal": 26921, "rahim": 26922, "\u0120Connection": 26923, "\u0120sculpture": 26924, "\u0120scheduling": 26925, "\u0120Daddy": 26926, "athing": 26927, "\u0120eyebrows": 26928, "\u0120curved": 26929, "\u0120sentiments": 26930, "\u0120drafting": 26931, "Drop": 26932, "([": 26933, "\u0120nominal": 26934, "\u0120Leadership": 26935, "\u0120Grow": 26936, "\u0120176": 26937, "\u0120constructive": 26938, "ivation": 26939, "\u0120corrupted": 26940, "gerald": 26941, "\u0120Cros": 26942, "\u0120Chester": 26943, "\u0120Lap": 26944, "\u00e3\u0123\u00aa": 26945, "OTH": 26946, "DATA": 26947, "\u0120almond": 26948, "probably": 26949, "Imp": 26950, "\u0120feast": 26951, "\u0120Warcraft": 26952, "Flor": 26953, "\u0120checkpoint": 26954, "\u0120transcription": 26955, "\u0120204": 26956, "\u0120tweaks": 26957, "\u0120relieve": 26958, "Science": 26959, "\u0120performer": 26960, "Zone": 26961, "\u0120turmoil": 26962, "igated": 26963, "hibit": 26964, "\u0120Cafe": 26965, "themed": 26966, "\u0120fluor": 26967, "bench": 26968, "\u0120decom": 26969, "\u0120Unt": 26970, "\u0120Barrett": 26971, "\u0120Facts": 26972, "\u0120tasting": 26973, "\u0120PTSD": 26974, "\u0120Seal": 26975, "\u0120Judaism": 26976, "\u0120Dynamic": 26977, "\u0120Cors": 26978, "Ve": 26979, "\u0120Ming": 26980, "\u0120Transform": 26981, "von": 26982, "\u0120Defenders": 26983, "\u0120Tactical": 26984, "\u0120Von": 26985, "\u0120Univers": 26986, "\u0120distorted": 26987, "\u0120Breath": 26988, "?'\"": 26989, "\u0120agon": 26990, "\u0120Deadly": 26991, "\u0120lan": 26992, "\u0120Cycle": 26993, "orned": 26994, "\u0120reliably": 26995, "\u0120glor": 26996, "\u0120Monkey": 26997, "\u00e3\u0125\u00a1": 26998, "\u0120adren": 26999, "\u0120microwave": 27000, "\u0120Alban": 27001, "ircraft": 27002, "digit": 27003, "smart": 27004, "\u0120Dread": 27005, "\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af\u00c2\u00af": 27006, "{{": 27007, "\u0120Rochester": 27008, "\u0120simplified": 27009, "\u0120inflicted": 27010, "\u0120takeover": 27011, "\u0120yourselves": 27012, "aditional": 27013, "\u0120muscular": 27014, "KS": 27015, "\u0120ingen": 27016, "Tax": 27017, "\u0120Feature": 27018, "277": 27019, "\u0120cruc": 27020, "\u0120crate": 27021, "\u0120unidentified": 27022, "\u0120acclaimed": 27023, "\u0120Manga": 27024, "\u0120Frances": 27025, "\u0120Nepal": 27026, "\u0120Gerald": 27027, "\u0120Kuwait": 27028, "\u0120slain": 27029, "\u0120Heb": 27030, "\u0120Goku": 27031, "\u00e3\u0123\u00ae\u00e6": 27032, "286": 27033, "Mrs": 27034, "\u0120Cody": 27035, "\u0120Sanctuary": 27036, "016": 27037, "\u0120dismant": 27038, "\u0120dataset": 27039, "\u0120Hond": 27040, "buck": 27041, "\u0120Patterson": 27042, "\u0120palette": 27043, "\u0120GD": 27044, "icol": 27045, "\u0120Lodge": 27046, "\u0120planetary": 27047, "akin": 27048, "\u0120Registered": 27049, "abwe": 27050, "\u0120Petersburg": 27051, "\u0120hailed": 27052, "\u0120Piece": 27053, "Sche": 27054, "\u0120DOJ": 27055, "\u0120enumer": 27056, "181": 27057, "\u0120Observer": 27058, "\u0120Bold": 27059, "founded": 27060, "commerce": 27061, "\u0120exploits": 27062, "\u0120Finding": 27063, "URN": 27064, "\u0120Sne": 27065, "\u0120Acid": 27066, "ayette": 27067, "\u0120Values": 27068, "\u0120drastic": 27069, "\u0120architectural": 27070, "\u0120\".": 27071, "\u00d7\u0137": 27072, "umped": 27073, "\u0120wrapping": 27074, "\u0120widow": 27075, "\u0120Slayer": 27076, "lace": 27077, "once": 27078, "Germany": 27079, "avoid": 27080, "\u0120temples": 27081, "PAR": 27082, "\u00c3\u00b4": 27083, "\u0120Lucifer": 27084, "\u0120Flickr": 27085, "lov": 27086, "forces": 27087, "\u0120scouting": 27088, "\u0120louder": 27089, "tesy": 27090, "\u0120beforehand": 27091, "\u00c4\u0135": 27092, "\u0120Neon": 27093, "\u0120Wol": 27094, "\u0120Typically": 27095, "\u0120Politico": 27096, "-+-+": 27097, "\u0120builder": 27098, "\u0120derive": 27099, "Kill": 27100, "\u0120poker": 27101, "\u0120ambiguous": 27102, "\u0120lifts": 27103, "\u0120cyt": 27104, "\u0120ribs": 27105, "oodle": 27106, "\u0120Sounds": 27107, "hair": 27108, "\u0120Syndrome": 27109, "tf": 27110, "\u0120proportional": 27111, "uid": 27112, "\u0120pertaining": 27113, "\u0120Kindle": 27114, "\u0120Negro": 27115, "\u0120reiterated": 27116, "\u0120Tonight": 27117, "oths": 27118, "\u0120Cornell": 27119, "\u0120owing": 27120, "\u0120208": 27121, "elfare": 27122, "ocating": 27123, "\u0120Birds": 27124, "Subscribe": 27125, "\u0120essays": 27126, "\u0120burdens": 27127, "\u0120illustrations": 27128, "arious": 27129, "ERAL": 27130, "\u0120Calcul": 27131, "\u0120xen": 27132, "\u0120LinkedIn": 27133, "\u0120Jung": 27134, "\u0120redesign": 27135, "Connor": 27136, "296": 27137, "\u0120reversal": 27138, "\u0120Adelaide": 27139, "\u0120LL": 27140, "\u0120sinking": 27141, "\u0120gum": 27142, "USH": 27143, "capt": 27144, "\u0120Grimm": 27145, "\u0120footsteps": 27146, "\u0120CBD": 27147, "ispers": 27148, "\u0120prose": 27149, "Wednesday": 27150, "\u0120Movies": 27151, "edin": 27152, "\u0120overturned": 27153, "\u0120contentious": 27154, "USB": 27155, "~~~~~~~~~~~~~~~~": 27156, "\u0120Copper": 27157, "\u0120pointless": 27158, "NV": 27159, "values": 27160, "olphin": 27161, "dain": 27162, "\u0120deposited": 27163, "\u0120GW": 27164, "\u0120preceded": 27165, "\u0120Cla": 27166, "\u0120Golem": 27167, "\u0120Nim": 27168, "\u0120\u00ce\u00b2": 27169, "\u0120Engineers": 27170, "middle": 27171, "\u0120flatt": 27172, "operative": 27173, "\u0120councils": 27174, "imbabwe": 27175, "elin": 27176, "\u0120stressful": 27177, "\u0120LD": 27178, "\u0120resh": 27179, "lake": 27180, "\u0120wheelchair": 27181, "\u0120Alternative": 27182, "\u0120optimize": 27183, "operation": 27184, "\u0120peek": 27185, "\u0120oneself": 27186, "igil": 27187, "\u0120transitions": 27188, "opathy": 27189, "blank": 27190, "\u0120169": 27191, "171": 27192, "________________________________________________________________": 27193, "\u0120laundering": 27194, "Enc": 27195, "\u0120DEC": 27196, "\u0120workouts": 27197, "\u0120spikes": 27198, "\u0120dinosaurs": 27199, "\u0120discriminatory": 27200, "Pool": 27201, "Rather": 27202, "385": 27203, "RNA": 27204, "testers": 27205, "eto": 27206, "\u0120Identity": 27207, "\u0120vein": 27208, "\u0120Burton": 27209, "\u0120arcade": 27210, "420": 27211, "Ultimately": 27212, "\u0120Sadly": 27213, "\u00c3\u00b0": 27214, "pill": 27215, "\u0120cubic": 27216, "\u0120Spectrum": 27217, "these": 27218, "states": 27219, "\u0120unofficial": 27220, "hawks": 27221, "\u0120EVERY": 27222, "\u0120rainbow": 27223, "\u0120incarceration": 27224, "anding": 27225, "\u0120syll": 27226, "\u0120Everton": 27227, "\u0120179": 27228, "\u0120Serbia": 27229, "\u0120189": 27230, "meter": 27231, "\u0120Mickey": 27232, "\u0120antiqu": 27233, "\u0120factual": 27234, "neck": 27235, "\u0120Nare": 27236, "norm": 27237, "must": 27238, "\u0120highways": 27239, "\u0120glam": 27240, "\u0120dividing": 27241, "\u0120Squadron": 27242, "\u0120Martha": 27243, "\u0120births": 27244, "Cover": 27245, "////////////////": 27246, "\u0120Wong": 27247, "Phot": 27248, "\u0120ALS": 27249, "rio": 27250, "\u0120Nonetheless": 27251, "\u0120Lemon": 27252, "\u0120206": 27253, "\u0120EE": 27254, "\u0120derivative": 27255, "\u0120WWII": 27256, "vote": 27257, "\u0120therein": 27258, "\u0120separating": 27259, "446": 27260, "sync": 27261, "\u0120Streets": 27262, "\u0120ratt": 27263, "\u0120municipality": 27264, "\u0120Shortly": 27265, "\u0120monk": 27266, "),\"": 27267, "\u0120scrub": 27268, "\u0120operatives": 27269, "Neither": 27270, "Place": 27271, "\u0120Limit": 27272, "Female": 27273, "\u0120Actor": 27274, "Character": 27275, "\u0120constituted": 27276, "357": 27277, "\u0120protested": 27278, "\u0120Straw": 27279, "\u0120Height": 27280, "ilda": 27281, "\u0120Typh": 27282, "\u0120floods": 27283, "\u0120cosmetic": 27284, "WAY": 27285, "perture": 27286, "upon": 27287, "tons": 27288, "essing": 27289, "\u0120Pocket": 27290, "\u0120rooft": 27291, "\u0120Caucas": 27292, "\u0120antidepress": 27293, "\u0120incompatible": 27294, "ECD": 27295, "\u0120opera": 27296, "\u0120Contest": 27297, "\u0120generators": 27298, "lime": 27299, "Defense": 27300, "1987": 27301, "forum": 27302, "\u0120savage": 27303, "\u0120Hungarian": 27304, "nz": 27305, "\u0120metallic": 27306, "\u0120expelled": 27307, "\u0120residency": 27308, "\u0120dresses": 27309, "666": 27310, "\u0120Clement": 27311, "fires": 27312, "Category": 27313, "\u0120geek": 27314, "alis": 27315, "\u0120cemetery": 27316, "educated": 27317, "\u0120crawl": 27318, "\u0120Unable": 27319, "\u0120Tyson": 27320, "akis": 27321, "\u0120pardon": 27322, "\u0120Wra": 27323, "\u0120strengthened": 27324, "\u0120Fors": 27325, "335": 27326, "\u0120HC": 27327, "\u0120Mond": 27328, "\u0120visuals": 27329, "\u0120Beatles": 27330, "ettlement": 27331, "\u0120\u00ef": 27332, "gro": 27333, "\u0120bash": 27334, "\u0120poorest": 27335, "\u0120excel": 27336, "\u0120aspirations": 27337, "\u0120Municip": 27338, "ensible": 27339, "\u0120ceremonies": 27340, "\u0120intimidation": 27341, "\u0120CONTR": 27342, "beck": 27343, "\u0120Kap": 27344, "asu": 27345, "\u0120trademarks": 27346, "\u0120Sew": 27347, "\u0120Competition": 27348, "network": 27349, "\u0120Arri": 27350, "\u0120Tet": 27351, "Roaming": 27352, "WC": 27353, "Dat": 27354, "\u0120sob": 27355, "\u0120pairing": 27356, "\u0120overdose": 27357, "SAY": 27358, "aber": 27359, "\u0120revolt": 27360, "\u0120Fah": 27361, "acting": 27362, "eq": 27363, "estation": 27364, "Fight": 27365, "\u0120Marks": 27366, "273": 27367, "\u0120178": 27368, "Raw": 27369, "\u00e3\u0123\u012d": 27370, "349": 27371, "blocks": 27372, "\u0120verge": 27373, "estine": 27374, "\u0120Podesta": 27375, "\u0120invasive": 27376, "\u0120profoundly": 27377, "\u0120Ao": 27378, "each": 27379, "\u0120lest": 27380, "interpret": 27381, "\u0120shrinking": 27382, "\u0120errone": 27383, "\u0120chees": 27384, "lys": 27385, "\u0120Ivy": 27386, "\u0120Directory": 27387, "\u0120hinted": 27388, "VICE": 27389, "\u0120contacting": 27390, "\u0120Gent": 27391, "hei": 27392, "\u0120labeling": 27393, "\u0120mercury": 27394, "\u0120Lite": 27395, "\u0120expires": 27396, "\u0120destabil": 27397, "ritis": 27398, "cu": 27399, "\u0120feathers": 27400, "\u0120steer": 27401, "\u0120programmed": 27402, "\u0120Vader": 27403, "Going": 27404, "\u0120Elim": 27405, "\u0120yo": 27406, "\u0120Miche": 27407, "\u0120203": 27408, "\u0120sleeves": 27409, "\u0120bully": 27410, "\u0120Humans": 27411, "368": 27412, "\u0120compress": 27413, "\u0120Banner": 27414, "ARS": 27415, "\u0120awhile": 27416, "\u0120calib": 27417, "\u0120sponsorship": 27418, "\u0120Difficulty": 27419, "\u0120Papers": 27420, "\u0120identifier": 27421, "}.": 27422, "\u0120yog": 27423, "\u0120Shia": 27424, "\u0120cleanup": 27425, "\u0120vibe": 27426, "introdu": 27427, "imming": 27428, "Australia": 27429, "\u0120outlines": 27430, "\u0120Youtube": 27431, "train": 27432, "\u0120Makes": 27433, "\u0120deported": 27434, "\u0120centr": 27435, "\u0120Dug": 27436, "\u0120Boulder": 27437, "\u0120Buffy": 27438, "\u0120injunction": 27439, "\u0120Harley": 27440, "\u0120Groups": 27441, "\u0120Dumbledore": 27442, "\u0120Clara": 27443, "\u0120\"-": 27444, "\u0120sacrificed": 27445, "eph": 27446, "Shadow": 27447, "ibling": 27448, "\u0120freelance": 27449, "\u0120evidently": 27450, "phal": 27451, "\u0120retains": 27452, "Mir": 27453, "\u0120finite": 27454, "dar": 27455, "\u0120Cous": 27456, "\u0120repaired": 27457, "\u0120periodic": 27458, "\u0120championships": 27459, "\u0120asteroid": 27460, "blind": 27461, "\u0120expressly": 27462, "\u0120Astros": 27463, "\u0120scaled": 27464, "\u0120geographical": 27465, "\u0120Rapids": 27466, "Enjoy": 27467, "\u0120elastic": 27468, "\u0120Mohamed": 27469, "Market": 27470, "begin": 27471, "\u0120discovers": 27472, "\u0120telecommunications": 27473, "\u0120scanner": 27474, "\u0120enlarge": 27475, "\u0120sharks": 27476, "\u0120psychedel": 27477, "\u0120Rouge": 27478, "\u0120snapshot": 27479, "isine": 27480, "XP": 27481, "\u0120pesticides": 27482, "\u0120LSD": 27483, "\u0120Distribution": 27484, "really": 27485, "\u0120degradation": 27486, "\u0120disguise": 27487, "\u0120biom": 27488, "\u0120EXT": 27489, "\u0120equations": 27490, "\u0120hazards": 27491, "\u0120Compared": 27492, ")*": 27493, "\u0120virtues": 27494, "\u0120elders": 27495, "\u0120enhancing": 27496, "\u0120Across": 27497, "eros": 27498, "angling": 27499, "\u0120combust": 27500, "ucci": 27501, "\u0120concussion": 27502, "\u0120contraception": 27503, "\u0120Kang": 27504, "\u0120expresses": 27505, "\u0120aux": 27506, "\u0120Pione": 27507, "\u0120exhibits": 27508, "Debug": 27509, "OTAL": 27510, "\u0120Already": 27511, "\u0120Wheeler": 27512, "\u0120expands": 27513, "?:": 27514, "\u0120reconciliation": 27515, "\u0120pirates": 27516, "\u0120purse": 27517, "\u0120discourage": 27518, "\u0120spectacle": 27519, "Rank": 27520, "\u0120wraps": 27521, "\u0120Thought": 27522, "\u0120impending": 27523, "Opp": 27524, "\u0120Anglo": 27525, "\u0120EUR": 27526, "\u0120screwed": 27527, "retched": 27528, "\u0120encouragement": 27529, "models": 27530, "\u0120confuse": 27531, "mmm": 27532, "\u0120Vitamin": 27533, "\u00e2\u0138\u0133\u00e2\u0138\u0133": 27534, "Cru": 27535, "\u0120knights": 27536, "\u0120discard": 27537, "\u0120bishops": 27538, "\u0120Wear": 27539, "\u0120Garrett": 27540, "kan": 27541, "\u00e3\u0125\u0141": 27542, "\u0120masculine": 27543, "capital": 27544, "\u0120Aus": 27545, "\u0120fatally": 27546, "thanks": 27547, "\u0120AU": 27548, "\u0120Gut": 27549, "1200": 27550, "\u012000000000": 27551, "\u0120surrog": 27552, "\u0120BIOS": 27553, "raits": 27554, "\u0120Watts": 27555, "\u0120resurrection": 27556, "\u0120Electoral": 27557, "\u0120Tips": 27558, "4000": 27559, "\u0120nutrient": 27560, "\u0120depicting": 27561, "\u0120sprink": 27562, "\u0120muff": 27563, "\u0120LIM": 27564, "\u0120Sample": 27565, "psc": 27566, "ibi": 27567, "generated": 27568, "\u0120specimens": 27569, "\u0120dissatisf": 27570, "\u0120tailored": 27571, "\u0120holdings": 27572, "\u0120Monthly": 27573, "\u0120Eat": 27574, "poons": 27575, "\u0120nec": 27576, "\u0120Cage": 27577, "\u0120Lotus": 27578, "\u0120Lantern": 27579, "\u0120frontier": 27580, "\u0120pensions": 27581, "\u0120joked": 27582, "\u0120Hardy": 27583, "=-=-=-=-": 27584, "rade": 27585, "UID": 27586, "\u0120rails": 27587, "\u0120emit": 27588, "\u0120slate": 27589, "\u0120smug": 27590, "\u0120spit": 27591, "\u0120Calls": 27592, "\u0120Jacobs": 27593, "feat": 27594, "\u0120UE": 27595, "\u0120restruct": 27596, "\u0120regeneration": 27597, "\u0120energies": 27598, "\u0120Connor": 27599, "OHN": 27600, "\u0120Cheese": 27601, "\u0120ger": 27602, "\u0120resurrect": 27603, "management": 27604, "NW": 27605, "\u0120presently": 27606, "\u0120Bruins": 27607, "Member": 27608, "\u0120Mang": 27609, "idan": 27610, "\u0120boosting": 27611, "wyn": 27612, "+.": 27613, "requisite": 27614, "\u0120NYPD": 27615, "\u0120Megan": 27616, "\u0120Conditions": 27617, "\u0120pics": 27618, "nesium": 27619, "\u0120Rash": 27620, "\u0120174": 27621, "\u0120Ducks": 27622, "\u0120embro": 27623, "zu": 27624, "onian": 27625, "religious": 27626, "\u0120craz": 27627, "\u0120ACA": 27628, "\u0120Zucker": 27629, "EMA": 27630, "\u0120Pros": 27631, "Weapon": 27632, "\u0120Knox": 27633, "\u0120Arduino": 27634, "\u0120stove": 27635, "\u0120heavens": 27636, "\u0120Purchase": 27637, "\u0120herd": 27638, "\u0120fundraiser": 27639, "Digital": 27640, "5000": 27641, "\u0120proponents": 27642, "/\u00e2\u0122\u012d": 27643, "\u0120jelly": 27644, "\u0120Visa": 27645, "\u0120monks": 27646, "\u0120advancement": 27647, "\u0120Wer": 27648, "\u0120187": 27649, "eus": 27650, "ertility": 27651, "\u0120fetal": 27652, "\u01201936": 27653, "Lo": 27654, "\u0120outfits": 27655, "\u0120staircase": 27656, "bomb": 27657, "\u0120customized": 27658, "clair": 27659, "Tree": 27660, "\u0120mapped": 27661, "\u0120Considering": 27662, "\u0120Torres": 27663, "\u0120methyl": 27664, "\u0120approximate": 27665, "\u0120doom": 27666, "\u0120Hansen": 27667, "\u0120crossover": 27668, "\u0120standalone": 27669, "\u00e4\u00bc": 27670, "\u0120invites": 27671, "\u0120graveyard": 27672, "\u0120hp": 27673, "DonaldTrump": 27674, "\u0120escort": 27675, "Gar": 27676, "\u0120predecessors": 27677, "\u0120hay": 27678, "\u0120enzyme": 27679, "\u0120Straight": 27680, "visors": 27681, "Ing": 27682, "aneously": 27683, "\u0120Applied": 27684, "\u0120fec": 27685, "\u0120Durant": 27686, "\u0120outspoken": 27687, "orb": 27688, "\u0120zeal": 27689, "\u0120disgrace": 27690, "').": 27691, "\u0120Cheng": 27692, "289": 27693, "\u0120Rena": 27694, "\u0120Suicide": 27695, "294": 27696, "\u0120outraged": 27697, "\u0120Newman": 27698, "\u0120Nvidia": 27699, "\u0120Aber": 27700, "\u0120Bers": 27701, "\u0120recreation": 27702, "Window": 27703, "\u0120DP": 27704, "xe": 27705, "\u0120pedoph": 27706, "\u0120fallout": 27707, "amboo": 27708, "\u0120presentations": 27709, "\u0120Apps": 27710, "\u0120html": 27711, "345": 27712, "\u0120XXX": 27713, "\u0120rubbing": 27714, "\u0120Leather": 27715, "\u0120humidity": 27716, "seys": 27717, "established": 27718, "\u0120Units": 27719, "646": 27720, "\u0120respectable": 27721, "Auto": 27722, "\u0120thriving": 27723, "\u0120Innovation": 27724, "angs": 27725, "Extra": 27726, "regulation": 27727, "298": 27728, "pick": 27729, "Examples": 27730, "\u0120CJ": 27731, "Attack": 27732, "\u0120dracon": 27733, "LT": 27734, "\u0120sticker": 27735, "rers": 27736, "\u0120sunny": 27737, "Iss": 27738, "regulated": 27739, "dim": 27740, "\u0120Abstract": 27741, "\u0120husbands": 27742, "Office": 27743, "omination": 27744, "itars": 27745, "ANGE": 27746, "ascal": 27747, "\u0120Kris": 27748, "\u0120Infantry": 27749, "\u0120malf": 27750, "\u0120Athe": 27751, "\u0120Rally": 27752, "balanced": 27753, "........................": 27754, "OUP": 27755, "\u0120molecule": 27756, "metics": 27757, "\u0120Split": 27758, "\u0120Instructions": 27759, "\u0120Nights": 27760, "cards": 27761, "\u0120tug": 27762, "\u0120cone": 27763, "\u00e5\u0143": 27764, "\u0120tx": 27765, "\u0120Discussion": 27766, "\u0120catastrophe": 27767, "ppe": 27768, "gio": 27769, "\u0120communism": 27770, "\u0120halted": 27771, "\u0120Guant": 27772, "clean": 27773, "\u0120Sched": 27774, "\u0120Kanye": 27775, "\u0120wander": 27776, "\u0120Seriously": 27777, "\u0120188": 27778, "ennial": 27779, "follow": 27780, "productive": 27781, "\u0120Flow": 27782, "\u0120Sail": 27783, "\u0120craw": 27784, "\u0120simulations": 27785, "oru": 27786, "angles": 27787, "\u0120Nolan": 27788, "\u0120menstru": 27789, "470": 27790, "\u0120207": 27791, "aja": 27792, "\u0120casually": 27793, "boarding": 27794, "\u0120222": 27795, "ovy": 27796, "\u0120Numbers": 27797, "umat": 27798, "OE": 27799, "287": 27800, "\u0120Clemson": 27801, "\u0120certs": 27802, "\u0120slid": 27803, "\u0120Tribe": 27804, "\u0120toast": 27805, "\u0120fortunes": 27806, "\u0120fals": 27807, "\u0120Committees": 27808, "\u0120gp": 27809, "\u0120fiery": 27810, "\u0120Nets": 27811, "\u0120Anime": 27812, "Package": 27813, "\u0120Compare": 27814, "laughter": 27815, "infect": 27816, "\u0120atrocities": 27817, "\u0120justices": 27818, "\u0120insults": 27819, "\u0120Vernon": 27820, "\u0120shaken": 27821, "\u0120persona": 27822, "estamp": 27823, "367": 27824, "brain": 27825, "\u0120experimenting": 27826, "Ken": 27827, "\u0120Electronics": 27828, "\u0120161": 27829, "domain": 27830, "\u0120graphical": 27831, "bishop": 27832, "\u0120whopping": 27833, "\u0120Evangel": 27834, "\u0120advertisers": 27835, "\u0120Spear": 27836, "\u0120bids": 27837, "\u0120destroys": 27838, "utz": 27839, "\u0120undersc": 27840, "\u0120ADD": 27841, "\u0120ants": 27842, "\u0120Cum": 27843, "ipples": 27844, "\u0120Fill": 27845, "\u0120glanced": 27846, "\u0120indicted": 27847, "\u0120Eff": 27848, "\u0120miscon": 27849, "\u0120Desktop": 27850, "\u0120abide": 27851, "\u00e3\u0125\u0122": 27852, "\u0120Io": 27853, "\u0120Coul": 27854, "\u0120capsule": 27855, "\u0120Chrys": 27856, "MON": 27857, "\u0120undes": 27858, "\u0120IRA": 27859, "\u0120citation": 27860, "\u0120dictate": 27861, "\u0120Networks": 27862, "\u0120Conflict": 27863, "\u0120Stuff": 27864, "xa": 27865, "isec": 27866, "\u0120Chemistry": 27867, "\u0120quarterly": 27868, "Williams": 27869, "anan": 27870, "Opt": 27871, "\u0120Alexandria": 27872, "outheastern": 27873, "\u0120Springfield": 27874, "\u0120Blacks": 27875, "\u0120geography": 27876, "242": 27877, "\u0120utmost": 27878, "\u0120Exxon": 27879, "abouts": 27880, "EVA": 27881, "\u0120Enable": 27882, "\u0120Barr": 27883, "\u0120disagreed": 27884, "\u0120Cyprus": 27885, "\u0120dementia": 27886, "\u0120labs": 27887, "\u0120ubiquitous": 27888, "\u0120LOVE": 27889, "\u0120consolidated": 27890, "sr": 27891, "\u0120creamy": 27892, "\u0120Timber": 27893, "Regardless": 27894, "\u0120Certificate": 27895, "\u0120\"...": 27896, "ogenous": 27897, "Captain": 27898, "\u0120insulting": 27899, "\u0120Soros": 27900, "\u0120Instr": 27901, "\u0120Bulgaria": 27902, "better": 27903, "\u0120sucking": 27904, "\u0120Davidson": 27905, "atz": 27906, "\u0120collateral": 27907, "gif": 27908, "\u0120plagued": 27909, "\u0120Cancel": 27910, "\u0120Gardner": 27911, "RB": 27912, "\u0120sixteen": 27913, "Remove": 27914, "uristic": 27915, "cook": 27916, "Rod": 27917, "\u0120comprising": 27918, "fle": 27919, ")\u00e2\u0122\u0136": 27920, "\u0120Viking": 27921, "growth": 27922, "agonal": 27923, "\u0120srf": 27924, "afety": 27925, "mot": 27926, "Nearly": 27927, "stown": 27928, "\u0120Factor": 27929, "\u0120automobile": 27930, "\u0120procedural": 27931, "mask": 27932, "ampires": 27933, "\u0120disappears": 27934, "jab": 27935, "315": 27936, "\u01201951": 27937, "needed": 27938, "\u0120daring": 27939, "leader": 27940, "\u0120podium": 27941, "\u0120unhealthy": 27942, "\u0120mund": 27943, "\u0120pyramid": 27944, "ocre": 27945, "\u0120kissed": 27946, "\u0120dreamed": 27947, "\u0120Fantastic": 27948, "\u0120Gly": 27949, "\u00e5\u012c": 27950, "\u0120greatness": 27951, "\u0120spices": 27952, "\u0120metropolitan": 27953, "\u0120compuls": 27954, "iets": 27955, "1016": 27956, "\u0120Sham": 27957, "\u0120Pyr": 27958, "flies": 27959, "\u0120Midnight": 27960, "\u0120swallowed": 27961, "\u0120genres": 27962, "\u0120Lucky": 27963, "\u0120Rewards": 27964, "\u0120dispatch": 27965, "\u0120IPA": 27966, "\u0120Apply": 27967, "\u0120aven": 27968, "alities": 27969, "312": 27970, "things": 27971, "\u0120().": 27972, "\u0120mates": 27973, "\u0120Sz": 27974, "\u0120COP": 27975, "olate": 27976, "OFF": 27977, "\u0120recharge": 27978, "caps": 27979, "\u0120Yorker": 27980, "icone": 27981, "\u0120galaxies": 27982, "ileaks": 27983, "Dave": 27984, "\u0120Puzz": 27985, "\u0120Celtic": 27986, "\u0120AFC": 27987, "276": 27988, "\u0120Sons": 27989, "\u0120affirmative": 27990, "Hor": 27991, "\u0120tutorials": 27992, "\u0120CITY": 27993, "\u0120Rosa": 27994, "\u0120Extension": 27995, "Series": 27996, "\u0120fats": 27997, "\u0120rab": 27998, "lis": 27999, "\u0120unic": 28000, "\u0120eve": 28001, "\u0120Spin": 28002, "\u0120adulthood": 28003, "typ": 28004, "\u0120sectarian": 28005, "\u0120checkout": 28006, "\u0120Cycl": 28007, "Single": 28008, "\u0120martyr": 28009, "\u0120chilling": 28010, "888": 28011, "oufl": 28012, "\u0120];": 28013, "\u0120congestion": 28014, "mk": 28015, "\u0120Whereas": 28016, "\u01201938": 28017, "urrencies": 28018, "erion": 28019, "\u0120boast": 28020, "\u0120Patients": 28021, "\u0120chap": 28022, "\u0120BD": 28023, "realDonaldTrump": 28024, "\u0120examines": 28025, "hov": 28026, "\u0120startling": 28027, "\u0120Babylon": 28028, "wid": 28029, "omew": 28030, "brance": 28031, "\u0120Odyssey": 28032, "wig": 28033, "\u0120torch": 28034, "\u0120Vox": 28035, "\u0120Moz": 28036, "\u0120Troll": 28037, "\u0120Ans": 28038, "Similarly": 28039, "\u0120Ful": 28040, "006": 28041, "Unless": 28042, "\u0120Alone": 28043, "stead": 28044, "\u0120Publisher": 28045, "rights": 28046, "tu": 28047, "\u0120Doesn": 28048, "\u0120professionally": 28049, "\u0120clo": 28050, "icz": 28051, "\u0120steals": 28052, "\u0120\u00e1": 28053, "1986": 28054, "\u0120sturdy": 28055, "\u0120Johann": 28056, "\u0120medals": 28057, "\u0120filings": 28058, "\u0120Fraser": 28059, "done": 28060, "\u0120multinational": 28061, "\u0120feder": 28062, "\u0120worthless": 28063, "\u0120pest": 28064, "Yesterday": 28065, "ankind": 28066, "\u0120gays": 28067, "\u0120borne": 28068, "\u0120POS": 28069, "Picture": 28070, "\u0120percentages": 28071, "251": 28072, "rame": 28073, "\u0120potions": 28074, "AMD": 28075, "\u0120Lebanese": 28076, "\u0120rang": 28077, "\u0120LSU": 28078, "ongs": 28079, "\u0120peninsula": 28080, "\u0120Clause": 28081, "ALK": 28082, "oha": 28083, "\u0120MacBook": 28084, "\u0120unanimous": 28085, "\u0120lenders": 28086, "\u0120hangs": 28087, "\u0120franchises": 28088, "orers": 28089, "\u0120Updates": 28090, "\u0120isolate": 28091, "andro": 28092, "Soon": 28093, "\u0120disruptive": 28094, "\u0120Surve": 28095, "\u0120stitches": 28096, "\u0120Scorp": 28097, "\u0120Dominion": 28098, "\u0120supplying": 28099, "Arg": 28100, "\u0120turret": 28101, "\u0120Luk": 28102, "\u0120brackets": 28103, "*)": 28104, "\u0120Revolutionary": 28105, "\u0120Honest": 28106, "\u0120noticing": 28107, "\u0120Shannon": 28108, "\u0120afforded": 28109, "\u0120tha": 28110, "\u0120Janet": 28111, "!--": 28112, "\u0120Narendra": 28113, "\u0120Plot": 28114, "Hol": 28115, "sever": 28116, "eenth": 28117, "\u0120obstruction": 28118, "\u01201024": 28119, "staff": 28120, "jas": 28121, "orget": 28122, "scenes": 28123, "laughs": 28124, "\u0120Fargo": 28125, "crime": 28126, "\u0120orchestr": 28127, "\u0120delet": 28128, "iliary": 28129, "rieved": 28130, "\u0120militar": 28131, "\u0120Greene": 28132, "\u00e2\u0139\u0131": 28133, "\u00e3\u0123\u00a6": 28134, "\u0120Guards": 28135, "\u0120unleashed": 28136, "\u0120Weber": 28137, "\u0120adjustable": 28138, "\u0120caliber": 28139, "\u0120motivations": 28140, "\u0120\u00c3\u0142": 28141, "mAh": 28142, "\u0120Lanka": 28143, "handle": 28144, "\u0120pent": 28145, "\u0120Rav": 28146, "\u0120Angular": 28147, "\u0120Kau": 28148, "umbing": 28149, "\u0120philanthrop": 28150, "\u0120dehyd": 28151, "\u0120toxicity": 28152, "eer": 28153, "\u0120YORK": 28154, "witz": 28155, "\u00e5\u00bc": 28156, "\u0120IE": 28157, "community": 28158, "\u0120AH": 28159, "\u0120retali": 28160, "\u0120massively": 28161, "\u0120Daniels": 28162, "\u0120DEL": 28163, "\u0120carcin": 28164, "Url": 28165, "\u0120routing": 28166, "\u0120NPCs": 28167, "\u0120RAF": 28168, "ryce": 28169, "\u0120waived": 28170, "\u0120Guatem": 28171, "Everybody": 28172, "\u0120covenant": 28173, "\u0120173": 28174, "\u0120relaxing": 28175, "\u0120quart": 28176, "almost": 28177, "\u0120guarded": 28178, "\u0120Soldiers": 28179, "\u0120PLAY": 28180, "\u0120outgoing": 28181, "LAND": 28182, "\u0120rewrite": 28183, "\u0120MOV": 28184, "\u0120Imper": 28185, "\u0120Solution": 28186, "\u0120phenomenal": 28187, "\u0120longevity": 28188, "\u0120impat": 28189, "\u0120Nissan": 28190, "irie": 28191, "\u0120odor": 28192, "\u0120Zar": 28193, "oks": 28194, "\u0120militias": 28195, "\u0120SPEC": 28196, "\u0120tolerated": 28197, "arser": 28198, "\u0120Bradford": 28199, "+,": 28200, "\u0120surreal": 28201, "sf": 28202, "Canadian": 28203, "\u0120resemblance": 28204, "\u0120carbohydrate": 28205, "VIEW": 28206, "\u0120accessory": 28207, "meal": 28208, "largest": 28209, "iegel": 28210, "Someone": 28211, "\u0120toughest": 28212, "oso": 28213, "\u0120funnel": 28214, "\u0120condemnation": 28215, "luent": 28216, "\u0120wired": 28217, "\u0120Sunset": 28218, "Jesus": 28219, "\u0120PST": 28220, "\u0120Pages": 28221, "\u0120Tycoon": 28222, "\u0120PF": 28223, "\u0120selections": 28224, "\u0120\u00e0\u00a4": 28225, "partisan": 28226, "\u0120highs": 28227, "\u0120Rune": 28228, "\u0120crafts": 28229, "lead": 28230, "\u0120Parents": 28231, "\u0120reclaim": 28232, "eker": 28233, "\u0120Allied": 28234, "aeper": 28235, "\u0120looming": 28236, "\u0120beneficiaries": 28237, "\u0120Hull": 28238, "Students": 28239, "Jewish": 28240, "dj": 28241, "\u0120pact": 28242, "template": 28243, "\u0120Officials": 28244, "\u0120Baylor": 28245, "\u0120hemp": 28246, "\u0120youths": 28247, "\u0120Levels": 28248, "\u0120Xiao": 28249, "\u0120Ches": 28250, "\u0120endeavor": 28251, "\u0120Removed": 28252, "\u0120hippocamp": 28253, "Hell": 28254, "\u00e3\u0124\u012c": 28255, "805": 28256, "\u0120dinosaur": 28257, "\u0120Wrath": 28258, "\u0120Indonesian": 28259, "\u0120calculator": 28260, "\u0120Dictionary": 28261, "\u0120420": 28262, "\u0120MAG": 28263, "(_": 28264, "!,": 28265, "tarians": 28266, "\u0120restricting": 28267, "racuse": 28268, "\u0120weekday": 28269, "OUNT": 28270, "\u0120shrugged": 28271, "leground": 28272, "\u0120bald": 28273, "\u0120Doctors": 28274, "\u0120touted": 28275, "\u0120Maxwell": 28276, "\u0120214": 28277, "\u0120diplomat": 28278, "\u0120repression": 28279, "\u0120constituency": 28280, "vice": 28281, "ranked": 28282, "\u0120Napoleon": 28283, "gang": 28284, "\u0120Forever": 28285, "tun": 28286, "\u0120bulb": 28287, "\u0120PDT": 28288, "\u0120Cisco": 28289, "VEN": 28290, "\u0120resumed": 28291, "Steven": 28292, "\u0120Manitoba": 28293, "\u0120fabulous": 28294, "\u0120Agents": 28295, "1984": 28296, "\u0120amusing": 28297, "\u0120Mysteries": 28298, "\u0120orthodox": 28299, "floor": 28300, "\u0120questionnaire": 28301, "\u0120penetrate": 28302, "\u0120filmmakers": 28303, "\u0120Unc": 28304, "\u0120stamped": 28305, "\u0120thirteen": 28306, "\u0120outfield": 28307, "\u0120forwarded": 28308, "\u0120appra": 28309, "\u0120aided": 28310, "try": 28311, "\u0120unfocused": 28312, "\u0120Liz": 28313, "\u0120Wendy": 28314, "\u0120Scene": 28315, "Charg": 28316, "\u0120rejects": 28317, "\u0120leftist": 28318, "\u0120Providence": 28319, "\u0120Brid": 28320, "regn": 28321, "\u0120prophecy": 28322, "\u0120LIVE": 28323, "499": 28324, "\u0120forge": 28325, "\u0120FML": 28326, "\u0120intrinsic": 28327, "\u0120Frog": 28328, "\u0120wont": 28329, "\u0120Holt": 28330, "\u0120famed": 28331, "CLUS": 28332, "aepernick": 28333, "\u0120Hate": 28334, "\u0120Cay": 28335, "\u0120registering": 28336, "ortality": 28337, "ropy": 28338, "ocalyptic": 28339, "aan": 28340, "nav": 28341, "\u0120fascist": 28342, "IFIED": 28343, "\u0120implicated": 28344, "\u0120Resort": 28345, "\u0120Chandler": 28346, "\u0120Brick": 28347, "Pin": 28348, "ysc": 28349, "Usage": 28350, "\u0120Helm": 28351, "usra": 28352, "\u00e2\u013a\u0127\u00e2\u013a\u0127": 28353, "\u0120Abbas": 28354, "\u0120unanimously": 28355, "\u0120keeper": 28356, "\u0120addicted": 28357, "???": 28358, "\u0120helmets": 28359, "\u0120antioxid": 28360, "apsed": 28361, "808": 28362, "giene": 28363, "\u0120waits": 28364, "\u0120minion": 28365, "raved": 28366, "\u0120Porsche": 28367, "\u0120dreaming": 28368, "\u0120171": 28369, "\u0120Cain": 28370, "\u0120unfor": 28371, "asso": 28372, "\u0120Configuration": 28373, "kun": 28374, "hardt": 28375, "\u0120nested": 28376, "\u0120LDS": 28377, "LES": 28378, "\u0120tying": 28379, "enos": 28380, "\u0120cue": 28381, "\u0120Marqu": 28382, "skirts": 28383, "\u0120clicked": 28384, "\u0120expiration": 28385, "\u0120Accordingly": 28386, "\u0120WC": 28387, "\u0120blessings": 28388, "\u0120addictive": 28389, "\u0120Narr": 28390, "yx": 28391, "\u0120Jaguars": 28392, "\u0120rents": 28393, "\u0120Siber": 28394, "\u0120tipped": 28395, "ousse": 28396, "\u0120Fitzgerald": 28397, "\u0120hierarch": 28398, "outine": 28399, "\u0120wavelength": 28400, ">.": 28401, "chid": 28402, "\u0120Processing": 28403, "/+": 28404, "ranking": 28405, "Easy": 28406, "\u0120Construct": 28407, "\u0120tet": 28408, "insured": 28409, "HUD": 28410, "\u0120quoting": 28411, "\u0120communicated": 28412, "inx": 28413, "\u0120inmate": 28414, "\u0120erected": 28415, "\u0120Absolutely": 28416, "\u0120Surely": 28417, "\u0120unim": 28418, "\u0120Throne": 28419, "heid": 28420, "\u0120claws": 28421, "\u0120superstar": 28422, "\u0120Lenn": 28423, "\u0120Whis": 28424, "Uk": 28425, "abol": 28426, "\u0120sket": 28427, "\u0120Niet": 28428, "\u0120perks": 28429, "\u0120affinity": 28430, "\u0120openings": 28431, "phasis": 28432, "\u0120discriminate": 28433, "Tip": 28434, "vc": 28435, "\u0120grinding": 28436, "\u0120Jenny": 28437, "\u0120asthma": 28438, "holes": 28439, "\u0120Homer": 28440, "\u0120registers": 28441, "\u0120Glad": 28442, "\u0120creations": 28443, "\u0120lithium": 28444, "\u0120applause": 28445, "until": 28446, "Justice": 28447, "\u0120Turks": 28448, "\u0120scandals": 28449, "\u0120bake": 28450, "tank": 28451, "Mech": 28452, "\u0120Means": 28453, "\u0120Maid": 28454, "Republicans": 28455, "isal": 28456, "windows": 28457, "\u0120Santos": 28458, "\u0120vegetation": 28459, "338": 28460, "tri": 28461, "\u0120flux": 28462, "insert": 28463, "\u0120clarified": 28464, "\u0120mortg": 28465, "\u0120Chim": 28466, "\u0120Tort": 28467, "\u0120disclaim": 28468, "metal": 28469, "\u0120Aside": 28470, "\u0120induction": 28471, "\u0120infl": 28472, "\u0120atheists": 28473, "amph": 28474, "\u0120ether": 28475, "\u0120Vital": 28476, "\u0120Built": 28477, "Mind": 28478, "\u0120weaponry": 28479, "SET": 28480, "\u0120186": 28481, "admin": 28482, "gam": 28483, "contract": 28484, "afa": 28485, "\u0120derivatives": 28486, "\u0120snacks": 28487, "\u0120churn": 28488, "Econom": 28489, "\u0120capped": 28490, "\u0120Understanding": 28491, "\u0120Hers": 28492, "\u0120Iz": 28493, "\u0120duct": 28494, "IENT": 28495, "aughty": 28496, "\u0120\u00e2\u013e\u0136": 28497, "\u0120NP": 28498, "\u0120sailing": 28499, "Initialized": 28500, "\u0120ted": 28501, "\u0120reactors": 28502, "\u0120Lomb": 28503, "\u0120choke": 28504, "\u0120Worm": 28505, "\u0120admiration": 28506, "\u0120swung": 28507, "ensibly": 28508, "\u0120rash": 28509, "\u0120Goals": 28510, "\u0120Important": 28511, "Shot": 28512, "\u0120Ras": 28513, "\u0120trainers": 28514, "\u0120Bun": 28515, "Working": 28516, "\u0120harmed": 28517, "\u0120Pandora": 28518, "\u0120LTE": 28519, "\u0120mushroom": 28520, "\u0120CHAR": 28521, "\u0120Fee": 28522, "\u0120Moy": 28523, "Born": 28524, "oliberal": 28525, "\u0120Martial": 28526, "\u0120gentlemen": 28527, "\u0120lingering": 28528, "Official": 28529, "\u0120graffiti": 28530, "\u0120Names": 28531, "Der": 28532, "\u0120quint": 28533, "istrate": 28534, "azeera": 28535, "\u0120NOTICE": 28536, "\u0120Florence": 28537, "\u0120payable": 28538, "\u0120depicts": 28539, "\u0120Species": 28540, "Heart": 28541, "\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122\u00e2\u0136\u0122": 28542, "\u0120enclosed": 28543, "Increases": 28544, "Daily": 28545, "\u0120Lis": 28546, "\u0120enactment": 28547, "\u0120Bacon": 28548, "\u0120Steele": 28549, "demand": 28550, "\u0120183": 28551, "\u0120mouths": 28552, "\u0120stranded": 28553, "\u0120enhancement": 28554, "011": 28555, "\u0120Whats": 28556, "\u0120healed": 28557, "eny": 28558, "\u0120Rab": 28559, "\u0120340": 28560, "\u0120Labyrinth": 28561, "roach": 28562, "\u0120Yosh": 28563, "\u0120Clippers": 28564, "\u0120concerts": 28565, "Internet": 28566, "355": 28567, "\u0120stickers": 28568, "\u0120termed": 28569, "\u0120Axe": 28570, "\u0120grandparents": 28571, "France": 28572, "\u0120Clim": 28573, "\u0120Uh": 28574, "ulic": 28575, "\u0120thrill": 28576, "centric": 28577, "\u0120Overview": 28578, "\u0120Conduct": 28579, "\u0120substantive": 28580, "\u0120182": 28581, "mur": 28582, "\u0120stray": 28583, "\u0120Coff": 28584, "\u0120repetitive": 28585, "\u0120Forgotten": 28586, "\u0120qualification": 28587, "ewitness": 28588, "\u0120Zimbabwe": 28589, "\u0120simulated": 28590, "\u0120JD": 28591, "253": 28592, "\u0120Ware": 28593, "\u0120unsc": 28594, "Times": 28595, "\u0120summons": 28596, "\u0120disconnected": 28597, "\u0120184": 28598, "cius": 28599, "\u0120Gujar": 28600, "odka": 28601, "\u0120erase": 28602, "\u0120Tobacco": 28603, "elected": 28604, "\u0120uncont": 28605, "\u0120Shepard": 28606, "\u0120Lamp": 28607, "\u0120alerted": 28608, "\u0120operative": 28609, "arna": 28610, "uint": 28611, "\u0120negligence": 28612, "acements": 28613, "\u0120supra": 28614, "\u0120prevail": 28615, "\u0120Shark": 28616, "\u0120belts": 28617, "\u00e3\u0123\u00ab": 28618, "\u0120tighter": 28619, "Engineers": 28620, "\u0120inactive": 28621, "\u0120exponent": 28622, "\u0120Willie": 28623, "aples": 28624, "\u0120heir": 28625, "\u0120Hits": 28626, "iann": 28627, "\u0120Says": 28628, "\u0120currents": 28629, "\u0120Bengal": 28630, "\u0120arist": 28631, "Buffer": 28632, "\u0120breeze": 28633, "\u0120Wesley": 28634, "Cola": 28635, "\u0120pronoun": 28636, "\u0120deed": 28637, "\u0120Kling": 28638, "\u0120oft": 28639, "\u0120inflict": 28640, "\u0120punishing": 28641, "\u0120nm": 28642, "iku": 28643, "ODUCT": 28644, "014": 28645, "\u0120subsidy": 28646, "\u0120DEA": 28647, "\u0120Herbert": 28648, "\u0120Jal": 28649, "Bank": 28650, "\u0120deferred": 28651, "\u0120shipment": 28652, "Bott": 28653, "\u0120alle": 28654, "bearing": 28655, "HTML": 28656, "Offline": 28657, "\u0120213": 28658, "\u0120scrolling": 28659, "\u0120scanned": 28660, "\u0120Libyan": 28661, "\u0120TOP": 28662, "chrom": 28663, "dt": 28664, "column": 28665, "PsyNetMessage": 28666, "Zero": 28667, "\u0120torso": 28668, "050": 28669, "\u00e2\u0137\u0132": 28670, "\u0120imperson": 28671, "\u0120Schwartz": 28672, "udic": 28673, "\u0120pissed": 28674, "\u0120Sapp": 28675, "257": 28676, "\u0120ISPs": 28677, "ogl": 28678, "\u0120supervised": 28679, "\u0120adolescent": 28680, "\u0120attained": 28681, "\u0120Delivery": 28682, "\u0120Bunny": 28683, "\u01201937": 28684, "\u0120miniature": 28685, "\u0120os": 28686, "\u0120370": 28687, "608": 28688, "\u0120Mourinho": 28689, "\u0120innate": 28690, "\u0120tempo": 28691, "\u0120NM": 28692, "\u0120Fallen": 28693, "009": 28694, "\u0120provocative": 28695, "Streamer": 28696, "\u0120Benedict": 28697, "\u0120Bolshe": 28698, "\u0120turtle": 28699, "\u0120PCB": 28700, "\u0120Equal": 28701, "Director": 28702, "\u0120Rend": 28703, "\u0120fluids": 28704, "Authorities": 28705, "\u0120cousins": 28706, "requency": 28707, "\u0120Neighbor": 28708, "sets": 28709, "shared": 28710, "Charles": 28711, "password": 28712, "\u0120gears": 28713, "\u0120211": 28714, "\u0120Hardware": 28715, "rika": 28716, "\u0120upstream": 28717, "Hom": 28718, "\u0120disproportionately": 28719, "ivities": 28720, "\u0120undefined": 28721, "\u0120electrons": 28722, "\u0120commemor": 28723, "Eventually": 28724, "\u0120><": 28725, "\u0120irresponsible": 28726, "218": 28727, "\u0120Released": 28728, "\u0120OVER": 28729, "\u0120IGN": 28730, "\u0120Bread": 28731, "stellar": 28732, "\u0120Sage": 28733, "tted": 28734, "damage": 28735, "edition": 28736, "\u0120Prec": 28737, "\u0120lime": 28738, "\u0120confinement": 28739, "\u0120calorie": 28740, "weapon": 28741, "\u0120differing": 28742, "\u0120Sina": 28743, "mys": 28744, "amd": 28745, "\u0120intricate": 28746, "kk": 28747, "\u0120PAT": 28748, "\u00c3\u00a3o": 28749, "stones": 28750, "links": 28751, "\u0120ranch": 28752, "Semitic": 28753, "\u0120differentiate": 28754, "\u0120Singer": 28755, "occupied": 28756, "\u0120fortress": 28757, "cmd": 28758, "\u0120interception": 28759, "\u0120Ankara": 28760, "\u0120rept": 28761, "\u0120Solitaire": 28762, "\u0120remake": 28763, "pred": 28764, "\u0120dared": 28765, "autions": 28766, "\u0120BACK": 28767, "Running": 28768, "\u0120debugging": 28769, "\u0120graphs": 28770, "399": 28771, "\u0120Nigel": 28772, "\u0120bun": 28773, "\u0120pillow": 28774, "\u0120progressed": 28775, "fashioned": 28776, "\u0120obedience": 28777, "ERN": 28778, "\u0120rehears": 28779, "Cell": 28780, "tl": 28781, "Sher": 28782, "\u0120herald": 28783, "\u0120Payment": 28784, "\u0120Cory": 28785, "\u0120Dept": 28786, "\u0120repent": 28787, "\u0120Weak": 28788, "uckland": 28789, "\u0120pleasing": 28790, "\u0120shortages": 28791, "\u0120jurors": 28792, "\u0120Kab": 28793, "qqa": 28794, "Anti": 28795, "\u0120wow": 28796, "\u0120RCMP": 28797, "\u0120tsun": 28798, "\u0120Sic": 28799, "\u0120comprises": 28800, "\u0120spies": 28801, "\u0120precinct": 28802, "nu": 28803, "\u0120urges": 28804, "\u0120timed": 28805, "\u0120stripes": 28806, "\u0120Boots": 28807, "\u0120yen": 28808, "Advanced": 28809, "\u0120discrete": 28810, "\u0120Archangel": 28811, "employment": 28812, "Diff": 28813, "\u0120monuments": 28814, "\u0120209": 28815, "worker": 28816, "\u0120196": 28817, "\u0120Ig": 28818, "utterstock": 28819, "TPS": 28820, "Jac": 28821, "\u0120homelessness": 28822, "\u0120commentator": 28823, "\u0120racially": 28824, "fing": 28825, "seed": 28826, "Ele": 28827, "ellation": 28828, "\u0120ethanol": 28829, "\u0120parish": 28830, "\u0120Dong": 28831, "\u0120Awakening": 28832, "\u0120deviation": 28833, "\u0120Bearing": 28834, "\u0120Tsuk": 28835, "\u0120recess": 28836, "\u0120lymph": 28837, "\u0120Cannabis": 28838, "\u00e5\u013e": 28839, "\u0120NEWS": 28840, "\u0120dra": 28841, "\u0120Stefan": 28842, "\u0120Wrong": 28843, "\u0120SAM": 28844, "\u0120loosely": 28845, "\u0120interpreter": 28846, "\u0120Plain": 28847, "Government": 28848, "\u0120bigotry": 28849, "\u0120grenades": 28850, "avez": 28851, "pictured": 28852, "\u0120mandated": 28853, "\u0120Monk": 28854, "\u0120Pedro": 28855, "\u0120lava": 28856, "274": 28857, "\u0120cynical": 28858, "\u0120Scrolls": 28859, "locks": 28860, "Mp": 28861, "\u0120congregation": 28862, "ornings": 28863, "phil": 28864, "\u0120Ibid": 28865, "\u0120ferv": 28866, "\u0120disappearing": 28867, "\u0120arrogant": 28868, "syn": 28869, "\u0120Maver": 28870, "\u0120Suit": 28871, "241": 28872, "\u0120abbre": 28873, "ackers": 28874, "Pa": 28875, "\u0120Yel": 28876, "Whenever": 28877, "\u0120235": 28878, "\u0120Vine": 28879, "\u0120Anat": 28880, "\u0120extinct": 28881, "LET": 28882, "\u0120executable": 28883, "VERS": 28884, "oxide": 28885, "DNA": 28886, "\u0120Prel": 28887, "\u0120resentment": 28888, "\u0120comprise": 28889, "\u0120Aviv": 28890, "\u0120interceptions": 28891, "\u0120prolific": 28892, "INA": 28893, "\u0120Erin": 28894, "thought": 28895, "219": 28896, "\u0120Psychiatry": 28897, "unky": 28898, "chemist": 28899, "Ho": 28900, "\u0120McCoy": 28901, "\u0120bricks": 28902, "Los": 28903, "rily": 28904, "\u0120USSR": 28905, "\u0120rud": 28906, "\u0120laud": 28907, "\u0120Wise": 28908, "\u0120Emerald": 28909, "\u0120revived": 28910, "\u0120damned": 28911, "\u0120Repair": 28912, "idem": 28913, "ctica": 28914, "\u0120patriarch": 28915, "\u0120Nurs": 28916, "meg": 28917, "\u0120cheapest": 28918, "reements": 28919, "empty": 28920, "\u0120Celebr": 28921, "\u0120deprivation": 28922, "chanted": 28923, "\u0120Thumbnails": 28924, "Energy": 28925, "\u0120Ethan": 28926, "\u0120Qing": 28927, "\u0120opposes": 28928, "WIND": 28929, "vik": 28930, "\u0120Mau": 28931, "\u0120SUB": 28932, "667": 28933, "GRE": 28934, "\u0120Volunte": 28935, "nton": 28936, "Cook": 28937, "\u00e5\u0132": 28938, "esque": 28939, "\u0120plummet": 28940, "\u0120suing": 28941, "\u0120pronounce": 28942, "\u0120resisting": 28943, "\u0120Fishing": 28944, "\u0120Trials": 28945, "\u0120yell": 28946, "\u0120310": 28947, "\u0120induct": 28948, "\u0120personalized": 28949, "often": 28950, "Reb": 28951, "EMBER": 28952, "\u0120viewpoint": 28953, "\u0120existential": 28954, "())": 28955, "remove": 28956, "MENTS": 28957, "lasses": 28958, "\u0120evapor": 28959, "\u0120aisle": 28960, "meta": 28961, "\u0120reflective": 28962, "\u0120entitlement": 28963, "\u0120devised": 28964, "music": 28965, "ascade": 28966, "\u0120winding": 28967, "offset": 28968, "\u0120accessibility": 28969, "kered": 28970, "Better": 28971, "\u0120Johnston": 28972, "thinking": 28973, "Snow": 28974, "\u0120Croatia": 28975, "\u0120Atomic": 28976, "271": 28977, "348": 28978, "\u0120textbook": 28979, "\u0120Sixth": 28980, "\u0120\u00d8\u00a7\u00d9\u0126": 28981, "\u0120slider": 28982, "\u0120Burger": 28983, "bol": 28984, "Sync": 28985, "\u0120grandchildren": 28986, "\u0120cerv": 28987, "+)": 28988, "\u0120eternity": 28989, "\u0120tweeting": 28990, "\u0120speculative": 28991, "\u0120pivotal": 28992, "\u0120WP": 28993, "\u0120TER": 28994, "ynamic": 28995, "\u0120upl": 28996, "\u0120Cats": 28997, "perhaps": 28998, "\u0120classmates": 28999, "\u0120blatant": 29000, "'-": 29001, "\u0120lakh": 29002, "antine": 29003, "\u0120Borg": 29004, "iom": 29005, "/(": 29006, "\u0120Athletic": 29007, "\u0120sar": 29008, "OTA": 29009, "\u0120Hoffman": 29010, "Nevertheless": 29011, "\u0120adorable": 29012, "\u0120spawned": 29013, "Associated": 29014, "\u0120Domestic": 29015, "\u0120implant": 29016, "\u0120Luxem": 29017, "\u0120Kens": 29018, "\u0120pumps": 29019, "\u0120SAT": 29020, "Attributes": 29021, "509": 29022, "avour": 29023, "\u0120centralized": 29024, "\u0120TN": 29025, "\u0120freshly": 29026, "\u0120Achieve": 29027, "\u0120outsiders": 29028, "herty": 29029, "\u0120Ree": 29030, "\u0120Towers": 29031, "\u0120Dart": 29032, "akable": 29033, "\u0120mp": 29034, "\u0120Heavenly": 29035, "\u0120ripe": 29036, "\u0120Caroline": 29037, "ryan": 29038, "\u0120classics": 29039, "\u0120retiring": 29040, "\u0120228": 29041, "\u0120ah": 29042, "\u0120dealings": 29043, "\u0120punching": 29044, "\u0120Chapman": 29045, "Options": 29046, "maxwell": 29047, "volume": 29048, "\u0120stal": 29049, "\u0120exported": 29050, "\u0120Quite": 29051, "\u0120numerical": 29052, "Burn": 29053, "Fact": 29054, "\u0120Keystone": 29055, "\u0120trending": 29056, "\u0120altering": 29057, "\u0120Africans": 29058, "478": 29059, "\u0120MN": 29060, "\u0120Knock": 29061, "\u0120temptation": 29062, "\u0120prestige": 29063, "Overview": 29064, "\u0120Traditional": 29065, "\u0120Bahrain": 29066, "Private": 29067, "\u0120HOU": 29068, "\u0120barr": 29069, "\u0120Tat": 29070, "Cube": 29071, "USD": 29072, "\u0120Grande": 29073, "\u0120Gat": 29074, "\u0120Flo": 29075, "\u0120resides": 29076, "\u0120indec": 29077, "volent": 29078, "\u0120perpetual": 29079, "ubes": 29080, "\u0120worldview": 29081, "\u0120Quantum": 29082, "\u0120filtered": 29083, "\u0120ensu": 29084, "orgetown": 29085, "ERSON": 29086, "\u0120Mild": 29087, "379": 29088, "OTT": 29089, "\u00c3\u00a5": 29090, "\u0120vitamins": 29091, "\u0120ribbon": 29092, "\u0120sincerely": 29093, "\u0120Hin": 29094, "\u0120eighteen": 29095, "\u0120contradictory": 29096, "\u0120glaring": 29097, "\u0120expectancy": 29098, "\u0120conspir": 29099, "\u0120monstrous": 29100, "\u0120380": 29101, "reci": 29102, "\u0120handic": 29103, "\u0120pumped": 29104, "\u0120indicative": 29105, "\u0120rapp": 29106, "\u0120avail": 29107, "\u0120LEGO": 29108, "\u0120Marijuana": 29109, "1985": 29110, "erton": 29111, "\u0120twentieth": 29112, "################################": 29113, "\u0120Swamp": 29114, "\u0120valuation": 29115, "\u0120affiliates": 29116, "adjusted": 29117, "\u0120Facility": 29118, "262": 29119, "\u0120enzymes": 29120, "itudinal": 29121, "\u0120imprint": 29122, "Site": 29123, "\u0120installer": 29124, "\u0120TRA": 29125, "mology": 29126, "linear": 29127, "\u0120Collective": 29128, "igating": 29129, "\u0120Token": 29130, "\u0120speculated": 29131, "KN": 29132, "\u0120Cly": 29133, "ority": 29134, "\u0120defer": 29135, "\u0120inspectors": 29136, "approved": 29137, "RM": 29138, "\u0120Suns": 29139, "\u0120informing": 29140, "\u0120Syracuse": 29141, "ibli": 29142, "765": 29143, "\u0120glove": 29144, "\u0120authorize": 29145, "\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6\u00e2\u0122\u00a6": 29146, "\u0120Cruise": 29147, "\u0120contracting": 29148, "shell": 29149, "IFE": 29150, "\u0120Jewel": 29151, "pract": 29152, "\u0120Photoshop": 29153, "\u0120Knowing": 29154, "harm": 29155, "\u0120attractions": 29156, "adan": 29157, "etus": 29158, "018": 29159, "wagen": 29160, "Alt": 29161, "\u0120multiply": 29162, "\u0120equilibrium": 29163, ":{": 29164, "\u0120Fighters": 29165, "\u0120Edgar": 29166, "\u0120fourteen": 29167, "Govern": 29168, "\u0120misuse": 29169, "\u0120abusing": 29170, "\u0120ancestry": 29171, "ramer": 29172, "644": 29173, "\u0120worms": 29174, "\u0120thicker": 29175, "\u0120Combine": 29176, "\u0120peasants": 29177, "\u0120vind": 29178, "\u0120conquest": 29179, "\u0120mocked": 29180, "\u0120cinnamon": 29181, "\u0120Cald": 29182, "\u0120Gallup": 29183, "\u0120avoidance": 29184, "\u0120incarnation": 29185, "\u0120Strat": 29186, "\u0120tasted": 29187, "enta": 29188, "\u0120Neal": 29189, "pared": 29190, "\u0120terminology": 29191, "jection": 29192, "Scientists": 29193, "\u0120INS": 29194, "\u0120Dee": 29195, "\u0120directories": 29196, "Road": 29197, "\u0120Shap": 29198, "bright": 29199, "\u0120Directors": 29200, "\u0120Column": 29201, "\u0120bob": 29202, "\u0120preferably": 29203, "\u0120glitch": 29204, "furt": 29205, "\u0120eg": 29206, "idis": 29207, "CBC": 29208, "\u0120surrendered": 29209, "\u0120testament": 29210, "336": 29211, "uggest": 29212, "\u0120Nil": 29213, "another": 29214, "\u0120pathetic": 29215, "\u0120Donna": 29216, "\u0120218": 29217, "\u0120Avery": 29218, "\u0120whiskey": 29219, "\u0120fixture": 29220, "\u0120Conquest": 29221, "\u0120bets": 29222, "Occ": 29223, "\u0120Leicester": 29224, "].\"": 29225, "\u0120));": 29226, "\u0120flashes": 29227, "456": 29228, "\u0120masked": 29229, "gebra": 29230, "\u0120computed": 29231, "chel": 29232, "auder": 29233, "\u0120defeats": 29234, "\u0120Liberation": 29235, "\u0120Osama": 29236, "\u0120Vive": 29237, "Changes": 29238, "Channel": 29239, "\u0120tariffs": 29240, "\u0120mage": 29241, "\u0120Sax": 29242, "\u0120inadvertently": 29243, "\u0120CRE": 29244, "\u0120Reaper": 29245, "inky": 29246, "grading": 29247, "\u0120stereotyp": 29248, "\u0120curl": 29249, "\u0120FANT": 29250, "\u0120frameworks": 29251, "Mom": 29252, "\u0120Anch": 29253, "\u0120flavour": 29254, "carbon": 29255, "\u0120permitting": 29256, "letcher": 29257, "\u0120Mozilla": 29258, "\u0120Parking": 29259, "\u0120Champ": 29260, "Scroll": 29261, "\u0120murderer": 29262, "\u0120rested": 29263, "\u0120owes": 29264, "\u0120Poss": 29265, "ADD": 29266, "IFF": 29267, "resolution": 29268, "\u0120Mining": 29269, "\u0120comparative": 29270, "Dim": 29271, "\u0120neighbouring": 29272, "\u0120AST": 29273, "\u0120Toxic": 29274, "\u0120biases": 29275, "\u0120gunfire": 29276, "urous": 29277, "\u0120Moment": 29278, "1983": 29279, "\u0120pervasive": 29280, "ttp": 29281, "\u0120Normally": 29282, "rir": 29283, "Sarah": 29284, "\u0120Albany": 29285, "\u0120unsett": 29286, "\u0120SMS": 29287, "ipers": 29288, "layer": 29289, "\u0120Whites": 29290, "uple": 29291, "\u0120turbo": 29292, "\u0120Leeds": 29293, "\u0120thats": 29294, "\u0120Miner": 29295, "MER": 29296, "\u0120Reign": 29297, "\u0120perme": 29298, "\u0120Blitz": 29299, "\u01201934": 29300, "\u0120intimidating": 29301, "tube": 29302, "\u0120eccentric": 29303, "abolic": 29304, "boxes": 29305, "\u0120Associates": 29306, "votes": 29307, "\u0120simulate": 29308, "umbo": 29309, "astery": 29310, "\u0120shipments": 29311, "FFFF": 29312, "anth": 29313, "\u0120seasoned": 29314, "\u0120experimentation": 29315, "\u00e2\u0138\u0142": 29316, "laws": 29317, "Meet": 29318, "iddles": 29319, "antics": 29320, "Rating": 29321, "ISIS": 29322, "hift": 29323, "\u0120fronts": 29324, "buf": 29325, "017": 29326, "\u0120unatt": 29327, "\u0120Dil": 29328, "leases": 29329, "\u0120Gardens": 29330, "777": 29331, "touch": 29332, "vell": 29333, "458": 29334, "\u0120=====": 29335, "saving": 29336, "\u0120erosion": 29337, "\u0120Quin": 29338, "\u0120earns": 29339, "\u0120accomplishment": 29340, "\u0120Wei": 29341, "\u0120<[": 29342, "_____": 29343, "\u0120irrig": 29344, "\u0120Teddy": 29345, "\u0120conquered": 29346, "\u0120Armored": 29347, "\u0120asserts": 29348, "\u0120manipulating": 29349, "r\u00c3\u00a9": 29350, "\u0120transcripts": 29351, "Gallery": 29352, "\u0120plotting": 29353, "Neil": 29354, "\u0120betrayal": 29355, "loader": 29356, "\u0120Sul": 29357, "\u0120displacement": 29358, "\u0120royalty": 29359, "\u0120WI": 29360, "heit": 29361, "\u0120Devices": 29362, "allel": 29363, "\u0120municipalities": 29364, "\u0120canal": 29365, "Stars": 29366, "\u0120UAE": 29367, "\u0120\"\u00e2\u0122\u00a6": 29368, "\u0120CU": 29369, "above": 29370, "\u0120resonance": 29371, "\u0120guiActiveUn": 29372, "added": 29373, "\u0120Braves": 29374, "\u0120Ibn": 29375, "\u0120hereby": 29376, "\u0120BRE": 29377, "\u0120shareholder": 29378, "\u0120Hir": 29379, "\u0120Ji": 29380, "\u0120strangely": 29381, "\u0120admired": 29382, "\u0120plight": 29383, "\u0120bachelor": 29384, "\u0120Pole": 29385, "ciplinary": 29386, "Tony": 29387, "\u0120Armenian": 29388, "\u0120unman": 29389, "\u0120Zionist": 29390, "Stage": 29391, "iscover": 29392, "\u0120automotive": 29393, "\u0120sidelines": 29394, "\u0120slick": 29395, "\u0120Renaissance": 29396, "\u0120FUN": 29397, "Images": 29398, "\u0120Haj": 29399, "\u0120ping": 29400, "\u0120shortcut": 29401, "\u0120Blvd": 29402, "\u0120Looks": 29403, "\u0120bursts": 29404, "\u0120clamp": 29405, "\u0120mish": 29406, "\u0120sorting": 29407, "\u0120patriot": 29408, "\u0120correctness": 29409, "\u0120Scandinav": 29410, "\u0120Cavaliers": 29411, "python": 29412, "azar": 29413, "\u0120375": 29414, "\u0120Jaune": 29415, "409": 29416, "\u0120detrimental": 29417, "\u0120stabbing": 29418, "\u0120poisoned": 29419, "\u0120fountain": 29420, "ocent": 29421, "orst": 29422, "\u0120Mari": 29423, "\u0120rains": 29424, "\u0120Overs": 29425, "\u0120Institution": 29426, "udget": 29427, "AMY": 29428, "tale": 29429, "\u0120KR": 29430, "\u0120Prices": 29431, "\u0120headaches": 29432, "\u0120landsl": 29433, "\u0120Aura": 29434, "Bonus": 29435, "\u0120Zhao": 29436, "\u0120Hip": 29437, "\u0120hops": 29438, "\u0120Kurdistan": 29439, "\u0120exploiting": 29440, "ryn": 29441, "\u0120hypocrisy": 29442, "opening": 29443, "\u0120gunshot": 29444, "\u0120wed": 29445, "interstitial": 29446, "Interstitial": 29447, "\u0120amen": 29448, "Breaking": 29449, "\u0120marketed": 29450, "Wire": 29451, "\u0120Crowd": 29452, "Continue": 29453, "\u0120Known": 29454, "\u0120Effective": 29455, "orean": 29456, "izons": 29457, "Joseph": 29458, "\u0120escalation": 29459, "username": 29460, "\u0120curtain": 29461, "ATES": 29462, "\u0120PAR": 29463, "\u0120Miy": 29464, "\u0120counterfe": 29465, "lene": 29466, "\u0120contenders": 29467, "daily": 29468, "\u0120Asc": 29469, "\u0120Phillip": 29470, "mostly": 29471, "\u0120filename": 29472, "hene": 29473, "\u0120resembling": 29474, "\u0120staging": 29475, "\u0120Chloe": 29476, "\u0120wiring": 29477, "Hon": 29478, "\u0120Renew": 29479, "ottage": 29480, "\u0120Hybrid": 29481, "much": 29482, "\u0120strokes": 29483, "\u0120policymakers": 29484, "APTER": 29485, "\u0120Arkham": 29486, "plot": 29487, "\u0120assistants": 29488, "\u0120deport": 29489, "\u0120Sega": 29490, "\u0120influenza": 29491, "\u0120Cursed": 29492, "\u0120Kobe": 29493, "\u0120skinny": 29494, "Provider": 29495, "\u0120Rip": 29496, "\u0120incremental": 29497, "products": 29498, "BF": 29499, "\u0120dome": 29500, "\u0120Credits": 29501, "\u0120losers": 29502, "ints": 29503, "\u0120Betty": 29504, "\u0120Talent": 29505, "\u0120DAM": 29506, "Lv": 29507, "Ess": 29508, "\u0120dens": 29509, "temp": 29510, "Judge": 29511, "odic": 29512, "\u0120'(": 29513, "URES": 29514, "etsk": 29515, "VO": 29516, "\u0120retrieved": 29517, "\u0120architects": 29518, "\u00d9\u0129": 29519, "\u0120ethic": 29520, "\u0120Secondary": 29521, "stocks": 29522, "adia": 29523, "\u0120325": 29524, "\u0120Opinion": 29525, "\u0120simultaneous": 29526, "\u0120dizz": 29527, "ulp": 29528, "\u0120smuggling": 29529, "ippery": 29530, "Random": 29531, "facing": 29532, "\u0120Das": 29533, "\u0120stockp": 29534, "\u0120disclosures": 29535, "pointer": 29536, "\u0120coral": 29537, "\u0120Selection": 29538, "\u0120Pike": 29539, "ivalent": 29540, "\u0120ruthless": 29541, "\u0120Rim": 29542, "\u0120ensuing": 29543, "\u0120Experiment": 29544, "\u0120congressman": 29545, "\u0120believer": 29546, "\u0120unspecified": 29547, "\u0120Mord": 29548, "\u0120knowledgeable": 29549, "\u0120VERY": 29550, "TX": 29551, "\u0120straps": 29552, "\u0120turf": 29553, "apeshifter": 29554, "\u0120marital": 29555, "\u0120flock": 29556, "\u00e3\u0123\u0128": 29557, "263": 29558, "AMES": 29559, "\u0120Opposition": 29560, "\u0120treasures": 29561, "\u0120GOD": 29562, "\u0120modeled": 29563, "\u0120WORLD": 29564, "\u0120([": 29565, "\u0120Usage": 29566, "HF": 29567, "\u0120$(": 29568, "ussed": 29569, "\u0120pioneer": 29570, "Eight": 29571, "parse": 29572, "bread": 29573, "ritz": 29574, "\u0120Miranda": 29575, "\u0120Kant": 29576, "++)": 29577, "oren": 29578, "\u0120provoked": 29579, "\u0120breeds": 29580, "\u0120Includes": 29581, "\u0120Pastebin": 29582, "\u0120Flip": 29583, "Java": 29584, "\u0120brink": 29585, "\u0120rumored": 29586, "\u0120unseen": 29587, "\u0120garnered": 29588, "\u0120Defin": 29589, "alted": 29590, "\u0120tattoos": 29591, "\u0120hesitation": 29592, "isitions": 29593, "\u0120Weaver": 29594, "\u0120Reporting": 29595, "\u0120therapies": 29596, "\u0120consultants": 29597, "\u0120residual": 29598, "\u0120Mali": 29599, "\u0120Roma": 29600, "iago": 29601, "\u0120Residents": 29602, "ubi": 29603, "\u0120remedies": 29604, "\u0120adaptive": 29605, "\u0120Alive": 29606, "\u0120Barcl": 29607, "\u0120wallets": 29608, "crypt": 29609, "etermination": 29610, "\u0120Pelosi": 29611, "\u0120slipping": 29612, "otonin": 29613, "\u0120alliances": 29614, "patrick": 29615, "iris": 29616, "\u0120orth": 29617, "\u0120Perkins": 29618, "\u0120DeV": 29619, "\u0120Gets": 29620, "\u0120drying": 29621, "gee": 29622, "forest": 29623, "\u0120Forget": 29624, "orem": 29625, "339": 29626, "\u0120vaguely": 29627, "\u0120Dion": 29628, "\u0120Porn": 29629, "\u0120HOW": 29630, "\u0120pneum": 29631, "\u0120rubble": 29632, "\u0120Taste": 29633, "encia": 29634, "\u0120Gel": 29635, "\u0120dst": 29636, "\u0120245": 29637, "\u0120Morocco": 29638, "inflamm": 29639, "\u0120Twins": 29640, "\u0120bots": 29641, "daughter": 29642, "\u0120Balk": 29643, "\u0120brethren": 29644, "\u0120logos": 29645, "\u0120gobl": 29646, "fps": 29647, "\u0120subdivision": 29648, "\u0120pawn": 29649, "\u0120squeezed": 29650, "\u0120morale": 29651, "\u0120DW": 29652, "'\"": 29653, "\u0120knot": 29654, "ooky": 29655, "\u0120divisive": 29656, "\u0120boosted": 29657, "chy": 29658, "\u00e3\u0125\u0132": 29659, "ifact": 29660, "\u0120newcomers": 29661, "\u0120Wrestling": 29662, "\u0120scouts": 29663, "wolves": 29664, "Rat": 29665, "\u0120nineteenth": 29666, "\u0120Osborne": 29667, "Stats": 29668, "\u0120empowered": 29669, "\u0120psychopath": 29670, "\u0120OEM": 29671, "uggage": 29672, "\u0120PK": 29673, "\u0120Mohammad": 29674, "Pak": 29675, "\u0120anarchists": 29676, "\u0120Extract": 29677, "esthes": 29678, "\u0120Stockholm": 29679, "loo": 29680, "\u0120Graph": 29681, "\u0120deploying": 29682, "\u0120Stranger": 29683, "\u0120Mold": 29684, "\u0120staffer": 29685, "\u0120discounted": 29686, "uckle": 29687, "please": 29688, "\u0120Landing": 29689, "\u00c3\u0143a": 29690, "\u0120193": 29691, "\u0120ante": 29692, "\u0120repetition": 29693, "\u0120+/-": 29694, "\u0120parody": 29695, "\u0120lively": 29696, "AAA": 29697, "\u0120Horus": 29698, "\u0120pits": 29699, "inders": 29700, "LOC": 29701, "\u0120Venice": 29702, "406": 29703, "\u0120Discover": 29704, "\u00e2\u0128": 29705, "ellectual": 29706, "\u0120pens": 29707, "\u0120eyel": 29708, "iguous": 29709, "Impl": 29710, "\u0120joking": 29711, "\u0120inval": 29712, "\u0120Belfast": 29713, "\u0120creditors": 29714, "\u0120Skywalker": 29715, "ovsky": 29716, "\u0120ceasefire": 29717, "\u0120seals": 29718, "isoft": 29719, ")).": 29720, "\u0120Felix": 29721, "ITS": 29722, "\u0120tresp": 29723, "\u0120Blockchain": 29724, "eware": 29725, "\u0120Schwar": 29726, "enne": 29727, "mounted": 29728, "\u0120Beacon": 29729, "lesh": 29730, "\u0120immensely": 29731, "\u0120cheering": 29732, "Employ": 29733, "scene": 29734, "ishly": 29735, "atchewan": 29736, "\u0120Nicolas": 29737, "\u0120drained": 29738, "\u0120Exit": 29739, "\u0120Azerb": 29740, "jun": 29741, "\u0120floated": 29742, "uania": 29743, "Deep": 29744, "\u0120superv": 29745, "\u0120mystical": 29746, "\u0120Dollar": 29747, "\u0120Apostle": 29748, "\u0120REL": 29749, "\u0120Provided": 29750, "\u0120Bucks": 29751, "\u00e3\u0125\u00b4": 29752, "cutting": 29753, "\u0120enhancements": 29754, "\u0120Penguins": 29755, "\u0120Isaiah": 29756, "\u0120jerk": 29757, "\u0120Wyn": 29758, "\u0120stalled": 29759, "\u0120cryptocurrencies": 29760, "\u0120Roland": 29761, "single": 29762, "\u0120lumin": 29763, "\u0120Fellow": 29764, "\u0120Capacity": 29765, "\u0120Kazakh": 29766, "WN": 29767, "\u0120financed": 29768, "389": 29769, "\u0120tid": 29770, "\u0120collusion": 29771, "\u0120Myr": 29772, "\u00ee\u0122": 29773, "Senator": 29774, "\u0120pediatric": 29775, "\u0120neatly": 29776, "\u0120sandwiches": 29777, "\u0120Architecture": 29778, "\u0120tucked": 29779, "\u0120balcony": 29780, "\u0120earthquakes": 29781, "quire": 29782, "Future": 29783, "\u0120hefty": 29784, "\u00e9\u0139": 29785, "\u0120specializes": 29786, "\u0120stresses": 29787, "\u0120sender": 29788, "\u0120misunderstanding": 29789, "\u0120epile": 29790, "\u0120provoke": 29791, "\u0120Colors": 29792, "\u0120dismay": 29793, "uko": 29794, "[_": 29795, "586": 29796, "neutral": 29797, "\u0120donating": 29798, "\u0120Randall": 29799, "Multi": 29800, "\u0120conveniently": 29801, "\u0120Sung": 29802, "\u0120Coca": 29803, "\u0120tents": 29804, "\u0120Acceler": 29805, "\u0120partnered": 29806, "272": 29807, "irming": 29808, "\u0120BAS": 29809, "sometimes": 29810, "\u0120objected": 29811, "ubric": 29812, "posed": 29813, "LCS": 29814, "grass": 29815, "\u0120attributable": 29816, "VIS": 29817, "Israeli": 29818, "\u0120repeats": 29819, "\u0120RM": 29820, "vag": 29821, "uta": 29822, "inous": 29823, "\u0120inert": 29824, "\u0120Miguel": 29825, "\u00e6\u0143": 29826, "\u0120Hawaiian": 29827, "Board": 29828, "\u0120artific": 29829, "\u0120Azerbai": 29830, "asio": 29831, "\u0120Rent": 29832, "AIN": 29833, "\u0120appliances": 29834, "\u0120nationality": 29835, "\u0120asshole": 29836, "\u0120Neb": 29837, "\u0120notch": 29838, "hani": 29839, "\u0120Bride": 29840, "Availability": 29841, "\u0120intercepted": 29842, "\u0120continental": 29843, "\u0120swelling": 29844, "\u0120Perspect": 29845, "bies": 29846, ".<": 29847, "ithmetic": 29848, "\u0120Lara": 29849, "\u0120tempting": 29850, "addr": 29851, "\u0120overseeing": 29852, "clad": 29853, "\u0120DV": 29854, "\u0120Gingrich": 29855, "\u0120mun": 29856, "\u0120Appropri": 29857, "\u0120alterations": 29858, "\u0120Patreon": 29859, "\u0120havoc": 29860, "\u0120disciplines": 29861, "\u0120notoriously": 29862, "akuya": 29863, "ieri": 29864, "?).": 29865, "\u0120Went": 29866, "\u0120silicon": 29867, "\u0120tremb": 29868, "Container": 29869, "Known": 29870, "\u0120mortar": 29871, "este": 29872, "icka": 29873, "Arthur": 29874, "\u0120Previously": 29875, "\u0120Marty": 29876, "\u0120sparse": 29877, "gins": 29878, "\u0120inward": 29879, "\u0120Participant": 29880, "Copy": 29881, "\u0120Misc": 29882, "\u0120antibiotic": 29883, "\u0120Retro": 29884, "\u0120elusive": 29885, "\u0120assail": 29886, "\u0120Battalion": 29887, "\u0120Bought": 29888, "\u0120diminish": 29889, "\u0120Europa": 29890, "session": 29891, "\u0120Dangerous": 29892, "iesel": 29893, "\u0120disbelief": 29894, "\u0120blasts": 29895, "extreme": 29896, "\u0120Boyd": 29897, "\u0120Projects": 29898, "\u0120Guys": 29899, "\u0120undergone": 29900, "\u0120grill": 29901, "\u0120Dwight": 29902, "\u0120197": 29903, "USER": 29904, "\u0120filesystem": 29905, "\u0120clocks": 29906, "Taylor": 29907, "\u0120wrapper": 29908, "\u0120folding": 29909, "ousand": 29910, "\u0120Philippine": 29911, "ATIONAL": 29912, "\u0120Perth": 29913, "\u0120ashes": 29914, "\u0120accumulate": 29915, "\u0120Gateway": 29916, "Shop": 29917, "orkshire": 29918, "Han": 29919, "\u0120Barrel": 29920, "\u0120Leh": 29921, "\u0120XV": 29922, "\u0120whim": 29923, "\u0120repo": 29924, "\u0120CG": 29925, "\u0120Mam": 29926, "\u0120incorporating": 29927, "\u0120bailout": 29928, "\u0120linguistic": 29929, "\u0120disinteg": 29930, "CLE": 29931, "\u0120cinematic": 29932, "\u0120Fiber": 29933, "Syn": 29934, "ilion": 29935, "\u0120Compos": 29936, "chens": 29937, "\u0120neoc": 29938, "\u0120boiled": 29939, "FINE": 29940, "ono": 29941, "uncle": 29942, "iken": 29943, "\u0120BM": 29944, "\u00ce\u00b9": 29945, "\u0120receipts": 29946, "\u0120disposed": 29947, "\u0120Thirty": 29948, "\u0120Rough": 29949, "\u0120ABS": 29950, "\u0120notwithstanding": 29951, "ollen": 29952, "#$": 29953, "\u0120unreliable": 29954, "\u0120bloom": 29955, "\u0120mediocre": 29956, "\u0120tram": 29957, "\u0120Tasman": 29958, "\u0120shakes": 29959, "\u0120manifesto": 29960, "\u0120MW": 29961, "\u0120satisfactory": 29962, "\u0120shores": 29963, "\u0120computation": 29964, "\u0120assertions": 29965, "ormons": 29966, "arag": 29967, "abit": 29968, "Democrats": 29969, "\u0120Loot": 29970, "\u0120Volks": 29971, "haired": 29972, "\u0120gravitational": 29973, "Sing": 29974, "\u0120Miz": 29975, "\u0120throttle": 29976, "\u0120tyranny": 29977, "\u0120Views": 29978, "\u0120robber": 29979, "\u0120Minority": 29980, "\u0120shrine": 29981, "scope": 29982, "purpose": 29983, "\u0120nucleus": 29984, "ourcing": 29985, "\u0120USDA": 29986, "\u0120DHS": 29987, "wra": 29988, "\u0120Bowie": 29989, "Scale": 29990, "\u0120BEL": 29991, "xi": 29992, "Iter": 29993, "\u0120(),": 29994, "wright": 29995, "\u0120sailors": 29996, "oused": 29997, "NASA": 29998, "\u0120Proof": 29999, "\u0120Mineral": 30000, "token": 30001, "\u0120FD": 30002, "Rew": 30003, "\u0120ell": 30004, "630": 30005, "\u0120chancellor": 30006, "\u0120Gos": 30007, "\u0120amounted": 30008, "\u0120Recre": 30009, "omez": 30010, "\u0120Optim": 30011, "\u0120Olive": 30012, "\u0120tracker": 30013, "owler": 30014, "\u0120Unique": 30015, "Root": 30016, "\u0120maritime": 30017, "\u0120Quran": 30018, "\u0120Adapt": 30019, "\u0120ecosystems": 30020, "\u0120Repeat": 30021, "\u0120Soy": 30022, "\u0120IMP": 30023, "\u0120graduating": 30024, "andem": 30025, "Pur": 30026, "\u0120Reset": 30027, "\u0120Trick": 30028, "\u0120Philly": 30029, "\u0120Tue": 30030, "\u0120Malaysian": 30031, "\u0120climax": 30032, "\u0120bury": 30033, "\u0120conspic": 30034, "\u0120Southampton": 30035, "\u0120Flowers": 30036, "\u0120escorted": 30037, "\u0120Educational": 30038, "\u0120IRC": 30039, "\u0120brutally": 30040, "eating": 30041, "\u0120pillar": 30042, "\u0120Sang": 30043, "\u0120Jude": 30044, "arling": 30045, "\u0120Amnesty": 30046, "\u0120reminding": 30047, "\u0120Administrative": 30048, "hesda": 30049, "\u0120flashed": 30050, "\u0120PBS": 30051, "perate": 30052, "feature": 30053, "\u0120swipe": 30054, "\u0120graves": 30055, "oultry": 30056, "261": 30057, "breaks": 30058, "\u0120Guer": 30059, "\u0120shrimp": 30060, "\u0120Voting": 30061, "quist": 30062, "\u0120analytical": 30063, "\u0120tablespoons": 30064, "\u0120SOU": 30065, "\u0120researched": 30066, "\u0120disrupted": 30067, "\u0120jour": 30068, "\u0120replica": 30069, "\u0120cartoons": 30070, "bians": 30071, "})": 30072, "copy": 30073, "Got": 30074, "ouched": 30075, "PUT": 30076, "\u0120swarm": 30077, "notations": 30078, "said": 30079, "\u0120rebuilt": 30080, "\u0120collaborate": 30081, "\u0120raging": 30082, "\u0120nar": 30083, "\u0120demographics": 30084, "\u0120DDR": 30085, "\u0120distrust": 30086, "ossier": 30087, "\u0120Kro": 30088, "\u0120pumpkin": 30089, "\u0120regrets": 30090, "\u0120fatalities": 30091, "\u0120Lens": 30092, "\u0120Ole": 30093, "pd": 30094, "\u0120puppet": 30095, "\u0120Outlook": 30096, "\u0120Stam": 30097, "Ol": 30098, "Fair": 30099, "UU": 30100, "\u0120rewritten": 30101, "\u00c4\u00b1": 30102, "\u0120fascinated": 30103, "\u0120vectors": 30104, "\u0120tribunal": 30105, "uay": 30106, "\u0120Mats": 30107, "\u0120Coins": 30108, "[[": 30109, "\u0120181": 30110, "\u0120renders": 30111, "\u0120Kaepernick": 30112, "\u0120espionage": 30113, "\u0120summ": 30114, "\u0120ditch": 30115, "Account": 30116, "\u0120spreadsheet": 30117, "\u0120mutant": 30118, "past": 30119, "407": 30120, "\u0120dye": 30121, "\u0120initiation": 30122, "\u01204000": 30123, "\u0120punishable": 30124, "\u0120thinner": 30125, "\u0120Khal": 30126, "\u0120intermedi": 30127, "Dun": 30128, "\u0120Gotham": 30129, "\u0120eagerly": 30130, "\u0120vaginal": 30131, "powers": 30132, "VW": 30133, "\u0120WATCHED": 30134, "\u0120predator": 30135, "amsung": 30136, "\u0120disparity": 30137, "\u0120[*": 30138, "\u0120amph": 30139, "\u0120outskirts": 30140, "\u0120Spirits": 30141, "\u0120skeletal": 30142, "\u00d0\u00bb": 30143, "\u0120Rear": 30144, "\u0120issuance": 30145, "\u0120Logic": 30146, "released": 30147, "ZZ": 30148, "\u0120Bound": 30149, "Entry": 30150, "\u0120exits": 30151, "isol": 30152, "\u0120Founder": 30153, "\u0120wre": 30154, "\u0120Greenland": 30155, "\u0120MMO": 30156, "taker": 30157, "INC": 30158, "\u00e3\u0123\u00be": 30159, "\u0120hourly": 30160, "henko": 30161, "\u0120fantasies": 30162, "\u0120disob": 30163, "\u0120demolition": 30164, "\u00e3\u0125\u012d": 30165, "\u0120enlisted": 30166, "ratulations": 30167, "\u0120misguided": 30168, "\u0120ensured": 30169, "\u0120discouraged": 30170, "mort": 30171, "\u0120flank": 30172, "\u0120cess": 30173, "\u0120reacts": 30174, "\u0120Sere": 30175, "sensitive": 30176, "\u0120Serpent": 30177, "assad": 30178, "\u0120247": 30179, "\u0120calmly": 30180, "busters": 30181, "\u0120bleed": 30182, "\u0120Stro": 30183, "\u0120amusement": 30184, "\u0120Antarctica": 30185, "\u0120scept": 30186, "\u0120Gaw": 30187, "aq": 30188, "asonic": 30189, "\u0120sprawling": 30190, "native": 30191, "aturated": 30192, "\u0120Battlefield": 30193, "IVERS": 30194, "EB": 30195, "\u0120Gems": 30196, "\u0120Northwestern": 30197, "\u0120Films": 30198, "\u0120Automatic": 30199, "\u0120apprehend": 30200, "\u00e3\u0123\u00a8": 30201, "\u0120guiName": 30202, "\u0120backend": 30203, "\u0120evidenced": 30204, "geant": 30205, "012": 30206, "\u0120Siege": 30207, "\u0120externalTo": 30208, "\u0120unfocusedRange": 30209, "\u0120guiActiveUnfocused": 30210, "\u0120guiIcon": 30211, "\u0120externalToEVA": 30212, "\u0120externalToEVAOnly": 30213, "Fri": 30214, "chard": 30215, "enaries": 30216, "\u0120chiefs": 30217, "\u0120cf": 30218, "\u0120HUD": 30219, "\u0120corrobor": 30220, "\u0120dB": 30221, "\u0120Taken": 30222, "\u0120Patricia": 30223, "rail": 30224, "\u0120Charm": 30225, "\u0120Libertarian": 30226, "rieve": 30227, "Personal": 30228, "\u0120OUR": 30229, "geries": 30230, "\u0120dumping": 30231, "\u0120neurological": 30232, "itimate": 30233, "\u0120Clintons": 30234, "rafted": 30235, "\u0120Molly": 30236, "\u0120terminals": 30237, "register": 30238, "\u0120flare": 30239, "\u0120encoded": 30240, "\u0120autopsy": 30241, "pel": 30242, "machine": 30243, "\u0120exemptions": 30244, "\u0120Royals": 30245, "distance": 30246, "\u0120drafts": 30247, "\u0120lame": 30248, "\u0120Cunning": 30249, "\u0120spouses": 30250, "\u0120Markets": 30251, "\u0120Carrier": 30252, "\u0120implying": 30253, "\u0120Yak": 30254, "sid": 30255, "\u0120loser": 30256, "\u0120vigilant": 30257, "\u0120impeachment": 30258, "\u0120augmented": 30259, "\u0120Employees": 30260, "\u0120unintended": 30261, "ternally": 30262, "\u0120Watt": 30263, "\u0120recognizable": 30264, "essim": 30265, "\u00e6\u013f": 30266, "\u0120coated": 30267, "rha": 30268, "\u0120lieutenant": 30269, "\u0120Legislation": 30270, "published": 30271, "444": 30272, "013": 30273, "\u0120ideally": 30274, "\u0120Password": 30275, "\u0120simplify": 30276, "\u0120Meta": 30277, "\u0120MRI": 30278, "\u0120pleading": 30279, "organized": 30280, "handler": 30281, "\u0120unravel": 30282, "correct": 30283, "\u0120icy": 30284, "\u0120paranoid": 30285, "\u0120passer": 30286, "\u0120inspections": 30287, "ofer": 30288, "\u0120Healthcare": 30289, "283": 30290, "\u0120Brut": 30291, "iola": 30292, "forge": 30293, "\u0120Medieval": 30294, "MSN": 30295, "ievers": 30296, "\u0120Programming": 30297, "\u00e5\u012b": 30298, "\u0120223": 30299, "mu": 30300, "\u0120CLE": 30301, "uga": 30302, "\u0120shoppers": 30303, "\u0120informative": 30304, "\u0120Plans": 30305, "\u0120supplementation": 30306, "\u0120Tests": 30307, "tyard": 30308, "ocytes": 30309, "\u0120Vega": 30310, "\u0120Gujarat": 30311, "ermanent": 30312, "Except": 30313, "\u0120LOT": 30314, "alla": 30315, "\u0120Cumm": 30316, "\u0120Osw": 30317, "\u0120venom": 30318, "\u0120Debt": 30319, "\u0120DOWN": 30320, "\u0120reunion": 30321, "\u0120muc": 30322, "\u0120Relief": 30323, "\u0120geop": 30324, "\u0120\u00f0\u0141\u013a": 30325, "alogue": 30326, "Anth": 30327, "echo": 30328, "\u0120corros": 30329, "\u0120replication": 30330, "\u0120Blazing": 30331, "\u0120Daughter": 30332, "\u0120inflic": 30333, "\u0120Lindsey": 30334, "\u00d9\u012a": 30335, "284": 30336, "Exit": 30337, "\u0120gloom": 30338, "TAIN": 30339, "\u0120undermining": 30340, "\u0120advising": 30341, "hidden": 30342, "\u0120overflow": 30343, "\u0120gor": 30344, "urdue": 30345, "\u0120echoes": 30346, "enhagen": 30347, "\u0120impuls": 30348, "drug": 30349, "cash": 30350, "\u0120async": 30351, "\u0120mirac": 30352, "atts": 30353, "punk": 30354, "\u0120pivot": 30355, "\u0120Legislative": 30356, "\u0120bloggers": 30357, "\u0120Claw": 30358, "sburg": 30359, "dyl": 30360, "\u0120Recommend": 30361, "\u0120verte": 30362, "\u0120prohibiting": 30363, "\u0120Panther": 30364, "Jonathan": 30365, "\u0120omin": 30366, "\u0120hateful": 30367, "281": 30368, "\u0120Orche": 30369, "\u0120Murdoch": 30370, "downs": 30371, "\u0120asymm": 30372, "GER": 30373, "Always": 30374, "\u0120informs": 30375, "\u0120WM": 30376, "\u0120Pony": 30377, "\u0120Appendix": 30378, "\u0120Arlington": 30379, "Jam": 30380, "\u0120medicinal": 30381, "\u0120Slam": 30382, "ITIES": 30383, "\u0120reaff": 30384, "\u0120Ri": 30385, "FG": 30386, "Spring": 30387, "bool": 30388, "\u0120thighs": 30389, "\u0120markings": 30390, "\u0120Raqqa": 30391, "\u0120Lak": 30392, "poll": 30393, "tsky": 30394, "\u0120Morty": 30395, "\u0120Definition": 30396, "\u0120debunk": 30397, "endered": 30398, "\u0120Leone": 30399, "avers": 30400, "\u0120mortgages": 30401, "Apparently": 30402, "Nic": 30403, "haus": 30404, "\u0120Thousands": 30405, "auld": 30406, "\u0120mash": 30407, "shoot": 30408, "\u0120diarr": 30409, "\u0120consciously": 30410, "Hero": 30411, "eas": 30412, "\u0120Naturally": 30413, "\u0120Destroyer": 30414, "\u0120dashboard": 30415, "services": 30416, "Rog": 30417, "\u0120millennials": 30418, "\u0120invade": 30419, "-(": 30420, "\u0120commissions": 30421, "\u0120Auckland": 30422, "\u0120broadcasts": 30423, "\u0120frontal": 30424, "\u0120crank": 30425, "\u0120Historic": 30426, "\u0120rumours": 30427, "CTV": 30428, "\u0120steril": 30429, "\u0120booster": 30430, "rocket": 30431, "\u00e3\u0124\u00bc": 30432, "utsche": 30433, "\u0120PI": 30434, "\u0120233": 30435, "\u0120Producer": 30436, "\u0120Analytics": 30437, "\u0120invaluable": 30438, "\u0120unintention": 30439, "\u0120CY": 30440, "\u0120scrutin": 30441, "\u0120gigg": 30442, "\u0120engulf": 30443, "\u0120proletariat": 30444, "\u0120hacks": 30445, "\u0120Hew": 30446, "arak": 30447, "\u0120Slime": 30448, "ielding": 30449, "agher": 30450, "\u0120Elliot": 30451, "\u0120telecom": 30452, "\u0120219": 30453, "ultan": 30454, "\u0120Arbor": 30455, "\u0120Scouts": 30456, "Ban": 30457, "\u0120lifespan": 30458, "\u0120blasp": 30459, "388": 30460, "\u0120judiciary": 30461, "\u0120Continental": 30462, "asking": 30463, "McC": 30464, "LED": 30465, "\u0120baggage": 30466, "\u0120Sorcerer": 30467, "\u0120remnants": 30468, "\u0120Griffith": 30469, "etsu": 30470, "\u0120Subaru": 30471, "\u0120Personality": 30472, "designed": 30473, "ushima": 30474, "agnar": 30475, "\u0120recoil": 30476, "\u0120passions": 30477, "\\\":": 30478, "\u0120tee": 30479, "\u0120abolition": 30480, "\u0120Creating": 30481, "jac": 30482, "\u0120194": 30483, "019": 30484, "\u0120pillars": 30485, "riched": 30486, "/\"": 30487, "tk": 30488, "\u0120livelihood": 30489, "\u0120roasted": 30490, "ahon": 30491, "\u0120Hutch": 30492, "assert": 30493, "\u0120dividend": 30494, "\u0120knit": 30495, "\u0120daunting": 30496, "\u0120disturbance": 30497, "\u0120shale": 30498, "\u0120cultivated": 30499, "\u0120refrigerator": 30500, "LB": 30501, "\u0120NET": 30502, "\u0120commercials": 30503, "\u0120thinkers": 30504, "455": 30505, "\u0120chop": 30506, "Broad": 30507, "\u0120suspicions": 30508, "\u0120tagged": 30509, "lifting": 30510, "\u0120stylish": 30511, "\u0120Shields": 30512, "Shortly": 30513, "\u0120tails": 30514, "Auth": 30515, "STE": 30516, "\u0120GAME": 30517, "\u0120seism": 30518, "\u0120Kis": 30519, "ologne": 30520, "\u0120cowork": 30521, "\u0120forcibly": 30522, "\u0120thyroid": 30523, "\u0120PB": 30524, "ANE": 30525, "married": 30526, "horse": 30527, "\u0120polymer": 30528, "\u0120Chal": 30529, "odor": 30530, "DEBUG": 30531, "\u0120Context": 30532, "\u0120bliss": 30533, "\u0120pinpoint": 30534, "\u0120Mathemat": 30535, "legram": 30536, "\u0120Weekend": 30537, "\u0120labelled": 30538, "\u0120bart": 30539, "itles": 30540, "\u0120estrogen": 30541, "\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136\u00e2\u0122\u0136": 30542, "\"'": 30543, "\u0120visibly": 30544, "\u0120outsider": 30545, "aida": 30546, "Area": 30547, "\u0120dissemin": 30548, "\u0120dishonest": 30549, "\u0120Closed": 30550, "\u0120Bulletin": 30551, "\u0120Ramsey": 30552, "sword": 30553, "\u0120XI": 30554, "ourced": 30555, "Same": 30556, "346": 30557, "\u0120Repe": 30558, "\u0120Kou": 30559, "cake": 30560, "emis": 30561, "Cache": 30562, "\u0120Meaning": 30563, "\u0120Enlight": 30564, "onomy": 30565, "\u0120manifestation": 30566, "sworth": 30567, "Jay": 30568, "\u0120chore": 30569, "\u00c3\u00b6r": 30570, "Dream": 30571, "\u0120sanctioned": 30572, "\u0120culturally": 30573, "\u0120Ara": 30574, "Nav": 30575, "\u0120theological": 30576, "\u0120strut": 30577, "\u0120VO": 30578, "\u0120Handbook": 30579, "\u0120constructing": 30580, "\u0120\u00c2\u00b6": 30581, "\u0120Benefits": 30582, "\u0120Psychological": 30583, "sac": 30584, "\u00e5\u00b8": 30585, "policy": 30586, "\u0120Matters": 30587, "\u0120Reported": 30588, "\u0120Byte": 30589, "\u0120vitro": 30590, "\u0120Maiden": 30591, "\u0120lam": 30592, "\u0120Jennings": 30593, "\u0120garment": 30594, "\u0120Rutgers": 30595, "\u0120Stafford": 30596, "\u0120Wellington": 30597, "\u0120intermitt": 30598, "\u0120npm": 30599, "\u0120ordeal": 30600, "\u0120plugged": 30601, "ooming": 30602, "inished": 30603, "framework": 30604, "\u0120timber": 30605, "\u0120cass": 30606, "\u0120850": 30607, "iless": 30608, "\u0120Redux": 30609, "768": 30610, "Stre": 30611, "\u0120surpassed": 30612, "whel": 30613, "\u0120parallels": 30614, "\u0120veil": 30615, "\u0120GI": 30616, "\u0120REST": 30617, "\u0120readiness": 30618, "sort": 30619, "\u0120modifying": 30620, "\u0120Slate": 30621, "ruff": 30622, "\u0120marble": 30623, "\u0120infrared": 30624, "\u0120auditor": 30625, "\u0120FANTASY": 30626, "\u0120Poverty": 30627, "\u0120SPD": 30628, "\u0120\"(": 30629, "Ky": 30630, "RAY": 30631, "\u0120executions": 30632, "\u0120Beverly": 30633, "\u0120Marxism": 30634, "\u0120Burst": 30635, "\u0120Kali": 30636, "estones": 30637, "Clearly": 30638, "Ell": 30639, "\u00e3\u0123\u00a7": 30640, "\u0120Proceedings": 30641, "Token": 30642, "IFIC": 30643, "\u00c3\u00b1a": 30644, "Central": 30645, "\u0120Haley": 30646, "\u0120Drama": 30647, "\u0120formations": 30648, "ORN": 30649, "Books": 30650, "\u0120dominating": 30651, "\u0120Flyers": 30652, "\u0120Companion": 30653, "\u0120disciplined": 30654, "\u0120Yugoslav": 30655, "\u0120Spells": 30656, "\u0120vengeance": 30657, "\u0120landlords": 30658, "Len": 30659, "\u0120Ogre": 30660, "anoia": 30661, "\u0120piercing": 30662, "\u0120congreg": 30663, "\u0120scorer": 30664, "obia": 30665, "\u0120nickel": 30666, "\u0120Learns": 30667, "\u0120rejo": 30668, "\u0120masterpiece": 30669, "Flash": 30670, "\u0120inhabited": 30671, "\u0120OpenGL": 30672, "\u0120Dud": 30673, "\u0120ICO": 30674, "\u0120arter": 30675, "\u0120plur": 30676, "\u0120mastery": 30677, "\u0120longstanding": 30678, "sted": 30679, "\u0120wines": 30680, "\u0120televised": 30681, "\u0120Shrine": 30682, "\u0120Bayern": 30683, "\u0120\u00e2\u0135\u013a": 30684, "\u0120enclosure": 30685, "john": 30686, "\u0120prophets": 30687, "\u0120Resurrection": 30688, "\u0120Orders": 30689, "\u0120uneven": 30690, "rals": 30691, "\u0120dwind": 30692, "\u0120Lah": 30693, "\u0120Sloven": 30694, "378": 30695, "\u0120insistence": 30696, "affle": 30697, "\u0120Clone": 30698, "\u0120hardship": 30699, "\u0120Congressman": 30700, "\u0120plead": 30701, "\u0120reviewers": 30702, "\u0120cured": 30703, "\u01201935": 30704, "asley": 30705, "fake": 30706, "\u0120Thinking": 30707, "ydia": 30708, "PART": 30709, "\u0120Dota": 30710, "oit": 30711, "\u0120whipped": 30712, "\u0120bouncing": 30713, "\u0120Hispanics": 30714, "comings": 30715, "\u0120cannabin": 30716, "\u0120Chambers": 30717, "\u0120Zack": 30718, "Optional": 30719, "\u0120coats": 30720, "\u0120prowess": 30721, "\u0120Norton": 30722, "\u0120plainly": 30723, "\u0120freight": 30724, "\u0120inhibition": 30725, "\u0120clam": 30726, "\u0120303": 30727, "kef": 30728, "aleigh": 30729, "Luke": 30730, "\u0120psycho": 30731, "atorium": 30732, "MED": 30733, "\u0120treaties": 30734, "\u0120indisc": 30735, "\u0120dc": 30736, "OPS": 30737, "\u0120resilient": 30738, "\u0120Interstate": 30739, "\u0120slack": 30740, "\u0120mundane": 30741, "\u0120establishes": 30742, "359": 30743, "\u0120strained": 30744, "\u0120nond": 30745, "Sus": 30746, "\u0120caste": 30747, "arate": 30748, "ieving": 30749, "\u0120unfairly": 30750, "\u0120parser": 30751, "onial": 30752, "ursive": 30753, "Via": 30754, "\u0120Otto": 30755, "\u0120Authorities": 30756, "stroke": 30757, "KR": 30758, "\u0120Mercy": 30759, "\u0120furnished": 30760, "\u0120outset": 30761, "\u0120metic": 30762, "1982": 30763, "olithic": 30764, "\u0120Tent": 30765, "ogical": 30766, "\u0120Aircraft": 30767, "\u0120hides": 30768, "\u0120Became": 30769, "\u0120educators": 30770, "reaching": 30771, "\u0120volatility": 30772, "\u0120toddler": 30773, "\u0120NASCAR": 30774, "\u0120Twelve": 30775, "\u0120Highlights": 30776, "\u0120grape": 30777, "\u0120splits": 30778, "\u0120peasant": 30779, "\u0120reneg": 30780, "\u0120MSI": 30781, "Temp": 30782, "stars": 30783, "\u0120trek": 30784, "\u0120Hyde": 30785, "binding": 30786, "\u0120realism": 30787, "\u0120oxide": 30788, "\u0120Hos": 30789, "\u0120mounts": 30790, "\u0120biting": 30791, "\u0120collapsing": 30792, "\u0120postal": 30793, "\u0120museums": 30794, "\u0120detached": 30795, "\u0120respecting": 30796, "\u0120monopol": 30797, "\u0120workflow": 30798, "\u0120Cake": 30799, "Template": 30800, "\u0120Organisation": 30801, "\u0120persistence": 30802, "369": 30803, "Coming": 30804, "Brad": 30805, "\u0120redundant": 30806, "\u0120GTA": 30807, "\u0120bending": 30808, "\u0120revoked": 30809, "\u0120offending": 30810, "\u0120framing": 30811, "\u0120printf": 30812, "Commun": 30813, "members": 30814, "Outside": 30815, "\u0120construed": 30816, "\u0120coded": 30817, "FORE": 30818, "\u0120chast": 30819, "Chat": 30820, "Indian": 30821, "\u0120Yard": 30822, "?!\"": 30823, "\u0120Ports": 30824, "\u0120Xavier": 30825, "\u0120RET": 30826, "'.\"": 30827, "\u0120Boat": 30828, "ivated": 30829, "icht": 30830, "umerable": 30831, "Ds": 30832, "\u0120Dunn": 30833, "\u0120coffin": 30834, "\u0120securely": 30835, "\u0120Raptors": 30836, "\u0120Bes": 30837, "Installation": 30838, "\u0120inception": 30839, "\u0120Healthy": 30840, "endants": 30841, "\u0120psychologists": 30842, "\u0120Sheikh": 30843, "cultural": 30844, "\u0120BlackBerry": 30845, "shift": 30846, "Fred": 30847, "oche": 30848, "\u0120cakes": 30849, "\u0120SEO": 30850, "\u0120Gian": 30851, "\u0120Asians": 30852, "ogging": 30853, "element": 30854, "\u0120pundits": 30855, "\u0120Vaugh": 30856, "\u0120Gavin": 30857, "\u0120hitter": 30858, "\u0120drowned": 30859, "\u0120chalk": 30860, "\u0120Zika": 30861, "\u0120measles": 30862, "802": 30863, "\u00e2\u0122\u00a6..": 30864, "\u0120AWS": 30865, "]\"": 30866, "\u0120distort": 30867, "\u0120Mast": 30868, "\u0120antibodies": 30869, "\u0120Mash": 30870, "Memory": 30871, "\u0120Uganda": 30872, "\u0120Prob": 30873, "\u0120vomiting": 30874, "\u0120Turns": 30875, "\u0120occupying": 30876, "\u0120evasion": 30877, "\u0120Therapy": 30878, "\u0120promo": 30879, "\u0120electr": 30880, "\u0120blueprint": 30881, "\u0120Dre": 30882, "priced": 30883, "\u0120Depot": 30884, "\u0120alleviate": 30885, "\u0120Somali": 30886, "marg": 30887, "nine": 30888, "\u0120nostalgia": 30889, "\u0120Shepherd": 30890, "\u0120cavalry": 30891, "\u0120torped": 30892, "\u0120Bloody": 30893, "xb": 30894, "\u0120sank": 30895, "\u0120goalt": 30896, "reportprint": 30897, "embedreportprint": 30898, "cloneembedreportprint": 30899, "\u0120Initially": 30900, "\u0120Fischer": 30901, "\u0120noteworthy": 30902, "cern": 30903, "\u0120inefficient": 30904, "rawdownload": 30905, "rawdownloadcloneembedreportprint": 30906, "cation": 30907, "\u0120Dynasty": 30908, "lag": 30909, "DES": 30910, "\u0120distinctly": 30911, "\u0120Estonia": 30912, "\u0120openness": 30913, "\u0120gossip": 30914, "ruck": 30915, "Width": 30916, "\u0120Ibrahim": 30917, "\u0120petroleum": 30918, "\u0120avatar": 30919, "\u0120Hed": 30920, "atha": 30921, "\u0120Hogwarts": 30922, "\u0120caves": 30923, "678": 30924, "\u0120safeguard": 30925, "\u0120Mog": 30926, "isson": 30927, "\u0120Durham": 30928, "slaught": 30929, "\u0120Graduate": 30930, "\u0120subconscious": 30931, "\u0120Excellent": 30932, "\u0120Dum": 30933, "-----": 30934, "\u0120piles": 30935, "\u0120WORK": 30936, "\u0120Garn": 30937, "\u0120Fol": 30938, "\u0120ATM": 30939, "\u0120avoids": 30940, "\u0120Tul": 30941, "\u0120bleak": 30942, "ELY": 30943, "ivist": 30944, "lightly": 30945, "Pers": 30946, "\u0120Dob": 30947, "\u0120LS": 30948, "\u0120insanity": 30949, "\u00ce\u00b5": 30950, "atalie": 30951, "Enlarge": 30952, "\u0120twists": 30953, "\u0120faulty": 30954, "\u0120piracy": 30955, "\u0120impover": 30956, "\u0120rugged": 30957, "\u0120Fashion": 30958, "\u0120sands": 30959, "'?": 30960, "swick": 30961, "\u0120natives": 30962, "\u0120hen": 30963, "\u0120Noise": 30964, "\u00e3\u0125\u0139": 30965, "\u0120greens": 30966, "\u0120freezer": 30967, "\u0120dynasty": 30968, "\u0120Fathers": 30969, "\u0120Newark": 30970, "\u0120archaeological": 30971, "\u0120ot": 30972, "obar": 30973, "\u0120blockade": 30974, "\u0120allerg": 30975, "LV": 30976, "\u0120debit": 30977, "\u0120RFC": 30978, "\u0120Milton": 30979, "\u0120Pressure": 30980, "\u0120willingly": 30981, "\u0120disproportionate": 30982, "\u0120oppressive": 30983, "\u0120diamonds": 30984, "\u0120belongings": 30985, "1970": 30986, "\u0120bells": 30987, "\u0120imperialism": 30988, "\u0120227": 30989, "\u0120exploding": 30990, "\u0120Eclipse": 30991, "\u01201919": 30992, "\u0120rant": 30993, "\u0120nominations": 30994, "347": 30995, "\u0120peacefully": 30996, "rica": 30997, "\u0120FUCK": 30998, "\u0120vibration": 30999, "malink": 31000, "\u0120ropes": 31001, "\u0120Ivanka": 31002, "\u0120Brewery": 31003, "\u0120Booker": 31004, "\u0120Owens": 31005, "goers": 31006, "Services": 31007, "\u0120Snape": 31008, "\u0120191": 31009, "395": 31010, "\u0120299": 31011, "justice": 31012, "\u0120bri": 31013, "\u0120discs": 31014, "\u0120prominently": 31015, "\u0120vulgar": 31016, "\u0120skipping": 31017, "lves": 31018, "\u0120tsunami": 31019, "374": 31020, "\u0120Urug": 31021, "\u0120Eid": 31022, "recated": 31023, "phen": 31024, "\u0120faults": 31025, "\u0120Started": 31026, "950": 31027, "\u0120pi": 31028, "\u0120detector": 31029, "\u0120bastard": 31030, "\u0120validated": 31031, "SpaceEngineers": 31032, "OURCE": 31033, "\u0120(~": 31034, "\u0120unsur": 31035, "\u0120affirmed": 31036, "\u0120fascism": 31037, "\u0120resolving": 31038, "\u0120Chavez": 31039, "\u0120Cyn": 31040, "\u0120detract": 31041, "Lost": 31042, "\u0120rigged": 31043, "\u0120homage": 31044, "\u0120Bruno": 31045, "555": 31046, "eca": 31047, "\u0120presses": 31048, "\u0120humour": 31049, "\u0120spacing": 31050, "\u0120'/": 31051, "olkien": 31052, "Coun": 31053, "OPER": 31054, "Tre": 31055, "Son": 31056, "\u0120Cambodia": 31057, "ierre": 31058, "mong": 31059, "ozy": 31060, "\u0120liquidity": 31061, "\u0120Soviets": 31062, "\u0120Fernando": 31063, "\u0120229": 31064, "\u0120slug": 31065, "\u0120Catalan": 31066, "electric": 31067, "\u0120scenery": 31068, "\u0120Hearth": 31069, "\u0120constrained": 31070, "\u0120goalie": 31071, "\u0120Guidelines": 31072, "\u0120Ammo": 31073, "\u0120Pearson": 31074, "\u0120taxed": 31075, "\u0120fetus": 31076, "Response": 31077, "\u0120Alexis": 31078, "thia": 31079, "Guy": 31080, "\u0120reconstruct": 31081, "\u0120extremes": 31082, "\u0120concluding": 31083, "\u0120Peg": 31084, "ooks": 31085, "\u0120deductions": 31086, "Rose": 31087, "\u0120groundbreaking": 31088, "\u0120Targ": 31089, "\u00e3\u0125\u0123": 31090, "\u0120Reve": 31091, "resource": 31092, "\u0120moons": 31093, "\u0120electromagnetic": 31094, "\u0120amidst": 31095, "\u0120Viktor": 31096, "NESS": 31097, "BACK": 31098, "\u0120commute": 31099, "\u0120Anaheim": 31100, "\u0120fluctuations": 31101, "640": 31102, "\u0120noodles": 31103, "\u0120Copenhagen": 31104, "\u0120Tide": 31105, "\u0120Grizz": 31106, "\u0120SEE": 31107, "\u0120pipelines": 31108, "\u0120scars": 31109, "endo": 31110, "agus": 31111, "\u0120ETF": 31112, "/#": 31113, "\u0120Become": 31114, "448": 31115, "\u0120visc": 31116, "\u0120Recommended": 31117, "\u0120jumper": 31118, "\u0120cognition": 31119, "\u0120assassin": 31120, "\u0120witnessing": 31121, "\u0120Setup": 31122, "\u0120lac": 31123, "vim": 31124, "ISM": 31125, "pages": 31126, "SSL": 31127, "358": 31128, "\u0120adject": 31129, "industrial": 31130, "lore": 31131, "chery": 31132, "\u0120glitter": 31133, "\u0120calf": 31134, "Florida": 31135, "\u0120spoilers": 31136, "\u0120succeeds": 31137, "\u0120chanting": 31138, "\u0120slogans": 31139, "\u0120Tracy": 31140, "Visit": 31141, "rology": 31142, "\u0120mornings": 31143, "\u0120lineage": 31144, "\u0120sip": 31145, "\u0120intensely": 31146, "\u0120flourish": 31147, "\u0120Sleeping": 31148, "\u0120Fem": 31149, "orpor": 31150, "\u0120Klan": 31151, "\u0120Darth": 31152, "hack": 31153, "\u0120Nielsen": 31154, "\u0120tumors": 31155, "\u0120procurement": 31156, "\u0120Yorkshire": 31157, "\u0120raided": 31158, "KY": 31159, "Anna": 31160, "\u0120//[": 31161, "\u0120Disorder": 31162, "\u0120Mustang": 31163, "\u0120Wen": 31164, "\u0120Trying": 31165, "sq": 31166, "\u0120deliveries": 31167, "\u0120shutter": 31168, "\u0120cerebral": 31169, "\u0120bipolar": 31170, "\u0120CN": 31171, "lass": 31172, "jet": 31173, "\u0120debating": 31174, ">:": 31175, "\u0120eagle": 31176, "grades": 31177, "\u0120Dixon": 31178, "UGC": 31179, "MAS": 31180, "\u0120Draco": 31181, "\u0120Machines": 31182, "affer": 31183, "\u0120eman": 31184, "\u00c2\u00b2": 31185, "pron": 31186, "\u0120Gym": 31187, "\u0120comparatively": 31188, "\u0120Tribunal": 31189, "PRO": 31190, "\u0120lex": 31191, "\u0120fertile": 31192, "\u0120depressing": 31193, "\u0120superficial": 31194, "essential": 31195, "\u0120Hunters": 31196, "gp": 31197, "\u0120prominence": 31198, "Liber": 31199, "\u0120Ancest": 31200, "otechnology": 31201, "\u0120mocking": 31202, "\u0120Traff": 31203, "\u0138\u013c": 31204, "Medium": 31205, "Iraq": 31206, "\u0120psychiatrist": 31207, "Quantity": 31208, "\u0120Lect": 31209, "\u0120noisy": 31210, "520": 31211, "GY": 31212, "\u0120slapped": 31213, "\u0120MTV": 31214, "\u0120para": 31215, "pull": 31216, "Multiple": 31217, "asher": 31218, "\u0120nour": 31219, "\u0120Seg": 31220, "Spell": 31221, "vous": 31222, "ordial": 31223, "Senior": 31224, "\u0120Goldberg": 31225, "\u0120Plasma": 31226, "need": 31227, "\u0120messenger": 31228, "eret": 31229, "\u0120teamed": 31230, "\u0120literacy": 31231, "\u0120Leah": 31232, "\u0120Doyle": 31233, "\u0120emitted": 31234, "UX": 31235, "\u0120evade": 31236, "\u0120maze": 31237, "\u0120wrongly": 31238, "\u0120Lars": 31239, "\u0120stereotype": 31240, "\u0120pledges": 31241, "\u0120aroma": 31242, "\u0120MET": 31243, "\u0120acre": 31244, "\u0120OD": 31245, "\u0120ff": 31246, "\u0120breweries": 31247, "\u0120Hilton": 31248, "undle": 31249, "\u0120Kak": 31250, "\u0120Thankfully": 31251, "\u0120Canucks": 31252, "inctions": 31253, "\u0120Appears": 31254, "\u0120coer": 31255, "\u0120undermined": 31256, "rovers": 31257, "Andre": 31258, "\u0120blaze": 31259, "umers": 31260, "\u0120famine": 31261, "amphetamine": 31262, "ulkan": 31263, "Amount": 31264, "\u0120desperation": 31265, "wikipedia": 31266, "development": 31267, "\u0120Corinth": 31268, "ussia": 31269, "Jackson": 31270, "LI": 31271, "Native": 31272, "Rs": 31273, "Ohio": 31274, "\u0120Kathleen": 31275, "Fortunately": 31276, "\u0120attendant": 31277, "\u0120Preferred": 31278, "\u0120Didn": 31279, "\u0120Vs": 31280, "Mis": 31281, "\u0120respondent": 31282, "\u0120boun": 31283, "stable": 31284, "\u0120paved": 31285, "\u0120unexpl": 31286, "\u0120Cheney": 31287, "LM": 31288, "\u0120Cull": 31289, "blown": 31290, "\u0120confronting": 31291, "ocese": 31292, "serving": 31293, "Wi": 31294, "\u0120Lithuania": 31295, "anni": 31296, "\u0120stalk": 31297, "hd": 31298, "\u0120vener": 31299, "APH": 31300, "ynchronous": 31301, "URR": 31302, "umably": 31303, "historic": 31304, "Half": 31305, "Hay": 31306, "\u0120resilience": 31307, "spection": 31308, "\u0120abandoning": 31309, "Obs": 31310, "\u0120Debbie": 31311, "\u0120gradient": 31312, "\u0120Plaint": 31313, "\u0120Canal": 31314, "ARCH": 31315, "\u0120expansive": 31316, "\u0120fung": 31317, "\u0120bounced": 31318, "Und": 31319, "\u0120precautions": 31320, "\u0120clarification": 31321, "\u0120dagger": 31322, "\u0120grips": 31323, "\u0120\u00c2\u00b5": 31324, "\u0120Rivera": 31325, "\u0120Undead": 31326, "isites": 31327, "\u0120FIRST": 31328, "\u00c3\u00b1o": 31329, "audi": 31330, "\u0120hostages": 31331, "\u0120compliant": 31332, "\u0120alumni": 31333, "Seven": 31334, "\u0120cybersecurity": 31335, "either": 31336, "Collect": 31337, "\u0120invariably": 31338, "\u0120Soci": 31339, "\u0120lawmaker": 31340, "\u0120ale": 31341, "\u0120Personally": 31342, "Nazi": 31343, "\u0120customization": 31344, "\u0120Proc": 31345, "\u0120Saskatchewan": 31346, "eaturing": 31347, "\u0120spared": 31348, "\u0120discontinued": 31349, "\u0120computational": 31350, "\u0120Motorola": 31351, "\u0120supremacist": 31352, "governmental": 31353, "\u0120paradise": 31354, "\u0120Downing": 31355, "\u0120Nikon": 31356, "\u0120catalyst": 31357, "berra": 31358, "Toronto": 31359, "875": 31360, "beta": 31361, "\u0120Macron": 31362, "\u0120unrealistic": 31363, "vector": 31364, "\u0120Vehicles": 31365, "itiveness": 31366, "\u0120RV": 31367, "\u0120Colbert": 31368, "sin": 31369, "oji": 31370, "entin": 31371, "\u0120Krish": 31372, "hello": 31373, "ffield": 31374, "oky": 31375, "\u0120Tate": 31376, "\u0120maple": 31377, "\u0120aids": 31378, "chemical": 31379, "334": 31380, "nuts": 31381, "\u0120Warp": 31382, "\u0120xx": 31383, "\u0120Robb": 31384, "umerous": 31385, "_-_": 31386, "ftime": 31387, "\u0120VW": 31388, "\u0120winger": 31389, "\u0120Dome": 31390, "tools": 31391, "\u0120PV": 31392, "\u0120Georgetown": 31393, "\u0120geared": 31394, "\u0120jihadists": 31395, "\u0120cp": 31396, "\u0120steroids": 31397, "Mother": 31398, "clerosis": 31399, "\u0120DRM": 31400, "nesia": 31401, "\u0120linger": 31402, "\u0120immersive": 31403, "\u0120COUN": 31404, "\u0120outweigh": 31405, "ensual": 31406, "Band": 31407, "\u0120transforms": 31408, "matched": 31409, "psons": 31410, "\u0120Judicial": 31411, "factor": 31412, "\u0120referral": 31413, "\u0120oddly": 31414, "\u0120Wenger": 31415, "Bring": 31416, "\u0120Bows": 31417, "602": 31418, "ICLE": 31419, "\u0120lions": 31420, "\u0120Academic": 31421, "\u0120Thorn": 31422, "\u0120Raider": 31423, "kefeller": 31424, "Storage": 31425, "Lower": 31426, "\u0120Ort": 31427, "\u0120Equality": 31428, "ALT": 31429, "\u0120SOC": 31430, "Types": 31431, "\u0120lyn": 31432, "\u0120Asset": 31433, "coat": 31434, "TPP": 31435, "CVE": 31436, "\u0120Pioneer": 31437, "application": 31438, "Modern": 31439, "\u0120HK": 31440, "Environment": 31441, "Alright": 31442, "Rain": 31443, "IPP": 31444, "\u0120Shiite": 31445, "\u0120mound": 31446, "\u0120Abilities": 31447, "condition": 31448, "Staff": 31449, "\u0120competence": 31450, "\u0120Moor": 31451, "\u0120Diablo": 31452, "\u0120withheld": 31453, "\u0120ostensibly": 31454, "\u0120Brom": 31455, "\u0120msg": 31456, "\u0120denomin": 31457, "\u0120References": 31458, "\u0120FP": 31459, "\u0120plunged": 31460, "\u0120pamph": 31461, "moving": 31462, "central": 31463, "\u0120downright": 31464, "\u0120fading": 31465, "Tal": 31466, "Typ": 31467, "\u0120Thy": 31468, "ukes": 31469, "ithe": 31470, "\u0120ove": 31471, "\u0120battled": 31472, "\u0120seafood": 31473, "\u0120figur": 31474, "\u0120RD": 31475, "crop": 31476, "\u0120squads": 31477, "{\\": 31478, "\u00e0\u00b9": 31479, "\u0120Eh": 31480, "\u0120interviewing": 31481, "\u0120Qin": 31482, "\u0120aspiring": 31483, "PLIC": 31484, "\u0120clauses": 31485, "\u0120Gast": 31486, "\u0120Nir": 31487, "\u0120luggage": 31488, "\u0120hose": 31489, "\u0120systemd": 31490, "\u0120descending": 31491, "\u0120Revised": 31492, "\u0120Rails": 31493, "align": 31494, "709": 31495, "337": 31496, "\u0120fug": 31497, "charging": 31498, "tags": 31499, "\u0120uter": 31500, "kish": 31501, "WARNING": 31502, "490": 31503, "profits": 31504, "\u0120voyage": 31505, "\u0120ace": 31506, "\u0120Vanguard": 31507, "\u0120Tanks": 31508, "\u0120Muk": 31509, "\u0120226": 31510, "Safe": 31511, "Armor": 31512, "\u0120volcanic": 31513, "\u0120womb": 31514, "\u0120MIL": 31515, "\u0120beginner": 31516, "\u0120Recogn": 31517, "\u0120AAP": 31518, "PLAY": 31519, ")!": 31520, "\u0120detecting": 31521, "cn": 31522, "\u0120breaches": 31523, "Basically": 31524, "\u0120Pag": 31525, "\u0120Municipal": 31526, "\u0120Indie": 31527, "\u0120Laf": 31528, "\u0120Disable": 31529, "\u0120Olson": 31530, "\u0120restrained": 31531, "\u0120rulings": 31532, "\u0120humane": 31533, "events": 31534, "\u0120Cinema": 31535, "displayText": 31536, "\u0120Hatch": 31537, "actionDate": 31538, "onnaissance": 31539, "\u0120assaulting": 31540, "\u0120Lug": 31541, "CHAT": 31542, "\u0120vigorous": 31543, "\u0120Perse": 31544, "\u0120intolerance": 31545, "\u0120Snapchat": 31546, "\u0120Sharks": 31547, "\u0120dummy": 31548, "\u0120Diagn": 31549, "\u0120Guitar": 31550, "imeters": 31551, "403": 31552, "REG": 31553, "Ax": 31554, "\u0120separates": 31555, "\u0120Mahm": 31556, "\u0120tv": 31557, "jah": 31558, "OOL": 31559, "Circ": 31560, "\u0120Windsor": 31561, "ussian": 31562, "\u0120intuition": 31563, "\u0120disdain": 31564, "\u0120Donovan": 31565, "\u0120221": 31566, "Emb": 31567, "\u0120condemning": 31568, "\u0120generosity": 31569, "zzy": 31570, "\u0120panties": 31571, "\u0120Prevent": 31572, "ActionCode": 31573, "ANA": 31574, "342": 31575, "externalActionCode": 31576, "\u0120specifying": 31577, "\u0120crystall": 31578, "Jere": 31579, "\u0120rupt": 31580, "\u0120Apprentice": 31581, "\u0120profiling": 31582, "\u00d0\u00ba": 31583, "Strike": 31584, "\u0120sideline": 31585, "\u0120obligated": 31586, "\u0120occult": 31587, "\u0120bureaucratic": 31588, "antically": 31589, "rupted": 31590, "negative": 31591, "\u0120Ethiopia": 31592, "\u0120Civic": 31593, "\u0120insiders": 31594, "eligible": 31595, "\u0120TVs": 31596, "\u0120BAR": 31597, "\u0120TI": 31598, "iologist": 31599, "\u0120AIR": 31600, "\u0120substituted": 31601, "Arab": 31602, "\u0120Saul": 31603, "\u0120Yog": 31604, "prem": 31605, "\u0120builders": 31606, "\u0120stationary": 31607, "\u0120doubtful": 31608, "\u0120vigorously": 31609, "\u0120thrilling": 31610, "Physical": 31611, "\u0120Carey": 31612, "\u0120Hydra": 31613, "geoning": 31614, "\u0120Sly": 31615, "yton": 31616, "\u0120borrowers": 31617, "\u0120Parkinson": 31618, "\u0120\u00eb": 31619, "\u0120Jamaica": 31620, "\u0120satir": 31621, "\u0120insurgents": 31622, "\u0120Firm": 31623, "\u0120isot": 31624, "\u0120Karn": 31625, "ourning": 31626, "akens": 31627, "docs": 31628, "little": 31629, "\u0120Monaco": 31630, "CLASS": 31631, "Turkey": 31632, "Ly": 31633, "\u0120Conan": 31634, "assic": 31635, "\u0120starred": 31636, "\u0120Pacers": 31637, "eties": 31638, "\u0120tipping": 31639, "Moon": 31640, "\u0120Rw": 31641, "same": 31642, "\u0120cavity": 31643, "\u0120goof": 31644, "\u0120Zo": 31645, "Shock": 31646, "ummer": 31647, "\u0120emphasizes": 31648, "\u0120regrett": 31649, "\u0120novelty": 31650, "\u0120envy": 31651, "\u0120Passive": 31652, "rw": 31653, "505": 31654, "\u0120indifferent": 31655, "\u0120Rica": 31656, "\u0120Himself": 31657, "\u0120Freddie": 31658, "\u0120adip": 31659, "\u00e4\u00b8\u0122": 31660, "\u0120breakout": 31661, "\u0120hurried": 31662, "\u0120Huang": 31663, "\u0120Disk": 31664, "\u0120roaming": 31665, "?????-?????-": 31666, "UV": 31667, "\u0120Ricky": 31668, "\u0120Sigma": 31669, "\u0120marginalized": 31670, "\u0120edits": 31671, "\u0120304": 31672, "memory": 31673, "\u0120specimen": 31674, "293": 31675, "\u00e3\u0123\u00af": 31676, "\u0120vertically": 31677, "\u0120audition": 31678, "\u0120Heck": 31679, "\u0120caster": 31680, "\u0120Holdings": 31681, "adal": 31682, "\u0120Cron": 31683, "\u0120Liam": 31684, "\u0120deflect": 31685, "Pick": 31686, "\u0120Debug": 31687, "REF": 31688, "\u0120versatility": 31689, "othes": 31690, "classified": 31691, "\u0120Mahar": 31692, "\u0120Hort": 31693, "Counter": 31694, "stasy": 31695, "noticed": 31696, "331": 31697, "\u0120Shim": 31698, "fuck": 31699, "\u0120Bie": 31700, "\u0120airing": 31701, "\u0120Protein": 31702, "\u0120Holding": 31703, "\u0120spectators": 31704, "iliated": 31705, "\u0120Thatcher": 31706, "nosis": 31707, "\u00e3\u0125\u00bc\u00e3\u0125\u00b3": 31708, "Tele": 31709, "Boston": 31710, "\u0120Templ": 31711, "stay": 31712, "\u0120declarations": 31713, "479": 31714, "Volume": 31715, "\u0120Designer": 31716, "\u0120Overwatch": 31717, "idae": 31718, "\u0120onwards": 31719, "\u0120nets": 31720, "\u0120Manila": 31721, "particularly": 31722, "\u0120politic": 31723, "oother": 31724, "\u0120portraits": 31725, "\u0120pavement": 31726, "cffff": 31727, "\u0120saints": 31728, "\u0120beginners": 31729, "ESPN": 31730, "\u0120shortcomings": 31731, "\u00e2\u0137\u0132\u00e2\u0137\u0132": 31732, "\u0120comet": 31733, "\u0120Organic": 31734, "quel": 31735, "\u0120hospitalized": 31736, "Break": 31737, "\u0120peel": 31738, "dylib": 31739, "aspx": 31740, "urances": 31741, "\u0120TIM": 31742, "Pg": 31743, "\u0120readable": 31744, "\u0120Malik": 31745, "\u0120muzzle": 31746, "\u0120benchmarks": 31747, "dal": 31748, "\u0120Vacc": 31749, "\u0120Hicks": 31750, "609": 31751, "\u0120Biblical": 31752, "heng": 31753, "\u0120overload": 31754, "\u0120Civilization": 31755, "\u0120immoral": 31756, "\u0120fries": 31757, "\u00e3\u0124\u0134": 31758, "\u0120reproduced": 31759, "\u0120formulation": 31760, "jug": 31761, "irez": 31762, "gear": 31763, "\u0120coached": 31764, "MpServer": 31765, "\u0120SJ": 31766, "\u0120Kw": 31767, "Init": 31768, "deal": 31769, "\u0120Oro": 31770, "\u0120Loki": 31771, "\u0120Songs": 31772, "\u0120232": 31773, "\u0120Louise": 31774, "asionally": 31775, "\u0120uncond": 31776, "ollywood": 31777, "\u0120progressives": 31778, "\u0120Enough": 31779, "\u0120Doe": 31780, "\u0120wreckage": 31781, "\u0120brushed": 31782, "\u0120BaseType": 31783, "\u0120zoning": 31784, "ishable": 31785, "hetically": 31786, "\u0120Caucus": 31787, "\u0120Hue": 31788, "\u0120karma": 31789, "\u0120Sporting": 31790, "\u0120trader": 31791, "\u0120seeming": 31792, "\u0120Capture": 31793, "430": 31794, "bish": 31795, "\u0120tunes": 31796, "\u0120indoors": 31797, "\u0120Sphere": 31798, "\u0120Dancing": 31799, "TERN": 31800, "\u0120nob": 31801, "\u0120GST": 31802, "maps": 31803, "\u0120peppers": 31804, "Fit": 31805, "\u0120oversees": 31806, "\u0120Rabbi": 31807, "\u0120Ruler": 31808, "vertising": 31809, "office": 31810, "xxx": 31811, "\u0120raft": 31812, "Changed": 31813, "\u0120textbooks": 31814, "Links": 31815, "\u0120Omn": 31816, "\u00e3\u0122\u0133": 31817, "\u0120inconvenience": 31818, "\u0120Donetsk": 31819, "=~": 31820, "\u0120implicitly": 31821, "\u0120boosts": 31822, "\u0120Bones": 31823, "\u0120Boom": 31824, "Courtesy": 31825, "\u0120sensational": 31826, "ANY": 31827, "\u0120greedy": 31828, "eden": 31829, "\u0120inexper": 31830, "\u0120Ler": 31831, "\u0120Vale": 31832, "\u0120tighten": 31833, "\u0120EAR": 31834, "\u0120Num": 31835, "\u0120ancestor": 31836, "Sent": 31837, "\u0120Horde": 31838, "urgical": 31839, "allah": 31840, "\u0120sap": 31841, "amba": 31842, "\u0120Spread": 31843, "twitch": 31844, "\u0120grandson": 31845, "\u0120fracture": 31846, "\u0120moderator": 31847, "\u0120Seventh": 31848, "\u0120Reverse": 31849, "\u0120estimation": 31850, "Choose": 31851, "\u0120parach": 31852, "\u0120barric": 31853, "\u00e3\u0122\u0132": 31854, "\u0120compass": 31855, "\u0120allergic": 31856, "\u00e2\u0122\u0137": 31857, "OTHER": 31858, "errilla": 31859, "\u0120wagon": 31860, "\u0120zinc": 31861, "\u0120rubbed": 31862, "\u0120Fuller": 31863, "\u0120Luxembourg": 31864, "\u0120Hoover": 31865, "\u0120liar": 31866, "\u0120Evening": 31867, "\u0120Cobb": 31868, "esteem": 31869, "\u0120selector": 31870, "\u0120Brawl": 31871, "isance": 31872, "\u0120Ek": 31873, "\u0120troop": 31874, "\u0120guts": 31875, "\u0120Appeal": 31876, "\u0120Tibetan": 31877, "\u0120routines": 31878, "\u0120Ment": 31879, "\u0120summarized": 31880, "steamapps": 31881, "\u0120tranqu": 31882, "\u01201929": 31883, "oran": 31884, "\u0120Authent": 31885, "\u0120gmaxwell": 31886, "\u0120apprehens": 31887, "\u0120poems": 31888, "\u0120sausage": 31889, "\u0120Webster": 31890, "urus": 31891, "\u0120themed": 31892, "\u0120lounge": 31893, "\u0120charger": 31894, "Spoiler": 31895, "\u0120spilled": 31896, "hog": 31897, "\u0120Sunder": 31898, "\u0120Ain": 31899, "\u0120Angry": 31900, "\u0120disqual": 31901, "\u0120Frequency": 31902, "\u0120Ethernet": 31903, "\u0120helper": 31904, "Percent": 31905, "\u0120horrifying": 31906, "\u0120ail": 31907, "\u0120Allan": 31908, "EEE": 31909, "\u0120Crossing": 31910, "449": 31911, "\u0120holog": 31912, "\u0120Puzzles": 31913, "\u0120Goes": 31914, "erenn": 31915, "604": 31916, "\u00e3\u0123\u0131": 31917, "\u0120Rafael": 31918, "\u0120atten": 31919, "\u0120Emanuel": 31920, "\u0120upro": 31921, "\u0120Susp": 31922, "Psych": 31923, "\u0120Trainer": 31924, "\u0120NES": 31925, "\u0120Hunts": 31926, "becue": 31927, "\u0120counselor": 31928, "Rule": 31929, "\u0120toxins": 31930, "\u0120banners": 31931, "rifice": 31932, "\u0120greeting": 31933, "\u0120frenzy": 31934, "\u0120allocate": 31935, "\u0120*)": 31936, "expr": 31937, "503": 31938, "\u0120Chick": 31939, "\u0120Torn": 31940, "\u0120consolidation": 31941, "\u0120Fletcher": 31942, "switch": 31943, "frac": 31944, "clips": 31945, "\u0120McKin": 31946, "\u0120Lunar": 31947, "Month": 31948, "ITCH": 31949, "\u0120scholarly": 31950, "raped": 31951, "398": 31952, "\u01201910": 31953, "\u0120egreg": 31954, "\u0120insecure": 31955, "\u0120victorious": 31956, "cffffcc": 31957, "\u0120singled": 31958, "\u0120elves": 31959, "\u0120Wond": 31960, "burst": 31961, "\u0120camoufl": 31962, "\u0120BLACK": 31963, "\u0120conditioned": 31964, "\u00e7\u012b": 31965, "answered": 31966, "\u0120compulsory": 31967, "ascist": 31968, "\u0120podcasts": 31969, "\u0120Frankfurt": 31970, "bnb": 31971, "\u0120neoliberal": 31972, "\u0120Keyboard": 31973, "\u0120Belle": 31974, "warm": 31975, "\u0120trusts": 31976, "\u0120insured": 31977, "\u0120Bucc": 31978, "usable": 31979, "607": 31980, "\u0120Plains": 31981, "\u01201890": 31982, "\u0120sabotage": 31983, "\u0120lodged": 31984, "felt": 31985, "\u0120ga": 31986, "\u0120Narc": 31987, "\u0120Salem": 31988, "\u0120seventy": 31989, "\u0120Blank": 31990, "pocket": 31991, "\u0120whisper": 31992, "\u0120mating": 31993, "omics": 31994, "\u0120Salman": 31995, "\u0120Kad": 31996, "\u0120angered": 31997, "\u0120collisions": 31998, "\u0120extraordinarily": 31999, "\u0120coercion": 32000, "Ghost": 32001, "birds": 32002, "\u00e8\u0122": 32003, "kok": 32004, "\u0120permissible": 32005, "avorable": 32006, "\u0120pointers": 32007, "\u0120dissip": 32008, "aci": 32009, "\u0120theatrical": 32010, "\u0120Cosmic": 32011, "\u0120forgetting": 32012, "\u0120finalized": 32013, "\u00e5\u00a4\u00a7": 32014, "yout": 32015, "library": 32016, "\u0120booming": 32017, "\u0120Believe": 32018, "\u0120Teacher": 32019, "\u0120Liv": 32020, "\u0120GOODMAN": 32021, "\u0120Dominican": 32022, "ORED": 32023, "\u0120Parties": 32024, "\u0120precipitation": 32025, "\u0120Slot": 32026, "Roy": 32027, "\u0120Combined": 32028, "\u0120integrating": 32029, "\u0120chrome": 32030, "\u0120intestinal": 32031, "\u0120Rebell": 32032, "\u0120matchups": 32033, "\u0120blockbuster": 32034, "\u0120Loren": 32035, "\u0120Levy": 32036, "\u0120preaching": 32037, "\u0120Sending": 32038, "\u0120Purpose": 32039, "rax": 32040, "fif": 32041, "\u0120authoritative": 32042, "\u0120PET": 32043, "astical": 32044, "\u0120dishon": 32045, "\u0120chatting": 32046, "\u0120\"$:/": 32047, "Connection": 32048, "\u0120recreate": 32049, "\u0120delinqu": 32050, "\u0120broth": 32051, "\u0120Dirty": 32052, "\u0120Admin": 32053, "zman": 32054, "\u0120scholarships": 32055, "\u0120253": 32056, "contact": 32057, "alsa": 32058, "767": 32059, "creen": 32060, "abbage": 32061, "\u01201915": 32062, "\u0120blended": 32063, "\u0120alarmed": 32064, "Language": 32065, "356": 32066, "\u0120blends": 32067, "\u0120Changed": 32068, "Wolf": 32069, "\u0120hepat": 32070, "Creating": 32071, "\u0120persecut": 32072, "\u0120sweetness": 32073, "arte": 32074, "\u0120forfeiture": 32075, "\u0120Roberto": 32076, "impro": 32077, "NFL": 32078, "\u0120Magnet": 32079, "Detailed": 32080, "\u0120insignificant": 32081, "\u0120POLIT": 32082, "\u0120BBQ": 32083, "\u0120CPS": 32084, "\u0120seaw": 32085, "aminer": 32086, "mL": 32087, "endif": 32088, "finals": 32089, "\u0120265": 32090, "uish": 32091, "\u0120})": 32092, "\u0120Problems": 32093, "\u0120emblem": 32094, "\u0120seriousness": 32095, "\u0120parsing": 32096, "\u0120substitution": 32097, "\u0120pressured": 32098, "\u0120recycled": 32099, "aleb": 32100, "Ruby": 32101, "\u0120proficiency": 32102, "Driver": 32103, "\u0120Wester": 32104, ":'": 32105, "AFTA": 32106, "\u0120mantle": 32107, "\u0120Clayton": 32108, "flag": 32109, "\u0120practitioner": 32110, "covered": 32111, "\u0120Struct": 32112, "addafi": 32113, "425": 32114, "\u0120Township": 32115, "\u0120Hydro": 32116, "Louis": 32117, "343": 32118, "\u0120condo": 32119, "\u0120Tao": 32120, "\u0120utilization": 32121, "\u0120nausea": 32122, "\u0120Dems": 32123, "ridges": 32124, "pause": 32125, "\u0120formulas": 32126, "\u0120challenger": 32127, "376": 32128, "\u0120defective": 32129, "\u0120Railway": 32130, "\u0120PubMed": 32131, "\u0120yogurt": 32132, "lbs": 32133, "\u0120Norfolk": 32134, "OPE": 32135, "\u0120Moody": 32136, "\u0120distributor": 32137, "\u0120scrolls": 32138, "\u0120extracts": 32139, "Stan": 32140, "\u0120viability": 32141, "\u0120exposes": 32142, "\u0120starvation": 32143, "\u0120Steps": 32144, "\u0120Dodd": 32145, "few": 32146, "STD": 32147, "332": 32148, "\u0120closures": 32149, "\u0120complementary": 32150, "\u0120Sasha": 32151, "umpy": 32152, "\u0120monet": 32153, "\u0120articulate": 32154, "\u0120Doct": 32155, "killer": 32156, "\u0120scrim": 32157, "\u0120264": 32158, "\u0120prostitutes": 32159, "\u0120severed": 32160, "\u0120attachments": 32161, "\u0120cooled": 32162, "Lev": 32163, "\u0120Falk": 32164, "fail": 32165, "\u0120policeman": 32166, "\u0120Dag": 32167, "\u0120prayed": 32168, "\u0120Kernel": 32169, "\u0120clut": 32170, "\u0120cath": 32171, "\u0120anomaly": 32172, "Storm": 32173, "emaker": 32174, "\u0120Breakfast": 32175, "uli": 32176, "oire": 32177, "JJ": 32178, "hz": 32179, "Operation": 32180, "\u0120Sick": 32181, "354": 32182, "\u0120Guatemala": 32183, "Rate": 32184, "\u0120exposures": 32185, "faces": 32186, "\u0120Archae": 32187, "raf": 32188, "\u0120Mia": 32189, "\u01202025": 32190, "\u0120opaque": 32191, "\u0120disguised": 32192, "\u0120Headquarters": 32193, "Sah": 32194, "\u0120pots": 32195, "978": 32196, "\u0120Malf": 32197, "\u0120frowned": 32198, "\u0120poisonous": 32199, "\u0120Convers": 32200, "eeks": 32201, "\u0120crab": 32202, ".\"\"": 32203, "\u0120treason": 32204, "\u0120ranc": 32205, "\u0120escalating": 32206, "\u0120warr": 32207, "\u0120mobs": 32208, "\u0120lamps": 32209, "\u0120Sunshine": 32210, "\u0120Brunswick": 32211, "Phones": 32212, "\u0120spelled": 32213, "\u0120Skip": 32214, "\u01202050": 32215, "\u01201911": 32216, "\u0120Pluto": 32217, "\u0120Amend": 32218, "\u0120meats": 32219, "387": 32220, "\u0120stomp": 32221, "\u0120Zhou": 32222, "\u0120Leviathan": 32223, "\u0120Hazard": 32224, "adv": 32225, "\u0120Orwell": 32226, "\u0120aloud": 32227, "\u0120bumper": 32228, "\u0120Anarch": 32229, "ubuntu": 32230, "\u0120Serious": 32231, "fitting": 32232, "\u0120Optional": 32233, "\u0120Cecil": 32234, "REAM": 32235, "\u0120serotonin": 32236, "\u0120cultivate": 32237, "agogue": 32238, "}\\": 32239, "\u0120mosques": 32240, "\u0120Sunny": 32241, "\u0120reactive": 32242, "revolution": 32243, "\u0120Lup": 32244, "\u0120Fedora": 32245, "\u0120defenseman": 32246, "\u0120VID": 32247, "istine": 32248, "\u0120drowning": 32249, "\u0120Broadcasting": 32250, "\u0120thriller": 32251, "\u0120Scy": 32252, "\u0120accelerating": 32253, "\u0120directs": 32254, "odied": 32255, "bike": 32256, "duration": 32257, "\u0120painfully": 32258, "Redd": 32259, "\u0120productions": 32260, "\u0120gag": 32261, "\u0120whist": 32262, "\u0120sock": 32263, "\u0120infinitely": 32264, "\u0120Concern": 32265, "\u0120Citadel": 32266, "\u0120lieu": 32267, "\u0120candles": 32268, "ogeneous": 32269, "arger": 32270, "\u0120heavenly": 32271, "inflammatory": 32272, "Performance": 32273, "Cs": 32274, "ructose": 32275, "azaki": 32276, "\u0120pessim": 32277, "\u0120inference": 32278, "\u0120powd": 32279, "\u0120Zoe": 32280, "\u0120paints": 32281, "\u0120dazz": 32282, "pta": 32283, "-----------": 32284, "\u0120inspir": 32285, "\u0120Experimental": 32286, "\u0120Knife": 32287, "regor": 32288, "bors": 32289, "\u0120showers": 32290, "romeda": 32291, "\u0120saint": 32292, "\u0120benign": 32293, "\u0120Jiang": 32294, "\u0120envisioned": 32295, "\u0120shroud": 32296, "IFT": 32297, "HO": 32298, "\u0120shuff": 32299, "\u0120ICC": 32300, "\u0120segreg": 32301, "\u0120revisit": 32302, "ighthouse": 32303, "Li": 32304, "\u0120substrate": 32305, "\u0120Seas": 32306, "\u0120Reward": 32307, "\u0120Hep": 32308, "\u0120Brass": 32309, "sbm": 32310, "\u0120eliminates": 32311, "\u0120stamina": 32312, "\u0120VAT": 32313, "\u0120Loan": 32314, "\u0120constraint": 32315, "\u0120appropriated": 32316, "\u0120pes": 32317, "\u0120ALE": 32318, "ranging": 32319, "\u0120404": 32320, "392": 32321, "\u0120intellectuals": 32322, "achu": 32323, "\u0120restructuring": 32324, "\u0120Levin": 32325, "\u0120runes": 32326, "\u0120delightful": 32327, "\u0120carbohydrates": 32328, "\u0120Models": 32329, "\u0120Expo": 32330, "\u0120transporting": 32331, "alloc": 32332, "\u0120ringing": 32333, "Samsung": 32334, "\u0120scarcely": 32335, "\u0120URLs": 32336, "\u0120MAS": 32337, "\u0120prototypes": 32338, "\u0120narrator": 32339, "\u0120CPUs": 32340, "cdn": 32341, "\u0120Barton": 32342, "\u0120decidedly": 32343, "\u0120Shu": 32344, "ixir": 32345, "ocious": 32346, "\u0120Myst": 32347, "Nintendo": 32348, "\u0120reuse": 32349, "\u0120forgiven": 32350, "Few": 32351, "inical": 32352, "nat": 32353, "\u0120seamless": 32354, "\u0120Eva": 32355, "\u0120EVE": 32356, "\u0120JO": 32357, "landers": 32358, "\u0120softer": 32359, "negie": 32360, "\u0120transient": 32361, "\u0120orbital": 32362, "\u0120fulfil": 32363, "\u0120Kom": 32364, "Hopefully": 32365, "\u0120dynamically": 32366, "\u0120Hunger": 32367, "\u00e5\u013d": 32368, "\u0120Armenia": 32369, "elman": 32370, "berto": 32371, "\u0120pige": 32372, "\u0120IDs": 32373, "limit": 32374, "\u0120veins": 32375, "\u0120soaring": 32376, "packs": 32377, "Golden": 32378, "\u0120Crab": 32379, "istor": 32380, "\u0120RPM": 32381, "\u0120$$": 32382, "gression": 32383, "\u0120jihadist": 32384, "\u0120gamble": 32385, "\u0120careg": 32386, "\u0120inflated": 32387, "Face": 32388, "\u0120Firearms": 32389, "\u0120Emmanuel": 32390, "\u00e2\u013f": 32391, "\u0120shocks": 32392, "grab": 32393, "\u0120splend": 32394, "\u0120HPV": 32395, "abortion": 32396, "Above": 32397, "Entity": 32398, "players": 32399, "\u0120commenced": 32400, "ulence": 32401, "\u0120fulfillment": 32402, "\u0120embodiments": 32403, "\u0120Welfare": 32404, "\u0120hail": 32405, "\u0120<@": 32406, "tten": 32407, "\u0120catcher": 32408, "\u0120Jazeera": 32409, "\u0120volcano": 32410, "\u0120stabilize": 32411, "\u0120Handler": 32412, "\u0120intensified": 32413, "\u0120Abrams": 32414, "\u0120humiliation": 32415, "paced": 32416, "605": 32417, "\u0120CentOS": 32418, "Specific": 32419, "\u0120heed": 32420, "\u0120CAM": 32421, "\u0120Galile": 32422, "Die": 32423, "\u0120abolished": 32424, "\u0120Thomson": 32425, "\u0120Teachers": 32426, "\u0120Wass": 32427, "jong": 32428, "\u0120ISBN": 32429, "\u0120Allies": 32430, "shake": 32431, "\u00e5\u00b7": 32432, "vict": 32433, "Howard": 32434, "\u0120deem": 32435, "\u0120exceedingly": 32436, "\u0120Smartstocks": 32437, "ibe": 32438, "\u0120doorway": 32439, "\u0120competed": 32440, "igmat": 32441, "\u0120nationalists": 32442, "\u0120groom": 32443, "\u0120Keen": 32444, "\u0120disposable": 32445, "decl": 32446, "\u0120Tolkien": 32447, "\u0120Scheme": 32448, "\u0120biod": 32449, "\u0120avid": 32450, "\u0120Elon": 32451, "agar": 32452, "\u0120TSA": 32453, "Roman": 32454, "\u0120artificially": 32455, "\u0120advisors": 32456, "XL": 32457, "\u0120Inferno": 32458, "366": 32459, "\u0120tedious": 32460, "\u0120Photography": 32461, "\u0120Carrie": 32462, "\u0120trope": 32463, "\u0120Sandra": 32464, "\u0120decimal": 32465, "Queen": 32466, "\u0120Gundam": 32467, "\u0120OM": 32468, "otech": 32469, "NBA": 32470, "\u01201932": 32471, "\u0120entrenched": 32472, "\u0120Marion": 32473, "\u0120fraternity": 32474, "Labour": 32475, "Henry": 32476, "\u0120latitude": 32477, "Either": 32478, "\u0120enhances": 32479, "\u0120Potential": 32480, "\u0120shines": 32481, "idad": 32482, "\u0120breadth": 32483, "\u0120capacities": 32484, "\u0120\u00f0\u0141\u013b\u0124": 32485, "\u0120Bronx": 32486, "\u0120sexes": 32487, "\u0120differentiation": 32488, "\u0120heavyweight": 32489, "\u0120Taj": 32490, "dra": 32491, "\u0120migrate": 32492, "\u0120exhaustion": 32493, "\u0120RUN": 32494, "elsius": 32495, "\u0120Cuomo": 32496, "\u0120guitars": 32497, "\u0120clones": 32498, "\u0120Somew": 32499, "\u0120Pry": 32500, "-------------": 32501, "\u0120warranted": 32502, "cycles": 32503, "\u0120salvage": 32504, "\u0120disks": 32505, "RANT": 32506, "\u0120NGOs": 32507, "\u0120Martian": 32508, "\":[{\"": 32509, "\u0120addicts": 32510, "ojure": 32511, "illet": 32512, "\u0120amazingly": 32513, "artments": 32514, "pixel": 32515, "\u0120GPUs": 32516, "Layout": 32517, "\u00e8\u00a3": 32518, "\u0120Tamil": 32519, "\u0120Basil": 32520, "\u0120impartial": 32521, "\u0120Structure": 32522, "fork": 32523, "bryce": 32524, "\u0120ridge": 32525, "\u0120Hamburg": 32526, "rious": 32527, "\u0120blitz": 32528, "cigarettes": 32529, "\u0120canned": 32530, "402": 32531, "\u0120ironically": 32532, "\u0120compassionate": 32533, "\u0120Hawkins": 32534, ".#": 32535, "\u0120Cathedral": 32536, "\u0120rallied": 32537, "internal": 32538, "\u0120quota": 32539, "stakes": 32540, "TEXT": 32541, "mom": 32542, "\u0120completes": 32543, "\u0120238": 32544, "\u0120shrug": 32545, "\u00e3\u0125\u0133": 32546, "\u0120Ninth": 32547, "\u0120revise": 32548, "\u0120Provider": 32549, "\u0120treacher": 32550, "\u0120quasi": 32551, "\u0120PRES": 32552, "\u0120deposition": 32553, "\u0120confidentiality": 32554, "issors": 32555, "\u0120imbalance": 32556, "\u0120spanning": 32557, "\u0120angular": 32558, "\u0120Cul": 32559, "communication": 32560, "\u0120Nora": 32561, "\u0120Genius": 32562, "opter": 32563, "\u0120sacked": 32564, "Spot": 32565, "\u0120finely": 32566, "\u0120CHR": 32567, "282": 32568, "waves": 32569, "Palest": 32570, "\u0120Rohing": 32571, "NL": 32572, "\u00e8\u00bf": 32573, "\u0120shitty": 32574, "\u0120Scalia": 32575, "475": 32576, "Progress": 32577, "\u0120referencing": 32578, "\u0120classrooms": 32579, "abee": 32580, "\u0120sod": 32581, "hesion": 32582, "708": 32583, "\u0120Zuckerberg": 32584, "\u0120Finish": 32585, "\u0120Scotia": 32586, "\u0120Savior": 32587, "\u0120Installation": 32588, "antha": 32589, "(-": 32590, "\u0120302": 32591, "\u0120Punk": 32592, "\u0120crater": 32593, "youtu": 32594, "\u0120roast": 32595, "\u0120influencing": 32596, "\u0120dup": 32597, "\u0120JR": 32598, "\u0120Grav": 32599, "\u0120stature": 32600, "\u0120bathrooms": 32601, "Aside": 32602, "Wiki": 32603, "mean": 32604, "\u0120Zak": 32605, "\u0120Ones": 32606, "\u0120Nath": 32607, "\u0120hypert": 32608, "\u0120commencement": 32609, "Civil": 32610, "\u0120moderately": 32611, "\u0120distributors": 32612, "\u0120breastfeeding": 32613, "\u0120980": 32614, "\u0120Sik": 32615, "\u0120Cig": 32616, "\u0120AMER": 32617, "RIP": 32618, "\u0120Career": 32619, "usting": 32620, "\u0120messed": 32621, "\u0120eh": 32622, "\u0120Jensen": 32623, "/$": 32624, "\u0120blackmail": 32625, "\u0120conversions": 32626, "\u0120scientifically": 32627, "\u0120mantra": 32628, "paying": 32629, "\u0120ivory": 32630, "\u0120Courts": 32631, "OUGH": 32632, "auntlet": 32633, "Serial": 32634, "Brow": 32635, "\u0120Hundreds": 32636, "323": 32637, "\u0120pee": 32638, "\u0120linux": 32639, "\u0120submer": 32640, "\u0120Principal": 32641, "485": 32642, "\u0120DSL": 32643, "\u0120Cousins": 32644, "\u0120doctrines": 32645, "\u0120Athletics": 32646, "\u0120315": 32647, "\u0120Karma": 32648, "\u0120attent": 32649, "urger": 32650, "\u0120prescribe": 32651, "\u0120encaps": 32652, "\u0120Came": 32653, "\u0120secretive": 32654, "\u0120Crimes": 32655, "dn": 32656, "Clean": 32657, "\u0120Egyptians": 32658, "\u0120Carpenter": 32659, "\u0120ll": 32660, "Hum": 32661, "\u0120Milo": 32662, "\u0120capitalists": 32663, "\u0120briefed": 32664, "Twe": 32665, "\u0120Basin": 32666, "elvet": 32667, "Mos": 32668, "\u0120plunge": 32669, "\u0120Kaiser": 32670, "\u0120Fuj": 32671, "illin": 32672, "\u0120safeguards": 32673, "\u0120oste": 32674, "\u0120Opportunity": 32675, "\u0120Mafia": 32676, "\u0120Calling": 32677, "apa": 32678, "urban": 32679, "brush": 32680, "illard": 32681, "c\u00c3\u00a9": 32682, "intelligence": 32683, "\u0120Lob": 32684, "\u0120Druid": 32685, "\u0120smoother": 32686, "\u0120footing": 32687, "\u0120motorists": 32688, "arcity": 32689, "\u0120masculinity": 32690, "\u0120mism": 32691, "\u0120abdominal": 32692, "\u0120Tavern": 32693, "\u0120Roh": 32694, "\u0120escapes": 32695, "signed": 32696, "Anthony": 32697, "\u0120sacrificing": 32698, "\u0120intimacy": 32699, "\u0120anterior": 32700, "\u0120Kod": 32701, "\u0120motif": 32702, "\u0120graz": 32703, "\u0120visualization": 32704, "\u0120guitarist": 32705, "\u0120Trotsky": 32706, "magic": 32707, "Dar": 32708, "\u0120Mori": 32709, "\u0120wards": 32710, "\u0120toilets": 32711, "lest": 32712, "\u0120teleport": 32713, "\u0120Sundays": 32714, "\u0120Plat": 32715, "ETS": 32716, "\u0120eSports": 32717, "Patrick": 32718, "\u0120Katherine": 32719, "enko": 32720, "\u0120hassle": 32721, "\u0120Mick": 32722, "ggles": 32723, "\u0120hob": 32724, "aintain": 32725, "\u0120airborne": 32726, "\u0120spans": 32727, "\u0120chili": 32728, "\u0120aperture": 32729, "\u0120volunteered": 32730, "\u0120Incident": 32731, "\u0120Fres": 32732, "\u0120Veteran": 32733, "aughtered": 32734, "ingo": 32735, "\u0120uninsured": 32736, "CLOSE": 32737, "\u0120fuse": 32738, "\u0120erotic": 32739, "\u0120advertise": 32740, "raising": 32741, "Texture": 32742, "\u0120attends": 32743, "\u0120REAL": 32744, "uddled": 32745, "\u0120smoot": 32746, "\u0120305": 32747, "\u0120Willis": 32748, "\u0120blond": 32749, "Analysis": 32750, "\u0120VT": 32751, "onica": 32752, "\u0120stronghold": 32753, "RF": 32754, "NM": 32755, ".>>": 32756, "\u0120prosperous": 32757, "\u0120boasted": 32758, "292": 32759, "\u0120Manufacturing": 32760, "PRESS": 32761, "gren": 32762, "\u0120pharmacy": 32763, "\u0120Rockefeller": 32764, "kai": 32765, "\u0120thumbs": 32766, "\u0120Hut": 32767, "\u0120motherboard": 32768, "\u0120guardians": 32769, "\u0120Alter": 32770, "llular": 32771, "\u0120shack": 32772, "\u0120wisely": 32773, "\u0120backbone": 32774, "erva": 32775, "\u0120suicides": 32776, "\u0120McGregor": 32777, "ijah": 32778, "Emer": 32779, "\u0120Brav": 32780, "\u0120designate": 32781, "POST": 32782, "produced": 32783, "\u0120cleansing": 32784, "irlwind": 32785, "existent": 32786, "\u0120Humph": 32787, "\u0120Payne": 32788, "\u0120vested": 32789, "\u00c5\u00a1": 32790, "\u0120stringent": 32791, "iona": 32792, "\u0120unsub": 32793, "\u0120summed": 32794, "\u0120Hercules": 32795, "subject": 32796, "\u0120Ragnar": 32797, "\u0120Nos": 32798, "\u0120characterization": 32799, "\u0120savvy": 32800, "\u0120Dawson": 32801, "\u0120Casino": 32802, "\u0120fri": 32803, "\u0120Barrier": 32804, "\u0120misinformation": 32805, "\u0120insulation": 32806, "\u0120corridors": 32807, "\u0120airplanes": 32808, "\u0120Noct": 32809, "ahi": 32810, "\u01201916": 32811, "kb": 32812, "armac": 32813, "\u0120shun": 32814, "\u0120schema": 32815, "\u0120horrified": 32816, "\u0120239": 32817, "aunders": 32818, "NB": 32819, "iates": 32820, "erity": 32821, "\u0120Shard": 32822, "\u0120rarity": 32823, "\u0120grouped": 32824, "\u0120Ghana": 32825, "against": 32826, "\u0120Biological": 32827, "\u0120Aware": 32828, "owell": 32829, "\u00cf\u0126": 32830, "\u0120Beau": 32831, "shaw": 32832, "Hack": 32833, "\u0120Julius": 32834, "USS": 32835, "olson": 32836, "auna": 32837, "cru": 32838, "\u0120Maurice": 32839, "\u0120Ik": 32840, "\u0120sequencing": 32841, "\u0120radicals": 32842, "\u0120(?,": 32843, "virtual": 32844, "\u0120anyways": 32845, "\u0120reperc": 32846, "\u0120handlers": 32847, "\u0120hesitant": 32848, "\u00e9\u0125": 32849, "\u0120MF": 32850, "plementation": 32851, "associated": 32852, "\u0120campaigned": 32853, "\u0120Yue": 32854, "utations": 32855, "\u0120Yoga": 32856, "\u0120simmer": 32857, "\u0120rods": 32858, "\u0120melody": 32859, "\u0120convoy": 32860, "videos": 32861, "\u0120screened": 32862, "Neg": 32863, "ochemical": 32864, "\u0120())": 32865, "\u0120ultras": 32866, "\u0120antip": 32867, "\u0120Islanders": 32868, "704": 32869, "\u0120fetish": 32870, "\u0120ridiculously": 32871, "\u0120Kart": 32872, "\u0120mitochondrial": 32873, "\u0120interfering": 32874, "Builder": 32875, "\u0120overfl": 32876, "\u0120acne": 32877, "\u0120Mud": 32878, "\u0120Kerr": 32879, "flex": 32880, "\u0120Postal": 32881, "\u0120Baltic": 32882, "477": 32883, "\u0120Persons": 32884, "ourage": 32885, "HB": 32886, "\u0120Muse": 32887, "\u0120Immortal": 32888, "\u0120Driving": 32889, "\u0120petitions": 32890, "\u0120subscript": 32891, "\u0120sorce": 32892, "\u0120Processor": 32893, "uton": 32894, "Sony": 32895, "\u0120phon": 32896, "\u0120raced": 32897, "\u0120Anthrop": 32898, "\u0120daytime": 32899, "\u0120Exercise": 32900, "Adding": 32901, "\u0120engages": 32902, "\u0120Qualcomm": 32903, "\u0120miracles": 32904, "\u0120memes": 32905, "\u0120Drink": 32906, "\u0120Orioles": 32907, "\u0120hairs": 32908, "\u0120Polar": 32909, "athom": 32910, "\u0120slippery": 32911, "\u0120Remy": 32912, "\u0120caramel": 32913, "\u0120YEAR": 32914, "\u0120alk": 32915, "Ign": 32916, "aution": 32917, "\u0120Merlin": 32918, "\u0120Cran": 32919, "\u0120apologies": 32920, "\u0120410": 32921, "\u0120outing": 32922, "\u0120Memories": 32923, "appointed": 32924, "\u0120countered": 32925, "uld": 32926, "posing": 32927, "\u0120firewall": 32928, "\u0120Wast": 32929, "\u0120Wet": 32930, "worked": 32931, "seller": 32932, "\u0120repealed": 32933, "ereo": 32934, "assuming": 32935, "BLIC": 32936, "mite": 32937, "\u0120CEOs": 32938, "\u0120Chapel": 32939, "elligent": 32940, "________________________": 32941, "Dog": 32942, "\u0120wart": 32943, "\u0120subscriber": 32944, "sports": 32945, "\u0120begged": 32946, "\u0120MV": 32947, "\u0120semif": 32948, "ethical": 32949, "\u0120preach": 32950, "\u0120revital": 32951, "\u0120punitive": 32952, "\u0120shortcuts": 32953, "\u0120instituted": 32954, "\u0120Warsaw": 32955, "\u0120abdomen": 32956, "\u0120KING": 32957, "\u0120superintendent": 32958, "\u0120fry": 32959, "\u0120Geo": 32960, "TOR": 32961, "\u0120contradictions": 32962, "aptic": 32963, "\u0120landscapes": 32964, "bugs": 32965, "\u0120clust": 32966, "\u0120volley": 32967, "cribed": 32968, "\u0120tandem": 32969, "\u0120robes": 32970, "WHAT": 32971, "\u0120promoter": 32972, "\u0120eloqu": 32973, "reviewed": 32974, "\u0120DK": 32975, "\u0120Plato": 32976, "\u0120fps": 32977, "Tank": 32978, "\u0120Derrick": 32979, "\u0120prioritize": 32980, "asper": 32981, "\u0120Honduras": 32982, "\u0120Completed": 32983, "nec": 32984, "\u0120mog": 32985, "nir": 32986, "\u0120Mayo": 32987, "DEF": 32988, "stall": 32989, "inness": 32990, "\u0120Volkswagen": 32991, "\u0120precaution": 32992, "\u0120Mell": 32993, "iak": 32994, "istries": 32995, "\u0120248": 32996, "\u0120overlapping": 32997, "Senate": 32998, "\u0120Enhance": 32999, "resy": 33000, "racial": 33001, "ORTS": 33002, "\u0120Mormons": 33003, "Strong": 33004, "\u0120Coch": 33005, "Mexico": 33006, "\u0120Maduro": 33007, "\u0120jars": 33008, "\u0120cane": 33009, "Wik": 33010, "olla": 33011, "ifference": 33012, "\u0120physicist": 33013, "\u0120Maggie": 33014, "\u0120285": 33015, "\u0120depiction": 33016, "\u0120McLaren": 33017, "Ju": 33018, "\u0120slows": 33019, "\u0120commissioners": 33020, "\u0120Willow": 33021, "\u0120Explos": 33022, "hovah": 33023, "\u0120technician": 33024, "\u0120homicides": 33025, "\u0120Flav": 33026, "\u0120Truman": 33027, "\u012010000": 33028, "uctor": 33029, "\u0120shader": 33030, "Newsletter": 33031, "457": 33032, "\u0120rever": 33033, "\u0120hardened": 33034, "\u0120whereabouts": 33035, "\u0120redevelop": 33036, "\u0120carbs": 33037, "\u0120travers": 33038, "\u0120squirrel": 33039, "\u0120follower": 33040, "\u0120sings": 33041, "508": 33042, "\u0120rabbits": 33043, "emonium": 33044, "\u0120documenting": 33045, "\u0120misunderstood": 33046, ")'": 33047, "Rick": 33048, "ggies": 33049, "\u0120premie": 33050, "\u0120skating": 33051, "\u0120passports": 33052, "\u0120fists": 33053, "ageddon": 33054, "Haw": 33055, "ACP": 33056, "080": 33057, "\u0120Thoughts": 33058, "\u0120Carlson": 33059, "\u0120priesthood": 33060, "hua": 33061, "\u0120dungeons": 33062, "\u0120Loans": 33063, "\u0120antis": 33064, "\u0120familiarity": 33065, "\u0120Sabb": 33066, "opal": 33067, "\u0120Ink": 33068, "strike": 33069, "\u0120cram": 33070, "\u0120legalized": 33071, "\u0120cuisine": 33072, "\u0120fibre": 33073, "Travel": 33074, "\u0120Monument": 33075, "ODY": 33076, "ethy": 33077, "\u0120interstate": 33078, "\u0120PUR": 33079, "emporary": 33080, "\u0120Arabian": 33081, "developed": 33082, "\u0120saddle": 33083, "\u0120github": 33084, "\u0120Offer": 33085, "\u0120ISP": 33086, "rolet": 33087, "\u0120SUPER": 33088, "\u0120Denis": 33089, "\u0120multiplier": 33090, "\u0120stirred": 33091, "Interestingly": 33092, "\u0120customary": 33093, "\u0120billed": 33094, "hex": 33095, "\u0120multiplied": 33096, "\u0120flipping": 33097, "\u0120Crosby": 33098, "\u0120fundamentals": 33099, "iae": 33100, "\u0120Played": 33101, "\u0120Atom": 33102, "amazon": 33103, "\u0120Flam": 33104, "eez": 33105, "activated": 33106, "\u0120tablespoon": 33107, "\u0120liberalism": 33108, "\u0120Palin": 33109, "\u0120Patel": 33110, "Num": 33111, "\u0120TAM": 33112, "\u0120surn": 33113, "\u0120Reloaded": 33114, "\u0120coined": 33115, "\"],": 33116, "\u0120Clash": 33117, "\u0120Agu": 33118, "\u0120pragmatic": 33119, "\u0120Activate": 33120, "\u0120802": 33121, "\u0120trailers": 33122, "\u0120silhou": 33123, "\u0120probes": 33124, "\u0120circus": 33125, "\u0120Bain": 33126, "\u0120Lindsay": 33127, "\u0120Abbey": 33128, "Delivery": 33129, "\u0120concession": 33130, "\u0120gastro": 33131, "\u0120Sprite": 33132, "\u00c4\u0141": 33133, "andel": 33134, "\u0120gimm": 33135, "\u0120autobi": 33136, "\u0120Turtle": 33137, "\u0120wonderfully": 33138, "\u0120Haram": 33139, "\u0120Worldwide": 33140, "\u0120Handle": 33141, "\u0120theorists": 33142, "\u0120sleek": 33143, "\u0120Zhu": 33144, "ographically": 33145, "EGA": 33146, "\u0120Owners": 33147, "aths": 33148, "\u0120Antarctic": 33149, "natal": 33150, "=\"\"": 33151, "flags": 33152, "````": 33153, "\u0120sul": 33154, "Kh": 33155, "\u0120potassium": 33156, "\u0120lineman": 33157, "\u0120cereal": 33158, "\u0120Seasons": 33159, "\u01202022": 33160, "\u0120mathematic": 33161, "\u0120astronomers": 33162, "professional": 33163, "\u0120fares": 33164, "cknowled": 33165, "\u0120chi": 33166, "\u0120youngsters": 33167, "\u0120mistakenly": 33168, "\u0120hemisphere": 33169, "\u0120Divinity": 33170, "rone": 33171, "\u0120\",": 33172, "rings": 33173, "\u0120attracts": 33174, "vana": 33175, "\u00e5\u00b9": 33176, "CAP": 33177, "\u0120playlist": 33178, "\u0120porch": 33179, "\u00e3\u0123\u00a3": 33180, "\u0120incorporates": 33181, "\u0120soak": 33182, "\u0120asserting": 33183, "\u0120Terrorism": 33184, "\u0120Pablo": 33185, "Ja": 33186, "cester": 33187, "\u0120fearing": 33188, "\u0120Prayer": 33189, "\u0120escalated": 33190, "GW": 33191, "\u0120robe": 33192, "\u0120Brighton": 33193, "acists": 33194, "\u0120Symphony": 33195, "\u0120Dwarf": 33196, "\u0120Parade": 33197, "\u0120Lego": 33198, "\u0120inexpl": 33199, "\u0120lords": 33200, "leaf": 33201, "RAG": 33202, "liber": 33203, "\u0120cigars": 33204, "\u0120Jehovah": 33205, "606": 33206, "WINDOWS": 33207, "\u0120Liberia": 33208, "ebus": 33209, "Heavy": 33210, "\u0120lubric": 33211, "\u0120RW": 33212, "anguages": 33213, "\u0120narrowed": 33214, "computer": 33215, "\u0120Ember": 33216, "\u0120murdering": 33217, "\u0120downstream": 33218, "\u0120Tuls": 33219, "\u0120Tables": 33220, "Topic": 33221, "\u0120Accuracy": 33222, "=/": 33223, "lost": 33224, "\u0120Rei": 33225, "\u0120progresses": 33226, "bear": 33227, "\u0120establishments": 33228, "Justin": 33229, "\u0120Peach": 33230, "\u0120Gomez": 33231, "\u00e5\u00bf": 33232, "\u0120Triangle": 33233, "Ident": 33234, "\u0120Hive": 33235, "Resources": 33236, "\u0120mixes": 33237, "\u0120Assuming": 33238, "Mu": 33239, "\u0120hypoc": 33240, "\u0120sane": 33241, "\u0120Wan": 33242, "idious": 33243, "Success": 33244, "\u0120io": 33245, "Angel": 33246, "\u0120dangerously": 33247, "\u0120Creature": 33248, "WORK": 33249, ":[": 33250, "\u0120Katrina": 33251, "Listener": 33252, "Miller": 33253, "\u0120Idlib": 33254, "hang": 33255, "\u0120circumvent": 33256, "href": 33257, "\u0120celestial": 33258, "\u0120Weeks": 33259, "\u0120Pug": 33260, "\u0120Dalton": 33261, "\u0120subpoena": 33262, "uku": 33263, "\u0120persisted": 33264, "pei": 33265, "olding": 33266, "\u0120Documents": 33267, "\u0120Hast": 33268, "\u0120CENT": 33269, "\u0120primer": 33270, "\u0120synonymous": 33271, "\u0120nib": 33272, "ombs": 33273, "\u0120notation": 33274, "\u0120Dish": 33275, "\u0120Atmosp": 33276, "\u0120forbid": 33277, "\u0120ANG": 33278, "pattern": 33279, "los": 33280, "\u0120projectiles": 33281, "brown": 33282, ".\",": 33283, "\u0120Venom": 33284, "\u0120fiercely": 33285, "ublished": 33286, "\u0120Uran": 33287, "\u0120Nicarag": 33288, "410": 33289, "\u0120CAL": 33290, "OTOS": 33291, "\u0120Miracle": 33292, "\u0120Enchant": 33293, "\u0120guarding": 33294, "append": 33295, "Attach": 33296, "\u0120leveled": 33297, "\u0120condoms": 33298, "ihilation": 33299, "649": 33300, "\u0120nightmares": 33301, "\u0120THEY": 33302, "\u0120START": 33303, "\u0120Kinn": 33304, "\u0120roommate": 33305, "\u0120hygiene": 33306, "opping": 33307, "Job": 33308, "\u0120lvl": 33309, "\u0120VER": 33310, "\u0120Keeping": 33311, "abetic": 33312, "\u0120formatting": 33313, "erala": 33314, "\u0120revisions": 33315, "\u0120resurg": 33316, "Tel": 33317, "\u0120Goodman": 33318, "353": 33319, "pod": 33320, "\u0120indisp": 33321, "\u0120Translation": 33322, "\u0120gown": 33323, "\u0120Mund": 33324, "\u0120cis": 33325, "\u0120bystand": 33326, "collect": 33327, "\u0120Punjab": 33328, "actively": 33329, "\u0120Gamb": 33330, "tell": 33331, "\u0120importing": 33332, "gencies": 33333, "\u0120locom": 33334, "\u0120Brill": 33335, "Holy": 33336, "\u0120Berger": 33337, "\u0120showdown": 33338, "\u0120responders": 33339, "ILY": 33340, "\u0120takedown": 33341, "leted": 33342, "\u0120mattered": 33343, "\u0120predictive": 33344, "\u0120overlay": 33345, "GPU": 33346, "\u0120Vick": 33347, "\u0120conveyed": 33348, "Tab": 33349, "peer": 33350, "Scan": 33351, "\u0120defensively": 33352, "vae": 33353, "\u0120approving": 33354, "\u0120tiers": 33355, "\u0120Via": 33356, "querade": 33357, "\u0120Saudis": 33358, "\u0120demolished": 33359, "\u0120Prophe": 33360, "\u0120mono": 33361, "\u0120hospitality": 33362, "HAM": 33363, "\u0120Ariel": 33364, "MOD": 33365, "\u0120Torah": 33366, "\u0120blah": 33367, "\u0120Belarus": 33368, "erential": 33369, "\u0120Tuc": 33370, "\u0120banker": 33371, "397": 33372, "\u0120mosquit": 33373, "\u0120Scientist": 33374, "\u0120Musical": 33375, "\u0120hust": 33376, "Shift": 33377, "\u0120torment": 33378, "\u0120standoff": 33379, "Educ": 33380, "\u0120Fog": 33381, "\u0120amplifier": 33382, "Shape": 33383, "Instance": 33384, "\u0120Critics": 33385, "\u0120daemon": 33386, "Houston": 33387, "\u0120mattress": 33388, "\u0120IDF": 33389, "\u0120obscene": 33390, "\u0120Amer": 33391, "hetti": 33392, "\u0120compiling": 33393, "352": 33394, "verett": 33395, "\u0120Reduction": 33396, "istration": 33397, "\u0120Blessed": 33398, "\u0120Bachelor": 33399, "316": 33400, "\u0120prank": 33401, "\u0120Vulcan": 33402, "dding": 33403, "\u0120mourning": 33404, "\u0120Quint": 33405, "\u0120Blaster": 33406, "testing": 33407, "\u0120sediment": 33408, ">>>": 33409, "\u0120Eternity": 33410, "\u0120WHERE": 33411, "\u0120Maze": 33412, "\u0120reacting": 33413, "\u0120Alv": 33414, "omsday": 33415, "\u0120CRA": 33416, "\u0120translator": 33417, "\u0120bogus": 33418, "atu": 33419, "Website": 33420, "olls": 33421, "\u0120baptism": 33422, "\u0120sibling": 33423, "\u0120Autumn": 33424, "vez": 33425, "\u00e3\u0123\u00ae\u00e9": 33426, "guards": 33427, "Georg": 33428, "assadors": 33429, "\u0120Freud": 33430, "\u0120continents": 33431, "\u0120Registry": 33432, "Bernie": 33433, "\u0138\u013c\u00e5\u00a3\u00ab": 33434, "\u0120tolerant": 33435, "\u0120UW": 33436, "\u0120horribly": 33437, "995": 33438, "\u0120MIDI": 33439, "\u0120impatient": 33440, "ocado": 33441, "eri": 33442, "\u0120Worst": 33443, "\u0120Norris": 33444, "\u0120Talking": 33445, "\u0120defends": 33446, "ensable": 33447, "\u01202021": 33448, "\u0120anatomy": 33449, "Lew": 33450, "\u0120drawer": 33451, "\u0120Canberra": 33452, "\u0120patriotic": 33453, "\u00e9\u00be\u012f\u00e5\u0138\u013c\u00e5\u00a3\u00ab": 33454, "\u0120Avg": 33455, "ARM": 33456, "\u0120undisclosed": 33457, "\u0120farewell": 33458, "459": 33459, "bable": 33460, "\u0120Allison": 33461, "OLOG": 33462, "\u0120conco": 33463, "tight": 33464, "\u0120ACPI": 33465, "\u0120Mines": 33466, "lich": 33467, "\u0120\u00e2\u0136\u013e": 33468, "represented": 33469, "200000": 33470, "\u0120enthusiast": 33471, "OTS": 33472, "bil": 33473, "\u0120Ingredients": 33474, "\u0120inventor": 33475, "\u0120MySQL": 33476, "\u00c2\u0142\u00c2\u0142\u00c2\u0142": 33477, "\u0120ABOUT": 33478, "within": 33479, "\u0120mk": 33480, "Bul": 33481, "\u0120Fake": 33482, "\u0120draconian": 33483, "Wa": 33484, "helm": 33485, "\u0120Terran": 33486, "erville": 33487, "\u0120commonplace": 33488, "SIZE": 33489, "\u0120\"<": 33490, "replace": 33491, "ographs": 33492, "\u0120SELECT": 33493, "incible": 33494, "\u0120Mostly": 33495, "\u0120Sheffield": 33496, "\u0120IDE": 33497, "uggle": 33498, "\u0120citations": 33499, "hurst": 33500, "\u0120Unix": 33501, "\u0120unleash": 33502, "\u0120Piper": 33503, "\u0120Nano": 33504, "\u0120succumb": 33505, "\u0120reluctance": 33506, "\u01202500": 33507, "\u0120Merchant": 33508, "\u0120wiret": 33509, "\u0120combos": 33510, "\u0120Birthday": 33511, "\u0120charcoal": 33512, "\u0120UPS": 33513, "\u0120Fairfax": 33514, "\u0120driveway": 33515, "\u0120Tek": 33516, "\u0120Pitch": 33517, "overe": 33518, "\u0120technicians": 33519, "\u0120Actual": 33520, "flation": 33521, "\u0120Fiscal": 33522, "\u0120Empty": 33523, "anamo": 33524, "\u0120magnesium": 33525, "\u0120slut": 33526, "\u0120growers": 33527, "Investigators": 33528, "():": 33529, "\u0120Satellite": 33530, "\u0120Keynes": 33531, "missive": 33532, "lane": 33533, "\u0120borough": 33534, "344": 33535, "\u0120TEAM": 33536, "\u0120Bethesda": 33537, "CV": 33538, "hower": 33539, "\u0120RAD": 33540, "\u0120chant": 33541, "\u0120Riy": 33542, "\u0120compositions": 33543, "\u0120mildly": 33544, "\u0120meddling": 33545, "\u0120agility": 33546, "aneers": 33547, "501": 33548, "\u0120synth": 33549, "linger": 33550, "291": 33551, "\u0120exclaimed": 33552, "Party": 33553, "\u0120contamin": 33554, "\u0120Manor": 33555, "\u0120Respond": 33556, "\u0120praising": 33557, "\u0120manners": 33558, "fleet": 33559, "Summer": 33560, "\u0120Lynd": 33561, "\u0120Definitely": 33562, "grim": 33563, "\u0120bowling": 33564, "stri": 33565, "\u00e7\u013d": 33566, "ynt": 33567, "\u0120mandates": 33568, "DIV": 33569, "\u0120reconcile": 33570, "views": 33571, "\u0120Damon": 33572, "vette": 33573, "Flo": 33574, "\u0120Greatest": 33575, "ilon": 33576, "icia": 33577, "\u0120portrayal": 33578, "\u0120cushion": 33579, "504": 33580, "1979": 33581, "ossal": 33582, "Applic": 33583, "scription": 33584, "\u0120mitigation": 33585, "ATS": 33586, "pac": 33587, "\u0120erased": 33588, "\u0120deficiencies": 33589, "\u0120Hollande": 33590, "\u0120Xu": 33591, "\u0120bred": 33592, "\u0120pregnancies": 33593, "femin": 33594, "\u0120emph": 33595, "\u0120planners": 33596, "\u0120outper": 33597, "uttering": 33598, "\u0120perpetrator": 33599, "\u0120motto": 33600, "\u0120Ellison": 33601, "\u0120NEVER": 33602, "\u0120admittedly": 33603, "ARI": 33604, "\u0120Azerbaijan": 33605, "\u0120millisec": 33606, "\u0120combustion": 33607, "\u0120Bottle": 33608, "\u0120Lund": 33609, "\u0120Ps": 33610, "\u0120Dress": 33611, "\u0120fabricated": 33612, "\u0120battered": 33613, "\u0120sidel": 33614, "\u0120Notting": 33615, "Foreign": 33616, "\u0120Jerome": 33617, "020": 33618, "\u0120Arbit": 33619, "\u0120knots": 33620, "\u0120RIGHT": 33621, "Moving": 33622, "\u00e3\u0123\u013b": 33623, "\u0120surgeries": 33624, "\u0120courthouse": 33625, "\u0120mastered": 33626, "\u0120hovering": 33627, "\u0120Bran": 33628, "\u0120Alison": 33629, "\u0120safest": 33630, "military": 33631, "\u0120bullied": 33632, "\u0120barrage": 33633, "Reader": 33634, "ESE": 33635, "\u0120Geographic": 33636, "Tools": 33637, "314": 33638, "\u0120Geek": 33639, "roth": 33640, "glers": 33641, "\u0120FIN": 33642, "\u00cf\u0123": 33643, "\u0120Aston": 33644, "altern": 33645, "488": 33646, "\u0120veterin": 33647, "Gamer": 33648, "\u0120intel": 33649, "renches": 33650, "Shield": 33651, "\u0120amnesty": 33652, "\u0120Bhar": 33653, "\u0120piled": 33654, "\u0120honorable": 33655, "\u0120Institutes": 33656, "\u0120soaked": 33657, "\u0120coma": 33658, "\u0120EFF": 33659, "341": 33660, "bytes": 33661, "\u0120Gmail": 33662, "lein": 33663, "\u0120Canadiens": 33664, "material": 33665, "Il": 33666, "\u0120instructors": 33667, "\u0120KY": 33668, "\u0120conceive": 33669, "ubb": 33670, "\u0120Possible": 33671, "\u0120easing": 33672, "\u0120Christina": 33673, "\u0120caric": 33674, "\u0120HDR": 33675, "ROM": 33676, "\u0120shovel": 33677, "delete": 33678, "\u0120puff": 33679, "\u0120Changing": 33680, "\u0120seamlessly": 33681, "Attribute": 33682, "\u0120acquisitions": 33683, "akery": 33684, "\u0120EF": 33685, "\u0120autistic": 33686, "\u0120Takes": 33687, "\u0120Powder": 33688, "\u0120Stir": 33689, "510": 33690, "\u0120Bubble": 33691, "settings": 33692, "\u0120Fowler": 33693, "\u0120mustard": 33694, "\u0120moreover": 33695, "\u0120copyrighted": 33696, "\u0120LEDs": 33697, "1500": 33698, "\u00e6\u012b": 33699, "\u0120HIS": 33700, "enf": 33701, "\u0120custod": 33702, "\u0120Huck": 33703, "Gi": 33704, "\u0120img": 33705, "Answer": 33706, "Ct": 33707, "jay": 33708, "\u0120Infrastructure": 33709, "\u0120federally": 33710, "Loc": 33711, "\u0120microbes": 33712, "\u0120overrun": 33713, "dds": 33714, "otent": 33715, "adiator": 33716, ">>>>>>>>": 33717, "\u0120tornado": 33718, "\u0120adjud": 33719, "\u0120intrigued": 33720, "\u0120si": 33721, "\u0120Revelation": 33722, "progress": 33723, "\u0120burglary": 33724, "\u0120Saiyan": 33725, "\u0120Kathy": 33726, "\u0120serpent": 33727, "\u0120Andreas": 33728, "\u0120compel": 33729, "essler": 33730, "\u0120Plastic": 33731, "\u0120Advent": 33732, "\u0120Positive": 33733, "\u0120Qt": 33734, "\u0120Hindus": 33735, "registered": 33736, "ularity": 33737, "\u0120righteousness": 33738, "\u0120demonic": 33739, "uitive": 33740, "\u0120BDS": 33741, "\u0120Gregg": 33742, "cia": 33743, "\u0120Crusade": 33744, "\u0120Sinai": 33745, "WARE": 33746, "+(": 33747, "\u0120mell": 33748, "\u0120derail": 33749, "yards": 33750, "Ast": 33751, "\u0120noticeably": 33752, "\u0120Ober": 33753, "Ram": 33754, "\u0120unnoticed": 33755, "\u0120seq": 33756, "avage": 33757, "Ts": 33758, "\u0120640": 33759, "\u0120concede": 33760, "\u0120])": 33761, "Fill": 33762, "\u0120captivity": 33763, "\u0120Improvement": 33764, "\u0120Crusader": 33765, "araoh": 33766, "MAP": 33767, "\u00e6\u0139": 33768, "\u0120stride": 33769, "always": 33770, "Fly": 33771, "Nit": 33772, "\u0120algae": 33773, "\u0120Cooking": 33774, "\u0120Doors": 33775, "Malley": 33776, "\u0120policemen": 33777, "\u00e3\u0123\u012f": 33778, "\u0120astronaut": 33779, "accessible": 33780, "495": 33781, "\u0120RAW": 33782, "cliffe": 33783, "udicrous": 33784, "\u0120depended": 33785, "alach": 33786, "\u0120ventures": 33787, "rake": 33788, "\u0120tits": 33789, "\u0120Hou": 33790, "\u0120condom": 33791, "ormonal": 33792, "\u0120indent": 33793, "\u0120uploading": 33794, "Footnote": 33795, "Important": 33796, "\u0120271": 33797, "\u0120mindful": 33798, "\u0120contends": 33799, "Cra": 33800, "\u0120calibr": 33801, "\u0120OECD": 33802, "plugin": 33803, "Fat": 33804, "\u0120ISS": 33805, "\u0120Dynamics": 33806, "ansen": 33807, "686": 33808, "'),": 33809, "\u0120sprite": 33810, "\u0120handheld": 33811, "\u0120Hipp": 33812, "=~=~": 33813, "Trust": 33814, "\u0120semantics": 33815, "\u0120Bundes": 33816, "\u0120Reno": 33817, "\u0120Literature": 33818, "sense": 33819, "Gary": 33820, "\u0120Aeg": 33821, "\u0120Trin": 33822, "EEK": 33823, "\u0120cleric": 33824, "\u0120SSH": 33825, "\u0120christ": 33826, "\u0120invading": 33827, "ibu": 33828, "\u0120enum": 33829, "aura": 33830, "\u0120allege": 33831, "\u0120Incredible": 33832, "BBC": 33833, "\u0120thru": 33834, "\u0120sailed": 33835, "\u0120emulate": 33836, "\u0120insecurity": 33837, "\u0120crou": 33838, "\u0120accommodations": 33839, "\u0120incompetent": 33840, "\u0120slips": 33841, "\u0120Earthqu": 33842, "sama": 33843, "ILLE": 33844, "\u0120iPhones": 33845, "asaki": 33846, "\u0120bye": 33847, "\u0120ard": 33848, "\u0120extras": 33849, "\u0120slaughtered": 33850, "\u0120crowdfunding": 33851, "resso": 33852, "\u0120filib": 33853, "\u0120ERROR": 33854, "\u0120TLS": 33855, "egg": 33856, "\u0120Ital": 33857, "\u0120enlist": 33858, "\u0120Catalonia": 33859, "\u0120Scots": 33860, "\u0120sergeant": 33861, "\u0120dissolve": 33862, "NH": 33863, "\u0120standings": 33864, "rique": 33865, "IQ": 33866, "\u0120beneficiary": 33867, "\u0120aquarium": 33868, "YouTube": 33869, "\u0120PowerShell": 33870, "\u0120brightest": 33871, "\u0120Warrant": 33872, "Sold": 33873, "Writing": 33874, "\u0120beginnings": 33875, "\u0120Reserved": 33876, "\u0120Latinos": 33877, "heading": 33878, "\u0120440": 33879, "\u0120rooftop": 33880, "ATING": 33881, "\u0120390": 33882, "VPN": 33883, "Gs": 33884, "kernel": 33885, "turned": 33886, "\u0120preferable": 33887, "\u0120turnovers": 33888, "\u0120Hels": 33889, "Sa": 33890, "\u0120Shinji": 33891, "veh": 33892, "\u0120MODULE": 33893, "Viol": 33894, "\u0120exiting": 33895, "\u0120jab": 33896, "\u0120Vanilla": 33897, "\u0120acron": 33898, "\u0120Gap": 33899, "bern": 33900, "Ak": 33901, "\u0120McGu": 33902, "\u0120endlessly": 33903, "\u0120Farage": 33904, "\u0120Noel": 33905, "Va": 33906, "MK": 33907, "\u0120brute": 33908, "\u0120Kru": 33909, "\u0120ESV": 33910, "\u0120Olivia": 33911, "\u00e2\u0122\u0142": 33912, "\u0120Kaf": 33913, "\u0120trusting": 33914, "\u0120hots": 33915, "324": 33916, "\u0120malaria": 33917, "\u0120json": 33918, "\u0120pounding": 33919, "ortment": 33920, "Country": 33921, "\u0120postponed": 33922, "\u0120unequiv": 33923, "?),": 33924, "\u0120Rooney": 33925, "udding": 33926, "\u0120Leap": 33927, "urrence": 33928, "shapeshifter": 33929, "\u0120HAS": 33930, "osate": 33931, "\u0120cavern": 33932, "\u0120conservatism": 33933, "\u0120BAD": 33934, "\u0120mileage": 33935, "\u0120arresting": 33936, "Vaults": 33937, "\u0120mixer": 33938, "Democratic": 33939, "\u0120Benson": 33940, "\u0120authored": 33941, "8000": 33942, "\u0120proactive": 33943, "\u0120Spiritual": 33944, "tre": 33945, "\u0120incarcerated": 33946, "\u0120Sort": 33947, "\u0120peaked": 33948, "\u0120wielding": 33949, "reciation": 33950, "\u00d7\u013b\u00d7": 33951, "Patch": 33952, "\u0120Emmy": 33953, "\u0120exqu": 33954, "tto": 33955, "\u0120Ratio": 33956, "\u0120Picks": 33957, "\u0120Gry": 33958, "phant": 33959, "\u0120fret": 33960, "\u0120ethn": 33961, "\u0120archived": 33962, "%-": 33963, "cases": 33964, "\u0120Blaze": 33965, "\u0120imb": 33966, "cv": 33967, "yss": 33968, "imony": 33969, "\u0120countdown": 33970, "\u0120awakening": 33971, "\u0120Tunisia": 33972, "\u0120Refer": 33973, "\u0120MJ": 33974, "\u0120unnatural": 33975, "\u0120Carnegie": 33976, "izen": 33977, "\u0120Nuggets": 33978, "hess": 33979, "\u0120evils": 33980, "647": 33981, "\u0120introductory": 33982, "loving": 33983, "\u0120McMahon": 33984, "\u0120ambiguity": 33985, "Label": 33986, "\u0120Almighty": 33987, "\u0120coloring": 33988, "\u0120Claus": 33989, "setting": 33990, "NULL": 33991, "\u0120Favorite": 33992, "\u0120SIG": 33993, ">(": 33994, "\u0120Shiva": 33995, "\u0120Mayer": 33996, "\u0120stormed": 33997, "\u0120Coverage": 33998, "weapons": 33999, "igham": 34000, "\u0120unanswered": 34001, "\u0120leve": 34002, "\u0120coy": 34003, "cas": 34004, "bags": 34005, "asured": 34006, "Seattle": 34007, "\u0120Santorum": 34008, "serious": 34009, "\u0120courageous": 34010, "\u0120Soup": 34011, "\u0120confiscated": 34012, "\u0120///": 34013, "\u0120unconventional": 34014, "\u0120moms": 34015, "\u0120Rohingya": 34016, "\u0120Orchestra": 34017, "\u0120Potion": 34018, "\u0120discredit": 34019, "\u0120FIL": 34020, "fixed": 34021, "\u0120Deer": 34022, "doi": 34023, "\u0120Dimension": 34024, "\u0120bureaucrats": 34025, "eteen": 34026, "\u0120actionGroup": 34027, "ohm": 34028, "\u0120bumps": 34029, "\u0120Utility": 34030, "\u0120submarines": 34031, "renheit": 34032, "research": 34033, "\u0120Shapiro": 34034, "\u0120sketches": 34035, "\u0120deceptive": 34036, "\u0120Vil": 34037, "esame": 34038, "\u0120Essentially": 34039, "\u0120rampage": 34040, "isky": 34041, "\u0120muttered": 34042, "thritis": 34043, "\u0120236": 34044, "fet": 34045, "bars": 34046, "\u0120pupil": 34047, "\u0120Thou": 34048, "oS": 34049, "song": 34050, "\u0120fractured": 34051, "\u0120revert": 34052, "picture": 34053, "\u0120criterion": 34054, "usher": 34055, "\u0120repercussions": 34056, "\u0120Vintage": 34057, "\u0120Superintendent": 34058, "Officers": 34059, "\u0120flagged": 34060, "\u0120blames": 34061, "\u0120inverse": 34062, "ographers": 34063, "\u0120makeshift": 34064, "\u0120devoid": 34065, "\u0120fossils": 34066, "\u0120Aristotle": 34067, "\u0120Funds": 34068, "\u0120depleted": 34069, "\u0120Flu": 34070, "\u0120Yuan": 34071, "\u0120woes": 34072, "\u0120lipid": 34073, "\u0120situ": 34074, "requisites": 34075, "\u0120furnish": 34076, "\u0120Samar": 34077, "\u0120shameful": 34078, "\u0120adversely": 34079, "\u0120adept": 34080, "\u0120remorse": 34081, "\u0120murderous": 34082, "uckles": 34083, "\u0120ESL": 34084, "\u0120314": 34085, "sent": 34086, "\u0120redef": 34087, "\u0120Cache": 34088, "\u0120Purs": 34089, "igans": 34090, "\u0120460": 34091, "\u0120prescriptions": 34092, "\u0120fres": 34093, "Fuck": 34094, "ocrates": 34095, "Twenty": 34096, "\u0120Weird": 34097, "\u0120Toggle": 34098, "\u0120Called": 34099, "itizens": 34100, "\u0120poultry": 34101, "\u0120harvesting": 34102, "\u00e3\u0124\u00a6\u00e3\u0124\u00b9": 34103, "Bottom": 34104, "\u0120cautioned": 34105, "tn": 34106, "396": 34107, "\u0120Nikki": 34108, "\u0120evaluations": 34109, "\u0120harassing": 34110, "\u0120bindings": 34111, "\u0120Monetary": 34112, "\u0120hitters": 34113, "\u0120adversary": 34114, "unts": 34115, "\u0120setback": 34116, "\u0120encrypt": 34117, "\u0120Cait": 34118, "\u0120lows": 34119, "enges": 34120, "\u0120Norn": 34121, "\u0120bulbs": 34122, "\u0120bottled": 34123, "\u0120Voyager": 34124, "317": 34125, "\u0120spheres": 34126, "politics": 34127, "\u0120subtract": 34128, "\u0120sensations": 34129, "\u0120appalling": 34130, "\u0120316": 34131, "\u0120environmentally": 34132, "\u0120STEM": 34133, "\u0120publishes": 34134, "560": 34135, "\u0120diligence": 34136, "484": 34137, "\u0120advises": 34138, "\u0120petrol": 34139, "\u0120imagining": 34140, "\u0120patrols": 34141, "\u0120Integer": 34142, "\u0120Ashes": 34143, "actus": 34144, "\u0120Radiant": 34145, "\u0120LT": 34146, "itability": 34147, "htaking": 34148, "Setting": 34149, "\u0120nuanced": 34150, "\u0120Reef": 34151, "\u0120Developers": 34152, "Ni": 34153, "pieces": 34154, "990": 34155, "License": 34156, "\u0120lowers": 34157, "\u0120Ottoman": 34158, "327": 34159, "ooo": 34160, "\u0120quitting": 34161, "markets": 34162, "Behind": 34163, "\u0120basin": 34164, "\u0120docs": 34165, "anie": 34166, "flash": 34167, "ctl": 34168, "\u0120civilized": 34169, "\u0120Fukushima": 34170, "\"],\"": 34171, "\u0120KS": 34172, "\u0120Honestly": 34173, "arat": 34174, "\u0120constructs": 34175, "\u0120Lans": 34176, "\u0120Dire": 34177, "\u0120LIKE": 34178, "\u0120Trouble": 34179, "\u0120withholding": 34180, "\u0120Oblivion": 34181, "\u0120sanity": 34182, "anya": 34183, "Const": 34184, "\u0120grocer": 34185, "\u0120Celsius": 34186, "\u0120recounted": 34187, "\u0120Wife": 34188, "Border": 34189, "atered": 34190, "happy": 34191, "\u0120spoiler": 34192, "\u0120logically": 34193, "Hall": 34194, "\u0120succeeding": 34195, "\u0120polymorph": 34196, "\u0120axes": 34197, "\u0120Shotgun": 34198, "\u0120Slim": 34199, "\u0120Principles": 34200, "\u0120Leth": 34201, "arta": 34202, "\u0120scor": 34203, "Screenshot": 34204, "\u0120relaxation": 34205, "#$#$": 34206, "\u0120deterrent": 34207, "iddy": 34208, "\u0120powerless": 34209, "\u0120lesbians": 34210, "\u0120chords": 34211, "\u0120Edited": 34212, "selected": 34213, "\u0120separatists": 34214, "0002": 34215, "\u0120airspace": 34216, "\u0120turnaround": 34217, "\u0120cunning": 34218, "PATH": 34219, "Poly": 34220, "\u0120bombed": 34221, "\u0120tion": 34222, "xs": 34223, "\u0120withhold": 34224, "\u0120waged": 34225, "\u0120Liberties": 34226, "Flag": 34227, "\u0120comforting": 34228, "454": 34229, "\u0120Iris": 34230, "arers": 34231, "\u0120rag": 34232, "\u0120relocated": 34233, "\u0120Guarant": 34234, "\u0120strategically": 34235, "\u0120gamma": 34236, "uberty": 34237, "\u0120Lockheed": 34238, "gres": 34239, "\u0120grilled": 34240, "\u0120Lowe": 34241, "stats": 34242, "\u0120Rocks": 34243, "\u0120sensing": 34244, "\u0120renting": 34245, "\u0120Geological": 34246, "\u00d8\u00a7\u00d8": 34247, "otrop": 34248, "\u0120sew": 34249, "\u0120improperly": 34250, "486": 34251, "\u0120\u00e2\u0138\u0142": 34252, "\u0120starving": 34253, "\u0120Bj": 34254, "Discussion": 34255, "328": 34256, "\u0120Combo": 34257, "\u0120Fixes": 34258, "NAT": 34259, "\u0120striving": 34260, "thora": 34261, "\u0120harvested": 34262, "\u0120Ping": 34263, "\u0120playful": 34264, "\u0120avenues": 34265, "\u0120occupational": 34266, "\u0120wakes": 34267, "\u0120Courier": 34268, "\u0120drummer": 34269, "\u0120Browser": 34270, "\u0120Houth": 34271, "itu": 34272, "\u0120apparel": 34273, "paste": 34274, "\u0120hunted": 34275, "\u0120Secondly": 34276, "lain": 34277, "XY": 34278, "\u0120PIN": 34279, "icons": 34280, "\u0120cocktails": 34281, "\u0120sizable": 34282, "\u0120hurdles": 34283, "estinal": 34284, "\u0120Recreation": 34285, "\u0120eco": 34286, "648": 34287, "\u0120Died": 34288, "mint": 34289, "\u0120fingerprints": 34290, "\u0120dispose": 34291, "\u0120Bosnia": 34292, "tsy": 34293, "2200": 34294, "\u0120inspected": 34295, "\u0120Fou": 34296, "\u0120fuss": 34297, "\u0120ambush": 34298, "\u0120Rak": 34299, "\u0120manifested": 34300, "Prosecut": 34301, "\u0120suffice": 34302, "rences": 34303, "\u0120compensated": 34304, "\u0120Cyrus": 34305, "\u0120genus": 34306, "\u0120Wolverine": 34307, "\u0120Trends": 34308, "\u0120hikes": 34309, "\u0120Seen": 34310, "\u0120enrol": 34311, "Cold": 34312, "\u0120politely": 34313, "\u0120Slav": 34314, "\u0120Rupert": 34315, "\u0120eyewitness": 34316, "\u0120Alto": 34317, "\u0120uncomp": 34318, "\u0120posterior": 34319, "Must": 34320, "\u0120Herz": 34321, "\u0120progressively": 34322, "\u0120234": 34323, "\u0120indifference": 34324, "\u0120Cunningham": 34325, "\u0120academia": 34326, "\u0120sewer": 34327, "\u0120astounding": 34328, "\u0120AES": 34329, "rather": 34330, "\u0120eldest": 34331, "\u0120climbs": 34332, "\u0120Adds": 34333, "\u0120outcry": 34334, "\u0120contag": 34335, "\u0120Houses": 34336, "\u0120pept": 34337, "\u0120Melania": 34338, "interested": 34339, "\u0120UCH": 34340, "\u0120Roots": 34341, "\u0120Hubbard": 34342, "\u0120TBD": 34343, "\u0120Romanian": 34344, "filename": 34345, "Stone": 34346, "\u0120Impl": 34347, "\u0120chromosome": 34348, "Cle": 34349, "dx": 34350, "\u0120scrambled": 34351, "\u0120Pt": 34352, "\u0120242": 34353, "OPLE": 34354, "\u0120tremendously": 34355, "Street": 34356, "\u0120craving": 34357, "\u0120bundled": 34358, "\u0120RG": 34359, "pipe": 34360, "\u0120injuring": 34361, "\u0120arcane": 34362, "Particip": 34363, "\u0120Heroic": 34364, "sty": 34365, "\u0120topping": 34366, "\u0120Tempest": 34367, "rentices": 34368, "bh": 34369, "\u0120paranoia": 34370, "\u0120Unicode": 34371, "\u0120egregious": 34372, "\u0120\\'": 34373, "\u0120Oswald": 34374, "\u0120gravel": 34375, "\u0120Simpsons": 34376, "\u0120bland": 34377, "\u0120Guantanamo": 34378, "Writer": 34379, "liners": 34380, "\u0120Dice": 34381, "JC": 34382, "\u0120parity": 34383, "\u0120sided": 34384, "\u0120237": 34385, "\u0120Pyrrha": 34386, "atters": 34387, "dk": 34388, "Fine": 34389, "compan": 34390, "\u0120formulated": 34391, "\u0120Idol": 34392, "ilers": 34393, "hemoth": 34394, "\u0120Fav": 34395, "\u0120intrusion": 34396, "\u0120carrots": 34397, "\u0120Layer": 34398, "\u0120Hacker": 34399, "\u0120----------------": 34400, "\u0120moderation": 34401, "\u00e9\u0123": 34402, "ococ": 34403, "\u0120characterize": 34404, "\u0120Teresa": 34405, "\u0120socioeconomic": 34406, "\u0120perk": 34407, "\u0120Participation": 34408, "training": 34409, "\u0120Paulo": 34410, "phys": 34411, "\u0120trustworthy": 34412, "\u0120embodied": 34413, "\u0120Merch": 34414, "currency": 34415, "\u0120Priority": 34416, "\u0120teasing": 34417, "\u0120absorbing": 34418, "\u0120unfinished": 34419, "\u0120Comparison": 34420, "\u0120disple": 34421, "writers": 34422, "\u0120professions": 34423, "\u0120Penguin": 34424, "\u0120angrily": 34425, "\u0120LINK": 34426, "688": 34427, "\u0120Correspond": 34428, "\u0120prevailed": 34429, "\u0120cartel": 34430, "lp": 34431, "asms": 34432, "\u0120Redemption": 34433, "\u0120Islamists": 34434, "effects": 34435, "dose": 34436, "\u0120Latter": 34437, "\u0120Halifax": 34438, "\u0120vas": 34439, "\u0120Topics": 34440, "\u0120Named": 34441, "advertising": 34442, "zza": 34443, "ICES": 34444, "\u0120retarded": 34445, "achable": 34446, "\u0120Puppet": 34447, "\u0120ItemLevel": 34448, "\u0120retract": 34449, "\u0120identifiable": 34450, "Aaron": 34451, "\u0120Buster": 34452, "sol": 34453, "helle": 34454, "assemb": 34455, "Hope": 34456, "ranged": 34457, "Ba": 34458, "\u0120Purch": 34459, "\u00e9\u0122": 34460, "\u0120Siri": 34461, "\u0120arrivals": 34462, "\u01201912": 34463, "\u0120shortened": 34464, "\u0120312": 34465, "\u0120discrepancy": 34466, "\u0120Temperature": 34467, "\u0120Walton": 34468, "\u0120kinderg": 34469, "polit": 34470, "\u0120remix": 34471, "\u0120connectors": 34472, "\u00e3\u0125\u013a\u00e3\u0125\u00a9": 34473, "\u0120Kazakhstan": 34474, "dominated": 34475, "\u0120sugars": 34476, "imble": 34477, "\u0120Panic": 34478, "\u0120Demand": 34479, "\u0120Colony": 34480, "onen": 34481, "\u0120MER": 34482, "775": 34483, "uria": 34484, "azaar": 34485, "\u0120Degree": 34486, "Pri": 34487, "\u0120sunshine": 34488, "\u0120251": 34489, "\u0120psychedelic": 34490, "\u0120digitally": 34491, "\u0120Braun": 34492, "\u0120shimmer": 34493, "\u0120shave": 34494, "\u0120Telesc": 34495, "\u0120Astral": 34496, "\u0120Venezuelan": 34497, "\u0120OG": 34498, "\u0120crawling": 34499, "Integ": 34500, "\u0120Feather": 34501, "\u0120unfolding": 34502, "\u0120appropriation": 34503, "\u0120\u00e8\u00a3\u0131\u00e8": 34504, "\u0120Mobility": 34505, "\u0120Ney": 34506, "-.": 34507, "bilt": 34508, "LIN": 34509, "\u0120Tube": 34510, "\u0120Conversely": 34511, "\u0120keyboards": 34512, "\u0120Cao": 34513, "\u0120overth": 34514, "\u0120laure": 34515, ">>\\": 34516, "\u0120Viper": 34517, "acha": 34518, "Offset": 34519, "\u0120Raleigh": 34520, "\u0120Jae": 34521, "Jordan": 34522, "jp": 34523, "\u0120totalitarian": 34524, "Connector": 34525, "\u0120observes": 34526, "\u0120Spartan": 34527, "\u0120Immediately": 34528, "\u0120Scal": 34529, "Cool": 34530, "\u0120taps": 34531, "\u0120roar": 34532, "Past": 34533, "\u0120chars": 34534, "\u0120Bender": 34535, "\u0120Sheldon": 34536, "\u0120painter": 34537, "\u0120beacon": 34538, "\u0120Creatures": 34539, "\u0120downturn": 34540, "\u0120hinder": 34541, "\u0120Andromeda": 34542, "\u00c3\u013d": 34543, "ccoli": 34544, "\u0120Fitness": 34545, "etrical": 34546, "\u0120utilizes": 34547, "\u0120senate": 34548, "\u0120ensemble": 34549, "\u0120cheers": 34550, "TW": 34551, "\u0120affluent": 34552, "kil": 34553, "rylic": 34554, "ordering": 34555, "Computer": 34556, "\u0120gruesome": 34557, "ostics": 34558, "\u0120Ubisoft": 34559, "\u0120Kelley": 34560, "\u0120wrench": 34561, "\u0120bourgeoisie": 34562, "IBLE": 34563, "\u0120Preston": 34564, "worn": 34565, "arist": 34566, "reating": 34567, "\u0120stained": 34568, "arine": 34569, "\u0120slime": 34570, "ENN": 34571, "\u0120chests": 34572, "\u0120groundwater": 34573, "annot": 34574, "\u0120Tray": 34575, "\u0120Locke": 34576, "\u0120CTR": 34577, "\u0120dudes": 34578, "\u0120External": 34579, "\u0120Decoder": 34580, "\u0120paramed": 34581, "\u0120Medline": 34582, "809": 34583, "\u0120Dinner": 34584, "rupal": 34585, "gz": 34586, "\u0120Gum": 34587, "\u0120Demo": 34588, "jee": 34589, "\u0120dh": 34590, "berman": 34591, "archs": 34592, "\u0120enqu": 34593, "\u0120Epstein": 34594, "\u0120devastation": 34595, "\u0120friendships": 34596, "\u0120Ard": 34597, "\u0120231": 34598, "\u0120Rubin": 34599, "\u0120Distance": 34600, "\u0120spurred": 34601, "\u0120dossier": 34602, "\u0120overlooking": 34603, "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": 34604, "Forest": 34605, "\u0120Comes": 34606, "\\\",": 34607, "\u0120Iranians": 34608, "\u0120fixtures": 34609, "Laughs": 34610, "\u0120curry": 34611, "\u0120Kingston": 34612, "\u0120squash": 34613, "\u0120catalogue": 34614, "\u0120abnormalities": 34615, "\u0120digestive": 34616, ".........": 34617, "\u0120subordinate": 34618, "ogly": 34619, "\u0120249": 34620, "Middle": 34621, "\u0120massac": 34622, "\u0120burgers": 34623, "\u0120downstairs": 34624, "\u01201931": 34625, "394": 34626, "\u0120VG": 34627, "\u0120lasers": 34628, "\u0120Sikh": 34629, "\u0120Alexa": 34630, "derived": 34631, "\u0120cyclist": 34632, "\u00e3\u0123\u00ae\u00e9\u0143\u0136": 34633, "oneliness": 34634, "!!!!!!!!": 34635, "\u0120buffs": 34636, "legate": 34637, "\u0120raping": 34638, "\u0120recommending": 34639, "rored": 34640, "\u0120multicultural": 34641, "unique": 34642, "\u0120businessmen": 34643, "\u0120uneasy": 34644, "\u0120MAP": 34645, "\u0120dispersed": 34646, "cipline": 34647, "Jess": 34648, "\u0120Kerala": 34649, "\u00e5\u00a7": 34650, "\u0120abstraction": 34651, "Surv": 34652, "Uh": 34653, "\u0120printers": 34654, "ija": 34655, "owder": 34656, "\u0120analogous": 34657, "\u0120ASP": 34658, "afer": 34659, "\u0120unfolded": 34660, "\u0120leveling": 34661, "\u0120breached": 34662, "\u0120Hearing": 34663, "\u0120nat": 34664, "\u0120translating": 34665, "critical": 34666, "\u0120antagonist": 34667, "\u0120Yesterday": 34668, "\u0120fuzzy": 34669, "wash": 34670, "mere": 34671, "\u0120bewild": 34672, "\u0120Mae": 34673, "Virgin": 34674, "phrase": 34675, "\u0120signaled": 34676, "\u0120HIGH": 34677, "\u0120protester": 34678, "\u0120garner": 34679, "unknown": 34680, "\u0120kay": 34681, "\u0120abducted": 34682, "\u0120stalking": 34683, "amn": 34684, "\u0120deserving": 34685, "\u0120Riv": 34686, "\u0120Jorge": 34687, "\u0120scratching": 34688, "\u0120Saving": 34689, "iping": 34690, "\u0120tease": 34691, "\u0120missionary": 34692, "\u0120Morrow": 34693, "TIME": 34694, "Present": 34695, "\u0120chemotherapy": 34696, "terness": 34697, "\u0120Homes": 34698, "\u0120Purdue": 34699, "\u0120staunch": 34700, "\u0120Whitney": 34701, "\u0120THERE": 34702, "\u00ce\u00bc": 34703, "iatus": 34704, "\u0120Ernest": 34705, "\u0120Deploy": 34706, "\u0120coveted": 34707, "FML": 34708, "\u0120Dialogue": 34709, "\u0120exited": 34710, "fruit": 34711, "\u0120nerd": 34712, "\":\"\",\"": 34713, "\u0120vivo": 34714, "ruly": 34715, "460": 34716, "\u0120Amen": 34717, "rehensible": 34718, "\u0120\u00e2\u013a": 34719, "DIR": 34720, "\u0120adherence": 34721, "\u0120chew": 34722, "\u0120Coke": 34723, "\u0120Sergei": 34724, "digital": 34725, "\u0120Neck": 34726, "gently": 34727, "enthal": 34728, "/)": 34729, "\u0120weary": 34730, "\u0120guise": 34731, "\u0120Concord": 34732, "\u0120Onion": 34733, "atcher": 34734, "\u0120binge": 34735, "\u0120Directive": 34736, "\u0120manned": 34737, "ansk": 34738, "\u0120illusions": 34739, "\u0120billionaires": 34740, "383": 34741, "olyn": 34742, "odynamic": 34743, "\u0120Wheat": 34744, "\u0120Alic": 34745, "\u0120coloured": 34746, "\u0120NAFTA": 34747, "abo": 34748, "\u0120macros": 34749, "independent": 34750, "sweet": 34751, "\u0120spac": 34752, "\u0120Kabul": 34753, "\u0120\u00c4": 34754, "eme": 34755, "\u0120dictated": 34756, "\u0120shouts": 34757, "={": 34758, "\u0120ripping": 34759, "\u0120Shay": 34760, "\u0120Cricket": 34761, "directed": 34762, "\u0120analysed": 34763, "\u0120WARRANT": 34764, "agons": 34765, "\u0120Blazers": 34766, "\u0120cheered": 34767, "\u0120arithmetic": 34768, "\u0120Tanz": 34769, "373": 34770, "\u0120Flags": 34771, "\u0120295": 34772, "\u0120witches": 34773, "\u0120Included": 34774, "\u0120Gained": 34775, "\u0120Blades": 34776, "Gam": 34777, "\u0120Samantha": 34778, "\u0120Atlantis": 34779, "\u0120Pratt": 34780, "\u0120spoiled": 34781, "\u0120IB": 34782, "\u0120Ramirez": 34783, "Probably": 34784, "rero": 34785, "\u0120Ng": 34786, "\u0120Warlock": 34787, "tp": 34788, "\u0120overhe": 34789, "\u0120administrations": 34790, "\u0120tint": 34791, "\u0120regiment": 34792, "\u0120pistols": 34793, "\u0120blankets": 34794, "\u0120epist": 34795, "\u0120bowls": 34796, "\u0120hydraulic": 34797, "\u0120dean": 34798, "\u0120jung": 34799, "\u0120ascend": 34800, "705": 34801, "\u0120Santiago": 34802, "\u00c3\u00ae": 34803, "\u0120unavoid": 34804, "\u0120Shaman": 34805, "reb": 34806, "\u0120stemming": 34807, "998": 34808, "\u0120MG": 34809, "sticks": 34810, "esthesia": 34811, "ERO": 34812, "\u0120morbid": 34813, "\u0120Grill": 34814, "\u0120Poe": 34815, "anyl": 34816, "\u0120deleting": 34817, "\u0120Surveillance": 34818, "\u0120directives": 34819, "\u0120iterations": 34820, "\u0120Rox": 34821, "\u0120Milky": 34822, "Father": 34823, "\u0120patented": 34824, "447": 34825, "\u0120precursor": 34826, "\u0120maiden": 34827, "\u0120Phen": 34828, "\u0120Vegan": 34829, "\u0120Patent": 34830, "Kelly": 34831, "Redditor": 34832, "\u0120nods": 34833, "\u0120ventilation": 34834, "\u0120Schwarz": 34835, "\u0120wizards": 34836, "\u0120ominous": 34837, "\u0120Heads": 34838, "\u0120BG": 34839, "\u0120lumber": 34840, "\u0120Spiel": 34841, "\u0120isEnabled": 34842, "\u0120ancestral": 34843, "\u0120Ships": 34844, "\u0120wrestler": 34845, "phi": 34846, "\u0120yuan": 34847, "\u0120Rebellion": 34848, "\u0120iceberg": 34849, "\u0120magically": 34850, "\u0120diversion": 34851, "arro": 34852, "ythm": 34853, "\u0120Riders": 34854, "\u0120Robbie": 34855, "\u0120Kara": 34856, "\u0120Maintenance": 34857, "\u0120Herb": 34858, "\u0120harms": 34859, "packed": 34860, "\u0120Feinstein": 34861, "\u0120marrying": 34862, "\u0120blending": 34863, "\u0120Rates": 34864, "\u01201880": 34865, "\u0120wrink": 34866, "\u0120Unch": 34867, "\u0120Torch": 34868, "described": 34869, "\u0120humanoid": 34870, "ilitating": 34871, "\u0120Conv": 34872, "\u0120Feld": 34873, "IGHTS": 34874, "\u0120whistleblower": 34875, "ortmund": 34876, "etsy": 34877, "arrett": 34878, "\u0120Mono": 34879, "\u0120Ike": 34880, "\u0120CNBC": 34881, "\u0120WAY": 34882, "\u0120MDMA": 34883, "\u0120Individuals": 34884, "\u0120supplemental": 34885, "\u0120powerhouse": 34886, "\u0120Stru": 34887, "Focus": 34888, "aphael": 34889, "\u0120Colleg": 34890, "atti": 34891, "ZA": 34892, "\u0120perenn": 34893, "\u0120Signature": 34894, "\u0120Rodney": 34895, "\u0120cubes": 34896, "iddled": 34897, "\u0120Dante": 34898, "\u0120INV": 34899, "ilingual": 34900, "\u0120Cth": 34901, "\u0120sofa": 34902, "\u0120intimidate": 34903, "\u0120Roe": 34904, "\u0120Diplom": 34905, "\u0120Countries": 34906, "ayson": 34907, "\u0120extradition": 34908, "\u0120disabling": 34909, "\u0120Cardiff": 34910, "\u0120memorandum": 34911, "\u0120Trace": 34912, "\u0120???": 34913, "sector": 34914, "\u0120Rouhani": 34915, "\u0120Yates": 34916, "\u0120Freeze": 34917, "\u0120bladder": 34918, "Motor": 34919, "\u0120Promise": 34920, "antasy": 34921, "\u0120foreseeable": 34922, "\u0120Cologne": 34923, "container": 34924, "\u0120Trees": 34925, "\u0120Gors": 34926, "\u0120Sinclair": 34927, "\u0120barring": 34928, "keye": 34929, "\u0120slashed": 34930, "\u0120Statistical": 34931, "\u00e9\u0129": 34932, "\u0120\u00e2\u0138\u00ba": 34933, "Allows": 34934, "\u0120humility": 34935, "\u0120drilled": 34936, "\u0120Furn": 34937, "443": 34938, "\u0120sewage": 34939, "\u0120homepage": 34940, "\u0120courtyard": 34941, "\u0120vile": 34942, "\u0120subsidiaries": 34943, "ajo": 34944, "directory": 34945, "\u0120ammon": 34946, "Vers": 34947, "charges": 34948, "\u0120}}": 34949, "\u0120Chains": 34950, "\u0120246": 34951, "nob": 34952, "\u0120percept": 34953, "\u0120grit": 34954, "\u0120fishermen": 34955, "\u0120Iraqis": 34956, "\u0120DISTR": 34957, "\u0120FULL": 34958, "\u0120Evaluation": 34959, "graph": 34960, "atial": 34961, "\u0120cooperating": 34962, "\u0120melan": 34963, "\u0120enlightened": 34964, "\u0120ali": 34965, "tailed": 34966, "\u0120salute": 34967, "\u0120weakest": 34968, "\u0120Bulldogs": 34969, "UA": 34970, "\u0120Alloy": 34971, "\u0120semen": 34972, "ocene": 34973, "\u0120Williamson": 34974, "spr": 34975, ",\u00e2\u0122\u0136": 34976, "\u0120GF": 34977, "ittens": 34978, "Beat": 34979, "\u0120Junk": 34980, "iphate": 34981, "\u0120Farmers": 34982, "\u0120Bitcoins": 34983, "igers": 34984, "dh": 34985, "\u0120Loyal": 34986, "payer": 34987, "\u0120entertained": 34988, "\u0120penned": 34989, "\u0120coupon": 34990, "Queue": 34991, "\u0120weakening": 34992, "carry": 34993, "\u0120underestimate": 34994, "\u0120shootout": 34995, "\u0120charismatic": 34996, "\u0120Procedure": 34997, "\u0120prudent": 34998, "inances": 34999, "\u0120riches": 35000, "\u0120cortical": 35001, "\u0120strides": 35002, "\u0120drib": 35003, "\u0120Oilers": 35004, "540": 35005, "\u0120Perform": 35006, "\u0120Bangkok": 35007, "\u0120euth": 35008, "SER": 35009, "\u0120simplistic": 35010, "tops": 35011, "campaign": 35012, "Quality": 35013, "\u0120impoverished": 35014, "\u0120Eisenhower": 35015, "\u0120augment": 35016, "\u0120Harden": 35017, "\u0120intervened": 35018, "\u0120listens": 35019, "\u0120Kok": 35020, "\u0120sage": 35021, "\u0120rubbish": 35022, "\u0120Ded": 35023, "\u0120mull": 35024, "pelling": 35025, "\u0120videot": 35026, "Production": 35027, "DJ": 35028, "miah": 35029, "\u0120adaptations": 35030, "\u0120medically": 35031, "\u0120boarded": 35032, "\u0120arrogance": 35033, "\u0120scrapped": 35034, "\u0120oppress": 35035, "FORMATION": 35036, "\u0120junction": 35037, "415": 35038, "EEEE": 35039, "Skill": 35040, "\u0120subdu": 35041, "\u0120Suggest": 35042, "\u0120Pett": 35043, "\u0120lett": 35044, "\u0120Manip": 35045, "\u0120Caf": 35046, "\u0120Cooperation": 35047, "Ther": 35048, "\u0120regained": 35049, "\u00b6\u00e6": 35050, "reflect": 35051, "\u0120thugs": 35052, "\u0120Shelby": 35053, "\u0120dictates": 35054, "\u0120Weiner": 35055, "\u0120Hale": 35056, "\u0120battleground": 35057, "schild": 35058, "\u0120condol": 35059, "hunt": 35060, "ositories": 35061, "\u0120accuses": 35062, "Filename": 35063, "\u0120shri": 35064, "\u0120motivate": 35065, "\u0120reflections": 35066, "Null": 35067, "\u0120Lobby": 35068, "\u00a5\u00b5": 35069, "\u0120SATA": 35070, "\u0120Backup": 35071, "\u00d1\u0125": 35072, "nin": 35073, "\u0120Correction": 35074, "\u0120juicy": 35075, "utra": 35076, "\u0120Pric": 35077, "\u0120restraining": 35078, "\u0120Airbnb": 35079, "\u0120Arrest": 35080, "\u0120appropriations": 35081, "\u0120slopes": 35082, "\u0120manslaughter": 35083, "\u0120workings": 35084, "\u0120Huss": 35085, "\u0120Frey": 35086, "Leave": 35087, "\u0120Harmony": 35088, "\u0120Feder": 35089, "\u0120430": 35090, "\u0120trench": 35091, "\u0120gladly": 35092, "\u0120bullpen": 35093, "\u0120Gau": 35094, "bones": 35095, "\u0120groove": 35096, "\u0120pretext": 35097, "\u00e3\u0127\u012d": 35098, "\u0120transmitter": 35099, "\u0120Component": 35100, "\u0120underage": 35101, "\u0120Empires": 35102, "Tile": 35103, "\u0120oy": 35104, "\u0120Marvin": 35105, "\u0120CAS": 35106, "\u0120bloss": 35107, "\u0120replicated": 35108, "\u0120Mariners": 35109, "Marcus": 35110, "\u0120Blocks": 35111, "\u0120liberated": 35112, "\u0120butterfly": 35113, "Feel": 35114, "\u0120fermentation": 35115, "\u0120youtube": 35116, "\u0120offend": 35117, "\u0120Term": 35118, "resist": 35119, "\u0120cessation": 35120, "\u0120insurgency": 35121, "\u0120bir": 35122, "\u0120Raise": 35123, "595": 35124, "\u0120hypotheses": 35125, "502": 35126, "\u0120plaque": 35127, "ocrat": 35128, "\u0120jackets": 35129, "\u0120HuffPost": 35130, "among": 35131, "\u0120confer": 35132, "487": 35133, "\u0120Lilly": 35134, "\u0120adapting": 35135, "\u0120Fay": 35136, "\u0120shoved": 35137, "vec": 35138, "\u0120refine": 35139, "\u0120gon": 35140, "\u0120gunmen": 35141, "zai": 35142, "\u0120Shuttle": 35143, "\u0120Izan": 35144, "\u01201913": 35145, "\u0120plethora": 35146, "\u00c2\u00b7\u00c2\u00b7": 35147, "\u0120510": 35148, "\u0120puberty": 35149, "\u0120241": 35150, "\u0120Wealth": 35151, "\u0120Alma": 35152, "\u0120MEM": 35153, "\u0120Adults": 35154, "Cas": 35155, "prison": 35156, "Race": 35157, "\u0120waterproof": 35158, "\u0120athleticism": 35159, "\u0120capitalize": 35160, "\u0120Juice": 35161, "\u0120illuminated": 35162, "\u0120Pascal": 35163, "\u0120irritation": 35164, "\u0120Witnesses": 35165, "adle": 35166, "\u0120Astro": 35167, "\u0120fax": 35168, "\u0120Elvis": 35169, "Primary": 35170, "\u0120Lich": 35171, "\u0120Elves": 35172, "\u0120residing": 35173, "\u0120stumble": 35174, "319": 35175, "\u0120PKK": 35176, "\u0120adversaries": 35177, "DOS": 35178, "\u0120Ritual": 35179, "\u0120smear": 35180, "\u0120arson": 35181, "idental": 35182, "\u0120scant": 35183, "\u0120monarchy": 35184, "\u0120halftime": 35185, "\u0120residue": 35186, "\u0120indign": 35187, "\u0120Shaun": 35188, "\u0120Elm": 35189, "auri": 35190, "Aff": 35191, "WATCH": 35192, "\u0120Lyon": 35193, "helps": 35194, "361": 35195, "\u0120lobbyist": 35196, "\u0120diminishing": 35197, "\u0120outbreaks": 35198, "\u0120goats": 35199, "favorite": 35200, "\u0120Nah": 35201, "sonian": 35202, "\u0120Booster": 35203, "\u0120sandbox": 35204, "\u0120Fare": 35205, "\u0120Malta": 35206, "\u0120attRot": 35207, "\u0120MOR": 35208, "lde": 35209, "\u0120navigating": 35210, "Touch": 35211, "\u0120untrue": 35212, "\u0120Disaster": 35213, "\u0120ludicrous": 35214, "Password": 35215, "\u0120JFK": 35216, "blogspot": 35217, "416": 35218, "\u0120UNDER": 35219, "ernal": 35220, "\u0120delaying": 35221, "TOP": 35222, "\u0120implants": 35223, "\u0120AVG": 35224, "\u0120Huge": 35225, "attr": 35226, "\u0120journalistic": 35227, "\u0120Peyton": 35228, "\u0120IA": 35229, "Rap": 35230, "goal": 35231, "\u0120Programme": 35232, "\u0120smashing": 35233, "wives": 35234, "println": 35235, "\u0120Plague": 35236, "inus": 35237, "EEP": 35238, "\u0120cruiser": 35239, "\u0120Parish": 35240, "uminium": 35241, "\u0120occupants": 35242, "\u0120Jihad": 35243, "mop": 35244, "\u0120pint": 35245, "\u0120hect": 35246, "\u0120Mecca": 35247, "director": 35248, "\u0120Funding": 35249, "\u0120Mixed": 35250, "\u0120stag": 35251, "Tier": 35252, "\u0120gust": 35253, "\u0120brightly": 35254, "orsi": 35255, "\u0120uphill": 35256, "RD": 35257, "\u0120lesions": 35258, "\u0120Bundy": 35259, "livious": 35260, "\u0120biologist": 35261, "\u0120Faculty": 35262, "\u0120Authorization": 35263, "\u0120244": 35264, "Allow": 35265, "\u00ef\u00b8": 35266, "\u0120Giul": 35267, "\u0120pertinent": 35268, "otaur": 35269, "esse": 35270, "\u0120Roof": 35271, "\u0120unmanned": 35272, "351": 35273, "\u0120Shak": 35274, "\u0120Orient": 35275, "\u0120endanger": 35276, "Dir": 35277, "\u0120replen": 35278, "edient": 35279, "\u0120tailor": 35280, "\u0120gadgets": 35281, "\u0120audible": 35282, "\u00e2\u013a\u0128": 35283, "Nice": 35284, "\u0120bombard": 35285, "\u0120Rape": 35286, "\u0120defiance": 35287, "\u0120TWO": 35288, "\u0120Filipino": 35289, "\u0120unaffected": 35290, "ervatives": 35291, "\u0120soared": 35292, "\u0120Bolton": 35293, "\u0120compromising": 35294, "\u0120Brewers": 35295, "RAL": 35296, "\u0120AHL": 35297, "icycle": 35298, "\u0120vampires": 35299, "\u0120dipped": 35300, "oyer": 35301, "\u0120XIII": 35302, "\u0120sideways": 35303, "\u0120Waste": 35304, "\u0120Diss": 35305, "\u0120\u00e2\u0136\u013e\u00e2\u0136\u0122\u00e2\u0136\u0122": 35306, "$.": 35307, "\u0120habitats": 35308, "\u0120Beef": 35309, "truth": 35310, "trained": 35311, "split": 35312, "Rus": 35313, "Andy": 35314, "\u0120Bram": 35315, "REP": 35316, "pid": 35317, "\u00e8\u00a3\u0127": 35318, "\u0120Mutant": 35319, "Anim": 35320, "\u0120Marina": 35321, "\u0120futile": 35322, "highest": 35323, "frequency": 35324, "\u0120epilepsy": 35325, "\u0120coping": 35326, "\u0120concise": 35327, "\u0120tracing": 35328, "\u0120SUN": 35329, "panel": 35330, "\u0120Sophie": 35331, "\u0120Crowley": 35332, "\u0120Adolf": 35333, "\u0120Shooter": 35334, "\u0120shaky": 35335, "\u0120IG": 35336, "\u0120Lies": 35337, "\u0120Barber": 35338, "pkg": 35339, "\u0120uptake": 35340, "\u0120predatory": 35341, "ULTS": 35342, "/**": 35343, "\u0120intoxicated": 35344, "\u0120Westbrook": 35345, "odder": 35346, "hement": 35347, "\u0120baseman": 35348, "APD": 35349, "storage": 35350, "\u0120Fifty": 35351, "editor": 35352, "GEN": 35353, "UTION": 35354, "irting": 35355, "\u0120sewing": 35356, "rift": 35357, "\u0120agony": 35358, "\u0120Sands": 35359, "\u0120254": 35360, "Cash": 35361, "\u0120lodge": 35362, "\u0120punt": 35363, "Natural": 35364, "\u0120Ideas": 35365, "\u0120erroneous": 35366, "\u0120Sensor": 35367, "\u0120Hannity": 35368, "\u01201921": 35369, "\u0120mould": 35370, "\u0120Gon": 35371, "kaya": 35372, "\u0120anonymously": 35373, "\u0120KEY": 35374, "\u0120simulator": 35375, "Winter": 35376, "\u0120streamed": 35377, "507": 35378, "?\",": 35379, "\u0120teased": 35380, "\u0120coefficient": 35381, "\u0120wartime": 35382, "\u0120THR": 35383, "''.": 35384, "\u0120Banking": 35385, "mpire": 35386, "\u0120fandom": 35387, "\u0120lia": 35388, "Ga": 35389, "\u0120downhill": 35390, "\u0120interpreting": 35391, "Individual": 35392, "Norm": 35393, "\u0120jealousy": 35394, "bitcoin": 35395, "\u0120pleasures": 35396, "\u0120Toys": 35397, "\u0120Chevrolet": 35398, "\u0120Advisor": 35399, "IZE": 35400, "\u0120receptions": 35401, "706": 35402, "Cro": 35403, "\u0120262": 35404, "\u0120citrus": 35405, "iru": 35406, "Reviewer": 35407, "jected": 35408, "UES": 35409, "anz": 35410, "1981": 35411, "\u0120Worker": 35412, "\u0120complied": 35413, "orescent": 35414, "continental": 35415, "Ton": 35416, "\u0120Prism": 35417, "\u0120Sheep": 35418, "\u0120288": 35419, "nox": 35420, "\u0120Vog": 35421, "Ord": 35422, "\u0120realms": 35423, "tek": 35424, "\u0120irrigation": 35425, "\u0120bicycles": 35426, "\u0120electronically": 35427, "poly": 35428, "tall": 35429, "());": 35430, "\u0120aesthetics": 35431, "\u0120Integrated": 35432, "Explore": 35433, "\u0120dunk": 35434, "476": 35435, "pain": 35436, "\u0120Jacques": 35437, "\u0120Dmit": 35438, "Frames": 35439, "\u0120reunited": 35440, "\u0120humid": 35441, "Dro": 35442, "Political": 35443, "\u0120youthful": 35444, "\u0120entails": 35445, "\u0120mosquito": 35446, "363": 35447, "species": 35448, "\u0120coordinating": 35449, "\u0120Mayhem": 35450, "\u0120Magnus": 35451, "Mount": 35452, "Improved": 35453, "\u0120STATE": 35454, "ATTLE": 35455, "\u0120flowed": 35456, "\u0120tackled": 35457, "\u0120fashioned": 35458, "\u0120reorgan": 35459, "ivari": 35460, "finger": 35461, "\u0120reluctantly": 35462, "etting": 35463, "\u0120Vand": 35464, "young": 35465, "\u0120Garland": 35466, "\u0120presumption": 35467, "\u0120amenities": 35468, "\u0120Pleasant": 35469, "onential": 35470, "\u0120Oxy": 35471, "\u0120morals": 35472, "\u0120Yah": 35473, "Ready": 35474, "Simon": 35475, "Enh": 35476, "Demon": 35477, "\u0120clich": 35478, "Monitor": 35479, "\u0120DU": 35480, "\u0120welcomes": 35481, "\u0120standout": 35482, "\u0120dreadful": 35483, "\u0120bananas": 35484, "\u0120balloons": 35485, "hooting": 35486, "basic": 35487, "\u0120suffix": 35488, "\u0120duly": 35489, "cano": 35490, "Chain": 35491, "atos": 35492, "\u0120geopolitical": 35493, "\u0120(&": 35494, "\u0120Gemini": 35495, "\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124\u00c3\u0125\u00c3\u0124": 35496, "\u0120acquitted": 35497, "Luck": 35498, "protect": 35499, "1024": 35500, "\u0120scarcity": 35501, "\u0120mindfulness": 35502, "ecided": 35503, "DN": 35504, "prime": 35505, "\u0120Presidents": 35506, "\u0120VIDEO": 35507, "\u0120(\u00e2\u012a\u0134": 35508, "addock": 35509, "NOR": 35510, "\u0120Pru": 35511, "pun": 35512, "\u0120LOL": 35513, "))))": 35514, "\u0120Liqu": 35515, "\u0120SAS": 35516, "\u0120styling": 35517, "\u0120punishments": 35518, "\u0120numb": 35519, "\u0120ascertain": 35520, "\u0120Rockies": 35521, "flu": 35522, "Thumbnail": 35523, "\u0120perpetrated": 35524, "\u0120Semi": 35525, "\u0120disarm": 35526, "\u0120Older": 35527, "\u0120Exception": 35528, "\u0120exponentially": 35529, "\u0120Communities": 35530, "\u0120abolish": 35531, "\u0120Partner": 35532, "ptoms": 35533, "\u0120777": 35534, "\u0120Foley": 35535, "\u0120Cases": 35536, "\u0120grease": 35537, "\u0120Rebirth": 35538, "Ground": 35539, "\u0120;)": 35540, "\u0120Doctrine": 35541, "ikini": 35542, "Ye": 35543, "\u0120Blossom": 35544, "\u0120persists": 35545, "bill": 35546, "\u0120infusion": 35547, "\u0120buddies": 35548, "911": 35549, "\u0120Patient": 35550, "\u0120demos": 35551, "\u0120acquaintance": 35552, "\u0120Paw": 35553, "atari": 35554, "\u0120xml": 35555, "\u0120fascination": 35556, "\u0120Serve": 35557, "\u00cf\u0124": 35558, "branded": 35559, "\u0120az": 35560, "Returns": 35561, "\u0120overshadow": 35562, "\u0120roam": 35563, "\u0120speedy": 35564, "numbered": 35565, "helial": 35566, "\u0120disciple": 35567, "\u0120assurances": 35568, "given": 35569, "pecting": 35570, "\u0120Natalie": 35571, "\u00e7\u0136\u00b0": 35572, "\u0120mosquitoes": 35573, "rotein": 35574, "\u0120numeric": 35575, "\u0120independents": 35576, "\u0120transitional": 35577, "\u0120reactionary": 35578, "\u0120Mechdragon": 35579, "doctor": 35580, "\u0120shortest": 35581, "\u0120sequential": 35582, "\u0120Bac": 35583, "\u0120Accounts": 35584, "\u00e3\u0123\u012e": 35585, "achy": 35586, "ractive": 35587, "\u0120Regiment": 35588, "\u0120breathtaking": 35589, "fficiency": 35590, "\u0120Bates": 35591, "\u0120311": 35592, "\u0120wardrobe": 35593, "fts": 35594, "\u0120Berk": 35595, "Simply": 35596, "\u0120Riverside": 35597, "ivering": 35598, "idential": 35599, "lucent": 35600, "\u0120enriched": 35601, "\u0120Conver": 35602, "\u0120Giving": 35603, "\u00e3\u0125\u013b": 35604, "\u0120legalize": 35605, "\u0120FTC": 35606, "\u0120freaking": 35607, "Mix": 35608, "\u0120terrestrial": 35609, "esian": 35610, "cients": 35611, "Wing": 35612, "LOAD": 35613, "\u0120ledge": 35614, "\u0120Violent": 35615, "\u0120Metall": 35616, "\u0120308": 35617, "\u0120southeastern": 35618, "hetto": 35619, "Meat": 35620, "\u0120slowdown": 35621, "\u0120retreated": 35622, "Jeremy": 35623, "endas": 35624, "*****": 35625, "eric": 35626, "\u0120reins": 35627, "oppable": 35628, "\u0120Humanity": 35629, "earances": 35630, "rigan": 35631, "Camera": 35632, "\u0120waivers": 35633, "soc": 35634, "\u0120alteration": 35635, "transform": 35636, "\u0120Cemetery": 35637, "506": 35638, "\u0120indefinite": 35639, "\u0120stimulating": 35640, "yg": 35641, "603": 35642, "\u0120Sop": 35643, "\u0120descriptive": 35644, "Phase": 35645, "\u0120Edmund": 35646, "\u0120pneumonia": 35647, "ventus": 35648, "Amb": 35649, "\u0120laboratories": 35650, "\u0120Exclusive": 35651, "ugar": 35652, "Were": 35653, "\u0120malfunction": 35654, "\u0120homosexuals": 35655, "\u0120-------": 35656, "uni": 35657, "\u0120turbines": 35658, "\u0120Equity": 35659, "Du": 35660, "\u0120minded": 35661, "\u0120RH": 35662, "\u0120Blackhawks": 35663, "\u0120feats": 35664, "\u01201700": 35665, "repl": 35666, "362": 35667, "laden": 35668, "\u0120indispensable": 35669, "lyss": 35670, "tti": 35671, "\u0120reel": 35672, "\u0120diverted": 35673, "\u0120likeness": 35674, "\u0120subscriptions": 35675, "\u0120fingert": 35676, "\u0120filthy": 35677, "destruct": 35678, "draft": 35679, "\u0120Bernardino": 35680, "launch": 35681, "\u0120perplex": 35682, "\u0120SUM": 35683, "carb": 35684, "\u0120sweater": 35685, "\u0120Venture": 35686, "\u0120Jag": 35687, "\u0120Celeb": 35688, "\u0120Voters": 35689, "\u0120steadfast": 35690, "\u0120athletics": 35691, "\u0120Hanson": 35692, "\u0120Drac": 35693, "Tracker": 35694, "\u0120commend": 35695, "\u0120Presidency": 35696, "\u0120DID": 35697, "informed": 35698, "\u0120webpage": 35699, "Pretty": 35700, "\u0120forcefully": 35701, "\u00e3\u0125\u0125\u00e3\u0124\u00af": 35702, "\u0120relocation": 35703, "\u0120satire": 35704, "\u00e2\u012b": 35705, "\u0120Sunderland": 35706, "\u00e6\u0126": 35707, "Voice": 35708, "????????": 35709, "\u0120informant": 35710, "\u0120bowel": 35711, "\u0120Uniform": 35712, "\u0120...\"": 35713, "\u0120purge": 35714, "\u0120picnic": 35715, "\u0120Umb": 35716, "\u0120UPDATE": 35717, "\u0120Sapphire": 35718, "\u0120Stall": 35719, "learn": 35720, "\u0120objectively": 35721, "\u0120obliter": 35722, "\u0120loophole": 35723, "\u0120journeys": 35724, "\u0120omission": 35725, "Pros": 35726, "\u0120Sidney": 35727, "ploma": 35728, "\u0120sprayed": 35729, "\u0120guru": 35730, "\u0120traitor": 35731, "\u0120timet": 35732, "\u0120snapping": 35733, "\u0120Sevent": 35734, "urnal": 35735, "\u0120Ukip": 35736, "\u0120bowed": 35737, "poral": 35738, "liberal": 35739, "Ros": 35740, "Questions": 35741, "iOS": 35742, "\u0120summarize": 35743, "STAT": 35744, "\u01201850": 35745, "apest": 35746, "\u0120lender": 35747, "\u0120Variable": 35748, "bringing": 35749, "\u0120LORD": 35750, ",)": 35751, "\u0120collapses": 35752, "xiety": 35753, "\u0120Ned": 35754, "YD": 35755, "\u0120Scha": 35756, "\u0120antibody": 35757, "\u0120disband": 35758, "yre": 35759, "illusion": 35760, "\u0120rover": 35761, "shed": 35762, "\u0120Hirosh": 35763, "cci": 35764, "\u0120calam": 35765, "\u0120Morton": 35766, "Pinterest": 35767, "\u01201928": 35768, "\u0120Euras": 35769, "ordes": 35770, "\u0120fences": 35771, "\u0120Inventory": 35772, "\u0120Valencia": 35773, "\u0120Ud": 35774, "\u0120Tiff": 35775, "\u0120sque": 35776, "\u0120quotation": 35777, "\u0120troublesome": 35778, "erker": 35779, "QUEST": 35780, "\u0120Kingdoms": 35781, "south": 35782, "\u0120levy": 35783, "Prince": 35784, "\u0120Sting": 35785, "\u0120nicknamed": 35786, "\u0120appe": 35787, "\u0120photographic": 35788, "\u0120corpus": 35789, "reference": 35790, "\u0120Trog": 35791, "Unt": 35792, ")=(": 35793, "\u0120Latvia": 35794, "\u0120activating": 35795, "\u0120licensee": 35796, "\u0120disparities": 35797, "\u0120Newsletter": 35798, "\u00e3\u0125\u0125\u00e3\u0125\u012a": 35799, "\u0120freeing": 35800, "\u0120Jeep": 35801, "\u0120Perception": 35802, "insk": 35803, "\u0120silicone": 35804, "\u0120Hayden": 35805, "Lean": 35806, "\u0120Suzuki": 35807, "ibrarian": 35808, "668": 35809, "\u0120spor": 35810, "\u0120correlations": 35811, "aghetti": 35812, "\u0120tuber": 35813, "\u0120IPCC": 35814, "ilus": 35815, "\u0120Vu": 35816, "\u0120wealthiest": 35817, "\u0120Carbuncle": 35818, "anza": 35819, "\u0120fooled": 35820, "\u0120Zur": 35821, "\u0120daddy": 35822, "rano": 35823, "ilian": 35824, "\u0120knockout": 35825, "fman": 35826, "required": 35827, "\u0120Wikileaks": 35828, "\u0120Duffy": 35829, "ONT": 35830, "\u0120insol": 35831, "\u0120Objects": 35832, "\u0120bou": 35833, "\u0120Nordic": 35834, "\u0120Insert": 35835, "scan": 35836, "\u0120dancers": 35837, "\u0120idiots": 35838, "majority": 35839, "\u0120Neville": 35840, "\u0120FreeBSD": 35841, "\u0120tart": 35842, "panic": 35843, "690": 35844, "\u0120cocoa": 35845, "\u0120sampled": 35846, "\u0120lookup": 35847, "Indust": 35848, "\u0120injections": 35849, "genre": 35850, "\u0120au": 35851, "\u0120roadway": 35852, "\u0120genitals": 35853, "Kind": 35854, "\u0120Examiner": 35855, "\u0120Yaz": 35856, "Fresh": 35857, "\u0120paralysis": 35858, "\u0120Aluminum": 35859, "\u0120reap": 35860, "ok\u00c3\u00a9": 35861, "\u0120sloppy": 35862, "\u0120Tunnel": 35863, "posium": 35864, "nery": 35865, "enic": 35866, "\u0120herbal": 35867, "\u0120Outer": 35868, "\u0120Builder": 35869, "\u0120incur": 35870, "\u0120ideologies": 35871, "\u0120backups": 35872, "consuming": 35873, "\u0120Detect": 35874, "deck": 35875, "\u0120KNOW": 35876, "\u0120Gret": 35877, "\u0120MIC": 35878, "\u0120toughness": 35879, "\u0120Exhibit": 35880, "\u0120hive": 35881, "Les": 35882, "\u0120SCHOOL": 35883, "\u0120Atari": 35884, "alde": 35885, "\u0120Null": 35886, "andestine": 35887, "mouse": 35888, "\u0120brigade": 35889, "489": 35890, "\u0120revol": 35891, "\u0120Lawson": 35892, "\u0120Wah": 35893, "opoly": 35894, "ebted": 35895, "\u0120Saunders": 35896, "\u0120313": 35897, "\u0120Winc": 35898, "\u0120taboo": 35899, "\u0120Helmet": 35900, "\u0120wedge": 35901, "chip": 35902, "\u0120Tina": 35903, "bg": 35904, "\u0120infuri": 35905, "rn": 35906, "\u0120anomalies": 35907, "\u0120Sync": 35908, "\u0120Exam": 35909, "\u0120Commit": 35910, "\u0120Diary": 35911, "\u0120ALSO": 35912, "\u0120Debor": 35913, "omedical": 35914, "\u0120comprehension": 35915, "655": 35916, "\u0120empowering": 35917, "\u0120ire": 35918, "\u0120juices": 35919, "\u0120ETH": 35920, "\u0120Boxing": 35921, "=\"/": 35922, "\u0120facilitated": 35923, "poke": 35924, "\u0120Parsons": 35925, "\u0120Moder": 35926, "travel": 35927, "\u0120civilizations": 35928, "\u0120libertarians": 35929, "\u0120rune": 35930, "\u0120Clarks": 35931, "athed": 35932, "\u0120campaigners": 35933, "\u0120Dispatch": 35934, "\u0120Fahrenheit": 35935, "\u0120Capcom": 35936, "----------": 35937, "\u0120lace": 35938, "\u0120draining": 35939, "\u0120liner": 35940, "\u0120Artificial": 35941, "\u00c3\u00a9n": 35942, "task": 35943, "]).": 35944, "\u0120GMO": 35945, "\u0120Operator": 35946, "ordinary": 35947, "\u0120Influence": 35948, "\u0120Ups": 35949, "\u0120potency": 35950, "ussen": 35951, "ospons": 35952, "\u0120Swim": 35953, "\u0120Deadline": 35954, "Unity": 35955, "\u0120culinary": 35956, "\u0120enlightenment": 35957, "\u0120wearer": 35958, "\u0120mined": 35959, "\u0120ply": 35960, "\u0120incest": 35961, "\u0120DVDs": 35962, "Walk": 35963, "BTC": 35964, "Trade": 35965, "\u0120deval": 35966, "iband": 35967, "\u0120Oversight": 35968, "Palestinian": 35969, "\u0120dart": 35970, "\u0120mul": 35971, "LR": 35972, "\u0120removable": 35973, "\u0120Realms": 35974, "\u00ec\u013f": 35975, "\u0120miscar": 35976, "\u0120Vulkan": 35977, "685": 35978, "\u00c3\u00a8re": 35979, "\u0120Sap": 35980, "\u0120merging": 35981, "\u0120Carly": 35982, "chester": 35983, "\u0120brisk": 35984, "\u0120luxurious": 35985, "\u0120Generator": 35986, "\u0120bitterness": 35987, "\u0120edible": 35988, "\u0120243": 35989, "TG": 35990, "\u0120rectangle": 35991, "WithNo": 35992, "below": 35993, "Jenn": 35994, "\u0120darkest": 35995, "\u0120hitch": 35996, "\u0120dosage": 35997, "\u0120scaven": 35998, "\u0120Keller": 35999, "\u0120Illustrated": 36000, "Certainly": 36001, "\u0120Mavericks": 36002, "Marginal": 36003, "\u0120diarrhea": 36004, "\u0120enormously": 36005, "\u0120999": 36006, "shr": 36007, "quart": 36008, "\u0120adamant": 36009, "\u0120Mew": 36010, "\u0120renovation": 36011, "\u0120cervical": 36012, "\u0120Percentage": 36013, "eners": 36014, "\u0120Kimber": 36015, "\u0120floats": 36016, "\u0120dex": 36017, "\u0120Witcher": 36018, "\u0120Swansea": 36019, "dm": 36020, "\u0120salty": 36021, "yellow": 36022, "\u0120cape": 36023, "\u0120Drain": 36024, "\u0120Paula": 36025, "\u0120Toledo": 36026, "lesi": 36027, "Magazine": 36028, "\u0120Wick": 36029, "\u0120Mn": 36030, "\u0120Ack": 36031, "\u0120Riding": 36032, "ASON": 36033, "\u0120homophobic": 36034, "ARP": 36035, "\u0120wandered": 36036, "CPU": 36037, "oodoo": 36038, "\u0120Pipe": 36039, "\u0120tightening": 36040, "\u0120Butt": 36041, "318": 36042, "\u0120deserted": 36043, "Session": 36044, "\u0120facilitating": 36045, "Jump": 36046, "\u0120emergencies": 36047, "OWER": 36048, "\u0120exhaustive": 36049, "\u0120AFTER": 36050, "\u0120heartbeat": 36051, "\u0120Label": 36052, "acky": 36053, "\u0120Certified": 36054, "iltration": 36055, "Ze": 36056, "\u0120Utt": 36057, "\u01201300": 36058, "\u0120presume": 36059, "\u0120Disp": 36060, "\u0120surged": 36061, "\u0120dolls": 36062, "Columb": 36063, "\u0120chimpan": 36064, "\u0120Razor": 36065, "\u0120ticks": 36066, "\u0120councillor": 36067, "\u0120pilgrimage": 36068, "\u0120Rebels": 36069, "\u0120QC": 36070, "\u0120Auction": 36071, "xia": 36072, "ikk": 36073, "bred": 36074, "\u0120insertion": 36075, "\u0120coarse": 36076, "dB": 36077, "SEE": 36078, "\u0120Zap": 36079, "\u0120Foo": 36080, "\u0120contempor": 36081, "\u0120Quarterly": 36082, "otions": 36083, "\u0120Alchemist": 36084, "\u0120Trey": 36085, "\u0120Duo": 36086, "Sweet": 36087, "804": 36088, "\u0120Giov": 36089, "\u0120funn": 36090, "Nin": 36091, "hoff": 36092, "\u0120ramifications": 36093, "\u01201922": 36094, "\u0120Experts": 36095, "azes": 36096, "\u0120garments": 36097, "arial": 36098, "\u0120Nab": 36099, "\u0120257": 36100, "\u0120Ved": 36101, "\u0120humorous": 36102, "\u0120Pompe": 36103, "\u0120nylon": 36104, "\u0120lurking": 36105, "\u0120Sergey": 36106, "\u0120Mattis": 36107, "\u0120misogyny": 36108, "\u0120Components": 36109, "\u0120Watching": 36110, "\u0120Folk": 36111, "ractical": 36112, "Bush": 36113, "\u0120taped": 36114, "\u0120grouping": 36115, "\u0120beads": 36116, "\u01202048": 36117, "\u0120condu": 36118, "querque": 36119, "Reading": 36120, "\u0120grievances": 36121, "Ultra": 36122, "\u0120endpoint": 36123, "Hig": 36124, "\u0120Static": 36125, "\u0120Scarborough": 36126, "Lua": 36127, "\u0120Messi": 36128, "aqu": 36129, "\u0120PsyNet": 36130, "\u0120Rudd": 36131, "\u0120avenue": 36132, "vp": 36133, "Jer": 36134, "\u0120shady": 36135, "\u0120Resist": 36136, "\u0120Artemis": 36137, "\u0120careless": 36138, "\u0120brokers": 36139, "\u0120temperament": 36140, "\u0120520": 36141, "Tags": 36142, "\u0120Turning": 36143, "\u0120uttered": 36144, "\u0120pedd": 36145, "\u0120improvised": 36146, "\u0120:(": 36147, "\u0120tabl": 36148, "\u0120plains": 36149, "1600": 36150, "pressure": 36151, "\u0120Essence": 36152, "margin": 36153, "friends": 36154, "\u0120Restoration": 36155, "\u0120pollut": 36156, "\u0120Poker": 36157, "\u0120Augustine": 36158, "\u0120CIS": 36159, "\u0120SEAL": 36160, "orama": 36161, "\u0120thwart": 36162, "seek": 36163, "\u0120pagan": 36164, "\u00c2\u00ba": 36165, "cpu": 36166, "\u0120garn": 36167, "\u0120assortment": 36168, "\u0120ILCS": 36169, "tower": 36170, "Recommended": 36171, "\u0120unborn": 36172, "\u0120RandomRedditor": 36173, "\u0120RandomRedditorWithNo": 36174, "\u0120paralyzed": 36175, "\u0120eruption": 36176, "\u0120intersect": 36177, "\u0120Stoke": 36178, "\u0120Sco": 36179, "Bind": 36180, "\u00e5\u00be": 36181, "\u0120PNG": 36182, "\u0120Negative": 36183, "\u0120NOAA": 36184, "Leon": 36185, "\u0120alloy": 36186, "\u0120Lama": 36187, "\u0120Diversity": 36188, "575": 36189, "\u0120underestimated": 36190, "\u0120Scor": 36191, "\u0120mural": 36192, "\u0120busted": 36193, "soon": 36194, "lif": 36195, "\u0120nonex": 36196, "\u0120allergy": 36197, "\u0120Underworld": 36198, "\u0120Rays": 36199, "\u0120Blasio": 36200, "\u0120hrs": 36201, "\u0120Dir": 36202, "\u0120327": 36203, "byter": 36204, "\u0120replacements": 36205, "\u0120activates": 36206, "rived": 36207, "MH": 36208, "\u0120pans": 36209, "\u0120HI": 36210, "\u0120longitudinal": 36211, "\u0120nuisance": 36212, "aler": 36213, "\u0120swell": 36214, "\u0120Signed": 36215, "sci": 36216, "\u0120Isles": 36217, "\u0120AGA": 36218, "\u0120defiant": 36219, "\u0120sonic": 36220, "ocon": 36221, "KC": 36222, "\u0120Aim": 36223, "tie": 36224, "ahah": 36225, "\u0120mL": 36226, "DX": 36227, "\u0120bisc": 36228, "\u0120Billboard": 36229, "\u0120SYSTEM": 36230, "NEY": 36231, "gaard": 36232, "\u0120distressed": 36233, "formerly": 36234, "Alan": 36235, "\u0120chefs": 36236, "\u0120optics": 36237, "\u0120Comet": 36238, "\u0120AMC": 36239, "\u0120redesigned": 36240, "irmation": 36241, "\u0120sightings": 36242, "382": 36243, "311": 36244, "\u0120WB": 36245, "\u0120contraction": 36246, "\u0120TOTAL": 36247, "Dual": 36248, "\u0120startled": 36249, "\u0120understandably": 36250, "\u0120sunglasses": 36251, "ETHOD": 36252, "\u0120docker": 36253, "\u0120surfing": 36254, "\u0120HEL": 36255, "\u0120Slack": 36256, "tones": 36257, "\u0120shalt": 36258, "Visual": 36259, "498": 36260, "Department": 36261, "cussion": 36262, "\u0120unrestricted": 36263, "\u0120tad": 36264, "\u0120rename": 36265, "employed": 36266, "\u0120educating": 36267, "\u0120grinned": 36268, "bedroom": 36269, "\u0120Activities": 36270, "\u0120Velvet": 36271, "\u0120SWAT": 36272, "\u0120shuffle": 36273, "igor": 36274, "\u0120saturation": 36275, "Finding": 36276, "cream": 36277, "icter": 36278, "\u0120vodka": 36279, "tracking": 36280, "tec": 36281, "\u0120foreground": 36282, "iesta": 36283, "\u0120vehement": 36284, "\u0120ECB": 36285, "\u0120Tie": 36286, "Ey": 36287, "\u0120turtles": 36288, "\u0120Railroad": 36289, "\u0120Katz": 36290, "\u0120Frames": 36291, "\u0120menace": 36292, "\u0120Fellowship": 36293, "\u0120Essential": 36294, "uggish": 36295, "\u0120drip": 36296, "chwitz": 36297, "\u0120Kyoto": 36298, "sb": 36299, "\u0120Nina": 36300, "Parameter": 36301, "\u0120alarms": 36302, "\u0120Claud": 36303, "\u0120pioneering": 36304, "\u0120chiefly": 36305, "\u0120Scream": 36306, "Collection": 36307, "\u0120thankfully": 36308, "\u0120Ronaldo": 36309, "\u00e5\u0143\u0132": 36310, "strip": 36311, "\u0120Disneyland": 36312, "commercial": 36313, "Seeing": 36314, "Soul": 36315, "\u0120evacuate": 36316, "\u0120civ": 36317, "\u0120Ashe": 36318, "\u0120divides": 36319, "\u0120Dagger": 36320, "rehensive": 36321, "\u0120berries": 36322, "\u0120DF": 36323, "\u0120sushi": 36324, "\u0120plurality": 36325, "WI": 36326, "\u0120disadvantaged": 36327, "\u0120battalion": 36328, "obiles": 36329, "451": 36330, "\u0120cling": 36331, "\u0120undeniable": 36332, "\u0120Lounge": 36333, "\u0120haunt": 36334, "phe": 36335, "\u0120quantify": 36336, "\u0120differed": 36337, "\u0120[*]": 36338, "\u0120Viz": 36339, "cum": 36340, "slave": 36341, "\u0120videog": 36342, "\u0120quar": 36343, "\u0120bundles": 36344, "\u0120Alonso": 36345, "tackle": 36346, "\u0120neuronal": 36347, "\u0120landslide": 36348, "confirmed": 36349, "\u0120Depth": 36350, "\u0120renewables": 36351, "Bear": 36352, "\u0120Macedonia": 36353, "\u0120jerseys": 36354, "\u0120bunk": 36355, "\u0120Spawn": 36356, "\u0120Controls": 36357, "\u0120Buchanan": 36358, "\u0120robotics": 36359, "\u0120emphasizing": 36360, "\u0120Tutorial": 36361, "hyp": 36362, "iston": 36363, "\u0120monumental": 36364, "\u00e6\u00b0": 36365, "\u0120Carry": 36366, "\u0120tbsp": 36367, "enance": 36368, "Hill": 36369, "arthed": 36370, "\u0120rotten": 36371, "Dean": 36372, "\u0120twisting": 36373, "\u0120goodwill": 36374, "\u0120immersion": 36375, "Living": 36376, "\u0120brushes": 36377, "\u0120CGI": 36378, "\u0120Atk": 36379, "traditional": 36380, "\u0120phantom": 36381, "\u0120Stamina": 36382, "\u0120expansions": 36383, "\u0120Marin": 36384, "\u0120embarked": 36385, "\u0120Eg": 36386, "intestinal": 36387, "\u0120PEOPLE": 36388, "\u0120Booth": 36389, "\u0120Appalach": 36390, "\u0120relegated": 36391, "VT": 36392, "MIT": 36393, "\u0120muster": 36394, "\u0120withdrawing": 36395, "\u0120microscope": 36396, "\u0120Gathering": 36397, "\u0120Crescent": 36398, "\u0120Argentine": 36399, "\u0120Decre": 36400, "\u0120Dominic": 36401, "\u0120buds": 36402, "antage": 36403, "\u0120Ion": 36404, "\u0120widened": 36405, "ONSORED": 36406, "\u0120Gloves": 36407, "iannopoulos": 36408, "razen": 36409, "feel": 36410, "\u0120repayment": 36411, "\u0120hindsight": 36412, "\u0120REALLY": 36413, "\u0120Pistol": 36414, "\u0120Brah": 36415, "\u0120watts": 36416, "\u0120survives": 36417, "\u0120flurry": 36418, "issy": 36419, "Alert": 36420, "\u0120Uruguay": 36421, "Phoenix": 36422, "Slow": 36423, "\u0120Grave": 36424, "\u0120Fir": 36425, "\u0120manageable": 36426, "\u0120tariff": 36427, "\u0120UDP": 36428, "\u0120Pistons": 36429, "\u0120Nigerian": 36430, "\u0120strikeouts": 36431, "\u0120cosmetics": 36432, "whelming": 36433, "fab": 36434, "cape": 36435, "proxy": 36436, "\u0120rethink": 36437, "\u0120overcoming": 36438, "simple": 36439, "\u0120woo": 36440, "\u0120distracting": 36441, "\u0120Stanton": 36442, "\u0120Tulsa": 36443, "\u0120Dock": 36444, "659": 36445, "\u0120discord": 36446, "\u0120Emacs": 36447, "\u0120Ves": 36448, "\u0120ROB": 36449, "\u0120reassuring": 36450, "\u0120consortium": 36451, "Muslims": 36452, "321": 36453, "\u0120prompts": 36454, "sei": 36455, "\u0120Hitch": 36456, "imposed": 36457, "\u0120Fool": 36458, "\u0120indiscrim": 36459, "wrong": 36460, "buquerque": 36461, "Davis": 36462, "!]": 36463, "\u0120timeless": 36464, "\u0120NEED": 36465, "\u0120pesticide": 36466, "\u0120rallying": 36467, "\u0120Calder": 36468, "\u0120\u00e5\u00a4": 36469, "\u0120xp": 36470, "\u0120Unle": 36471, "\u0120Export": 36472, "luaj": 36473, "Buff": 36474, ")</": 36475, "Boot": 36476, "\u0120Chrysler": 36477, "orative": 36478, "Mess": 36479, "\u0120negligible": 36480, "ertodd": 36481, "\u0120Mushroom": 36482, "\u0120Gale": 36483, "gc": 36484, "\u0120Cosby": 36485, "\u0120Rural": 36486, "ritical": 36487, "Bell": 36488, "\u0120turbine": 36489, "00200000": 36490, "\u0120legitimately": 36491, "\u0120Animated": 36492, "TED": 36493, "\u0120Theodore": 36494, "conduct": 36495, "\u0120Hier": 36496, "\u0120counterfeit": 36497, "\u0120Algeria": 36498, "\u0120unbeat": 36499, "controller": 36500, "\u0120unres": 36501, "\u0120scrambling": 36502, "\u0120Fallon": 36503, "Tes": 36504, "\u0120amber": 36505, "\u0120royalties": 36506, "\u0120Shelter": 36507, "\u0120Lester": 36508, "\u0120classify": 36509, "Remote": 36510, "\u0120unheard": 36511, "\u0120controversies": 36512, "\u0120enrichment": 36513, "\u0120Yankee": 36514, "gamer": 36515, "\u0120platinum": 36516, "\u0120ecology": 36517, "\u0120Sark": 36518, "\u0120untouched": 36519, "\u0120supervisors": 36520, "\u0120\"%": 36521, "\u0120footh": 36522, "\u0120commons": 36523, "\u0120narcotics": 36524, "\u0120indices": 36525, "\u0120Ply": 36526, "\u0120additionally": 36527, "\u0120Gawker": 36528, "\u0120EQ": 36529, "Playing": 36530, "\u0120caveat": 36531, "\u0120Absolute": 36532, "ossus": 36533, "Baby": 36534, "\u0120ration": 36535, "\u0120resin": 36536, "\u0120calibration": 36537, "\u0120Newport": 36538, "\u0120knocks": 36539, "vt": 36540, "\u0120compost": 36541, "Scene": 36542, "\u0120sarcast": 36543, "\u0120kisses": 36544, "\u0120ns": 36545, "alli": 36546, "\u0120Marcel": 36547, "\u0120Piet": 36548, "iatrics": 36549, "\u0120surrounds": 36550, "\u0120Reprodu": 36551, "\u0120Phillies": 36552, "\u0120uncertainties": 36553, "\u0120Eur": 36554, "\u0120Romance": 36555, "\u0120Hath": 36556, "\u0120Needs": 36557, "\u0120Cloak": 36558, "\u0120crem": 36559, "queue": 36560, "\u0120355": 36561, "\u0120upfront": 36562, "]);": 36563, "\u0120reciproc": 36564, "\u01201927": 36565, "\u01201100": 36566, "utsu": 36567, "\u0120depressive": 36568, "owment": 36569, "Fans": 36570, "\u0120mech": 36571, "\u0120annihil": 36572, "\u0120counterterrorism": 36573, "\u0120Figures": 36574, "bold": 36575, "\u0120Moines": 36576, "\u0120Drivers": 36577, "\u0120manuscripts": 36578, "\u0120Crypto": 36579, "\u0120hypnot": 36580, "reddits": 36581, "\u0120prosecutions": 36582, "\u0120divert": 36583, "CRIP": 36584, "\u0120Bene": 36585, "\u0120Reggie": 36586, "\u0120taxing": 36587, "\u0120Morales": 36588, "enting": 36589, "tur": 36590, "significant": 36591, "\u0120PROV": 36592, "\u0120strands": 36593, "\u0120pouch": 36594, "\u0120Rookie": 36595, "\u00bb\u0134": 36596, "\u0120nicer": 36597, "hemy": 36598, "hw": 36599, "ECA": 36600, "\u0120intimidated": 36601, "\u0120stricter": 36602, "\u0120microbial": 36603, "details": 36604, "\u0120vows": 36605, "\u0120quake": 36606, "hhhh": 36607, "\u0120reinvent": 36608, "Ub": 36609, "\u0120relinqu": 36610, "\u0120Buffett": 36611, "licensed": 36612, "ittered": 36613, "\u0120Picard": 36614, "\u0120chewing": 36615, "ucl": 36616, "organic": 36617, "\u0120localized": 36618, "\u0120Economist": 36619, "\u0120acquainted": 36620, "Definition": 36621, "sed": 36622, "Critics": 36623, "\u0120cc": 36624, "453": 36625, "381": 36626, "\u0120fellows": 36627, "\u0120checkpoints": 36628, "025": 36629, "\u0120reelection": 36630, "\u0120mediated": 36631, "\u0120KDE": 36632, "\u0120hurdle": 36633, "\u0120texting": 36634, "Perfect": 36635, "\u0120trustees": 36636, "fecture": 36637, "\u0120dich": 36638, "monary": 36639, "\u0120distinctions": 36640, "\u01201400": 36641, "\u0120usher": 36642, "\u0120parasites": 36643, "\u0120Sharing": 36644, "\u0120Vim": 36645, "\u0120barbecue": 36646, "\u0120Ministers": 36647, "erella": 36648, "\u0120eb": 36649, "\u0120mc": 36650, "\u0120Somehow": 36651, "\u0120Insect": 36652, "changes": 36653, "broad": 36654, "\u0120Byz": 36655, "\u0120grapes": 36656, "669": 36657, "\u0120=================": 36658, "\u0120assimil": 36659, "\u0120haunting": 36660, "\u0120firepower": 36661, "\u0120defamation": 36662, "emphasis": 36663, "\u0120compose": 36664, "\u0120allergies": 36665, "\u0120strang": 36666, "rollers": 36667, "bang": 36668, "\u0120brewers": 36669, "rongh": 36670, "riot": 36671, "poor": 36672, "cold": 36673, "Sample": 36674, "\u0120buoy": 36675, "040": 36676, "\u0120Courtney": 36677, "\u0120268": 36678, "\u0120Wedding": 36679, "702": 36680, "\u0120obsessive": 36681, "\u0120braking": 36682, "\u0120Lal": 36683, "anical": 36684, "\u00e5\u00a6": 36685, "aten": 36686, "Construction": 36687, "\u0120clinically": 36688, "iership": 36689, "Names": 36690, "\u0120Discuss": 36691, "\u0120Ramos": 36692, "\u0120locale": 36693, "\u0120Agricultural": 36694, "Enable": 36695, "\u0120horsepower": 36696, "enture": 36697, "Pref": 36698, "Court": 36699, "\u0120staffing": 36700, "\u0120futuristic": 36701, "drivers": 36702, "\u0120Marketplace": 36703, "\u00e6\u012a\u00a6": 36704, "Friends": 36705, "\u0120damning": 36706, "\u0120Customers": 36707, "\u0120weeds": 36708, "\u0120Mai": 36709, "\u0120agile": 36710, "\u0120Tatt": 36711, "icent": 36712, "Ranked": 36713, "croft": 36714, "\u0120Katy": 36715, "Extreme": 36716, "\u0120carve": 36717, "\u0120Rover": 36718, "\u0120Byron": 36719, "372": 36720, "\u0120conducts": 36721, "ratch": 36722, "itia": 36723, "\u0120Pumpkin": 36724, "Sadly": 36725, "Reloaded": 36726, "Policy": 36727, "\u0120lick": 36728, "peak": 36729, "isks": 36730, "\u0120CDs": 36731, "\u0120Encyclopedia": 36732, "initial": 36733, "Cos": 36734, "\u0120Awareness": 36735, "\u0120Dram": 36736, "$$$$": 36737, "\u0120riff": 36738, "\u0120scripture": 36739, "runners": 36740, "\u0120boiler": 36741, "onson": 36742, "oin": 36743, "\u0120hamstring": 36744, "\u0120cataly": 36745, "\u0120Archbishop": 36746, "chall": 36747, "\u0120faux": 36748, "okin": 36749, "localhost": 36750, "\u0120NAME": 36751, "adobe": 36752, "SAN": 36753, "amate": 36754, "\u0120scramble": 36755, "\u0120carc": 36756, "\u0120Manifest": 36757, "\u0120Cedar": 36758, "\u0120Sergio": 36759, "later": 36760, "ffer": 36761, "\u0120grappling": 36762, "\u0120Deutsche": 36763, "agonists": 36764, "\u0120Newsp": 36765, "\u0120pretended": 36766, "archment": 36767, "\u0120curated": 36768, "\u0120headphone": 36769, "\u0120Uncommon": 36770, "\u0120SIGN": 36771, "Agent": 36772, "\u0120deadlines": 36773, "\u0120horizontally": 36774, "\u0120MAT": 36775, "\u0120Summers": 36776, "\u0120ordained": 36777, "\u0120Lastly": 36778, "\u0120Kendall": 36779, "\u0120frig": 36780, "\u0120Machina": 36781, "\u0120Waterloo": 36782, "\u0120Mexicans": 36783, "\u0120protector": 36784, "\u0120glare": 36785, "}\"": 36786, "Premium": 36787, "\u0120rift": 36788, "\u0120Telescope": 36789, "Metal": 36790, "\u0120recapt": 36791, "\u0120;;": 36792, "\u0120inclination": 36793, "\u0120imposes": 36794, "ingen": 36795, "^{": 36796, "\u0120haste": 36797, "\u0120dolphins": 36798, "\u0120commuters": 36799, "planned": 36800, "cong": 36801, "mx": 36802, "\u0120Upload": 36803, "\u0120extrap": 36804, "\u0120Tucson": 36805, "\u0120Exploration": 36806, "efeated": 36807, "\u0120slender": 36808, "703": 36809, "\u0120Buk": 36810, "isel": 36811, "\u0120competitiveness": 36812, "chlor": 36813, "\u0120Permanent": 36814, "\u0120Everett": 36815, "\u0120Specialist": 36816, "\u0120SOL": 36817, "\u0120cyan": 36818, "\u0120Exactly": 36819, "UF": 36820, "\u0120LIFE": 36821, "aryl": 36822, "onet": 36823, "\u0120Employee": 36824, "awed": 36825, "\u0120Ratings": 36826, "\u0120extravag": 36827, "ulhu": 36828, "\u0120Plane": 36829, "\u0120elevate": 36830, "\u0120Coordinator": 36831, "\u0120Watkins": 36832, "\u0120excludes": 36833, "\u0120sentient": 36834, "\u0120epoch": 36835, "\u0120alloc": 36836, "Previously": 36837, "\u0120Shy": 36838, "\u0120Slovakia": 36839, "LOCK": 36840, "\u0120markedly": 36841, "\u0120knob": 36842, "\u0120adventurers": 36843, "\u0120Been": 36844, "\u0120Costs": 36845, "ammers": 36846, "\u0120onslaught": 36847, "\u0120Supported": 36848, "\u0120Tau": 36849, "ikarp": 36850, "\u0120Sovere": 36851, "\u0120Hampton": 36852, "\u00e3\u0124\u012b": 36853, "Prev": 36854, "\u0120Worse": 36855, "\u0120cottage": 36856, "\u0120Hades": 36857, "lez": 36858, "bowl": 36859, "\u0120fragrance": 36860, "\u0120Lok": 36861, "EMOTE": 36862, "\u0120Petro": 36863, "\u01201925": 36864, "\u0120Pend": 36865, "producing": 36866, "\u0120relocate": 36867, "vati": 36868, "pole": 36869, "\u0120semin": 36870, "\u0120NUM": 36871, "\u0120rocked": 36872, "buff": 36873, "bly": 36874, "Reply": 36875, "\u0120Hai": 36876, "\u0120articulated": 36877, "\u0120Islamabad": 36878, "665": 36879, "\u0120Claims": 36880, "Desktop": 36881, "\u0120trustee": 36882, "\u0120scripting": 36883, "\u0120Sob": 36884, "\u0120Asylum": 36885, "STDOUT": 36886, "\u0120Clown": 36887, "\u0120Dortmund": 36888, "\u0120Devon": 36889, "lite": 36890, "\u0120Marble": 36891, "\u0120bunker": 36892, "\u0120crest": 36893, "\u0120arousal": 36894, "\u0120Sears": 36895, "\u0120Buddy": 36896, "eredith": 36897, "\u0120Polly": 36898, "\u0120decode": 36899, "\u0120Vish": 36900, "\u0120Reflect": 36901, "anon": 36902, "\u0120refunds": 36903, "immers": 36904, "HM": 36905, "\u0120wiping": 36906, "\u0120puzzled": 36907, "\u0120matte": 36908, "uno": 36909, "Pierre": 36910, ")),": 36911, "\u0120tainted": 36912, "\u0120symbolism": 36913, "\u0120Fraz": 36914, "\u0120protestors": 36915, "etheus": 36916, "%%%%": 36917, "Wra": 36918, "\u0120lax": 36919, "adem": 36920, "aturation": 36921, "\u00e3\u0125\u0135": 36922, "\u0120Trailer": 36923, "\u0120ENG": 36924, "\u0120Bowser": 36925, "\u0120attm": 36926, "Dur": 36927, "807": 36928, "\u0120sidx": 36929, "\u0120cider": 36930, "\u0120Affect": 36931, "\u0120woven": 36932, "\u0120Barker": 36933, "benef": 36934, "\u0120dstg": 36935, "\u0120Ryu": 36936, ">[": 36937, "\u0120sqor": 36938, "Saudi": 36939, "\u0120istg": 36940, "\u0120indulge": 36941, "proc": 36942, "\u0120disgusted": 36943, "\u0120compounded": 36944, "\u0120nem": 36945, "\u0120schooling": 36946, "\u0120Cure": 36947, "processing": 36948, "Sol": 36949, "\u0120proverb": 36950, "itized": 36951, "\u0120Alvarez": 36952, "\u0120scarf": 36953, "\u0120rectangular": 36954, "reve": 36955, "\u0120hormonal": 36956, "\u0120Stress": 36957, "itizen": 36958, "\u0120425": 36959, "girls": 36960, "\u0120Noir": 36961, "\u0120Rapp": 36962, "\u0120marches": 36963, "church": 36964, "\u0120Uses": 36965, "\u0120405": 36966, "\u0120Berm": 36967, "\u0120ordinances": 36968, "\u0120Judgment": 36969, "Charges": 36970, "\u0120Zin": 36971, "\u0120dusty": 36972, "\u0120strawberries": 36973, "\u0120perce": 36974, "\u0120Thur": 36975, "\u0120Deborah": 36976, "netflix": 36977, "\u0120Lambert": 36978, "\u0120amused": 36979, "\u0120Guang": 36980, "YOU": 36981, "RGB": 36982, "\u0120CCTV": 36983, "\u0120fiat": 36984, "rang": 36985, "\u0120federation": 36986, "\u0120Mant": 36987, "\u0120Bust": 36988, "\u0120Mare": 36989, "respective": 36990, "\u0120Migration": 36991, "\u0120BIT": 36992, "590": 36993, "\u0120patriotism": 36994, "\u0120outlining": 36995, "region": 36996, "\u0120Jos\u00c3\u00a9": 36997, "\u0120blasting": 36998, "\u0120Ezra": 36999, "Bs": 37000, "\u0120undermines": 37001, "\u0120Smooth": 37002, "\u0120clashed": 37003, "radio": 37004, "\u0120transitioning": 37005, "\u0120Buccaneers": 37006, "\u0120Owl": 37007, "\u0120plugs": 37008, "\u0120hiatus": 37009, "\u0120Pinball": 37010, "\u0120mig": 37011, "\u0120Nutr": 37012, "\u0120Wolfe": 37013, "\u0120integers": 37014, "\u0120orbits": 37015, "\u0120Edwin": 37016, "\u0120DirectX": 37017, "bite": 37018, "\u0120blazing": 37019, "vr": 37020, "Edge": 37021, "\u0120PID": 37022, "exit": 37023, "\u0120Comed": 37024, "\u0120Pathfinder": 37025, "\u0120Guid": 37026, "\u0120Signs": 37027, "\u0120Zer": 37028, "\u0120Agenda": 37029, "\u0120reimbursement": 37030, "Mesh": 37031, "iPhone": 37032, "\u0120Marcos": 37033, "\u0120Sites": 37034, "hate": 37035, "enburg": 37036, "\u0120sockets": 37037, "pend": 37038, "Batman": 37039, "vir": 37040, "\u0120SHOW": 37041, "\u0120provisional": 37042, "conn": 37043, "\u0120Deaths": 37044, "ATIVE": 37045, "Profile": 37046, "sym": 37047, "JA": 37048, "\u0120ninja": 37049, "installed": 37050, "idates": 37051, "ebra": 37052, "\u0120Omaha": 37053, "\u0120seizing": 37054, "\u0120Beasts": 37055, "\u0120salts": 37056, "Mission": 37057, "Generally": 37058, "\u0120Trilogy": 37059, "heon": 37060, "legates": 37061, "\u0120dime": 37062, "\u0120faire": 37063, "parable": 37064, "Graph": 37065, "\u0120totaling": 37066, "\u0120diagrams": 37067, "\u0120Yanuk": 37068, "plet": 37069, "\u0120Meh": 37070, "\u0120mythical": 37071, "\u0120Stephens": 37072, "autical": 37073, "ochemistry": 37074, "\u0120kilograms": 37075, "\u0120elbows": 37076, "ancock": 37077, "\u0120BCE": 37078, "\u0120Prague": 37079, "\u0120improv": 37080, "\u0120Devin": 37081, "\u0120\"\\": 37082, "paralle": 37083, "\u0120supremacists": 37084, "\u0120Billion": 37085, "\u0120regimen": 37086, "innacle": 37087, "\u0120requisite": 37088, "angan": 37089, "\u0120Burlington": 37090, "ainment": 37091, "\u0120Objective": 37092, "omsky": 37093, "GV": 37094, "\u0120unilateral": 37095, "\u0120tc": 37096, "\u0120hires": 37097, "mental": 37098, "\u0120involuntary": 37099, "\u0120transpl": 37100, "\u0120ASCII": 37101, "\u00c2\u00a8": 37102, "Events": 37103, "\u0120doubted": 37104, "\u0120Kaplan": 37105, "\u0120Courage": 37106, "igon": 37107, "\u0120Managing": 37108, "\u0120Tart": 37109, "\u0120falsehood": 37110, "\u0120Violet": 37111, "\u0120airs": 37112, "\u0120fertilizer": 37113, "Britain": 37114, "\u0120aquatic": 37115, "ouf": 37116, "Words": 37117, "\u0120Hartford": 37118, "\u0120evenings": 37119, "\u0120Vengeance": 37120, "quite": 37121, "Gall": 37122, "\u0120Pret": 37123, "\u0120pdf": 37124, "\u0120LM": 37125, "\u0120Sochi": 37126, "\u0120Intercept": 37127, "920": 37128, "\u0120profitability": 37129, "\u0120Idle": 37130, "\u0120MacDonald": 37131, "\u0120Establishment": 37132, "umsy": 37133, "\u0120gatherings": 37134, "\u0120Naj": 37135, "Charlie": 37136, "\u0120ascent": 37137, "\u0120Protector": 37138, "\u0120algebra": 37139, "\u0120bios": 37140, "forums": 37141, "ELS": 37142, "Introduced": 37143, "\u0120335": 37144, "\u0120astronomy": 37145, "Contribut": 37146, "\u0120Polic": 37147, "Platform": 37148, "\u0120containment": 37149, "wrap": 37150, "\u0120coronary": 37151, "\u0120Jelly": 37152, "manager": 37153, "\u0120heartbreaking": 37154, "cair": 37155, "\u0120Chero": 37156, "cgi": 37157, "Medical": 37158, "\u0120Accountability": 37159, "!!\"": 37160, "ophile": 37161, "\u0120psychotic": 37162, "\u0120Restrict": 37163, "\u0120equitable": 37164, "issues": 37165, "\u01201905": 37166, "\u0120Nek": 37167, "cised": 37168, "\u0120Tracking": 37169, "\u0120ozone": 37170, "\u0120cooker": 37171, "rosis": 37172, "\u0120reopen": 37173, "\u0120infinity": 37174, "\u0120Pharmaceutical": 37175, "ensional": 37176, "Attempt": 37177, "\u0120Rory": 37178, "Marco": 37179, "\u0120awaits": 37180, "HOW": 37181, "treated": 37182, "\u0120bolst": 37183, "\u0120revered": 37184, "\u0120pods": 37185, "oppers": 37186, "0010": 37187, "\u0120amplitude": 37188, "rican": 37189, "SPONSORED": 37190, "\u0120trousers": 37191, "\u0120halves": 37192, "\u0120Kaine": 37193, "\u0120Cutler": 37194, "\u0120AUTH": 37195, "\u0120splendid": 37196, "\u0120preventive": 37197, "\u0120Dudley": 37198, "ifacts": 37199, "uminati": 37200, "\u0120Yin": 37201, "\u0120admon": 37202, "\u0120Vag": 37203, "\u0120inverted": 37204, "\u0120hastily": 37205, "\u0120Hague": 37206, "Lyn": 37207, "\u0120ledger": 37208, "\u0120astronomical": 37209, "getting": 37210, "\u0120circa": 37211, "\u0120Cic": 37212, "\u0120Tennis": 37213, "Limited": 37214, "\u0120dru": 37215, "\u0120BYU": 37216, "\u0120travellers": 37217, "\u0120pane": 37218, "\u0120Intro": 37219, "\u0120patiently": 37220, "\u0120aiding": 37221, "\u0120loos": 37222, "\u0120Tough": 37223, "\u0120293": 37224, "\u0120consumes": 37225, "SourceFile": 37226, "\u0120\"\"\"": 37227, "\u0120bonding": 37228, "\u0120tilted": 37229, "\u0120menstrual": 37230, "\u0120Celestial": 37231, "ULAR": 37232, "Plugin": 37233, "\u0120risking": 37234, "Naz": 37235, "\u0120Riyadh": 37236, "\u0120accredited": 37237, "\u0120skirm": 37238, "\u00e9\u013d": 37239, "\u0120examiner": 37240, "\u0120messing": 37241, "\u0120nearing": 37242, "\u0120Chern": 37243, "\u0120Beckham": 37244, "\u0120swapped": 37245, "\u0120goose": 37246, "Kay": 37247, "\u0120lofty": 37248, "\u0120Wallet": 37249, "\u0120['": 37250, "\u0120apocalypse": 37251, "\u0120bamboo": 37252, "\u0120SPACE": 37253, "\u0120Elena": 37254, "\u0120306": 37255, "acons": 37256, "\u0120tightened": 37257, "\u0120adolescence": 37258, "\u0120rainy": 37259, "\u0120vandalism": 37260, "\u0120Newtown": 37261, "\u0120conject": 37262, "cakes": 37263, "\u0120cheated": 37264, "\u0120moderators": 37265, "params": 37266, "EFF": 37267, "\u0120deceit": 37268, "\u0120STL": 37269, "\u0120Tanzania": 37270, "\u0120RI": 37271, "\u01201923": 37272, "\u0120Exile": 37273, "thel": 37274, "\u0120theolog": 37275, "\u0120quirky": 37276, "\u0120Irvine": 37277, "\u0120needy": 37278, "oris": 37279, "Um": 37280, "Ka": 37281, "\u0120mailbox": 37282, "322": 37283, "\u0120bos": 37284, "\u0120Petra": 37285, "KING": 37286, "\u0120enlarged": 37287, "Often": 37288, "\u0120badass": 37289, "\u0120343": 37290, "\u0120Places": 37291, "\u0120CAD": 37292, "\u0120pristine": 37293, "\u0120intervening": 37294, "direction": 37295, "\u0120laz": 37296, "\u0120DSM": 37297, "\u0120projecting": 37298, "\u0120Funk": 37299, "agog": 37300, "payment": 37301, "nov": 37302, "\u0120chatter": 37303, "ARB": 37304, "\u0120examinations": 37305, "\u0120Household": 37306, "\u0120Gus": 37307, "Ford": 37308, "414": 37309, "Boss": 37310, "\u0120mystic": 37311, "\u0120leaps": 37312, "\u0120Bav": 37313, "ulz": 37314, "budget": 37315, "Football": 37316, "\u0120subsidized": 37317, "\u0120firsthand": 37318, "\u0120coincide": 37319, "ocular": 37320, "Conn": 37321, "\u0120Collabor": 37322, "\u0120fools": 37323, "amura": 37324, "ahar": 37325, "rists": 37326, "\u0120swollen": 37327, "\u0120expended": 37328, "\u0120Pau": 37329, "sup": 37330, "\u0120spar": 37331, "\u0120keynote": 37332, "suff": 37333, "\u0120unequal": 37334, "\u0120progressing": 37335, "strings": 37336, "\u0120Gamergate": 37337, "Disney": 37338, "\u0120Eleven": 37339, "omnia": 37340, "\u0120scripted": 37341, "\u0120earners": 37342, "brother": 37343, "\u0120Enabled": 37344, "\u00e6\u00b3": 37345, "\u0120larvae": 37346, "\u0120LOC": 37347, "mess": 37348, "Wilson": 37349, "\u0120Template": 37350, "successfully": 37351, "\u0120paramount": 37352, "\u0120camouflage": 37353, "\u0120binds": 37354, "\u0120Quiet": 37355, "\u0120Shutterstock": 37356, "rush": 37357, "\u0120mascot": 37358, "fortune": 37359, "\u0120Colt": 37360, "\u0120Beyon": 37361, "habi": 37362, "\u0120hairc": 37363, "\u0120267": 37364, "\u0120Deus": 37365, "\u0120twitch": 37366, "\u0120concentrating": 37367, "\u0120nipples": 37368, "cible": 37369, "\u0120gir": 37370, "NZ": 37371, "Math": 37372, "nih": 37373, "Required": 37374, "\u0120ponder": 37375, "\u0120SAN": 37376, "\u0120weddings": 37377, "\u0120loneliness": 37378, "NES": 37379, "\u0120Mahjong": 37380, "695": 37381, "addle": 37382, "\u0120Garner": 37383, "\u0120COUR": 37384, "Bridge": 37385, "\u0120spree": 37386, "\u0120Caldwell": 37387, "\u0120bribery": 37388, "\u0120\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd": 37389, "plugins": 37390, "\u0120racket": 37391, "\u0120champagne": 37392, "versible": 37393, "Vote": 37394, "\u0120modifiers": 37395, "Mayor": 37396, "680": 37397, "\u0120assemblies": 37398, "\u0120Sultan": 37399, "\u0120Ning": 37400, "\u0120Ladies": 37401, "\u0120sulfur": 37402, "\u0120orbs": 37403, "\u0120-----": 37404, "_______": 37405, "\u0120Journalism": 37406, "\u0120esports": 37407, "\u0120lush": 37408, "\u0120hue": 37409, "\u0120spectral": 37410, "Honest": 37411, "\u00e3\u0125\u0131": 37412, "\u0120bushes": 37413, "\u0120reinforcement": 37414, "\u0120reopened": 37415, "\u0120Wheels": 37416, "\u0120Morg": 37417, "rieving": 37418, "\u0120auxiliary": 37419, "\u0120jQuery": 37420, "\u0120BAT": 37421, "tesque": 37422, "\u0120vertex": 37423, "pure": 37424, "frey": 37425, "\u00e3\u0124\u00ba": 37426, "dos": 37427, "\u0120typh": 37428, "\u0120cull": 37429, "\u0120eq": 37430, "\u0120decon": 37431, "\u0120tossing": 37432, "\u0120disparate": 37433, "\u0120Brigham": 37434, "printf": 37435, "ledged": 37436, "\u0120sund": 37437, "\u0120cozy": 37438, "\u0120hepatitis": 37439, "performing": 37440, "\u0120aval": 37441, "\u0120GG": 37442, "future": 37443, "\u0120petertodd": 37444, "\u0120Kosovo": 37445, "\u0120magnets": 37446, "Already": 37447, "\u0120Edison": 37448, "\u0120Ceres": 37449, "\u0120RAID": 37450, "\u0120brilliance": 37451, "576": 37452, "\u0120derives": 37453, "\u0120hypertension": 37454, "\u0120\u00ce\u0136": 37455, "\u0120lambda": 37456, "\u0120flair": 37457, "\u0120missionaries": 37458, "\u0120rapes": 37459, "\u0120Starter": 37460, "\u0120Months": 37461, "\u0120defy": 37462, "\u0120seismic": 37463, "\u0120Raphael": 37464, "\u0120eurozone": 37465, "656": 37466, "zsche": 37467, "\u0120scratched": 37468, "\u0120bows": 37469, "\u0120Lennon": 37470, "\u0120Gaia": 37471, "\u0120dripping": 37472, "facts": 37473, "Ale": 37474, "\u0120frogs": 37475, "\u0120Breast": 37476, "ogeneity": 37477, "\u0120Prosecutor": 37478, "\u0120amplified": 37479, "\u0120Hodg": 37480, "\u0120Fn": 37481, "Thousands": 37482, "\u0120NIH": 37483, "\u0120Monitoring": 37484, "FTWARE": 37485, "\u0120Priebus": 37486, "\u0120Growing": 37487, "hunter": 37488, "\u0120diagnose": 37489, "\u0120Mald": 37490, "\u0120LR": 37491, "\u0120crowned": 37492, "\u0120bursting": 37493, "\u0120dissolution": 37494, "javascript": 37495, "\u0120usefulness": 37496, "\u0120Execution": 37497, ":(": 37498, "\u0120Ivory": 37499, "aah": 37500, "\u0120persecuted": 37501, "violence": 37502, "istas": 37503, "\u0120Crate": 37504, "\u0120impulses": 37505, "\u0120Spani": 37506, "edes": 37507, "Handle": 37508, "\u0120Zerg": 37509, "thinkable": 37510, "Lastly": 37511, "\u0120spontaneously": 37512, "\u0120inconvenient": 37513, "\u0120dismissing": 37514, "\u0120plotted": 37515, "\u0120eighty": 37516, "\u0120737": 37517, "rish": 37518, "\u0120Thornton": 37519, "atham": 37520, "\u0120sitcom": 37521, "Ven": 37522, "Recipe": 37523, "tel": 37524, "lund": 37525, "\u0120clears": 37526, "\u0120Sasuke": 37527, "\u0120258": 37528, "\u0120opting": 37529, "\u0120enraged": 37530, "esthetic": 37531, "\u0120Ae": 37532, "uchs": 37533, "Prep": 37534, "Flow": 37535, "\u0120runoff": 37536, "\u0120Eating": 37537, "\u0120Giles": 37538, "\u0120Acting": 37539, "resources": 37540, "ibaba": 37541, "\u0120rpm": 37542, "\u0120skewed": 37543, "\u0120Blanc": 37544, "\u0120Sakuya": 37545, "\u0120hotter": 37546, "\u01201924": 37547, "opian": 37548, "cko": 37549, "\u0120crumbling": 37550, "\u0120captains": 37551, "\u0120Appropriations": 37552, "leaders": 37553, "dropping": 37554, "anuts": 37555, "\u0120reversing": 37556, "\u0120Pose": 37557, "\u0120Sek": 37558, "Scot": 37559, "\u0120Idea": 37560, "cise": 37561, "\u0120Slovenia": 37562, "\u0120317": 37563, "Doctor": 37564, "\u0120crocod": 37565, "aldi": 37566, "Sea": 37567, "\u0120Farrell": 37568, "\u0120mercenaries": 37569, "\u0120RNC": 37570, "\u0120Guess": 37571, "\u0120pacing": 37572, "Machine": 37573, "StreamerBot": 37574, "\u0120Charity": 37575, "\u0120298": 37576, "\u0120cannons": 37577, "\u0120Toby": 37578, "TPPStreamerBot": 37579, "\u0120Passion": 37580, "cfg": 37581, "Thom": 37582, "\u0120badges": 37583, "\u0120Bernstein": 37584, ".\u00e2\u0122\u0135": 37585, "\u0120POP": 37586, "\u0120Conj": 37587, "\u0120initialization": 37588, "\u0120biodiversity": 37589, "Dub": 37590, "\u0120feudal": 37591, "\u0120disclaimer": 37592, "\u0120crow": 37593, "\u0120ignition": 37594, "arf": 37595, "SHA": 37596, "\u0120kHz": 37597, "hazard": 37598, "\u0120Artists": 37599, "oeuv": 37600, "679": 37601, "\u0120Rudy": 37602, "Nine": 37603, "\u0120Ramadan": 37604, "\u00e5\u00bd": 37605, "itto": 37606, "\u0120adrenaline": 37607, "Cert": 37608, "\u0120smelled": 37609, "\u0120impunity": 37610, "\u0120agendas": 37611, "\u0120Reborn": 37612, "\u0120Concent": 37613, "\u0120Seems": 37614, "\u0120omega": 37615, "\u0120Dustin": 37616, "\u0120backer": 37617, "\u0120Sauce": 37618, "\u0120Boyle": 37619, "WIN": 37620, "\u0120spins": 37621, "\u0120pauses": 37622, "upt": 37623, "\u0120shredded": 37624, "\u0120strapped": 37625, "\u0120Corruption": 37626, "\u0120scratches": 37627, "\u0120ni": 37628, "\u0120attire": 37629, "\u0120SAF": 37630, "FactoryReloaded": 37631, "\u0120IPS": 37632, "\u0120(%": 37633, "\u0120seminar": 37634, "focus": 37635, "civil": 37636, "\u01201860": 37637, "intosh": 37638, "\u0120continual": 37639, "\u0120abbrevi": 37640, "\u0120Sok": 37641, "ocobo": 37642, "XM": 37643, "\u0120frantic": 37644, "\u0120unavoidable": 37645, "\u0120artery": 37646, "\u0120annotations": 37647, "bath": 37648, "Climate": 37649, "\u0120dors": 37650, "\u0120Slide": 37651, "coord": 37652, "\u0120Reload": 37653, "\u0120LDL": 37654, "\u0120Lovecraft": 37655, "\u0120unimagin": 37656, "\u0120resembled": 37657, "\u0120barracks": 37658, "np": 37659, "\u0120surrogate": 37660, "\u0120categorized": 37661, "\u00e3\u0124\u00a9": 37662, "\u0120vaccinated": 37663, "\u0120drainage": 37664, "\u0120indist": 37665, "\u0120WhatsApp": 37666, "\u01201870": 37667, "olerance": 37668, "invoke": 37669, "amorph": 37670, "\u0120reconnect": 37671, "\u0120emanc": 37672, "\u0120blindness": 37673, "\u01201280": 37674, "internet": 37675, "collar": 37676, "\u0120altru": 37677, "\u0120abyss": 37678, "\u0120TRI": 37679, "657": 37680, "\u0120infused": 37681, "HEAD": 37682, "\u0120forestry": 37683, "\u0120Woody": 37684, "\u0120Ci": 37685, "wi": 37686, "sam": 37687, "784": 37688, "holiday": 37689, "\u0120mogul": 37690, "\u0120Fees": 37691, "\u0120DEN": 37692, "Internal": 37693, "urbed": 37694, "fusc": 37695, "atom": 37696, "\u0120Illusion": 37697, "\u0120polled": 37698, "\u0120flap": 37699, "\u0120coax": 37700, "LGBT": 37701, "Analy": 37702, "\u0120Sections": 37703, "\u0120Californ": 37704, "emn": 37705, "\u0120hither": 37706, "\u0120NIGHT": 37707, "\u0120nailed": 37708, "\u0120Pipeline": 37709, "391": 37710, "oof": 37711, "\u0120Primal": 37712, "verend": 37713, "\u0120slashing": 37714, "\u0120retri": 37715, "aviour": 37716, "\u0120departing": 37717, "gil": 37718, "ISC": 37719, "\u0120midway": 37720, "\u0120ultrasound": 37721, "\u0120behaving": 37722, "\u0120Tara": 37723, "classes": 37724, "Virtual": 37725, "\u0120Colonial": 37726, "\u0120stripping": 37727, "\u0120orchestrated": 37728, "\u0120Graves": 37729, "452": 37730, "\u0120Ironically": 37731, "\u0120Writers": 37732, "\u0120lends": 37733, "\u0120Manz": 37734, "\u0120raven": 37735, "\u0120oxidative": 37736, "\u0120266": 37737, "ELF": 37738, "actually": 37739, "ascar": 37740, "Draft": 37741, "\u0120favourable": 37742, "\u0120humiliating": 37743, "\u0120fidelity": 37744, "\u0120Hof": 37745, "\u0120Xuan": 37746, "496": 37747, "\u0120layered": 37748, "atis": 37749, "790": 37750, "\u0120paycheck": 37751, "iton": 37752, "Kar": 37753, "\u0120VMware": 37754, "\u0120Farmer": 37755, "\u0120servic": 37756, "glomer": 37757, "\u0120slump": 37758, "\u0120Fabric": 37759, "\u0120DOC": 37760, "esting": 37761, "\u0120reassure": 37762, "\u0120phyl": 37763, "volt": 37764, "itory": 37765, "Rules": 37766, "\u0120oxidation": 37767, "\u0120prized": 37768, "\u0120mistress": 37769, "\u0120Django": 37770, "WARN": 37771, "\u00e5\u0133": 37772, "\u0120encode": 37773, "\u0120Feedback": 37774, "\u0120stupidity": 37775, "Ian": 37776, "\u0120Yugoslavia": 37777, "\u00d7\u00a8": 37778, "acl": 37779, "UTE": 37780, "1977": 37781, "\u0120qualifies": 37782, "\u0120pulses": 37783, "pretty": 37784, "\u0120froze": 37785, "\u0120ss": 37786, "Iterator": 37787, "\u0120urgently": 37788, "\u0120mailed": 37789, "\u0120Cham": 37790, "\u0120sustaining": 37791, "\u0120basil": 37792, "\u0120puppies": 37793, "ilant": 37794, "\u0120PLEASE": 37795, "lap": 37796, "aceous": 37797, "Fear": 37798, "\u0120Mastery": 37799, "automatic": 37800, "\u0120TAG": 37801, "\u0120antim": 37802, "agles": 37803, "473": 37804, "frames": 37805, "\u0120whispers": 37806, "\u0120Whoever": 37807, "\u0120bravery": 37808, "\u0120UKIP": 37809, "ractions": 37810, "\"\"\"": 37811, "\u0120tame": 37812, "\u0120parted": 37813, "everything": 37814, "CONT": 37815, "\u0120indebted": 37816, "\u0120addr": 37817, "rek": 37818, "IRED": 37819, "\u0120eminent": 37820, "clinton": 37821, "\u0120ousted": 37822, "\u0120reviewer": 37823, "\u0120meltdown": 37824, "\u0120rearr": 37825, "\u0120Yao": 37826, "thereal": 37827, "abyte": 37828, "\u0120stumbling": 37829, "\u0120batches": 37830, "\u0120259": 37831, "\u0120contraceptive": 37832, "\u0120prostitute": 37833, "ensis": 37834, "Decl": 37835, "\u0120Strikes": 37836, "Military": 37837, "\u0120Oath": 37838, "vacc": 37839, "ppings": 37840, "052": 37841, "\u0120partName": 37842, "amping": 37843, "Reports": 37844, "KI": 37845, "CHR": 37846, "\u0120subtly": 37847, "swers": 37848, "Blake": 37849, "usual": 37850, "\u0120contestants": 37851, "\u0120cartridges": 37852, "\u0120GREAT": 37853, "\u0120blush": 37854, "\u0120\u00e2\u0122\u00ba": 37855, "472": 37856, "\u0120reasoned": 37857, "\u00e3\u0125\u00a4": 37858, "paralleled": 37859, "\u0120dyn": 37860, "agate": 37861, "\u0120nightly": 37862, "\u00e5\u0128": 37863, "556": 37864, "\u0120semantic": 37865, "\u0120Advoc": 37866, "\u0120!!": 37867, "\u0120disagrees": 37868, "\u0120BW": 37869, "Veh": 37870, "\u0120harming": 37871, "\u0120embraces": 37872, "\u0120strives": 37873, "\u0120inland": 37874, "\u0120Kard": 37875, "\u0120heats": 37876, "\u0120Ginny": 37877, "utan": 37878, "ernaut": 37879, "ylene": 37880, "\u0120Elev": 37881, "JD": 37882, "\u0120hars": 37883, "\u0120Starr": 37884, "\u0120skysc": 37885, "\u0120collaborators": 37886, "Usually": 37887, "\u0120revolutions": 37888, "\u0120STATS": 37889, "\u0120dismantle": 37890, "\u0120confidently": 37891, "\u0120kinetic": 37892, "Ali": 37893, "\u0120percentile": 37894, "\u0120extracting": 37895, "illian": 37896, "estead": 37897, "\u0120physicists": 37898, "\u0120Marshal": 37899, "\u0120fellowship": 37900, "\u0120dashed": 37901, "\u0120UR": 37902, "\u0120Sioux": 37903, "\u0120Compact": 37904, "amide": 37905, "Python": 37906, "\u0120Leigh": 37907, "\u0120Pharmac": 37908, "istrates": 37909, "herical": 37910, "\u0120fue": 37911, "\u0120Emin": 37912, "\u0120({": 37913, "\u0120Neighborhood": 37914, "\u0120disrupting": 37915, "\u0120Dup": 37916, "\u0120gland": 37917, "\u0120Sev": 37918, "\u0120Marian": 37919, "argon": 37920, "\u0120Dund": 37921, "\u0120<!--": 37922, "\u0120strand": 37923, "\u0120stadiums": 37924, "zos": 37925, "\u0120psychosis": 37926, "\u0120Rack": 37927, "\u0120brilliantly": 37928, "\u00ef\u00b8\u0131": 37929, "\u0120submerged": 37930, "\u0120Instit": 37931, "\u0120Chow": 37932, "\u0120cages": 37933, "\u0120Hats": 37934, "\u0120Urs": 37935, "\u0120diluted": 37936, "usat": 37937, "ienne": 37938, "\u0120Membership": 37939, "\u0120Burk": 37940, "\u0120ie": 37941, "\u0120archetype": 37942, "Drug": 37943, "ulton": 37944, "\u0120Spock": 37945, "\u0120McKay": 37946, "\u0120Depend": 37947, "Featured": 37948, "Soc": 37949, "1978": 37950, "\u0120Bere": 37951, "\u0120relentlessly": 37952, "\u0120crippling": 37953, "\u0120arthritis": 37954, "\u00e7\u0136\u0141": 37955, "\u0120Tropical": 37956, "\u0120Bulg": 37957, "\u0120Cheryl": 37958, "\u0120admirable": 37959, "\u0120subtitle": 37960, "Override": 37961, "\u0120originating": 37962, "\u0120CCP": 37963, "\u0120swore": 37964, "\u0120Sole": 37965, "\u0120Disorders": 37966, "329": 37967, "\u0120procession": 37968, "\u0120refurb": 37969, "\u0120immersed": 37970, "requently": 37971, "\u0120skeptics": 37972, "\u0120ceramic": 37973, "mitter": 37974, "enstein": 37975, "belt": 37976, "\u0120TIT": 37977, "bidden": 37978, "\u0120fir": 37979, "mist": 37980, ">]": 37981, "\u0120weave": 37982, "\u0120Paradox": 37983, "\u0120entrusted": 37984, "\u0120Barclays": 37985, "\u0120novelist": 37986, "ogie": 37987, "806": 37988, "\u0120ninety": 37989, "\u0120disagreements": 37990, "@@@@@@@@": 37991, "\u0120Auschwitz": 37992, "cars": 37993, "\u0120LET": 37994, "tub": 37995, "arantine": 37996, "POS": 37997, "\u0120backstory": 37998, "\u0120cheerful": 37999, "\u0120Rag": 38000, "eka": 38001, "biased": 38002, "\u0120inexperienced": 38003, "akra": 38004, "\u0120Witt": 38005, "tan": 38006, "\u0120rapist": 38007, "\u0120plateau": 38008, "chal": 38009, "\u0120Inquis": 38010, "expression": 38011, "\u0120cipher": 38012, "\u0120shaving": 38013, "adden": 38014, "rely": 38015, "(\\": 38016, "isma": 38017, "\u0120Regulatory": 38018, "CHAR": 38019, "ilyn": 38020, "NVIDIA": 38021, "GU": 38022, "\u0120murm": 38023, "laus": 38024, "Christopher": 38025, "\u0120contractual": 38026, "\u0120Proxy": 38027, "\u0120Jaime": 38028, "\u0120Methodist": 38029, "\u0120stewards": 38030, "sta": 38031, "peria": 38032, "\u0120physiology": 38033, "\u0120bumped": 38034, "\u0120fructose": 38035, "Australian": 38036, "\u0120Metallic": 38037, "\u0120Masquerade": 38038, "arb": 38039, "\u0120promul": 38040, "\u0120downfall": 38041, "\u0120butcher": 38042, "\u0120bour": 38043, "\u0120INFORMATION": 38044, "\u0120Bis": 38045, "pects": 38046, "adena": 38047, "\u0120contemplating": 38048, "aroo": 38049, "centered": 38050, "\u0120Peaks": 38051, "Used": 38052, "\u0120modem": 38053, "\u0120genders": 38054, "\u01208000": 38055, "371": 38056, "\u0120maternity": 38057, "\u0120Raz": 38058, "\u0120rocking": 38059, "\u0120handguns": 38060, "\u0120DACA": 38061, "Autom": 38062, "\u0120Nile": 38063, "\u0120tumult": 38064, "\u0120Benefit": 38065, "\u0120Approach": 38066, "workshop": 38067, "\u0120Leaving": 38068, "Ger": 38069, "instead": 38070, "\u0120vibrations": 38071, "\u0120repositories": 38072, "497": 38073, "\u0120Aunt": 38074, "\u0120Jub": 38075, "\u0120Expedition": 38076, "Alpha": 38077, "\u0120sans": 38078, "\u0120overdue": 38079, "\u0120overcrowd": 38080, "\u0120legislatures": 38081, "\u0120paternal": 38082, "\u0120Leonardo": 38083, "\u0120expressive": 38084, "\u0120distractions": 38085, "\u0120silenced": 38086, "trust": 38087, "\u0120biking": 38088, "\u0120560": 38089, "\u0120propriet": 38090, "\u0120imposition": 38091, "\u0120conglomer": 38092, "\u0120=================================================================": 38093, "\u0120Teaching": 38094, "\u0120Yose": 38095, "intensive": 38096, "Town": 38097, "\u0120trolling": 38098, "\u0120Grac": 38099, "\u0120ASUS": 38100, "Yo": 38101, "\u0120specials": 38102, "\u0120Neph": 38103, "\u0120Godzilla": 38104, "Database": 38105, "\u0120Hegel": 38106, "\u0120272": 38107, "1976": 38108, "\u0120Gloria": 38109, "\u0120disemb": 38110, "\u0120Investigations": 38111, "\u0120Bane": 38112, "agements": 38113, "Strange": 38114, "\u0120treasury": 38115, "\u0120Plays": 38116, "\u0120undesirable": 38117, "\u0120widening": 38118, "\u0120verbally": 38119, "\u0120infancy": 38120, "\u0120cutter": 38121, "fml": 38122, "\u01202100": 38123, "prototype": 38124, "fine": 38125, "\u0120decriminal": 38126, "\u0120dysfunctional": 38127, "\u0120besie": 38128, "\u0120Ernst": 38129, "zeb": 38130, "\u0120northeastern": 38131, "\u0120aust": 38132, "porate": 38133, "\u0120Marlins": 38134, "\u0120segregated": 38135, "eworld": 38136, "\u0120Maher": 38137, "\u0120traverse": 38138, "\u0120monastery": 38139, "urgy": 38140, "Gear": 38141, "sand": 38142, "Compl": 38143, "\u0120EMP": 38144, "\u0120plent": 38145, "\u0120Mercer": 38146, "\u0120276": 38147, "TABLE": 38148, "Configuration": 38149, "Hundreds": 38150, "\u0120pric": 38151, "\u0120collaborating": 38152, "\u0120Paramount": 38153, "\u0120Cummings": 38154, "\u0120(<": 38155, "\u0120recorder": 38156, "\u0120flats": 38157, "\u0120416": 38158, "whose": 38159, "FontSize": 38160, "\u0120Orbit": 38161, "YR": 38162, "\u0120wrists": 38163, "\u0120bakery": 38164, ")}": 38165, "\u0120Bounty": 38166, "\u0120Lancaster": 38167, "\u0120endings": 38168, "according": 38169, "\u0120Salam": 38170, "easy": 38171, "755": 38172, "\u0120Burr": 38173, "\u0120Barnett": 38174, "onomous": 38175, "Union": 38176, "\u0120precedence": 38177, "\u0120Scholarship": 38178, "\u0120UX": 38179, "\u0120rollout": 38180, "\u0120boon": 38181, "alm": 38182, "\u0120Canter": 38183, "\u00e6\u00b5": 38184, "\u0120rounding": 38185, "\u0120clad": 38186, "\u0120vap": 38187, "\u0120Featured": 38188, "isations": 38189, "\u0120540": 38190, "police": 38191, "\u0120unsettling": 38192, "\u0120drifting": 38193, "\u0120Lumia": 38194, "\u0120ObamaCare": 38195, "\u0120Favor": 38196, "Hyper": 38197, "\u0120Rothschild": 38198, "\u0120Miliband": 38199, "analy": 38200, "\u0120Juliet": 38201, "Hu": 38202, "\u0120recalling": 38203, "ahead": 38204, "696": 38205, "\u0120unfavorable": 38206, "\u0120dances": 38207, "Ox": 38208, "\u0120legality": 38209, "\u0120403": 38210, "romancer": 38211, "\u0120inquire": 38212, "\u0120Moves": 38213, "\\\">": 38214, "\u0120Variant": 38215, "\u0120Messiah": 38216, "\u0120LCS": 38217, "\u0120Bah\u00c3\u00a1": 38218, "756": 38219, "\u0120eyebrow": 38220, "\u0120\u00c2\u00a5": 38221, "\u0120McF": 38222, "\u0120Forty": 38223, "Mas": 38224, "\u0120panicked": 38225, "\u0120transformations": 38226, "qq": 38227, "\u0120revolves": 38228, "ringe": 38229, "\u0120Ai": 38230, "axe": 38231, "\u0120onward": 38232, "\u0120CFR": 38233, "\u0120Bare": 38234, "login": 38235, "\u0120liquids": 38236, "\u0120decomp": 38237, "secondary": 38238, "ilan": 38239, "\u0120Convert": 38240, "amiya": 38241, "\u0120prosecuting": 38242, "\u0120\u00e2\u012b\u00a1": 38243, "\u0120Yorkers": 38244, "\u0120Byrne": 38245, "slow": 38246, "awei": 38247, "Jean": 38248, "\u0120269": 38249, "\u0120Skydragon": 38250, "\u0120\u00c3\u00a9": 38251, "\u0120Nicaragua": 38252, "\u0120Huckabee": 38253, "\u0120Highly": 38254, "\u0120amphib": 38255, "\u0120Pastor": 38256, "\u0120Lets": 38257, "\u0120blurred": 38258, "\u0120visceral": 38259, "\u0120CBO": 38260, "\u0120collaborated": 38261, "zig": 38262, "Legal": 38263, "\u0120apartheid": 38264, "\u0120brid": 38265, "\u0120preset": 38266, "\u0120DET": 38267, "\u0120AMA": 38268, "\u00d7\u0136": 38269, "arching": 38270, "aucuses": 38271, "builder": 38272, "\u0120poetic": 38273, "\u0120emulator": 38274, "\u0120Molecular": 38275, "\u0120honoring": 38276, "iseum": 38277, "\u0120tractor": 38278, "\u0120Cluster": 38279, "\u0120Calm": 38280, "aredevil": 38281, "\u0120sidewalks": 38282, "\u0120violin": 38283, "\u0120generalized": 38284, "\u0120Alec": 38285, "\u0120embargo": 38286, "\u0120fastball": 38287, "\u0120HTTPS": 38288, "\u0120Lack": 38289, "\u0120Chill": 38290, "river": 38291, "Chel": 38292, "\u0120Swarm": 38293, "\u0120Levine": 38294, "roying": 38295, "Launch": 38296, "\u0120kicker": 38297, "\u0120additive": 38298, "\u0120Deals": 38299, "Widget": 38300, "containing": 38301, "\u0120escalate": 38302, "\u0120OPEN": 38303, "\u0120tweaked": 38304, "\u0120stash": 38305, "\u0120sparks": 38306, "\u0120Essex": 38307, "\u0120Ecc": 38308, "\u0120convict": 38309, "\u0120blogging": 38310, "IER": 38311, "\u0120HL": 38312, "\u0120murderers": 38313, "759": 38314, "\u0120Hib": 38315, "\u0120depl": 38316, "\u0120Jord": 38317, "Sac": 38318, "\u0120dissect": 38319, "\u0120Howe": 38320, "osher": 38321, "\u0120customizable": 38322, "\u0120Franz": 38323, "\u0120atro": 38324, "\u00c4\u0129": 38325, "\u01200004": 38326, "\u0120outpost": 38327, "Ross": 38328, "\u0120glyphosate": 38329, "\u0120Hastings": 38330, "\u0120BEFORE": 38331, "\u0120shove": 38332, "opped": 38333, "\u0120Scala": 38334, "\u0120amulet": 38335, "anian": 38336, "\u0120exacerbated": 38337, "\u0120eater": 38338, "471": 38339, "UME": 38340, "\u0120pulp": 38341, "izontal": 38342, "\u0120Zam": 38343, "\u0120ATI": 38344, "immune": 38345, "abytes": 38346, "\u0120unnecessarily": 38347, "\u0120CAT": 38348, "\u0120Axis": 38349, "\u0120visualize": 38350, "\u00c3\u012b": 38351, "\u0120Radical": 38352, "fm": 38353, "Documents": 38354, "\u0120Forrest": 38355, "\u0120contextual": 38356, "\u0120Symbol": 38357, "\u0120tentative": 38358, "\u0120DOES": 38359, "\u0120Goods": 38360, "\u0120intermittent": 38361, "}:": 38362, "mediated": 38363, "\u0120ridicule": 38364, "\u0120atheism": 38365, "\u0120pathogens": 38366, "\u0120Mum": 38367, "\u0120reintrodu": 38368, "\u0120307": 38369, "iHUD": 38370, "\u0120flashlight": 38371, "\u0120swearing": 38372, "\u0120pengu": 38373, "Bu": 38374, "\u0120rotated": 38375, "\u0120Crane": 38376, "\u0120());": 38377, "\u0120fashionable": 38378, "\u0120endorsing": 38379, "463": 38380, ")[": 38381, "\u0120ingestion": 38382, "\u0120cooks": 38383, "\u0120950": 38384, "otomy": 38385, "\u0120Imam": 38386, "\u0120ka": 38387, "\u0120teaser": 38388, "\u0120Ghosts": 38389, "\u0120\u00e3\u0124\u00b5": 38390, "1969": 38391, "\u00cf\u0125": 38392, "ubby": 38393, "\u0120converter": 38394, "zanne": 38395, "ende": 38396, "\u0120Prepar": 38397, "\u0120Nickel": 38398, "\u0120Chimera": 38399, "him": 38400, "\u0120Tyrann": 38401, "\u0120Sabbath": 38402, "\u0120Nichols": 38403, "\u0120rapt": 38404, "ihar": 38405, "\u0120shelling": 38406, "\u0120illuminate": 38407, "\u0120dentist": 38408, "utor": 38409, "\u0120Integration": 38410, "\u0120whims": 38411, "\u0120Literary": 38412, "Beaut": 38413, "\u0120parchment": 38414, "agara": 38415, "Brand": 38416, "\u0120derog": 38417, "\u00e2\u0122\u00a6)": 38418, "\u0120Norse": 38419, "\u0120unwitting": 38420, "\u0120cuc": 38421, "\u0120borderline": 38422, "\u0120upsetting": 38423, "\u0120recourse": 38424, "\u0120draped": 38425, "\u0120Radar": 38426, "\u0120colder": 38427, "\u0120Pepsi": 38428, "iminary": 38429, "],[": 38430, "658": 38431, "Vi": 38432, "\u0120Frem": 38433, "\u0120Pes": 38434, "\u0120veterinary": 38435, "\u0120TED": 38436, "\u0120Epidem": 38437, "nova": 38438, "kid": 38439, "\u0120devout": 38440, "oct": 38441, "jad": 38442, "Moh": 38443, "\u0120PAY": 38444, "\u0120geometric": 38445, "\u0120323": 38446, "\u0120circumference": 38447, "ichick": 38448, "1975": 38449, "\u0120Yuri": 38450, "\u0120Shall": 38451, "\u0120Hover": 38452, "unin": 38453, "Spr": 38454, "\u0120graft": 38455, "\u0120Happiness": 38456, "\u0120disadvantages": 38457, "attacks": 38458, "\u0120hubs": 38459, "\u0120StarCraft": 38460, "\u00e9\u0138": 38461, "\u0120galleries": 38462, "\u0120Korra": 38463, "\u0120groceries": 38464, "\u0120Gorsuch": 38465, "\u0120rapists": 38466, "\u0120fungi": 38467, "\u0120Typhoon": 38468, "Vector": 38469, "\u0120Empress": 38470, "battle": 38471, "468": 38472, "\u0120parasite": 38473, "\u0120Bomber": 38474, "SG": 38475, "exist": 38476, "\u0120Pf": 38477, "\u0120unse": 38478, "\u0120surgeons": 38479, "Birth": 38480, "\u0120Unsure": 38481, "\u0120Printed": 38482, "\u0120Behavioral": 38483, "\u0120Aster": 38484, "Pakistan": 38485, "\u0120unethical": 38486, "\u0120sv": 38487, "\u0120IoT": 38488, "\u0120layouts": 38489, "Pain": 38490, "\u0120constants": 38491, "\u0120LW": 38492, "\u0120Bake": 38493, "\u0120towels": 38494, "\u0120deterioration": 38495, "\u0120Bolivia": 38496, "\u0120blinded": 38497, "\u0120Warden": 38498, "\u0120Mistress": 38499, "\u0120onstage": 38500, "\u0120clans": 38501, "\u0120BEST": 38502, "1960": 38503, "\u0120antique": 38504, "\u0120rhetorical": 38505, "\u0120Percy": 38506, "\u0120Rwanda": 38507, ",.": 38508, "Bruce": 38509, "\u0120traumat": 38510, "\u0120Parliamentary": 38511, "\u0120footnote": 38512, "idia": 38513, "\u0120Learned": 38514, "seeking": 38515, "genic": 38516, "\u0120dimensional": 38517, "Hide": 38518, "\u00e8\u0122\u0127": 38519, "\u0120intrigue": 38520, "inse": 38521, "\u0120leases": 38522, "\u0120apprentices": 38523, "washing": 38524, "\u01201926": 38525, "VILLE": 38526, "\u0120swoop": 38527, "scl": 38528, "\u0120bedrooms": 38529, "onics": 38530, "\u0120Crunch": 38531, "compatible": 38532, "\u0120incapac": 38533, "\u0120Yemeni": 38534, "ashtra": 38535, "zhou": 38536, "danger": 38537, "\u0120manifestations": 38538, "\u0120Demons": 38539, "AAF": 38540, "Secretary": 38541, "ACTED": 38542, "LOD": 38543, "\u0120amy": 38544, "raper": 38545, "ethnic": 38546, "417": 38547, "\u0120positives": 38548, "\u0120273": 38549, "\u0120Refugees": 38550, "\u0120usb": 38551, "\u0120Vald": 38552, "oddy": 38553, "\u0120Mahmoud": 38554, "Asia": 38555, "\u0120skulls": 38556, "\u0120Exodus": 38557, "\u0120Compet": 38558, "\u0120LIC": 38559, "\u0120Mansion": 38560, "\u0120Ame": 38561, "\u0120consolidate": 38562, "storms": 38563, "ontent": 38564, "996": 38565, "\u0120clen": 38566, "\u0120mummy": 38567, "flat": 38568, "758": 38569, "\u0120VOL": 38570, "oteric": 38571, "nen": 38572, "\u0120Minute": 38573, "Sov": 38574, "\u0120finer": 38575, "Rh": 38576, "lycer": 38577, "\u0120reinforcements": 38578, "\u0120Johannes": 38579, "\u0120Gallagher": 38580, "\u0120gymn": 38581, "Suddenly": 38582, "\u0120extortion": 38583, "kr": 38584, "iator": 38585, "Ta": 38586, "\u0120hippocampus": 38587, "NPR": 38588, "\u0120Computing": 38589, "\u0120squarely": 38590, "\u0120modelling": 38591, "\u0120Forums": 38592, "\u0120Lisp": 38593, "\u0120Krishna": 38594, "\u0120324": 38595, "\u0120rushes": 38596, "\u0120ensued": 38597, "\u0120creeping": 38598, "onte": 38599, "nai": 38600, "ilater": 38601, "\u0120Hornets": 38602, "\u0120oblivious": 38603, "INST": 38604, "559": 38605, "\u0120jeopardy": 38606, "\u0120distinguishing": 38607, "jured": 38608, "\u0120begs": 38609, "similar": 38610, "phot": 38611, "530": 38612, "\u0120Parkway": 38613, "\u0120sinks": 38614, "\u0120Hearthstone": 38615, "ibur": 38616, "\u0120Baton": 38617, "Avoid": 38618, "\u0120dancer": 38619, "\u0120magistrate": 38620, "aryn": 38621, "\u0120disturbances": 38622, "\u0120Romero": 38623, "\u0120paraph": 38624, "\u0120mischief": 38625, "\u00e2\u0138\u0135": 38626, "\u0120Sharia": 38627, "\u0120urinary": 38628, "route": 38629, "ivas": 38630, "fitted": 38631, "\u0120ejected": 38632, "\u0120Albuquerque": 38633, "\u0120470": 38634, "\u0120irritated": 38635, "\u0120Zip": 38636, "\u0120Biol": 38637, "\u00c3\u012f": 38638, "\u0120denounce": 38639, "\u0120binaries": 38640, "\u0120Verse": 38641, "\u0120oppos": 38642, "\u0120Kendrick": 38643, "\u0120GPL": 38644, "\u0120spew": 38645, "\u0120Elijah": 38646, "\u0120Eas": 38647, "\u0120drifted": 38648, "sofar": 38649, "\u0120annoyance": 38650, "\u0120BET": 38651, "474": 38652, "\u0120Strongh": 38653, "itates": 38654, "\u0120Cognitive": 38655, "ophone": 38656, "\u0120Identification": 38657, "ocrine": 38658, "connection": 38659, "\u0120boxer": 38660, "\u0120ASD": 38661, "\u0120Areas": 38662, "Yang": 38663, "tch": 38664, "ullah": 38665, "\u0120deceive": 38666, "Combat": 38667, "episode": 38668, "crete": 38669, "Witness": 38670, "\u0120condolences": 38671, "htar": 38672, "\u0120heals": 38673, "\u0120buckets": 38674, "\u0120LAW": 38675, "Blu": 38676, "\u0120slab": 38677, "\u0120ORDER": 38678, "ocl": 38679, "atton": 38680, "\u0120Stevenson": 38681, "\u0120Ginger": 38682, "\u0120Friendly": 38683, "\u0120Vanderbilt": 38684, "spirit": 38685, "igl": 38686, "\u0120Regarding": 38687, "\u0120PROG": 38688, "\u0120sealing": 38689, "starting": 38690, "\u0120cardinal": 38691, "\u0120Vec": 38692, "\u0120Beir": 38693, "\u0120milliseconds": 38694, "weak": 38695, "perse": 38696, "\u0120sterile": 38697, "\u0120Contemporary": 38698, "\u0120Phant": 38699, "\u0120Clo": 38700, "\u0120outp": 38701, "\u0120exiled": 38702, "\u0120277": 38703, "\u0120selfie": 38704, "\u0120manic": 38705, "\u0120nano": 38706, "terms": 38707, "Alexander": 38708, "\u0120resolves": 38709, "\u0120millennia": 38710, "\u0120explodes": 38711, "\u0120constellation": 38712, "\u0120adultery": 38713, "motion": 38714, "DOC": 38715, "\u0120broadcasters": 38716, "\u0120kindergarten": 38717, "\u0120Mayweather": 38718, "\u0120Eco": 38719, "icho": 38720, "\u0120287": 38721, "laun": 38722, "\u0120mute": 38723, "\u0120discreet": 38724, "\u0120preschool": 38725, "\u0120preempt": 38726, "Delete": 38727, "\u0120Freed": 38728, "Pi": 38729, "HK": 38730, "\u0120blocker": 38731, "\u0120Cumber": 38732, "\u0120wrought": 38733, "dating": 38734, "\u0120insurer": 38735, "\u0120quotas": 38736, "\u0120preached": 38737, "\u0120eviction": 38738, "\u0120Regina": 38739, "\u0120Pens": 38740, "\u0120seventeen": 38741, "\u0120Nass": 38742, "Dick": 38743, "\u0120folds": 38744, "\u0120dotted": 38745, "\u0120Aad": 38746, "Universal": 38747, "\u0120pizz": 38748, "\u0120Guru": 38749, "\u0120soils": 38750, "\u0120novice": 38751, "\u0120Neander": 38752, "\u0120stool": 38753, "\u0120detonated": 38754, "\u0120Pikachu": 38755, "\u0120Massive": 38756, "IVER": 38757, "\u0120Abdel": 38758, "\u0120subdued": 38759, "\u0120tallest": 38760, "\u0120precarious": 38761, "\u0120ay": 38762, "rification": 38763, "\u0120Obj": 38764, "cale": 38765, "\u0120unquestion": 38766, "culosis": 38767, "adas": 38768, "igrated": 38769, "Days": 38770, "\u0120queens": 38771, "\u0120Gazette": 38772, "\u0120Colour": 38773, "\u0120Bowman": 38774, "\u0120JJ": 38775, "\u00c3\u00afve": 38776, "\u0120dominates": 38777, "Student": 38778, "\u0120mu": 38779, "\u0120backlog": 38780, "\u0120Electro": 38781, "Truth": 38782, "483": 38783, "\u0120condensed": 38784, "rules": 38785, "\u0120Conspiracy": 38786, "\u0120acronym": 38787, "handled": 38788, "\u0120Matte": 38789, "jri": 38790, "\u0120Impossible": 38791, "lude": 38792, "creation": 38793, "\u0120warmed": 38794, "\u0120Slave": 38795, "\u0120misled": 38796, "\u0120ferment": 38797, "\u0120Kah": 38798, "inki": 38799, "keleton": 38800, "cyl": 38801, "\u0120Karin": 38802, "Hunter": 38803, "Register": 38804, "\u0120Surrey": 38805, "\u0120stares": 38806, "\u0120Width": 38807, "\u0120Nay": 38808, "\u0120Ski": 38809, "\u0120blacklist": 38810, "ucket": 38811, "\u0120expulsion": 38812, "imet": 38813, "\u0120retweet": 38814, "vantage": 38815, "Feature": 38816, "\u0120troopers": 38817, "\u0120homers": 38818, "969": 38819, "\u0120contingency": 38820, "\u0120WTC": 38821, "\u0120Brewer": 38822, "foreign": 38823, "Ware": 38824, "Solar": 38825, "\u0120undue": 38826, "REC": 38827, "ulnerable": 38828, "pathic": 38829, "\u0120Boise": 38830, "\u0120322": 38831, "\u0120aroused": 38832, "\u0120Ying": 38833, "\u00e4\u00b8\u012f": 38834, "ueless": 38835, "\u0120pas": 38836, "\u0120morp": 38837, "\u0120floral": 38838, "Express": 38839, "udging": 38840, "kB": 38841, "\u0120Granted": 38842, "\u00d8\u00af": 38843, "\u0120Micha": 38844, "\u0120Gothic": 38845, "\u0120SPECIAL": 38846, "\u0120Ricardo": 38847, "Fran": 38848, "\u0120administering": 38849, "620": 38850, "pora": 38851, "\u0120\u00c2\u00ae": 38852, "\u0120compromises": 38853, "\u0120bitten": 38854, "Accept": 38855, "Thirty": 38856, "\u00d0\u00b2": 38857, "\u0120materially": 38858, "\u0120Terr": 38859, "igmatic": 38860, "chains": 38861, "\u0120dove": 38862, "stadt": 38863, "Marvel": 38864, "FAULT": 38865, "\u0120windshield": 38866, "\u0120336": 38867, "adier": 38868, "\u0120swapping": 38869, "\u0120flawless": 38870, "\u0120Predator": 38871, "\u0120Michele": 38872, "\u0120propulsion": 38873, "\u0120Psychic": 38874, "\u0120assigning": 38875, "\u0120fabrication": 38876, "\u0120barley": 38877, "lust": 38878, "\u0120towering": 38879, "\u0120altercation": 38880, "\u0120Bentley": 38881, "Sphere": 38882, "\u0120tuna": 38883, "\u0120Classes": 38884, "Freedom": 38885, "uner": 38886, "Lady": 38887, "voice": 38888, "\u0120coolest": 38889, "orr": 38890, "\u0120palp": 38891, "${": 38892, "\u0120hysteria": 38893, "\u0120Metatron": 38894, "pants": 38895, "\u0120spawning": 38896, "Experts": 38897, "\u0120Investors": 38898, "\u0120Anarchy": 38899, "\u0120shrunk": 38900, "\u0120Victim": 38901, "\u0120289": 38902, "\u0120ecstasy": 38903, "\u0120Binding": 38904, "585": 38905, "\u0120Melody": 38906, "578": 38907, "otally": 38908, "\u0120Etsy": 38909, "liga": 38910, "\u0120applauded": 38911, "\u0120sweating": 38912, "\u0120redistributed": 38913, "\u0120popcorn": 38914, "\u0120seminal": 38915, "fur": 38916, "\u0120Neuroscience": 38917, "Rand": 38918, "\u0120Ost": 38919, "\u0120Madden": 38920, "\u0120Increasing": 38921, "\u0120Dawkins": 38922, "\u0120Subway": 38923, "\u0120arsen": 38924, "conserv": 38925, "BUR": 38926, "\u0120spiked": 38927, "\u0120Lyft": 38928, "\u0120Imperium": 38929, "\u0120Dropbox": 38930, "\u0120favoured": 38931, "\u0120encompasses": 38932, "ghost": 38933, "\u0120inspires": 38934, "\u0120burgeoning": 38935, "\u0120Yoshi": 38936, "\u0120Vertical": 38937, "\u0120Auditor": 38938, "\u0120intending": 38939, "\u0120filibuster": 38940, "Bloom": 38941, "fac": 38942, "\u0120Cavs": 38943, "igning": 38944, "\u0120coworkers": 38945, "\u0120Barbarian": 38946, "remember": 38947, "FLAG": 38948, "\u0120auditory": 38949, "asonry": 38950, "College": 38951, "\u0120muted": 38952, "gemony": 38953, "obin": 38954, "\u0120Psycho": 38955, "968": 38956, "\u0120lavish": 38957, "\u0120hierarchical": 38958, "\u0120Drone": 38959, "ouk": 38960, "\u0120crippled": 38961, "\u0120Maxim": 38962, "Slot": 38963, "\u0120quiz": 38964, "\u0120Vid": 38965, "ifling": 38966, "\u0120archaeologists": 38967, "\u0120abandonment": 38968, "dial": 38969, "leon": 38970, "\u0120Fas": 38971, "Ted": 38972, "\u0120raspberry": 38973, "\u0120maneuvers": 38974, "\u0120behaviours": 38975, "\u0120insure": 38976, "\u0120remod": 38977, "Switch": 38978, "hoe": 38979, "\u0120spaced": 38980, "\u0120affordability": 38981, "\u0120Fern": 38982, "notation": 38983, "\u0120Balanced": 38984, "\u0120occupies": 38985, "environment": 38986, "\u0120necklace": 38987, "\u0120sedan": 38988, "FU": 38989, "\u0120Bravo": 38990, "\u0120abusers": 38991, "\u0120Anita": 38992, "metadata": 38993, "\u0120Github": 38994, "aito": 38995, "\u0120Faster": 38996, "\u0120Wasserman": 38997, "\u0120Flesh": 38998, "\u0120thorn": 38999, "rarily": 39000, "\u0120Merry": 39001, "wine": 39002, "\u0120populace": 39003, "\u0120Lann": 39004, "\u0120repairing": 39005, "\u0120psyche": 39006, "\u0120modulation": 39007, "awaru": 39008, "\u00e2\u0122\u012d\u00e2\u0122\u012d": 39009, "arij": 39010, "\u0120decorations": 39011, "\u0120apologise": 39012, "\u0120Garg": 39013, "apply": 39014, "\u0120giveaway": 39015, "\u0120Flan": 39016, "\u0120Wyatt": 39017, "Uber": 39018, "\u0120authorised": 39019, "\u0120Moral": 39020, "HAHAHAHA": 39021, "activate": 39022, "\u0120torpedo": 39023, "\u0120FAR": 39024, "\u0120amassed": 39025, "\u0120Aram": 39026, "arkin": 39027, "\u0120Victims": 39028, "stab": 39029, "\u0120om": 39030, "\u0120ECO": 39031, "\u0120opioids": 39032, "\u0120purposely": 39033, "\u0120Vest": 39034, "\u0120erg": 39035, "atan": 39036, "\u0120Surgery": 39037, "\u0120correcting": 39038, "\u0120Ortiz": 39039, "\u0120Beet": 39040, "\u0120revoke": 39041, "\u0120freeway": 39042, "\u0120Higgins": 39043, "Fail": 39044, "\u0120Farms": 39045, "\u0120ATP": 39046, "hound": 39047, "\u0120poking": 39048, "\u0120Communists": 39049, "monster": 39050, "imentary": 39051, "\u0120unlocking": 39052, "\u0120unfit": 39053, "weed": 39054, "enario": 39055, "atical": 39056, "\u0120Enlightenment": 39057, "\u0120NG": 39058, "\u0120Compensation": 39059, "deen": 39060, "\u0120Widow": 39061, "\u0120Cindy": 39062, "\u0120Afterwards": 39063, "\u01206000": 39064, "ikhail": 39065, "agically": 39066, "\u0120ratified": 39067, "\u0120casualty": 39068, "HOME": 39069, "psey": 39070, "fee": 39071, "\u0120sparkling": 39072, "\u0120d\u00c3\u00a9": 39073, "\u0120concerted": 39074, "Catal": 39075, "\u0120complying": 39076, "\u0120Ares": 39077, "\u0120Dent": 39078, "Shut": 39079, "\u0120skim": 39080, "administ": 39081, "\u0120hostilities": 39082, "\u0120Gins": 39083, "\u0120608": 39084, "\u0120muddy": 39085, "\u0120McInt": 39086, "\u0120Decay": 39087, "525": 39088, "\u0120conspicuous": 39089, "\u0120Exposure": 39090, "\u0120rescind": 39091, "\u0120wearable": 39092, "\u0120328": 39093, "ourmet": 39094, "ahs": 39095, "\u0120Robots": 39096, "\u0120eclips": 39097, "instance": 39098, "\u0120REPORT": 39099, "\u0120Appl": 39100, "030": 39101, "\u0120Skies": 39102, "0100": 39103, "\u0120fallacy": 39104, "Socket": 39105, "\u0120Receiver": 39106, "\u0120solves": 39107, "\u0120Butterfly": 39108, "\u0120Shopping": 39109, "\u0120FIRE": 39110, "654": 39111, "Medic": 39112, "\u0120singers": 39113, "\u0120Needless": 39114, "''''": 39115, "ishers": 39116, "\u0120Dive": 39117, "588": 39118, "\u0120selectively": 39119, "\u0120clumsy": 39120, "889": 39121, "\u0120purchaser": 39122, "earned": 39123, "ardy": 39124, "\u0120benefiting": 39125, "english": 39126, "\u0120yielding": 39127, "\u0120Pour": 39128, "\u0120spinach": 39129, "\u0120delve": 39130, "\u0120Crom": 39131, "610": 39132, "\u0120exporting": 39133, "\u0120MAKE": 39134, "\u0120263": 39135, "\u0120grop": 39136, "\u0120envoy": 39137, "\u0120Inquiry": 39138, "\u0120Luigi": 39139, "dry": 39140, "\u0120Turing": 39141, "ThumbnailImage": 39142, "\u0120Variety": 39143, "\u0120facet": 39144, "\u0120fluffy": 39145, "\u0120excerpts": 39146, "\u0120shorth": 39147, "\u0120Olsen": 39148, "CLUD": 39149, "\u0120reliant": 39150, "\u0120UNC": 39151, "Tour": 39152, "\u0120bathing": 39153, "Company": 39154, "\u0120globalization": 39155, "Pred": 39156, "\u0120Malfoy": 39157, "\u0120hoc": 39158, "jam": 39159, "crafted": 39160, "\u0120Bonds": 39161, "\u0120Kissinger": 39162, "England": 39163, "\u0120orderly": 39164, "catentry": 39165, "\u0120261": 39166, "\u0120exchanging": 39167, "\u0120Intent": 39168, "\u0120Amendments": 39169, "DOM": 39170, "\u0120stout": 39171, "\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142\u00c2\u0142": 39172, "\u0120Airbus": 39173, "\u0120278": 39174, "hyde": 39175, "Poll": 39176, "ItemThumbnailImage": 39177, "\u0120loopholes": 39178, "\u0120Pillar": 39179, "\u0120explor": 39180, "Stretch": 39181, "Apart": 39182, "\u0120unmarried": 39183, "Limit": 39184, "\u0120Transformers": 39185, "\u0120intellectually": 39186, "uncture": 39187, "1800": 39188, "\u0120darn": 39189, "Brazil": 39190, "\u0120leftover": 39191, "berus": 39192, "fred": 39193, "Minecraft": 39194, "326": 39195, "\u0120Forms": 39196, "\u0120proofs": 39197, "\u0120Designed": 39198, "\u0120indexes": 39199, "\u0120Suppose": 39200, "EMS": 39201, "\u0120Loving": 39202, "\u0120Bonnie": 39203, "imating": 39204, "OTUS": 39205, "\u0120conductor": 39206, "\u0120behaved": 39207, "\u0120Fren": 39208, "\u0120synerg": 39209, "\u0120millennium": 39210, "\u0120catering": 39211, "\u0120Lauder": 39212, "Wr": 39213, "\u0120Yiannopoulos": 39214, "\u0120ATF": 39215, "\u0120enslaved": 39216, "\u0120awakened": 39217, "DVD": 39218, "\u0120EDITION": 39219, "\u0120Concert": 39220, "\u0120Challenger": 39221, "\u0120Haku": 39222, "umeric": 39223, "\u0120deprecated": 39224, "\u0120SHAR": 39225, "412": 39226, "\u0120dystop": 39227, "\u0120trembling": 39228, "\u0120dreaded": 39229, "\u0120Spac": 39230, "padding": 39231, "Repl": 39232, "\u0120Garrison": 39233, "Mini": 39234, "\u0120unparalleled": 39235, "amar": 39236, "URRENT": 39237, "wreck": 39238, "certain": 39239, "tal": 39240, "\u0120CLS": 39241, "appings": 39242, "\u0120sensed": 39243, "\u0120fencing": 39244, "\u0120Paso": 39245, "\u0120Desk": 39246, "\u0120scoff": 39247, "\u0120contemplate": 39248, "\u0120Liga": 39249, "liquid": 39250, "757": 39251, "\u0120apprentice": 39252, "\u0120UCHIJ": 39253, "570": 39254, "\u0120Thousand": 39255, "\u0120Illum": 39256, "\u0120championed": 39257, "\u00e3\u0124\u012e": 39258, "\u0120electors": 39259, "\u0120398": 39260, "\u0120Hancock": 39261, "rounded": 39262, "\u0120JOHN": 39263, "\u0120unsatisf": 39264, "\u0120qualifier": 39265, "\u0120Gadget": 39266, "ENE": 39267, "\u0120deadliest": 39268, "\u0120Plants": 39269, "\u0120ions": 39270, "\u0120accents": 39271, "\u0120tweaking": 39272, "\u0120shaved": 39273, "FREE": 39274, "\u0120Chaser": 39275, "Against": 39276, "960": 39277, "\u0120methamphetamine": 39278, "\u0120normalized": 39279, "\u0120$\\": 39280, "\u0120Precision": 39281, "\u0120Guam": 39282, "\u0120choked": 39283, "\u0120XII": 39284, "\u0120Casting": 39285, "Torrent": 39286, "\u0120scalp": 39287, "\u0120Jaguar": 39288, "wit": 39289, "\u0120semic": 39290, "ixie": 39291, "\u0120Gould": 39292, "\u0120confines": 39293, "Nusra": 39294, "\u0120Lon": 39295, "\u0120Jugg": 39296, "ycle": 39297, "\u0120Codec": 39298, "Egypt": 39299, "\u0120restrain": 39300, "\u0120Aliens": 39301, "\u0120choking": 39302, "\u0120Dunk": 39303, "\u0120Bella": 39304, "abc": 39305, "\u0120slang": 39306, "\u0120neurotrans": 39307, "sav": 39308, "\u0120empowerment": 39309, "\u00e2\u0128\u0134": 39310, "\u0120climbers": 39311, "\u0120Mim": 39312, "\u0120Fra": 39313, "rosse": 39314, "Capital": 39315, "\u0120Cthulhu": 39316, "Interface": 39317, "\u0120proficient": 39318, "\u0120INTO": 39319, "\u0120318": 39320, "rontal": 39321, "580": 39322, "\u0120Despair": 39323, "Kenn": 39324, "\u0120scrimmage": 39325, "\u0120Coat": 39326, "asions": 39327, "\u0120wallpaper": 39328, "\u0120Jol": 39329, "\u0120resurgence": 39330, "\u0120antiv": 39331, "\u0120Balls": 39332, "\u00b2\u00be": 39333, "\u0120buffers": 39334, "\u0120subsystem": 39335, "\u0120Stellar": 39336, "\u0120Lung": 39337, "AIDS": 39338, "\u0120eradicate": 39339, "\u0120blatantly": 39340, "\u0120behaves": 39341, "\u0120Nun": 39342, "\u0120antics": 39343, "export": 39344, "DEV": 39345, "wb": 39346, "\u0120php": 39347, "\u0120Integrity": 39348, "\u0120explorer": 39349, "\u0120revolving": 39350, "authored": 39351, "gans": 39352, "\u0120bask": 39353, "\u0120asynchronous": 39354, "\u00e5\u012f": 39355, "THING": 39356, "698": 39357, "Gene": 39358, "\u0120Racer": 39359, "\u0120Nico": 39360, "issued": 39361, "\u0120sermon": 39362, "possibly": 39363, "\u0120sizeof": 39364, "\u0120entrepreneurial": 39365, "oxin": 39366, "\u0120Minerva": 39367, "\u0120platoon": 39368, "nos": 39369, "riks": 39370, "AUT": 39371, "\u0120Avalanche": 39372, "\u0120Desc": 39373, "\u0133\u00e5\u00a3\u00ab": 39374, "\u0120Poc": 39375, "\u0120conferred": 39376, "\u00ce\u00bb": 39377, "\u0120patched": 39378, "FBI": 39379, "662": 39380, "\u0120fractures": 39381, "\u0120detects": 39382, "\u0120dedicate": 39383, "\u0120constituent": 39384, "\u0120cosmos": 39385, "WT": 39386, "\u0120sweats": 39387, "\u0120sprung": 39388, "bara": 39389, "solid": 39390, "\u0120unsus": 39391, "\u0120bulky": 39392, "\u0120Philippe": 39393, "\u0120Fenrir": 39394, "\u0120therapists": 39395, "oreal": 39396, "^^^^": 39397, "\u0120totaled": 39398, "\u0120booze": 39399, "\u0120RPC": 39400, "Prosecutors": 39401, "\u0120diseng": 39402, "\u0120Shared": 39403, "\u0120motorcycles": 39404, "\u0120inventions": 39405, "\u0120lettuce": 39406, "\u0120Merge": 39407, "\u0120JC": 39408, "\u0120spirituality": 39409, "\u0120WARNING": 39410, "\u0120unlucky": 39411, "\u0120Tess": 39412, "\u0120tongues": 39413, "\u0120DUI": 39414, "Tumblr": 39415, "\u0120leans": 39416, "\u0120invaders": 39417, "\u0120canopy": 39418, "\u0120Hurricanes": 39419, "\u0120Bret": 39420, "\u0120APPLIC": 39421, "idine": 39422, "ickle": 39423, "Regarding": 39424, "\u0120veggies": 39425, "\u0120ejac": 39426, "juven": 39427, "Fish": 39428, "DEM": 39429, "\u0120Dino": 39430, "Throw": 39431, "\u0120Checking": 39432, "beard": 39433, "(&": 39434, "\u0120jails": 39435, "\u0120hr": 39436, "transfer": 39437, "ivating": 39438, "\u0120fleets": 39439, "\u0120Imag": 39440, "\u0120McDonnell": 39441, "\u0120snippet": 39442, "Isa": 39443, "\u0120Chatt": 39444, "\u0120Stain": 39445, "\u0120SetFontSize": 39446, "\u0120Oy": 39447, "\u0120Mathematics": 39448, "494": 39449, "\u0120electroly": 39450, "\u0120Gott": 39451, "\u0120Bras": 39452, "BOOK": 39453, "\u0120Finger": 39454, "dump": 39455, "\u0120mutants": 39456, "\u0120rentals": 39457, "\u0120intertw": 39458, "\u0120creek": 39459, "aila": 39460, "Brother": 39461, "\u0120Discord": 39462, "pee": 39463, "rawler": 39464, "\u0120carp": 39465, "\u0120279": 39466, "\u00e3\u0124\u00b7\u00e3\u0125\u00a3": 39467, "relations": 39468, "\u0120contrasts": 39469, "Column": 39470, "\u0120reconnaissance": 39471, "\u0120unknow": 39472, "\u0120looting": 39473, "\u0120regulates": 39474, "\u0120optimum": 39475, "\u0120Cherokee": 39476, "\u0120Ary": 39477, "Latest": 39478, "\u0120roadside": 39479, "\u0120danced": 39480, "\u0120Unicorn": 39481, "Acknowled": 39482, "\u0120uncontroll": 39483, "\u0120MUS": 39484, "atio": 39485, "chance": 39486, "haven": 39487, "VALUE": 39488, "\u0120favourites": 39489, "\u0120ceremonial": 39490, "binary": 39491, "peed": 39492, "woods": 39493, "EMP": 39494, "\u0120vascular": 39495, "\u0120contemplated": 39496, "\u0120barren": 39497, "\u0120LIST": 39498, "Yellow": 39499, "osponsors": 39500, "\u0120whisky": 39501, "\u0120Mamm": 39502, "\u0120DeVos": 39503, "minimum": 39504, "Hung": 39505, "442": 39506, "Pic": 39507, "\u0120Snapdragon": 39508, "776": 39509, "\u0120carving": 39510, "\u0120undecided": 39511, "\u0120advantageous": 39512, "\u0120palms": 39513, "\u0120AQ": 39514, "\u0120starch": 39515, "Loop": 39516, "\u0120paddle": 39517, "\u0120flaming": 39518, "\u0120Horizons": 39519, "Animation": 39520, "boost": 39521, "\u0120probabilities": 39522, "\u0120Mish": 39523, "\u0120exodus": 39524, "\u0120Editorial": 39525, "\u0120fungus": 39526, "\u0120dissenting": 39527, "\u0120Delicious": 39528, "rogram": 39529, "\u0120Dyn": 39530, "disk": 39531, "tom": 39532, "\u0120fabrics": 39533, "\u0120Cove": 39534, "\u0120Bans": 39535, "\u0120soften": 39536, "\u0120CONS": 39537, "\u0120ineligible": 39538, "\u0120estimating": 39539, "\u0120Lexington": 39540, "practice": 39541, "ofi": 39542, "\u0120shedding": 39543, "\u0120Nope": 39544, "\u0120breathed": 39545, "\u0120Corinthians": 39546, "yne": 39547, "eki": 39548, "Bull": 39549, "\u0120attaching": 39550, "reenshots": 39551, "\u0120analyse": 39552, "\u0120Kappa": 39553, "\u0120unsustainable": 39554, "\u0120interpol": 39555, "anky": 39556, "hemer": 39557, "\u0120protagonists": 39558, "\u0120formatted": 39559, "\u0120Bryce": 39560, "\u0120Achilles": 39561, "\u0120Abedin": 39562, "shock": 39563, "\u0120bum": 39564, "bos": 39565, "qua": 39566, "\u0120Warn": 39567, "qt": 39568, "\u0120Diabetes": 39569, "864": 39570, "\u0120Invisible": 39571, "\u0120vanish": 39572, "\u0120transmitting": 39573, "\u0120murky": 39574, "\u0120Fei": 39575, "\u0120awaited": 39576, "\u0120Jurassic": 39577, "ummies": 39578, "\u0120menacing": 39579, "gall": 39580, "Cath": 39581, "Built": 39582, "ildo": 39583, "\u0120Votes": 39584, "\u0120ont": 39585, "\u0120munitions": 39586, "\u0120Freem": 39587, "\u00c3\u0143n": 39588, "\u0120decency": 39589, "lopp": 39590, "ieved": 39591, "\u0120Gord": 39592, "\u0120unthinkable": 39593, "\u0120Newsweek": 39594, "\u0120321": 39595, "Heat": 39596, "\u0120presenter": 39597, "jiang": 39598, "\u0120plank": 39599, "\u0120Avalon": 39600, "\u0120benz": 39601, "\u0120Rout": 39602, "\u0120slamming": 39603, "\u0120Dai": 39604, "outer": 39605, "\u0120Cookie": 39606, "\u0120Alicia": 39607, "gey": 39608, "\u0120vanity": 39609, "\u0120owl": 39610, "\u00e1\u00b5": 39611, "tested": 39612, "\u0120Awakens": 39613, "\u0120canv": 39614, "\u0120blindly": 39615, "\u0120Ridley": 39616, "\u0120Emails": 39617, "Requires": 39618, "\u0120Serbian": 39619, "ographed": 39620, "iframe": 39621, "eteria": 39622, "\u0120alternating": 39623, "quiet": 39624, "\u0120sociology": 39625, "\u0120Unlock": 39626, "\u0120Communism": 39627, "\u0120ops": 39628, "\u0120attribution": 39629, "\u0120abduction": 39630, "\u0120Abram": 39631, "\u0120sidelined": 39632, "\u0120BOOK": 39633, "\u0120refining": 39634, "\u0120Feeling": 39635, "\u0120Oslo": 39636, "\u0120Pruitt": 39637, "rack": 39638, "angible": 39639, "\u0120cautiously": 39640, "\u0120MARK": 39641, "eeds": 39642, "Mouse": 39643, "\u0120Steph": 39644, "\u0120Pair": 39645, "Sab": 39646, "997": 39647, "\u0120Baal": 39648, "Bec": 39649, "\u0120comma": 39650, "\u0120Pall": 39651, "\u0120Gael": 39652, "\u0120misunderstand": 39653, "\u0120Pesh": 39654, "Orderable": 39655, "\u0120dismal": 39656, "\u0120Shiny": 39657, "%\"": 39658, "\u0120realistically": 39659, "\u0120patio": 39660, "\u0120Gw": 39661, "\u0120Virtue": 39662, "\u0120exhausting": 39663, "whatever": 39664, "ophys": 39665, "yip": 39666, "418": 39667, "Adjust": 39668, "\u0120Waiting": 39669, "esson": 39670, "\u0120Mazda": 39671, "\u0120Dozens": 39672, "\u0120streamlined": 39673, "\u0120incompetence": 39674, "\u0120Meth": 39675, "\u0120ethos": 39676, "ONES": 39677, "\u0120incentiv": 39678, "\u0120gritty": 39679, "\u0120Butcher": 39680, "Header": 39681, "\u0120exponential": 39682, "\u00c3\u0141": 39683, "\u0120correlate": 39684, "\u0120consensual": 39685, "sounding": 39686, "Ring": 39687, "Origin": 39688, "\u0120conclusive": 39689, "feet": 39690, "acly": 39691, "\u0120Fernandez": 39692, "Buyable": 39693, "\u0120ducks": 39694, "auntlets": 39695, "\u0120elong": 39696, "\u0120286": 39697, "\u0120simul": 39698, "Gas": 39699, "\u0120Kirst": 39700, "\u0120protr": 39701, "\u0120Robo": 39702, "\u0120AoE": 39703, "opol": 39704, "\u0120psychologically": 39705, "spin": 39706, "ilaterally": 39707, "\u0120Conrad": 39708, "Wave": 39709, "441": 39710, "\u0120Advertisement": 39711, "\u0120Harmon": 39712, "\u0120Oriental": 39713, "isSpecial": 39714, "\u0120presumptive": 39715, "\u0120wil": 39716, "\u0120Kier": 39717, "nea": 39718, "\u0120ppm": 39719, "\u0120harbour": 39720, "\u0120Wired": 39721, "company": 39722, "\u0120coroner": 39723, "aturdays": 39724, "\u0120Proud": 39725, "\u0120NEXT": 39726, "\u0120Flake": 39727, "valued": 39728, "ceiver": 39729, "\u0120fraught": 39730, "\u0120casing": 39731, "\u0120runaway": 39732, "\u0120gin": 39733, "\u0120Laurent": 39734, "\u0120Harlem": 39735, "\u0120Curiosity": 39736, "quished": 39737, "\u0120neuroscience": 39738, "\u0120Hulu": 39739, "\u0120borrower": 39740, "\u0120petitioner": 39741, "\u0120Cooldown": 39742, "WARD": 39743, "\u0120invoking": 39744, "confidence": 39745, "Forward": 39746, "\u0120sts": 39747, "population": 39748, "DeliveryDate": 39749, "Film": 39750, "\u0120Cov": 39751, "quickShip": 39752, "quickShipAvailable": 39753, "primary": 39754, "isSpecialOrderable": 39755, "inventoryQuantity": 39756, "channelAvailability": 39757, "BOX": 39758, "\u0120Multiplayer": 39759, "\u0120Jenner": 39760, "778": 39761, "\u0120Md": 39762, "\u0120~/.": 39763, "MN": 39764, "\u0120childish": 39765, "\u0120antioxidant": 39766, "\u0120Chromebook": 39767, "\u0120274": 39768, "\u0120screenplay": 39769, "\u0120adventurous": 39770, "\u0120Relationship": 39771, "responsive": 39772, "mington": 39773, "\u0120cornerstone": 39774, "\u0120Fey": 39775, "FIR": 39776, "\u0120rookies": 39777, "\u0120Featuring": 39778, "\u0120originate": 39779, "\u0120electrodes": 39780, "antes": 39781, "\u0120scriptures": 39782, "\u0120glued": 39783, "\u0120discontent": 39784, "\u0120afflicted": 39785, "layout": 39786, "Brave": 39787, "\u0120mosa": 39788, "\u0120Quantity": 39789, "\u0120Hik": 39790, "winner": 39791, "Hours": 39792, "\u0120entail": 39793, "\u0120Cells": 39794, "ologue": 39795, "\u0120vil": 39796, "\u0120preacher": 39797, "\u0120decorative": 39798, "different": 39799, "\u0120prejudices": 39800, "\u0120Smoking": 39801, "\u0120Nottingham": 39802, "soType": 39803, "\u0120rhythms": 39804, "\u0120Alph": 39805, "blast": 39806, "Steel": 39807, "\u0120Danielle": 39808, "\u0120strife": 39809, "\u0120rematch": 39810, "soDeliveryDate": 39811, "\u0120Fork": 39812, "trip": 39813, "olulu": 39814, "heses": 39815, "CG": 39816, "\u0120POLITICO": 39817, "osta": 39818, "\u0120Drift": 39819, "\u00e9\u00be\u012f\u00e5\u00a5": 39820, "\u00e9\u00be\u012f\u00e5\u00a5\u0133\u00e5\u00a3\u00ab": 39821, "\u0120vetting": 39822, "\u0120Jinping": 39823, "\u0120Recession": 39824, "Minor": 39825, "\u0120Fraud": 39826, "enfranch": 39827, "\u0120convened": 39828, "\u0120NAACP": 39829, "\u0120Millions": 39830, "\u0120Farming": 39831, "\u0120Woo": 39832, "\u0120Flare": 39833, "rito": 39834, "immigrant": 39835, "\u0120vacancy": 39836, "\u0120HEAD": 39837, "\u0120Vaj": 39838, "egal": 39839, "\u0120Vigil": 39840, "Study": 39841, "\u0120ruining": 39842, "\u0120racks": 39843, "\u0120heater": 39844, "\u0120Randolph": 39845, "\u0120Brush": 39846, "\u0120Tir": 39847, "\u00d8\u00a8": 39848, "\u0120cov": 39849, "%]": 39850, "\u0120recounts": 39851, "\u0120OPT": 39852, "\u0120Melt": 39853, "\u0120truce": 39854, "\u0120casinos": 39855, "\u0120crusade": 39856, "\u0120carnage": 39857, "\u0120stripe": 39858, "\u0120Kyl": 39859, "Textures": 39860, "\u0120698": 39861, "\u0120proclamation": 39862, "\u0120goodies": 39863, "\u0120..........": 39864, "proclaimed": 39865, "Polit": 39866, "\u0120topical": 39867, "\u0120specialize": 39868, "\u0120Amin": 39869, "gm": 39870, "\u0120anchored": 39871, "\u0120bearings": 39872, "sample": 39873, "\u0120Highland": 39874, "\u0120Autism": 39875, "\u0120mercenary": 39876, "\u0120interviewer": 39877, "LER": 39878, "\u0120Somers": 39879, "\u0120embryo": 39880, "\u0120Assy": 39881, "\u0120281": 39882, "\u0120Editing": 39883, "\u0120Chosen": 39884, "660": 39885, "\u0120pci": 39886, "\u0120Thunderbolt": 39887, "BILL": 39888, "\u0120chuckled": 39889, "jriwal": 39890, "hof": 39891, "\u0120earthly": 39892, "(){": 39893, "independence": 39894, "\u0120dispers": 39895, "\u0120Vendor": 39896, "\u0120Gareth": 39897, "\u0120pals": 39898, "Penn": 39899, "\u0120Submit": 39900, "icum": 39901, "Thu": 39902, "\u0120clandestine": 39903, "\u0120cannibal": 39904, "\u0120Clerk": 39905, "EStream": 39906, "galitarian": 39907, "\u00e2\u013b\u00a5": 39908, "gew": 39909, "\u0120horrend": 39910, "\u0120Lov": 39911, "\u0120Reaction": 39912, "ocrin": 39913, "Classic": 39914, "\u0120echoing": 39915, "\u0120disclosing": 39916, "\u0120Insight": 39917, "ogun": 39918, "\u0120Incarn": 39919, "uploads": 39920, "pperc": 39921, "guyen": 39922, "\u01201901": 39923, "\u0120Bars": 39924, "687": 39925, "\u0120bribes": 39926, "\u0120Fresno": 39927, "urat": 39928, "\u0120Reese": 39929, "\u0120intrusive": 39930, "\u0120gripping": 39931, "\u0120Blueprint": 39932, "\u0120Rasm": 39933, "unia": 39934, "managed": 39935, "\u0120Hebdo": 39936, "\u0120345": 39937, "\u0120decoding": 39938, "\u0120poets": 39939, "\u0120jaws": 39940, "\u0120FIGHT": 39941, "ameless": 39942, "\u0120Meadows": 39943, "\u0120Harbaugh": 39944, "Interview": 39945, "\u0120Hosp": 39946, "\u0120BRA": 39947, "\u0120deletion": 39948, "mob": 39949, "Walker": 39950, "\u0120Moonlight": 39951, "\u0120Jed": 39952, "\u0120Sophia": 39953, "\u0120usur": 39954, "\u0120fortunately": 39955, "\u0120Putting": 39956, "\u0120Fold": 39957, "\u0120sanitation": 39958, "\u0120partisans": 39959, "ISON": 39960, "Bow": 39961, "\u0120CONC": 39962, "\u0120Reduced": 39963, "\u0120Sutton": 39964, "\u0120touchscreen": 39965, "\u0120embryos": 39966, "\u00e2\u0122\u00a2\u00e2\u0122\u00a2\u00e2\u0122\u00a2\u00e2\u0122\u00a2": 39967, "\u0120Krug": 39968, "combat": 39969, "\u0120Petroleum": 39970, "\u0120amd": 39971, "\u0120Cosmos": 39972, "\u0120prescribing": 39973, "\u0120conformity": 39974, "ourses": 39975, "\u0120plentiful": 39976, "\u0120disillusion": 39977, "\u0120Ecology": 39978, "ittal": 39979, "\u0120fanc": 39980, "\u0120assassinated": 39981, "regnancy": 39982, "\u0120perennial": 39983, "\u0120Bullets": 39984, "\u0120stale": 39985, "\u0120cached": 39986, "\u0120Judith": 39987, "\u0120Diseases": 39988, "Allen": 39989, "\u0120las": 39990, "\u0120shards": 39991, "\u0120Suarez": 39992, "\u0120Friendship": 39993, "interface": 39994, "\u0120Supporters": 39995, "addons": 39996, "462": 39997, "\u0120Imran": 39998, "\u0120Wim": 39999, "\u0120newfound": 40000, "\u0120Mb": 40001, "Animal": 40002, "\u0120darling": 40003, "ande": 40004, "\u0120rhy": 40005, "\u0120Twisted": 40006, "posal": 40007, "ynski": 40008, "Various": 40009, "\u00d7\u013e": 40010, "\u0120Kiw": 40011, "uyomi": 40012, "\u0120wellbeing": 40013, "\u0120Lau": 40014, "anos": 40015, "\u0120unmist": 40016, "\u0120macOS": 40017, "\u0120restroom": 40018, "\u0120Oliv": 40019, "\u0120Airways": 40020, "\u0120timetable": 40021, "980": 40022, "\u0120radios": 40023, "voy": 40024, "iasco": 40025, "\u0120cloudy": 40026, "\u0120Drawing": 40027, "Anything": 40028, "Syria": 40029, "\u0120Hert": 40030, "staking": 40031, "\u0120unchecked": 40032, "\u0120brazen": 40033, "\u0120NRS": 40034, "697": 40035, "onomic": 40036, "establish": 40037, "\u0120leng": 40038, "\u0120diagonal": 40039, "\u0120Fior": 40040, "Lair": 40041, "\u0120Stard": 40042, "\u0120deficient": 40043, "joining": 40044, "beam": 40045, "\u0120omnip": 40046, "\u0120blender": 40047, "\u0120sunrise": 40048, "Moore": 40049, "\u0120Fault": 40050, "\u0120Costume": 40051, "\u0120Mub": 40052, "Flags": 40053, "anse": 40054, "\u0120payout": 40055, "\u0120Governors": 40056, "\u0120Dillon": 40057, "\u0120Banana": 40058, "Nar": 40059, "\u0120trailed": 40060, "\u0120imperialist": 40061, "umann": 40062, "atsuki": 40063, "435": 40064, "\u0120Roads": 40065, "\u0120slur": 40066, "\u0120Ideally": 40067, "\u0120trenches": 40068, "Ctrl": 40069, "\u0120mirrored": 40070, "\u0120Zel": 40071, "\u0120Crest": 40072, "Compat": 40073, "\u0120Rolls": 40074, "scrib": 40075, "\u0120Trails": 40076, "ometers": 40077, "winter": 40078, "\u0120immortality": 40079, "ilated": 40080, "\u0120contradicts": 40081, "universal": 40082, "illions": 40083, "\u0120Mama": 40084, "optim": 40085, "ATURE": 40086, "\u0120geo": 40087, "etter": 40088, "\u0120Carlo": 40089, "424": 40090, "\u0120canonical": 40091, "\u0120Stronghold": 40092, "near": 40093, "\u0120perfume": 40094, "\u0120orchestra": 40095, "odiac": 40096, "\u0120uphe": 40097, "\u0120reigning": 40098, "versive": 40099, "\u0120caucuses": 40100, "\u0120DEM": 40101, "\u0120insulted": 40102, "\u0120------": 40103, "\u0120Crush": 40104, "\u0120rooting": 40105, "\u0120Wraith": 40106, "\u0120whore": 40107, "\u0120tofu": 40108, "Cmd": 40109, "\u0120Bree": 40110, "\u0120$_": 40111, "\u0120rive": 40112, "\u0120Advertising": 40113, "\u0120watt": 40114, "\u0120HO": 40115, "\u0120persuasive": 40116, "\u0120Parameters": 40117, "\u0120observational": 40118, "\u0120NCT": 40119, "\u0120Moj": 40120, "\u0120Salon": 40121, "\u0120trunc": 40122, "\u0120exquisite": 40123, "\u0120Mara": 40124, "\u0120poop": 40125, "\u0120ANN": 40126, "Exc": 40127, "\u0120Wonderful": 40128, "\u0120Taco": 40129, "\u0120homeowner": 40130, "\u0120Smithsonian": 40131, "orporated": 40132, "mmmm": 40133, "\u0120loaf": 40134, "\u0120Yamato": 40135, "\u0120Indo": 40136, "\u0120clinging": 40137, "\u00c3\u00a1s": 40138, "\u0120immutable": 40139, "hub": 40140, "Orange": 40141, "\u0120fingertips": 40142, "\u0120Wooden": 40143, "\u0120Kidd": 40144, "\u0120JPM": 40145, "\u0120Damn": 40146, "Cow": 40147, "codes": 40148, "482": 40149, "\u0120initiating": 40150, "\u0120Elk": 40151, "\u0120Cutting": 40152, "\u0120absentee": 40153, "\u0120Vance": 40154, "\u0120Lilith": 40155, "GUI": 40156, "\u0120obscured": 40157, "\u0120dwarves": 40158, "\u0120Chop": 40159, "\u0120Boko": 40160, "Values": 40161, "\u0120multimedia": 40162, "\u0120brewed": 40163, "Regular": 40164, "CRIPTION": 40165, "\u0120Mortal": 40166, "\u0120apex": 40167, "\u0120traveler": 40168, "\u0120boils": 40169, "\u0120spraying": 40170, "Represent": 40171, "\u0120Starship": 40172, "428": 40173, "\u0120disapproval": 40174, "\u0120shadowy": 40175, "\u0120lamented": 40176, "\u0120Replace": 40177, "\u0120Fran\u00c3\u00a7": 40178, "677": 40179, "dor": 40180, "\u0120unstoppable": 40181, "\u0120cohorts": 40182, "gyn": 40183, "\u0120Classics": 40184, "\u0120Amph": 40185, "\u0120sluggish": 40186, "\u0120Addiction": 40187, "\u0120Padres": 40188, "\u0120inscription": 40189, "\u0120inhuman": 40190, "minus": 40191, "\u0120Jeremiah": 40192, "atars": 40193, "Terror": 40194, "\u0120Tos": 40195, "\u0120Sharma": 40196, "asta": 40197, "catch": 40198, "\u0120plumbing": 40199, "\u0120Timbers": 40200, "Shar": 40201, "Hal": 40202, "\u0120Osc": 40203, "\u0120coupling": 40204, "humans": 40205, "\u0120sponge": 40206, "\u0120idols": 40207, "\u0120Spa": 40208, "\u0120Advocate": 40209, "\u0120Beats": 40210, "lua": 40211, "\u0120ticking": 40212, "\u0120loader": 40213, "\u0120Gron": 40214, "810": 40215, "\u0120stimulated": 40216, "\u0120sidebar": 40217, "\u0120Manufacturer": 40218, "oreAnd": 40219, "1973": 40220, "\u0120praises": 40221, "\u0120Flores": 40222, "disable": 40223, "\u0120Electrical": 40224, "raise": 40225, "Eth": 40226, "\u0120migrated": 40227, "\u0120lecturer": 40228, "Kids": 40229, "\u0120Cavern": 40230, "\u0120kettle": 40231, "\u0120glyc": 40232, "\u0120Mandela": 40233, "\u0120Fully": 40234, "\u00e5\u00a7\u00ab": 40235, "FINEST": 40236, "\u0120squeezing": 40237, "\u0120Ryder": 40238, "ampoo": 40239, "oreAndOnline": 40240, "InstoreAndOnline": 40241, "BuyableInstoreAndOnline": 40242, "\u0120commemorate": 40243, "\u0120Rampage": 40244, "Austin": 40245, "\u0120Shroud": 40246, "\u0120Ruins": 40247, "915": 40248, "\u0120KH": 40249, "\u0120waterfront": 40250, "\u0120ESC": 40251, "baby": 40252, "\u0120Cout": 40253, "\u0120Emblem": 40254, "\u0120equivalents": 40255, "492": 40256, "Unique": 40257, "\u0120Nietzsche": 40258, "browser": 40259, "\u0120imitation": 40260, "\u0120Werewolf": 40261, "\u0120Kirin": 40262, "acas": 40263, "',\"": 40264, "\u0120\u00c3\u00be": 40265, "Reviewed": 40266, "\u0120cunt": 40267, "\u0120voic": 40268, "\u0120Lenovo": 40269, "\u0120bonded": 40270, "481": 40271, "\u0120inhibitors": 40272, "\u0120endeavors": 40273, "\u0120Havana": 40274, "\u0120Stout": 40275, "\u0120Jolly": 40276, "Actor": 40277, "*/(": 40278, "\u0120occurrences": 40279, "\u0120Tens": 40280, "Increased": 40281, "\u0120ACTION": 40282, "\u0120\u00e3\u0122\u012e": 40283, "\u0120Rankings": 40284, "\u0120Breat": 40285, "\u0120309": 40286, "Dou": 40287, "\u0120impacting": 40288, "\u0120Duchess": 40289, "prefix": 40290, "QB": 40291, "\u0120summoning": 40292, "\u0120bestowed": 40293, "\u0120Kepler": 40294, "\u0120POWER": 40295, "cube": 40296, "\u0120Kits": 40297, "\u0120Grip": 40298, "\u0120opium": 40299, "\u0120reputable": 40300, "toc": 40301, "ichael": 40302, "\u0120Ripple": 40303, "\u0120caf\u00c3\u00a9": 40304, "\u0120Zoom": 40305, "\u0120Burma": 40306, "\u0120waive": 40307, "\u0120stalls": 40308, "\u0120demeanor": 40309, "incerity": 40310, "\u0120fluoride": 40311, "\u0120SHOULD": 40312, "Paris": 40313, "\u0120longing": 40314, "\u0120plat": 40315, "\u0120grossly": 40316, "\u0120bulls": 40317, "\u0120showcasing": 40318, "expected": 40319, "\u0120Gaddafi": 40320, "engineering": 40321, "Repeat": 40322, "\u0120Kut": 40323, "\u0120conceivable": 40324, "\u0120trimmed": 40325, "oscope": 40326, "\u0120Candidate": 40327, "\u0120Tears": 40328, "rolog": 40329, "Lewis": 40330, "SUP": 40331, "\u0120roadmap": 40332, "\u0120saliva": 40333, "\u0120trumpet": 40334, "Jimmy": 40335, "\u0120miraculous": 40336, "\u0120colonization": 40337, "\u0120amput": 40338, "\u0120GNOME": 40339, "atech": 40340, "Different": 40341, "\u0120ELE": 40342, "\u0120Governments": 40343, "\u0120Ahead": 40344, "\u00e3\u0127\u012d\u00e3\u0127\u012d": 40345, "wordpress": 40346, "LIB": 40347, "\u0120Include": 40348, "\u0120Dorothy": 40349, "045": 40350, "\u0120Colombian": 40351, "\u0120leased": 40352, "884": 40353, "\u0120degrading": 40354, "\u0120Daisy": 40355, "iations": 40356, "\u0120baptized": 40357, "\u0120surname": 40358, "cox": 40359, "\u0120blinked": 40360, "\u00e3\u0125\u00a2": 40361, "\u0120pollen": 40362, "\u0120dermat": 40363, "\u0120regex": 40364, "\u0120Nicholson": 40365, "\u0120Eater": 40366, "\u00e7\u013e": 40367, "rador": 40368, "\u0120narrower": 40369, "\u0120hurricanes": 40370, "\u0120hallucinations": 40371, "ridden": 40372, "ISSION": 40373, "\u0120Firefly": 40374, "\u0120attainment": 40375, "\u0120nominate": 40376, "\u0120avocado": 40377, "\u0120Meredith": 40378, "\u0120ts": 40379, "\u0120reverence": 40380, "\u0120euph": 40381, "\u0120crates": 40382, "\u0120TEXT": 40383, "\u0120443": 40384, "\u0120319": 40385, "JSON": 40386, "iquette": 40387, "\u0120shortstop": 40388, "ickey": 40389, "\u0120propelled": 40390, "\u0120api": 40391, "\u0120Thieves": 40392, "779": 40393, "\u0120oversaw": 40394, "\u0120coli": 40395, "\u0120Nicola": 40396, "\u0120overcl": 40397, "ikawa": 40398, "\u0120Cyr": 40399, "\u0120384": 40400, "789": 40401, "\u0120Allows": 40402, "1027": 40403, "Detroit": 40404, "TRY": 40405, "setup": 40406, "\u0120Socialism": 40407, "Soviet": 40408, "susp": 40409, "\u0120APR": 40410, "\u0120Shutdown": 40411, "\u0120aluminium": 40412, "zbek": 40413, "\u0120Lover": 40414, "GGGGGGGG": 40415, "\u0120democracies": 40416, "\u01201908": 40417, "\u0120Merrill": 40418, "\u0120Francois": 40419, "gdala": 40420, "\u0120traffickers": 40421, "\u0120Til": 40422, "\u0120Goat": 40423, "\u0120sped": 40424, "\u0120Reserv": 40425, "\u0120prod": 40426, "552": 40427, "\u0120cac": 40428, "\u0120Univ": 40429, "\u0120Schwe": 40430, "\u0120swirling": 40431, "\u0120Wilderness": 40432, "\u0120Eggs": 40433, "\u0120saddened": 40434, "\u0120archaic": 40435, "Hyd": 40436, "\u0120excessively": 40437, "BRE": 40438, "\u0120aerospace": 40439, "\u0120Voices": 40440, "Craig": 40441, "\u0120ignited": 40442, "Initially": 40443, "\u0120McA": 40444, "\u0120handset": 40445, "\u0120reforming": 40446, "\u0120frustrations": 40447, "\u0120Deadpool": 40448, "\u0120Belichick": 40449, "ractor": 40450, "\u0120Ragnarok": 40451, "\u0120Drupal": 40452, "\u0120Approximately": 40453, "1920": 40454, "\u0120Hubble": 40455, "armor": 40456, "\u0120Saras": 40457, "\u0120Jonas": 40458, "\u0120nostalgic": 40459, "\u0120feasibility": 40460, "Saharan": 40461, "\u0120orbiting": 40462, "\u0120970": 40463, "Ru": 40464, "\u0120shin": 40465, "\u0120Investigators": 40466, "\u0120inconsistencies": 40467, "\u0120PAN": 40468, "BG": 40469, "\u0120grazing": 40470, "\u0120detectors": 40471, "\u0120Startup": 40472, "\u0120Funny": 40473, "\u0120Naomi": 40474, "Considering": 40475, "\u0120hog": 40476, "utf": 40477, "cemic": 40478, "\u0120fortified": 40479, "\u0120Functions": 40480, "\u0120codec": 40481, "nutrition": 40482, "Hat": 40483, "\"!": 40484, "microsoft": 40485, "558": 40486, "\u0120Thin": 40487, "\u0120ACE": 40488, "Alias": 40489, "\u0120OPS": 40490, "papers": 40491, "PK": 40492, "\u00e3\u0122\u0130": 40493, "\u0120improbable": 40494, "Northern": 40495, "equal": 40496, "\u0120lookout": 40497, "\u0120tyres": 40498, "\u0120Modified": 40499, "\u0120Kop": 40500, "Absolutely": 40501, "\u0120buildup": 40502, "silver": 40503, "\u0120audi": 40504, "\u0120grotesque": 40505, "\u0120Saber": 40506, "\u0120Presbyter": 40507, "ONY": 40508, "\u0120glaciers": 40509, "\u0120Shoals": 40510, "\u0120Kass": 40511, "\u0120HRC": 40512, "\u0120Nicol": 40513, "\u0120Lunch": 40514, "\u0120Foss": 40515, "\u00e2\u0138\u0134": 40516, "ADRA": 40517, "\u0120OnePlus": 40518, "oing": 40519, "grounds": 40520, "\u0120incidental": 40521, "\u0120datasets": 40522, "689": 40523, "\u0120Clarkson": 40524, "\u0120assembling": 40525, "\u0120Corrections": 40526, "\u0120drinkers": 40527, "\u0120qualifiers": 40528, "\u0120leash": 40529, "\u0120unfounded": 40530, "\u0120Hundred": 40531, "\u0120kickoff": 40532, "Ti": 40533, "\u0120reconcil": 40534, "\u0120Grants": 40535, "\u0120Compliance": 40536, "\u0120Dexterity": 40537, "\u01201906": 40538, "warn": 40539, "Dallas": 40540, "Maximum": 40541, "nard": 40542, "avia": 40543, "beaut": 40544, "ensitivity": 40545, "trace": 40546, "\u0120pioneers": 40547, "\u0120Fract": 40548, "\u00e3\u0122\u0131": 40549, "\u0120precept": 40550, "\u0120glossy": 40551, "\u0120IEEE": 40552, "Across": 40553, "\u0120680": 40554, "Sleep": 40555, "cheon": 40556, "\u0120satirical": 40557, "\u0120Minotaur": 40558, "\u0120Claude": 40559, "\u0120r\u00c3\u00a9": 40560, "apego": 40561, "\u0120carrot": 40562, "\u0120Semin": 40563, "inoa": 40564, "\u0120zo": 40565, "Independent": 40566, "\u0120diagnoses": 40567, "\u0120Cue": 40568, "MAR": 40569, "\u0120rendition": 40570, "\u0120Kik": 40571, "\u0120pathology": 40572, "\u0120selects": 40573, "LinkedIn": 40574, "\u0120assay": 40575, "\u0120Dres": 40576, "\u0120textual": 40577, "posted": 40578, "ITAL": 40579, "\u0120Maul": 40580, "Neal": 40581, "\u0120interconnected": 40582, "\u0120erratic": 40583, "\u0120Virus": 40584, "\u0120530": 40585, "\u0120environmentalists": 40586, "\u0120Phelps": 40587, "\u0120engagements": 40588, "\u0120INST": 40589, "\u0120economical": 40590, "noxious": 40591, "\u0120gearing": 40592, "izzy": 40593, "\u0120favorably": 40594, "\u0120McGill": 40595, "Term": 40596, "\u0120hanged": 40597, "\u0120ballpark": 40598, "\u0120Reyes": 40599, "\u0120beware": 40600, "\u0120Psal": 40601, "\u0120Massacre": 40602, "qi": 40603, "\u0120inaccessible": 40604, "aclysm": 40605, "\u0120fray": 40606, "illac": 40607, "\u0120bitterly": 40608, "\u0120Certification": 40609, "Michigan": 40610, "\u0120irrespective": 40611, "alore": 40612, "Empty": 40613, "\u0120endorsements": 40614, "\u0120undet": 40615, "fg": 40616, "equipped": 40617, "\u0120merciless": 40618, "\u0120Cust": 40619, "\u0120immature": 40620, "\u0120voucher": 40621, "\u0120Blackwell": 40622, "\u00d1\u0131": 40623, "hawk": 40624, "disciplinary": 40625, "ilee": 40626, "\u0120Makoto": 40627, "\u0120Dude": 40628, "\u00e3\u0125\u0129\u00e3\u0124\u00a3": 40629, "Years": 40630, "\u0120inver": 40631, "\u0120shaman": 40632, "\u0120Yong": 40633, "ipel": 40634, "ellen": 40635, "\u0120Cathy": 40636, "brids": 40637, "\u0120sarc": 40638, "651": 40639, "Near": 40640, "\u0120groundwork": 40641, "\u0120amaz": 40642, "\u0120415": 40643, "\u0120Huntington": 40644, "hews": 40645, "\u0120Bung": 40646, "\u0120arbitrarily": 40647, "\u0120Wit": 40648, "\u0120Alberto": 40649, "\u0120disqualified": 40650, "bestos": 40651, "461": 40652, "\u0120pc": 40653, "\u0120284": 40654, "robat": 40655, "Robin": 40656, "\u0120hugs": 40657, "\u0120Transition": 40658, "\u0120Occasionally": 40659, "\u0120326": 40660, "\u0120Whilst": 40661, "\u0120Ley": 40662, "\u0120spaceship": 40663, "csv": 40664, "\u0120unsuccessfully": 40665, "\u0120Au": 40666, "leck": 40667, "\u0120Winged": 40668, "\u0120Grizzlies": 40669, ".\u00ef\u00bf\u00bd": 40670, "\u0120nearer": 40671, "\u0120Sorceress": 40672, "\u0120Indigo": 40673, "Else": 40674, "840": 40675, "letes": 40676, "Coach": 40677, "\u0120upbringing": 40678, "\u0120Kes": 40679, "\u0120separatist": 40680, "\u0120racists": 40681, "\u0120chained": 40682, "\u0120abstinence": 40683, "learning": 40684, "\u0120reinstated": 40685, "\u0120symmetry": 40686, "\u0120reminders": 40687, "\u0120Chevy": 40688, "\u0120mont": 40689, "\u0120exemplary": 40690, "\u0120TOR": 40691, "ZX": 40692, "\u0120qualitative": 40693, "\u0120Stamp": 40694, "\u0120Savannah": 40695, "\u0120Rossi": 40696, "\u0120paed": 40697, "\u0120dispensaries": 40698, "\u0120Walls": 40699, "\u0120Chronic": 40700, "\u0120complimentary": 40701, "\u0120Beirut": 40702, "\u0120+---": 40703, "igslist": 40704, "\u0120cryptographic": 40705, "masters": 40706, "\u0120Capitals": 40707, "\u0120maximal": 40708, "\u0120entropy": 40709, "Points": 40710, "\u0120combatants": 40711, "lip": 40712, "\u0120Glob": 40713, "\u0120BMC": 40714, "phase": 40715, "thank": 40716, "HTTP": 40717, "\u0120commuter": 40718, "\u0120\\(\\": 40719, "../": 40720, "\u0120Regener": 40721, "\u0120DOI": 40722, "\u0120Activision": 40723, "\u0120slit": 40724, "osal": 40725, "REM": 40726, "\u0120chants": 40727, "Yu": 40728, "Keys": 40729, "Brexit": 40730, "\u0120Forced": 40731, "Arizona": 40732, "\u0120squadron": 40733, "ISO": 40734, "\u0120Malone": 40735, "\u0120338": 40736, "\u0120contrasting": 40737, "\u0120tidal": 40738, "\u0120libel": 40739, "\u0120implanted": 40740, "\u0120uproar": 40741, "\u0120Cater": 40742, "\u0120propositions": 40743, "Manchester": 40744, "\u0120Euros": 40745, "itamin": 40746, "Gil": 40747, "\u0120Elven": 40748, "\u0120Seek": 40749, "\u0120Bai": 40750, "\u0120redevelopment": 40751, "\u0120Towns": 40752, "\u0120Lub": 40753, "!\",": 40754, "alon": 40755, "Krist": 40756, "\u0120measurable": 40757, "\u0120imaginable": 40758, "\u0120apostles": 40759, "YN": 40760, "760": 40761, "\u0120steroid": 40762, "\u0120specificity": 40763, "\u0120Located": 40764, "\u0120Becker": 40765, "\u0120Edu": 40766, "\u0120Dietary": 40767, "utsch": 40768, "\u0120Marilyn": 40769, "\u0120blister": 40770, "\u0120MEP": 40771, "\u0120Koz": 40772, "\u0120CMS": 40773, "yahoo": 40774, "\u0120Carney": 40775, "\u0120boasting": 40776, "\u0120Caleb": 40777, "Byte": 40778, "reads": 40779, "aden": 40780, "Problem": 40781, "\u0120Woodward": 40782, "Swe": 40783, "Sup": 40784, "\u0120KGB": 40785, "Setup": 40786, "\u0120tacit": 40787, "\u0120retribution": 40788, "\u0120dues": 40789, "\u0120M\u00c3\u00bc": 40790, ".?": 40791, "\u00e4\u00b8\u0143": 40792, "pots": 40793, "\u0120cameo": 40794, "\u0120PAL": 40795, "education": 40796, "Amy": 40797, "likely": 40798, "gling": 40799, "\u0120constitutionally": 40800, "\u0120Hamm": 40801, "\u0120Speak": 40802, "\u0120widgets": 40803, "brate": 40804, "\u0120crappy": 40805, "\u0120Iter": 40806, "\u0120anticipating": 40807, "\u0120Bout": 40808, "Pixel": 40809, "\u0120Yep": 40810, "\u0120Laurie": 40811, "\u0120hut": 40812, "\u0120bulletin": 40813, "\u0120Salvation": 40814, "\u0120chats": 40815, "earable": 40816, "Honestly": 40817, "ALTH": 40818, "onsequ": 40819, "cult": 40820, "iscovery": 40821, "ovych": 40822, "\u0120selves": 40823, "\u0120Satoshi": 40824, "Sounds": 40825, "\u0120convergence": 40826, "\u0120Rosenberg": 40827, "1974": 40828, "\u0120nasal": 40829, "\u0120fullest": 40830, "\u0120ferocious": 40831, "xus": 40832, "iste": 40833, "AMS": 40834, "\u0120lobbied": 40835, "\u0120soothing": 40836, "\u0120Gunn": 40837, "today": 40838, "024": 40839, "\u0120inspirational": 40840, "\u0120NBN": 40841, "pb": 40842, "gewater": 40843, "orah": 40844, "allowed": 40845, "\u0120Coliseum": 40846, "\u0120specializing": 40847, "\u0120insanely": 40848, "\u0120Tape": 40849, "delay": 40850, "\u0120tarn": 40851, "\u0120Pound": 40852, "\u0120melanch": 40853, "\u0120deployments": 40854, "iland": 40855, "\u0120lessen": 40856, "\u0120furry": 40857, "\u0120UEFA": 40858, "\u0120bloodshed": 40859, "\u0120Meier": 40860, "ithering": 40861, "\u0120heirs": 40862, "\u0120Jaw": 40863, "axter": 40864, "\u0120Publications": 40865, "\u0120alters": 40866, "intention": 40867, "\u0120Winchester": 40868, "determination": 40869, "\u0120Lifetime": 40870, "thin": 40871, "Monster": 40872, "780": 40873, "\u0120approximation": 40874, "\u0120supermarkets": 40875, "\u0120Seconds": 40876, "oros": 40877, "huge": 40878, "\u0120bribe": 40879, "\u0120LIMITED": 40880, "uned": 40881, "\u0120misinterpret": 40882, "\u0120Injury": 40883, "\u0120367": 40884, "\u0120thresholds": 40885, "\u0120Carnival": 40886, "\u0120gastrointestinal": 40887, "\u0120guideline": 40888, "\u0120deceived": 40889, "features": 40890, "\u0120purportedly": 40891, "\u0120Ronnie": 40892, "\u0120Newt": 40893, "\u0120spacious": 40894, "asus": 40895, "\u0120superheroes": 40896, "\u0120Cynthia": 40897, "legged": 40898, "kamp": 40899, "chio": 40900, "\u0120thumbnail": 40901, "\u0120Shirley": 40902, "illation": 40903, "\u0120sheds": 40904, "\u0120Zy": 40905, "EPA": 40906, "\u0120dams": 40907, "\u0120yawn": 40908, "nah": 40909, "\u0120Peggy": 40910, "\u0120Erie": 40911, "\u0120Juventus": 40912, "\u0120Fountain": 40913, "rx": 40914, "donald": 40915, "album": 40916, "\u0120Comprehensive": 40917, "\u0120caching": 40918, "\u0120Uz": 40919, "ulnerability": 40920, "\u0120Principle": 40921, "\u0120Jian": 40922, "ingers": 40923, "casts": 40924, "\u0120Osiris": 40925, "chart": 40926, "tile": 40927, "\u0120Tiffany": 40928, "\u0120Patton": 40929, "\u0120Whip": 40930, "\u0120oversized": 40931, "Je": 40932, "\u0120Cinderella": 40933, "\u0120Borders": 40934, "\u0120Daesh": 40935, "Mah": 40936, "\u0120dogma": 40937, "\u0120communists": 40938, "vu": 40939, "Council": 40940, "\u0120freshwater": 40941, "\u0120wounding": 40942, "\u0120debacle": 40943, "\u0120youngster": 40944, "\u0120threaded": 40945, "\u0120Bots": 40946, "\u0120Savings": 40947, "\u00e3\u0123\u0124": 40948, "oling": 40949, "oho": 40950, "\u0120illumination": 40951, "MRI": 40952, "\u0120loosen": 40953, "trump": 40954, "agency": 40955, "urion": 40956, "\u0120momentarily": 40957, "\u0120Chun": 40958, "\u0120Budapest": 40959, "\u0120Alley": 40960, "Disk": 40961, "\u0120astonished": 40962, "\u0120Conquer": 40963, "\u0120Accounting": 40964, "having": 40965, "\u0120Wein": 40966, "\u0120Alright": 40967, "\u0120revolver": 40968, "\u0120delusion": 40969, "\u0120relics": 40970, "\u0120adherent": 40971, "quant": 40972, "\u0120handmade": 40973, "orio": 40974, "\u0120combating": 40975, "coded": 40976, "\u0120quadru": 40977, "reth": 40978, "Nik": 40979, "\u0120Tribal": 40980, "\u0120Mysterious": 40981, "\u0120inhal": 40982, "\u0120Winning": 40983, "\u0120Classification": 40984, "changed": 40985, "\u0120unab": 40986, "\u0120scorn": 40987, "icipated": 40988, "wl": 40989, "onductor": 40990, "\u0120reinforcing": 40991, "\u0120Childhood": 40992, "anova": 40993, "\u0120adventurer": 40994, "\u0120doctoral": 40995, "\u0120Strategies": 40996, "\u0120engulfed": 40997, "\u0120Encounter": 40998, "\u0120lashes": 40999, "Critical": 41000, "ricular": 41001, "\u0120UTF": 41002, "ociation": 41003, "checking": 41004, "\u0120Consulting": 41005, "Runtime": 41006, "period": 41007, "\u0120Asgard": 41008, "\u0120distilled": 41009, "\u0120Pasadena": 41010, "\u0120Dying": 41011, "\u0120COUNTY": 41012, "\u0120granite": 41013, "\u0120smack": 41014, "\u0120parachute": 41015, "\u0120SUR": 41016, "Virginia": 41017, "\u0120Furious": 41018, "787": 41019, "\u0120Okin": 41020, "\u0120camel": 41021, "\u0120Mbps": 41022, "1972": 41023, "\u0120Chao": 41024, "\u0120Cyan": 41025, "joice": 41026, "efer": 41027, "\u0120Wrap": 41028, "\u0120Debate": 41029, "Seg": 41030, "\u0120forearm": 41031, "\u0120Ignore": 41032, "\u0120timestamp": 41033, "\u0120probing": 41034, "\u0120Noon": 41035, "\u0120Grail": 41036, "fen": 41037, "\u0120dormant": 41038, "\u0120Firstly": 41039, "\u0120Eighth": 41040, "\u0120HUN": 41041, "\u0120Desire": 41042, "oras": 41043, "Girls": 41044, "\u0120Desmond": 41045, "zar": 41046, "amines": 41047, "OAD": 41048, "execute": 41049, "\u0120boobs": 41050, "\u0120ATL": 41051, "_(": 41052, "Chelsea": 41053, "\u0120masturbation": 41054, "\u0120CoC": 41055, "\u0120destroyer": 41056, "\u0120Chomsky": 41057, "\u0120scatter": 41058, "\u0120Assets": 41059, "796": 41060, "\u0120Cargo": 41061, "\u0120receptive": 41062, "\u0120Scope": 41063, "\u0120marketers": 41064, "\u0120launchers": 41065, "\u0120axle": 41066, "\u0120SEA": 41067, "seq": 41068, "\u0120Moff": 41069, "finding": 41070, "\u0120Gibbs": 41071, "Georgia": 41072, "extremely": 41073, "NJ": 41074, "\u0120laborers": 41075, "stals": 41076, "\u0120mediation": 41077, "\u0120Hedge": 41078, "atown": 41079, "\u0120iod": 41080, "despite": 41081, "vill": 41082, "Jane": 41083, "existence": 41084, "\u0120coincided": 41085, "\u0120Utilities": 41086, "\u0120Cheap": 41087, "\u0120logistical": 41088, "\u0120culmination": 41089, "\u0120Nicotine": 41090, "pak": 41091, "Folder": 41092, "\u0120rodents": 41093, "stuff": 41094, "\u0120lawfully": 41095, "\u0120reperto": 41096, "ioch": 41097, "jj": 41098, "Dialogue": 41099, "HHHH": 41100, "liction": 41101, "Looks": 41102, "\u0120297": 41103, "\u0120turrets": 41104, "\u0120Abandon": 41105, "\u0120incess": 41106, "\u0120Trafford": 41107, "\u0120curled": 41108, "\u0120preferring": 41109, "\u0120privatization": 41110, "\u0120irresist": 41111, "\u0120Panda": 41112, "\u0120Shake": 41113, "\u0120McGr": 41114, "\u00e3\u0125\u0126": 41115, "unders": 41116, "\u0120discriminated": 41117, "\u0120bartender": 41118, "ILE": 41119, "Atlantic": 41120, "\u0120propensity": 41121, "\u0120Wiz": 41122, "\u0120Gim": 41123, "conference": 41124, "\u0120reinforces": 41125, "Gh": 41126, "wagon": 41127, "\u0120eerie": 41128, "Fal": 41129, "\u0120hugged": 41130, "racist": 41131, "RIC": 41132, "Fu": 41133, "\u0120filler": 41134, "\u0120Stub": 41135, "\u0120engraved": 41136, "\u0120Wrestle": 41137, "\u0120imaginative": 41138, "\u0120Peer": 41139, "\u0120Factors": 41140, "anus": 41141, "\u0120Dracula": 41142, "monitor": 41143, "\u0120routers": 41144, "ibia": 41145, "\u0120Boolean": 41146, "endale": 41147, "\u0120Slaughter": 41148, "\u0120Shack": 41149, "RFC": 41150, "\u0120Spielberg": 41151, "Sax": 41152, "\u0120PHOTO": 41153, "\u0120Clover": 41154, "\u0120Rae": 41155, "Depending": 41156, "\u0120Memor": 41157, "aram": 41158, "\u0120pierced": 41159, "\u0120curtains": 41160, "vale": 41161, "\u0120Inquisition": 41162, "\u0120Poke": 41163, "\u0120forecasting": 41164, "\u0120complains": 41165, "Sense": 41166, "\u0120Hermes": 41167, "iscovered": 41168, "\u0120bible": 41169, "\u0120Morph": 41170, "\u0120germ": 41171, "785": 41172, "DON": 41173, "\u0120congen": 41174, "\u0120crane": 41175, "\u0120DPR": 41176, "\u0120respectfully": 41177, "Room": 41178, "\u0120Naw": 41179, "\u0120Dalai": 41180, "reason": 41181, "\u0120Angus": 41182, "Education": 41183, "\u0120Titanic": 41184, "\u00cb\u013e": 41185, "\u0120oval": 41186, "united": 41187, "\u0120thirds": 41188, "\u0120moistur": 41189, "\u0120CPC": 41190, "Miami": 41191, "\u0120tentacles": 41192, "\u0120Polaris": 41193, "exc": 41194, "exclusive": 41195, "\u0120Prairie": 41196, "\u0120colossal": 41197, "\u0120Blend": 41198, "surprisingly": 41199, "\u00c3\u0143s": 41200, "\u0120indoctr": 41201, "\u0120basal": 41202, "\u0120MPEG": 41203, "undo": 41204, "Split": 41205, "Development": 41206, "\u0120lantern": 41207, "1971": 41208, "\u0120provocation": 41209, "\u0120anguish": 41210, "\u0120Bind": 41211, "\u0120Leia": 41212, "ducers": 41213, "ippy": 41214, "conservancy": 41215, "\u0120initialize": 41216, "\u0120Twice": 41217, "\u0120Suk": 41218, "\u0120predic": 41219, "\u0120diploma": 41220, "\u0120sociop": 41221, "Ingredients": 41222, "\u0120hammered": 41223, "\u0120Irma": 41224, "Qaida": 41225, "\u0120glimps": 41226, "\u0120Bian": 41227, "\u0120stacking": 41228, "\u0120fend": 41229, "govtrack": 41230, "\u0120unn": 41231, "democratic": 41232, "igree": 41233, "\u0120580": 41234, "\u0120294": 41235, "\u0120strawberry": 41236, "IDER": 41237, "\u0120cherished": 41238, "\u0120Hots": 41239, "\u0120inferred": 41240, "\u0120808": 41241, "\u0120Socrates": 41242, "Oregon": 41243, "\u0120Roses": 41244, "\u0120FOIA": 41245, "\u0120insensitive": 41246, "\u0120408": 41247, "Recommend": 41248, "\u0120Shine": 41249, "\u0120painstaking": 41250, "UGE": 41251, "\u0120Heller": 41252, "\u0120Enterprises": 41253, "IOR": 41254, "adj": 41255, "NRS": 41256, "LG": 41257, "\u0120alienated": 41258, "\u0120acknowledgement": 41259, "\u0120AUD": 41260, "\u0120Reneg": 41261, "\u0120vouchers": 41262, "\u0120960": 41263, "\u0120moot": 41264, "\u0120Dimensions": 41265, "\u0120cabbage": 41266, "Bright": 41267, "gat": 41268, "\u0120Klu": 41269, "\u0120latent": 41270, "\u0120ze": 41271, "\u0120Meng": 41272, "\u0120disperse": 41273, "\u0120pandemonium": 41274, "HQ": 41275, "\u0120virtuous": 41276, "\u0120Locations": 41277, "eeper": 41278, "provided": 41279, "\u0120seams": 41280, "\u0120WT": 41281, "izo": 41282, "PROV": 41283, "\u0120titanium": 41284, "\u0120recollection": 41285, "\u0120cran": 41286, "\u0120780": 41287, "\u0120NF": 41288, "491": 41289, "642": 41290, "packing": 41291, "598": 41292, "texture": 41293, "Spider": 41294, "freedom": 41295, "cipled": 41296, "\u0120TAMADRA": 41297, "\u00e2\u013b\u00a6": 41298, "authent": 41299, "\u0120WANT": 41300, "rified": 41301, "\u0120rites": 41302, "\u0120uterus": 41303, "kiss": 41304, "\u0120\u00e2\u012b\u00a4": 41305, "\u0120skillet": 41306, "\u0120disenfranch": 41307, "\u0120Gaal": 41308, "Compan": 41309, "\u0120ageing": 41310, "guide": 41311, "Balt": 41312, "\u0120iterator": 41313, "\u0120discretionary": 41314, "tips": 41315, "\u0120primates": 41316, "\u0120Technique": 41317, "\u0120Payments": 41318, "azel": 41319, "\u0120ROCK": 41320, "stantial": 41321, "060": 41322, "\u0120dmg": 41323, "\u0120Jackets": 41324, "\u0120Playoff": 41325, "\u0120nursery": 41326, "\u0120Symb": 41327, "arton": 41328, "\u0120annexation": 41329, "Colorado": 41330, "\u0120coils": 41331, "\u0120Shoes": 41332, "\u00e2\u0126\u00a2:": 41333, "\u0120Roz": 41334, "COMPLE": 41335, "\u0120Everest": 41336, "\u0120Triumph": 41337, "Joy": 41338, "Grid": 41339, "\u00e0\u00bc": 41340, "processor": 41341, "\u0120Prosper": 41342, "\u0120Severus": 41343, "\u0120Selected": 41344, "rg": 41345, "\u0120Tayyip": 41346, "Stra": 41347, "\u0120skiing": 41348, "\u0120?)": 41349, "\u0120peg": 41350, "Tesla": 41351, "\u0120timeframe": 41352, "\u0120mastermind": 41353, "\u0120NB": 41354, "scientific": 41355, "\u0120Shit": 41356, "generic": 41357, "INTER": 41358, "NUM": 41359, "\u0120stroll": 41360, "\u0120Enix": 41361, "\u0120MMR": 41362, "\u0120EMS": 41363, "movie": 41364, "\u0124\u00aa": 41365, "\u0120minimizing": 41366, "iddling": 41367, "\u0120illegitimate": 41368, "\u0120prototyp": 41369, "\u0120prematurely": 41370, "\u0120manuals": 41371, "obbies": 41372, "\u0120Cassidy": 41373, "DEC": 41374, "desktop": 41375, "\u0120aeros": 41376, "\u0120screenings": 41377, "\u0120debilitating": 41378, "\u0120Grind": 41379, "natureconservancy": 41380, "\u0120fades": 41381, "termination": 41382, "assetsadobe": 41383, "Factor": 41384, "\u0120definitively": 41385, "Pok\u00c3\u00a9": 41386, "apult": 41387, "\u0120Lafayette": 41388, "Corn": 41389, "\u0120Coral": 41390, "\u0120stagnant": 41391, "Tue": 41392, "\u0120dissatisfaction": 41393, "Gender": 41394, "\u0120kidneys": 41395, "\u0120Gow": 41396, "\u0120Defeat": 41397, "\u0120Ashton": 41398, "\u0120cartels": 41399, "\u0120foreclosure": 41400, "\u0120Explore": 41401, "strength": 41402, "otin": 41403, "\u0120veterinarian": 41404, "\u0120fumble": 41405, "\u0120parap": 41406, "\u0120Strait": 41407, "rils": 41408, "\u0120prick": 41409, "\u0120Bermuda": 41410, "\u0120Ammunition": 41411, "skinned": 41412, "\u0120abound": 41413, "\u0120Braz": 41414, "\u0120sharper": 41415, "\u0120Ascension": 41416, "\u0120978": 41417, "\u0120previews": 41418, "\u0120communion": 41419, "\u0120XY": 41420, "\u0120phony": 41421, "\u0120newcomer": 41422, "\u0120332": 41423, ".\",\"": 41424, "\u0120redistribution": 41425, "Protect": 41426, "\u0120Sof": 41427, "Kal": 41428, "\u0120lipstick": 41429, "worst": 41430, "\u0120tangled": 41431, "\u0120retrospective": 41432, "integer": 41433, "\u0120volunteering": 41434, "\u01201907": 41435, "\u0120--------------------": 41436, "ichen": 41437, "\u0120unveiling": 41438, "\u0120senseless": 41439, "\u0120fisheries": 41440, "\\-": 41441, "\u0120hinges": 41442, "\u0120calculus": 41443, "Myth": 41444, "\u0120undefeated": 41445, "\u0120optimizations": 41446, "\u0120depress": 41447, "\u0120billboard": 41448, "\u0120Yad": 41449, "\u0120Pyramid": 41450, "Isn": 41451, "Ide": 41452, "\u0120legion": 41453, "\u0120Kramer": 41454, "entanyl": 41455, "\u0120penetrating": 41456, "\u0120Hawth": 41457, "\u0120PRODUCT": 41458, "\u0120Gerard": 41459, "\u0120Pact": 41460, "\u0120Including": 41461, "\u0120Elias": 41462, "\u0120Elaine": 41463, "visual": 41464, "\u0120humming": 41465, "\u0120condesc": 41466, "\u0120Fasc": 41467, "\u00e4\u00b8\u012c": 41468, "\u0120egalitarian": 41469, "\u0120devs": 41470, "\u0120Dahl": 41471, "Ops": 41472, "DH": 41473, "\u0120Bounce": 41474, "idated": 41475, "aldo": 41476, "\u0120republican": 41477, "\u0120hamb": 41478, "\u0120Sett": 41479, "ographies": 41480, "CHAPTER": 41481, "\u0120transsexual": 41482, "\u0120skyrocket": 41483, "answer": 41484, "\u0120markup": 41485, "\u00d8\u00aa": 41486, "\u0120heroine": 41487, "Compare": 41488, "\u0120Tav": 41489, "Beast": 41490, "\u0120successors": 41491, "\u0120na\u00c3\u00afve": 41492, "\u0120Buckley": 41493, "stress": 41494, "meat": 41495, "\u0120downloadable": 41496, "\u0120indexed": 41497, "\u0120scaff": 41498, "\u0120Lump": 41499, "\u0120Homo": 41500, "Studio": 41501, "Insp": 41502, "\u0120racked": 41503, "farious": 41504, "\u0120Petty": 41505, "External": 41506, "\u01201909": 41507, "Wars": 41508, "commit": 41509, "puters": 41510, "\u0120unob": 41511, "\u0120Err": 41512, "\u0120EG": 41513, "\u0120Alam": 41514, "\u0120Siberia": 41515, "\u0120Atmospheric": 41516, "ISTER": 41517, "\u0120Satanic": 41518, "translation": 41519, "\u0120Loud": 41520, "traumatic": 41521, "lique": 41522, "\u0120resonate": 41523, "\u0120Welch": 41524, "\u0120sparking": 41525, "\u0120TOM": 41526, "tone": 41527, "\u0120outl": 41528, "\u0120handcuffed": 41529, "\u0120Serie": 41530, "801": 41531, "\u0120landmarks": 41532, "\u0120Reeves": 41533, "\u0120softened": 41534, "\u0120dazzling": 41535, "\u0120Wanted": 41536, "months": 41537, "Magikarp": 41538, "\u0120untreated": 41539, "\u0120Bedford": 41540, "Mi": 41541, "\u0120Dynamo": 41542, "Ore": 41543, "795": 41544, "\u0120wrongful": 41545, "\u0120lured": 41546, "\u0120cortisol": 41547, "\u0120vex": 41548, "drawn": 41549, "ilet": 41550, "Downloadha": 41551, "\u0120Faction": 41552, "\u0120labyrinth": 41553, "\u0120hijacked": 41554, "waters": 41555, "erick": 41556, "\u0120superiors": 41557, "\u0120Rowling": 41558, "\u0120Guinness": 41559, "\u0120td": 41560, "992": 41561, "\u0120unearthed": 41562, "\u0120centrif": 41563, "\u0120shameless": 41564, "Pod": 41565, "\u0120Fib": 41566, "\u0120icing": 41567, "\u0120predictor": 41568, "\u0120292": 41569, "forestation": 41570, "construct": 41571, "Cand": 41572, "@#": 41573, "\u0120agitated": 41574, "\u0120repr": 41575, "OVA": 41576, "\u0120knitting": 41577, "\u0120Lima": 41578, "\u0120fodder": 41579, "684": 41580, "\u0120Persona": 41581, "kl": 41582, "701": 41583, "\u0120breakup": 41584, "\u00e1\u00b8": 41585, "\u0120appalled": 41586, "\u0120antidepressants": 41587, "\u0120Sussex": 41588, "Harris": 41589, "\u0120Thermal": 41590, "eeee": 41591, "Upload": 41592, "\u0120gulf": 41593, "\u0120doorstep": 41594, "\u0120Shank": 41595, "LU": 41596, "\u0120MEN": 41597, "\u0120Pond": 41598, "sorry": 41599, "\u0120misfortune": 41600, "nance": 41601, "\u0120bona": 41602, "Mut": 41603, "\u0120degraded": 41604, "\u0120LOG": 41605, "\u0120Ness": 41606, "animal": 41607, "\u0120aversion": 41608, "undown": 41609, "\u0120supplemented": 41610, "\u0120Cups": 41611, "\u0120504": 41612, "\u0120deprive": 41613, "\u0120Sparkle": 41614, "\u00c5\u0124": 41615, "\u0120Meditation": 41616, "authors": 41617, "\u0120Saban": 41618, "\u0120Naked": 41619, "aird": 41620, "\u0120Mandarin": 41621, "\u0120Scriptures": 41622, "\u0120Personnel": 41623, "\u0120Maharashtra": 41624, "\u01201903": 41625, "\u0120Pai": 41626, "\u0120Mirage": 41627, "ombat": 41628, "Accessory": 41629, "\u0120fragmented": 41630, "Together": 41631, "\u0120believable": 41632, "\u0120Gladiator": 41633, "aligned": 41634, "\u0120Slug": 41635, "MAT": 41636, "\u0120convertible": 41637, "\u0120Bourbon": 41638, "ameron": 41639, "\u0120Rehab": 41640, "ntax": 41641, "\u0120powdered": 41642, "pillar": 41643, "\u0120smoker": 41644, "\u0120Manson": 41645, "\u0120BF": 41646, "511": 41647, "\u0120Goodell": 41648, "\u0120DAR": 41649, "mud": 41650, "gart": 41651, "\u0120obedient": 41652, "\u0120Transmission": 41653, "\u0120Donation": 41654, "880": 41655, "\u0120bothering": 41656, "Materials": 41657, "\u00e3\u0124\u00b1": 41658, "destroy": 41659, "\u0120foregoing": 41660, "\u0120anarchism": 41661, "\u0120Kry": 41662, "iceps": 41663, "\u0120littered": 41664, "\u0120Schiff": 41665, "\u0120anecdotal": 41666, "units": 41667, "\u0120fian": 41668, "\u0120Stim": 41669, "\u0120SOME": 41670, "\u0120Invaders": 41671, "\u0120behavioural": 41672, "\u0120Ventures": 41673, "\u0120sublime": 41674, "\u0120fruition": 41675, "\u0120Penalty": 41676, "\u0120corrosion": 41677, "\u00b6\u0127": 41678, "\u0120likened": 41679, "\u0120besieged": 41680, "weeney": 41681, "\u0120Creep": 41682, "\u0120linemen": 41683, "multi": 41684, "icably": 41685, "udder": 41686, "\u0120vitality": 41687, "\u0120shortfall": 41688, "\u0120Pants": 41689, "apist": 41690, "Hidden": 41691, "\u0120Drops": 41692, "medical": 41693, "\u0120pronunciation": 41694, "\u0120NRL": 41695, "\u0120insightful": 41696, "JV": 41697, "\u0120Beard": 41698, "\u0120Chou": 41699, "\u0120charms": 41700, "\u0120bins": 41701, "\u0120ambassadors": 41702, "\u0120Saturdays": 41703, "\u0120inhibitor": 41704, "\u0120Franch": 41705, "601": 41706, "','": 41707, "\u0120Conor": 41708, "artney": 41709, "\u0120Xperia": 41710, "grave": 41711, "bees": 41712, "\u0120Protestants": 41713, "\u0120soaking": 41714, "\u0120Mandal": 41715, "\u0120phased": 41716, "\u0120660": 41717, "\u0120scams": 41718, "\u0120buzzing": 41719, "\u0120Italians": 41720, "\u0120Lorenzo": 41721, "\u0120JA": 41722, "\u0120hesitated": 41723, "\u0120cliffs": 41724, "\u0120GOT": 41725, "inguishable": 41726, "\u0120ko": 41727, "\u0120interruption": 41728, "Zip": 41729, "Learning": 41730, "\u0120underscores": 41731, "\u0120Blink": 41732, "Ku": 41733, "579": 41734, "\u0120Autob": 41735, "IRE": 41736, "\u0120watering": 41737, "\u0120pastry": 41738, "820": 41739, "\u0120visionary": 41740, "\u0120Templar": 41741, "awaited": 41742, "\u0120piston": 41743, "\u0120antid": 41744, "currently": 41745, "\u0120pard": 41746, "\u0120waging": 41747, "\u0120nobility": 41748, "\u0120Yus": 41749, "\u0120injecting": 41750, "faith": 41751, "\u0120PASS": 41752, "\u00e5\u00ba": 41753, "\u0120retake": 41754, "\u0120PROC": 41755, "\u0120cathedral": 41756, "bash": 41757, "\u0120wrestlers": 41758, "\u0120partnering": 41759, "\u0120noses": 41760, "\u0120358": 41761, "Transform": 41762, "amen": 41763, "\u0120bouts": 41764, "\u0120Ideal": 41765, "\u0120Constantin": 41766, "\u0120sep": 41767, "\u0120Monarch": 41768, "atten": 41769, "\u0120Peoples": 41770, "modified": 41771, "\u0120moratorium": 41772, "\u0120penchant": 41773, "\u0120offensively": 41774, "\u0120proxies": 41775, "okane": 41776, "\u0120Taiwanese": 41777, "\u0120Poo": 41778, "\u0120HOME": 41779, "usional": 41780, "\u0120verbs": 41781, "\u0120Oman": 41782, "visory": 41783, "\u0120persuasion": 41784, "\u0120multit": 41785, "\u0120scissors": 41786, "Gay": 41787, "oway": 41788, "ophysical": 41789, "lus": 41790, "gnu": 41791, "\u0120apocalyptic": 41792, "\u0120absurdity": 41793, "\u0120playbook": 41794, "\u0120autobiography": 41795, "IUM": 41796, "\u0120sneaking": 41797, "\u0120Simulation": 41798, "pps": 41799, "ellery": 41800, "Planet": 41801, "\u0120rightfully": 41802, "\u0120niece": 41803, "\u0120NEC": 41804, "\u0120IPO": 41805, "\u0120Disclosure": 41806, "leanor": 41807, "ousy": 41808, "STER": 41809, "\u0120282": 41810, "Cruz": 41811, "Chall": 41812, "643": 41813, "\u0120Survive": 41814, "\u0120Fatal": 41815, "\u0120Amid": 41816, "apo": 41817, "Weapons": 41818, "DEN": 41819, "770": 41820, "\u0120Greenwald": 41821, "\u0120linen": 41822, "alos": 41823, "\u0120pollutants": 41824, "\u0120PCIe": 41825, "kat": 41826, "\u0120paw": 41827, "\u0120Kraft": 41828, "Chem": 41829, "\u0120Terminator": 41830, "\u0120reincarn": 41831, "\u0120][": 41832, "\u0120Seeds": 41833, "\u0120silhouette": 41834, "\u0120Stores": 41835, "\u0120grooming": 41836, "\u0120Direction": 41837, "\u0120Isabel": 41838, "\u0120Bridges": 41839, "\u00f0\u0141\u0133": 41840, "EED": 41841, "\u0120Morsi": 41842, "\u0120valves": 41843, "\u0120Ranked": 41844, "\u0120Pharma": 41845, "\u0120Organizations": 41846, "\u0120penetrated": 41847, "\u0120Rodham": 41848, "\u0120Protoss": 41849, "\u0120overest": 41850, "\u0120exasper": 41851, "\u0120TJ": 41852, "\u0120000000": 41853, "\u0120trickle": 41854, "\u0120bourbon": 41855, "WHO": 41856, "\u0120wretched": 41857, "\u0120microscopic": 41858, "\u0120checklist": 41859, "\u0120adorned": 41860, "Royal": 41861, "Administ": 41862, "\u0120Retirement": 41863, "\u0120Highest": 41864, "Weather": 41865, "ilege": 41866, "\u0120increments": 41867, "\u0120Cosponsors": 41868, "\u0120masse": 41869, "\u0120Sinn": 41870, "rf": 41871, "\u0120hordes": 41872, "assembly": 41873, "754": 41874, "\u0120Natasha": 41875, "\u0120TYPE": 41876, "\u0120GENERAL": 41877, "\u0120arranging": 41878, "\u0120407": 41879, "lator": 41880, "\u0120glean": 41881, "\u0120discredited": 41882, "\u0120clinicians": 41883, "UNE": 41884, "\u0120achieves": 41885, "\u0120Emerson": 41886, "complex": 41887, "=[": 41888, "\u0120principally": 41889, "\u0120frail": 41890, "picked": 41891, "\u0120thanking": 41892, "\u0120recl": 41893, "\u0120LAST": 41894, "\u0120suppressing": 41895, "ilic": 41896, "\u0120antidepressant": 41897, "\u0120Lisbon": 41898, "\u0120thor": 41899, "\u0120spa": 41900, "\u0120kingdoms": 41901, "\u0120Pearce": 41902, "emo": 41903, "\u0120plung": 41904, "\u0120divest": 41905, "\u0120********************************": 41906, "bis": 41907, "ospels": 41908, "adr": 41909, "Spirit": 41910, "halla": 41911, "Pink": 41912, "endez": 41913, "\u0120resurrected": 41914, "escape": 41915, "\u0120Rosenstein": 41916, "\u0120geological": 41917, "\u0120necessities": 41918, "\u0120carniv": 41919, "\u0120Elys": 41920, "\u0120Barney": 41921, "\u0120296": 41922, "digy": 41923, "STON": 41924, "DOWN": 41925, "\u0120milestones": 41926, "\u0120ker": 41927, "\u0120dismantling": 41928, "\u0120reprim": 41929, "\u0120crossings": 41930, "1945": 41931, "\u0120patriarchy": 41932, "\u0120blasphemy": 41933, "\u0120359": 41934, "metry": 41935, "\u0120Obesity": 41936, "\u0120Differences": 41937, "blocking": 41938, "\u00e3\u0125\u0137\u00e3\u0124\u00a1": 41939, "ichita": 41940, "\u0120Sabha": 41941, "phalt": 41942, "\u0120Colo": 41943, "uala": 41944, "efficients": 41945, "\u0120Medina": 41946, "console": 41947, "557": 41948, "\u0120Hannibal": 41949, "\u0120Habit": 41950, "\u0120Fever": 41951, "\u0120thence": 41952, "\u0120synagogue": 41953, "\u0120essentials": 41954, "\u0120wink": 41955, "\u0120Trader": 41956, "IDA": 41957, "\u0120Spoiler": 41958, "\u0120Icelandic": 41959, "\u0120Hayward": 41960, "\u0120peac": 41961, "\u0120malice": 41962, "\u0120flashback": 41963, "\u0120thw": 41964, "\u0120layoffs": 41965, "Liquid": 41966, "\u0120trooper": 41967, "\u0120hinge": 41968, "\u0120Readers": 41969, "Phill": 41970, "\u0120Bauer": 41971, "Created": 41972, "\u0120audits": 41973, "accompan": 41974, "\u0120unsuspecting": 41975, "iera": 41976, "66666666": 41977, "\u0120broch": 41978, "\u0120apprehended": 41979, "\u0120Malk": 41980, "cerning": 41981, "\u0120Codex": 41982, "OVER": 41983, "Marsh": 41984, "\u0120Deng": 41985, "\u0120Expression": 41986, "\u0120disrespectful": 41987, "\u0120ascending": 41988, "tests": 41989, "\u0120Plaintiff": 41990, "stery": 41991, "\u0120Alibaba": 41992, "dinand": 41993, "\u0120Dempsey": 41994, "Applications": 41995, "moral": 41996, "\u0120throughput": 41997, "\u0120quarrel": 41998, "\u0120mills": 41999, "\u0120hemor": 42000, "\u0120CASE": 42001, "terrorist": 42002, "stim": 42003, "ifestyle": 42004, "rozen": 42005, "CEPT": 42006, "Ark": 42007, "uci": 42008, "lectic": 42009, "\u0120irritating": 42010, "sheets": 42011, "Ay": 42012, "\u0120redeemed": 42013, "\u0120horny": 42014, "\u0120Teach": 42015, "\u0120Sear": 42016, "democracy": 42017, "465": 42018, "\u0120Restore": 42019, "\u0120standby": 42020, "\u0120Pis": 42021, "iffin": 42022, "\u0120sleepy": 42023, "\u0120extrater": 42024, "\u0120compliments": 42025, "Frameworks": 42026, "\u0120installs": 42027, "\u0120banging": 42028, "surface": 42029, "foundland": 42030, "\u0120metaphysical": 42031, "\u0120283": 42032, "ouls": 42033, "devices": 42034, "Args": 42035, "\u0120Sacrifice": 42036, "\u0120McCorm": 42037, "eson": 42038, "Conservative": 42039, "\u0120Mikhail": 42040, "seeing": 42041, "isively": 42042, "\u0120Rooms": 42043, "\u0120Generic": 42044, "\u0120enthusiastically": 42045, "\u0120gripped": 42046, "\u0120comedic": 42047, "\u0120Electricity": 42048, "\u0120guerrilla": 42049, "\u0120decoration": 42050, "\u0120Perspective": 42051, "\u0120consultations": 42052, "\u0120unamb": 42053, "\u0120plagiar": 42054, "\u0120magician": 42055, "\u0120erection": 42056, "\u0120Tourism": 42057, "oried": 42058, "roxy": 42059, "1100": 42060, "Tam": 42061, "\u012a\u00e8": 42062, "\u00ce\u00b3": 42063, "\u00d7\u00aa": 42064, "\u0120Predators": 42065, "Nitrome": 42066, "\u0120telescopes": 42067, "projects": 42068, "\u0120unprotected": 42069, "\u0120stocked": 42070, "\u0120Entreprene": 42071, "nexpected": 42072, "\u0120wastewater": 42073, "Vill": 42074, "\u0120intimately": 42075, "\u0120iCloud": 42076, "\u0120Constable": 42077, "\u0120spoof": 42078, "\u0120nefarious": 42079, "\u0120fins": 42080, "\u0120censor": 42081, "\u0120Modes": 42082, "\u0120Esper": 42083, "arbon": 42084, "\u0120intersections": 42085, "\u0120lauded": 42086, "\u0120physi": 42087, "\u0120generously": 42088, "\u0120TheNitrome": 42089, "\u0120TheNitromeFan": 42090, "\u0120arisen": 42091, "\u0120\u00d9\u012a": 42092, "\u0120glands": 42093, "\u0120Pavilion": 42094, "\u0120Gupta": 42095, "\u0120uniformly": 42096, "\u0120ramps": 42097, "riet": 42098, "\u0120WHEN": 42099, "\u0120Vanessa": 42100, "\u0120routed": 42101, "\u0120limp": 42102, "\u0120CPI": 42103, "pter": 42104, "intuitive": 42105, "\u0120vaping": 42106, "\u0120experimented": 42107, "\u0120Olympus": 42108, "\u0120Amon": 42109, "\u0120sighting": 42110, "\u0120infiltrate": 42111, "\u0120Gentleman": 42112, "\u0120signings": 42113, "\u0120Meow": 42114, "\u0120Navigation": 42115, "checks": 42116, "433": 42117, "\u0120elapsed": 42118, "\u0120Bulgarian": 42119, "espie": 42120, "\u0120SOM": 42121, "during": 42122, "\u0120spills": 42123, "anca": 42124, "\u0120Plymouth": 42125, "MAL": 42126, "\u0120domestically": 42127, "\u0120Watergate": 42128, "\u0120FAM": 42129, "killed": 42130, "edited": 42131, "\u0120Yourself": 42132, "\u0120synchronization": 42133, "\u0120Practices": 42134, "STEP": 42135, "\u0120genomes": 42136, "\u0120QR": 42137, "notice": 42138, "\u0120locating": 42139, "zin": 42140, "\u0120329": 42141, "alcohol": 42142, "\u0120kitten": 42143, "Vo": 42144, "\u0120rinse": 42145, "\u0120grapple": 42146, "\u0120Screw": 42147, "\u0120Dul": 42148, "AIR": 42149, "\u0120leasing": 42150, "\u0120Caf\u00c3\u00a9": 42151, "\u0120roses": 42152, "\u0120Respect": 42153, "\u0120mislead": 42154, "\u0120perfected": 42155, "\u0120nudity": 42156, "\u0120nonpartisan": 42157, "\u0120Consumption": 42158, "Reporting": 42159, "\u0120nuances": 42160, "\u0120deductible": 42161, "\u0120Shots": 42162, "\u0120377": 42163, "\u0120\u00e6\u013e": 42164, "anooga": 42165, "Benef": 42166, "\u0120Bam": 42167, "\u0120Samp": 42168, "ifix": 42169, "\u0120galvan": 42170, "\u0120Medals": 42171, "radius": 42172, "\u0120nobles": 42173, "\u0120eaves": 42174, "igrate": 42175, "KT": 42176, "\u0120Harbour": 42177, "uers": 42178, "\u0120risked": 42179, "req": 42180, "\u0120neurot": 42181, "gettable": 42182, "aina": 42183, "Romney": 42184, "\u0120underpin": 42185, "\u0120loft": 42186, "\u0120Subcommittee": 42187, "\u0120Mongol": 42188, "biz": 42189, "\u0120manifests": 42190, "assisted": 42191, "\u0120Gaga": 42192, "\u0120synergy": 42193, "\u0120religiously": 42194, "\u0120Pref": 42195, "\u0120Gerry": 42196, "TAG": 42197, "\u0120Choi": 42198, "466": 42199, "behind": 42200, "\u0120Ou": 42201, "GoldMagikarp": 42202, "\u0120hemorrh": 42203, "River": 42204, "\u0120tendon": 42205, "\u0120injure": 42206, "\u0120Fiona": 42207, "\u0120pag": 42208, "\u0120agitation": 42209, "||||": 42210, "uran": 42211, "\u0120ESA": 42212, "\u0120esteem": 42213, "\u0120dodging": 42214, "\u0120412": 42215, "rss": 42216, "\u0120ceases": 42217, "excluding": 42218, "\u0120intakes": 42219, "\u0120inserts": 42220, "\u0120embold": 42221, "\u0120Oral": 42222, "upuncture": 42223, "411": 42224, "\u0120Unified": 42225, "\u0120Dele": 42226, "\u0120furnace": 42227, "\u0120Coyotes": 42228, "\u0120Brach": 42229, "Labor": 42230, "\u0120handshake": 42231, "\u0120bruises": 42232, "Grade": 42233, "\u00e9\u0139\u013a": 42234, "\u0120Grammy": 42235, "ileen": 42236, "States": 42237, "\u0120Scandinavian": 42238, "\u0120Kardash": 42239, "866": 42240, "\u0120effortlessly": 42241, "\u0120DIRECT": 42242, "\u0120THEN": 42243, "\u0120Mei": 42244, "ertation": 42245, "1968": 42246, "\u0120groin": 42247, "witch": 42248, "Requirements": 42249, "985": 42250, "\u0120roofs": 42251, "\u0120estates": 42252, "\u0120HF": 42253, "\u0120haha": 42254, "\u0120densely": 42255, "\u0120OCT": 42256, "\u0120plastics": 42257, "\u0120incidentally": 42258, "\u0120Tracks": 42259, "\u0120Taxes": 42260, "\u0120chanted": 42261, "\u0120forceful": 42262, "\u0120Bieber": 42263, "\u0120Kahn": 42264, "Kent": 42265, "\u0120Cot": 42266, "licts": 42267, "Fed": 42268, "\u0120hideous": 42269, "\u0120Verd": 42270, "\u0120Syndicate": 42271, "\u0120Illegal": 42272, "Jet": 42273, "\u0120DAV": 42274, "reasonable": 42275, "crew": 42276, "\u0120fundamentalist": 42277, "\u0120truthful": 42278, "\u0120Jing": 42279, "\u0120lil": 42280, "\u0120downed": 42281, "\u0120enchanted": 42282, "\u0120Policies": 42283, "\u0120McMaster": 42284, "\u0120Hare": 42285, "ideshow": 42286, "\u0120params": 42287, "encers": 42288, "gorithm": 42289, "\u0120allowances": 42290, "\u0120turbulent": 42291, "\u0120complexities": 42292, "\u0120KT": 42293, "\u0120337": 42294, "\u0120Genetic": 42295, "FUN": 42296, "Doug": 42297, "tick": 42298, "\u0120gigs": 42299, "umenthal": 42300, "\u0120patriarchal": 42301, "\u0120calc": 42302, ",...": 42303, "\u0120cout": 42304, "\u0120Guan": 42305, "\u0120pathological": 42306, "\u0120Rivals": 42307, "\u0120underrated": 42308, "\u0120fluorescent": 42309, "\u0120Jiu": 42310, "arnaev": 42311, "\u0120Quan": 42312, "\u0120429": 42313, "\u0120\u00e0\u00a8": 42314, "Mario": 42315, "Construct": 42316, "\u0120Citation": 42317, "\u0120Racial": 42318, "\u0120RSA": 42319, "\u0120Fidel": 42320, "\u0120395": 42321, "Personally": 42322, "Cause": 42323, "\u00c3\u00bb": 42324, "radical": 42325, "inen": 42326, "\u0120vehemently": 42327, "\u0120Papa": 42328, "\u0120internship": 42329, "\u0120flakes": 42330, "\u0120Reck": 42331, "Luckily": 42332, "Bra": 42333, "2020": 42334, "ravings": 42335, "RN": 42336, "Wonder": 42337, "Seriously": 42338, "\u0120reusable": 42339, "\u0120polluted": 42340, "\u0120Peng": 42341, "leigh": 42342, "indle": 42343, "\u0120circuitry": 42344, "\u0120Madonna": 42345, "\u0120BART": 42346, "Residents": 42347, "attribute": 42348, "Philadelphia": 42349, "Club": 42350, "\u0120planner": 42351, "\u0120frantically": 42352, "\u0120faithfully": 42353, "\u0120Territories": 42354, "\u0120LAT": 42355, "\u0120Andersen": 42356, "anu": 42357, "\u0120PARK": 42358, "\u0120Sora": 42359, "iage": 42360, "\u0120Playoffs": 42361, "\u0120GCC": 42362, "427": 42363, "\u0120abnorm": 42364, "\u0120Lever": 42365, "\u0120disobedience": 42366, "Async": 42367, "\u0120Shea": 42368, "Vert": 42369, "\u0120skirts": 42370, "\u0120Sawyer": 42371, "xp": 42372, "\u0120worsening": 42373, "\u0120scapego": 42374, "\u0120Angle": 42375, "othal": 42376, "\u0120trove": 42377, "\u0120Sty": 42378, "\u0120Nguyen": 42379, "marine": 42380, "ideon": 42381, "Depths": 42382, "Blog": 42383, "\u0120Illuminati": 42384, "\u0120tracts": 42385, "\u0120organise": 42386, "\u0120ostr": 42387, "Fs": 42388, "\u0120leveraging": 42389, "\u0120Daredevil": 42390, "asar": 42391, "\u0120lang": 42392, "\u0120extermin": 42393, "ursions": 42394, "\u0120Romo": 42395, "\u00e3\u0124\u00a4\u00e3\u0125\u012a": 42396, "\u0120contended": 42397, "\u0120encountering": 42398, "\u0120Tablet": 42399, "\u0120Alternate": 42400, "skill": 42401, "\u0120sweets": 42402, "\u0120cohesive": 42403, "capacity": 42404, "\u0120repud": 42405, "\u0120lizard": 42406, "roo": 42407, "\u0120pilgrims": 42408, "\u0120Ruff": 42409, "\u0120Instrument": 42410, "\u0120Logo": 42411, "uitous": 42412, "EH": 42413, "\u0120salesman": 42414, "\u0120ankles": 42415, "Led": 42416, "\u0120Patty": 42417, "udos": 42418, "Owner": 42419, "\u0120discrepancies": 42420, "kj": 42421, "MU": 42422, "\u0120unconditional": 42423, "DragonMagazine": 42424, "iard": 42425, "Oak": 42426, "\u0120Conversation": 42427, "beer": 42428, "\u0120Osaka": 42429, "Delta": 42430, "usky": 42431, "\u0120secretion": 42432, "\u0120plaza": 42433, "\u0120ming": 42434, "\u0120depletion": 42435, "\u0120Mous": 42436, "\u0120ITS": 42437, "\u0120Himal": 42438, "\u0120Fleming": 42439, "\u0120cytok": 42440, "\u0120Hick": 42441, "\u0120batters": 42442, "\u0120Intellectual": 42443, "675": 42444, "\u00c3\u00a9r": 42445, "ISION": 42446, "\u0120Quentin": 42447, "\u0120Chapters": 42448, "ihadi": 42449, "\u0120coaster": 42450, "WAYS": 42451, "\u0120Lizard": 42452, "\u0120Yor": 42453, "andering": 42454, "Skin": 42455, "haust": 42456, "abby": 42457, "\u0120portraying": 42458, "\u0120wielded": 42459, "dash": 42460, "\u0120proponent": 42461, "\u0120ripple": 42462, "\u0120graphene": 42463, "\u0120flyer": 42464, "\u0120recurrent": 42465, "\u0120devils": 42466, "\u0120waterfall": 42467, "\u00e6\u013a\u00af": 42468, "goo": 42469, "TextColor": 42470, "\u0120tampering": 42471, "IVES": 42472, "TRUMP": 42473, "\u0120Abel": 42474, "\u0120SAL": 42475, "\u0120Hendricks": 42476, "\u0120Lucius": 42477, "bots": 42478, "\u01204096": 42479, "ISTORY": 42480, "Guest": 42481, "\u0120NX": 42482, "inant": 42483, "Benz": 42484, "\u0120Loaded": 42485, "\u0120Clever": 42486, "treatment": 42487, "\u0120tavern": 42488, "\u0120339": 42489, "\u0120TNT": 42490, "ificantly": 42491, "Temperature": 42492, "Fel": 42493, "\u0120underworld": 42494, "\u0120Judges": 42495, "\u0120<+": 42496, "\u0120stump": 42497, "\u0120occupancy": 42498, "\u0120aber": 42499, "\u0120Finder": 42500, ")\",": 42501, "\u0120Nunes": 42502, "reset": 42503, "inet": 42504, "ectomy": 42505, "\u0120wellness": 42506, "\u0120Peb": 42507, "quartered": 42508, "andan": 42509, "\u0120negatives": 42510, "\u0120Thiel": 42511, "\u0120Clip": 42512, "\u0120LTD": 42513, "\u0120blight": 42514, "\u0120repertoire": 42515, "Kyle": 42516, "\u0120quer": 42517, "\u0120Ces": 42518, "\u0120hapl": 42519, "989": 42520, "\u0120Thames": 42521, "iscopal": 42522, "Desk": 42523, "ivariate": 42524, "\u0120Excellence": 42525, "foundation": 42526, "\u0120\u00e2\u0129": 42527, "Xi": 42528, "\u0120mysteriously": 42529, "estyles": 42530, "\u0120perish": 42531, "\u0120Engels": 42532, "\u0120DEAD": 42533, "090": 42534, "}}}": 42535, "\u0120Unreal": 42536, "\u0120restless": 42537, "IDES": 42538, "orthodox": 42539, "\u0120Intermediate": 42540, "\u0120dinners": 42541, "\u0120Trout": 42542, "\u0120Seym": 42543, "\u0120Halls": 42544, "ogged": 42545, "\u0120tragedies": 42546, "\u0120didnt": 42547, "676": 42548, "\u0120ailments": 42549, "\u0120observable": 42550, "\u0120Vide": 42551, "adapt": 42552, "\u0120Dusk": 42553, "\u0120professionalism": 42554, "\u0120Prescott": 42555, "\u0120Indies": 42556, "pox": 42557, "\u0120Mehran": 42558, "Wide": 42559, "\u0120endemic": 42560, "\u0120Paran": 42561, "Bird": 42562, "\u0120pedals": 42563, "\u0120IU": 42564, "\u0120Adamant": 42565, "\u0120Hurt": 42566, "\u0120correlates": 42567, "urden": 42568, "\u0120sponsoring": 42569, "climate": 42570, "\u0120Universities": 42571, "\u0120Knot": 42572, "ennes": 42573, "\u0120Damian": 42574, "\u0120Axel": 42575, "Sport": 42576, "\u0120barb": 42577, "\u0120Sno": 42578, "shown": 42579, "steen": 42580, "udence": 42581, "\u0120nonviolent": 42582, "\u0120homophobia": 42583, "\u0120biomass": 42584, "\u0120Detail": 42585, "\u0120srfN": 42586, "\u0120Tune": 42587, "accompanied": 42588, "IENCE": 42589, "Albert": 42590, "\u0120Mongo": 42591, "zx": 42592, "\u0120Cerberus": 42593, "orbit": 42594, "cens": 42595, "\u0120slay": 42596, "SHARE": 42597, "HY": 42598, "\u0120brawl": 42599, "\u0120Probe": 42600, "\u0120nonexistent": 42601, "\u0120Clarence": 42602, "\u0120Blackburn": 42603, "\u0120portals": 42604, "\u0120Rita": 42605, "\u0120Remain": 42606, "\u0120Levant": 42607, "\u0120tricked": 42608, "\u0120Ferry": 42609, "avering": 42610, "\u0120Strawberry": 42611, "\u0120Answers": 42612, "\u0120horrendous": 42613, "\u0120Aman": 42614, "Supplement": 42615, "\u0120Toad": 42616, "\u0120peeled": 42617, "\u0120manoeuv": 42618, "\u0120Uzbek": 42619, "monds": 42620, "\u0120Hector": 42621, "\u0120402": 42622, "pees": 42623, "fixes": 42624, "\u0120dj": 42625, "\u0120resumes": 42626, "\u0120accountant": 42627, "\u0120adversity": 42628, "\u0120hampered": 42629, "\u0120Larson": 42630, "\u0120doping": 42631, "parts": 42632, "Hur": 42633, "\u0120bearded": 42634, "\u0120yr": 42635, "\u0120Plugin": 42636, "\u00e5\u00a5\u00b3": 42637, "\u0120/**": 42638, "rolley": 42639, "\u0120watershed": 42640, "\u0120Submission": 42641, "iflower": 42642, "ASC": 42643, "\u0120choir": 42644, "\u0120sculptures": 42645, "mA": 42646, "increasing": 42647, "aii": 42648, "\u0120sneakers": 42649, "\u0120confronts": 42650, "\u0120Elephant": 42651, "\u0120Elixir": 42652, "\u0120recal": 42653, "\u0120TTL": 42654, "widget": 42655, "\u0120Wax": 42656, "\u0120Grayson": 42657, "\u0120hairst": 42658, "\u0120humiliated": 42659, "\u0120WARN": 42660, "appiness": 42661, "\u0120TTC": 42662, "Fuel": 42663, "\u0120polio": 42664, "\u0120complexes": 42665, "\u0120babe": 42666, "\u0120XIV": 42667, "PF": 42668, ").[": 42669, "Parts": 42670, "\u0120435": 42671, "Meg": 42672, "\u0120Yards": 42673, "\u0120ALP": 42674, "\u0120yells": 42675, "\u0120princes": 42676, "\u0120bullies": 42677, "\u0120Capitalism": 42678, "exempt": 42679, "FAQ": 42680, "\u0120Sponge": 42681, "\u0120Ala": 42682, "\u0120pleasantly": 42683, "\u0120buf": 42684, "\u0120denote": 42685, "\u0120unpublished": 42686, "\u0120kneeling": 42687, "asca": 42688, "\u0120lapse": 42689, "alien": 42690, "994": 42691, "\u0120referees": 42692, "\u0120Lawyers": 42693, "Santa": 42694, "\u0120puzzling": 42695, "\u0120Prometheus": 42696, "\u0120Pharaoh": 42697, "\u0120Delay": 42698, "\u0120facilitates": 42699, "\u0120CES": 42700, "\u0120jewels": 42701, "\u0120booklet": 42702, "onding": 42703, "\u0120polarization": 42704, "\u0120Moran": 42705, "\u0120Salad": 42706, "\u0120SOS": 42707, "\u0120Advice": 42708, "PHOTOS": 42709, "ICAN": 42710, "iatures": 42711, "express": 42712, "\u0120Wonderland": 42713, "\u0120CODE": 42714, "\u0120CLASS": 42715, "975": 42716, "\u0120grep": 42717, "\u0120Diesel": 42718, "\u0120Glac": 42719, "!?\"": 42720, "\u0120rm": 42721, "oine": 42722, "discrimination": 42723, "\u0120Nurse": 42724, "mallow": 42725, "\u0120vortex": 42726, "\u0120Consortium": 42727, "\u0120largeDownload": 42728, "straight": 42729, "aughlin": 42730, "Grad": 42731, "\u0120publicized": 42732, "\u0120Waves": 42733, "\u0120Redd": 42734, "\u0120festivities": 42735, "\u0120Mane": 42736, "arov": 42737, "\u0120fleeting": 42738, "\u0120Drunk": 42739, "ugen": 42740, "Cele": 42741, "\u0120chromosomes": 42742, "\u0120DOT": 42743, "-+-+-+-+": 42744, "\u0120busiest": 42745, "\u0120Beaver": 42746, "Syrian": 42747, "\u0120Kyr": 42748, "kas": 42749, "\u0120CrossRef": 42750, "1950": 42751, "7601": 42752, "\u0120repealing": 42753, "\u0120Winners": 42754, "\u0120Macro": 42755, "\u0120DOD": 42756, "blance": 42757, "Sort": 42758, "641": 42759, "\u0120metre": 42760, "\u0120Dirk": 42761, "\u0120goggles": 42762, "\u0120drawbacks": 42763, "\u0120complainant": 42764, "\u0120authorizing": 42765, "\u0120antitrust": 42766, "operated": 42767, "\u0120mah": 42768, "\u0120exaggeration": 42769, "Amazing": 42770, "\u0120Seraph": 42771, "\u0120haze": 42772, "wow": 42773, "\u0120extinguished": 42774, "\u0120canyon": 42775, "\u0120Bosh": 42776, "\u0120vents": 42777, "\u0120scrape": 42778, "Correct": 42779, "426": 42780, "\u0120avg": 42781, "Demand": 42782, "\u0120\u00e2\u012a\u00bc": 42783, "\u0120microbiota": 42784, "\"}],\"": 42785, "\u0120Stev": 42786, "Bio": 42787, "\u0120Planes": 42788, "\u0120suggestive": 42789, "\u0120decipher": 42790, "\u0120Refugee": 42791, "\u0120Kejriwal": 42792, "\u0120Greenpeace": 42793, "\u0120declass": 42794, "\u0120Sounders": 42795, "\u0120tho": 42796, "\u0120decrypt": 42797, "\u0120brushing": 42798, "\u0120Janeiro": 42799, "ipop": 42800, "Si": 42801, "877": 42802, "\u0120Geoffrey": 42803, "\u0120cpu": 42804, "\u0120Hazel": 42805, "\u0120viewpoints": 42806, "\u0120crispy": 42807, "\u0120Notification": 42808, "\u0120solder": 42809, "\u0120Modest": 42810, "\u0120Hemisphere": 42811, "\u0120cassette": 42812, "includes": 42813, "\u0120identifiers": 42814, "\u0120CALL": 42815, "incent": 42816, "Todd": 42817, "\u0120Sweep": 42818, "\u0120334": 42819, "boss": 42820, "\u0120smir": 42821, "ginx": 42822, "\u0120township": 42823, "\u0120grieving": 42824, "\u0120Mosque": 42825, "Netflix": 42826, "ASED": 42827, "\u0120Millennials": 42828, "ocom": 42829, "1967": 42830, "\u0120boldly": 42831, "sleep": 42832, "\u0120esche": 42833, "arijuana": 42834, "\u0120swirl": 42835, "\u0120Penal": 42836, "\u0120negligent": 42837, "\u0120Stephenson": 42838, "KER": 42839, "\u0120Zoro": 42840, "risis": 42841, "\u0120localization": 42842, "\u0120Seymour": 42843, "\u0120Anglic": 42844, "reditation": 42845, "protection": 42846, "\u0120Paige": 42847, "\u0120omit": 42848, "\u0120Rousse": 42849, "\u0120Tub": 42850, "\u0120invitations": 42851, "tty": 42852, "\u0120moss": 42853, "physical": 42854, "Credits": 42855, "\u0120anarchy": 42856, "\u0120childcare": 42857, "\u0120lull": 42858, "\u0120Mek": 42859, "\u0120Languages": 42860, "latest": 42861, "\u0120Sanford": 42862, "\u0120usability": 42863, "\u0120diffuse": 42864, "\u0120DATA": 42865, "\u0120sprites": 42866, "\u0120Vegeta": 42867, "\u0120Promotion": 42868, "\u00e3\u0125\u00bc\u00e3\u0124\u00af": 42869, "ricting": 42870, "zee": 42871, "Turkish": 42872, "\u0120TDs": 42873, "proven": 42874, "571": 42875, "\u0120smugglers": 42876, "70710": 42877, "\u0120reformed": 42878, "\u0120Lois": 42879, "\u0120unfl": 42880, "\u0120WITHOUT": 42881, "\u0120Returning": 42882, "annie": 42883, "\u0120Tomas": 42884, "Franc": 42885, "\u0120Profit": 42886, "\u0120SERV": 42887, "\u0120Rumble": 42888, "ikuman": 42889, "esan": 42890, "\u0120testers": 42891, "\u0120gadget": 42892, "\u0120bracelet": 42893, "\u0120FSA": 42894, "component": 42895, "\u0120paramedics": 42896, "\u0120jan": 42897, "\u0120Remem": 42898, "\u0120Skinner": 42899, "\u0120lov": 42900, "\u0120Quake": 42901, "roma": 42902, "\u0120flask": 42903, "Princ": 42904, "\u0120overpower": 42905, "\u0120lodging": 42906, "\u0120KKK": 42907, "rette": 42908, "\u0120absorbs": 42909, "wrote": 42910, "\u0120,\"": 42911, "Kings": 42912, "\u0120Hail": 42913, "\u0120Falling": 42914, "xtap": 42915, "\u0120Helena": 42916, "irens": 42917, "Larry": 42918, "\u0120pamphlet": 42919, "\u0120CPR": 42920, "Gro": 42921, "\u0120Hiroshima": 42922, "\u0120holistic": 42923, "\".[": 42924, "\u0120detachment": 42925, "\u0120aspire": 42926, "\u0120complicit": 42927, "\u0120Greenwood": 42928, "\u0120respawn": 42929, "\u0120Stupid": 42930, "\u0120Finished": 42931, "fal": 42932, "bass": 42933, "\u0120abhor": 42934, "\u0120mockery": 42935, "\u0120Feast": 42936, "VIDEO": 42937, "\u0120consec": 42938, "\u0120Hungry": 42939, "Pull": 42940, "\u0120Hust": 42941, "itance": 42942, "?\u00e3\u0122\u012f": 42943, ")--": 42944, "\u0120Parallel": 42945, "conv": 42946, "469": 42947, "haar": 42948, "want": 42949, "Paper": 42950, "mins": 42951, "\u0120Toro": 42952, "\u0120TRUMP": 42953, "\u0120Rai": 42954, "DW": 42955, "\u0120Wicked": 42956, "\u0120Lep": 42957, "\u0120funky": 42958, "\u0120detriment": 42959, "iosis": 42960, "achev": 42961, "\u0120degrade": 42962, "imilation": 42963, "\u0120retard": 42964, "\u0120fragmentation": 42965, "\u0120cowboy": 42966, "\u0120YPG": 42967, "\u0120HAL": 42968, "Parents": 42969, "\u0120Sieg": 42970, "\u0120Strauss": 42971, "\u0120Rubber": 42972, "\u00d7\u0132": 42973, "Frag": 42974, "\u0120pt": 42975, "\u0120optionally": 42976, "\u0120ZIP": 42977, "\u0120Transcript": 42978, "\u0120Dwell": 42979, "882": 42980, "Merc": 42981, "\u0120MOT": 42982, "\u00e3\u0125\u00af\u00e3\u0125\u00b3": 42983, "\u0120hunts": 42984, "\u0120executes": 42985, "Includes": 42986, "\u0120acidic": 42987, "\u0120Responsibility": 42988, "\u0120Dumb": 42989, "wei": 42990, "Anderson": 42991, "\u0120Jasper": 42992, "ighton": 42993, "absolutely": 42994, "Adult": 42995, "\u0120plunder": 42996, "Morning": 42997, "\u0120Tours": 42998, "\u0120Dane": 42999, "\u00ce\u00ba": 43000, "\u0120TEST": 43001, "\u0120Gina": 43002, "\u0120canine": 43003, "awan": 43004, "\u0120socialists": 43005, "\u0120Soda": 43006, "\u0120impetus": 43007, "\u0120Supplementary": 43008, "oliath": 43009, "\u0120Kinnikuman": 43010, "mittedly": 43011, "seconds": 43012, "\u0120organisers": 43013, "\u0120documentaries": 43014, "Variable": 43015, "GREEN": 43016, "\u0120resorts": 43017, "\u0120bragging": 43018, "\u0120368": 43019, "Artist": 43020, "wk": 43021, "blers": 43022, "Uncommon": 43023, "\u0120Retrieved": 43024, "\u0120hectares": 43025, "\u0120toxin": 43026, "rank": 43027, "\u0120faiths": 43028, "\u0120Graphic": 43029, "\u0120vec": 43030, "\u0120LIA": 43031, "African": 43032, "\u0120ardent": 43033, "endiary": 43034, "Lake": 43035, "\u0120DOS": 43036, "cientious": 43037, "\u0120Okawaru": 43038, "\u0120Ally": 43039, "\u0120Timeline": 43040, "Dash": 43041, "\u0120Ic": 43042, "continue": 43043, "\u0120tidy": 43044, "\u0120instinctively": 43045, "\u0120Possibly": 43046, "\u0120Outdoor": 43047, "\u0120Wouldn": 43048, "\u0120lich": 43049, "\u0120Bray": 43050, "\u0120AX": 43051, "\u0120\u00c3\u012b": 43052, "\u0120+#": 43053, "\\'": 43054, "Directory": 43055, "abiding": 43056, "\u0120feral": 43057, "icative": 43058, "butt": 43059, "\u0120perverse": 43060, "Salt": 43061, "\u0120warped": 43062, "\u0120nineteen": 43063, "\u0120cabinets": 43064, "\u0120srfAttach": 43065, "\u0120Sloan": 43066, "\u0120powering": 43067, "regation": 43068, "Flight": 43069, "severe": 43070, "\u0120stren": 43071, "\u0120cog": 43072, "apache": 43073, "\u0120\u00e2\u013f": 43074, "\u0120cafeteria": 43075, "paces": 43076, "\u0120Grimoire": 43077, "utonium": 43078, "\u0120raining": 43079, "\u0120circling": 43080, "\u0120linebackers": 43081, "credit": 43082, "\u0120repatri": 43083, "\u0120Camden": 43084, "license": 43085, "\u0120lyric": 43086, "\u0120descriptor": 43087, "\u0120valleys": 43088, "\u0120req": 43089, "\u0120backstage": 43090, "\u0120Prohibition": 43091, "\u0120Ket": 43092, "Opening": 43093, "Sym": 43094, "\u00e6\u0138\u00b9": 43095, "\u0120servings": 43096, "\u0120overseen": 43097, "\u0120asteroids": 43098, "\u0120Mods": 43099, "\u0120Springer": 43100, "\u0120Container": 43101, "\u00e8\u00bb": 43102, "\u0120Mens": 43103, "\u0120multim": 43104, "\u0120firefighter": 43105, "pec": 43106, "\u0120chlorine": 43107, "\u00d0\u00bc": 43108, "endi": 43109, "\u0120sparing": 43110, "\u0120polygamy": 43111, "\u0120RN": 43112, "\u0120Pell": 43113, "\u0120tigers": 43114, "\u0120flashy": 43115, "\u0120Madame": 43116, "Sword": 43117, "\u0120prefrontal": 43118, "\u0120prerequisite": 43119, "uca": 43120, "\u0120wifi": 43121, "\u0120misconception": 43122, "\u0120harshly": 43123, "\u0120Streaming": 43124, "otom": 43125, "\u0120Giuliani": 43126, "footed": 43127, "\u0120tubing": 43128, "individual": 43129, "zek": 43130, "nuclear": 43131, "mol": 43132, "\u0120rightful": 43133, "493": 43134, "\u0120specialization": 43135, "\u0120passionately": 43136, "\u0120Velocity": 43137, "\u0120Availability": 43138, "Tenn": 43139, "\u0120latch": 43140, "\u0120Somebody": 43141, "\u0120helium": 43142, "claw": 43143, "\u0120dipping": 43144, "XXX": 43145, "\u0120interpersonal": 43146, "710": 43147, "\u0120subter": 43148, "\u0120biologists": 43149, "\u0120Lighting": 43150, "\u0120optic": 43151, "\u0120denim": 43152, "endon": 43153, "\u0120Corm": 43154, "\u0120341": 43155, "\u0120Coup": 43156, "\u0120fearless": 43157, "\u0120alot": 43158, "\u0120Clifford": 43159, "\u0120Runtime": 43160, "\u0120Provision": 43161, "updated": 43162, "leneck": 43163, "\u0120neuron": 43164, "\u0120grading": 43165, "\u0120Ct": 43166, "sequence": 43167, "inia": 43168, "concept": 43169, "\u0120roaring": 43170, "rival": 43171, "\u0120Caucasian": 43172, "\u0120monog": 43173, "keyes": 43174, "\u0120appellate": 43175, "\u0120liaison": 43176, "EStreamFrame": 43177, "\u0120Plum": 43178, "!.": 43179, "\u0120spherical": 43180, "\u0120perished": 43181, "\u0120blot": 43182, "\u0120benches": 43183, "\u0120411": 43184, "\u0120pioneered": 43185, "\u0120hurled": 43186, "Jennifer": 43187, "\u0120Yosemite": 43188, "Chair": 43189, "\u0120reefs": 43190, "\u0120elector": 43191, "\u0120Anthem": 43192, "652": 43193, "\u0120uninstall": 43194, "\u0120impede": 43195, "\u0120blinking": 43196, "\u0120goto": 43197, "Decre": 43198, "Aren": 43199, "\u0120stabilization": 43200, "\u0120Disabled": 43201, "\u0120Yanukovych": 43202, "\u0120outlawed": 43203, "\u0120Ventura": 43204, "teness": 43205, "\u0120plantation": 43206, "\u0120yacht": 43207, "\u0120Huawei": 43208, "\u0120solvent": 43209, "\u0120gracious": 43210, "\u0120curiously": 43211, "\u0120capacitor": 43212, "\u0120cx": 43213, "\u0120Reflex": 43214, "Phys": 43215, "\u0120Cf": 43216, "ptin": 43217, "conservative": 43218, "\u0120invocation": 43219, "cour": 43220, "FN": 43221, "\u0120Newly": 43222, "Hour": 43223, "Asian": 43224, "\u0120Leading": 43225, "\u0120Aerospace": 43226, "Anne": 43227, "\u0120prenatal": 43228, "\u0120deteriorating": 43229, "HCR": 43230, "\u0120Normandy": 43231, "olini": 43232, "\u0120Ambro": 43233, "910": 43234, "\u0120setbacks": 43235, "\u0120TRE": 43236, "\u0120sig": 43237, "\u0120Scourge": 43238, "597": 43239, "798": 43240, "Gameplay": 43241, "\u0120msec": 43242, "MX": 43243, "\u0120pricey": 43244, "\u0120LLP": 43245, "akeru": 43246, "\u0120overarching": 43247, "\u0120Bale": 43248, "\u0120worldly": 43249, "Clark": 43250, "\u0120scenic": 43251, "\u0120disliked": 43252, "\u0120Controlled": 43253, "Tickets": 43254, "\u0120EW": 43255, "abies": 43256, "\u0120Plenty": 43257, "Nonetheless": 43258, "\u0120artisan": 43259, "Transfer": 43260, "\u0120Famous": 43261, "\u0120infield": 43262, "bley": 43263, "\u0120unresolved": 43264, "\u0120MLA": 43265, "\u00e3\u0124\u0124": 43266, "Correction": 43267, "\u0120democrat": 43268, "\u0120Moreno": 43269, "rocal": 43270, "ilings": 43271, "\u0120sailor": 43272, "\u0120rife": 43273, "hung": 43274, "\u0120tropes": 43275, "\u0120snatched": 43276, "\u0120LIN": 43277, "\u0120Bib": 43278, "ESA": 43279, "\u0120Prev": 43280, "\u0120Camel": 43281, "runtime": 43282, "\u0120obnoxious": 43283, "437": 43284, "\u0120summers": 43285, "\u0120unexplained": 43286, "\u0120Walters": 43287, "caliber": 43288, "\u0120gull": 43289, "\u0120Endurance": 43290, "\u00e4\u00bd\u013e": 43291, "\u0120347": 43292, "Irish": 43293, "\u0120aerobic": 43294, "\u0120cramped": 43295, "\u0120Honolulu": 43296, "\u00e0\u00a9": 43297, "userc": 43298, "ecast": 43299, "ACY": 43300, "\u0120Query": 43301, "\u00e3\u0124\u00b9\u00e3\u0125\u012a": 43302, "Beta": 43303, "\u0120susceptibility": 43304, "\u0120Shiv": 43305, "\u0120Limbaugh": 43306, "\u0120\u00c3\u0138": 43307, "\u0120NXT": 43308, "\u0120Muss": 43309, "\u0120Britons": 43310, "ESCO": 43311, "EGIN": 43312, "\u0120%%": 43313, "\u0120secession": 43314, "\u0120Patron": 43315, "\u0120Lua": 43316, "naires": 43317, "\u0120JPMorgan": 43318, "usb": 43319, "ocyte": 43320, "\u0120councillors": 43321, "\u0120Liang": 43322, "farm": 43323, "\u0120nervously": 43324, "\u0120attractiveness": 43325, "\u0120Kov": 43326, "jump": 43327, "Plot": 43328, "\u0120stains": 43329, "\u0120Statue": 43330, "\u0120Apostles": 43331, "heter": 43332, "\u0120SUPPORT": 43333, "\u0120overwhelm": 43334, "YES": 43335, "\u0120291": 43336, "density": 43337, "\u0120trapping": 43338, "Mit": 43339, "\u0120fide": 43340, "\u0120Pamela": 43341, "atlantic": 43342, "Damn": 43343, "\u0120pts": 43344, "OPA": 43345, "\u0120servicing": 43346, "\u0120overflowing": 43347, "ulo": 43348, "\u0120Erit": 43349, "ticket": 43350, "lighting": 43351, "\u0120Hmm": 43352, "\u00e3\u0125\u00bc\u00e3\u0125\u00ab": 43353, "imoto": 43354, "\u0120chuckle": 43355, "423": 43356, "\u00e3\u0123\u0137": 43357, "shape": 43358, "\u0120queues": 43359, "\u0120anchors": 43360, "\u00e3\u0124\u00bc\u00e3\u0124\u00a6\u00e3\u0124\u00b9": 43361, "Fer": 43362, "\u0120awoke": 43363, "\u0120666": 43364, "hands": 43365, "\u0120divergence": 43366, "\u0120505": 43367, "Tips": 43368, "\u0120depot": 43369, "\u0120skew": 43370, "\u0120Deliver": 43371, "opot": 43372, "\u0120divul": 43373, "\u0120EB": 43374, "unsigned": 43375, "\u0120Uni": 43376, "Xbox": 43377, "\u0120forks": 43378, "\u0120702": 43379, "\u00e5\u00af": 43380, "\u0120promoters": 43381, "\u0120Vapor": 43382, "\u0120levied": 43383, "slot": 43384, "\u0120pigment": 43385, "\u0120cylinders": 43386, "CRE": 43387, "\u0120snatch": 43388, "\u0120perpetually": 43389, "\u0120licking": 43390, "\u0120Feet": 43391, "\u0120Kraken": 43392, "\u0120Holden": 43393, "\u0120CLSID": 43394, "mr": 43395, "\u0120projector": 43396, "\u0120denotes": 43397, "\u0120chapel": 43398, "\u0120Torrent": 43399, "bler": 43400, "Route": 43401, "\u0120Defendant": 43402, "\u0120Publishers": 43403, "\u0120Males": 43404, "\u0120Innov": 43405, "\u0120Agility": 43406, "riter": 43407, "tymology": 43408, "stores": 43409, "Lind": 43410, "\u0120folly": 43411, "\u0120Zurich": 43412, "Ble": 43413, "\u0120nurture": 43414, "\u0120coastline": 43415, "uchin": 43416, "Domin": 43417, "\u0120frivol": 43418, "\u0120Consolid": 43419, "results": 43420, "MJ": 43421, "\u0120phylogen": 43422, "\u0120hauled": 43423, "\u0120Wiley": 43424, "\u0120Jessie": 43425, "\u0120Prepare": 43426, "\u0120Eps": 43427, "\u0120treasurer": 43428, "IAS": 43429, "\u0120colonists": 43430, "\u0120inund": 43431, "\u0120WWF": 43432, "\u0120Converted": 43433, "6000": 43434, "outside": 43435, "\u0120Appearance": 43436, "\u0120Relic": 43437, "\u0120Mister": 43438, "saw": 43439, "\u0120resultant": 43440, "\u0120adjective": 43441, "\u0120Laurel": 43442, "\u0120Hindi": 43443, "bda": 43444, "Peace": 43445, "\u0120rebirth": 43446, "\u0120membranes": 43447, "\u0120forwarding": 43448, "\u0120collided": 43449, "\u0120Carolyn": 43450, "Kansas": 43451, "599": 43452, "\u0120SolidGoldMagikarp": 43453, "Beck": 43454, "\u0120stressing": 43455, "\u0120Goo": 43456, "\u0120Cooperative": 43457, "\u0120fs": 43458, "\u0120Archie": 43459, "Liter": 43460, "\u0120Klopp": 43461, "Jerry": 43462, "\u0120footwear": 43463, "Warren": 43464, "\u0120scree": 43465, "hare": 43466, "Understanding": 43467, "Ped": 43468, "\u0120anthology": 43469, "\u0120Announce": 43470, "Mega": 43471, "\u0120fluent": 43472, "\u0120bondage": 43473, "\u0120Discount": 43474, "ilial": 43475, "Cart": 43476, "\u0120Nightmares": 43477, "Sham": 43478, "\u0120Boll": 43479, "ussie": 43480, "Http": 43481, "Atlanta": 43482, "\u0120unrecogn": 43483, "\u0120Bid": 43484, "\u0120undergrad": 43485, "\u0120forgiving": 43486, "\u0120Glover": 43487, "AAAAAAAA": 43488, "445": 43489, "VG": 43490, "paio": 43491, "killers": 43492, "\u0120responsibly": 43493, "\u0120mobilize": 43494, "\u0120effected": 43495, "\u0120Lumin": 43496, "\u0120kale": 43497, "\u0120infringing": 43498, "announced": 43499, "\u0120fitt": 43500, "batch": 43501, "\u0120Tackle": 43502, "\u0120Lime": 43503, "\u0120APP": 43504, "ukemia": 43505, "\u0120ruby": 43506, "\u0120exoner": 43507, "\u0120Casual": 43508, "070": 43509, "\u0120pelvic": 43510, "\u0120automate": 43511, "\u0120Kear": 43512, "\u0120Coastal": 43513, "\u0120creed": 43514, "\u0120boredom": 43515, "\u0120Stun": 43516, "riott": 43517, "\u0124\u0130": 43518, "\u0120regenerate": 43519, "\u0120comedians": 43520, "\u0120OPER": 43521, "Spons": 43522, "idium": 43523, "onis": 43524, "Located": 43525, "057": 43526, "\u0120suspense": 43527, "\u0120Dating": 43528, "Cass": 43529, "\u0120neocons": 43530, "\u0120Shinzo": 43531, "\u0120awoken": 43532, "christ": 43533, "\u0120Messages": 43534, "attled": 43535, "\u0120Spray": 43536, "\u0120Spice": 43537, "CW": 43538, "\u0120shielding": 43539, "\u0120Gaul": 43540, "Amid": 43541, "\u0120paramilitary": 43542, "\u0120multif": 43543, "\u0120Tanner": 43544, "ilk": 43545, "\u0120goddamn": 43546, "gements": 43547, "\u0120befriend": 43548, "mobi": 43549, "\u0120388": 43550, "folder": 43551, "acca": 43552, "\u0120insin": 43553, "gap": 43554, "Nev": 43555, "fifth": 43556, "\u0120psychiatry": 43557, "banks": 43558, "THIS": 43559, "\u0120harb": 43560, "acqu": 43561, "\u0120facade": 43562, "\u0120PowerPoint": 43563, "803": 43564, "\u0120bluff": 43565, "Shares": 43566, "\u0120favoring": 43567, "Elizabeth": 43568, "\u00c3\u012f\u00c3\u012f": 43569, "\u0120ranger": 43570, "772": 43571, "\u0120Arche": 43572, "hak": 43573, "\u0120Genetics": 43574, "\u0120FEMA": 43575, "\u0120evolves": 43576, "\u0120este": 43577, "\u0120Pets": 43578, "\u0120M\u00c3\u00a9": 43579, "\u0120Interesting": 43580, "\u0120Canterbury": 43581, "chapter": 43582, "\u0120Starfleet": 43583, "Spanish": 43584, "\u0120drawback": 43585, "\u0120Norwich": 43586, "970": 43587, "north": 43588, "aganda": 43589, "\u0120transformative": 43590, "ramids": 43591, "biology": 43592, "aday": 43593, "\u0120propagation": 43594, "\u0120Gamma": 43595, "\u0120Denise": 43596, "\u0120Calculator": 43597, "entimes": 43598, "\u0120Bett": 43599, "\u0120appendix": 43600, "\u0120HDD": 43601, "AKING": 43602, "\u0120stigmat": 43603, "\u0120holster": 43604, "\u0120ordinarily": 43605, "Chance": 43606, "\u0120Contrary": 43607, "\u0120adhesive": 43608, "\u0120gathers": 43609, "612": 43610, "reau": 43611, "onyms": 43612, "eways": 43613, "\u0120induces": 43614, "\u0120interchangeable": 43615, "sem": 43616, "Whit": 43617, "\u0120trance": 43618, "\u0120incorporation": 43619, "\u0120Extras": 43620, "Financial": 43621, "\u0120awkwardly": 43622, "\u0120Sturgeon": 43623, "\u0120HY": 43624, "Normally": 43625, "\u0120Ending": 43626, "\u0120Assist": 43627, "encrypted": 43628, "\u0120subjug": 43629, "\u0120nos": 43630, "\u0120fanatic": 43631, "Cub": 43632, "CU": 43633, "?\".": 43634, "\u0120irreversible": 43635, "\u00e5\u0124": 43636, "031": 43637, "\u0120HAR": 43638, "spread": 43639, "ulia": 43640, "=$": 43641, "Scope": 43642, "Lots": 43643, "\u0120lifestyles": 43644, "olon": 43645, "\u0120feds": 43646, "\u0120congratulate": 43647, "webkit": 43648, "\u0120indistinguishable": 43649, "\u0120Swing": 43650, "\u0120commandments": 43651, "quila": 43652, "abella": 43653, "methyl": 43654, "annabin": 43655, "\u0120overe": 43656, "\u0120lobster": 43657, "\u0120QUEST": 43658, "\u0120CONTIN": 43659, "bernatorial": 43660, "::::::::": 43661, "\u0120Trave": 43662, "\u0120Samoa": 43663, "ANI": 43664, "752": 43665, "\u00d0\u00b4": 43666, "usercontent": 43667, "\u0120Moderate": 43668, "yeah": 43669, "\u0120Kitt": 43670, "\u0120wee": 43671, "\u0120stuffing": 43672, "\u0120Intervention": 43673, "\u0120Dign": 43674, "\u0120warehouses": 43675, "\u0120Fiji": 43676, "\u0120pellets": 43677, "\u0120takeaway": 43678, "\u0120TABLE": 43679, "\u0120Classical": 43680, "collection": 43681, "\u0120landfall": 43682, "\u0120Muscle": 43683, "\u0120settles": 43684, "\u0120ADV": 43685, "\u0120344": 43686, "Laura": 43687, "\u0120fared": 43688, "\u0120Partial": 43689, "436": 43690, "ossibility": 43691, "\u0120Daly": 43692, "\u0120Tarant": 43693, "\u0120Fuji": 43694, "aml": 43695, "cence": 43696, "551": 43697, "\u0120Procedures": 43698, "\u0120OCD": 43699, "\u0120UD": 43700, "tin": 43701, "QUI": 43702, "acho": 43703, "438": 43704, "\u0120glitches": 43705, "\u0120enchantment": 43706, "\u0120calculates": 43707, "IRO": 43708, "\u0120Hua": 43709, "alyses": 43710, "\u0120Lift": 43711, "umo": 43712, "\u0120leapt": 43713, "\u0120hypothesized": 43714, "\u0120Gustav": 43715, "itans": 43716, "VERSION": 43717, "\u00e6\u0142": 43718, "Roger": 43719, "\u0120rand": 43720, "\u0120Adapter": 43721, "\u0120331": 43722, "\u0120Petition": 43723, "kies": 43724, "Mars": 43725, "\u0120undercut": 43726, "zees": 43727, "\u0120Lyons": 43728, "\u0120DHCP": 43729, "Missing": 43730, "\u0120retirees": 43731, "\u0120insidious": 43732, "eli": 43733, ">)": 43734, ".\u00e3\u0122\u012f": 43735, "\u0120finalists": 43736, "\u0120Aure": 43737, "\u0120accuser": 43738, "\u0120wastes": 43739, "\u0120Ys": 43740, "\u0120Lori": 43741, "\u0120constituencies": 43742, "\u0120supper": 43743, "\u0120mayhem": 43744, "orange": 43745, "\u0120misplaced": 43746, "\u0120managerial": 43747, "\u0120exce": 43748, "\u0120CLI": 43749, "\u0120primal": 43750, "\u0120Lent": 43751, "Crystal": 43752, "hover": 43753, "\u0120NTS": 43754, "endum": 43755, "\u0120dw": 43756, "\u0120Alc": 43757, "nostic": 43758, "\u0120preserves": 43759, "\u0120Tsarnaev": 43760, "\u0120tripled": 43761, "relative": 43762, "Arcade": 43763, "killing": 43764, "\u0120WEEK": 43765, "\u0120Hanna": 43766, "Dust": 43767, "Completed": 43768, "\u0123\u00ab": 43769, "\u0120approves": 43770, "\u0120Surf": 43771, "\u0120Lutheran": 43772, "venants": 43773, "\u0120robberies": 43774, "weights": 43775, "software": 43776, "atana": 43777, "ugal": 43778, "\u0120gravy": 43779, "\u0120Cance": 43780, "OLOGY": 43781, "lyak": 43782, "Tonight": 43783, "\u0120unveil": 43784, "\u01201904": 43785, "\u0120Minion": 43786, "entious": 43787, "stice": 43788, "packages": 43789, "\u0120GEAR": 43790, "\u0120gol": 43791, "\u0120Hutchinson": 43792, "\u0120Profession": 43793, "\u0120GUN": 43794, "\u0120Difference": 43795, "\u0120Tsukuyomi": 43796, "\u0120Lesbian": 43797, "670": 43798, "\u0120fugitive": 43799, "\u0120Planetary": 43800, "--------------------------------------------------------": 43801, "\u0120accrued": 43802, "\u0120chicks": 43803, "\u0120stopp": 43804, "\u0120blockers": 43805, "Cod": 43806, "\u0120commenters": 43807, "\u0120Somewhere": 43808, "\u0120Photographer": 43809, "theme": 43810, "\u0120mayoral": 43811, "wu": 43812, "\u0120antennas": 43813, "\u0120revamped": 43814, "\u0120Subjects": 43815, "it\u00c3\u00a9": 43816, "imura": 43817, "\u0120entrances": 43818, "literally": 43819, "\u0120tenets": 43820, "\u0120OMG": 43821, "\u0120MPH": 43822, "\u0120Donkey": 43823, "\u0120Offense": 43824, "\u0120\"+": 43825, "Snap": 43826, "\u0120AFB": 43827, "\u0120animate": 43828, "\u0120Sod": 43829, "Hispanic": 43830, "\u0120inconsistency": 43831, "Db": 43832, "FY": 43833, "Export": 43834, "\u0120ape": 43835, "\u0120pearl": 43836, "ibel": 43837, "\u0120PACs": 43838, "\u0120{\\": 43839, "\u0120actu": 43840, "\u0120HSBC": 43841, "campus": 43842, "\u0120payoff": 43843, "\u0120deities": 43844, "\u0120Nato": 43845, "ouple": 43846, "\u0120censored": 43847, "\u0120Clojure": 43848, "\u0120confounding": 43849, "eni": 43850, "\u0120reckon": 43851, "ophe": 43852, "\u0120spotting": 43853, "\u0120signifies": 43854, "\u0120propel": 43855, "\u0120festive": 43856, "Suggest": 43857, "\u0120pledging": 43858, "\u0120Berman": 43859, "\u0120rebellious": 43860, "\u0120overshadowed": 43861, "\u0120infiltrated": 43862, "jobs": 43863, "672": 43864, "\u0120scalable": 43865, "\u0120dominion": 43866, "\u0120Newfoundland": 43867, "\u0120Meadow": 43868, "\u0120partitions": 43869, "AMI": 43870, "\u0120supplementary": 43871, "strument": 43872, "\u0120hairy": 43873, "\u0120perpetuate": 43874, "\u0120nutshell": 43875, "\u0120Potato": 43876, "\u0120Hobbit": 43877, "\u0120curses": 43878, "Float": 43879, "\u0120quieter": 43880, "\u0120fueling": 43881, "\u0120capsules": 43882, "\u0120Lust": 43883, "\u0120Haunted": 43884, "Executive": 43885, "\u0120childbirth": 43886, "Gre": 43887, "\u0120radiant": 43888, "\u00e5\u0130": 43889, "\u0120malls": 43890, "\u0120inept": 43891, "\u0120Warranty": 43892, "\u0120spectator": 43893, "Eh": 43894, "thens": 43895, "\u0120culminating": 43896, "\u00e6\u00a9": 43897, "arya": 43898, "\u00e3\u0124\u00ae": 43899, "ilitarian": 43900, "\u0120ORIG": 43901, "\u0120Spending": 43902, "ptives": 43903, "\u0120Siren": 43904, "\u0120Recording": 43905, "ayne": 43906, "\u0120vim": 43907, "\u0120sprang": 43908, "Tang": 43909, "\u0120MFT": 43910, "morning": 43911, "\u0120Weed": 43912, "mpeg": 43913, "cession": 43914, "\u0120Chung": 43915, "730": 43916, "warning": 43917, "562": 43918, "handedly": 43919, "Poor": 43920, "Politics": 43921, ":#": 43922, "\u0120pian": 43923, "\u0120feces": 43924, "\u0120Documentation": 43925, "\u0120banished": 43926, "\u0120399": 43927, "\u0120ARC": 43928, "\u0120heinous": 43929, "Jake": 43930, "\u0120Amir": 43931, "wayne": 43932, "vre": 43933, "oshenko": 43934, "\u0120notebooks": 43935, "\u0120foundational": 43936, "\u0120marvelous": 43937, "ixtape": 43938, "\u0120withdrawals": 43939, "\u0120horde": 43940, "\u0120Dhabi": 43941, "isable": 43942, "\u0120KD": 43943, "\u0120contagious": 43944, "\u0120Dip": 43945, "\u0120Arrows": 43946, "\u0120pronouns": 43947, "\u0120morphine": 43948, "\u0120BUS": 43949, "682": 43950, "\u0120kosher": 43951, "finished": 43952, "\u0120Instruments": 43953, "\u0120fused": 43954, "yden": 43955, "\u0120Salmon": 43956, "Fab": 43957, "affected": 43958, "KEN": 43959, "CENT": 43960, "Domain": 43961, "\u0120pokemon": 43962, "\u0120Drinking": 43963, "Growing": 43964, "\u0120Investigative": 43965, "\u0120Aether": 43966, "emi": 43967, "\u0120tabloid": 43968, "\u0120repro": 43969, "\u0120Notwithstanding": 43970, "\u0120Berserker": 43971, "\u0120dramas": 43972, "\u0120clich\u00c3\u00a9": 43973, "\u0120bung": 43974, "\u0120URI": 43975, "\u0120Dos": 43976, "044": 43977, "\u0120pastors": 43978, "\u0120ls": 43979, "\u0120acrylic": 43980, "aunts": 43981, "Edward": 43982, "\u0120majorities": 43983, "Bang": 43984, "\u0120fielding": 43985, "\u0120Replacement": 43986, "\u0120Alchemy": 43987, "ppard": 43988, "\u0120Romeo": 43989, "\u0120Sanct": 43990, "\u0120Lavrov": 43991, "ibble": 43992, "Instruct": 43993, "\u0120impractical": 43994, "\u0120Playboy": 43995, "cephal": 43996, "\u0120swaps": 43997, "\u0120kan": 43998, "\u0120Theo": 43999, "\u0120illustrating": 44000, "\u0120dismantled": 44001, "\u0120Transgender": 44002, "\u0120Guth": 44003, "UGH": 44004, "\u0120triumphant": 44005, "\u0120encompass": 44006, "\u0120bookmark": 44007, "uddin": 44008, "jer": 44009, "\u0120predicate": 44010, "ESH": 44011, "\u0120whence": 44012, "\u0120ABE": 44013, "\u0120nonprofits": 44014, "Sequ": 44015, "\u0120diabetic": 44016, "\u0120pend": 44017, "\u0120heartfelt": 44018, "shi": 44019, "\u0120interacts": 44020, "\u0120Telecom": 44021, "\u0120bombardment": 44022, "depending": 44023, "\u0120Lowry": 44024, "\u0120Admission": 44025, "\u0120Blooming": 44026, "ustration": 44027, "enegger": 44028, "Brew": 44029, "\u0120molten": 44030, "\u0120Nerd": 44031, "PIN": 44032, "\u00e2\u0138\u0122": 44033, "avement": 44034, "\u0120toured": 44035, "\u0120coefficients": 44036, "\u0120Trayvon": 44037, "ansson": 44038, "\u0120sandy": 44039, "told": 44040, "flows": 44041, "\u0120populous": 44042, "\u0120Tinder": 44043, "\u0120Bliss": 44044, "Rachel": 44045, "Minimum": 44046, "\u0120contestant": 44047, "\u0120Reduce": 44048, "\u0120Morse": 44049, "\u0120Grassley": 44050, "\u0120Clicker": 44051, "\u0120expr": 44052, "\u0120sincerity": 44053, "\u0120marqu": 44054, "\u0120elicit": 44055, "\u0120Proposition": 44056, "\u0120Demonic": 44057, "\u0120tacos": 44058, "Greek": 44059, "\u0120postwar": 44060, "\u0120insofar": 44061, "\u0120Pork": 44062, "\u0120352": 44063, "doctoral": 44064, "walking": 44065, "\u0120midterm": 44066, "\u0120Sammy": 44067, "sighted": 44068, "\u0120TRANS": 44069, "ici": 44070, "ALD": 44071, "\u0120USL": 44072, "\u0120FISA": 44073, "\u0120Ampl": 44074, "\u0120Alexandra": 44075, "inelli": 44076, "Train": 44077, "\u0120signify": 44078, "\u0120Versus": 44079, "\u0120obfusc": 44080, "\u0120kh": 44081, "\u0120aggro": 44082, "\u0120Renault": 44083, "\u0120348": 44084, "518": 44085, "oxicity": 44086, "022": 44087, "\u0120Twist": 44088, "\u0120goofy": 44089, "Dynamic": 44090, "\u0120briefings": 44091, "might": 44092, "899": 44093, "\u0120derogatory": 44094, "Tro": 44095, "\u0120forging": 44096, "\u0120Koran": 44097, "\u0120Married": 44098, "\u0120Bucs": 44099, "\u0120palate": 44100, "\u0120Conversion": 44101, "mable": 44102, "413": 44103, "\u0120(_": 44104, "\u0120siph": 44105, "\u0120NEO": 44106, "college": 44107, "\u0120marginally": 44108, "\u0120flirt": 44109, "\u0120Traps": 44110, "\u0120Pace": 44111, "\u00e9\u00bb\u0134": 44112, "\u0120goaltender": 44113, "\u0120forbids": 44114, "\u0120clerks": 44115, "\u0120Tant": 44116, "\u0120Robbins": 44117, "\u0120Printing": 44118, "\u0120premiered": 44119, "\u0120magnification": 44120, "\u0120TG": 44121, "\u0120Rouse": 44122, "\u0120Mock": 44123, "odynamics": 44124, "\u0120preclude": 44125, "ismo": 44126, "\u0120Pulitzer": 44127, "\u0120avalanche": 44128, "\u0120Kodi": 44129, "ribune": 44130, "\u0120Lena": 44131, "Electric": 44132, "\u0120refinery": 44133, "\u0120endowed": 44134, "\u0120counselors": 44135, "\u0120dolphin": 44136, "\u0120Mith": 44137, "\u0120armoured": 44138, "hibited": 44139, "Begin": 44140, "\u0120PW": 44141, "Oil": 44142, "\u0120Vor": 44143, "\u0120Sharif": 44144, "\u0120Frazier": 44145, "estate": 44146, "\u0120jams": 44147, "Proxy": 44148, "\u0120bandits": 44149, "\u0120Presbyterian": 44150, "\u0120Premiere": 44151, "tiny": 44152, "\u0120Cruel": 44153, "Testing": 44154, "\u0120homer": 44155, "\u0120VERS": 44156, "\u0120Prol": 44157, "\u0120Deposit": 44158, "\u0120Coffin": 44159, "\u0120seminars": 44160, "\u0120sql": 44161, "\u0120Defendants": 44162, "Alternatively": 44163, "\u0120Rats": 44164, "\u00e7\u00ab": 44165, "ethyst": 44166, "'>": 44167, "\u0120issuer": 44168, "589": 44169, "\u0120chaired": 44170, "\u0120Accessories": 44171, "manent": 44172, "\u0120marrow": 44173, "\u0120Primordial": 44174, "CN": 44175, "\u0120limitless": 44176, "\u0120Carnage": 44177, "\u0120undrafted": 44178, "qv": 44179, "INESS": 44180, "onew": 44181, "\u0120cohesion": 44182, "987": 44183, "\u0120necks": 44184, "\u0120footballer": 44185, "\u0120GER": 44186, "\u0120detectable": 44187, "\u0120Supporting": 44188, "\u0120CSV": 44189, "ocally": 44190, "kHz": 44191, "\u0120unde": 44192, "\u0120shone": 44193, "\u0120budding": 44194, "trak": 44195, "Standing": 44196, "\u0120Starcraft": 44197, "\u0120Kemp": 44198, "Bench": 44199, "\u0120thwarted": 44200, "\u0120Grounds": 44201, "athi": 44202, "Lisa": 44203, "Dialog": 44204, "\u0120SX": 44205, "Vision": 44206, "\u0120ingenious": 44207, "\u00d9\u0132": 44208, "\u0120fostering": 44209, "\u0120Za": 44210, "\u0120Ingram": 44211, "\u0120\"@": 44212, "Naturally": 44213, "616": 44214, "035": 44215, "\u0120FAC": 44216, "Hmm": 44217, "554": 44218, "\u0120accelerator": 44219, "\u0120Vend": 44220, "\u0120sunscreen": 44221, "\u0120tuberculosis": 44222, "raviolet": 44223, "\u0120Functional": 44224, "\u0120Errors": 44225, "edar": 44226, "1966": 44227, "\u0120Spectre": 44228, "\u0120Recipes": 44229, "885": 44230, "\u0120Mankind": 44231, "Liverpool": 44232, "\u0120|--": 44233, "\u0120substitutes": 44234, "\u0120XT": 44235, "wired": 44236, "\u0120inco": 44237, "\u0120Afgh": 44238, "Eva": 44239, "icc": 44240, "Song": 44241, "Knight": 44242, "\u0120diligently": 44243, "\u0120Broadcast": 44244, "Aid": 44245, "\u0120afar": 44246, "\u0120HMS": 44247, "atonin": 44248, "\u0120Grateful": 44249, "\u0120fireplace": 44250, "\u0120Omni": 44251, "euro": 44252, "\u0120FRE": 44253, "\u0120Shib": 44254, "\u0120Digest": 44255, "toggle": 44256, "\u0120headsets": 44257, "\u0120diffusion": 44258, "\u0120Squirrel": 44259, "\u0120FN": 44260, "\u0120darkened": 44261, "outher": 44262, "\u0120sleeps": 44263, "\u0120Xer": 44264, "guns": 44265, "\u0120setups": 44266, "\u0120parsed": 44267, "\u0120mammoth": 44268, "\u0120Curious": 44269, "gob": 44270, "\u0120Fitzpatrick": 44271, "\u0120Emil": 44272, "imov": 44273, ".............": 44274, "\u0120Benny": 44275, "Secondly": 44276, "\u0120hearty": 44277, "\u0120conson": 44278, "stained": 44279, "\u0120galactic": 44280, "clave": 44281, "\u0120plummeted": 44282, "\u0120pests": 44283, "\u0120swat": 44284, "\u0120referrals": 44285, "\u0120Lionel": 44286, "holy": 44287, "\u0120underdog": 44288, "\u0120Slater": 44289, "\u0120Provide": 44290, "\u0120Amar": 44291, "ressor": 44292, "\u00e5\u012e": 44293, "onga": 44294, "\u0120timid": 44295, "\u0120piety": 44296, "\u0120Dek": 44297, "\u0120surging": 44298, "azo": 44299, "\u0120610": 44300, "\u0120desks": 44301, "\u0120Spokane": 44302, "\u0120Anfield": 44303, "\u0120warships": 44304, "\u0120Cobra": 44305, "\u0120arming": 44306, "clusively": 44307, "\u0120Badge": 44308, "agascar": 44309, "\u0120PRESS": 44310, "\u0120McKenzie": 44311, "\u0120Ferdinand": 44312, "burning": 44313, "Afee": 44314, "\u0120tyrann": 44315, "\u0120Iw": 44316, "\u0120Boone": 44317, "1007": 44318, "\u0120Rept": 44319, "\u010a\u00c2\u0142": 44320, "\u0120caravan": 44321, "\u0120Dill": 44322, "\u0120Bundesliga": 44323, "Chuck": 44324, "\u0120healer": 44325, "\u00e3\u0125\u00bc\u00e3\u0125\u0128": 44326, "\u0120Hobby": 44327, "\u0120negate": 44328, "\u0120critiques": 44329, "sectional": 44330, "mopolitan": 44331, "\u0120dx": 44332, "\u0120outsourcing": 44333, "\u0120Cipher": 44334, "tap": 44335, "Sharp": 44336, "\u0120upbeat": 44337, "\u0120hangar": 44338, "\u0120cruising": 44339, "\u0120Niagara": 44340, "\u0120342": 44341, "illus": 44342, "\u0120Sv": 44343, "\u0120subtitles": 44344, "\u0120squared": 44345, "\u0120bookstore": 44346, "\u0120revolutionaries": 44347, "\u0120Carlton": 44348, "abal": 44349, "Utah": 44350, "\u0120despise": 44351, "\u0120UM": 44352, "consider": 44353, "aido": 44354, "\u0120carts": 44355, "\u0120Turtles": 44356, "Training": 44357, "\u0120honorary": 44358, "\u00c2\u00a2": 44359, "\u0120triangles": 44360, "422": 44361, "\u0120reprinted": 44362, "\u0120graceful": 44363, "\u0120Mongolia": 44364, "\u0120disruptions": 44365, "\u0120Boh": 44366, "\u0120349": 44367, "\u0120drains": 44368, "\u0120consulate": 44369, "\u0120bends": 44370, "\u0120mafia": 44371, "uron": 44372, "\u0120Fulton": 44373, "misc": 44374, "\u0120renal": 44375, "\u0120inaction": 44376, "cking": 44377, "\u0120photons": 44378, "\u0120bruised": 44379, "\u0120Codes": 44380, "ogi": 44381, "\u0120nests": 44382, "\u0120Lovely": 44383, "\u0120Libre": 44384, "\u0120Daryl": 44385, "\u0120###": 44386, "Sys": 44387, ".,\"": 44388, "\u0120freezes": 44389, "establishment": 44390, "andowski": 44391, "\u0120cumbers": 44392, "\u0120Starg": 44393, "\u0120Bombs": 44394, "\u0120legions": 44395, "\u0120handwriting": 44396, "\u0120grun": 44397, "\u0120Cah": 44398, "sequent": 44399, "\u0120moth": 44400, "\u0120MSM": 44401, "Insert": 44402, "Fif": 44403, "\u0120motel": 44404, "\u0120dexter": 44405, "\u0120Bild": 44406, "heartedly": 44407, "\u0120prope": 44408, "\u0120Texture": 44409, "\u0120Junction": 44410, "ynthesis": 44411, "ocard": 44412, "\u0120Vera": 44413, "\u0120Barth": 44414, "\u0120\u00ce\u00bcg": 44415, "\u0120lashed": 44416, "\u0120351": 44417, "\u0120Zamb": 44418, "\u0120Staples": 44419, "\u0120Cortex": 44420, "\u0120Corker": 44421, "\u0120continuum": 44422, "\u0120WRITE": 44423, "unta": 44424, "ridor": 44425, "\u0120deems": 44426, "033": 44427, "\u0120GOLD": 44428, "pas": 44429, "\u0120repressive": 44430, "\u00e3\u0125\u0128\u00e3\u0124\u00a3": 44431, "\u0120baffled": 44432, "Scar": 44433, "\u0120crave": 44434, "\u0120______": 44435, "\u0120entrepreneurship": 44436, "\u0120Directorate": 44437, "\u0120'[": 44438, "\u0120vines": 44439, "\u0120ascended": 44440, "\u0120GROUP": 44441, "\u0120Goodbye": 44442, "\u0120dogged": 44443, "\u00e3\u0125\u00b4\u00e3\u0124\u00a1": 44444, "Manufact": 44445, "\u0120unimaginable": 44446, "riots": 44447, "ierrez": 44448, "\u0120relativity": 44449, "\u0120Crafting": 44450, "raught": 44451, "uden": 44452, "cookie": 44453, "\u0120assassins": 44454, "\u0120dissatisfied": 44455, "acci": 44456, "\u0120conduit": 44457, "Spread": 44458, "\u0120Rican": 44459, "nice": 44460, "izzle": 44461, "\u0120scares": 44462, "\u0120WHY": 44463, "phans": 44464, "535": 44465, "\u0120protracted": 44466, "\u0120Kristen": 44467, "536": 44468, "\u0120Scrib": 44469, "\u0120Neh": 44470, "\u0120twenties": 44471, "\u0120predicament": 44472, "\u0120handcuffs": 44473, "\u0120fruitful": 44474, "\u0120UL": 44475, "\u0120Ludwig": 44476, "\u0120attest": 44477, "\u0120Breaker": 44478, "\u0120biologically": 44479, "\u0120Dealer": 44480, "\u0120renovations": 44481, "fw": 44482, "essen": 44483, "Alice": 44484, "\u0120Henri": 44485, "\u0120unilaterally": 44486, "\u0120Sidd": 44487, "hai": 44488, "\u0120Stretch": 44489, "Sales": 44490, "\u0120cumbersome": 44491, "\u0120Javier": 44492, "\u0120trendy": 44493, "\u0120rotting": 44494, "\u0120Challenges": 44495, "\u0120scraps": 44496, "\u0120facets": 44497, "\u0120Veronica": 44498, "\u0120Verge": 44499, "\u0120Sana": 44500, "Alien": 44501, "\u0120Rih": 44502, "\u0120radial": 44503, "ectar": 44504, "\u0120630": 44505, "cli": 44506, "Marie": 44507, "\u0120wildfire": 44508, "\u0120Cato": 44509, "hander": 44510, "\u0120waitress": 44511, "\u0120chops": 44512, "\u0120SECTION": 44513, "\u0120bluntly": 44514, "\u0120Catalog": 44515, "nian": 44516, "study": 44517, "\u0120patrolling": 44518, "\u0120Tenth": 44519, "nexus": 44520, "\u0120NON": 44521, "opsy": 44522, "\u0120scathing": 44523, "sie": 44524, "\u0120deteriorated": 44525, "VB": 44526, "Nazis": 44527, "\u0120depictions": 44528, "\u0120authenticated": 44529, "\u0120Conce": 44530, "krit": 44531, "\u0120promulg": 44532, "\u0120LONG": 44533, "UFC": 44534, "\u0120Visitors": 44535, "\u0120Recall": 44536, "\u0120rehabilit": 44537, "\u0120SLI": 44538, "\u0120glacier": 44539, "\u0120Bite": 44540, "\u0120503": 44541, "\u0120vomit": 44542, "\u0120fermented": 44543, "\u0120Khalid": 44544, "\u0120graded": 44545, "\u0120Magicka": 44546, "\u0120Ichigo": 44547, "powerful": 44548, "icators": 44549, "753": 44550, "\u0120shrew": 44551, "\u0120356": 44552, "\u0120legalizing": 44553, "\u0120allotted": 44554, "\u0120Archdemon": 44555, "ithing": 44556, "iggurat": 44557, "VOL": 44558, "Leod": 44559, "\u0120oily": 44560, "\u0120inducing": 44561, "\u0120amygdala": 44562, "\u0120admins": 44563, "\u0120Acquisition": 44564, "CAN": 44565, "\u0120schematic": 44566, "\u0120moan": 44567, "\u0120Cameroon": 44568, "\u0120tink": 44569, "\u0120merry": 44570, "\u0120butterflies": 44571, "\u0120Goff": 44572, "\u0120workspace": 44573, "\u0120Corona": 44574, "\u0120javascript": 44575, "\u0120Dolphin": 44576, "\u0120Cantor": 44577, "464": 44578, "toe": 44579, "APS": 44580, "\u0120Aging": 44581, "\u0120padded": 44582, "\u0120Zheng": 44583, "\u0120Held": 44584, "\u0120estranged": 44585, "\u0120770": 44586, ".}": 44587, "\u0120Dunham": 44588, "\u0120smokes": 44589, "\u0120capitals": 44590, "undai": 44591, "Shin": 44592, "\u0120Founding": 44593, "\u0120entitle": 44594, "\u0120centerpiece": 44595, "Discover": 44596, "\u0120thereto": 44597, "alert": 44598, "\u0120Nou": 44599, "\u0120Analyst": 44600, "lc": 44601, "FH": 44602, "FIELD": 44603, "\u0120POV": 44604, "gray": 44605, "\u0120arcs": 44606, "\u0120HOT": 44607, "\u0120rs": 44608, "\u0120obligatory": 44609, "\u0120Architects": 44610, "\u0120Sven": 44611, "\u0120FEC": 44612, "0200": 44613, "Christmas": 44614, "\u0120Albania": 44615, "ratom": 44616, "587": 44617, "\u0120hardships": 44618, "\u0120autos": 44619, "\u0120Charges": 44620, "\u0120apes": 44621, "\u0120376": 44622, "wallet": 44623, "\u0120intoxication": 44624, "\u0120goblin": 44625, "\u0120570": 44626, "++++++++++++++++": 44627, "\u0120Yelp": 44628, "\u0120Magnetic": 44629, "\u0120Briggs": 44630, "Rail": 44631, "\u0120spawns": 44632, "\u0120Wiggins": 44633, "\u0120showcased": 44634, "\u0120resorted": 44635, "uben": 44636, "\u0120whipping": 44637, "\u0120imitate": 44638, "\u0120digestion": 44639, "\u0120USPS": 44640, "\u0120Gest": 44641, "\u0120yea": 44642, "\u0120Tight": 44643, "indal": 44644, "icas": 44645, "`.": 44646, "CAST": 44647, "'';": 44648, "\u0120Fet": 44649, "opathic": 44650, "Invalid": 44651, "\u0120regretted": 44652, "\u0120broccoli": 44653, "\u0120Scores": 44654, "eve": 44655, "\u0120postings": 44656, "\u0120accumulating": 44657, "\u0120needless": 44658, "elfth": 44659, "\u0120mayors": 44660, "\u0120scrib": 44661, "\u0120anecdotes": 44662, "\u0120botched": 44663, "\u0120Ribbon": 44664, "\u0120Constantine": 44665, "iuses": 44666, "esses": 44667, "\u0120devise": 44668, "Compared": 44669, "\u0120pudding": 44670, "\u0120garg": 44671, "\u0120evoke": 44672, "797": 44673, "\u0120detox": 44674, "909": 44675, "\u0120Pieces": 44676, "\u0120McCartney": 44677, "\u0120metast": 44678, "\u0120Krypt": 44679, "POR": 44680, "\u0120tending": 44681, "\u0120Merchants": 44682, "Proof": 44683, "\u0120Varg": 44684, "\u0120Portable": 44685, "\u00e3\u0125\u00bc\u00e3\u0125\u0128\u00e3\u0124\u00a3": 44686, "Brain": 44687, "2500": 44688, "\u0120foliage": 44689, "\u00d8\u00b9": 44690, "\u0120mentors": 44691, "\u0120Aires": 44692, "\u0120minimalist": 44693, "\u0120ingested": 44694, "\u0120Trojan": 44695, "\u0120Qian": 44696, "involved": 44697, "027": 44698, "\u0120eroded": 44699, "RAFT": 44700, "\u0120blurry": 44701, "Mob": 44702, "\u0120buffet": 44703, "\u0120Fnatic": 44704, "aea": 44705, "KNOWN": 44706, "\u0120Init": 44707, "safety": 44708, "enum": 44709, "ACTION": 44710, "\u0120Crusher": 44711, "\u0120Dates": 44712, "\u0120................": 44713, "calling": 44714, "akov": 44715, "\u0120ventured": 44716, "\u0120555": 44717, "auga": 44718, "Hart": 44719, "\u0120Aero": 44720, "MAC": 44721, "\u0120thinly": 44722, "\u0120arra": 44723, "STATE": 44724, "ilde": 44725, "\u0120Jacqu": 44726, "\u0120Females": 44727, "\u0120theorem": 44728, "\u0120346": 44729, "\u0120smartest": 44730, "\u0120PUBLIC": 44731, "\u0120Kron": 44732, "\u0120Bits": 44733, "\u0120Vessel": 44734, "\u0120Telephone": 44735, "\u0120decap": 44736, "\u0120adjunct": 44737, "\u0120SEN": 44738, "merga": 44739, "\u0120redacted": 44740, "\u0120prehistoric": 44741, "\u0120explanatory": 44742, "\u0120Runs": 44743, "\u0120Uttar": 44744, "\u0120Manny": 44745, "\u0120AUTHOR": 44746, "\u0120Unleashed": 44747, "\u0120Bowling": 44748, "beans": 44749, "793": 44750, "\u0120universes": 44751, "\u0120sensit": 44752, "\u0120Kung": 44753, "repeat": 44754, "ctrl": 44755, "\u0120paced": 44756, "\u0120fuller": 44757, "Clock": 44758, "\u0120recomb": 44759, "\u0120Faul": 44760, "\u0120Bunker": 44761, "\u0120pooled": 44762, "\u0120ana": 44763, "\u0120Mouth": 44764, "LLOW": 44765, "humane": 44766, "\u0120bulldo": 44767, "\u0120Michaels": 44768, "fam": 44769, "\u0120wrecked": 44770, "\u0120portrays": 44771, "\u0120Whale": 44772, "\u0120Hes": 44773, "\u0120guesses": 44774, "\u0120Browse": 44775, "\u0120LAPD": 44776, "\u0120consequential": 44777, "\u0120Innocent": 44778, "\u0120DRAG": 44779, "\u0120transgress": 44780, "\u0120Oaks": 44781, "\u0120trivia": 44782, "\u0120Reson": 44783, "\u0120ADS": 44784, "--+": 44785, "\u0120Toll": 44786, "\u0120grasping": 44787, "\u0120THEM": 44788, "\u0120Tags": 44789, "\u0120Conclusion": 44790, "\u0120practicable": 44791, "\u0120hoop": 44792, "\u0120unintentionally": 44793, "\u0120ignite": 44794, "\u0120Mov": 44795, "urized": 44796, "lehem": 44797, "Termin": 44798, "\u0120colourful": 44799, "\u0120Linear": 44800, "\u0120Ellie": 44801, "Gy": 44802, "\u0120manpower": 44803, "\u0120js": 44804, "\u0120emoji": 44805, "\u0120SHARES": 44806, "_.": 44807, "00007": 44808, "\u0120sophistication": 44809, "\u0120underscore": 44810, "\u0120practise": 44811, "\u0120blob": 44812, "opens": 44813, "Ukraine": 44814, "Keeping": 44815, "YC": 44816, "JR": 44817, "ultimate": 44818, "Claim": 44819, "\u0120automobiles": 44820, "993": 44821, "steel": 44822, "\u0120parting": 44823, "\u0120Lank": 44824, "...?": 44825, "\u0120385": 44826, "\u0120remembrance": 44827, "\u0120eased": 44828, "\u0120covari": 44829, "\u0120Sind": 44830, "Effective": 44831, "\u0120dissemination": 44832, "\u0120Moose": 44833, "\u0120Clapper": 44834, "brates": 44835, "Apply": 44836, "\u0120invis": 44837, "\u0120worsened": 44838, "\u00e2\u0122\u0136-": 44839, "\u0120legislator": 44840, "\u0120Lol": 44841, "\u0120Rowe": 44842, "\u0120dealership": 44843, "umar": 44844, "idences": 44845, "\u0120investigates": 44846, "\u0120cascade": 44847, "\u0120bidder": 44848, "\u0120BEN": 44849, "Ironically": 44850, "\u0120presiding": 44851, "\u0120ding": 44852, "\u0120contradicted": 44853, "\u0120shuts": 44854, "\u0120FIX": 44855, "\u0120366": 44856, "District": 44857, "\u0120sinful": 44858, "\u0120Charisma": 44859, "oops": 44860, "\u0120totality": 44861, "\u0120restitution": 44862, "\u0120Optimus": 44863, "\u0120Dah": 44864, "\u0120clueless": 44865, "urned": 44866, "\u0120nutrit": 44867, "\u0120landowners": 44868, "\u0120flushed": 44869, "\u0120broaden": 44870, "mie": 44871, "\u0120println": 44872, "\u0120nig": 44873, "\u0120Corpus": 44874, "Jen": 44875, "\u0120proto": 44876, "\u0120Wikimedia": 44877, "\u0120Palo": 44878, "COR": 44879, "\u0120storylines": 44880, "\u0120evangelicals": 44881, "\u0120Darrell": 44882, "\u0120rotor": 44883, "\u0120HW": 44884, "skilled": 44885, "eryl": 44886, "\u0120begg": 44887, "\u0120Blumenthal": 44888, "\u0120weaving": 44889, "\u0120downwards": 44890, "\u0120Jacket": 44891, "\u0120ANGEL": 44892, "Technology": 44893, "\u0120esoteric": 44894, "aldehyde": 44895, "\u0120furiously": 44896, "\u0120foreigner": 44897, "Weak": 44898, "CHO": 44899, "\u0120Hound": 44900, "Experience": 44901, "\u0120Playstation": 44902, "\u0120MIA": 44903, "\u0120Ung": 44904, "cloth": 44905, "agall": 44906, "\u0120calming": 44907, "izens": 44908, "Struct": 44909, "\u0120Witches": 44910, "\u0120Celebration": 44911, "\u0120..............": 44912, "ptroller": 44913, "\u0120TCU": 44914, "\u0120bunny": 44915, "\u00e3\u0125\u012f": 44916, "utorial": 44917, "\u0120upscale": 44918, "\u0120Sta": 44919, "\u0120Colossus": 44920, "\u0120chloride": 44921, "\u0120Zac": 44922, "\u0120Reasons": 44923, "\u0120Brookings": 44924, "\u0120WHITE": 44925, "][/": 44926, "\u0120Lose": 44927, "905": 44928, "\u0120underside": 44929, "ernels": 44930, "\u0120vape": 44931, "dozen": 44932, "uppet": 44933, "\u0120STOP": 44934, "matical": 44935, "\u0120Statements": 44936, "heddar": 44937, "PAC": 44938, "Customer": 44939, "\u0120memos": 44940, "\u0120PJ": 44941, "endars": 44942, "\u0120Limits": 44943, "laugh": 44944, "\u0120stabilized": 44945, "\u0120ALEC": 44946, "YA": 44947, "Upgrade": 44948, "alam": 44949, "\u0120techno": 44950, "\u0120anew": 44951, "foreseen": 44952, "\u0120collegiate": 44953, "\u0120Pyro": 44954, "\u0120Dism": 44955, "\u0120frontline": 44956, "\u0120ammonia": 44957, "IU": 44958, "Quite": 44959, "Johnny": 44960, "assin": 44961, "GOP": 44962, "\u0120Styles": 44963, "\u0120Sovereign": 44964, "acterial": 44965, "549": 44966, "\u0120RIP": 44967, "\u0120Lists": 44968, "\u0120364": 44969, "\u0120Recep": 44970, "socket": 44971, "\u0120Byrd": 44972, "\u0120Candle": 44973, "Ancient": 44974, "\u0120appellant": 44975, "enforcement": 44976, "acea": 44977, "anski": 44978, "\u0120olds": 44979, "886": 44980, "\u0120slurs": 44981, "\u0120empires": 44982, "\u0120buckle": 44983, "\u0120alienation": 44984, "\u0120Aberdeen": 44985, "\u0120unicorn": 44986, "\u0120overriding": 44987, "\u0120LX": 44988, "ppa": 44989, "\u0120despised": 44990, "\u0120Bugs": 44991, "\u0120BST": 44992, "Southern": 44993, "533": 44994, "\u0120hallmark": 44995, "\u0120Poster": 44996, "\u0120stemmed": 44997, "\u0120principals": 44998, "\u0120TECH": 44999, "\u0120Sandwich": 45000, "Italy": 45001, "\u0120cheesy": 45002, "\u0120SetTextColor": 45003, "\u0120Protective": 45004, "\u0120Cohn": 45005, "JO": 45006, "aptop": 45007, "Reason": 45008, "Leader": 45009, "\u0120Understand": 45010, "\u0120Fridays": 45011, "\u0120Continuous": 45012, "\u0120clipping": 45013, "\u0120Rye": 45014, "\u0120berth": 45015, "timer": 45016, "annis": 45017, "react": 45018, "\u0120buffalo": 45019, "\u0120Paras": 45020, "\u0120655": 45021, "\u0120presided": 45022, "\u0120Sunrise": 45023, "\u0120vets": 45024, "\u0120cloves": 45025, "\u0120McCull": 45026, "Strength": 45027, "GAN": 45028, "\u0120illiter": 45029, "\u0120Pricing": 45030, "l\u00c3\u00a9": 45031, "\u0120resistor": 45032, "\u0120brun": 45033, "\u0120Suffolk": 45034, "\u00d1\u012d": 45035, "\u0120Liver": 45036, "Released": 45037, "\u0120whats": 45038, "860": 45039, "\u0120Measures": 45040, "\u0120denouncing": 45041, "\u0120Ryzen": 45042, "\u0120souven": 45043, "\u0120caregivers": 45044, "chini": 45045, "\u0120Scarlett": 45046, "\u0120trough": 45047, "Congratulations": 45048, "\u0120taxis": 45049, "\u0120Tradition": 45050, "jit": 45051, "\u0120tabletop": 45052, "\u0120hitherto": 45053, "\u0120disinformation": 45054, "offensive": 45055, "hra": 45056, "\u0120DISTRICT": 45057, "\u0120complicate": 45058, "chenko": 45059, "\u0120Reconstruction": 45060, "\u0120palpable": 45061, "\u0120ausp": 45062, "\u0120428": 45063, "\u0120showcases": 45064, "\u0120Publication": 45065, "knowledge": 45066, "innon": 45067, "419": 45068, "\u0120retrieval": 45069, "anders": 45070, "\u0120refute": 45071, "\u0120inquired": 45072, "gur": 45073, "\u0120negativity": 45074, "\u0120conserve": 45075, "\u0120afterlife": 45076, "\u0120presupp": 45077, "\u0120Gillespie": 45078, "\u0120mt": 45079, "\u0120DN": 45080, "Tap": 45081, "\u0120perpend": 45082, "\u0120Smy": 45083, "doesn": 45084, "\u0120spilling": 45085, "\u0120hypers": 45086, "Kate": 45087, "\u00c2\u00ae,": 45088, "kept": 45089, "\u0120Powered": 45090, "\u0120ja": 45091, "\u0120Klux": 45092, "arde": 45093, "aban": 45094, "\u0120444": 45095, "\u0120flattened": 45096, "\u0120Improvements": 45097, "urga": 45098, "\u0120Kund": 45099, "\u0120inscribed": 45100, "\u0120facult": 45101, "\u0120unprepared": 45102, "\u0120Consumers": 45103, "\u0120satisfies": 45104, "\u0120pulmonary": 45105, "\u0120infiltration": 45106, "\u0120externally": 45107, "\u0120congratulations": 45108, "aghan": 45109, "\u0120airliner": 45110, "\u0120flung": 45111, "\u0120flyers": 45112, "GD": 45113, "\u0120snippets": 45114, "\u0120recursive": 45115, "\u0120mastering": 45116, "Lex": 45117, "\u0120overtly": 45118, "vg": 45119, "\u0120luckily": 45120, "\u0120encro": 45121, "\u0120Lancet": 45122, "\u0120Abyssal": 45123, "functional": 45124, "\u0120sow": 45125, "\u0120squid": 45126, "\u0120narration": 45127, "\u0120naughty": 45128, "\u0120Honour": 45129, "\u0120Spartans": 45130, "\u0120shatter": 45131, "\u0120Tacoma": 45132, "\u0120Calories": 45133, "\u0120Races": 45134, "Submit": 45135, "\u0120purposefully": 45136, "wav": 45137, "\u0120Yok": 45138, "Fest": 45139, "\u0120Gerr": 45140, "Metro": 45141, "\u0120itiner": 45142, "famous": 45143, "\u0120\"{": 45144, "inline": 45145, "washer": 45146, "Issue": 45147, "\u0120CLIENT": 45148, "ozo": 45149, "Versions": 45150, "725": 45151, "\u0120Glock": 45152, "\u0120shielded": 45153, "\u0120PCR": 45154, "ENCY": 45155, "\u0120Weld": 45156, "\u0120Simpl": 45157, "\u0120redirected": 45158, "\u0120Kham": 45159, "\u0120(>": 45160, "\u0120labou": 45161, "\u0120diapers": 45162, "ssl": 45163, "\u0120cellar": 45164, "organisms": 45165, "oresc": 45166, "\u0120Berks": 45167, "didn": 45168, "Shipping": 45169, "Chest": 45170, "\u0120undone": 45171, "\u0120millionaire": 45172, "\u0120cords": 45173, "\u0120Younger": 45174, "appropriately": 45175, "\u0120sequels": 45176, "uve": 45177, "anticipated": 45178, "\u0120lewd": 45179, "\u0120Shirt": 45180, "\u0120Dmitry": 45181, "Veter": 45182, "\u0120slaying": 45183, "\u0120Yar": 45184, "\u0120complication": 45185, "Iowa": 45186, "\u0120Erica": 45187, "\u0120BLM": 45188, "girlfriend": 45189, "bodied": 45190, "626": 45191, "1963": 45192, "\u0120intermediary": 45193, "\u0120consolation": 45194, "Mask": 45195, "\u0120Siem": 45196, "owan": 45197, "Beginning": 45198, "\u0120fixme": 45199, "\u0120culminated": 45200, "\u0120conduc": 45201, "\u0120Volunteer": 45202, "\u0120positional": 45203, "\u0120greets": 45204, "\u0120Definitions": 45205, "\u0120thinker": 45206, "\u0120ingenuity": 45207, "\u0120freshmen": 45208, "\u0120Moments": 45209, "\u0120357": 45210, "ateurs": 45211, "\u0120FedEx": 45212, "sg": 45213, "694": 45214, "\u0120dwindling": 45215, "\u0120BOX": 45216, "selage": 45217, "\u0120tmp": 45218, "\u0120sten": 45219, "\u0120Sut": 45220, "\u0120neighbourhoods": 45221, "\u0120classmate": 45222, "fledged": 45223, "\u0120leftists": 45224, "\u0120climates": 45225, "ATHER": 45226, "\u0120Scythe": 45227, "uliffe": 45228, "\u0120sag": 45229, "\u0120hopped": 45230, "\u0120Ft": 45231, "\u0120Eck": 45232, "\u0120CK": 45233, "\u0120Doomsday": 45234, "kids": 45235, "\u0120gasped": 45236, "\u0120moniker": 45237, "\u0120Lod": 45238, "\u0120CFL": 45239, "tions": 45240, "rums": 45241, "folios": 45242, "\u0120md": 45243, "\u0120uncanny": 45244, "\u0120transports": 45245, "\u0120Labrador": 45246, "\u0120railways": 45247, "\u0120appliance": 45248, "\u0120CTRL": 45249, "\u00e6\u0122": 45250, "Population": 45251, "\u0120Confederacy": 45252, "\u0120unbearable": 45253, "\u0120dorsal": 45254, "\u0120Inform": 45255, "opted": 45256, "\u0120KILL": 45257, "Marx": 45258, "\u0120hypocritical": 45259, "qus": 45260, "\u0120Numerous": 45261, "\u0120Georgian": 45262, "\u0120Ambrose": 45263, "\u0120Loch": 45264, "\u0120gubernatorial": 45265, "\u0120Xeon": 45266, "\u0120Supports": 45267, "enser": 45268, "eely": 45269, "\u0120Avenger": 45270, "1965": 45271, "Army": 45272, "\u0120juxtap": 45273, "\u0120chopping": 45274, "\u0120Splash": 45275, "\u0120Sustainable": 45276, "\u0120Finch": 45277, "\u01201861": 45278, "ictive": 45279, "atmeal": 45280, "\u0120Gohan": 45281, "\u0120lightsaber": 45282, "\u0120GPA": 45283, "ugu": 45284, "\u0120REPL": 45285, "variable": 45286, "\u0120herpes": 45287, "\u0120deserts": 45288, "aciously": 45289, "\u0120situational": 45290, "weekly": 45291, "obl": 45292, "\u0120textile": 45293, "\u0120Cornwall": 45294, "\u0120contraceptives": 45295, "\u0120Ake": 45296, "]-": 45297, "\u00e4\u00b9\u012d": 45298, ":,": 45299, "\u0120Wem": 45300, "\u0120Bihar": 45301, "\u0120'.": 45302, "\u0120bere": 45303, "\u0120analogue": 45304, "\u0120Cookies": 45305, "\u0120takeoff": 45306, "Wheel": 45307, "\u0120majestic": 45308, "\u0120commuting": 45309, "023": 45310, "\u0120Corpse": 45311, "assment": 45312, "mini": 45313, "\u0120gorilla": 45314, "\u0120Alas": 45315, "eree": 45316, "\u0120acquaintances": 45317, "\u0120Advantage": 45318, "\u0120spiritually": 45319, "\u0120eyed": 45320, "pmwiki": 45321, "\u0120Ender": 45322, "\u0120translucent": 45323, "\u0120nighttime": 45324, "\u0120IMAGES": 45325, "545": 45326, "\u0120Kamp": 45327, "\u0120Freak": 45328, "\u0120ig": 45329, "Portland": 45330, "432": 45331, "\u0120Mata": 45332, "\u0120marines": 45333, "\u0120hors": 45334, "aterasu": 45335, "\u0120Attribution": 45336, "\u0120---------": 45337, "\u0120kins": 45338, "\u0120BELOW": 45339, "+++": 45340, "\u0120reeling": 45341, "oled": 45342, "\u0120clutter": 45343, "\u0120Relative": 45344, "\u0120427": 45345, "BUS": 45346, "\u0120avert": 45347, "\u0120Cheong": 45348, "\u0120Able": 45349, "\u0120Pryor": 45350, "Developer": 45351, "\u0120encyclopedia": 45352, "\u0120USAF": 45353, "\u0120Garry": 45354, "Spain": 45355, "Blocks": 45356, "\u0120exposition": 45357, "\u0120GamerGate": 45358, "WOR": 45359, "\u0120stockpile": 45360, "\u0120clothed": 45361, "\u0120Tone": 45362, "\u0120Rue": 45363, "tumblr": 45364, "\u0120treacherous": 45365, "\u0120frying": 45366, "\u00d1\u012e": 45367, "\u0120Sph": 45368, "\u0120restraints": 45369, "\u0120embodies": 45370, "\u0120Ges": 45371, "Safety": 45372, "\u0120negotiators": 45373, "mining": 45374, "\u0120Appalachian": 45375, "LOS": 45376, "\u0120Jenna": 45377, "\u0120passers": 45378, "\u00e7\u012d": 45379, "snap": 45380, "\u0120shorten": 45381, "creator": 45382, "\u0120innumerable": 45383, "utherland": 45384, "674": 45385, "\u0120WOM": 45386, "\u0120Ascend": 45387, "\u0120Armory": 45388, "\u0120Transaction": 45389, "Kick": 45390, "\u0120suitcase": 45391, "dayName": 45392, "\u0120wasteful": 45393, "marriage": 45394, "\u0120McCabe": 45395, "itech": 45396, "\u0120Oss": 45397, "Closure": 45398, "\u0120Treasurer": 45399, "\u0120indecent": 45400, "\u0120Dull": 45401, "\u0120residences": 45402, "1959": 45403, "\u0120Settlement": 45404, "Hamilton": 45405, "\u0120selfies": 45406, "\u0120Ranking": 45407, "\u0120Barkley": 45408, "\u0120Bore": 45409, "\u0120WCS": 45410, "\u0120Maritime": 45411, "\u0120Huh": 45412, "\u0120Forestry": 45413, "\u0120cultivating": 45414, "\u0120Ballard": 45415, "\u0120garrison": 45416, "\u0120SDL": 45417, "930": 45418, "\u0120nascent": 45419, "\u0120irresistible": 45420, "\u0120awfully": 45421, "\\/\\/": 45422, "\u0120equate": 45423, "\u0120anthropology": 45424, "\u0120Sylvia": 45425, "\u0120intestine": 45426, "\u0120innocuous": 45427, "cessive": 45428, "agra": 45429, "\u0120Metroid": 45430, "Grant": 45431, "855": 45432, "\u0123\u0138": 45433, "\u0120\"_": 45434, "\u00e3\u0125\u0125\u00e3\u0125\u012b": 45435, "\u0120appraisal": 45436, "\u0120Freddy": 45437, "046": 45438, "\u0120406": 45439, "\u01201830": 45440, "\u0120docking": 45441, "Static": 45442, "\u0120pont": 45443, "\u0120Voltage": 45444, "\u0120Stead": 45445, "\u0120Mortgage": 45446, "\u0120Jonah": 45447, "YL": 45448, "CLASSIFIED": 45449, "\u0120asbestos": 45450, "nikov": 45451, "\u0120collagen": 45452, "\u0120Orbital": 45453, "Pocket": 45454, "799": 45455, "\u0120hybrids": 45456, "inches": 45457, "\u0120invoice": 45458, "undy": 45459, "\u0120inequalities": 45460, "Trend": 45461, "washed": 45462, "BALL": 45463, "\u0120lucid": 45464, "\u0120Commentary": 45465, "\u0120witty": 45466, "Brandon": 45467, "\u0120bruising": 45468, "\u0120620": 45469, "escent": 45470, "boxing": 45471, "POL": 45472, "\u0120378": 45473, "Rect": 45474, "\u0120licences": 45475, "\u0120McGee": 45476, "pressed": 45477, "Danny": 45478, "\u0120jammed": 45479, "ordinate": 45480, "\u0120leth": 45481, "\u0120distinguishes": 45482, "\u0120Yamaha": 45483, "ILS": 45484, "\u0120Hume": 45485, "\u0120Categories": 45486, "Roberts": 45487, "Chart": 45488, "\u0120beetle": 45489, "\u0120Graveyard": 45490, "\u0120($)": 45491, "o\u00c4\u0141": 45492, "\u0120twilight": 45493, "arella": 45494, "\u00e1\u00bd": 45495, "\u0120booths": 45496, "\u0120HHS": 45497, "\u0120Feldman": 45498, "\u0120excavation": 45499, "\u0120philosophies": 45500, "atography": 45501, "\u0120Garage": 45502, "technology": 45503, "\u0120unforgettable": 45504, "\u0120verifying": 45505, "\u0120subordinates": 45506, "Els": 45507, "\u0120neb": 45508, "Gaming": 45509, "ENA": 45510, "\u0120Achievement": 45511, "itters": 45512, "\u0120Gabe": 45513, "\u0120dumps": 45514, "forcer": 45515, "\u0120poignant": 45516, "\u0120MBA": 45517, "\u0120Heidi": 45518, "imei": 45519, "\u0120mages": 45520, "\u0120liberate": 45521, "\u0120circumcised": 45522, "\u0120Mermaid": 45523, "\u0120Matth": 45524, "together": 45525, "\u0120Wichita": 45526, "\u0120storefront": 45527, "\u0120Adin": 45528, "VII": 45529, "Fourth": 45530, "\u0120explorers": 45531, "WER": 45532, "Notable": 45533, "Brook": 45534, "mens": 45535, "Faith": 45536, "---------": 45537, "\u0120Jou": 45538, "\u00ac\u00bc": 45539, "\u0120pineapple": 45540, "\u0120amalg": 45541, "eln": 45542, "arkable": 45543, "\u0120\u00e3\u0124\u00b5\u00e3\u0125\u00bc\u00e3\u0125\u0128\u00e3\u0124\u00a3": 45544, "\u0120\u00e3\u0124\u00b5\u00e3\u0125\u00bc\u00e3\u0125\u0128\u00e3\u0124\u00a3\u00e3\u0125\u00af\u00e3\u0125\u00b3": 45545, "\u0120ovarian": 45546, "\u0120Echoes": 45547, "\u0120haircut": 45548, "\u0120pav": 45549, "\u0120chilled": 45550, "anasia": 45551, "\u0120styled": 45552, "\u0120dab": 45553, "niper": 45554, "\u0120ministerial": 45555, "\u0120DUP": 45556, "Tan": 45557, "\u0120sulph": 45558, "\u0120Deter": 45559, "\u0120Bohem": 45560, "odan": 45561, "\u0120educator": 45562, "\u00e2\u0135\u013a": 45563, "spir": 45564, "Chicken": 45565, "\u0120Eleanor": 45566, "\u0120qui": 45567, "\u0120heaviest": 45568, "\u0120grasped": 45569, "URA": 45570, "\u0120crooked": 45571, "Jessica": 45572, "problem": 45573, "\u0120predetermined": 45574, "\u0120maniac": 45575, "\u0120breaths": 45576, "\u0120Lauderdale": 45577, "\u0120hobbies": 45578, "yz": 45579, "Crime": 45580, "\u0120charisma": 45581, "dL": 45582, "\u0120leaping": 45583, "\u0120kittens": 45584, "Angelo": 45585, "\u0120JACK": 45586, "\u0120Suzanne": 45587, "\u0120halting": 45588, "ENTION": 45589, "\u0120swallowing": 45590, "\u0120Earthquake": 45591, "\u0120eighteenth": 45592, "\u0120NIC": 45593, "\u0120INF": 45594, "\u0120Conscious": 45595, "\u0120particulars": 45596, "circle": 45597, "740": 45598, "\u0120benevolent": 45599, "\u0120747": 45600, "\u0120490": 45601, "\u0120rundown": 45602, "\u0120Valerie": 45603, "\u0120BUR": 45604, "\u0120civilisation": 45605, "\u0120Schn": 45606, "WB": 45607, "otide": 45608, "international": 45609, "\u0120john": 45610, "\u01201902": 45611, "\u0120peanuts": 45612, "\u0120flavored": 45613, "kus": 45614, "\u0120roared": 45615, "\u0120cutoff": 45616, "\u00e9\u00a3": 45617, "\u0120ornament": 45618, "\u0120architectures": 45619, "\u0120369": 45620, "olor": 45621, "\u0120Wilde": 45622, "\u0120CRC": 45623, "\u0120Adjusted": 45624, "\u0120provoking": 45625, "landish": 45626, "\u0120rationality": 45627, "\u0120justifies": 45628, "\u0120dispel": 45629, "\u0120americ": 45630, "\u0120Poles": 45631, "\u00d8\u00a9": 45632, "\u0120envis": 45633, "\u0120Doodle": 45634, "\u00e4\u00bd\u00bf": 45635, "igsaw": 45636, "auldron": 45637, "Technical": 45638, "Teen": 45639, "uphem": 45640, "\u0120Xiang": 45641, "\u0120detractors": 45642, "\u0120Zi": 45643, "\u0120Journalists": 45644, "\u0120conducive": 45645, "\u0120Volunteers": 45646, "\u0120sd": 45647, "Knowing": 45648, "\u0120transmissions": 45649, "\u0120PLAN": 45650, "\u0120LIB": 45651, "\u0120alluded": 45652, "\u0120obe": 45653, "\u0120dope": 45654, "\u0120Goldstein": 45655, "\u0120wavelengths": 45656, "\u0120Destination": 45657, "nda": 45658, "ugi": 45659, "\u0120attentive": 45660, "\u0120Lean": 45661, "raltar": 45662, "\u0120mang": 45663, "mbuds": 45664, "akings": 45665, "bender": 45666, "\u0120accol": 45667, "\u0120crawled": 45668, "NOW": 45669, "Minnesota": 45670, "\u0120flourished": 45671, "\u0120Zup": 45672, "\u0120Supervisor": 45673, "\u0120Olivier": 45674, "Excellent": 45675, "\u0120widen": 45676, "Done": 45677, "\u0120wig": 45678, "\u0120misconceptions": 45679, "Corp": 45680, "Wan": 45681, "\u0120venerable": 45682, "\u0120Notably": 45683, "\u0120Klingon": 45684, "animate": 45685, "Boost": 45686, "\u0120SAY": 45687, "missing": 45688, "ibliography": 45689, "melon": 45690, "\u0120payday": 45691, "\u00d8\u00b3": 45692, "bole": 45693, "\u0120veiled": 45694, "\u0120Alphabet": 45695, "Italian": 45696, "\u0120everlasting": 45697, "\u0120RIS": 45698, "\u0120Cree": 45699, "rompt": 45700, "\u0120hating": 45701, "\u0120grinning": 45702, "\u0120geographically": 45703, "OSH": 45704, "\u0120weeping": 45705, "\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142\u0120\u00c2\u0142": 45706, "\u0120impecc": 45707, "Letter": 45708, "\u0120bloated": 45709, "PLA": 45710, "\u0120Fein": 45711, "\u0120persever": 45712, "Thunder": 45713, "\u0120aur": 45714, "\u0120RL": 45715, "\u0120pitfalls": 45716, "\u00e2\u0138\u00ba": 45717, "\u0120predominant": 45718, "\u0120525": 45719, "718": 45720, "APE": 45721, "714": 45722, "\u0120farmland": 45723, "\u0120Qiao": 45724, "\u0120violet": 45725, "\u0120Bahamas": 45726, "\u0120inflicting": 45727, "\u0120Efficiency": 45728, "\u0120homebrew": 45729, "\u0120undertook": 45730, "\u0120curly": 45731, "\u0120Harding": 45732, "mania": 45733, "596": 45734, "\u0120tempered": 45735, "\u0120harrowing": 45736, "\u0120Pledge": 45737, "\u0120Frankenstein": 45738, "\u00e8\u00aa": 45739, "Motion": 45740, "\u0120predictably": 45741, "\u0120Explosion": 45742, "ocusing": 45743, "erd": 45744, "colo": 45745, "FFER": 45746, "\u0120backfield": 45747, "\u0120VIDE": 45748, "uebl": 45749, "Narr": 45750, "\u0120Argument": 45751, "\u0120genomic": 45752, "\u0120boutique": 45753, "\u0120batted": 45754, "\u0120Binary": 45755, "\u0120gamb": 45756, "\u0120Rhythm": 45757, "673": 45758, "\u0120afloat": 45759, "\u0120Olympia": 45760, "YING": 45761, "\u0120endif": 45762, "isin": 45763, "\u0120winters": 45764, "\u0120scattering": 45765, "Iv": 45766, "Distance": 45767, "\u0120tru": 45768, "\u0120Comfort": 45769, "\u0120nexus": 45770, "\u0120airflow": 45771, "\u0120Byzantine": 45772, "payers": 45773, "coni": 45774, "\u0120Betsy": 45775, "Deal": 45776, "\u0120Nug": 45777, "\u0120Continent": 45778, "redibly": 45779, "\u0120optimizing": 45780, "albeit": 45781, "\u0120ecstatic": 45782, "\u0120Proto": 45783, "\u00e7\u00b7": 45784, "ivot": 45785, "\u00e2\u0138\u0126": 45786, "emp": 45787, "rounder": 45788, "\u0120clout": 45789, "\u0120IST": 45790, "663": 45791, "\u0120Dollars": 45792, "\u0120DAC": 45793, "\u0120subscribed": 45794, "\u0120rehearsal": 45795, "\u0120amps": 45796, "\u0120Shang": 45797, "esm": 45798, "\u0120sprinkle": 45799, "\u0120assailant": 45800, "\u0120Oo": 45801, "\u0120Coinbase": 45802, "Tact": 45803, "\u0120retina": 45804, "\u0120nuns": 45805, "RON": 45806, "atto": 45807, "\u0120jug": 45808, "\u0120SVG": 45809, "\u0120bikini": 45810, "\u0120FILE": 45811, "\u0120Founders": 45812, "eport": 45813, "\u0120KP": 45814, "\u0120restores": 45815, "\u0120Thick": 45816, "\u0120ashore": 45817, "\u0120approvals": 45818, "Render": 45819, "MAG": 45820, "Graham": 45821, "\u0120Cortana": 45822, "\u00e3\u0125\u00b3\u00e3\u0124\u00b8": 45823, "ssh": 45824, "orians": 45825, "arsity": 45826, "\u0120Inspired": 45827, "upper": 45828, "\u0120signalling": 45829, "\u0120rebuke": 45830, "\u0120flares": 45831, "\u0120downtime": 45832, "Studies": 45833, "\u0120stagnation": 45834, "\u0120Sequence": 45835, "\u0120grunt": 45836, "\u0120assures": 45837, "\u0120PLA": 45838, "592": 45839, "\u0120intraven": 45840, "depend": 45841, "Susan": 45842, "\u0120Manziel": 45843, "Mania": 45844, "Contract": 45845, "\u0120slams": 45846, "\u0120cultured": 45847, "\u0120creditor": 45848, "LIST": 45849, "\u0120HUM": 45850, "\u0120Chattanooga": 45851, "served": 45852, "\u0120cloaked": 45853, "\u0120FTP": 45854, "powder": 45855, "\u0120Stella": 45856, "uctive": 45857, "\u0120cheaply": 45858, "\u0120MUCH": 45859, "\u0120Galileo": 45860, "\u0120suites": 45861, "speech": 45862, "\u0120deliberations": 45863, "\u0120Chips": 45864, "\u00ab\u013a": 45865, "Balance": 45866, "\u0120Wynne": 45867, "\u0120Akron": 45868, "Asset": 45869, "\u0120honoured": 45870, "\u0120edged": 45871, "Likewise": 45872, "animous": 45873, "\u0120Wage": 45874, "\u0120Ezek": 45875, "advertisement": 45876, "\u0120RTX": 45877, "\u0120MAD": 45878, "\u0120migrating": 45879, "\u0120SQU": 45880, "\u0120475": 45881, "Edited": 45882, "\u0120shorthand": 45883, "\u0120Basics": 45884, "\u0120crotch": 45885, "\u0120EVEN": 45886, "\u0120vm": 45887, "efficiency": 45888, "\u0120calves": 45889, "\u0120Frie": 45890, "\u0120Brilliant": 45891, "\u0120strikers": 45892, "\u0120repentance": 45893, "\u0120arteries": 45894, "rl": 45895, "Bed": 45896, "hap": 45897, "\u0120cryptography": 45898, "\u0120Sabres": 45899, "\u0120414": 45900, "viks": 45901, "ihara": 45902, "apses": 45903, "Talking": 45904, "\u0120intertwined": 45905, "\u0120docks": 45906, "\u0120allele": 45907, "\u0120Artifact": 45908, "\u0120HIM": 45909, "torn": 45910, "\u00e7\u0137": 45911, "\u0120opacity": 45912, "\u0120Ely": 45913, "osuke": 45914, "\u0120nipple": 45915, "\u0120handwritten": 45916, "\u0120VK": 45917, "\u0120Chamberlain": 45918, "\u0120Laos": 45919, "igraph": 45920, "grow": 45921, "\u0120trillions": 45922, "\u0120descendant": 45923, "\u0120Sailor": 45924, "asuring": 45925, "\u0120ceilings": 45926, "\u0120Warehouse": 45927, "flying": 45928, "\u0120Glow": 45929, "\u0120nont": 45930, "\u0120miscarriage": 45931, "\u0120rigs": 45932, "\u0120ministries": 45933, "\u0120elaborated": 45934, "\u0120delusional": 45935, "\u0120Humane": 45936, "\u0120379": 45937, "nets": 45938, "\u0120blackout": 45939, "adders": 45940, "\u0120np": 45941, "\u0120Tire": 45942, "rosc": 45943, "\u0120subdiv": 45944, "\u0120linkage": 45945, "\u0120chronological": 45946, "\u0120HERO": 45947, "\u0120resettlement": 45948, "\u0120Vinyl": 45949, "\u0120pastoral": 45950, "\u0120Mobil": 45951, "\u0120Barbar": 45952, "Cooldown": 45953, "\u0120Fritz": 45954, "criminal": 45955, "repe": 45956, "\u0120bellig": 45957, "\u0120Breed": 45958, "\u0120418": 45959, "\u0120semblance": 45960, "ijk": 45961, "\u0120curtail": 45962, "\u0120clinch": 45963, "contained": 45964, "\u0120Prompt": 45965, "aston": 45966, "\u0120wi": 45967, "\u0120pursuits": 45968, "515": 45969, "\u0120Gloss": 45970, "\u0120flips": 45971, "\u0120coupons": 45972, "\u0120cloning": 45973, "\u0120Likely": 45974, "Removed": 45975, "\u0120Quartz": 45976, "rices": 45977, "\u0120Spears": 45978, "\u0120pious": 45979, "\u0120depreciation": 45980, "\u0120Dare": 45981, "ounces": 45982, "amaz": 45983, "Ont": 45984, "\u0120pinnacle": 45985, "docker": 45986, "026": 45987, "\u0120Wyr": 45988, "\u0120Proper": 45989, "\u00cb\u012a": 45990, "nil": 45991, "Bytes": 45992, "\u0120seeker": 45993, "trial": 45994, "\u0120unfolds": 45995, "\u0120Marse": 45996, "\u0120extravagant": 45997, "\u0120Survivors": 45998, "REDACTED": 45999, "\u0120Speedway": 46000, "\u0120Craigslist": 46001, "submit": 46002, "\u0120Generations": 46003, "\u0120upholding": 46004, "\u0120bloodstream": 46005, "\u0120Missions": 46006, "\u0120Lawn": 46007, "\u0120limbo": 46008, "enei": 46009, "Huh": 46010, "\u0120Wildcats": 46011, "prep": 46012, "\u0120Markus": 46013, "\u0120Forbidden": 46014, "ritic": 46015, "INO": 46016, "\u0120exhibiting": 46017, "requent": 46018, "chuk": 46019, "\u0120habitual": 46020, "\u0120Compatibility": 46021, "Drag": 46022, "RIPT": 46023, "ujah": 46024, "GROUND": 46025, "\u0120delinquent": 46026, "\u0120burner": 46027, "\u0120contemporaries": 46028, "\u0120gimmick": 46029, "loads": 46030, "\u0120nozzle": 46031, "podcast": 46032, "\u0120Wak": 46033, "\u0120Staten": 46034, "\u0120Kuh": 46035, "\u00e3\u0123\u0135": 46036, "interrupted": 46037, "\u0120invincible": 46038, "\u0120Burnett": 46039, "cigarette": 46040, "\u0120Pebble": 46041, "\u0120Temporary": 46042, "\u0120Marino": 46043, "582": 46044, "\u0120wasteland": 46045, "idently": 46046, "Tx": 46047, "\u0120rite": 46048, "\u0120Panasonic": 46049, "\u0120Middles": 46050, "\u0120Horton": 46051, "aeus": 46052, "\u0120curing": 46053, "\u0120mats": 46054, "\u0120adjourn": 46055, "\u0120fearsome": 46056, "pez": 46057, "boats": 46058, "\u0120propell": 46059, "\u0120conflicted": 46060, "\u0120Anger": 46061, "\u0120insurgent": 46062, "Karl": 46063, "\u0120coales": 46064, "\u0120southwestern": 46065, "\u0120dissu": 46066, "\u0120Overt": 46067, "************": 46068, "\u0120boxed": 46069, "\u0120Brune": 46070, "aaa": 46071, "\u0120gardening": 46072, "\u0120Engel": 46073, "tracks": 46074, "\u0120purified": 46075, "\u0120placeholder": 46076, "\u0120Likes": 46077, "\u0120dan": 46078, "Gab": 46079, "\u0120ect": 46080, "\u0120Faw": 46081, "\u0120Eliot": 46082, "\u0120',": 46083, "otropic": 46084, "\u0120Ruin": 46085, "hedon": 46086, "\u0120caul": 46087, "\u0120aft": 46088, "\u0120Cadillac": 46089, "gha": 46090, "assian": 46091, "udeb": 46092, "\u0120Tick": 46093, "\u0120adjusts": 46094, "ARGET": 46095, "537": 46096, "ische": 46097, "anty": 46098, "\u0120Friedrich": 46099, "\u0120Blizz": 46100, "\u0120AOL": 46101, "Campaign": 46102, "\u0120mammal": 46103, "\u0120Veil": 46104, "\u0120Kev": 46105, "\u0120Maurit": 46106, "\u0120Damien": 46107, "Nation": 46108, "Eastern": 46109, "\u0120{:": 46110, "\u0120=================================": 46111, "\u0120stereotypical": 46112, "\u0120attic": 46113, "\u0120Cyborg": 46114, "require": 46115, "\u0120awarding": 46116, "\u0120Papua": 46117, "btn": 46118, "bent": 46119, "Boo": 46120, "\u0120(=": 46121, "\u0120Xander": 46122, "\u0120Somerset": 46123, "\u0120catchy": 46124, "\u0120certify": 46125, "STRUCT": 46126, "\u0120ital": 46127, "\u0120tides": 46128, "\u0120Brands": 46129, "Gray": 46130, "competitive": 46131, "\u0120curator": 46132, "\u0120DG": 46133, "ominium": 46134, "\u0120GMOs": 46135, "ciating": 46136, "\u0120Carmen": 46137, "oward": 46138, "Baltimore": 46139, "\u0120rgb": 46140, "Cu": 46141, "\u0120wipes": 46142, "spell": 46143, "ITNESS": 46144, "\u0120summarizes": 46145, "\u0120Revis": 46146, "\u0120whistleblowers": 46147, "\u0120Breach": 46148, "\u0120crochet": 46149, "kos": 46150, "ewski": 46151, "\u0120repet": 46152, "\u0120crimson": 46153, "\u0120Karachi": 46154, "readable": 46155, "dimension": 46156, "\u0120Igor": 46157, "ilded": 46158, "\u0120Zed": 46159, "\u0120Keane": 46160, "\u0120Cosmetic": 46161, "DEP": 46162, "\u0120retreating": 46163, "\u0120UA": 46164, "ensical": 46165, "\u0120dusk": 46166, "\u0120Dickens": 46167, "\u0120arenas": 46168, "\u0120Passage": 46169, "levels": 46170, "\u0120curv": 46171, "Pope": 46172, "\u0120chores": 46173, "\u0120Elise": 46174, "\u0120Compass": 46175, "bub": 46176, "\u0120mammalian": 46177, "\u0120Sanskrit": 46178, "\u0120ANC": 46179, "\u0120Crack": 46180, "Qual": 46181, "Laun": 46182, "ampunk": 46183, "\u0120learners": 46184, "\u0120glamorous": 46185, "\u0120furthe": 46186, "ermott": 46187, "cand": 46188, "Generic": 46189, "\u0120narrated": 46190, "\u0120disorderly": 46191, "\u0120Transactions": 46192, "\u0120Detention": 46193, "\u0120Roku": 46194, "\u00c4\u012f": 46195, "\u0120understatement": 46196, "\u0120Saur": 46197, "\u0120Rodrigo": 46198, "\u0120ASAP": 46199, "Sin": 46200, "\u0120rejoice": 46201, "Methods": 46202, "\u0120electrode": 46203, "\u0120worshipped": 46204, "\u0120idi": 46205, "\u0120Physicians": 46206, "\u0120popup": 46207, "\u0120deft": 46208, "\u0120Removal": 46209, "\u0120Buenos": 46210, "verbs": 46211, "\u0120funk": 46212, "usha": 46213, "riction": 46214, "orea": 46215, "\u0120Bangalore": 46216, "\u0120Kenobi": 46217, "zzi": 46218, "\u0120normative": 46219, "\u0120goblins": 46220, "\u0120cafes": 46221, "\u0120UNCLASSIFIED": 46222, "\u0120Fired": 46223, "SIGN": 46224, "\u0120sclerosis": 46225, "\u0120Voter": 46226, "\u0120Sonny": 46227, "\u0120Extend": 46228, "\u0120EVs": 46229, "Arsenal": 46230, "\u0120psi": 46231, "\u0120widest": 46232, "\u0120Tus": 46233, "\u0120looms": 46234, "\u0120justifying": 46235, "\u0120Granger": 46236, "\u00e8\u00af": 46237, "Refer": 46238, "583": 46239, "\u0120flourishing": 46240, "abre": 46241, "\u0120rave": 46242, "\u0120Contra": 46243, "\u01201898": 46244, "Adds": 46245, "\u0120ful": 46246, "\u0120Cooke": 46247, "someone": 46248, "=#": 46249, "671": 46250, "\u0120yak": 46251, "\u0120arte": 46252, "\u0120Miscellaneous": 46253, "\u0120Detection": 46254, "\u0120Clancy": 46255, "\u00e2\u0123": 46256, "assies": 46257, "\u0120valiant": 46258, "\u0120Feminist": 46259, "corruption": 46260, "Vel": 46261, "Pear": 46262, "\u0120succinct": 46263, "\u0120quickest": 46264, "kw": 46265, "\u0120spitting": 46266, "\u0120Libraries": 46267, "\u00e5\u0127\u012b": 46268, "antz": 46269, "Dad": 46270, "\u0120Specifications": 46271, "rupulous": 46272, "andr": 46273, "RESULTS": 46274, "\u0120snowball": 46275, "\u0120predis": 46276, "\u0120Baxter": 46277, "\u0120Nursing": 46278, "\u0120Chaff": 46279, "swe": 46280, "\u0120outage": 46281, "\u0120nesting": 46282, "\u0120notoriety": 46283, "trigger": 46284, "onite": 46285, "jon": 46286, "\u0120fou": 46287, "ooked": 46288, "\u0120Celebrity": 46289, "reality": 46290, "\u0120fatig": 46291, "\u0120hugging": 46292, "\u0120bothers": 46293, "\u0120Panzer": 46294, "\u0120Chandra": 46295, "figured": 46296, "\u0120volts": 46297, "\u0120Clouds": 46298, "\u0120feeble": 46299, "\u0120Curve": 46300, "\u0120Asus": 46301, "786": 46302, "absor": 46303, "\u0120VICE": 46304, "\u0120Hess": 46305, "\u0120manufactures": 46306, "\u0120grizz": 46307, "\u0120Powerful": 46308, "acid": 46309, "\u0120subsections": 46310, "\u0120Krugman": 46311, "\u0120Alps": 46312, "isu": 46313, "\u0120sequest": 46314, "\u0120Ultron": 46315, "\u0120Tinker": 46316, "\u0120Goose": 46317, "\u0120mismatch": 46318, "Attorney": 46319, "\u0120morphology": 46320, "\u0120Sixers": 46321, "uttered": 46322, "\u0120ELECT": 46323, "gran": 46324, "Russell": 46325, "\u0120GSL": 46326, "\u0120fortnight": 46327, "\u0120.)": 46328, "\u0120apostle": 46329, "prone": 46330, "elist": 46331, "Untitled": 46332, "\u0120Implementation": 46333, "istors": 46334, "\u0120tanker": 46335, "\u0120plush": 46336, "\u0120attendants": 46337, "\u0120Tik": 46338, "\u0120Greenwich": 46339, "\u0120Yon": 46340, "\u0120SPL": 46341, "cells": 46342, "untled": 46343, "Solution": 46344, "\u0120Qu\u00c3\u00a9": 46345, "\u0120vacated": 46346, "\u0120uptick": 46347, "\u0120Meridian": 46348, "\u00e6\u0125": 46349, "\u0120Drill": 46350, "925": 46351, "584": 46352, "\u0120renovated": 46353, "\u0120Kubrick": 46354, "zyk": 46355, "\u0120lousy": 46356, "ppel": 46357, "ohydrate": 46358, "\u0120Izzy": 46359, "lesiastical": 46360, "CCC": 46361, "\u0120Ajax": 46362, "\u0120adapters": 46363, "\u0120Petraeus": 46364, "\u0120affirmation": 46365, "\u0120STOR": 46366, "lems": 46367, "adoes": 46368, "\u0120Constantinople": 46369, "\u0120ponies": 46370, "\u0120lighthouse": 46371, "\u0120adherents": 46372, "\u0120Brees": 46373, "omorphic": 46374, "Fighting": 46375, "\u0120plaster": 46376, "\u0120PVC": 46377, "\u0120Obst": 46378, "\u0120dearly": 46379, "\u0120Tooth": 46380, "ickson": 46381, "\u0120shaming": 46382, "Plex": 46383, "Agg": 46384, "\u0120\u00e2\u0122\u00a6\"": 46385, "\u0120subreddits": 46386, "\u0120pigeon": 46387, "\u0120Residential": 46388, "\u0120Passing": 46389, "\u0120lum": 46390, "\u0120Pension": 46391, "\u0120pessimistic": 46392, "\u0120432": 46393, "zinski": 46394, "cade": 46395, "075": 46396, "\u0120apologised": 46397, "iyah": 46398, "Putting": 46399, "\u0120gloomy": 46400, "\u0120Lyme": 46401, "=-=-=-=-=-=-=-=-": 46402, "\u0120Tome": 46403, "\u0120Psychiatric": 46404, "\u0120HIT": 46405, "cms": 46406, "apolog": 46407, "\u0120breaker": 46408, "\u0120deepen": 46409, "\u0120theorist": 46410, "\u0120Highlands": 46411, "\u0120baker": 46412, "\u0120staples": 46413, "\u0120interfered": 46414, "\u0120Abortion": 46415, "joined": 46416, "chu": 46417, "\u0120formulate": 46418, "\u0120vaccinations": 46419, "\u0120banter": 46420, "pheus": 46421, "\u0120outfielder": 46422, "\u0120Meter": 46423, "\u0120#####": 46424, "\u01201895": 46425, "\u0120narrowing": 46426, "\u0120STORY": 46427, "fp": 46428, "\u0120CST": 46429, "ignore": 46430, "\u0120proclaiming": 46431, "\u0120RU": 46432, "\u0120BALL": 46433, "yna": 46434, "653": 46435, "\u0120posit": 46436, "PRE": 46437, "594": 46438, "\u0120Registrar": 46439, "\u0120Pilgrim": 46440, "icio": 46441, "\u0120prett": 46442, "\u0120lifeless": 46443, "\u0120___": 46444, "Neigh": 46445, "\u0120Churches": 46446, "orno": 46447, "\u0120orcs": 46448, "\u0120kindred": 46449, "\u0120Audit": 46450, "\u0120millennial": 46451, "\u0120Persia": 46452, "gravity": 46453, "\u0120Disability": 46454, "\u0120DARK": 46455, "Ws": 46456, "odon": 46457, "\u0120granddaughter": 46458, "\u0120Brooke": 46459, "\u0120ADA": 46460, "ERA": 46461, "\u0120pickups": 46462, "\u0120Wilkinson": 46463, "\u0120Shards": 46464, "\u0120NK": 46465, "\u0120expel": 46466, "\u0120Kislyak": 46467, "\u0120jargon": 46468, "\u0120polarized": 46469, "iane": 46470, "Publisher": 46471, "\u0120rebutt": 46472, "\u0120apprehension": 46473, "\u0120Kessler": 46474, "\u0120prism": 46475, "FUL": 46476, "1964": 46477, "\u0120Loll": 46478, "\u00e4\u00bf": 46479, "lethal": 46480, "\u00c5\u0141": 46481, "\u0120ghetto": 46482, "\u0120boulder": 46483, "\u0120Slowly": 46484, "\u0120Oscars": 46485, "\u0120Instruction": 46486, "\u0120Ultr": 46487, "\u0120Moe": 46488, "Nich": 46489, "\u0120PATH": 46490, "(*": 46491, "\u0120RELEASE": 46492, "uning": 46493, "rouse": 46494, "eneg": 46495, "\u0120reimb": 46496, "\u0120Detected": 46497, "DoS": 46498, "\u0120sterling": 46499, "\u0120aggregation": 46500, "\u0120Lonely": 46501, "\u0120Attend": 46502, "higher": 46503, "\u0120airstrike": 46504, "kson": 46505, "SELECT": 46506, "\u0120deflation": 46507, "\u0120Herrera": 46508, "Cole": 46509, "ritch": 46510, "\u0120advisable": 46511, "Fax": 46512, "\u0120workaround": 46513, "\u0120pid": 46514, "mortem": 46515, "ersen": 46516, "\u0120typo": 46517, "\u0120alum": 46518, "782": 46519, "\u0120Jamal": 46520, "scripts": 46521, "\u0120captives": 46522, "\u0120Presence": 46523, "\u0120Lieberman": 46524, "angelo": 46525, "\u0120alcoholism": 46526, "assi": 46527, "\u0120recite": 46528, "\u0120gaping": 46529, "\u0120baskets": 46530, "\u0120Gou": 46531, "Browser": 46532, "neau": 46533, "\u0120corrective": 46534, "unda": 46535, "scoring": 46536, "\u0120XD": 46537, "\u0120filament": 46538, "\u0120deepening": 46539, "\u0120Stainless": 46540, "Integer": 46541, "\u0120buggy": 46542, "\u0120tenancy": 46543, "\u0120Mubarak": 46544, "\u0120tuple": 46545, "\u0120Droid": 46546, "\u0120Sitting": 46547, "\u0120forfeit": 46548, "\u0120Rasmussen": 46549, "ixties": 46550, "esi": 46551, "\u0120Kimmel": 46552, "\u0120meticulously": 46553, "\u0120apopt": 46554, "\u0120Seller": 46555, "088": 46556, "ecake": 46557, "hematically": 46558, "TN": 46559, "\u0120mindless": 46560, "\u0120digs": 46561, "\u0120Accord": 46562, "onsense": 46563, "eming": 46564, "brace": 46565, "\u0120eBook": 46566, "\u0120Distribut": 46567, "\u0120Investments": 46568, "wt": 46569, "]),": 46570, "behavior": 46571, "563": 46572, "\u0120blinding": 46573, "\u0120Protesters": 46574, "topia": 46575, "\u0120reborn": 46576, "\u0120Kelvin": 46577, "\u0120Dover": 46578, "\u0120Dairy": 46579, "\u0120Outs": 46580, "\u0120[/": 46581, "\u00cf\u0122": 46582, "bp": 46583, "\u0120Vanity": 46584, "\u0120Recap": 46585, "\u0120HOUSE": 46586, "\u0120FACE": 46587, "\u0120422": 46588, "692": 46589, "\u0120Antioch": 46590, "cooked": 46591, "\u0120collide": 46592, "\u0120apr": 46593, "\u0120sleeper": 46594, "\u0120Jarvis": 46595, "\u0120alternatively": 46596, "\u0120Leaves": 46597, "\u0120Maw": 46598, "\u0120antiquity": 46599, "\u0120Adinida": 46600, "\u0120abuser": 46601, "Pok\u00c3\u00a9mon": 46602, "\u0120assorted": 46603, "\u0120Revision": 46604, "\u0120Piano": 46605, "\u0120Gideon": 46606, "Ocean": 46607, "\u0120salon": 46608, "\u0120bustling": 46609, "ognitive": 46610, "\u0120Rahman": 46611, "\u0120waiter": 46612, "\u0120presets": 46613, "\u0120Osh": 46614, "\u0120GHC": 46615, "operator": 46616, "\u0120reptiles": 46617, "\u0120413": 46618, "\u0120Garr": 46619, "\u0120Chak": 46620, "\u0120hashes": 46621, "\u0120failings": 46622, "\u0120folklore": 46623, "\u0120abl": 46624, "\u0120Cena": 46625, "\u0120MacArthur": 46626, "\u0120COURT": 46627, "\u0120periphery": 46628, "appers": 46629, "\u0120reckoned": 46630, "\u0120Influ": 46631, "\u0120CET": 46632, "\u0120372": 46633, "\u0120Definitive": 46634, "assault": 46635, "421": 46636, "\u0120reservoirs": 46637, "\u0120dives": 46638, "\u0120Coil": 46639, "DAQ": 46640, "\u0120vividly": 46641, "\u0120RJ": 46642, "\u0120Bellev": 46643, "\u0120eclectic": 46644, "\u0120Showdown": 46645, "\u0120KM": 46646, "iped": 46647, "reetings": 46648, "\u0120Asuka": 46649, "Liberal": 46650, "\u0120\u00cf\u0126": 46651, "\u0120bystanders": 46652, "\u0120Goodwin": 46653, "ukong": 46654, "Sit": 46655, "\u0120Trem": 46656, "\u0120criminally": 46657, "\u0120Circus": 46658, "chrome": 46659, "887": 46660, "\u0120nanop": 46661, "\u0120Obi": 46662, "\u0120LOW": 46663, "ogh": 46664, "\u0120Authors": 46665, "obyl": 46666, "Urban": 46667, "\u0120ti": 46668, "\u0120Weir": 46669, "trap": 46670, "agy": 46671, "\u0120parentheses": 46672, "\u0120outnumbered": 46673, "\u0120counterproductive": 46674, "\u0120Tobias": 46675, "ubis": 46676, "Parser": 46677, "STAR": 46678, "\u0120synaptic": 46679, "\u0120Gears": 46680, "\u0120hiber": 46681, "\u0120debunked": 46682, "\u0120exalted": 46683, "awatts": 46684, "HOU": 46685, "Church": 46686, "\u0120Pixie": 46687, "\u0120Uri": 46688, "\u0120Formation": 46689, "\u0120Prediction": 46690, "CEO": 46691, "\u0120thrott": 46692, "\u0120Britann": 46693, "\u0120Madagascar": 46694, "\u00eb\u012d": 46695, "\u0120billboards": 46696, "\u0120RPGs": 46697, "\u0120Bees": 46698, "completely": 46699, "FIL": 46700, "\u0120doesnt": 46701, "\u0120Greenberg": 46702, "reys": 46703, "\u0120sling": 46704, "\u0120emptied": 46705, "\u0120Pixar": 46706, "\u0120Dharma": 46707, "luck": 46708, "inguished": 46709, "\u0120endot": 46710, "\u0120babys": 46711, "059": 46712, "chest": 46713, "rats": 46714, "\u0120ridden": 46715, "\u0120beetles": 46716, "\u0120illuminating": 46717, "\u0120fictitious": 46718, "\u0120Provincial": 46719, "\u0120768": 46720, "\u0120shepherd": 46721, "\u0120Render": 46722, "\u01201896": 46723, "Crew": 46724, "\u0120molded": 46725, "\u0120Xiaomi": 46726, "\u0120Spiral": 46727, "\u0120delim": 46728, "\u0120organising": 46729, "\u0120hoops": 46730, "\u0120Bei": 46731, "zhen": 46732, "\u0120fuckin": 46733, "\u0120decad": 46734, "\u0120unbiased": 46735, "ammy": 46736, "swing": 46737, "\u0120smuggled": 46738, "\u0120kios": 46739, "\u0120PERSON": 46740, "\u0120Inquisitor": 46741, "\u0120snowy": 46742, "\u0120scraping": 46743, "\u0120Burgess": 46744, "Ptr": 46745, "agame": 46746, "RW": 46747, "\u0120droid": 46748, "\u0120Lys": 46749, "\u0120Cassandra": 46750, "Jacob": 46751, "\u0120354": 46752, "\u0120pasture": 46753, "\u0120franc": 46754, "\u0120Scotch": 46755, "\u0120Ends": 46756, "\u0120IGF": 46757, "definition": 46758, "\u0120hysterical": 46759, "\u0120Browne": 46760, "771": 46761, "\u0120mobilization": 46762, "\u00e6\u0137": 46763, "iqueness": 46764, "Thor": 46765, "\u0120spearheaded": 46766, "\u0120embroiled": 46767, "\u0120conjecture": 46768, "judicial": 46769, "Choice": 46770, "\u0120paperback": 46771, "Pir": 46772, "\u0120recovers": 46773, "\u0120Surge": 46774, "\u0120Shogun": 46775, "\u0120Pediatrics": 46776, "\u00e3\u0123\u0142": 46777, "\u0120sweeps": 46778, "\u0120Laboratories": 46779, "\u0120Packs": 46780, "alus": 46781, "addin": 46782, "\u0120headlights": 46783, "gra": 46784, "Evidence": 46785, "COLOR": 46786, "Admin": 46787, "\u012c\u00b1": 46788, "\u0120concoct": 46789, "sufficient": 46790, "\u0120unmarked": 46791, "\u0120richness": 46792, "\u0120dissertation": 46793, "\u0120seasoning": 46794, "\u0120gib": 46795, "\u0120Mages": 46796, "unctions": 46797, "\u0120Nid": 46798, "cheat": 46799, "\u0120TMZ": 46800, "citizens": 46801, "\u0120Catholicism": 46802, "nb": 46803, "\u0120disembark": 46804, "\u0120PROGRAM": 46805, "aques": 46806, "Tyler": 46807, "Org": 46808, "\u0120Slay": 46809, "\u0120Nero": 46810, "\u0120Townsend": 46811, "INTON": 46812, "tele": 46813, "\u0120mesmer": 46814, "901": 46815, "\u0120fireball": 46816, "evidence": 46817, "affiliated": 46818, "\u0120Frenchman": 46819, "\u0120Augusta": 46820, "021": 46821, "\u0120sled": 46822, "\u0120reused": 46823, "\u0120Immunity": 46824, "\u0120wrestle": 46825, "assembled": 46826, "Maria": 46827, "\u0120gunshots": 46828, "\u0120Barbie": 46829, "\u0120cannabinoids": 46830, "\u0120Toast": 46831, "\u0120Kinder": 46832, "IRD": 46833, "\u0120rejuven": 46834, "\u0120gore": 46835, "\u0120rupture": 46836, "\u0120breaching": 46837, "\u0120Cartoon": 46838, "\u0120455": 46839, "\u0120Paleo": 46840, "614": 46841, "\u0120spears": 46842, "\u0120Ames": 46843, "abus": 46844, "Madison": 46845, "GROUP": 46846, "\u0120aborted": 46847, "yah": 46848, "\u0120felon": 46849, "\u0120causation": 46850, "\u0120prepaid": 46851, "\u0120pitted": 46852, "oplan": 46853, "\u0120Shelley": 46854, "\u0120Russo": 46855, "\u0120Pagan": 46856, "\u0120willfully": 46857, "\u0120Canaver": 46858, "undrum": 46859, "\u0120Salary": 46860, "\u0120Arpaio": 46861, "reader": 46862, "\u0120Rational": 46863, "\u0120Overse": 46864, "\u0120Causes": 46865, "\u0120*.": 46866, "\u0120wob": 46867, "Keith": 46868, "\u0120Consent": 46869, "manac": 46870, "773": 46871, "623": 46872, "\u0120fateful": 46873, "etimes": 46874, "\u0120spirited": 46875, "\u0120Dys": 46876, "\u0120hegemony": 46877, "\u0120boycot": 46878, "\u0120Enrique": 46879, "emouth": 46880, "\u0120timelines": 46881, "\u0120Sahara": 46882, "\u0120Relax": 46883, "\u0120Quincy": 46884, "\u0120Lessons": 46885, "\u0120EQU": 46886, "SEA": 46887, "NK": 46888, "\u0120Costco": 46889, "Increase": 46890, "\u0120motivating": 46891, "\u0120Chong": 46892, "amaru": 46893, "\u0120Divide": 46894, "\u0120pedigree": 46895, "\u0120Tasmania": 46896, "\u0120Prelude": 46897, "Las": 46898, "940": 46899, "574": 46900, "\u0120chau": 46901, "\u0120Spiegel": 46902, "unic": 46903, "-->": 46904, "\u0120Philips": 46905, "\u0120Kafka": 46906, "\u0120upheaval": 46907, "\u0120sentimental": 46908, "\u0120sax": 46909, "\u0120Akira": 46910, "serial": 46911, "Matrix": 46912, "\u0120electing": 46913, "\u0120commenter": 46914, "\u0120Nebula": 46915, "plets": 46916, "\u0120Nadu": 46917, "\u0120Adren": 46918, "\u0120enshr": 46919, "\u0120RAND": 46920, "financial": 46921, "\u0120Clyde": 46922, "utherford": 46923, "\u0120signage": 46924, "\u0120deline": 46925, "\u0120phosphate": 46926, "roversial": 46927, "fascist": 46928, "\u0120Vall": 46929, "\u0120Bethlehem": 46930, "\u0120fors": 46931, "\u0120english": 46932, "Solid": 46933, "Nature": 46934, "\u0120va": 46935, "\u0120Guests": 46936, "\u0120tantal": 46937, "\u0120autoimmune": 46938, ";;;;;;;;;;;;": 46939, "\u0120Totally": 46940, "\u0120Ov": 46941, "\u0120defences": 46942, "\u0120Coconut": 46943, "\u0120tranquil": 46944, "\u0120ploy": 46945, "\u0120flavours": 46946, "\u0120Flask": 46947, "\u00e3\u0124\u00a8\u00e3\u0125\u00ab": 46948, "\u0120Weston": 46949, "\u0120Volvo": 46950, "870": 46951, "\u0120microphones": 46952, "verbal": 46953, "RPG": 46954, "\u0120iii": 46955, ";}": 46956, "028": 46957, "\u0120headlined": 46958, "\u0120primed": 46959, "\u0120hoard": 46960, "\u0120Shad": 46961, "\u0120ENTER": 46962, "\u0120triangular": 46963, "\u0120capit": 46964, "lik": 46965, "\u0120Ancients": 46966, "\u0120lash": 46967, "\u0120convol": 46968, "\u0120colonel": 46969, "enemy": 46970, "Gra": 46971, "\u0120pubs": 46972, "utters": 46973, "\u0120assigns": 46974, "\u0120Penet": 46975, "\u0120Monstrous": 46976, "\u0120Bowen": 46977, "ilver": 46978, "Haunted": 46979, "\u0120Ding": 46980, "started": 46981, "plin": 46982, "\u0120contaminants": 46983, "\u0120DOE": 46984, "ffen": 46985, "\u0120Technician": 46986, "Ry": 46987, "\u0120robbers": 46988, "\u0120hotline": 46989, "\u0120Guardiola": 46990, "\u0120Kaufman": 46991, "rower": 46992, "\u0120Dresden": 46993, "\u0120Alpine": 46994, "Elf": 46995, "\u0120fmt": 46996, "\u0120Sard": 46997, "urses": 46998, "gpu": 46999, "Unix": 47000, "\u0120unequivocally": 47001, "\u0120Citizenship": 47002, "quad": 47003, "mire": 47004, "\u0120Sweeney": 47005, "Battery": 47006, "615": 47007, "\u0120pancakes": 47008, "\u0120oats": 47009, "Maps": 47010, "\u0120Contrast": 47011, "mbudsman": 47012, "\u0120EPS": 47013, "\u0120subcommittee": 47014, "\u0120sourcing": 47015, "\u0120sizing": 47016, "\u0120Buffer": 47017, "\u0120Mandatory": 47018, "\u0120moderates": 47019, "\u0120Patterns": 47020, "\u0120Chocobo": 47021, "\u0120Zan": 47022, "\u0120STATES": 47023, "\u0120Judging": 47024, "\u0120Inher": 47025, "*:": 47026, "\u0120bil": 47027, "\u0120Yen": 47028, "\u0120exhilar": 47029, "ollower": 47030, "zers": 47031, "\u0120snug": 47032, "maximum": 47033, "\u0120despicable": 47034, "\u0120PACK": 47035, "\u0120Annex": 47036, "\u0120sarcastic": 47037, "\u0120latex": 47038, "\u0120tamp": 47039, "\u0120Sao": 47040, "bah": 47041, "\u0120Reverend": 47042, "\u0120Chinatown": 47043, "\u0120AUT": 47044, "documented": 47045, "\u0120GABA": 47046, "\u0120Canaan": 47047, "\u0120\u00d9\u0127": 47048, "\u0120governs": 47049, "prev": 47050, "Esc": 47051, "\u0120Estimates": 47052, "OSP": 47053, "\u0120endeavour": 47054, "\u0120Closing": 47055, "ometime": 47056, "everyone": 47057, "\u0120worsen": 47058, "\u0120scanners": 47059, "\u0120deviations": 47060, "\u0120Robotics": 47061, "\u0120Compton": 47062, "\u0120sorcerer": 47063, "\u0120endogenous": 47064, "\u0120emulation": 47065, "\u0120Piercing": 47066, "\u0120Aph": 47067, "\u0120Socket": 47068, "\u0120bould": 47069, "\u0120OU": 47070, "\u0120Borderlands": 47071, "\u01201863": 47072, "Gordon": 47073, "\u0120WTO": 47074, "\u0120restricts": 47075, "\u0120mosaic": 47076, "\u0120melodies": 47077, "\u00e7\u0126": 47078, "Tar": 47079, "\u0120disson": 47080, "\u0120Provides": 47081, "\u0120......": 47082, "bek": 47083, "FIX": 47084, "\u0120broom": 47085, "anship": 47086, "Doctors": 47087, "\u0120nerds": 47088, "\u0120Regions": 47089, "naissance": 47090, "\u0120mete": 47091, "\u0120crept": 47092, "plings": 47093, "\u0120girlfriends": 47094, "knit": 47095, "igent": 47096, "owe": 47097, "\u0120ushered": 47098, "\u0120Baz": 47099, "Mobil": 47100, "434": 47101, "\u0120Presents": 47102, "origin": 47103, "\u0120insomnia": 47104, "\u0120Aux": 47105, "439": 47106, "\u0120Chili": 47107, "irsch": 47108, "GAME": 47109, "\u0120gestation": 47110, "algia": 47111, "romising": 47112, "$,": 47113, "crow": 47114, "\u0120Inspection": 47115, "atomic": 47116, "Relations": 47117, "JOHN": 47118, "roman": 47119, "\u0120Clockwork": 47120, "\u0120Bakr": 47121, "mone": 47122, "MET": 47123, "\u0120thirsty": 47124, "\u0120bc": 47125, "\u0120faculties": 47126, "Rum": 47127, "\u0120nuance": 47128, "\u0120Darius": 47129, "pleting": 47130, "fters": 47131, "etchup": 47132, "Registration": 47133, "\u0120KE": 47134, "Rah": 47135, "\u0120preferential": 47136, "\u0120Lash": 47137, "\u0120HH": 47138, "Valid": 47139, "\u0120NAV": 47140, "\u0120starve": 47141, "\u0120Gong": 47142, "zynski": 47143, "\u0120Actress": 47144, "\u0120wik": 47145, "\u0120unaccompanied": 47146, "lvl": 47147, "Bride": 47148, "ADS": 47149, "\u0120Commando": 47150, "\u0120Vaughn": 47151, "Wallet": 47152, "\u0120hopping": 47153, "\u0120Vie": 47154, "\u0120caveats": 47155, "\u0120alas": 47156, "ifled": 47157, "abuse": 47158, "661": 47159, "\u0120ibn": 47160, "\u0120gul": 47161, "\u0120robbing": 47162, "til": 47163, "ILA": 47164, "\u0120mitigating": 47165, "\u0120aptly": 47166, "\u0120tyrant": 47167, "\u0120midday": 47168, "\u0120Gilmore": 47169, "\u0120Decker": 47170, "\u0120\u00c2\u00a7\u00c2\u00a7": 47171, "partial": 47172, "Exactly": 47173, "\u0120phenotype": 47174, "\u0120[+]": 47175, "\u0120Plex": 47176, "\u0120Ips": 47177, "versions": 47178, "\u0120ebook": 47179, "\u0120chic": 47180, "gross": 47181, "\":\"\"},{\"": 47182, "\u0120Surprisingly": 47183, "Morgan": 47184, "\u0120residues": 47185, "\u0120Confederation": 47186, "infeld": 47187, "\u0120lyr": 47188, "moderate": 47189, "\u0120perpendicular": 47190, "VK": 47191, "\u0120synchronized": 47192, "\u0120refreshed": 47193, "\u0120adore": 47194, "\u0120Torment": 47195, "olina": 47196, "\u01202600": 47197, "ItemTracker": 47198, "\u0120pies": 47199, "\u0120FAT": 47200, "\u0120RHP": 47201, "048": 47202, "\u0120RESP": 47203, "\u0120BJ": 47204, "allows": 47205, "Pand": 47206, "\u0120unwelcome": 47207, "\u0120Voc": 47208, "\u0120Bastard": 47209, "\u0120OW": 47210, "\u0120LAR": 47211, "\u0120Healer": 47212, "Environmental": 47213, "\u0120Kenyan": 47214, "\u0120Trance": 47215, "\u0120Pats": 47216, "\u0120aliases": 47217, "\u0120Garfield": 47218, "\u0120campaigner": 47219, "\u0120advancements": 47220, "\u0120Okinawa": 47221, "\u0120Coh": 47222, "owsky": 47223, "\u0120starved": 47224, "\u0120sizeable": 47225, "\u0120:-)": 47226, "\u0120mRNA": 47227, "\u0120suspensions": 47228, "istar": 47229, "Scotland": 47230, "Prin": 47231, "------------------------------------------------": 47232, "\u0120502": 47233, "\u0120teaspoons": 47234, "\u01201050": 47235, "\u0120coercive": 47236, "\u0120Masonic": 47237, "edded": 47238, "\u0120Passenger": 47239, "\u0120latt": 47240, "\u0120braces": 47241, "\u0120Steal": 47242, "\u0120NYT": 47243, "\u0120Kats": 47244, "\u0120Celest": 47245, "aez": 47246, "Tu": 47247, "\u0120Coulter": 47248, "\u00f0\u0141\u013a": 47249, "Flickr": 47250, "\u0120Wilmington": 47251, "iths": 47252, "++;": 47253, "\u0120vending": 47254, "\u0120negro": 47255, "\u0120Phi": 47256, "\u0120Yellowstone": 47257, "Callback": 47258, "\u0120shampoo": 47259, "\u0120Shades": 47260, "wat": 47261, "\u0120superhuman": 47262, "\u0120ridiculed": 47263, "\u0120holiest": 47264, "ombo": 47265, "\u0120interns": 47266, "\u0120hone": 47267, "\u0120Paragu": 47268, "URI": 47269, "\u0120dangling": 47270, "\u00e3\u0124\u00bb": 47271, "sov": 47272, "ictional": 47273, "availability": 47274, "\u0120revocation": 47275, "\u0120dow": 47276, "inic": 47277, "\u0120THEIR": 47278, "\u0120iso": 47279, "\u0120outings": 47280, "\u0120Lethal": 47281, "\u0120)))": 47282, "\u0120inaccur": 47283, "\u0120outlandish": 47284, "\u0120anus": 47285, "letico": 47286, "idon": 47287, "lol": 47288, "\u0120unregulated": 47289, "\u0120succumbed": 47290, "\u0120cuff": 47291, "\u0120Wasteland": 47292, "letal": 47293, "\u0120substr": 47294, "\u0120coffers": 47295, "\u0120automakers": 47296, "ovi": 47297, "\u0120Xue": 47298, "\u0120Daytona": 47299, "\u0120jarring": 47300, "\u0120fumes": 47301, "\u0120disbanded": 47302, "zik": 47303, "itton": 47304, "\u0120strikingly": 47305, "\u0120spores": 47306, "Adapter": 47307, ".):": 47308, "\u0120Lyndon": 47309, "ivalry": 47310, "\u0120orally": 47311, "\u0120tumultuous": 47312, "\u0120displeasure": 47313, "\u0120cones": 47314, "orrect": 47315, "\u0120appease": 47316, "\u0120derby": 47317, "\u0120Tripoli": 47318, "\u0120Aless": 47319, "\u0120poked": 47320, "\u0120Guilty": 47321, "vP": 47322, "Enough": 47323, "\u0120originals": 47324, "699": 47325, "\u0120rabbi": 47326, "\u0120proverbial": 47327, "\u0120postpone": 47328, "elope": 47329, "\u0120Misty": 47330, "\u0120staffed": 47331, "\u0120Unemployment": 47332, "reditary": 47333, "\u0120diligent": 47334, "recomm": 47335, "measures": 47336, "asin": 47337, "825": 47338, "\u0120ponds": 47339, "\u0120mmol": 47340, "\u0120SAR": 47341, "\u0120CARE": 47342, "\u0120371": 47343, "\u0120clenched": 47344, "\u0120Corsair": 47345, "\u0120caricature": 47346, "zn": 47347, "attach": 47348, "\u0120Schro": 47349, "speak": 47350, "painted": 47351, "\u0120Suc": 47352, "\u0120ENT": 47353, "\u0120cellul": 47354, "\u0120Paid": 47355, "diagn": 47356, "WHERE": 47357, "\u0120texted": 47358, "Barn": 47359, "\u0120retracted": 47360, "\u0120Referred": 47361, "Sav": 47362, "\u0120upkeep": 47363, "\u0120workplaces": 47364, "\u0120Tokens": 47365, "\u0120amplify": 47366, "clinical": 47367, "\u0120multic": 47368, "mberg": 47369, "\u0120convoluted": 47370, "Region": 47371, "565": 47372, "\u0120Topic": 47373, "\u0120snail": 47374, "\u0120saline": 47375, "\u0120insurrection": 47376, "\u0120Petr": 47377, "forts": 47378, "BAT": 47379, "\u0120Navajo": 47380, "\u0120rudimentary": 47381, "\u0120Laksh": 47382, "ONDON": 47383, "Measure": 47384, "\u0120transformer": 47385, "\u0120Goddard": 47386, "\u0120coincides": 47387, "irin": 47388, "Rex": 47389, "\u0120Bok": 47390, "quit": 47391, "\u0120shotguns": 47392, "\u0120proletarian": 47393, "\u0120scorp": 47394, "\u0120Ada": 47395, "514": 47396, "\u0120slander": 47397, "recorded": 47398, "\u0120embell": 47399, "risome": 47400, "\u0120apologizing": 47401, "\u0120Mulcair": 47402, "\u0120Gibraltar": 47403, "Cla": 47404, "\u0120allot": 47405, "\u0120Attention": 47406, "\u0120433": 47407, "leave": 47408, "\u0120whine": 47409, "\u0120Issa": 47410, "\u0120Faust": 47411, "\u0120Barron": 47412, "heny": 47413, "\u0120victimized": 47414, "Jews": 47415, "\u0120nurturing": 47416, "ettel": 47417, "Winged": 47418, "\u0120Subtle": 47419, "\u0120flavorful": 47420, "\u0120Reps": 47421, "enged": 47422, "callback": 47423, "\u0120directional": 47424, "\u0120clasp": 47425, "\u0120Directions": 47426, "planet": 47427, "iculture": 47428, "Helper": 47429, "icion": 47430, "acia": 47431, "\u0120\u00e7\u00a5\u0140": 47432, "\u0120surges": 47433, "\u0120canoe": 47434, "\u0120Premiership": 47435, "been": 47436, "\u0120defied": 47437, "\u0120Trooper": 47438, "\u0120tripod": 47439, "\u0120gasp": 47440, "\u0120Euph": 47441, "\u0120Ads": 47442, "vernight": 47443, "highly": 47444, "Role": 47445, "\u0120entangled": 47446, "\u0120Zeit": 47447, "618": 47448, "\u0120Rusty": 47449, "\u0120havens": 47450, "\u0120Vaughan": 47451, "HAEL": 47452, "\u0120SERVICE": 47453, "/,": 47454, "\u0120stricken": 47455, "\u0120delusions": 47456, "\u0120bis": 47457, "\u0120Haf": 47458, "\u0120gratification": 47459, "\u0120enticing": 47460, "UNCH": 47461, "Adams": 47462, "\u0120OLED": 47463, "\u0120Beetle": 47464, "\u01201899": 47465, "\u0120SOFTWARE": 47466, "ategor": 47467, "VL": 47468, "\u0120Totem": 47469, "\u0120Gators": 47470, "ATURES": 47471, "\u0120impedance": 47472, "Registered": 47473, "\u0120Cary": 47474, "\u0120Aerial": 47475, "onne": 47476, "enium": 47477, "\u0120dred": 47478, "\u0120Beg": 47479, "\u0120concurrently": 47480, "\u0120superpower": 47481, "\u0120Xan": 47482, "jew": 47483, "imester": 47484, "\u0120Dickinson": 47485, "\u00e2\u0136\u0123": 47486, "Fla": 47487, "\u0120pree": 47488, "\u0120Rollins": 47489, "\u00a9\u00b6\u00e6": 47490, "\u0120denomination": 47491, "\u0120Lana": 47492, "516": 47493, "\u0120inciting": 47494, "scribed": 47495, "juries": 47496, "\u0120Wonders": 47497, "approximately": 47498, "\u0120suspending": 47499, "\u0120mountainous": 47500, "\u0120Laugh": 47501, "oidal": 47502, "Ns": 47503, "Detect": 47504, ")=": 47505, "\u0120Luthor": 47506, "\u0120Schwarzenegger": 47507, "\u0120Muller": 47508, "\u0120Devi": 47509, "ecycle": 47510, "Jar": 47511, "613": 47512, "\u0120Longh": 47513, "Bah": 47514, "\u0120SPORTS": 47515, "nw": 47516, "\u0120refinement": 47517, "\u0120waterways": 47518, "\u0120diner": 47519, "Blade": 47520, "683": 47521, "Fac": 47522, "\u0120initials": 47523, "\u0120rog": 47524, "\u0120paranormal": 47525, "BUT": 47526, "\u0120[(": 47527, "\u0120Swanson": 47528, "\u0120Mesh": 47529, "\u00e2\u0138\u00ac": 47530, "Improve": 47531, "\u0120Radiation": 47532, "\u0120Esther": 47533, "\u0120Esk": 47534, "\u0120Aly": 47535, "iky": 47536, "\u0120irrad": 47537, "\u0120Buckingham": 47538, "\u0120refill": 47539, "\u0120._": 47540, "Repe": 47541, "CONCLUS": 47542, "\u0120differentiated": 47543, "\u0120chirop": 47544, "\u0120Atkins": 47545, "Pattern": 47546, "\u0120excise": 47547, "\u0120cabal": 47548, "NSA": 47549, "\u0120STA": 47550, "\u0120SIL": 47551, "\u0120Paraly": 47552, "\u0120rye": 47553, "\u0120Howell": 47554, "\u0120Countdown": 47555, "nesses": 47556, "alysed": 47557, "\u0120resize": 47558, "\u00e3\u0124\u00bd": 47559, "\u0120budgetary": 47560, "\u0120Stras": 47561, "wang": 47562, "\u0120apiece": 47563, "\u0120precincts": 47564, "\u0120peach": 47565, "\u0120skyline": 47566, "\u0120353": 47567, "popular": 47568, "Appearances": 47569, "\u0120Mechanics": 47570, "\u0120DevOnline": 47571, "Sullivan": 47572, "Zen": 47573, "\u0120pu": 47574, "opolis": 47575, "544": 47576, "\u0120deform": 47577, "\u0120counteract": 47578, "\u0120Lange": 47579, "\u0120417": 47580, "Console": 47581, "774": 47582, "\u0120nodding": 47583, "\u0120populism": 47584, "\u0120hep": 47585, "\u0120counselling": 47586, "compliance": 47587, "UFF": 47588, "\u0120undeniably": 47589, "\u0120railing": 47590, "\u0120Horowitz": 47591, "\u0120Simone": 47592, "\u0120Bungie": 47593, "\u0120ak": 47594, "\u0120Talks": 47595, "xff": 47596, "flake": 47597, "Crash": 47598, "\u0120sweaty": 47599, "\u0120banquet": 47600, "\u0120OFFIC": 47601, "\u0120inventive": 47602, "\u0120astronomer": 47603, "\u0120Stamford": 47604, "\u0120Scare": 47605, "\u0120GREEN": 47606, "olicited": 47607, "\u0120rusher": 47608, "\u0120centrist": 47609, "ighting": 47610, "\u0120subclass": 47611, "\u0120disav": 47612, "\u0120defund": 47613, "\u0120Nanto": 47614, "ociate": 47615, "mast": 47616, "\u0120pacif": 47617, "\u0120mend": 47618, "eers": 47619, "immigration": 47620, "ESSION": 47621, "\u0120numbering": 47622, "\u0120laughable": 47623, "\u0120Ended": 47624, "viation": 47625, "emark": 47626, "Pitt": 47627, "\u0120meticulous": 47628, "\u0120LF": 47629, "\u0120congratulated": 47630, "\u0120Birch": 47631, "\u0120swayed": 47632, "\u0120semifinals": 47633, "\u0120humankind": 47634, "matter": 47635, "\u0120Equip": 47636, "opausal": 47637, "Said": 47638, "\u0120Layout": 47639, "\u0120voicing": 47640, "\u0120thug": 47641, "\u0120pornographic": 47642, "IPS": 47643, "\u0120moaning": 47644, "\u0120grievance": 47645, "\u0120confessions": 47646, "escal": 47647, "TEXTURE": 47648, "Authent": 47649, "osaurus": 47650, "Purchase": 47651, "\u0120relegation": 47652, "alter": 47653, "\u0120\u00c2\u0142\u00c2\u0142": 47654, "\u0120riddled": 47655, "\u0120ogre": 47656, "\u0120Lowell": 47657, "Occup": 47658, "Eat": 47659, "\u0120Hyder": 47660, "\u0120Adviser": 47661, "Commerce": 47662, "Hunt": 47663, "\u0120Orth": 47664, "\u0120Competitive": 47665, "\u0120CLA": 47666, "CDC": 47667, "\u0120salads": 47668, "Fle": 47669, "\u0120industrialized": 47670, "`,": 47671, "\u0120OWN": 47672, "\u0120beck": 47673, "\u0120Particularly": 47674, "oubt": 47675, "\u0120mM": 47676, "\u0120Hussain": 47677, "\u0120Chennai": 47678, "\u0120920": 47679, "\u0120appointing": 47680, "\u0120Cullen": 47681, ",,,,,,,,": 47682, "\u0120pores": 47683, "verified": 47684, "\u0120biochemical": 47685, "emate": 47686, "\u0120cowardly": 47687, "\u0120Helsinki": 47688, "\u0120Ethiopian": 47689, "SOURCE": 47690, "ERC": 47691, "estro": 47692, "\u0120biotech": 47693, "\u0120Sour": 47694, "\u0120brewer": 47695, "Bloomberg": 47696, "\u0120intensify": 47697, "Glass": 47698, "anco": 47699, "\u0120FDR": 47700, "greSQL": 47701, "\u0120Fires": 47702, "\u00a9\u00b6\u00e6\u00a5\u00b5": 47703, "eco": 47704, "1001": 47705, "\u0120Homeless": 47706, "\u0120instantaneous": 47707, "\u0120Haste": 47708, "igel": 47709, "Diamond": 47710, "\u0120paving": 47711, "\u0120landfill": 47712, "\u0120dads": 47713, "houn": 47714, ":]": 47715, "\u0120incendiary": 47716, "\u0120Livingston": 47717, "\u0120Hilbert": 47718, "\u0120Checks": 47719, "styles": 47720, "inators": 47721, "\u0120Clive": 47722, "phrine": 47723, "\u0120chimpanzees": 47724, "\u0120pall": 47725, "\u0120JM": 47726, "\u0120Aadhaar": 47727, "\u00f0\u013f": 47728, "\u0120achievable": 47729, "disabled": 47730, "PET": 47731, "OOOOOOOO": 47732, "Mot": 47733, "\u0120intangible": 47734, "\u0120ballet": 47735, "\u0120Webs": 47736, "\u0120Estimated": 47737, "Effects": 47738, "\u0120bailed": 47739, "Joshua": 47740, "\u0120turbulence": 47741, "\u0120occupant": 47742, "\u0120Daylight": 47743, "\u0120361": 47744, "meet": 47745, "\u0120statically": 47746, "\u0120onlook": 47747, "\u0120ki": 47748, "illegal": 47749, "\u0120velvet": 47750, "\u0120dehydration": 47751, "\u0120acquies": 47752, "\u0120Rez": 47753, "akura": 47754, "\u0120Upton": 47755, "atro": 47756, "\u0120incomprehensible": 47757, "\u0120backdoor": 47758, "\u0120Rhino": 47759, "727": 47760, "\u0120maths": 47761, ")+": 47762, "\u0120heresy": 47763, "\u0120df": 47764, "\u0120Roche": 47765, "\u0120Lydia": 47766, "\u0120pancreat": 47767, "reply": 47768, "arrell": 47769, "\u0120solicitation": 47770, "\u0120circadian": 47771, "BIP": 47772, "\u0120foray": 47773, "\u0120cryptic": 47774, "izu": 47775, "imeo": 47776, "\u0120Tomato": 47777, "\u0120Homs": 47778, "examination": 47779, "\u0120quarry": 47780, "\u0120Valiant": 47781, "\u0120Jericho": 47782, "\u0120INCLUD": 47783, "\u01201840": 47784, "519": 47785, "\u0120resists": 47786, "\u0120snapshots": 47787, "\u0120Spur": 47788, "\u0120Antiqu": 47789, "Login": 47790, "\u0120bestselling": 47791, "\u0120antic": 47792, "\u0120Sutherland": 47793, "\u00e3\u0124\u00a2\u00e3\u0125\u00ab": 47794, "\u0120~/": 47795, "\u0120Parm": 47796, "\u00e8\u0125": 47797, "Pages": 47798, "intensity": 47799, "\u0120immobil": 47800, "\u01201865": 47801, "zzo": 47802, "\u0120nifty": 47803, "\u0120fentanyl": 47804, "\u0120Preservation": 47805, "ophen": 47806, "\u0120darts": 47807, "\u0120Dinosaur": 47808, "pointers": 47809, "\u0120Rite": 47810, "suggest": 47811, "awareness": 47812, "\u0120Sheridan": 47813, "\u0120stances": 47814, "\u0120sorcery": 47815, "\u0120perjury": 47816, "\u0120Nikola": 47817, "iever": 47818, "\u0120fiance": 47819, "\u0120Jordanian": 47820, "\u0120Balloon": 47821, "\u0120nab": 47822, "\u0120kb": 47823, "\u0120humanities": 47824, "\u0120Tanaka": 47825, "hillary": 47826, "\u0120consultancy": 47827, "\u0120Zub": 47828, "\u0120remission": 47829, "\u0120confid": 47830, "CHQ": 47831, "\u0120Fug": 47832, "\u0120improvis": 47833, "Yep": 47834, "/_": 47835, "\u0120unwillingness": 47836, "\u0120portfolios": 47837, "055": 47838, "\u0120Instructor": 47839, "aiman": 47840, "\u0120claimants": 47841, "Mbps": 47842, "\u0120Bye": 47843, "received": 47844, "Tweet": 47845, "\u0120indemn": 47846, "riz": 47847, "amara": 47848, "Nat": 47849, "\u0120evaluates": 47850, "\u0120Lur": 47851, "epad": 47852, "FOX": 47853, "\u0120Thro": 47854, "\u0120rusty": 47855, "\u0120bedrock": 47856, "\u0120Oprah": 47857, "JB": 47858, "\u0120manipulative": 47859, "\u0120willful": 47860, "\u0120relapse": 47861, "\u0120extant": 47862, "Theme": 47863, "Sensor": 47864, "\u0120Stability": 47865, "govern": 47866, "\u0120poppy": 47867, "\u0120knack": 47868, "\u0120insulated": 47869, "\u0120Tile": 47870, "\u0120Extrem": 47871, "\u0120untold": 47872, "\u0120converge": 47873, "\u0120refuel": 47874, "igroup": 47875, "\u0120distortions": 47876, "\u0120ravaged": 47877, "\u0120mechanically": 47878, "\u0120Reilly": 47879, "\u0120Nose": 47880, "\u0120Incarnation": 47881, "\u0120Becky": 47882, "abbling": 47883, "\u0120taco": 47884, "\u0120rake": 47885, "\u0120melancholy": 47886, "\u0120illustrious": 47887, "\u0120Dartmouth": 47888, "Guide": 47889, "\u0120Razer": 47890, "\u0120Benz": 47891, "Ultimate": 47892, "\u0120Surprise": 47893, "\u0120pageant": 47894, "offer": 47895, "Whoever": 47896, "\u0120wiser": 47897, "\u0120chemist": 47898, "\u0120HELL": 47899, "\u0120Bulk": 47900, "\u0120plutonium": 47901, "\u0120COVER": 47902, "\u00d6\u00bc": 47903, "failed": 47904, "\u0120tirelessly": 47905, "\u0120infertility": 47906, "\u0120Trident": 47907, "\u0120Showtime": 47908, "\u0120Civ": 47909, "Vice": 47910, "requires": 47911, "ittance": 47912, "\u0120uncontrolled": 47913, "interesting": 47914, "561": 47915, "\u0120innovate": 47916, "ategic": 47917, "Lie": 47918, "\u0120Selling": 47919, "Ul": 47920, "\u0120savior": 47921, "\u0120Tosh": 47922, "\u0120swast": 47923, "PASS": 47924, "\u0120rink": 47925, "\u0120cardio": 47926, "\u0120Iro": 47927, "udi": 47928, "\u0120vantage": 47929, "\u0120vans": 47930, "\u0120Ni\u00c3\u00b1o": 47931, "+=": 47932, "\u0120propagate": 47933, "<?": 47934, "\u0120methodological": 47935, "20439": 47936, "\u0120triglycer": 47937, "\u0120ingrained": 47938, "\u0120Annotations": 47939, "arranted": 47940, "617": 47941, "\u0120Sodium": 47942, "\u0120AAC": 47943, "technical": 47944, "multipl": 47945, "\u0120373": 47946, "\u00e5\u012d": 47947, "\u0120decisively": 47948, "\u0120boosters": 47949, "\u0120desserts": 47950, "\u0120Grenade": 47951, "\u0120testifying": 47952, "\u0120Scully": 47953, "IDs": 47954, "\u0120lockdown": 47955, "\u0120Scher": 47956, "\u0120R\u00c3\u00a9": 47957, "\u0120Whitman": 47958, "\u0120Ramsay": 47959, "remote": 47960, "\u0120hikers": 47961, "\u0120Hyundai": 47962, "\u0120conscientious": 47963, "\u0120clerics": 47964, "\u0120Siberian": 47965, "uti": 47966, "isbury": 47967, "\u0120relayed": 47968, "\u0120quartz": 47969, "\u0120CBI": 47970, "seekers": 47971, "ulla": 47972, "\u0120welding": 47973, "\u0120Shal": 47974, "bleacher": 47975, "Tai": 47976, "\u0120Samson": 47977, "\u0120tumble": 47978, "\u0120Investor": 47979, "\u0120subcontract": 47980, "\u0120Shinra": 47981, "owicz": 47982, "jandro": 47983, "dad": 47984, "\u0120terminating": 47985, "\u0120Neural": 47986, "\u00e4\u00bb\u00a3": 47987, "\u0120leakage": 47988, "\u0120Midlands": 47989, "\u0120Caucasus": 47990, "\u00ed\u0137": 47991, "cit": 47992, "llan": 47993, "ivably": 47994, "\u0120Albion": 47995, "\u0120457": 47996, "\u0120registrations": 47997, "\u0120comrade": 47998, "\u0120clipboard": 47999, "047": 48000, "\u0120discouraging": 48001, "\u0120Oops": 48002, "Adapt": 48003, "\u0120empath": 48004, "nv": 48005, "\u0120PROT": 48006, "\u0120Donn": 48007, "\u0120Pax": 48008, "\u0120Bayer": 48009, "tis": 48010, "Square": 48011, "\u0120footprints": 48012, "particip": 48013, "\u0120Chilean": 48014, "Brend": 48015, "inducing": 48016, "Magn": 48017, "\u0120clubhouse": 48018, "\u0120Magnum": 48019, "\u0120encamp": 48020, "\u0120Ethnic": 48021, "ucha": 48022, "erey": 48023, "\u0120watered": 48024, "\u0120Calais": 48025, "\u0120complexion": 48026, "\u0120sects": 48027, "\u0120renters": 48028, "\u0120bras": 48029, "o\u00c4\u0141an": 48030, "Timeout": 48031, "Management": 48032, "\u0120infographic": 48033, "Pokemon": 48034, "Clar": 48035, "\u0120locality": 48036, "\u0120flora": 48037, "asel": 48038, "Pont": 48039, "\u0120populate": 48040, "\u0120Ong": 48041, "\u0120subsistence": 48042, "\u0120auctions": 48043, "\u0120McAuliffe": 48044, "\u0120LOOK": 48045, "bringer": 48046, "\u0120titan": 48047, "\u0120manifold": 48048, "\u0120\u00e2\u0139\u0131": 48049, "\u0120calibrated": 48050, "\u0120caliphate": 48051, "\u0120SHE": 48052, "\u0120Commissioners": 48053, "ceivable": 48054, "jc": 48055, "Winner": 48056, "524": 48057, "\u0120condone": 48058, "Otherwise": 48059, "\u0120piling": 48060, "\u0120embody": 48061, "\u0120Crimean": 48062, "utics": 48063, "\u0120Exhibition": 48064, "\u0120426": 48065, "eering": 48066, "\u0120vying": 48067, "\u0120HUGE": 48068, "*=-": 48069, "\u0120principled": 48070, "\u00e0\u00a6": 48071, "\u0120quirks": 48072, "\u0120Editors": 48073, "puting": 48074, "GES": 48075, "\u0120FTA": 48076, "\u00e0\u00a4\u00be": 48077, "addon": 48078, "\u0120HAM": 48079, "\u0120Frieza": 48080, "Woman": 48081, ".$": 48082, "\u0120crib": 48083, "\u0120Herod": 48084, "\u0120timers": 48085, "\u0120Spaces": 48086, "\u0120Macintosh": 48087, "ataka": 48088, "\u0120glide": 48089, "\u0120smelling": 48090, "\u0120BAL": 48091, "\u0120unsu": 48092, "\u0120condos": 48093, "\u0120bicycl": 48094, "\u0120Revival": 48095, "553": 48096, "\u0120juggling": 48097, "Hug": 48098, "\u0120Kardashian": 48099, "\u0120Balkans": 48100, "multiple": 48101, "\u0120nutritious": 48102, "ocry": 48103, "1900": 48104, "\u0120integrates": 48105, "\u0120adjoining": 48106, "\u0120Folder": 48107, "rollment": 48108, "venient": 48109, "\u0120uber": 48110, "yi": 48111, "\u0120whiff": 48112, "\u0120Juven": 48113, "\u0120Borough": 48114, "nette": 48115, "\u0120bilingual": 48116, "\u0120Sparks": 48117, "phthal": 48118, "manufact": 48119, "\u0120touting": 48120, "\u0120PHI": 48121, "Keefe": 48122, "Reward": 48123, "\u0120infall": 48124, "\u0120Temper": 48125, "typically": 48126, "\u0120Nikol": 48127, "\u0120regulars": 48128, "\u0120pseudonym": 48129, "\u0120exhibitions": 48130, "\u0120blaster": 48131, "\u0120409": 48132, "warming": 48133, "\u0120reverber": 48134, "\u0120reciprocal": 48135, "\u0120670": 48136, "ipient": 48137, "bett": 48138, "\u0120Begins": 48139, "\u0120itching": 48140, "\u0120Phar": 48141, "Assuming": 48142, "\u0120emitting": 48143, "\u0120MLG": 48144, "\u0120birthplace": 48145, "\u0120taunt": 48146, "\u0120Luffy": 48147, "\u0120Amit": 48148, "\u0120circled": 48149, "\u0120Nost": 48150, "ennett": 48151, "\u0120deforestation": 48152, "\u0120Historically": 48153, "\u0120Everyday": 48154, "\u0120overtake": 48155, "792": 48156, "\u0120nun": 48157, "\u0120Lucia": 48158, "\u0120accompanies": 48159, "\u0120Seeking": 48160, "\u0120Trash": 48161, "anism": 48162, "Rogue": 48163, "\u0120northwestern": 48164, "\u0120Supplemental": 48165, "\u0120NYU": 48166, "\u0120FRI": 48167, "\u0120Satisf": 48168, "xes": 48169, "517": 48170, "\u0120reassured": 48171, "\u0120sporadic": 48172, "\u0120701": 48173, "\u0120medial": 48174, "\u0120cannabinoid": 48175, "\u0120barbaric": 48176, "\u0120epis": 48177, "\u0120Explosive": 48178, "\u0120Dough": 48179, "\u0120unsolved": 48180, "Supported": 48181, "\u0120acknowledgment": 48182, "spawn": 48183, "\u0120kitchens": 48184, "\u0120-=": 48185, "talking": 48186, "icist": 48187, "\u0120Pegasus": 48188, "\u0120PSU": 48189, "\u0120photon": 48190, "\u0120Authentication": 48191, "RG": 48192, "@#&": 48193, "762": 48194, "\u0120Clair": 48195, "\u0120diaper": 48196, "\u0120brist": 48197, "\u0120Prosecutors": 48198, "\u0120Jem": 48199, "628": 48200, "\u0120Everywhere": 48201, "\u0120Jeanne": 48202, "equality": 48203, "\u00e3\u0125\u00a9\u00e3\u0125\u00b3": 48204, "objects": 48205, "\u0120Pelicans": 48206, "\u0120392": 48207, "\u0120blu": 48208, "bys": 48209, "\u0120Ago": 48210, "\u0120instructional": 48211, "\u0120discriminating": 48212, "\u0120TRAN": 48213, "\u0120Cornel": 48214, "agos": 48215, "\u0120tyre": 48216, "\u0120aspiration": 48217, "\u0120Bridgewater": 48218, "\":-": 48219, "!\".": 48220, "\u0120Ens": 48221, "\u0120Coco": 48222, "Pie": 48223, "\u0120detach": 48224, "\u0120Couch": 48225, "\u0120physique": 48226, "\u0120Occupations": 48227, "oscopic": 48228, "enough": 48229, "Buzz": 48230, "Appearance": 48231, "YP": 48232, "\u0120racer": 48233, "\u0120complicity": 48234, "rpm": 48235, "Toy": 48236, "\u0120interrupts": 48237, "\u0120Catalyst": 48238, "\u0120utilitarian": 48239, "impact": 48240, "\u0120spaghetti": 48241, "\u0120porous": 48242, "\u0120esteemed": 48243, "\u0120inciner": 48244, "\u0120IOC": 48245, "748": 48246, "\u0120espresso": 48247, "\u0120Smile": 48248, "abilia": 48249, "635": 48250, "\u0120mathematician": 48251, "\u0120424": 48252, "\u0120KL": 48253, "\u0120HIP": 48254, "\u0120overheard": 48255, "\u0120Tud": 48256, "\u0120Tec": 48257, "\u0120quizz": 48258, "\u0120flattering": 48259, "\u0120conn": 48260, "\u00e2\u0122\u0130": 48261, "\u0120attaches": 48262, "\u0120ROS": 48263, "\u0120ACS": 48264, "\u0120tcp": 48265, "\u0120Shame": 48266, "skip": 48267, "respected": 48268, "\u0120Trinidad": 48269, "grain": 48270, "\u0120foothold": 48271, "\u0120Uncharted": 48272, "\u0120Julio": 48273, "zl": 48274, "avored": 48275, "\u0120Anxiety": 48276, "errors": 48277, "\u0120Centauri": 48278, "itsch": 48279, "Daddy": 48280, "\u0120clutching": 48281, "\u0120Implement": 48282, "\u0120Gutierrez": 48283, "\u0120760": 48284, "\u0120teleportation": 48285, "endra": 48286, "\u0120reversible": 48287, "stros": 48288, "Adventure": 48289, "083": 48290, "\u0120liberating": 48291, "\u0120asphalt": 48292, "\u0120Spend": 48293, "ARDS": 48294, "imsy": 48295, "PRES": 48296, "\u0120Emerging": 48297, "\u0120wildfires": 48298, "\u0120technologically": 48299, "\u0120emits": 48300, "\u0120ARTICLE": 48301, "\u0120irregularities": 48302, "\u0120cherish": 48303, "\u00e7\u012b\u012a": 48304, "\u0120stink": 48305, "\u0120Rost": 48306, "Economic": 48307, "\u0120coughing": 48308, "\u0120McCann": 48309, "properties": 48310, "ilantro": 48311, "\u0120renegoti": 48312, "Translation": 48313, "\u0120inquest": 48314, "\u0120Grape": 48315, "ooters": 48316, "gui": 48317, "\u0120Swordsman": 48318, "aceae": 48319, "hitting": 48320, "\u0120rc": 48321, "\u0120exerted": 48322, "\u0120SAP": 48323, "itent": 48324, "\u0120perilous": 48325, "\u0120obscurity": 48326, "\u0120assassinate": 48327, "\u0120aboriginal": 48328, "\u0120rescuing": 48329, "\u0120Shattered": 48330, "locking": 48331, "allion": 48332, "Changing": 48333, "\u0120Harrington": 48334, "\u0120Bord": 48335, "\u0120Afghans": 48336, "Jamie": 48337, "aretz": 48338, "\u0120Augustus": 48339, "\u0120386": 48340, "830": 48341, "\u0120jog": 48342, "okingly": 48343, "Trigger": 48344, "\u0120HOR": 48345, "Statistics": 48346, "\u0120viewership": 48347, "\u0120additives": 48348, "hur": 48349, "\u0120maximizing": 48350, "\u0120Rove": 48351, "\u0120Louie": 48352, "\u0120Bucket": 48353, "\u0120CHRIST": 48354, "ousel": 48355, "\u0120streaks": 48356, "irted": 48357, "\u0120tert": 48358, "\u0120colonialism": 48359, "\u0120burying": 48360, "yk": 48361, "Condition": 48362, "\u0120DPRK": 48363, "ById": 48364, "751": 48365, "\u00e2\u0139\u00bc": 48366, "\u0120worrisome": 48367, "\u0120vocational": 48368, "slice": 48369, "\u0120sails": 48370, "\u0120Correctional": 48371, "954": 48372, "\u0120tul": 48373, "Kid": 48374, "luster": 48375, "\u0120familial": 48376, "\u0120Spit": 48377, "\u0120Episcopal": 48378, "Specifically": 48379, "\u0120Volcano": 48380, "runs": 48381, "qs": 48382, "\u0120vetted": 48383, "\u0120crammed": 48384, "trop": 48385, "herer": 48386, "Thankfully": 48387, "\u0120percussion": 48388, "\u0120oranges": 48389, "\u0120roundup": 48390, "\u0120499": 48391, "xious": 48392, "Characters": 48393, "\u0120Zionism": 48394, "\u0120Rao": 48395, "\u00c3\u013d\u00c3\u013d": 48396, "WF": 48397, "\u0120unintentional": 48398, "ONEY": 48399, "Grab": 48400, "Commercial": 48401, "\u0120glutamate": 48402, "\u0120McKenna": 48403, "ruciating": 48404, "nington": 48405, "ihu": 48406, "Chan": 48407, "\u0120Swap": 48408, "\u0120leaflets": 48409, "\u0120functionally": 48410, "erous": 48411, "Farm": 48412, "\u0120caloric": 48413, "\u0120Literally": 48414, "concert": 48415, "\u0120shenan": 48416, "\u0120repaid": 48417, "eyes": 48418, "\u0120bashing": 48419, "\u0120Gorge": 48420, "\u0120collaborations": 48421, "\u0120unaccount": 48422, "itchie": 48423, "\u0120teamwork": 48424, "ppelin": 48425, "\u0120piping": 48426, "\u0120minced": 48427, "\u0120diam": 48428, "rieg": 48429, "\u0120mascara": 48430, "\u0120sucker": 48431, "\u0120Moons": 48432, "Apps": 48433, "\u0120Peck": 48434, "\u0120perv": 48435, "\u0120Float": 48436, "oley": 48437, "\u0120Nish": 48438, "imize": 48439, "\u0120aromatic": 48440, "uin": 48441, "endish": 48442, "!/": 48443, "\u0120Bicycle": 48444, "\u0120ASIC": 48445, "ileged": 48446, "\u0120Quadro": 48447, "iosyn": 48448, "\u0120lockout": 48449, "\u0120Wink": 48450, "SPEC": 48451, "Attempts": 48452, "\u0120seeded": 48453, "redo": 48454, "iasis": 48455, "\u0120snag": 48456, "\u00e3\u0125\u0137\u00e3\u0124\u00a9": 48457, "\u00e3\u0124\u00b6": 48458, "\u0120grounding": 48459, "\u0120reliever": 48460, "\u0120frivolous": 48461, "\u0120Gifts": 48462, "\u0120Faces": 48463, "Especially": 48464, "\u0120microbiome": 48465, "imag": 48466, "\u0120Schl": 48467, "\u0120Ples": 48468, "\u0120Bleach": 48469, "\u0120Irwin": 48470, "\u0120Eaton": 48471, "\u0120Disciple": 48472, "\u0120multiplication": 48473, "\u0120coerced": 48474, "\u0120419": 48475, "sth": 48476, "Evil": 48477, "Bomb": 48478, "\u0120exorc": 48479, "\u0120staggered": 48480, "LESS": 48481, "\u0120inertia": 48482, "\u0120EDIT": 48483, "\u0120gob": 48484, "Traditional": 48485, "\u0120classy": 48486, "Leary": 48487, "\u0120PAGE": 48488, "yrs": 48489, "\u0120transporter": 48490, "\u0120matured": 48491, "\u0120hijab": 48492, "\u0120biome": 48493, "Whereas": 48494, "\u0120extermination": 48495, "\u0120Tues": 48496, "\u0120Takeru": 48497, "\u0120Audrey": 48498, "erial": 48499, "\u0120Aden": 48500, "affles": 48501, "\u0120narcissistic": 48502, "\u0120Baird": 48503, "UTF": 48504, "Ire": 48505, "\u0120Connie": 48506, "Champ": 48507, "\u0120whispering": 48508, "\u0120Hatt": 48509, "DK": 48510, "\u0120disinfect": 48511, "\u0120deducted": 48512, "\u0120partake": 48513, "\u0120downgrade": 48514, "\u0120Esports": 48515, "\u0120Continuing": 48516, "\u0120democratically": 48517, "icrobial": 48518, "itta": 48519, "\u0120limestone": 48520, "\u0120exempted": 48521, "\u0120Frenzy": 48522, "Herm": 48523, "728": 48524, "\u0120fledgling": 48525, "Meta": 48526, "76561": 48527, "693": 48528, "%:": 48529, "wake": 48530, "526": 48531, "\u0120Discipline": 48532, "\u0120virginity": 48533, "\u0120Legions": 48534, "\u0120Frankie": 48535, "intent": 48536, "\u0120restrooms": 48537, "\u0120Router": 48538, "daq": 48539, "\u0120objectionable": 48540, "\u00e2\u0128\u0133": 48541, "wark": 48542, "\u0120Rahul": 48543, "gain": 48544, "activation": 48545, "absolute": 48546, "\u0120Accessed": 48547, "\u01202400": 48548, "oggles": 48549, "\u0120secondly": 48550, "\u0120DEFENSE": 48551, "\u0120postage": 48552, "wrapper": 48553, "sharp": 48554, "729": 48555, "\u0120communicates": 48556, "\u0120addon": 48557, "\u0120Militia": 48558, "Hong": 48559, "\u0120slumped": 48560, "\u0120JPEG": 48561, "\u0120Icar": 48562, "adish": 48563, "681": 48564, "\u0120majesty": 48565, "\u0120Wolfgang": 48566, "\u0120Elastic": 48567, "uper": 48568, "\u0120viz": 48569, "\u0120unconsciously": 48570, "\u0120STD": 48571, "\u0120Sass": 48572, "\u0120flowering": 48573, "\u0120Helic": 48574, "\u0120Draper": 48575, "\u0120Amateur": 48576, "\u0120manure": 48577, "\u0120disingen": 48578, "\u0120Lei": 48579, "bring": 48580, "949": 48581, "\u0120inhibited": 48582, "\u0120headquartered": 48583, "\u0120enigmatic": 48584, "\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd\u00ef\u00bf\u00bd": 48585, "\u0120redress": 48586, "RH": 48587, "\u0120rattled": 48588, "\u0120diction": 48589, "lio": 48590, "\u0120TBA": 48591, "\u0120SNAP": 48592, "Calling": 48593, "\u0120fascists": 48594, "\u0120Dove": 48595, "iewicz": 48596, "036": 48597, "\u0120coasts": 48598, "\u0120Rect": 48599, "\u0120)]": 48600, "Lot": 48601, "629": 48602, "\u0120SEM": 48603, "\u0120Petersen": 48604, "\u0120Explain": 48605, "\u0120Boards": 48606, "\u0120Bezos": 48607, "\u0120Journals": 48608, "\u01202024": 48609, "parser": 48610, "\u0120mistrust": 48611, "\u0120grate": 48612, "\u0120Locked": 48613, "boa": 48614, "Saint": 48615, "gaming": 48616, "\u0120vowel": 48617, "inately": 48618, "blow": 48619, "Allah": 48620, "\u0120unmatched": 48621, "\u0120bordering": 48622, "\u0120Expend": 48623, "nr": 48624, "Oracle": 48625, "rouch": 48626, "\u0120contiguous": 48627, "acus": 48628, "\u0120distraught": 48629, "581": 48630, "\u0120anatomical": 48631, "OX": 48632, "apixel": 48633, "833": 48634, "\u0120PLUS": 48635, "\u0120resusc": 48636, "\u0120abiding": 48637, "573": 48638, "\u0120vacancies": 48639, "Emily": 48640, "\u0120hypothal": 48641, "\u0120Werner": 48642, "\u0120Wee": 48643, "\u0120DJs": 48644, "513": 48645, "\u0120witchcraft": 48646, "\u0120acupuncture": 48647, "entary": 48648, "benefit": 48649, "Products": 48650, "\u0120PSP": 48651, "\u0120MPG": 48652, "\u0120Jinn": 48653, "\u0120Jarrett": 48654, "\u0120445": 48655, "\u0120Imaging": 48656, "\u0120Pyth": 48657, "Finish": 48658, "\u0120tex": 48659, "\u0120juveniles": 48660, "\u0120heroism": 48661, "\u0120doubtless": 48662, "\u0120Aki": 48663, "\u0120Tend": 48664, "\u0120Patriarch": 48665, "\u0120bitters": 48666, "\u0120Telecommunications": 48667, "itatively": 48668, "agna": 48669, "\u0120rg": 48670, "\u0120SOLD": 48671, "\u0120compulsion": 48672, "\u0120Nasa": 48673, "\u0120Kathryn": 48674, "\u0120millionaires": 48675, "\u0120intrinsically": 48676, "\u0120bolstered": 48677, "timeout": 48678, "flo": 48679, "\u0120tutor": 48680, "pour": 48681, "Statement": 48682, "\u0120{*": 48683, "\u0120Rudolph": 48684, "\u0120Kimberly": 48685, "rogens": 48686, "adiq": 48687, "]+": 48688, "\u0120indignation": 48689, "\u0120fracturing": 48690, "\u0120Releases": 48691, "\u0120Grain": 48692, "protein": 48693, "Lago": 48694, "\u0120vacations": 48695, "\u0120booted": 48696, "\u0120THREE": 48697, "\u0120HG": 48698, "orescence": 48699, "\u0120tf": 48700, "\u0120soar": 48701, "iosyncr": 48702, "\u0120glances": 48703, "\u0120Spoon": 48704, "\u0120Jury": 48705, "\u0120Cowboy": 48706, "\u0120creatively": 48707, "Higher": 48708, "\u0120solicitor": 48709, "\u0120hawk": 48710, "acio": 48711, "896": 48712, "\u0120superflu": 48713, "\u0120bombshell": 48714, "cture": 48715, "\u0120brokerage": 48716, "\u0120raiding": 48717, "\u0120french": 48718, "\u0120angled": 48719, "Transaction": 48720, "\u0120Genocide": 48721, "upe": 48722, "\u0120Haitian": 48723, "572": 48724, "!:": 48725, "\u0120unwittingly": 48726, "iterator": 48727, "scroll": 48728, "\u0120tallied": 48729, "\u0120biomedical": 48730, "\u0120CARD": 48731, "\u0120euphem": 48732, "\u0120brainstorm": 48733, "aquin": 48734, "Ko": 48735, "Michelle": 48736, "\u0120Runes": 48737, "\u0120Ballistic": 48738, "uders": 48739, "\u0120modesty": 48740, "\u0120iPads": 48741, "\u0120Ezekiel": 48742, "YE": 48743, "\u0120starship": 48744, "\u0120powerfully": 48745, "\u0120perl": 48746, "\u0120Shade": 48747, "\u0120Quart": 48748, "\u0120EEG": 48749, "\u0120fisherman": 48750, "OSED": 48751, "\u0120Typical": 48752, "dfx": 48753, "\u0120meshes": 48754, "\u0120etched": 48755, "worthiness": 48756, "\u0120toppled": 48757, "\u0120396": 48758, "orius": 48759, "Weiss": 48760, "\u0120mysql": 48761, "\u0120Valhalla": 48762, "\u00d9\u0134": 48763, "leasing": 48764, "\u0120recomp": 48765, "rapnel": 48766, "Sel": 48767, "043": 48768, "\u0120derailed": 48769, "\u0120Guides": 48770, "IRT": 48771, "\u0120dehuman": 48772, "\u0120Brittany": 48773, "\"))": 48774, "\u0120exclaim": 48775, "\u0120balk": 48776, "\u0120840": 48777, "CLAIM": 48778, "intel": 48779, "LAB": 48780, "\u0120pegged": 48781, "\u0120astroph": 48782, "smoking": 48783, "\u0120rigging": 48784, "\u0120fixation": 48785, "\u0120catapult": 48786, "inside": 48787, "\u0120Cascade": 48788, "\u0120Bolshevik": 48789, "Gaza": 48790, "Depth": 48791, "\u0120loudspe": 48792, "\u0120almonds": 48793, "meyer": 48794, "leness": 48795, "jen": 48796, "fresh": 48797, "\u0120unbeaten": 48798, "\u0120Squid": 48799, "\u0120Presumably": 48800, "Timer": 48801, "BW": 48802, "\u0120rosters": 48803, "\u0120ellipt": 48804, "\u0120Harriet": 48805, "database": 48806, "\u0120Mutual": 48807, "\u0120Commodore": 48808, "uked": 48809, "knife": 48810, "\u0120COMMUN": 48811, "hya": 48812, "\u0120melts": 48813, "archives": 48814, "\u0120ratification": 48815, "\u0120multiplying": 48816, "\u0120interoper": 48817, "\u0120ascert": 48818, "wings": 48819, "verting": 48820, "\u0120Scorpion": 48821, "aye": 48822, "\u0120Portsmouth": 48823, "\u0120MTA": 48824, "nit": 48825, "iazep": 48826, "\u0120quarantine": 48827, "\u0120slideshow": 48828, "\u0120centimeters": 48829, "\u0120synopsis": 48830, "\u0120spate": 48831, "thirst": 48832, "\u0120nominating": 48833, "\u0120Melvin": 48834, "Preview": 48835, "\u0120throb": 48836, "\u0120generational": 48837, "\u0120Radius": 48838, "restling": 48839, "putable": 48840, "awar": 48841, "NECT": 48842, "\u0120unlawfully": 48843, "\u0120Revelations": 48844, "Wikipedia": 48845, "surv": 48846, "\u0120eyeing": 48847, "ijn": 48848, "\u0120FW": 48849, "\u0120brunt": 48850, "\u0120interstellar": 48851, "\u0120clitor": 48852, "\u0120Croatian": 48853, "\u0120Chic": 48854, "eva": 48855, "\u0120Disapp": 48856, "\u0120Akin": 48857, "ineries": 48858, "dust": 48859, "Interested": 48860, "\u0120genesis": 48861, "\u0120Eucl": 48862, "\u00c3\u00b6n": 48863, "picking": 48864, "\u0120mutated": 48865, "\u0120disapprove": 48866, "\u0120HDL": 48867, "\u0120625": 48868, "\u00cc\u00b6": 48869, "cancer": 48870, "\u0120squats": 48871, "\u0120levers": 48872, "Discuss": 48873, "=]": 48874, "Dex": 48875, "\u0120VIDEOS": 48876, "AUD": 48877, "\u0120transact": 48878, "\u0120Kinect": 48879, "\u0120Kuala": 48880, "\u0120Cyp": 48881, "747": 48882, "\u0120shattering": 48883, "\u0120arsenic": 48884, "\u0120Intake": 48885, "\u0120Angelo": 48886, "\u0120Quit": 48887, "\u0120Khe": 48888, "\u01201893": 48889, "Maker": 48890, "029": 48891, "\u0120Painting": 48892, "Disable": 48893, "916": 48894, "\u0120analges": 48895, "\u0120tactile": 48896, "\u0120prophes": 48897, "\u0120diced": 48898, "\u0120Travels": 48899, "\u0120Header": 48900, "\u0120Clubs": 48901, "Assistant": 48902, "\u0120incrim": 48903, "\u0120dips": 48904, "\u0120crucifix": 48905, "\u0120Shanahan": 48906, "\u0120Interpret": 48907, "\u01204090": 48908, "alogy": 48909, "abba": 48910, "\u0120simulac": 48911, "husband": 48912, "SIM": 48913, "\u0120recycle": 48914, "ucer": 48915, "edged": 48916, "\u0120renaissance": 48917, "\u0120Bombay": 48918, "Catholic": 48919, "\u0120LINE": 48920, "\u0120Clothing": 48921, "reports": 48922, "\u0120plaus": 48923, "\u0120dag": 48924, "\u0120Mace": 48925, "ZI": 48926, "\u0120intruder": 48927, "\u0120Veterinary": 48928, "gru": 48929, "\u0120sneaky": 48930, "\u0120Sie": 48931, "\u0120Cinnamon": 48932, "POSE": 48933, "\u0120courier": 48934, "\u0120CNS": 48935, "\u0120emancipation": 48936, "sit": 48937, "\u0120playthrough": 48938, "\u0120Facilities": 48939, "virt": 48940, "\u0120Gauntlet": 48941, "Thompson": 48942, "\u0120unbelievably": 48943, "Parameters": 48944, "\u0120stitching": 48945, "igne": 48946, "\u0120THESE": 48947, "Privacy": 48948, "\u0120shenanigans": 48949, "\u0120vitri": 48950, "\u0120Valid": 48951, "591": 48952, "\u0143\u00b7": 48953, "\u0120Prototype": 48954, "inka": 48955, "SCP": 48956, "\u0120Tid": 48957, "\u00e8\u012a": 48958, "olded": 48959, "\u0120individuality": 48960, "\u0120barking": 48961, "\u0120mars": 48962, "\u0120WD": 48963, "\u0120820": 48964, "\u0120tir": 48965, "\u0120slapping": 48966, "\u0120disgruntled": 48967, "\u0120Angola": 48968, "rius": 48969, "\u0120Tornado": 48970, "\u0120Thurs": 48971, "\u0120captcha": 48972, "\u0120angst": 48973, "\u0120Pog": 48974, "\u0120Assassins": 48975, "\u0120Adidas": 48976, "\u0120joyful": 48977, "\u0120whining": 48978, "Emergency": 48979, "\u0120phosphorus": 48980, "\u0120attrition": 48981, "ophon": 48982, "\u0120Timberwolves": 48983, "\u0120Jah": 48984, "\u0120Bringing": 48985, "\u0120Wad": 48986, "\u0120Ensure": 48987, "ohl": 48988, "\u0120Xie": 48989, "ommel": 48990, "cmp": 48991, "\u0120zipper": 48992, "\u0120relat": 48993, "\u0120Corridor": 48994, "milo": 48995, "TING": 48996, "Avg": 48997, "\u0120cropped": 48998, "]}": 48999, "\u0120raged": 49000, "\u0120Lumpur": 49001, "\u0120Guerrero": 49002, "ourke": 49003, "Nut": 49004, "\u0120offsets": 49005, "oglu": 49006, "drm": 49007, "\u0120mortals": 49008, "latable": 49009, "\u0120dismissive": 49010, "\u00e4\u00b8\u012b": 49011, "\u0120throats": 49012, "\u0120chipset": 49013, "\u0120Spotlight": 49014, "Catalog": 49015, "artist": 49016, "Gb": 49017, "\u0120chilly": 49018, "\u0120stoked": 49019, "\u0120374": 49020, "Ward": 49021, "Latin": 49022, "\u0120fiasco": 49023, "\u0120bleach": 49024, "\u0120brav": 49025, "Enhanced": 49026, "\u0120inoc": 49027, "\u0120Fiorina": 49028, "_>": 49029, "\u0120leukemia": 49030, "\u0120eluc": 49031, "\u0120announcer": 49032, "\u0120Lithuan": 49033, "\u0120Armageddon": 49034, "\u00e5\u0129": 49035, "Lenin": 49036, "\u0120Ruk": 49037, "\u0120pepp": 49038, "\u0120Romantic": 49039, "\u0120PIT": 49040, "\u0120Interstellar": 49041, "\u0120Atkinson": 49042, "Raid": 49043, "Js": 49044, "Goal": 49045, "Course": 49046, "\u0120vanishing": 49047, "esley": 49048, "\u0120Rounds": 49049, "Elsa": 49050, "593": 49051, "\u0120redundancy": 49052, "\u0120STAND": 49053, "\u0120prophetic": 49054, "\u0120habitable": 49055, "ryu": 49056, "\u0120faintly": 49057, "MODE": 49058, "\u0120flanked": 49059, "IRC": 49060, "Awesome": 49061, "\u0120spurious": 49062, "\u0120Zah": 49063, "\u0120MSG": 49064, "\u0120shading": 49065, "\u0120motivational": 49066, "\u0120Santana": 49067, "\u0120SPR": 49068, "\u0120excruciating": 49069, "omial": 49070, "\u0120Miko": 49071, "\u0120Leopard": 49072, "Abyss": 49073, "\u0120[|": 49074, "dirty": 49075, "\u0120baths": 49076, "\u0120demoral": 49077, "andre": 49078, "PB": 49079, "\u0120unification": 49080, "\u0120sacrament": 49081, "\u0120[&": 49082, "\u0120priceless": 49083, "\u0120gelatin": 49084, "\u0120emanating": 49085, "\u0120Allaah": 49086, "986": 49087, "\u0120outburst": 49088, "\u0120eras": 49089, "\u0120XVI": 49090, "\u0120SPI": 49091, "Ott": 49092, "\u0120Lazarus": 49093, "PLIED": 49094, "Flying": 49095, "blogs": 49096, "Wisconsin": 49097, "Raven": 49098, "\u0120rebate": 49099, "\u0120creeps": 49100, "\u0120Span": 49101, "\u0120Painter": 49102, "\u0120Kira": 49103, "\u0120Amos": 49104, "\u0120Corvette": 49105, "Consumer": 49106, "\u0120Recover": 49107, "cki": 49108, "\u0120pesky": 49109, "\u0120Invention": 49110, "Companies": 49111, "\u0120challengers": 49112, "ademic": 49113, "\u0120Ukrainians": 49114, "\u0120Neurolog": 49115, "\u0120Forsaken": 49116, "\u0120entrants": 49117, "\u0120embattled": 49118, "\u0120defunct": 49119, "\u0120Glacier": 49120, "\u0120poisons": 49121, "\u0120Horses": 49122, "makes": 49123, "\u0120Dirt": 49124, "\u0120423": 49125, "hhh": 49126, "\u0120Transformation": 49127, "QUIRE": 49128, "..................": 49129, "\u0120traveller": 49130, "\u0120Sexy": 49131, "\u0120Kern": 49132, "ipolar": 49133, "\u0120ransomware": 49134, "oooooooooooooooo": 49135, "Ec": 49136, "ruby": 49137, "Professional": 49138, "\u0120Outbreak": 49139, "argument": 49140, "Grey": 49141, "\u0120Fifa": 49142, "\u0120CHO": 49143, "\u0120FORM": 49144, "\u0120Amtrak": 49145, "-[": 49146, "\u0120cradle": 49147, "\u0120antioxidants": 49148, "\u00e3\u0123\u00ae\u00e5\u00ae": 49149, "736": 49150, "\u0120NASL": 49151, "\u0120Contributions": 49152, "Indiana": 49153, "\u0120STEP": 49154, "CSS": 49155, "\u0120salient": 49156, "\u0120allocations": 49157, "yrights": 49158, "\u0120mashed": 49159, "\u0120Cutter": 49160, "Sexual": 49161, "\u0120pounded": 49162, "\u0120fanbase": 49163, "\u0120casc": 49164, "\u0120Transparency": 49165, "\u0120analytic": 49166, "\u0120Summoner": 49167, "\u00d7\u0140": 49168, "\u0120ADC": 49169, "detail": 49170, "\u0120vanquished": 49171, "\u0120crabs": 49172, "arie": 49173, "Destroy": 49174, "\u0120Sack": 49175, "\u0120transistor": 49176, "Alabama": 49177, "\u0120Koen": 49178, "\u0120Fisheries": 49179, "cone": 49180, "\u0120annexed": 49181, "\u0120MGM": 49182, "esa": 49183, "\u0120faked": 49184, "\u0120Congratulations": 49185, "\u0120hindered": 49186, "\u0120correctional": 49187, "\u0120ITV": 49188, "leeve": 49189, "\u0120inappropriately": 49190, "licks": 49191, "\u0120trespass": 49192, "\u0120paws": 49193, "\u0120negotiator": 49194, "\u0120Christensen": 49195, "limits": 49196, "\u0120Dianne": 49197, "\u0120elegance": 49198, "\u0120Contracts": 49199, "anke": 49200, "Obj": 49201, "\u0120vigilance": 49202, "\u0120castles": 49203, "\u0120NAD": 49204, "\u0120Holo": 49205, "\u0120emphatically": 49206, "\u0120Titus": 49207, "\u0120Serving": 49208, "\u0120Richie": 49209, "\u0120Pigs": 49210, "568": 49211, "\u0120animosity": 49212, "\u0120Attributes": 49213, "\u0120Uriel": 49214, "MQ": 49215, "myra": 49216, "\u0120Applicant": 49217, "\u0120psychiatrists": 49218, "\u0120Vij": 49219, "\u0120Abby": 49220, "agree": 49221, "Push": 49222, "\u0120kWh": 49223, "hiba": 49224, "\u0120incite": 49225, "\u0120Weasley": 49226, "\u0120Taxi": 49227, "ministic": 49228, "hyper": 49229, "\u0120Farn": 49230, "\u0120601": 49231, "\u0120Nationwide": 49232, "Fake": 49233, "952": 49234, "\u0120maize": 49235, "\u0120interacted": 49236, "\u0120transitioned": 49237, "\u0120parasitic": 49238, "\u0120harmonic": 49239, "\u0120decaying": 49240, "\u0120baseless": 49241, "nsics": 49242, "\u0120transpired": 49243, "\u0120abundantly": 49244, "\u0120Forensic": 49245, "\u0120treadmill": 49246, "\u0120Jav": 49247, "aband": 49248, "\u0120sshd": 49249, "\u0120frontman": 49250, "\u0120Jakarta": 49251, "oller": 49252, "drops": 49253, "\u0120SERVICES": 49254, "romptu": 49255, "ophical": 49256, "hospital": 49257, "bledon": 49258, "645": 49259, "\u0120midrange": 49260, "\u0120EVENT": 49261, "culated": 49262, "rawled": 49263, "\u0120perched": 49264, "\u0120overboard": 49265, "\u0120Peel": 49266, "\u0120Pwr": 49267, "\u0120Carth": 49268, "\u0120COMPLE": 49269, "coe": 49270, "shall": 49271, "\u0120deterrence": 49272, "METHOD": 49273, "\u0120Absent": 49274, "MEN": 49275, "\u0120sill": 49276, "\u0120LEVEL": 49277, "York": 49278, "\u0120sinners": 49279, "\u0120OPEC": 49280, "\u0120Nur": 49281, "\u0120Designs": 49282, "selection": 49283, "\u0120unworthy": 49284, "CHA": 49285, "\u0120strengthens": 49286, "883": 49287, "edly": 49288, "\u0120slicing": 49289, "\u0120malnutrition": 49290, "\u0120filmmaking": 49291, "\u0120Polk": 49292, "urated": 49293, "\u0120421": 49294, "breakers": 49295, "!'\"": 49296, "\u0120wetlands": 49297, "\u0120Discrimination": 49298, "\u0120allowable": 49299, "\u0120steered": 49300, "\u0120Sicily": 49301, "SAM": 49302, "\u0120mustache": 49303, "\u0120mids": 49304, "\u0120clipped": 49305, "\u0120circulate": 49306, "\u0120brittle": 49307, "\u0120Buildings": 49308, "raised": 49309, "\u0120Roundup": 49310, "\u0120wealthier": 49311, "\u0120overwrite": 49312, "\u0120overpowered": 49313, "\u0120Gerrard": 49314, "sites": 49315, "PDATED": 49316, "\u0120acutely": 49317, "\u0120Gamble": 49318, "\u0120pim": 49319, "\u0120Kus": 49320, "Typically": 49321, "Deploy": 49322, "\u0120Moroccan": 49323, "potion": 49324, "combe": 49325, "\u0120vigilante": 49326, "\u0120363": 49327, "Stew": 49328, "\u0120Bagg": 49329, "\u0120resided": 49330, "\u0120Spo": 49331, "\u0120remnant": 49332, "\u0120emptiness": 49333, "brainer": 49334, "\u0120outpatient": 49335, "priority": 49336, "\u0120leptin": 49337, "\u0120Payton": 49338, "\u0120Gleaming": 49339, "\u0120Shed": 49340, "\u0120Polo": 49341, "\u0120Mormonism": 49342, "restricted": 49343, "arlane": 49344, "wx": 49345, "\u0120creatine": 49346, "\u0120Anon": 49347, "\u0120STUD": 49348, "\u0120JUL": 49349, "\u0120Tee": 49350, "528": 49351, "089": 49352, "\u0120hatched": 49353, "Dispatch": 49354, "\u0120Composite": 49355, "\u0120451": 49356, "puff": 49357, "\u0120XCOM": 49358, "\u0120Orn": 49359, "\u0120THANK": 49360, "ENDED": 49361, "\u0120Asheville": 49362, "\u0120\u00c3\u013e": 49363, "\u0120mango": 49364, "\u0120Slightly": 49365, "worldly": 49366, "\u0120Wander": 49367, "\u0120Expand": 49368, "\u0120Chr": 49369, "Mist": 49370, "\u0120orthodoxy": 49371, "\u0120UNESCO": 49372, "regate": 49373, "Elsewhere": 49374, "kie": 49375, "irled": 49376, "\u0120topple": 49377, "\u0120adoptive": 49378, "\u0120Legs": 49379, "dress": 49380, "\u0120Sagan": 49381, "bare": 49382, "\u0120Glou": 49383, "Crunch": 49384, "\u0120helpers": 49385, "\u0120chronically": 49386, "\u0120Huma": 49387, "10000": 49388, "\u0120accommodating": 49389, "\u00e4\u00ba\u0136": 49390, "\u0120wrinkles": 49391, "\u0120dodged": 49392, "fourth": 49393, "\u0120precon": 49394, "\u0120compressor": 49395, "\u0120Kare": 49396, "\u0120evict": 49397, "\u0120Warwick": 49398, "imar": 49399, "\u0120modernization": 49400, "\u0120bandwagon": 49401, "\u0120refuted": 49402, "\u0120netted": 49403, "\u0120Naples": 49404, "\u0120Genie": 49405, "perors": 49406, "\u0120fielded": 49407, "\u0120dere": 49408, "\u0120Parables": 49409, "lees": 49410, "\u0120trout": 49411, "aspers": 49412, "\u0120nihil": 49413, "\u0120happiest": 49414, "\u0120floppy": 49415, "\u0120Loft": 49416, "\u0120Heard": 49417, "\u0120unison": 49418, "\u0120lug": 49419, "\u0120Redmond": 49420, "classic": 49421, "Supporters": 49422, "SHIP": 49423, "GMT": 49424, "\u0120fuelled": 49425, "\u00e7\u0132": 49426, "\u0120dd": 49427, "\u0120Eminem": 49428, "\u01201897": 49429, "NYSE": 49430, "\u0120secretaries": 49431, "\u0120FIA": 49432, "\u0120Canaveral": 49433, "Favorite": 49434, "\u0120pomp": 49435, "\u0120detainee": 49436, "ership": 49437, "aimon": 49438, "iour": 49439, "\u0120Apex": 49440, "\u0120plantations": 49441, "amia": 49442, "acion": 49443, "Rust": 49444, "\u0120towed": 49445, "\u0120Truly": 49446, "577": 49447, "\u0120sheltered": 49448, "rider": 49449, "Wo": 49450, "\u0120lair": 49451, "\u0120Intelligent": 49452, "improve": 49453, "matically": 49454, "\u0120etiquette": 49455, "adra": 49456, "allo": 49457, "\u0120Juno": 49458, "anything": 49459, "\u0120Struggle": 49460, "\u0120Predict": 49461, "\u0120Grimes": 49462, "\u0120AMERICA": 49463, "ctx": 49464, "\u0120Situation": 49465, "WOOD": 49466, "\u0120soluble": 49467, "meier": 49468, "\u0120intolerable": 49469, "angering": 49470, "\u0120uninterrupted": 49471, "\u0120tooltip": 49472, "\u0120interrogated": 49473, "\u0120gunned": 49474, "\u0120Sneak": 49475, "\u00e6\u0143\u00a6": 49476, "\u0120tether": 49477, "\u0120crumble": 49478, "Lens": 49479, "\u0120clustered": 49480, "\u0120Syl": 49481, "\u0120Hasan": 49482, "\u0120dystopian": 49483, "wana": 49484, "\u0120joystick": 49485, "\u0120Thib": 49486, "ammu": 49487, "Tomorrow": 49488, "546": 49489, "\u0120overcame": 49490, "\u0120minimized": 49491, "ceptor": 49492, "Runner": 49493, "ENGTH": 49494, "\u0120Brenda": 49495, "\u0120Achievements": 49496, "\u0120torches": 49497, "\u0120rapport": 49498, "\u0120Investigator": 49499, "\u0120Handling": 49500, "relation": 49501, "grey": 49502, "815": 49503, "\u0120kcal": 49504, "\u0120Commands": 49505, "dq": 49506, "\u0120curls": 49507, "\u0120bearer": 49508, "\u0120cynicism": 49509, "itri": 49510, "\u0120Useful": 49511, "Bee": 49512, "DCS": 49513, "\u0120abras": 49514, "Pract": 49515, "BILITIES": 49516, "712": 49517, "\u0120debugger": 49518, "\u0120debtor": 49519, "\u0120Lia": 49520, "\u0120Kers": 49521, "\u0120exacerbate": 49522, "\u0120Stacy": 49523, "\u0120Bland": 49524, "\u0120Scenes": 49525, "\u0120branching": 49526, "\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a\u00e2\u0138\u012a": 49527, "apeake": 49528, "\u0120salsa": 49529, "\u0120mishand": 49530, "\u0120Konami": 49531, "\u0120Nib": 49532, "\u0120anecdote": 49533, "\u0120agreeable": 49534, "\u00cf\u012b": 49535, "\u0120Nathaniel": 49536, "\u0120Heisman": 49537, "\u0120Beware": 49538, "\u01201886": 49539, "spective": 49540, "691": 49541, "522": 49542, "\u0120inhibits": 49543, "\u0120hashing": 49544, "\u01201889": 49545, "\u00e5\u00b0\u0128": 49546, "vich": 49547, "Pure": 49548, "\u0120solidly": 49549, "\u0120aspirin": 49550, "imaru": 49551, "\u0120streetcar": 49552, "\u0120UCS": 49553, "\u0120Judd": 49554, "\u0120flashbacks": 49555, "pins": 49556, "\u01201440": 49557, "\u0120UNHCR": 49558, "\u0120Symptoms": 49559, "TIT": 49560, "538": 49561, "Fra": 49562, "%);": 49563, "\u0120ooz": 49564, "\u0120curfew": 49565, "\u0120calmed": 49566, "\u0120participates": 49567, "TeX": 49568, "\u0120nonsensical": 49569, "\u0120fullback": 49570, "\u0120DeL": 49571, "monkey": 49572, "hari": 49573, "\u0120metabolites": 49574, "\u0120looted": 49575, "\u0120ALWAYS": 49576, "\u0120BCC": 49577, "Lt": 49578, "ochet": 49579, "Bone": 49580, "\u0120vetoed": 49581, "\u0120gcc": 49582, "\u0120CLICK": 49583, "\u01201888": 49584, "saf": 49585, "\u0120stiffness": 49586, "\u0120lowly": 49587, "\u0120Geh": 49588, "verson": 49589, "orset": 49590, "\u0120unforeseen": 49591, "\u0120anesthesia": 49592, "\u0120Optical": 49593, "\u0120reconstructed": 49594, "\u0120Tup": 49595, "shows": 49596, "NEWS": 49597, "\u0120Newspaper": 49598, "\u0120ASA": 49599, "tera": 49600, "Numbers": 49601, "\u0120inexplicable": 49602, "\u00d7\u0133": 49603, "\u0120hardness": 49604, "untarily": 49605, "\u0120Acer": 49606, "gradient": 49607, "ARDIS": 49608, "\u0120woodland": 49609, "\u0120metaphors": 49610, "\u0120Wembley": 49611, "\u0120Pavel": 49612, "philis": 49613, "\u0120rewriting": 49614, "\u0120perceptual": 49615, "\u01201070": 49616, "worms": 49617, "\u0120Downs": 49618, "\u0120unsurprisingly": 49619, "\u0120tagging": 49620, "flame": 49621, "\u0120litres": 49622, "\u0120bounces": 49623, "\u0120Babe": 49624, "shut": 49625, "\u0120overdoses": 49626, "\u0120Sheila": 49627, "\u0120Chau": 49628, "\u0120Bless": 49629, "Capture": 49630, "\u0120Significant": 49631, "\u0120Scion": 49632, "\u0120389": 49633, "\u0120McH": 49634, "\u0120Titanium": 49635, "\u0120Meal": 49636, "ameda": 49637, "agents": 49638, "aggressive": 49639, "Billy": 49640, "763": 49641, "\u0120Saying": 49642, "DERR": 49643, "itone": 49644, "Collins": 49645, "Bound": 49646, "\u0120bolted": 49647, "\u0120DMCA": 49648, "953": 49649, "\u0120uniqueness": 49650, "\u0120epigen": 49651, "unci": 49652, "antam": 49653, "\u0120reckoning": 49654, "chairs": 49655, "OGR": 49656, "\u0120Senegal": 49657, "\u01201862": 49658, "relevant": 49659, "\u0120\u00c2\u00af": 49660, "\u0120pharmacies": 49661, "\u0120Geral": 49662, "vier": 49663, "Yan": 49664, "ORPG": 49665, "\u0120rabid": 49666, "bending": 49667, "\u0120UNITED": 49668, "\u0120465": 49669, "Assembly": 49670, "\u0120weep": 49671, "\u0120behest": 49672, "\u0120Mothers": 49673, "\u0120Jace": 49674, "hid": 49675, "\u0120whirlwind": 49676, "\u0120UNIVERS": 49677, "\u0120utopian": 49678, "\u0120kidnap": 49679, "Philipp": 49680, "Kin": 49681, "893": 49682, "\u0120livestream": 49683, "\u0120MISS": 49684, "\u0120subversive": 49685, "\u0120Techniques": 49686, "\u0120JUSTICE": 49687, "\u0120BASE": 49688, "\u0120387": 49689, "\u0120assailants": 49690, "\u0120Hardcore": 49691, "\u0120sprinkled": 49692, "\u0120Pse": 49693, "\u00e9\u013c": 49694, "printed": 49695, "\u0120Hau": 49696, "ORGE": 49697, "\u0120TOUR": 49698, "\u0120laced": 49699, "\u0120itch": 49700, "Giving": 49701, "\u0120ported": 49702, "781": 49703, "////////////////////////////////": 49704, "breeding": 49705, "\u0120logger": 49706, "\u0120HOL": 49707, "innie": 49708, "Firstly": 49709, "\u0120embryonic": 49710, "\u0120delegated": 49711, "pai": 49712, "OIL": 49713, "\u0120centrally": 49714, "\u0120Rx": 49715, "\u0120Scouting": 49716, "Dutch": 49717, "\u0120hereditary": 49718, "\u0120Cruiser": 49719, "sat": 49720, "529": 49721, "\u0120Marriott": 49722, "othermal": 49723, "\u0120prohibitions": 49724, "Earn": 49725, "\u0120Stab": 49726, "\u0120Colleges": 49727, "\u0120Belief": 49728, "stretched": 49729, "\u0120LH": 49730, "\u0120EntityItem": 49731, "CIA": 49732, "\u0120unrem": 49733, "\u0120laureate": 49734, "\u0120denominations": 49735, "summary": 49736, "hler": 49737, "Spect": 49738, "\u0120Klaus": 49739, "\u0120Beans": 49740, "\u0120insur": 49741, "\u0120PAX": 49742, "\u0120fielder": 49743, "\u0120Vet": 49744, "\u0120Sparrow": 49745, "zie": 49746, "\u0120SQ": 49747, "\u0120Mondays": 49748, "\u0120Offline": 49749, "\u0120Lerner": 49750, "\u0120Extensions": 49751, "Ireland": 49752, "\u0120patronage": 49753, "\u0120contrasted": 49754, "\u0120Mania": 49755, "hirt": 49756, "Moscow": 49757, "\u0120condemns": 49758, "\u0120Ange": 49759, "\u0120composing": 49760, "\u0120Pepe": 49761, "\u0120Paddock": 49762, "\u0120heterogeneity": 49763, "\u0120ideologically": 49764, "\u0120fishes": 49765, "\u0120cursing": 49766, "\u0120Rutherford": 49767, "\u0120Floating": 49768, "\u0120Amelia": 49769, "Tea": 49770, "Synopsis": 49771, "\u0120stunts": 49772, "\u0120bead": 49773, "\u0120stocking": 49774, "\u0120MILL": 49775, "obook": 49776, "massive": 49777, "\\<": 49778, "\u0120hump": 49779, "\u0120Preferences": 49780, "EngineDebug": 49781, "geist": 49782, "\u0120Nieto": 49783, "omever": 49784, "ishy": 49785, "evaluate": 49786, "colonial": 49787, "Alternative": 49788, "\u0120GoPro": 49789, "\u0120Vortex": 49790, "\u0120NETWORK": 49791, "ansky": 49792, "Secure": 49793, "\u0120Thrust": 49794, "Snake": 49795, "\u0120parcels": 49796, "\u0120samurai": 49797, "\u0120actresses": 49798, "Nap": 49799, "MF": 49800, "iferation": 49801, "Beer": 49802, "523": 49803, "\u0120Ily": 49804, "ointment": 49805, "Ping": 49806, "\u0120striped": 49807, "\u0120Mellon": 49808, "ossession": 49809, "\u0120neutron": 49810, "endium": 49811, "\u0120aph": 49812, "\u0120Flavoring": 49813, "\u0120383": 49814, "\u0120responsiveness": 49815, "\u0120Jindal": 49816, "\u0120Hitchcock": 49817, "Denver": 49818, "\u0120DRAGON": 49819, "smanship": 49820, "\u0120Dupl": 49821, "\u0120sly": 49822, "\u0120webcam": 49823, "\u0120Twain": 49824, "\u0120Darling": 49825, "iliate": 49826, "consumer": 49827, "DIT": 49828, "\u0120namesake": 49829, "\u0120unorthodox": 49830, "\u0120funer": 49831, "\u0120PLoS": 49832, "\u0120CONTROL": 49833, "ozyg": 49834, "oglobin": 49835, "FACE": 49836, "ERG": 49837, "\u0120Dia": 49838, "\u0120Fiesta": 49839, "cele": 49840, "034": 49841, "\u0120enclave": 49842, "\u00e2\u0138\u00ac\u00e2\u0138\u00ac": 49843, "onement": 49844, "alist": 49845, "Mand": 49846, "\u0120homegrown": 49847, "\u0120Fancy": 49848, "\u0120conceptions": 49849, "\u0120Contains": 49850, "ureen": 49851, "\u0120reiterate": 49852, "\u0120meager": 49853, "\u0120installments": 49854, "Spawn": 49855, "627": 49856, "\u0120photoc": 49857, "\u0120Cabrera": 49858, "\u0120Rosenthal": 49859, "\u0120Lansing": 49860, "isner": 49861, "\u0120invests": 49862, "\u0120UFOs": 49863, "EXP": 49864, "Hardware": 49865, "\u0120tragically": 49866, "\u0120concedes": 49867, "ieft": 49868, "cham": 49869, "borgh": 49870, "\u0120Schr": 49871, "\u0120Melanie": 49872, "\u0120Hoy": 49873, "\u0120visitation": 49874, "\u0120idiosyncr": 49875, "\u0120fractions": 49876, "\u0120foreskin": 49877, "obos": 49878, "\u0120poaching": 49879, "\u0120VIEW": 49880, "\u0120stimulates": 49881, "\u0120Gork": 49882, "canon": 49883, "MIC": 49884, "\u0120Nemesis": 49885, "\u0120Indra": 49886, "\u0120DMV": 49887, "\u0120529": 49888, "\u0120inspecting": 49889, "\u0120grandma": 49890, "\u0120Whedon": 49891, "\u0120Shant": 49892, "\u0120Purg": 49893, "ikan": 49894, "\u0120Teg": 49895, "\u0120CLR": 49896, "zac": 49897, "Victoria": 49898, "\u0120Verify": 49899, "ionics": 49900, "\u0120partying": 49901, "\u0120Mou": 49902, "colour": 49903, "\u0120testimonies": 49904, "lations": 49905, "\u0120pressuring": 49906, "hiro": 49907, "acers": 49908, "\u0120fid": 49909, "angler": 49910, "\u0120CSI": 49911, "\u0120hereafter": 49912, "\u0120dissidents": 49913, "reporting": 49914, "iphany": 49915, "chev": 49916, "\u0120solitude": 49917, "\u0120lobe": 49918, "\u0120indis": 49919, "\u0120credential": 49920, "recent": 49921, "adult": 49922, "\u0120Nirvana": 49923, "\u0120Franchise": 49924, "Layer": 49925, "Hyp": 49926, "\u0120Berkshire": 49927, "\u0120wills": 49928, "tif": 49929, "\u0120totem": 49930, "\u0120Judah": 49931, "repair": 49932, "Instant": 49933, "548": 49934, "\u0120embassies": 49935, "\u0120bottleneck": 49936, "\u0120bount": 49937, "\u0120typew": 49938, "\u0120Alvin": 49939, "jing": 49940, "imilar": 49941, "Rush": 49942, "\u0120brim": 49943, "\u0120HELP": 49944, "Aim": 49945, "]'": 49946, "\u0120passively": 49947, "\u0120bounded": 49948, "\u0120Rated": 49949, "\u0120criminality": 49950, "\u0120biomark": 49951, "\u0120dispatcher": 49952, "\u0120Towards": 49953, "\u0120+++": 49954, "righteous": 49955, "frog": 49956, "\u0120Panc": 49957, "Carter": 49958, "032": 49959, "\u00e6\u00a9\u0141": 49960, "\u0120ultraviolet": 49961, "\u0120Licensed": 49962, "\u0120Tata": 49963, "\u0120Blessing": 49964, "\u0120GAM": 49965, "\u0120chemically": 49966, "\u0120Seaf": 49967, "\u0120RELE": 49968, "\u0120Mercenary": 49969, "capitalist": 49970, "\u0120formulations": 49971, "\u0120annihilation": 49972, "\u0120Verb": 49973, "\u0120Argon": 49974, "\u0120unloaded": 49975, "\u0120morphed": 49976, "\u0120conquering": 49977, "backer": 49978, "IELD": 49979, "\u0120thefts": 49980, "\u0120frontrunner": 49981, "\u0120Royale": 49982, "\u0120Fundamental": 49983, "elight": 49984, "Chip": 49985, "necessary": 49986, "ayn": 49987, "\u0120Slip": 49988, "\u0120448": 49989, "cerned": 49990, "Pause": 49991, "\u0120shockingly": 49992, "\u0120ABV": 49993, "\u0120composure": 49994, "733": 49995, "\u0120Motorsport": 49996, "ahime": 49997, "Murray": 49998, "Mach": 49999, "\u0120grids": 50000, "\u0120debian": 50001, "\u0120furthermore": 50002, "\u0120dexterity": 50003, "\u0120Collections": 50004, "oslov": 50005, "ilage": 50006, "bj": 50007, "\u0120Monteneg": 50008, "\u0120strutConnector": 50009, "\u0120massacres": 50010, "\u0120briefs": 50011, "fetched": 50012, "uvian": 50013, "olition": 50014, "Failure": 50015, "emonic": 50016, "\u0120flared": 50017, "\u0120claimant": 50018, "\u0120cures": 50019, "\u0120giveaways": 50020, "\u0120Substance": 50021, "alions": 50022, "\u0120cringe": 50023, "\u0120Kul": 50024, "\u0120aristocracy": 50025, "\u0120Ulster": 50026, "olated": 50027, "housing": 50028, "\u0120MIS": 50029, "\u0120glared": 50030, "\u0120Wilhelm": 50031, "needs": 50032, "lambda": 50033, "builders": 50034, "\u0120VIS": 50035, "\u0120radiator": 50036, "\u0120Ghostbusters": 50037, "\u0120436": 50038, "actual": 50039, "\u0120herds": 50040, "\u00c3\u00a7a": 50041, "watching": 50042, "\u0120countering": 50043, "Charge": 50044, "\u0120charred": 50045, "\u0120warheads": 50046, "\u0120iodine": 50047, "\u0120Macy": 50048, "041": 50049, "\u0120departures": 50050, "\u0120Sins": 50051, "\u0120dyed": 50052, "\u0120Concepts": 50053, "gado": 50054, "713": 50055, "\u0120quotations": 50056, "\u0120gist": 50057, "\u0120Christy": 50058, "\u0120antigen": 50059, "\u0120Hemp": 50060, "\u0120Drawn": 50061, "\u0120Barg": 50062, "ezvous": 50063, "\u0120paternity": 50064, "\u0120ardu": 50065, "\u0120Anchorage": 50066, "\u0120Rik": 50067, "\u0120overloaded": 50068, "\u0120Username": 50069, "\u0120Tammy": 50070, "\u0120Nau": 50071, "\u0120Cellular": 50072, "\u0120waning": 50073, "\u0120rodent": 50074, "\u0120Worcester": 50075, "ilts": 50076, "\u0120Tad": 50077, "\u0120dwellings": 50078, "\u0120bullish": 50079, "431": 50080, "\u0120retaliate": 50081, "\u0120migraine": 50082, "\u0120Chevron": 50083, "CHECK": 50084, "\u0120donkey": 50085, "crim": 50086, "SPA": 50087, "\u0120Analog": 50088, "\u0120marquee": 50089, "\u0120Haas": 50090, "Bir": 50091, "\u0120GDDR": 50092, "\u0120Downloads": 50093, "\u0120willpower": 50094, "\u0120Forth": 50095, "\u0120Recorded": 50096, "\u0120impossibility": 50097, "\u0120Logged": 50098, "\u0120Franks": 50099, "\u0120Ratt": 50100, "initions": 50101, "\u0120cleaners": 50102, "\u0120sorely": 50103, "\u0120flickering": 50104, "\u0120Examination": 50105, "catching": 50106, "alloween": 50107, "Msg": 50108, "\u0120dunno": 50109, "Fa": 50110, "\u0120dysph": 50111, "crazy": 50112, ".''.": 50113, "\u0120mainline": 50114, "\u0120cs": 50115, "\u0120ptr": 50116, "\u0120Wally": 50117, "igun": 50118, "951": 50119, "\u0120Bigfoot": 50120, "fights": 50121, "\u0120retrieving": 50122, "Jr": 50123, "\u0120duplication": 50124, "\u0120Explan": 50125, "\u0120relational": 50126, "\u0120quaint": 50127, "\u0120biscuits": 50128, "\u0120ado": 50129, "\u0120shudder": 50130, "\u0120antidote": 50131, "blooded": 50132, "ksh": 50133, "\u0120sauces": 50134, "\u0120reinvest": 50135, "\u0120dispensary": 50136, "\u0120Diver": 50137, "\u01209000": 50138, "student": 50139, "\u0120insepar": 50140, "escap": 50141, "\u0120toddlers": 50142, "\u0120GPIO": 50143, "\u0120Assignment": 50144, "headers": 50145, "\u0120lackluster": 50146, "\u0120aback": 50147, "956": 50148, "\u0120toolbar": 50149, "745": 50150, "\u0120oust": 50151, "\u0120contemplation": 50152, "\u0120PRESIDENT": 50153, "\u0120458": 50154, "======": 50155, "\u0120guaranteeing": 50156, "\u0120Heist": 50157, "\u0120Cannes": 50158, "\u013b\u00bd": 50159, "\u0120collaborator": 50160, "\u0120Amp": 50161, "\u0120gou": 50162, "\u0120SHALL": 50163, "stories": 50164, "783": 50165, "\u0120mobilized": 50166, "\u0120brood": 50167, "\u0120LU": 50168, "\u0120\u00f0\u0141\u0133": 50169, "\u0120refin": 50170, "\u0120Anthropology": 50171, "vind": 50172, "illi": 50173, "\u0120warranties": 50174, "\u0120Babel": 50175, "\u0120swath": 50176, "\u0120caches": 50177, "\u0120antagonists": 50178, "artifacts": 50179, "\u0120hotly": 50180, "\u0120Starts": 50181, "\u0120G\u00c3\u00b6": 50182, "zag": 50183, "!!!!!": 50184, "\u0120scourge": 50185, "\u0120conspiring": 50186, "ruits": 50187, "reverse": 50188, "\u0120Sheen": 50189, "\u0120Jesuit": 50190, "\u0120Giovanni": 50191, "adies": 50192, "\u0120buttocks": 50193, "earcher": 50194, "acan": 50195, "\u0120volleyball": 50196, "\u0120shrouded": 50197, "\u0120scoreboard": 50198, "bats": 50199, "\u0120IPM": 50200, "\u0120asses": 50201, "\u0120deregulation": 50202, "\u0120Telegram": 50203, "\u0120Reboot": 50204, "\u01207000": 50205, "\u0120Canary": 50206, "\u0120kernels": 50207, "\u0120Fran\u00c3\u00a7ois": 50208, "\u0120Duff": 50209, "\u0120Pon": 50210, "\u0120Leica": 50211, "\u0120Garmin": 50212, "\u0120orphans": 50213, "\u0120Claudia": 50214, "\u0120calendars": 50215, "\u0120Leilan": 50216, "ento": 50217, "Rocket": 50218, "\u0120brunch": 50219, "\u0120Hawking": 50220, "ainers": 50221, "\u0120sensibilities": 50222, "\u0120kW": 50223, "\u0120Kand": 50224, "\u0120reclaimed": 50225, "\u0120interestingly": 50226, "\u00d7\u00a9": 50227, "romy": 50228, "JM": 50229, "\u0120Enhancement": 50230, "bush": 50231, "Skip": 50232, "\u0120rappers": 50233, "\u0120gazing": 50234, "pedia": 50235, "athlon": 50236, "Revolution": 50237, "\u0120snipers": 50238, "\u0120reverted": 50239, "\u0120conglomerate": 50240, "Terry": 50241, "794": 50242, "\u0120harsher": 50243, "\u0120desolate": 50244, "\u0120Hitman": 50245, "Commission": 50246, "\u0120(/": 50247, "\u00e2\u0122\u00a6.\"": 50248, "Compar": 50249, "\u0120amplification": 50250, "ominated": 50251, "\u0120regress": 50252, "\u0120Collider": 50253, "\u0120informants": 50254, "<|endoftext|>": 50255}
diff --git a/tests/unit/megatron_model.py b/tests/unit/megatron_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd2ef69b72595d3ccd64caca6b4fc60980efcc27
--- /dev/null
+++ b/tests/unit/megatron_model.py
@@ -0,0 +1,116 @@
+from pathlib import Path
+import torch
+import os
+import sys
+import math
+
+from .common import get_test_path
+from deepspeed.pipe import PipelineModule, LayerSpec
+
+
+def get_megatron_version():
+    p = os.popen("pip list --format=columns | grep megatron-lm")
+    pip_list = p.read()
+    assert 'megatron-lm' in pip_list, 'Please install Megatron-LM before getting its version'
+    ver_str = pip_list.split()[1]
+    return float(ver_str[0])
+
+
+def get_gpt2_model(args_others, mp_size=1):
+    from megatron.model import GPT2Model
+    from megatron.initialize import initialize_megatron
+
+    args_defaults = {
+        'vocab_file': get_test_path('gpt2-vocab.json'),
+        'merge_file': get_test_path('gpt2-merges.txt'),
+        'tokenizer_type': 'GPT2BPETokenizer',
+    }
+
+    args_defaults.update(args_others)
+
+    # setting "make-vocab-size-divisible-by" to avoid word-embedding size change in resizing testing.
+    sys.argv.extend([
+        '--model-parallel-size',
+        str(mp_size),
+        '--make-vocab-size-divisible-by',
+        str(1)
+    ])
+
+    initialize_megatron(args_defaults=args_defaults, ignore_unknown_args=True)
+    model = GPT2Model(num_tokentypes=0, parallel_output=False)
+    model.cuda()
+    from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
+    from megatron import mpu
+    i = torch.cuda.current_device()
+    model = torchDDP(model,
+                     device_ids=[i],
+                     output_device=i,
+                     process_group=mpu.get_data_parallel_group())
+
+    return model
+
+
+class MockGPT2ModelPipe(PipelineModule):
+    def __init__(self, num_layers, mp_size, args_others, topo, **kwargs):
+        from megatron.initialize import initialize_megatron
+
+        args_defaults = {
+            'vocab_file': get_test_path('gpt2-vocab.json'),
+            'merge_file': get_test_path('gpt2-merges.txt'),
+            'tokenizer_type': 'GPT2BPETokenizer',
+        }
+
+        args_defaults.update(args_others)
+
+        # setting "make-vocab-size-divisible-by" to avoid word-embedding size change in resizing testing.
+        sys.argv.extend([
+            '--model-parallel-size',
+            str(mp_size),
+            '--make-vocab-size-divisible-by',
+            str(1)
+        ])
+
+        initialize_megatron(args_defaults=args_defaults, ignore_unknown_args=True)
+
+        from megatron.model.transformer import ParallelTransformerLayer
+
+        class ParallelTransformerLayerPipe(ParallelTransformerLayer):
+            def forward(self, args):
+                # hardcode attn mask for testing, PP requires the attn_mask to be stashed
+                attention_mask = torch.tensor([[True]],
+                                              device=torch.cuda.current_device())
+                return super().forward(args, attention_mask)
+
+        layers = []
+        for x in range(num_layers):
+            layers.append(
+                LayerSpec(ParallelTransformerLayerPipe,
+                          self.gpt2_attention_mask_func,
+                          self.init_method_normal(0.02),
+                          self.scaled_init_method_normal(0.02,
+                                                         num_layers),
+                          x))
+        super().__init__(layers=layers,
+                         loss_fn=torch.nn.CrossEntropyLoss(),
+                         topology=topo,
+                         **kwargs)
+
+    def gpt2_attention_mask_func(self, attention_scores, ltor_mask):
+        attention_scores.masked_fill_(ltor_mask, -10000.0)
+        return attention_scores
+
+    def init_method_normal(self, sigma):
+        """Init method based on N(0, sigma)."""
+        def init_(tensor):
+            return torch.nn.init.normal_(tensor, mean=0.0, std=sigma)
+
+        return init_
+
+    def scaled_init_method_normal(self, sigma, num_layers):
+        """Init method based on N(0, sigma/sqrt(2*num_layers)."""
+        std = sigma / math.sqrt(2.0 * num_layers)
+
+        def init_(tensor):
+            return torch.nn.init.normal_(tensor, mean=0.0, std=std)
+
+        return init_
diff --git a/tests/unit/modeling.py b/tests/unit/modeling.py
old mode 100755
new mode 100644
index 09e1eccc61c6bcadc5b378f56bd1994327f669ee..8bf2d6dba9da197bd9c306bf2f04c5c936adfbf3
--- a/tests/unit/modeling.py
+++ b/tests/unit/modeling.py
@@ -2,7 +2,7 @@
 # https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/modeling.py
 
 # coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
 # Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -132,21 +132,40 @@ def load_tf_weights_in_bert(model, tf_checkpoint_path):
     return model
 
 
+"""
 @torch.jit.script
 def f_gelu(x):
     return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
-
-
 @torch.jit.script
 def bias_gelu(bias, y):
     x = bias + y
     return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
-
-
 @torch.jit.script
 def bias_tanh(bias, y):
     x = bias + y
     return torch.tanh(x)
+ """
+
+
+def f_gelu(x):
+    x_type = x.dtype
+    x = x.float()
+    x = x * 0.5 * (1.0 + torch.erf(x / 1.41421))
+    return x.to(x_type)
+
+
+def bias_gelu(bias, y):
+    y_type = y.dtype
+    x = bias.float() + y.float()
+    x = x * 0.5 * (1.0 + torch.erf(x / 1.41421))
+    return x.to(y_type)
+
+
+def bias_tanh(bias, y):
+    y_type = y.dtype
+    x = bias.float() + y.float()
+    x = torch.tanh(x)
+    return x.to(y_type)
 
 
 def gelu(x):
@@ -276,7 +295,7 @@ class BertConfig(object):
                 layer in the Transformer encoder.
             hidden_act: The non-linear activation function (function or string) in the
                 encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
-            hidden_dropout_prob: The dropout probabilitiy for all fully connected
+            hidden_dropout_prob: The dropout probability for all fully connected
                 layers in the embeddings, encoder, and pooler.
             attention_probs_dropout_prob: The dropout ratio for the attention
                 probabilities.
@@ -781,7 +800,7 @@ class BertPreTrainingHeads(nn.Module):
 
 class BertPreTrainedModel(nn.Module):
     """ An abstract class to handle weights initialization and
-        a simple interface for dowloading and loading pretrained models.
+        a simple interface for downloading and loading pretrained models.
     """
     def __init__(self, config, *inputs, **kwargs):
         super(BertPreTrainedModel, self).__init__()
@@ -837,7 +856,7 @@ class BertPreTrainedModel(nn.Module):
                     . `model.chkpt` a TensorFlow checkpoint
             from_tf: should we load the weights from a locally saved TensorFlow checkpoint
             cache_dir: an optional path to a folder in which the pre-trained models will be cached.
-            state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
+            state_dict: an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
             *inputs, **kwargs: additional input for the specific Bert class
                 (ex: num_labels for BertForSequenceClassification)
         """
@@ -958,7 +977,7 @@ class BertModel(BertPreTrainedModel):
         `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
 
     Outputs: Tuple of (encoded_layers, pooled_output)
-        `encoded_layers`: controled by `output_all_encoded_layers` argument:
+        `encoded_layers`: controlled by `output_all_encoded_layers` argument:
             - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
                 of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
                 encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
diff --git a/tests/unit/modelingpreln.py b/tests/unit/modelingpreln.py
old mode 100755
new mode 100644
index 8fcae8bcca18c89acf36e68a7bf7edad66a02d0c..7661303a414595aeb5fc92581e75ff1d5349a84d
--- a/tests/unit/modelingpreln.py
+++ b/tests/unit/modelingpreln.py
@@ -1,1673 +1,1692 @@
-# DeepSpeed note, code taken from commit 3d59216cec89a363649b4fe3d15295ba936ced0f
-# https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/modeling.py
-
-# coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""PyTorch BERT model."""
-
-from __future__ import absolute_import, division, print_function, unicode_literals
-
-import copy
-import json
-import logging
-import math
-import os
-import shutil
-import tarfile
-import tempfile
-import sys
-from io import open
-
-import torch
-from torch import nn
-from torch.nn import CrossEntropyLoss
-from torch.utils import checkpoint
-import torch.distributed as dist
-
-from torch.nn import Module
-from torch.nn.parameter import Parameter
-import torch.nn.functional as F
-import torch.nn.init as init
-import time
-
-#from numba import cuda
-
-#from deepspeed_cuda import DeepSpeedSoftmaxConfig, DeepSpeedSoftmax
-
-logger = logging.getLogger(__name__)
-
-PRETRAINED_MODEL_ARCHIVE_MAP = {
-    'bert-base-uncased':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
-    'bert-large-uncased':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
-    'bert-base-cased':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
-    'bert-large-cased':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
-    'bert-base-multilingual-uncased':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
-    'bert-base-multilingual-cased':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
-    'bert-base-chinese':
-    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
-}
-CONFIG_NAME = 'bert_config.json'
-WEIGHTS_NAME = 'pytorch_model.bin'
-TF_WEIGHTS_NAME = 'model.ckpt'
-
-
-def load_tf_weights_in_bert(model, tf_checkpoint_path):
-    """ Load tf checkpoints in a pytorch model
-    """
-    try:
-        import re
-        import numpy as np
-        import tensorflow as tf
-    except ImportError:
-        print(
-            "Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
-            "https://www.tensorflow.org/install/ for installation instructions.")
-        raise
-    tf_path = os.path.abspath(tf_checkpoint_path)
-    print("Converting TensorFlow checkpoint from {}".format(tf_path))
-    # Load weights from TF model
-    init_vars = tf.train.list_variables(tf_path)
-    names = []
-    arrays = []
-    for name, shape in init_vars:
-        print("Loading TF weight {} with shape {}".format(name, shape))
-        array = tf.train.load_variable(tf_path, name)
-        names.append(name)
-        arrays.append(array)
-
-    for name, array in zip(names, arrays):
-        name = name.split('/')
-        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
-        # which are not required for using pretrained model
-        if any(n in ["adam_v", "adam_m"] for n in name):
-            print("Skipping {}".format("/".join(name)))
-            continue
-        pointer = model
-        for m_name in name:
-            if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
-                l = re.split(r'_(\d+)', m_name)
-            else:
-                l = [m_name]
-            if l[0] == 'kernel' or l[0] == 'gamma':
-                pointer = getattr(pointer, 'weight')
-            elif l[0] == 'output_bias' or l[0] == 'beta':
-                pointer = getattr(pointer, 'bias')
-            elif l[0] == 'output_weights':
-                pointer = getattr(pointer, 'weight')
-            else:
-                pointer = getattr(pointer, l[0])
-            if len(l) >= 2:
-                num = int(l[1])
-                pointer = pointer[num]
-        if m_name[-11:] == '_embeddings':
-            pointer = getattr(pointer, 'weight')
-        elif m_name == 'kernel':
-            array = np.transpose(array)
-        try:
-            assert pointer.shape == array.shape
-        except AssertionError as e:
-            e.args += (pointer.shape, array.shape)
-            raise
-        print("Initialize PyTorch weight {}".format(name))
-        pointer.data = torch.from_numpy(array)
-    return model
-
-
-@torch.jit.script
-def f_gelu(x):
-    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
-
-
-@torch.jit.script
-def bias_gelu(bias, y):
-    x = bias + y
-    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
-
-
-@torch.jit.script
-def bias_tanh(bias, y):
-    x = bias + y
-    return torch.tanh(x)
-
-
-def gelu(x):
-    """Implementation of the gelu activation function.
-        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
-        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
-        Also see https://arxiv.org/abs/1606.08415
-    """
-    return f_gelu(x)
-
-
-def swish(x):
-    return x * torch.sigmoid(x)
-
-
-ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
-
-
-class GPUTimer:
-    def __init__(self):
-        super().__init__()
-        self.start = cuda.event()
-        self.stop = cuda.event()
-
-    def record(self):
-        self.start.record()
-
-    def elapsed(self):
-        self.stop.record()
-        self.stop.synchronize()
-        return self.start.elapsed_time(self.stop) / 1000.0
-
-
-class LinearActivation(Module):
-    r"""Fused Linear and activation Module.
-    """
-    __constants__ = ['bias']
-
-    def __init__(self,
-                 in_features,
-                 out_features,
-                 weights,
-                 biases,
-                 act='gelu',
-                 bias=True):
-        super(LinearActivation, self).__init__()
-        self.in_features = in_features
-        self.out_features = out_features
-        self.fused_gelu = False
-        self.fused_tanh = False
-        if isinstance(act,
-                      str) or (sys.version_info[0] == 2 and isinstance(act,
-                                                                       unicode)):
-            if bias and act == 'gelu':
-                self.fused_gelu = True
-            elif bias and act == 'tanh':
-                self.fused_tanh = True
-            else:
-                self.act_fn = ACT2FN[act]
-        else:
-            self.act_fn = act
-        #self.weight = Parameter(torch.Tensor(out_features, in_features))
-        self.weight = weights[5]
-        self.bias = biases[5]
-        #if bias:
-        #    self.bias = Parameter(torch.Tensor(out_features))
-        #else:
-        #    self.register_parameter('bias', None)
-        #self.reset_parameters()
-
-    def reset_parameters(self):
-        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
-        if self.bias is not None:
-            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
-            bound = 1 / math.sqrt(fan_in)
-            init.uniform_(self.bias, -bound, bound)
-
-    def forward(self, input):
-        if self.fused_gelu:
-            #timing = []
-            #t1 = GPUTimer()
-            #t1.record()
-            y = F.linear(input, self.weight, None)
-            #timing.append(t1.elapsed())
-            #t1.record()
-            bg = bias_gelu(self.bias, y)
-            #timing.append(t1.elapsed())
-            return bg
-        elif self.fused_tanh:
-            return bias_tanh(self.bias, F.linear(input, self.weight, None))
-        else:
-            return self.act_fn(F.linear(input, self.weight, self.bias))
-
-    def extra_repr(self):
-        return 'in_features={}, out_features={}, bias={}'.format(
-            self.in_features,
-            self.out_features,
-            self.bias is not None)
-
-
-class BertConfig(object):
-    """Configuration class to store the configuration of a `BertModel`.
-    """
-    def __init__(self,
-                 vocab_size_or_config_json_file,
-                 hidden_size=768,
-                 num_hidden_layers=12,
-                 num_attention_heads=12,
-                 intermediate_size=3072,
-                 batch_size=8,
-                 hidden_act="gelu",
-                 hidden_dropout_prob=0.1,
-                 attention_probs_dropout_prob=0.1,
-                 max_position_embeddings=512,
-                 type_vocab_size=2,
-                 initializer_range=0.02,
-                 fp16=False):
-        """Constructs BertConfig.
-
-        Args:
-            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
-            hidden_size: Size of the encoder layers and the pooler layer.
-            num_hidden_layers: Number of hidden layers in the Transformer encoder.
-            num_attention_heads: Number of attention heads for each attention layer in
-                the Transformer encoder.
-            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
-                layer in the Transformer encoder.
-            hidden_act: The non-linear activation function (function or string) in the
-                encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
-            hidden_dropout_prob: The dropout probabilitiy for all fully connected
-                layers in the embeddings, encoder, and pooler.
-            attention_probs_dropout_prob: The dropout ratio for the attention
-                probabilities.
-            max_position_embeddings: The maximum sequence length that this model might
-                ever be used with. Typically set this to something large just in case
-                (e.g., 512 or 1024 or 2048).
-            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
-                `BertModel`.
-            initializer_range: The sttdev of the truncated_normal_initializer for
-                initializing all weight matrices.
-        """
-        if isinstance(vocab_size_or_config_json_file,
-                      str) or (sys.version_info[0] == 2
-                               and isinstance(vocab_size_or_config_json_file,
-                                              unicode)):
-            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
-                json_config = json.loads(reader.read())
-            for key, value in json_config.items():
-                self.__dict__[key] = value
-        elif isinstance(vocab_size_or_config_json_file, int):
-            self.vocab_size = vocab_size_or_config_json_file
-            self.hidden_size = hidden_size
-            self.num_hidden_layers = num_hidden_layers
-            self.num_attention_heads = num_attention_heads
-            self.batch_size = batch_size
-            self.hidden_act = hidden_act
-            self.intermediate_size = intermediate_size
-            self.hidden_dropout_prob = hidden_dropout_prob
-            self.attention_probs_dropout_prob = attention_probs_dropout_prob
-            self.max_position_embeddings = max_position_embeddings
-            self.type_vocab_size = type_vocab_size
-            self.initializer_range = initializer_range
-            self.fp16 = fp16
-        else:
-            raise ValueError("First argument must be either a vocabulary size (int)"
-                             "or the path to a pretrained model config file (str)")
-
-    @classmethod
-    def from_dict(cls, json_object):
-        """Constructs a `BertConfig` from a Python dictionary of parameters."""
-        config = BertConfig(vocab_size_or_config_json_file=-1)
-        for key, value in json_object.items():
-            config.__dict__[key] = value
-        return config
-
-    @classmethod
-    def from_json_file(cls, json_file):
-        """Constructs a `BertConfig` from a json file of parameters."""
-        with open(json_file, "r", encoding='utf-8') as reader:
-            text = reader.read()
-        return cls.from_dict(json.loads(text))
-
-    def __repr__(self):
-        return str(self.to_json_string())
-
-    def to_dict(self):
-        """Serializes this instance to a Python dictionary."""
-        output = copy.deepcopy(self.__dict__)
-        return output
-
-    def to_json_string(self):
-        """Serializes this instance to a JSON string."""
-        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
-
-
-try:
-    import apex
-    #apex.amp.register_half_function(apex.normalization.fused_layer_norm, 'FusedLayerNorm')
-    import apex.normalization
-    #apex.amp.register_float_function(apex.normalization.FusedLayerNorm, 'forward')
-    BertLayerNorm = apex.normalization.FusedLayerNorm
-except ImportError:
-    print(
-        "Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex."
-    )
-
-    class BertLayerNorm(nn.Module):
-        def __init__(self, hidden_size, eps=1e-12):
-            """Construct a layernorm module in the TF style (epsilon inside the square root).
-            """
-            super(BertLayerNorm, self).__init__()
-            self.weight = nn.Parameter(torch.ones(hidden_size))
-            self.bias = nn.Parameter(torch.zeros(hidden_size))
-            self.variance_epsilon = eps
-
-        def forward(self, x):
-            pdtype = x.dtype
-            x = x.float()
-            u = x.mean(-1, keepdim=True)
-            s = (x - u).pow(2).mean(-1, keepdim=True)
-            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
-            return self.weight * x.to(pdtype) + self.bias
-
-        #def forward(self, x):
-        #    u = x.mean(-1, keepdim=True)
-        #    s = (x - u).pow(2).mean(-1, keepdim=True)
-        #    x = (x - u) / torch.sqrt(s + self.variance_epsilon)
-        #    return self.weight * x + self.bias
-
-
-class BertEmbeddings(nn.Module):
-    """Construct the embeddings from word, position and token_type embeddings.
-    """
-    def __init__(self, config):
-        super(BertEmbeddings, self).__init__()
-        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
-        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
-                                                config.hidden_size)
-        self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
-                                                  config.hidden_size)
-
-        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
-        # any TensorFlow checkpoint file
-        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-    def forward(self, input_ids, token_type_ids=None):
-        seq_length = input_ids.size(1)
-        position_ids = torch.arange(seq_length,
-                                    dtype=torch.long,
-                                    device=input_ids.device)
-        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
-        if token_type_ids is None:
-            token_type_ids = torch.zeros_like(input_ids)
-
-        words_embeddings = self.word_embeddings(input_ids)
-        position_embeddings = self.position_embeddings(position_ids)
-        token_type_embeddings = self.token_type_embeddings(token_type_ids)
-
-        embeddings = words_embeddings + position_embeddings + token_type_embeddings
-        embeddings = self.LayerNorm(embeddings)
-        embeddings = self.dropout(embeddings)
-        return embeddings
-
-
-class BertSelfAttention(nn.Module):
-    def __init__(self, i, config, weights, biases):
-        super(BertSelfAttention, self).__init__()
-        if config.hidden_size % config.num_attention_heads != 0:
-            raise ValueError(
-                "The hidden size (%d) is not a multiple of the number of attention "
-                "heads (%d)" % (config.hidden_size,
-                                config.num_attention_heads))
-        self.num_attention_heads = config.num_attention_heads
-        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
-        self.all_head_size = self.num_attention_heads * self.attention_head_size
-
-        self.query = nn.Linear(config.hidden_size, self.all_head_size)
-        self.query.weight = weights[0]
-        self.query.bias = biases[0]
-        self.key = nn.Linear(config.hidden_size, self.all_head_size)
-        self.key.weight = weights[1]
-        self.key.bias = biases[1]
-        self.value = nn.Linear(config.hidden_size, self.all_head_size)
-        self.value.weight = weights[2]
-        self.value.bias = biases[2]
-
-        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
-        self.softmax = nn.Softmax(dim=-1)
-        #self.softmax_config = DeepSpeedSoftmaxConfig()
-        #self.softmax_config.batch_size = config.batch_size
-        #self.softmax_config.max_seq_length = config.max_position_embeddings
-        #self.softmax_config.hidden_size = config.hidden_size
-        #self.softmax_config.heads = config.num_attention_heads
-        #self.softmax_config.softmax_id = i
-        #self.softmax_config.fp16 = config.fp16
-        #self.softmax_config.prob_drop_out = 0.0
-        #self.softmax = DeepSpeedSoftmax(i, self.softmax_config)
-
-    def transpose_for_scores(self, x):
-        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
-                                       self.attention_head_size)
-        x = x.view(*new_x_shape)
-        return x.permute(0, 2, 1, 3)
-
-    def transpose_key_for_scores(self, x):
-        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
-                                       self.attention_head_size)
-        x = x.view(*new_x_shape)
-        return x.permute(0, 2, 3, 1)
-
-    def forward(self, hidden_states, attention_mask, grads=None):
-        #timing = []
-        #t1 = GPUTimer()
-        #t1.record()
-        mixed_query_layer = self.query(hidden_states)
-
-        #timing.append(t1.elapsed())
-        #print("Query elapsed: %s" % (time.clock() - start))
-        #t1.record()
-        mixed_key_layer = self.key(hidden_states)
-
-        #timing.append(t1.elapsed())
-        #print("Key elapsed: %s" % (time.clock() - start))
-        #t1.record()
-        mixed_value_layer = self.value(hidden_states)
-        #timing.append(t1.elapsed())
-        #print("Value elapsed: %s" % (time.clock() - start))
-
-        #t1.record()
-        query_layer = self.transpose_for_scores(mixed_query_layer)
-        # print(query_layer)
-        #timing.append(t1.elapsed())
-        #print("Query-Transform elapsed: %s" % (time.clock() - start))
-        #t1.record()
-        key_layer = self.transpose_key_for_scores(mixed_key_layer)
-        # print(key_layer)
-        #timing.append(t1.elapsed())
-        #print("Key-Transform elapsed: %s" % (time.clock() - start))
-        #t1.record()
-        value_layer = self.transpose_for_scores(mixed_value_layer)
-        #print(value_layer)
-        #timing.append(t1.elapsed())
-        #print("Value-Transform elapsed: %s" % (time.clock() - start))
-
-        # Take the dot product between "query" and "key" to get the raw attention scores.
-        #t1.record()
-        #print(query_layer.shape)
-        #print(key_layer.shape)
-        attention_scores = torch.matmul(query_layer, key_layer)
-        #print(attention_scores.shape)
-        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-        #print("Pytorch: ", attention_scores)
-        #timing.append(t1.elapsed())
-        #print("Attention-Score elapsed: %s" % (time.clock() - start))
-        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
-        #t1.record()
-
-        # context_layer = self.softmax(query_layer, key_layer, value_layer, attention_mask)
-        #print("context shape is :", context_layer.shape)
-        #print("Cuda-ext:, ", attention_scores1)
-        # Normalize the attention scores to probabilities.
-        ####attention_probs = self.softmax(attention_scores)
-        #timing.append(t1.elapsed())
-        #print("Softmax elapsed: %s" % (time.clock() - start))
-        #t1 = GPUTimer()
-        #t1.record()
-        attention_scores = attention_scores + attention_mask
-        attention_probs = self.softmax(attention_scores)
-        #attention_scores = self.softmax(attention_scores, attention_mask)
-        #print("Softmax elapse {0:8.2f} ms", t1.elapsed() * 1000)
-
-        # This is actually dropping out entire tokens to attend to, which might
-        # seem a bit unusual, but is taken from the original Transformer paper.
-        attention_probs = self.dropout(attention_probs)
-
-        #t1.record()
-        context_layer = torch.matmul(attention_probs, value_layer)
-        #timing.append(t1.elapsed())
-        #print("Context elapsed: %s" % (time.clock() - start))
-        #t1.record()
-        #context_layer1 = context_layer.permute(
-        #                0, 1, 3, 2, 4).contiguous()
-        #if grads is not None:
-        # context_layer.register_hook(lambda x, self = self : grads.append([x, "Context"]))
-        context_layer1 = context_layer.permute(0, 2, 1, 3).contiguous()
-        new_context_layer_shape = context_layer1.size()[:-2] + (self.all_head_size, )
-        context_layer1 = context_layer1.view(*new_context_layer_shape)
-        #timing.append(t1.elapsed())
-        #print("Context-Transform elapsed: %s" % (time.clock() - start))
-
-        if grads is not None:
-            query_layer.register_hook(lambda x, self=self: grads.append([x, "Query"]))
-            key_layer.register_hook(lambda x, self=self: grads.append([x, "Key"]))
-            value_layer.register_hook(lambda x, self=self: grads.append([x, "Value"]))
-
-        return context_layer1
-
-
-class BertSelfOutput(nn.Module):
-    def __init__(self, config, weights, biases):
-        super(BertSelfOutput, self).__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
-        self.dense.weight = weights[3]
-        self.dense.bias = biases[3]
-        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-    def forward(self, hidden_states, input_tensor):
-        #timing = []
-        #t1 = GPUTimer()
-        #t1.record()
-        hidden_states = self.dense(hidden_states)
-        #timing.append(t1.elapsed())
-        #print("Attention Output elapsed: %s" % (time.clock() - start))
-        hidden_states = self.dropout(hidden_states)
-        #t1.record()
-        #hidden_states = self.LayerNorm(hidden_states + input_tensor)
-        #timing.append(t1.elapsed())
-        #print("LayerNorm elapsed: %s" % (time.clock() - start))
-        return hidden_states
-
-    def get_w(self):
-        return self.dense.weight
-
-
-class BertAttention(nn.Module):
-    def __init__(self, i, config, weights, biases):
-        super(BertAttention, self).__init__()
-        self.self = BertSelfAttention(i, config, weights, biases)
-        self.output = BertSelfOutput(config, weights, biases)
-
-    def forward(self, input_tensor, attention_mask):
-        self_output = self.self(input_tensor, attention_mask)
-        attention_output = self.output(self_output, input_tensor)
-        return attention_output
-
-    def get_w(self):
-        return self.output.get_w()
-
-
-class BertIntermediate(nn.Module):
-    def __init__(self, config, weights, biases):
-        super(BertIntermediate, self).__init__()
-        self.dense_act = LinearActivation(config.hidden_size,
-                                          config.intermediate_size,
-                                          weights,
-                                          biases,
-                                          act=config.hidden_act)
-
-    def forward(self, hidden_states):
-        hidden_states = self.dense_act(hidden_states)
-        return hidden_states
-
-
-class BertOutput(nn.Module):
-    def __init__(self, config, weights, biases):
-        super(BertOutput, self).__init__()
-        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
-        self.dense.weight = weights[6]
-        self.dense.bias = biases[6]
-        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-    def forward(self, hidden_states, input_tensor):
-        #timing = []
-        #t1 = GPUTimer()
-        #t1.record()
-        #print (hidden_states)
-        #print (self.dense.weight)
-        hidden_states = self.dense(hidden_states)
-        #timing.append(t1.elapsed())
-        #print("FF2 elapsed: %s" % (time.clock() - start))
-        hidden_states = self.dropout(hidden_states)
-        #t1.record()
-        #hidden_states = self.LayerNorm(hidden_states + input_tensor)
-        #timing.append(t1.elapsed())
-        #print("LayerNorm elapsed: %s" % (time.clock() - start))
-        return hidden_states
-
-
-class BertLayer(nn.Module):
-    def __init__(self, i, config, weights, biases):
-        super(BertLayer, self).__init__()
-        self.attention = BertAttention(i, config, weights, biases)
-        self.PreAttentionLayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-        self.PostAttentionLayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-        self.intermediate = BertIntermediate(config, weights, biases)
-        self.output = BertOutput(config, weights, biases)
-        self.weight = weights
-        self.biases = biases
-
-    def forward(self, hidden_states, attention_mask, grads, collect_all_grads=False):
-        input_layer_norm = self.PreAttentionLayerNorm(hidden_states)
-        attention_output = self.attention(input_layer_norm, attention_mask)
-        #print ("hidden shape is :", hidden_states.shape)
-        intermediate_input = hidden_states + attention_output
-
-        intermediate_layer_norm = self.PostAttentionLayerNorm(intermediate_input)
-        intermediate_output = self.intermediate(intermediate_layer_norm)
-        layer_output = self.output(intermediate_output, attention_output)
-
-        #attention_output = self.attention(hidden_states, attention_mask)
-        #intermediate_output = self.intermediate(attention_output)
-        #layer_output = self.output(intermediate_output, attention_output)
-
-        if collect_all_grads:
-            # self.weight[0].register_hook(lambda x, self=self: grads.append([x,"Q_W"]))
-            # self.biases[0].register_hook(lambda x, self=self: grads.append([x,"Q_B"]))
-            # self.weight[1].register_hook(lambda x, self=self: grads.append([x,"K_W"]))
-            # self.biases[1].register_hook(lambda x, self=self: grads.append([x,"K_B"]))
-            self.weight[2].register_hook(lambda x, self=self: grads.append([x, "V_W"]))
-            self.biases[2].register_hook(lambda x, self=self: grads.append([x, "V_B"]))
-            self.weight[3].register_hook(lambda x, self=self: grads.append([x, "O_W"]))
-            self.biases[3].register_hook(lambda x, self=self: grads.append([x, "O_B"]))
-            self.PostAttentionLayerNorm.weight.register_hook(
-                lambda x,
-                self=self: grads.append([x,
-                                         "N2_W"]))
-            self.PostAttentionLayerNorm.bias.register_hook(
-                lambda x,
-                self=self: grads.append([x,
-                                         "N2_B"]))
-            self.weight[5].register_hook(lambda x, self=self: grads.append([x, "int_W"]))
-            self.biases[5].register_hook(lambda x, self=self: grads.append([x, "int_B"]))
-            self.weight[6].register_hook(lambda x, self=self: grads.append([x, "out_W"]))
-            self.biases[6].register_hook(lambda x, self=self: grads.append([x, "out_B"]))
-            self.PreAttentionLayerNorm.weight.register_hook(
-                lambda x,
-                self=self: grads.append([x,
-                                         "norm_W"]))
-            self.PreAttentionLayerNorm.bias.register_hook(
-                lambda x,
-                self=self: grads.append([x,
-                                         "norm_B"]))
-
-        return layer_output + intermediate_input
-
-    def get_w(self):
-        return self.attention.get_w()
-
-
-class BertEncoder(nn.Module):
-    def __init__(self, config, weights, biases):
-        super(BertEncoder, self).__init__()
-        #layer = BertLayer(config, weights, biases)
-        self.FinalLayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-
-        self.layer = nn.ModuleList([
-            copy.deepcopy(BertLayer(i,
-                                    config,
-                                    weights,
-                                    biases)) for i in range(config.num_hidden_layers)
-        ])
-        self.grads = []
-        self.graph = []
-
-    def get_grads(self):
-        return self.grads
-
-    # def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
-    #     all_encoder_layers = []
-    #     for layer_module in self.layer:
-    #         hidden_states = layer_module(hidden_states, attention_mask)
-    #         if output_all_encoded_layers:
-    #             all_encoder_layers.append(hidden_states)
-    #     if not output_all_encoded_layers:
-    #         all_encoder_layers.append(hidden_states)
-    #     return all_encoder_layers
-
-    def get_modules(self, big_node, input):
-        for mdl in big_node.named_children():
-            graph.append(mdl)
-            get_modules(self, mdl, input)
-
-    def forward(self,
-                hidden_states,
-                attention_mask,
-                output_all_encoded_layers=True,
-                checkpoint_activations=False):
-        all_encoder_layers = []
-
-        def custom(start, end):
-            def custom_forward(*inputs):
-                layers = self.layer[start:end]
-                x_ = inputs[0]
-                for layer in layers:
-                    x_ = layer(x_, inputs[1])
-                return x_
-
-            return custom_forward
-
-        if checkpoint_activations:
-            l = 0
-            num_layers = len(self.layer)
-            chunk_length = math.ceil(math.sqrt(num_layers))
-            while l < num_layers:
-                hidden_states = checkpoint.checkpoint(custom(l,
-                                                             l + chunk_length),
-                                                      hidden_states,
-                                                      attention_mask * 1)
-                l += chunk_length
-            # decoder layers
-        else:
-            for i, layer_module in enumerate(self.layer):
-                hidden_states = layer_module(hidden_states,
-                                             attention_mask,
-                                             self.grads,
-                                             collect_all_grads=True)
-                hidden_states.register_hook(
-                    lambda x,
-                    i=i,
-                    self=self: self.grads.append([x,
-                                                  "hidden_state"]))
-                #print("pytorch weight is: ", layer_module.get_w())
-
-                if output_all_encoded_layers:
-                    all_encoder_layers.append((hidden_states))
-
-        if not output_all_encoded_layers or checkpoint_activations:
-            hidden_states = self.FinalLayerNorm(hidden_states)
-            all_encoder_layers.append((hidden_states))
-        return all_encoder_layers
-
-
-#class BertEncoder(nn.Module):
-#    def __init__(self, config):
-#        super(BertEncoder, self).__init__()
-#        layer = BertLayer(config)
-#        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
-#
-#    def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
-#        all_encoder_layers = []
-#        for layer_module in self.layer:
-#            hidden_states = layer_module(hidden_states, attention_mask)
-#            if output_all_encoded_layers:
-#                all_encoder_layers.append(hidden_states)
-#        if not output_all_encoded_layers:
-#            all_encoder_layers.append(hidden_states)
-#        return all_encoder_layers
-
-
-class BertPooler(nn.Module):
-    def __init__(self, config):
-        super(BertPooler, self).__init__()
-        self.dense_act = LinearActivation(config.hidden_size,
-                                          config.hidden_size,
-                                          act="tanh")
-
-    def forward(self, hidden_states):
-        # We "pool" the model by simply taking the hidden state corresponding
-        # to the first token.
-        first_token_tensor = hidden_states[:, 0]
-        pooled_output = self.dense_act(first_token_tensor)
-        return pooled_output
-
-
-class BertPredictionHeadTransform(nn.Module):
-    def __init__(self, config):
-        super(BertPredictionHeadTransform, self).__init__()
-        self.dense_act = LinearActivation(config.hidden_size,
-                                          config.hidden_size,
-                                          act=config.hidden_act)
-        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
-
-    def forward(self, hidden_states):
-        hidden_states = self.dense_act(hidden_states)
-        hidden_states = self.LayerNorm(hidden_states)
-        return hidden_states
-
-
-class BertLMPredictionHead(nn.Module):
-    def __init__(self, config, bert_model_embedding_weights):
-        super(BertLMPredictionHead, self).__init__()
-        self.transform = BertPredictionHeadTransform(config)
-
-        # The output weights are the same as the input embeddings, but there is
-        # an output-only bias for each token.
-        self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
-                                 bert_model_embedding_weights.size(0),
-                                 bias=False)
-        self.decoder.weight = bert_model_embedding_weights
-        self.bias = nn.Parameter(torch.zeros(bert_model_embedding_weights.size(0)))
-
-    def forward(self, hidden_states):
-        hidden_states = self.transform(hidden_states)
-        torch.cuda.nvtx.range_push(
-            "decoder input.size() = {}, weight.size() = {}".format(
-                hidden_states.size(),
-                self.decoder.weight.size()))
-        hidden_states = self.decoder(hidden_states) + self.bias
-        torch.cuda.nvtx.range_pop()
-        return hidden_states
-
-
-class BertOnlyMLMHead(nn.Module):
-    def __init__(self, config, bert_model_embedding_weights):
-        super(BertOnlyMLMHead, self).__init__()
-        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
-
-    def forward(self, sequence_output):
-        prediction_scores = self.predictions(sequence_output)
-        return prediction_scores
-
-
-class BertOnlyNSPHead(nn.Module):
-    def __init__(self, config):
-        super(BertOnlyNSPHead, self).__init__()
-        self.seq_relationship = nn.Linear(config.hidden_size, 2)
-
-    def forward(self, pooled_output):
-        seq_relationship_score = self.seq_relationship(pooled_output)
-        return seq_relationship_score
-
-
-class BertPreTrainingHeads(nn.Module):
-    def __init__(self, config, bert_model_embedding_weights):
-        super(BertPreTrainingHeads, self).__init__()
-        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
-        self.seq_relationship = nn.Linear(config.hidden_size, 2)
-
-    def forward(self, sequence_output, pooled_output):
-        prediction_scores = self.predictions(sequence_output)
-        seq_relationship_score = self.seq_relationship(pooled_output)
-        return prediction_scores, seq_relationship_score
-
-
-class BertPreTrainedModel(nn.Module):
-    """ An abstract class to handle weights initialization and
-        a simple interface for dowloading and loading pretrained models.
-    """
-    def __init__(self, config, *inputs, **kwargs):
-        super(BertPreTrainedModel, self).__init__()
-        if not isinstance(config, BertConfig):
-            raise ValueError(
-                "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
-                "To create a model from a Google pretrained model use "
-                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
-                    self.__class__.__name__,
-                    self.__class__.__name__))
-        self.config = config
-
-    def init_bert_weights(self, module):
-        """ Initialize the weights.
-        """
-        if isinstance(module, (nn.Linear, nn.Embedding)):
-            # Slightly different from the TF version which uses truncated_normal for initialization
-            # cf https://github.com/pytorch/pytorch/pull/5617
-            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
-        elif isinstance(module, BertLayerNorm):
-            module.bias.data.zero_()
-            module.weight.data.fill_(1.0)
-        if isinstance(module, nn.Linear) and module.bias is not None:
-            module.bias.data.zero_()
-
-    @classmethod
-    def from_pretrained(cls,
-                        pretrained_model_name_or_path,
-                        state_dict=None,
-                        cache_dir=None,
-                        from_tf=False,
-                        *inputs,
-                        **kwargs):
-        """
-        Instantiate a BertPreTrainedModel from a pre-trained model file or a pytorch state dict.
-        Download and cache the pre-trained model file if needed.
-
-        Params:
-            pretrained_model_name_or_path: either:
-                - a str with the name of a pre-trained model to load selected in the list of:
-                    . `bert-base-uncased`
-                    . `bert-large-uncased`
-                    . `bert-base-cased`
-                    . `bert-large-cased`
-                    . `bert-base-multilingual-uncased`
-                    . `bert-base-multilingual-cased`
-                    . `bert-base-chinese`
-                - a path or url to a pretrained model archive containing:
-                    . `bert_config.json` a configuration file for the model
-                    . `pytorch_model.bin` a PyTorch dump of a BertForPreTraining instance
-                - a path or url to a pretrained model archive containing:
-                    . `bert_config.json` a configuration file for the model
-                    . `model.chkpt` a TensorFlow checkpoint
-            from_tf: should we load the weights from a locally saved TensorFlow checkpoint
-            cache_dir: an optional path to a folder in which the pre-trained models will be cached.
-            state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
-            *inputs, **kwargs: additional input for the specific Bert class
-                (ex: num_labels for BertForSequenceClassification)
-        """
-        if pretrained_model_name_or_path in PRETRAINED_MODEL_ARCHIVE_MAP:
-            archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name_or_path]
-        else:
-            archive_file = pretrained_model_name_or_path
-        if resolved_archive_file == archive_file:
-            logger.info("loading archive file {}".format(archive_file))
-        else:
-            logger.info("loading archive file {} from cache at {}".format(
-                archive_file,
-                resolved_archive_file))
-        tempdir = None
-        if os.path.isdir(resolved_archive_file) or from_tf:
-            serialization_dir = resolved_archive_file
-        else:
-            # Extract archive to temp dir
-            tempdir = tempfile.mkdtemp()
-            logger.info("extracting archive file {} to temp dir {}".format(
-                resolved_archive_file,
-                tempdir))
-            with tarfile.open(resolved_archive_file, 'r:gz') as archive:
-                archive.extractall(tempdir)
-            serialization_dir = tempdir
-        # Load config
-        config_file = os.path.join(serialization_dir, CONFIG_NAME)
-        config = BertConfig.from_json_file(config_file)
-        logger.info("Model config {}".format(config))
-        # Instantiate model.
-        model = cls(config, *inputs, **kwargs)
-        if state_dict is None and not from_tf:
-            weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
-            state_dict = torch.load(
-                weights_path,
-                map_location='cpu' if not torch.cuda.is_available() else None)
-        if tempdir:
-            # Clean up temp dir
-            shutil.rmtree(tempdir)
-        if from_tf:
-            # Directly load from a TensorFlow checkpoint
-            weights_path = os.path.join(serialization_dir, TF_WEIGHTS_NAME)
-            return load_tf_weights_in_bert(model, weights_path)
-        # Load from a PyTorch state_dict
-        old_keys = []
-        new_keys = []
-        for key in state_dict.keys():
-            new_key = None
-            if 'gamma' in key:
-                new_key = key.replace('gamma', 'weight')
-            if 'beta' in key:
-                new_key = key.replace('beta', 'bias')
-            if new_key:
-                old_keys.append(key)
-                new_keys.append(new_key)
-        for old_key, new_key in zip(old_keys, new_keys):
-            state_dict[new_key] = state_dict.pop(old_key)
-
-        missing_keys = []
-        unexpected_keys = []
-        error_msgs = []
-        # copy state_dict so _load_from_state_dict can modify it
-        metadata = getattr(state_dict, '_metadata', None)
-        state_dict = state_dict.copy()
-        if metadata is not None:
-            state_dict._metadata = metadata
-
-        def load(module, prefix=''):
-            local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
-            module._load_from_state_dict(state_dict,
-                                         prefix,
-                                         local_metadata,
-                                         True,
-                                         missing_keys,
-                                         unexpected_keys,
-                                         error_msgs)
-            for name, child in module._modules.items():
-                if child is not None:
-                    load(child, prefix + name + '.')
-
-        start_prefix = ''
-        if not hasattr(model,
-                       'bert') and any(s.startswith('bert.') for s in state_dict.keys()):
-            start_prefix = 'bert.'
-        load(model, prefix=start_prefix)
-        if len(missing_keys) > 0:
-            logger.info("Weights of {} not initialized from pretrained model: {}".format(
-                model.__class__.__name__,
-                missing_keys))
-        if len(unexpected_keys) > 0:
-            logger.info("Weights from pretrained model not used in {}: {}".format(
-                model.__class__.__name__,
-                unexpected_keys))
-        if len(error_msgs) > 0:
-            raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-                model.__class__.__name__,
-                "\n\t".join(error_msgs)))
-        return model
-
-
-class BertModel(BertPreTrainedModel):
-    """BERT model ("Bidirectional Embedding Representations from a Transformer").
-
-    Params:
-        config: a BertConfig class instance with the configuration to build a new model
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
-
-    Outputs: Tuple of (encoded_layers, pooled_output)
-        `encoded_layers`: controled by `output_all_encoded_layers` argument:
-            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
-                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
-                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
-            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
-                to the last attention block of shape [batch_size, sequence_length, hidden_size],
-        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
-            classifier pretrained on top of the hidden state associated to the first character of the
-            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = modeling.BertModel(config=config)
-    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config):
-        super(BertModel, self).__init__(config)
-        self.embeddings = BertEmbeddings(config)
-        self.encoder = BertEncoder(config)
-        self.pooler = BertPooler(config)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                output_all_encoded_layers=True,
-                checkpoint_activations=False):
-        if attention_mask is None:
-            attention_mask = torch.ones_like(input_ids)
-        if token_type_ids is None:
-            token_type_ids = torch.zeros_like(input_ids)
-
-        # We create a 3D attention mask from a 2D tensor mask.
-        # Sizes are [batch_size, 1, 1, to_seq_length]
-        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
-        # this attention mask is more simple than the triangular masking of causal attention
-        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
-        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
-
-        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
-        # masked positions, this operation will create a tensor which is 0.0 for
-        # positions we want to attend and -10000.0 for masked positions.
-        # Since we are adding it to the raw scores before the softmax, this is
-        # effectively the same as removing these entirely.
-        extended_attention_mask = extended_attention_mask.to(dtype=next(
-            self.parameters()).dtype)  # fp16 compatibility
-        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
-
-        embedding_output = self.embeddings(input_ids, token_type_ids)
-        encoded_layers = self.encoder(
-            embedding_output,
-            extended_attention_mask,
-            output_all_encoded_layers=output_all_encoded_layers,
-            checkpoint_activations=checkpoint_activations)
-        sequence_output = encoded_layers[-1]
-        pooled_output = self.pooler(sequence_output)
-        if not output_all_encoded_layers:
-            encoded_layers = encoded_layers[-1]
-        return encoded_layers, pooled_output
-
-
-class BertForPreTraining(BertPreTrainedModel):
-    """BERT model with pre-training heads.
-    This module comprises the BERT model followed by the two pre-training heads:
-        - the masked language modeling head, and
-        - the next sentence classification head.
-
-    Params:
-        config: a BertConfig class instance with the configuration to build a new model.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
-            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
-            is only computed for the labels set in [0, ..., vocab_size]
-        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, 1].
-            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
-
-    Outputs:
-        if `masked_lm_labels` and `next_sentence_label` are not `None`:
-            Outputs the total_loss which is the sum of the masked language modeling loss and the next
-            sentence classification loss.
-        if `masked_lm_labels` or `next_sentence_label` is `None`:
-            Outputs a tuple comprising
-            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
-            - the next sentence classification logits of shape [batch_size, 2].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForPreTraining(config)
-    masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config, args):
-        super(BertForPreTraining, self).__init__(config)
-        self.summary_writer = None
-        if dist.get_rank() == 0:
-            self.summary_writer = args.summary_writer
-        self.samples_per_step = dist.get_world_size() * args.train_batch_size
-        self.sample_count = self.samples_per_step
-        self.bert = BertModel(config)
-        self.cls = BertPreTrainingHeads(config,
-                                        self.bert.embeddings.word_embeddings.weight)
-        self.apply(self.init_bert_weights)
-
-    def log_summary_writer(self, logs: dict, base='Train'):
-        if dist.get_rank() == 0:
-            module_name = "Samples"  #self._batch_module_name.get(batch_type, self._get_batch_type_error(batch_type))
-            for key, log in logs.items():
-                self.summary_writer.add_scalar(f'{base}/{module_name}/{key}',
-                                               log,
-                                               self.sample_count)
-            self.sample_count += self.samples_per_step
-
-    def forward(self, batch, log=True):
-        #input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, next_sentence_label=None, checkpoint_activations=False):
-        input_ids = batch[1]
-        token_type_ids = batch[3]
-        attention_mask = batch[2]
-        masked_lm_labels = batch[5]
-        next_sentence_label = batch[4]
-        checkpoint_activations = False
-
-        sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
-                                                   output_all_encoded_layers=False, checkpoint_activations=checkpoint_activations)
-        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
-
-        if masked_lm_labels is not None and next_sentence_label is not None:
-            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            masked_lm_loss = loss_fct(prediction_scores.view(-1,
-                                                             self.config.vocab_size),
-                                      masked_lm_labels.view(-1))
-            next_sentence_loss = loss_fct(seq_relationship_score.view(-1,
-                                                                      2),
-                                          next_sentence_label.view(-1))
-            #print("loss is {} {}".format(masked_lm_loss, next_sentence_loss))
-            total_loss = masked_lm_loss + next_sentence_loss
-            #            if log:
-            #                self.log_summary_writer(logs={'train_loss': total_loss.item()})
-            return total_loss
-        else:
-            return prediction_scores, seq_relationship_score
-
-
-class BertForMaskedLM(BertPreTrainedModel):
-    """BERT model with the masked language modeling head.
-    This module comprises the BERT model followed by the masked language modeling head.
-
-    Params:
-        config: a BertConfig class instance with the configuration to build a new model.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
-            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
-            is only computed for the labels set in [0, ..., vocab_size]
-
-    Outputs:
-        if `masked_lm_labels` is  not `None`:
-            Outputs the masked language modeling loss.
-        if `masked_lm_labels` is `None`:
-            Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForMaskedLM(config)
-    masked_lm_logits_scores = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config):
-        super(BertForMaskedLM, self).__init__(config)
-        self.bert = BertModel(config)
-        self.cls = BertOnlyMLMHead(config, self.bert.embeddings.word_embeddings.weight)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                masked_lm_labels=None,
-                checkpoint_activations=False):
-        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask,
-                                       output_all_encoded_layers=False)
-        prediction_scores = self.cls(sequence_output)
-
-        if masked_lm_labels is not None:
-            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            masked_lm_loss = loss_fct(prediction_scores.view(-1,
-                                                             self.config.vocab_size),
-                                      masked_lm_labels.view(-1))
-            return masked_lm_loss
-        else:
-            return prediction_scores
-
-
-class BertForNextSentencePrediction(BertPreTrainedModel):
-    """BERT model with next sentence prediction head.
-    This module comprises the BERT model followed by the next sentence classification head.
-
-    Params:
-        config: a BertConfig class instance with the configuration to build a new model.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, 1].
-            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
-
-    Outputs:
-        if `next_sentence_label` is not `None`:
-            Outputs the total_loss which is the sum of the masked language modeling loss and the next
-            sentence classification loss.
-        if `next_sentence_label` is `None`:
-            Outputs the next sentence classification logits of shape [batch_size, 2].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForNextSentencePrediction(config)
-    seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config):
-        super(BertForNextSentencePrediction, self).__init__(config)
-        self.bert = BertModel(config)
-        self.cls = BertOnlyNSPHead(config)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                next_sentence_label=None,
-                checkpoint_activations=False):
-        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
-                                     output_all_encoded_layers=False)
-        seq_relationship_score = self.cls(pooled_output)
-
-        if next_sentence_label is not None:
-            loss_fct = CrossEntropyLoss(ignore_index=-1)
-            next_sentence_loss = loss_fct(seq_relationship_score.view(-1,
-                                                                      2),
-                                          next_sentence_label.view(-1))
-            return next_sentence_loss
-        else:
-            return seq_relationship_score
-
-
-class BertForSequenceClassification(BertPreTrainedModel):
-    """BERT model for classification.
-    This module is composed of the BERT model with a linear layer on top of
-    the pooled output.
-
-    Params:
-        `config`: a BertConfig class instance with the configuration to build a new model.
-        `num_labels`: the number of classes for the classifier. Default = 2.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, ..., num_labels].
-
-    Outputs:
-        if `labels` is not `None`:
-            Outputs the CrossEntropy classification loss of the output with the labels.
-        if `labels` is `None`:
-            Outputs the classification logits of shape [batch_size, num_labels].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    num_labels = 2
-
-    model = BertForSequenceClassification(config, num_labels)
-    logits = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config, num_labels):
-        super(BertForSequenceClassification, self).__init__(config)
-        self.num_labels = num_labels
-        self.bert = BertModel(config)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.classifier = nn.Linear(config.hidden_size, num_labels)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                labels=None,
-                checkpoint_activations=False):
-        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
-        pooled_output = self.dropout(pooled_output)
-        logits = self.classifier(pooled_output)
-
-        if labels is not None:
-            loss_fct = CrossEntropyLoss()
-            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
-            return loss
-        else:
-            return logits
-
-
-class BertForMultipleChoice(BertPreTrainedModel):
-    """BERT model for multiple choice tasks.
-    This module is composed of the BERT model with a linear layer on top of
-    the pooled output.
-
-    Params:
-        `config`: a BertConfig class instance with the configuration to build a new model.
-        `num_choices`: the number of classes for the classifier. Default = 2.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, num_choices, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length]
-            with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A`
-            and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, ..., num_choices].
-
-    Outputs:
-        if `labels` is not `None`:
-            Outputs the CrossEntropy classification loss of the output with the labels.
-        if `labels` is `None`:
-            Outputs the classification logits of shape [batch_size, num_labels].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
-    input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
-    token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    num_choices = 2
-
-    model = BertForMultipleChoice(config, num_choices)
-    logits = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config, num_choices):
-        super(BertForMultipleChoice, self).__init__(config)
-        self.num_choices = num_choices
-        self.bert = BertModel(config)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.classifier = nn.Linear(config.hidden_size, 1)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                labels=None,
-                checkpoint_activations=False):
-        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
-        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
-        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1))
-        _, pooled_output = self.bert(flat_input_ids, flat_token_type_ids, flat_attention_mask, output_all_encoded_layers=False)
-        pooled_output = self.dropout(pooled_output)
-        logits = self.classifier(pooled_output)
-        reshaped_logits = logits.view(-1, self.num_choices)
-
-        if labels is not None:
-            loss_fct = CrossEntropyLoss()
-            loss = loss_fct(reshaped_logits, labels)
-            return loss
-        else:
-            return reshaped_logits
-
-
-class BertForTokenClassification(BertPreTrainedModel):
-    """BERT model for token-level classification.
-    This module is composed of the BERT model with a linear layer on top of
-    the full hidden state of the last layer.
-
-    Params:
-        `config`: a BertConfig class instance with the configuration to build a new model.
-        `num_labels`: the number of classes for the classifier. Default = 2.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
-            with indices selected in [0, ..., num_labels].
-
-    Outputs:
-        if `labels` is not `None`:
-            Outputs the CrossEntropy classification loss of the output with the labels.
-        if `labels` is `None`:
-            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    num_labels = 2
-
-    model = BertForTokenClassification(config, num_labels)
-    logits = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config, num_labels):
-        super(BertForTokenClassification, self).__init__(config)
-        self.num_labels = num_labels
-        self.bert = BertModel(config)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.classifier = nn.Linear(config.hidden_size, num_labels)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                labels=None,
-                checkpoint_activations=False):
-        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
-        sequence_output = self.dropout(sequence_output)
-        logits = self.classifier(sequence_output)
-
-        if labels is not None:
-            loss_fct = CrossEntropyLoss()
-            # Only keep active parts of the loss
-            if attention_mask is not None:
-                active_loss = attention_mask.view(-1) == 1
-                active_logits = logits.view(-1, self.num_labels)[active_loss]
-                active_labels = labels.view(-1)[active_loss]
-                loss = loss_fct(active_logits, active_labels)
-            else:
-                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
-            return loss
-        else:
-            return logits
-
-
-class BertForQuestionAnswering(BertPreTrainedModel):
-    """BERT model for Question Answering (span extraction).
-    This module is composed of the BERT model with a linear layer on top of
-    the sequence output that computes start_logits and end_logits
-
-    Params:
-        `config`: a BertConfig class instance with the configuration to build a new model.
-
-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
-            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
-            into account for computing the loss.
-        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
-            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
-            into account for computing the loss.
-
-    Outputs:
-        if `start_positions` and `end_positions` are not `None`:
-            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
-        if `start_positions` or `end_positions` is `None`:
-            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
-            position tokens of shape [batch_size, sequence_length].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForQuestionAnswering(config)
-    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
-    ```
-    """
-    def __init__(self, config):
-        super(BertForQuestionAnswering, self).__init__(config)
-        self.bert = BertModel(config)
-        # TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
-        # self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        self.qa_outputs = nn.Linear(config.hidden_size, 2)
-        self.apply(self.init_bert_weights)
-
-    def forward(self,
-                input_ids,
-                token_type_ids=None,
-                attention_mask=None,
-                start_positions=None,
-                end_positions=None,
-                checkpoint_activations=False):
-        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
-        logits = self.qa_outputs(sequence_output)
-        start_logits, end_logits = logits.split(1, dim=-1)
-        start_logits = start_logits.squeeze(-1)
-        end_logits = end_logits.squeeze(-1)
-
-        if start_positions is not None and end_positions is not None:
-            # If we are on multi-GPU, split add a dimension
-            if len(start_positions.size()) > 1:
-                start_positions = start_positions.squeeze(-1)
-            if len(end_positions.size()) > 1:
-                end_positions = end_positions.squeeze(-1)
-            # sometimes the start/end positions are outside our model inputs, we ignore these terms
-            ignored_index = start_logits.size(1)
-            start_positions.clamp_(0, ignored_index)
-            end_positions.clamp_(0, ignored_index)
-
-            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
-            start_loss = loss_fct(start_logits, start_positions)
-            end_loss = loss_fct(end_logits, end_positions)
-            total_loss = (start_loss + end_loss) / 2
-            return total_loss
-        else:
-            return start_logits, end_logits
+# DeepSpeed note, code taken from commit 3d59216cec89a363649b4fe3d15295ba936ced0f
+# https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/modeling.py
+
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch BERT model."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import copy
+import json
+import logging
+import math
+import os
+import shutil
+import tarfile
+import tempfile
+import sys
+from io import open
+
+import torch
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from torch.utils import checkpoint
+import torch.distributed as dist
+
+from torch.nn import Module
+from torch.nn.parameter import Parameter
+import torch.nn.functional as F
+import torch.nn.init as init
+import time
+
+#from numba import cuda
+
+#from deepspeed_cuda import DeepSpeedSoftmaxConfig, DeepSpeedSoftmax
+
+logger = logging.getLogger(__name__)
+
+PRETRAINED_MODEL_ARCHIVE_MAP = {
+    'bert-base-uncased':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
+    'bert-large-uncased':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
+    'bert-base-cased':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
+    'bert-large-cased':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
+    'bert-base-multilingual-uncased':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
+    'bert-base-multilingual-cased':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
+    'bert-base-chinese':
+    "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
+}
+CONFIG_NAME = 'bert_config.json'
+WEIGHTS_NAME = 'pytorch_model.bin'
+TF_WEIGHTS_NAME = 'model.ckpt'
+
+
+def load_tf_weights_in_bert(model, tf_checkpoint_path):
+    """ Load tf checkpoints in a pytorch model
+    """
+    try:
+        import re
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        print(
+            "Loading a TensorFlow models in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions.")
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    print("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        print("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array)
+
+    for name, array in zip(names, arrays):
+        name = name.split('/')
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
+        # which are not required for using pretrained model
+        if any(n in ["adam_v", "adam_m"] for n in name):
+            print("Skipping {}".format("/".join(name)))
+            continue
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
+                l = re.split(r'_(\d+)', m_name)
+            else:
+                l = [m_name]
+            if l[0] == 'kernel' or l[0] == 'gamma':
+                pointer = getattr(pointer, 'weight')
+            elif l[0] == 'output_bias' or l[0] == 'beta':
+                pointer = getattr(pointer, 'bias')
+            elif l[0] == 'output_weights':
+                pointer = getattr(pointer, 'weight')
+            else:
+                pointer = getattr(pointer, l[0])
+            if len(l) >= 2:
+                num = int(l[1])
+                pointer = pointer[num]
+        if m_name[-11:] == '_embeddings':
+            pointer = getattr(pointer, 'weight')
+        elif m_name == 'kernel':
+            array = np.transpose(array)
+        try:
+            assert pointer.shape == array.shape
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        print("Initialize PyTorch weight {}".format(name))
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+"""
+@torch.jit.script
+def f_gelu(x):
+    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
+@torch.jit.script
+def bias_gelu(bias, y):
+    x = bias + y
+    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
+@torch.jit.script
+def bias_tanh(bias, y):
+    x = bias + y
+    return torch.tanh(x)
+ """
+
+
+def f_gelu(x):
+    x_type = x.dtype
+    x = x.float()
+    x = x * 0.5 * (1.0 + torch.erf(x / 1.41421))
+    return x.to(x_type)
+
+
+def bias_gelu(bias, y):
+    y_type = y.dtype
+    x = bias.float() + y.float()
+    x = x * 0.5 * (1.0 + torch.erf(x / 1.41421))
+    return x.to(y_type)
+
+
+def bias_tanh(bias, y):
+    y_type = y.dtype
+    x = bias.float() + y.float()
+    x = torch.tanh(x)
+    return x.to(y_type)
+
+
+def gelu(x):
+    """Implementation of the gelu activation function.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return f_gelu(x)
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
+
+
+class GPUTimer:
+    def __init__(self):
+        super().__init__()
+        self.start = cuda.event()
+        self.stop = cuda.event()
+
+    def record(self):
+        self.start.record()
+
+    def elapsed(self):
+        self.stop.record()
+        self.stop.synchronize()
+        return self.start.elapsed_time(self.stop) / 1000.0
+
+
+class LinearActivation(Module):
+    r"""Fused Linear and activation Module.
+    """
+    __constants__ = ['bias']
+
+    def __init__(self,
+                 in_features,
+                 out_features,
+                 weights,
+                 biases,
+                 act='gelu',
+                 bias=True):
+        super(LinearActivation, self).__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        self.fused_gelu = False
+        self.fused_tanh = False
+        if isinstance(act,
+                      str) or (sys.version_info[0] == 2 and isinstance(act,
+                                                                       unicode)):
+            if bias and act == 'gelu':
+                self.fused_gelu = True
+            elif bias and act == 'tanh':
+                self.fused_tanh = True
+            else:
+                self.act_fn = ACT2FN[act]
+        else:
+            self.act_fn = act
+        #self.weight = Parameter(torch.Tensor(out_features, in_features))
+        self.weight = weights[5]
+        self.bias = biases[5]
+        #if bias:
+        #    self.bias = Parameter(torch.Tensor(out_features))
+        #else:
+        #    self.register_parameter('bias', None)
+        #self.reset_parameters()
+
+    def reset_parameters(self):
+        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+        if self.bias is not None:
+            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
+            bound = 1 / math.sqrt(fan_in)
+            init.uniform_(self.bias, -bound, bound)
+
+    def forward(self, input):
+        if self.fused_gelu:
+            #timing = []
+            #t1 = GPUTimer()
+            #t1.record()
+            y = F.linear(input, self.weight, None)
+            #timing.append(t1.elapsed())
+            #t1.record()
+            bg = bias_gelu(self.bias, y)
+            #timing.append(t1.elapsed())
+            return bg
+        elif self.fused_tanh:
+            return bias_tanh(self.bias, F.linear(input, self.weight, None))
+        else:
+            return self.act_fn(F.linear(input, self.weight, self.bias))
+
+    def extra_repr(self):
+        return 'in_features={}, out_features={}, bias={}'.format(
+            self.in_features,
+            self.out_features,
+            self.bias is not None)
+
+
+class BertConfig(object):
+    """Configuration class to store the configuration of a `BertModel`.
+    """
+    def __init__(self,
+                 vocab_size_or_config_json_file,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 batch_size=8,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=2,
+                 initializer_range=0.02,
+                 fp16=False):
+        """Constructs BertConfig.
+
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
+            hidden_size: Size of the encoder layers and the pooler layer.
+            num_hidden_layers: Number of hidden layers in the Transformer encoder.
+            num_attention_heads: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+                layer in the Transformer encoder.
+            hidden_act: The non-linear activation function (function or string) in the
+                encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+            hidden_dropout_prob: The dropout probability for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            attention_probs_dropout_prob: The dropout ratio for the attention
+                probabilities.
+            max_position_embeddings: The maximum sequence length that this model might
+                ever be used with. Typically set this to something large just in case
+                (e.g., 512 or 1024 or 2048).
+            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+                `BertModel`.
+            initializer_range: The sttdev of the truncated_normal_initializer for
+                initializing all weight matrices.
+        """
+        if isinstance(vocab_size_or_config_json_file,
+                      str) or (sys.version_info[0] == 2
+                               and isinstance(vocab_size_or_config_json_file,
+                                              unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif isinstance(vocab_size_or_config_json_file, int):
+            self.vocab_size = vocab_size_or_config_json_file
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.batch_size = batch_size
+            self.hidden_act = hidden_act
+            self.intermediate_size = intermediate_size
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.initializer_range = initializer_range
+            self.fp16 = fp16
+        else:
+            raise ValueError("First argument must be either a vocabulary size (int)"
+                             "or the path to a pretrained model config file (str)")
+
+    @classmethod
+    def from_dict(cls, json_object):
+        """Constructs a `BertConfig` from a Python dictionary of parameters."""
+        config = BertConfig(vocab_size_or_config_json_file=-1)
+        for key, value in json_object.items():
+            config.__dict__[key] = value
+        return config
+
+    @classmethod
+    def from_json_file(cls, json_file):
+        """Constructs a `BertConfig` from a json file of parameters."""
+        with open(json_file, "r", encoding='utf-8') as reader:
+            text = reader.read()
+        return cls.from_dict(json.loads(text))
+
+    def __repr__(self):
+        return str(self.to_json_string())
+
+    def to_dict(self):
+        """Serializes this instance to a Python dictionary."""
+        output = copy.deepcopy(self.__dict__)
+        return output
+
+    def to_json_string(self):
+        """Serializes this instance to a JSON string."""
+        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
+
+
+try:
+    import apex
+    #apex.amp.register_half_function(apex.normalization.fused_layer_norm, 'FusedLayerNorm')
+    import apex.normalization
+    #apex.amp.register_float_function(apex.normalization.FusedLayerNorm, 'forward')
+    BertLayerNorm = apex.normalization.FusedLayerNorm
+except ImportError:
+    print(
+        "Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex."
+    )
+
+    class BertLayerNorm(nn.Module):
+        def __init__(self, hidden_size, eps=1e-12):
+            """Construct a layernorm module in the TF style (epsilon inside the square root).
+            """
+            super(BertLayerNorm, self).__init__()
+            self.weight = nn.Parameter(torch.ones(hidden_size))
+            self.bias = nn.Parameter(torch.zeros(hidden_size))
+            self.variance_epsilon = eps
+
+        def forward(self, x):
+            pdtype = x.dtype
+            x = x.float()
+            u = x.mean(-1, keepdim=True)
+            s = (x - u).pow(2).mean(-1, keepdim=True)
+            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
+            return self.weight * x.to(pdtype) + self.bias
+
+        #def forward(self, x):
+        #    u = x.mean(-1, keepdim=True)
+        #    s = (x - u).pow(2).mean(-1, keepdim=True)
+        #    x = (x - u) / torch.sqrt(s + self.variance_epsilon)
+        #    return self.weight * x + self.bias
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings.
+    """
+    def __init__(self, config):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
+                                                config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
+                                                  config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+        seq_length = input_ids.size(1)
+        position_ids = torch.arange(seq_length,
+                                    dtype=torch.long,
+                                    device=input_ids.device)
+        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + position_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, i, config, weights, biases):
+        super(BertSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size,
+                                config.num_attention_heads))
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.query.weight = weights[0]
+        self.query.bias = biases[0]
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key.weight = weights[1]
+        self.key.bias = biases[1]
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value.weight = weights[2]
+        self.value.bias = biases[2]
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.softmax = nn.Softmax(dim=-1)
+        #self.softmax_config = DeepSpeedSoftmaxConfig()
+        #self.softmax_config.batch_size = config.batch_size
+        #self.softmax_config.max_seq_length = config.max_position_embeddings
+        #self.softmax_config.hidden_size = config.hidden_size
+        #self.softmax_config.heads = config.num_attention_heads
+        #self.softmax_config.softmax_id = i
+        #self.softmax_config.fp16 = config.fp16
+        #self.softmax_config.prob_drop_out = 0.0
+        #self.softmax = DeepSpeedSoftmax(i, self.softmax_config)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
+                                       self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def transpose_key_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
+                                       self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 3, 1)
+
+    def forward(self, hidden_states, attention_mask, grads=None):
+        #timing = []
+        #t1 = GPUTimer()
+        #t1.record()
+        mixed_query_layer = self.query(hidden_states)
+
+        #timing.append(t1.elapsed())
+        #print("Query elapsed: %s" % (time.clock() - start))
+        #t1.record()
+        mixed_key_layer = self.key(hidden_states)
+
+        #timing.append(t1.elapsed())
+        #print("Key elapsed: %s" % (time.clock() - start))
+        #t1.record()
+        mixed_value_layer = self.value(hidden_states)
+        #timing.append(t1.elapsed())
+        #print("Value elapsed: %s" % (time.clock() - start))
+
+        #t1.record()
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        # print(query_layer)
+        #timing.append(t1.elapsed())
+        #print("Query-Transform elapsed: %s" % (time.clock() - start))
+        #t1.record()
+        key_layer = self.transpose_key_for_scores(mixed_key_layer)
+        # print(key_layer)
+        #timing.append(t1.elapsed())
+        #print("Key-Transform elapsed: %s" % (time.clock() - start))
+        #t1.record()
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+        #print(value_layer)
+        #timing.append(t1.elapsed())
+        #print("Value-Transform elapsed: %s" % (time.clock() - start))
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        #t1.record()
+        #print(query_layer.shape)
+        #print(key_layer.shape)
+        attention_scores = torch.matmul(query_layer, key_layer)
+        #print(attention_scores.shape)
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        #print("Pytorch: ", attention_scores)
+        #timing.append(t1.elapsed())
+        #print("Attention-Score elapsed: %s" % (time.clock() - start))
+        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+        #t1.record()
+
+        # context_layer = self.softmax(query_layer, key_layer, value_layer, attention_mask)
+        #print("context shape is :", context_layer.shape)
+        #print("Cuda-ext:, ", attention_scores1)
+        # Normalize the attention scores to probabilities.
+        ####attention_probs = self.softmax(attention_scores)
+        #timing.append(t1.elapsed())
+        #print("Softmax elapsed: %s" % (time.clock() - start))
+        #t1 = GPUTimer()
+        #t1.record()
+        attention_scores = attention_scores + attention_mask
+        attention_probs = self.softmax(attention_scores)
+        #attention_scores = self.softmax(attention_scores, attention_mask)
+        #print("Softmax elapse {0:8.2f} ms", t1.elapsed() * 1000)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        #t1.record()
+        context_layer = torch.matmul(attention_probs, value_layer)
+        #timing.append(t1.elapsed())
+        #print("Context elapsed: %s" % (time.clock() - start))
+        #t1.record()
+        #context_layer1 = context_layer.permute(
+        #                0, 1, 3, 2, 4).contiguous()
+        #if grads is not None:
+        # context_layer.register_hook(lambda x, self = self : grads.append([x, "Context"]))
+        context_layer1 = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer1.size()[:-2] + (self.all_head_size, )
+        context_layer1 = context_layer1.view(*new_context_layer_shape)
+        #timing.append(t1.elapsed())
+        #print("Context-Transform elapsed: %s" % (time.clock() - start))
+
+        if grads is not None:
+            query_layer.register_hook(lambda x, self=self: grads.append([x, "Query"]))
+            key_layer.register_hook(lambda x, self=self: grads.append([x, "Key"]))
+            value_layer.register_hook(lambda x, self=self: grads.append([x, "Value"]))
+
+        return context_layer1
+
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config, weights, biases):
+        super(BertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dense.weight = weights[3]
+        self.dense.bias = biases[3]
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        #timing = []
+        #t1 = GPUTimer()
+        #t1.record()
+        hidden_states = self.dense(hidden_states)
+        #timing.append(t1.elapsed())
+        #print("Attention Output elapsed: %s" % (time.clock() - start))
+        hidden_states = self.dropout(hidden_states)
+        #t1.record()
+        #hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        #timing.append(t1.elapsed())
+        #print("LayerNorm elapsed: %s" % (time.clock() - start))
+        return hidden_states
+
+    def get_w(self):
+        return self.dense.weight
+
+
+class BertAttention(nn.Module):
+    def __init__(self, i, config, weights, biases):
+        super(BertAttention, self).__init__()
+        self.self = BertSelfAttention(i, config, weights, biases)
+        self.output = BertSelfOutput(config, weights, biases)
+
+    def forward(self, input_tensor, attention_mask):
+        self_output = self.self(input_tensor, attention_mask)
+        attention_output = self.output(self_output, input_tensor)
+        return attention_output
+
+    def get_w(self):
+        return self.output.get_w()
+
+
+class BertIntermediate(nn.Module):
+    def __init__(self, config, weights, biases):
+        super(BertIntermediate, self).__init__()
+        self.dense_act = LinearActivation(config.hidden_size,
+                                          config.intermediate_size,
+                                          weights,
+                                          biases,
+                                          act=config.hidden_act)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense_act(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+    def __init__(self, config, weights, biases):
+        super(BertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.dense.weight = weights[6]
+        self.dense.bias = biases[6]
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        #timing = []
+        #t1 = GPUTimer()
+        #t1.record()
+        #print (hidden_states)
+        #print (self.dense.weight)
+        hidden_states = self.dense(hidden_states)
+        #timing.append(t1.elapsed())
+        #print("FF2 elapsed: %s" % (time.clock() - start))
+        hidden_states = self.dropout(hidden_states)
+        #t1.record()
+        #hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        #timing.append(t1.elapsed())
+        #print("LayerNorm elapsed: %s" % (time.clock() - start))
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+    def __init__(self, i, config, weights, biases):
+        super(BertLayer, self).__init__()
+        self.attention = BertAttention(i, config, weights, biases)
+        self.PreAttentionLayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+        self.PostAttentionLayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+        self.intermediate = BertIntermediate(config, weights, biases)
+        self.output = BertOutput(config, weights, biases)
+        self.weight = weights
+        self.biases = biases
+
+    def forward(self, hidden_states, attention_mask, grads, collect_all_grads=False):
+        input_layer_norm = self.PreAttentionLayerNorm(hidden_states)
+        attention_output = self.attention(input_layer_norm, attention_mask)
+        #print ("hidden shape is :", hidden_states.shape)
+        intermediate_input = hidden_states + attention_output
+
+        intermediate_layer_norm = self.PostAttentionLayerNorm(intermediate_input)
+        intermediate_output = self.intermediate(intermediate_layer_norm)
+        layer_output = self.output(intermediate_output, attention_output)
+
+        #attention_output = self.attention(hidden_states, attention_mask)
+        #intermediate_output = self.intermediate(attention_output)
+        #layer_output = self.output(intermediate_output, attention_output)
+
+        if collect_all_grads:
+            # self.weight[0].register_hook(lambda x, self=self: grads.append([x,"Q_W"]))
+            # self.biases[0].register_hook(lambda x, self=self: grads.append([x,"Q_B"]))
+            # self.weight[1].register_hook(lambda x, self=self: grads.append([x,"K_W"]))
+            # self.biases[1].register_hook(lambda x, self=self: grads.append([x,"K_B"]))
+            self.weight[2].register_hook(lambda x, self=self: grads.append([x, "V_W"]))
+            self.biases[2].register_hook(lambda x, self=self: grads.append([x, "V_B"]))
+            self.weight[3].register_hook(lambda x, self=self: grads.append([x, "O_W"]))
+            self.biases[3].register_hook(lambda x, self=self: grads.append([x, "O_B"]))
+            self.PostAttentionLayerNorm.weight.register_hook(
+                lambda x,
+                self=self: grads.append([x,
+                                         "N2_W"]))
+            self.PostAttentionLayerNorm.bias.register_hook(
+                lambda x,
+                self=self: grads.append([x,
+                                         "N2_B"]))
+            self.weight[5].register_hook(lambda x, self=self: grads.append([x, "int_W"]))
+            self.biases[5].register_hook(lambda x, self=self: grads.append([x, "int_B"]))
+            self.weight[6].register_hook(lambda x, self=self: grads.append([x, "out_W"]))
+            self.biases[6].register_hook(lambda x, self=self: grads.append([x, "out_B"]))
+            self.PreAttentionLayerNorm.weight.register_hook(
+                lambda x,
+                self=self: grads.append([x,
+                                         "norm_W"]))
+            self.PreAttentionLayerNorm.bias.register_hook(
+                lambda x,
+                self=self: grads.append([x,
+                                         "norm_B"]))
+
+        return layer_output + intermediate_input
+
+    def get_w(self):
+        return self.attention.get_w()
+
+
+class BertEncoder(nn.Module):
+    def __init__(self, config, weights, biases):
+        super(BertEncoder, self).__init__()
+        #layer = BertLayer(config, weights, biases)
+        self.FinalLayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+
+        self.layer = nn.ModuleList([
+            copy.deepcopy(BertLayer(i,
+                                    config,
+                                    weights,
+                                    biases)) for i in range(config.num_hidden_layers)
+        ])
+        self.grads = []
+        self.graph = []
+
+    def get_grads(self):
+        return self.grads
+
+    # def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
+    #     all_encoder_layers = []
+    #     for layer_module in self.layer:
+    #         hidden_states = layer_module(hidden_states, attention_mask)
+    #         if output_all_encoded_layers:
+    #             all_encoder_layers.append(hidden_states)
+    #     if not output_all_encoded_layers:
+    #         all_encoder_layers.append(hidden_states)
+    #     return all_encoder_layers
+
+    def get_modules(self, big_node, input):
+        for mdl in big_node.named_children():
+            graph.append(mdl)
+            get_modules(self, mdl, input)
+
+    def forward(self,
+                hidden_states,
+                attention_mask,
+                output_all_encoded_layers=True,
+                checkpoint_activations=False):
+        all_encoder_layers = []
+
+        def custom(start, end):
+            def custom_forward(*inputs):
+                layers = self.layer[start:end]
+                x_ = inputs[0]
+                for layer in layers:
+                    x_ = layer(x_, inputs[1])
+                return x_
+
+            return custom_forward
+
+        if checkpoint_activations:
+            l = 0
+            num_layers = len(self.layer)
+            chunk_length = math.ceil(math.sqrt(num_layers))
+            while l < num_layers:
+                hidden_states = checkpoint.checkpoint(custom(l,
+                                                             l + chunk_length),
+                                                      hidden_states,
+                                                      attention_mask * 1)
+                l += chunk_length
+            # decoder layers
+        else:
+            for i, layer_module in enumerate(self.layer):
+                hidden_states = layer_module(hidden_states,
+                                             attention_mask,
+                                             self.grads,
+                                             collect_all_grads=True)
+                hidden_states.register_hook(
+                    lambda x,
+                    i=i,
+                    self=self: self.grads.append([x,
+                                                  "hidden_state"]))
+                #print("pytorch weight is: ", layer_module.get_w())
+
+                if output_all_encoded_layers:
+                    all_encoder_layers.append((hidden_states))
+
+        if not output_all_encoded_layers or checkpoint_activations:
+            hidden_states = self.FinalLayerNorm(hidden_states)
+            all_encoder_layers.append((hidden_states))
+        return all_encoder_layers
+
+
+#class BertEncoder(nn.Module):
+#    def __init__(self, config):
+#        super(BertEncoder, self).__init__()
+#        layer = BertLayer(config)
+#        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
+#
+#    def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
+#        all_encoder_layers = []
+#        for layer_module in self.layer:
+#            hidden_states = layer_module(hidden_states, attention_mask)
+#            if output_all_encoded_layers:
+#                all_encoder_layers.append(hidden_states)
+#        if not output_all_encoded_layers:
+#            all_encoder_layers.append(hidden_states)
+#        return all_encoder_layers
+
+
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super(BertPooler, self).__init__()
+        self.dense_act = LinearActivation(config.hidden_size,
+                                          config.hidden_size,
+                                          act="tanh")
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense_act(first_token_tensor)
+        return pooled_output
+
+
+class BertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super(BertPredictionHeadTransform, self).__init__()
+        self.dense_act = LinearActivation(config.hidden_size,
+                                          config.hidden_size,
+                                          act=config.hidden_act)
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense_act(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class BertLMPredictionHead(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(BertLMPredictionHead, self).__init__()
+        self.transform = BertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
+                                 bert_model_embedding_weights.size(0),
+                                 bias=False)
+        self.decoder.weight = bert_model_embedding_weights
+        self.bias = nn.Parameter(torch.zeros(bert_model_embedding_weights.size(0)))
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        torch.cuda.nvtx.range_push(
+            "decoder input.size() = {}, weight.size() = {}".format(
+                hidden_states.size(),
+                self.decoder.weight.size()))
+        hidden_states = self.decoder(hidden_states) + self.bias
+        torch.cuda.nvtx.range_pop()
+        return hidden_states
+
+
+class BertOnlyMLMHead(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(BertOnlyMLMHead, self).__init__()
+        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class BertOnlyNSPHead(nn.Module):
+    def __init__(self, config):
+        super(BertOnlyNSPHead, self).__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+class BertPreTrainingHeads(nn.Module):
+    def __init__(self, config, bert_model_embedding_weights):
+        super(BertPreTrainingHeads, self).__init__()
+        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class BertPreTrainedModel(nn.Module):
+    """ An abstract class to handle weights initialization and
+        a simple interface for downloading and loading pretrained models.
+    """
+    def __init__(self, config, *inputs, **kwargs):
+        super(BertPreTrainedModel, self).__init__()
+        if not isinstance(config, BertConfig):
+            raise ValueError(
+                "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
+                "To create a model from a Google pretrained model use "
+                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
+                    self.__class__.__name__,
+                    self.__class__.__name__))
+        self.config = config
+
+    def init_bert_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, BertLayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+    @classmethod
+    def from_pretrained(cls,
+                        pretrained_model_name_or_path,
+                        state_dict=None,
+                        cache_dir=None,
+                        from_tf=False,
+                        *inputs,
+                        **kwargs):
+        """
+        Instantiate a BertPreTrainedModel from a pre-trained model file or a pytorch state dict.
+        Download and cache the pre-trained model file if needed.
+
+        Params:
+            pretrained_model_name_or_path: either:
+                - a str with the name of a pre-trained model to load selected in the list of:
+                    . `bert-base-uncased`
+                    . `bert-large-uncased`
+                    . `bert-base-cased`
+                    . `bert-large-cased`
+                    . `bert-base-multilingual-uncased`
+                    . `bert-base-multilingual-cased`
+                    . `bert-base-chinese`
+                - a path or url to a pretrained model archive containing:
+                    . `bert_config.json` a configuration file for the model
+                    . `pytorch_model.bin` a PyTorch dump of a BertForPreTraining instance
+                - a path or url to a pretrained model archive containing:
+                    . `bert_config.json` a configuration file for the model
+                    . `model.chkpt` a TensorFlow checkpoint
+            from_tf: should we load the weights from a locally saved TensorFlow checkpoint
+            cache_dir: an optional path to a folder in which the pre-trained models will be cached.
+            state_dict: an optional state dictionary (collections.OrderedDict object) to use instead of Google pre-trained models
+            *inputs, **kwargs: additional input for the specific Bert class
+                (ex: num_labels for BertForSequenceClassification)
+        """
+        if pretrained_model_name_or_path in PRETRAINED_MODEL_ARCHIVE_MAP:
+            archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name_or_path]
+        else:
+            archive_file = pretrained_model_name_or_path
+        if resolved_archive_file == archive_file:
+            logger.info("loading archive file {}".format(archive_file))
+        else:
+            logger.info("loading archive file {} from cache at {}".format(
+                archive_file,
+                resolved_archive_file))
+        tempdir = None
+        if os.path.isdir(resolved_archive_file) or from_tf:
+            serialization_dir = resolved_archive_file
+        else:
+            # Extract archive to temp dir
+            tempdir = tempfile.mkdtemp()
+            logger.info("extracting archive file {} to temp dir {}".format(
+                resolved_archive_file,
+                tempdir))
+            with tarfile.open(resolved_archive_file, 'r:gz') as archive:
+                archive.extractall(tempdir)
+            serialization_dir = tempdir
+        # Load config
+        config_file = os.path.join(serialization_dir, CONFIG_NAME)
+        config = BertConfig.from_json_file(config_file)
+        logger.info("Model config {}".format(config))
+        # Instantiate model.
+        model = cls(config, *inputs, **kwargs)
+        if state_dict is None and not from_tf:
+            weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
+            state_dict = torch.load(
+                weights_path,
+                map_location='cpu' if not torch.cuda.is_available() else None)
+        if tempdir:
+            # Clean up temp dir
+            shutil.rmtree(tempdir)
+        if from_tf:
+            # Directly load from a TensorFlow checkpoint
+            weights_path = os.path.join(serialization_dir, TF_WEIGHTS_NAME)
+            return load_tf_weights_in_bert(model, weights_path)
+        # Load from a PyTorch state_dict
+        old_keys = []
+        new_keys = []
+        for key in state_dict.keys():
+            new_key = None
+            if 'gamma' in key:
+                new_key = key.replace('gamma', 'weight')
+            if 'beta' in key:
+                new_key = key.replace('beta', 'bias')
+            if new_key:
+                old_keys.append(key)
+                new_keys.append(new_key)
+        for old_key, new_key in zip(old_keys, new_keys):
+            state_dict[new_key] = state_dict.pop(old_key)
+
+        missing_keys = []
+        unexpected_keys = []
+        error_msgs = []
+        # copy state_dict so _load_from_state_dict can modify it
+        metadata = getattr(state_dict, '_metadata', None)
+        state_dict = state_dict.copy()
+        if metadata is not None:
+            state_dict._metadata = metadata
+
+        def load(module, prefix=''):
+            local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
+            module._load_from_state_dict(state_dict,
+                                         prefix,
+                                         local_metadata,
+                                         True,
+                                         missing_keys,
+                                         unexpected_keys,
+                                         error_msgs)
+            for name, child in module._modules.items():
+                if child is not None:
+                    load(child, prefix + name + '.')
+
+        start_prefix = ''
+        if not hasattr(model,
+                       'bert') and any(s.startswith('bert.') for s in state_dict.keys()):
+            start_prefix = 'bert.'
+        load(model, prefix=start_prefix)
+        if len(missing_keys) > 0:
+            logger.info("Weights of {} not initialized from pretrained model: {}".format(
+                model.__class__.__name__,
+                missing_keys))
+        if len(unexpected_keys) > 0:
+            logger.info("Weights from pretrained model not used in {}: {}".format(
+                model.__class__.__name__,
+                unexpected_keys))
+        if len(error_msgs) > 0:
+            raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
+                model.__class__.__name__,
+                "\n\t".join(error_msgs)))
+        return model
+
+
+class BertModel(BertPreTrainedModel):
+    """BERT model ("Bidirectional Embedding Representations from a Transformer").
+
+    Params:
+        config: a BertConfig class instance with the configuration to build a new model
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
+
+    Outputs: Tuple of (encoded_layers, pooled_output)
+        `encoded_layers`: controlled by `output_all_encoded_layers` argument:
+            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
+                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
+                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
+            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
+                to the last attention block of shape [batch_size, sequence_length, hidden_size],
+        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
+            classifier pretrained on top of the hidden state associated to the first character of the
+            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    model = modeling.BertModel(config=config)
+    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config):
+        super(BertModel, self).__init__(config)
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+        self.pooler = BertPooler(config)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                output_all_encoded_layers=True,
+                checkpoint_activations=False):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+        # this attention mask is more simple than the triangular masking of causal attention
+        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=next(
+            self.parameters()).dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        embedding_output = self.embeddings(input_ids, token_type_ids)
+        encoded_layers = self.encoder(
+            embedding_output,
+            extended_attention_mask,
+            output_all_encoded_layers=output_all_encoded_layers,
+            checkpoint_activations=checkpoint_activations)
+        sequence_output = encoded_layers[-1]
+        pooled_output = self.pooler(sequence_output)
+        if not output_all_encoded_layers:
+            encoded_layers = encoded_layers[-1]
+        return encoded_layers, pooled_output
+
+
+class BertForPreTraining(BertPreTrainedModel):
+    """BERT model with pre-training heads.
+    This module comprises the BERT model followed by the two pre-training heads:
+        - the masked language modeling head, and
+        - the next sentence classification head.
+
+    Params:
+        config: a BertConfig class instance with the configuration to build a new model.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+            is only computed for the labels set in [0, ..., vocab_size]
+        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, 1].
+            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+
+    Outputs:
+        if `masked_lm_labels` and `next_sentence_label` are not `None`:
+            Outputs the total_loss which is the sum of the masked language modeling loss and the next
+            sentence classification loss.
+        if `masked_lm_labels` or `next_sentence_label` is `None`:
+            Outputs a tuple comprising
+            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
+            - the next sentence classification logits of shape [batch_size, 2].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    model = BertForPreTraining(config)
+    masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config, args):
+        super(BertForPreTraining, self).__init__(config)
+        self.summary_writer = None
+        if dist.get_rank() == 0:
+            self.summary_writer = args.summary_writer
+        self.samples_per_step = dist.get_world_size() * args.train_batch_size
+        self.sample_count = self.samples_per_step
+        self.bert = BertModel(config)
+        self.cls = BertPreTrainingHeads(config,
+                                        self.bert.embeddings.word_embeddings.weight)
+        self.apply(self.init_bert_weights)
+
+    def log_summary_writer(self, logs: dict, base='Train'):
+        if dist.get_rank() == 0:
+            module_name = "Samples"  #self._batch_module_name.get(batch_type, self._get_batch_type_error(batch_type))
+            for key, log in logs.items():
+                self.summary_writer.add_scalar(f'{base}/{module_name}/{key}',
+                                               log,
+                                               self.sample_count)
+            self.sample_count += self.samples_per_step
+
+    def forward(self, batch, log=True):
+        #input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, next_sentence_label=None, checkpoint_activations=False):
+        input_ids = batch[1]
+        token_type_ids = batch[3]
+        attention_mask = batch[2]
+        masked_lm_labels = batch[5]
+        next_sentence_label = batch[4]
+        checkpoint_activations = False
+
+        sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
+                                                   output_all_encoded_layers=False, checkpoint_activations=checkpoint_activations)
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        if masked_lm_labels is not None and next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1,
+                                                             self.config.vocab_size),
+                                      masked_lm_labels.view(-1))
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1,
+                                                                      2),
+                                          next_sentence_label.view(-1))
+            #print("loss is {} {}".format(masked_lm_loss, next_sentence_loss))
+            total_loss = masked_lm_loss + next_sentence_loss
+            #            if log:
+            #                self.log_summary_writer(logs={'train_loss': total_loss.item()})
+            return total_loss
+        else:
+            return prediction_scores, seq_relationship_score
+
+
+class BertForMaskedLM(BertPreTrainedModel):
+    """BERT model with the masked language modeling head.
+    This module comprises the BERT model followed by the masked language modeling head.
+
+    Params:
+        config: a BertConfig class instance with the configuration to build a new model.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+            is only computed for the labels set in [0, ..., vocab_size]
+
+    Outputs:
+        if `masked_lm_labels` is  not `None`:
+            Outputs the masked language modeling loss.
+        if `masked_lm_labels` is `None`:
+            Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    model = BertForMaskedLM(config)
+    masked_lm_logits_scores = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config):
+        super(BertForMaskedLM, self).__init__(config)
+        self.bert = BertModel(config)
+        self.cls = BertOnlyMLMHead(config, self.bert.embeddings.word_embeddings.weight)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                masked_lm_labels=None,
+                checkpoint_activations=False):
+        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask,
+                                       output_all_encoded_layers=False)
+        prediction_scores = self.cls(sequence_output)
+
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(prediction_scores.view(-1,
+                                                             self.config.vocab_size),
+                                      masked_lm_labels.view(-1))
+            return masked_lm_loss
+        else:
+            return prediction_scores
+
+
+class BertForNextSentencePrediction(BertPreTrainedModel):
+    """BERT model with next sentence prediction head.
+    This module comprises the BERT model followed by the next sentence classification head.
+
+    Params:
+        config: a BertConfig class instance with the configuration to build a new model.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, 1].
+            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+
+    Outputs:
+        if `next_sentence_label` is not `None`:
+            Outputs the total_loss which is the sum of the masked language modeling loss and the next
+            sentence classification loss.
+        if `next_sentence_label` is `None`:
+            Outputs the next sentence classification logits of shape [batch_size, 2].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    model = BertForNextSentencePrediction(config)
+    seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config):
+        super(BertForNextSentencePrediction, self).__init__(config)
+        self.bert = BertModel(config)
+        self.cls = BertOnlyNSPHead(config)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                next_sentence_label=None,
+                checkpoint_activations=False):
+        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
+                                     output_all_encoded_layers=False)
+        seq_relationship_score = self.cls(pooled_output)
+
+        if next_sentence_label is not None:
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            next_sentence_loss = loss_fct(seq_relationship_score.view(-1,
+                                                                      2),
+                                          next_sentence_label.view(-1))
+            return next_sentence_loss
+        else:
+            return seq_relationship_score
+
+
+class BertForSequenceClassification(BertPreTrainedModel):
+    """BERT model for classification.
+    This module is composed of the BERT model with a linear layer on top of
+    the pooled output.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model.
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, ..., num_labels].
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, num_labels].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    num_labels = 2
+
+    model = BertForSequenceClassification(config, num_labels)
+    logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config, num_labels):
+        super(BertForSequenceClassification, self).__init__(config)
+        self.num_labels = num_labels
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, num_labels)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                labels=None,
+                checkpoint_activations=False):
+        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss
+        else:
+            return logits
+
+
+class BertForMultipleChoice(BertPreTrainedModel):
+    """BERT model for multiple choice tasks.
+    This module is composed of the BERT model with a linear layer on top of
+    the pooled output.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model.
+        `num_choices`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, num_choices, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length]
+            with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A`
+            and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+            with indices selected in [0, ..., num_choices].
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, num_labels].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
+    input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
+    token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    num_choices = 2
+
+    model = BertForMultipleChoice(config, num_choices)
+    logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config, num_choices):
+        super(BertForMultipleChoice, self).__init__(config)
+        self.num_choices = num_choices
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                labels=None,
+                checkpoint_activations=False):
+        flat_input_ids = input_ids.view(-1, input_ids.size(-1))
+        flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
+        flat_attention_mask = attention_mask.view(-1, attention_mask.size(-1))
+        _, pooled_output = self.bert(flat_input_ids, flat_token_type_ids, flat_attention_mask, output_all_encoded_layers=False)
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, self.num_choices)
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+            return loss
+        else:
+            return reshaped_logits
+
+
+class BertForTokenClassification(BertPreTrainedModel):
+    """BERT model for token-level classification.
+    This module is composed of the BERT model with a linear layer on top of
+    the full hidden state of the last layer.
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model.
+        `num_labels`: the number of classes for the classifier. Default = 2.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
+            with indices selected in [0, ..., num_labels].
+
+    Outputs:
+        if `labels` is not `None`:
+            Outputs the CrossEntropy classification loss of the output with the labels.
+        if `labels` is `None`:
+            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    num_labels = 2
+
+    model = BertForTokenClassification(config, num_labels)
+    logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config, num_labels):
+        super(BertForTokenClassification, self).__init__(config)
+        self.num_labels = num_labels
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, num_labels)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                labels=None,
+                checkpoint_activations=False):
+        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)[active_loss]
+                active_labels = labels.view(-1)[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            return loss
+        else:
+            return logits
+
+
+class BertForQuestionAnswering(BertPreTrainedModel):
+    """BERT model for Question Answering (span extraction).
+    This module is composed of the BERT model with a linear layer on top of
+    the sequence output that computes start_logits and end_logits
+
+    Params:
+        `config`: a BertConfig class instance with the configuration to build a new model.
+
+    Inputs:
+        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
+        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+            a `sentence B` token (see BERT paper for more details).
+        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+            input sequence length in the current batch. It's the mask that we typically use for attention when
+            a batch has varying length sentences.
+        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
+            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+            into account for computing the loss.
+        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
+            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+            into account for computing the loss.
+
+    Outputs:
+        if `start_positions` and `end_positions` are not `None`:
+            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
+        if `start_positions` or `end_positions` is `None`:
+            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
+            position tokens of shape [batch_size, sequence_length].
+
+    Example usage:
+    ```python
+    # Already been converted into WordPiece token ids
+    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+    model = BertForQuestionAnswering(config)
+    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
+    ```
+    """
+    def __init__(self, config):
+        super(BertForQuestionAnswering, self).__init__(config)
+        self.bert = BertModel(config)
+        # TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
+        # self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+        self.apply(self.init_bert_weights)
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                attention_mask=None,
+                start_positions=None,
+                end_positions=None,
+                checkpoint_activations=False):
+        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions.clamp_(0, ignored_index)
+            end_positions.clamp_(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+            return total_loss
+        else:
+            return start_logits, end_logits
diff --git a/tests/unit/multi_output_model.py b/tests/unit/multi_output_model.py
old mode 100755
new mode 100644
diff --git a/tests/unit/simple_model.py b/tests/unit/simple_model.py
old mode 100755
new mode 100644
index 8bf0a21c8273c728e10784bc9a6099d55c675174..aaa8ffc67971516ff99422f43baa9e6b25207208
--- a/tests/unit/simple_model.py
+++ b/tests/unit/simple_model.py
@@ -4,24 +4,96 @@ import argparse
 import torch
 
 from deepspeed.pipe import PipelineModule, LayerSpec
+from deepspeed.moe.layer import MoE
 
 
 class SimpleModel(torch.nn.Module):
-    def __init__(self, hidden_dim, empty_grad=False):
+    def __init__(self, hidden_dim, empty_grad=False, nlayers=1):
         super(SimpleModel, self).__init__()
-        self.linear = torch.nn.Linear(hidden_dim, hidden_dim)
+        self.linears = torch.nn.ModuleList(
+            [torch.nn.Linear(hidden_dim,
+                             hidden_dim) for i in range(nlayers)])
         if empty_grad:
             self.linear2 = torch.nn.Linear(hidden_dim, hidden_dim)
         self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
         self.empty_grad = empty_grad
 
     def forward(self, x, y):
-        hidden_dim = x
-        if self.empty_grad and torch.distributed.get_rank() == 0:
-            hidden_dim = self.linear(hidden_dim) + self.linear2(hidden_dim)
+        if len(self.linears) == 1:
+            x = self.linears[0](x)
         else:
-            hidden_dim = self.linear(hidden_dim)
-        return self.cross_entropy_loss(hidden_dim, y)
+            for i, l in enumerate(self.linears):
+                x = self.linears[i // 2](x) + l(x)
+        return self.cross_entropy_loss(x, y)
+
+
+class Curriculum_SimpleModel(SimpleModel):
+    def __init__(self, hidden_dim, empty_grad=False):
+        super(Curriculum_SimpleModel, self).__init__(hidden_dim, empty_grad)
+
+    def forward(self, x, y, **kwargs):
+        seqlen = kwargs.get('curriculum_seqlen', None)
+        loss = super(Curriculum_SimpleModel, self).forward(x, y)
+        return loss, seqlen
+
+
+class SimpleMoEModel(torch.nn.Module):
+    def __init__(self, hidden_dim, num_experts=4, ep_size=1, use_residual=False):
+        super(SimpleMoEModel, self).__init__()
+        self.linear = torch.nn.Linear(hidden_dim, hidden_dim)
+        linear2 = torch.nn.Linear(hidden_dim, hidden_dim)
+        self.linear2 = MoE(hidden_size=hidden_dim,
+                           expert=linear2,
+                           ep_size=ep_size,
+                           use_residual=use_residual,
+                           num_experts=num_experts,
+                           k=1)
+        self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
+
+    def forward(self, x, y):
+        hidden_dim = x
+        hidden_dim = self.linear(hidden_dim)
+        output, _, _ = self.linear2(hidden_dim)
+        hidden_dim = hidden_dim + output
+        sentence_embed = hidden_dim.mean(1)
+        return self.cross_entropy_loss(sentence_embed, y)
+
+
+class SimplePRMoEModel(torch.nn.Module):
+    def __init__(self, hidden_dim, num_experts=2, ep_size=1, use_residual=False):
+        super(SimplePRMoEModel, self).__init__()
+        self.linear = torch.nn.Linear(hidden_dim, hidden_dim)
+        linear2 = torch.nn.Linear(hidden_dim, hidden_dim)
+        self.linear2 = MoE(hidden_size=hidden_dim,
+                           expert=linear2,
+                           ep_size=ep_size,
+                           use_residual=use_residual,
+                           num_experts=num_experts,
+                           k=1)
+        linear3 = torch.nn.Linear(hidden_dim, hidden_dim)
+        self.linear3 = MoE(hidden_size=hidden_dim,
+                           expert=linear3,
+                           ep_size=ep_size,
+                           use_residual=use_residual,
+                           num_experts=int(2 * num_experts),
+                           k=1)
+        self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
+
+    def forward(self, x, y):
+        hidden_dim = x
+        hidden_dim = self.linear(hidden_dim)
+        output, _, _ = self.linear2(hidden_dim)
+        output, _, _ = self.linear3(output)
+        hidden_dim = hidden_dim + output
+        sentence_embed = hidden_dim.mean(1)
+        return self.cross_entropy_loss(sentence_embed, y)
+
+
+class UnusedParametersModel(SimpleModel):
+    def __init__(self, hidden_dim, empty_grad=False):
+        super().__init__(hidden_dim, empty_grad)
+
+        self.unused_linear = torch.nn.Linear(hidden_dim, hidden_dim)
 
 
 class LinearStack(torch.nn.Module):
@@ -121,7 +193,7 @@ class HybridStateOptimizer(torch.optim.Optimizer):
                 state = self.state[p]
                 if len(state) == 0:
                     state['integer_step'] = 0
-                    state['tensor_step'] = torch.zeros(1)
+                    state['tensor_step'] = torch.zeros(1, device=p.device)
 
                 d_p = p.grad.data
                 p.data.add_(-group['lr'], d_p)
@@ -142,9 +214,34 @@ class PLD_SimpleModel(SimpleModel):
         return hidden_dim
 
 
+def random_dataset(total_samples, hidden_dim, device, dtype=torch.half):
+    train_data = torch.randn(total_samples, hidden_dim, device=device, dtype=dtype)
+    train_label = torch.empty(total_samples,
+                              dtype=torch.long,
+                              device=device).random_(hidden_dim)
+    train_dataset = torch.utils.data.TensorDataset(train_data, train_label)
+    return train_dataset
+
+
 def random_dataloader(model, total_samples, hidden_dim, device, dtype=torch.half):
     batch_size = model.train_micro_batch_size_per_gpu()
-    train_data = torch.randn(total_samples, hidden_dim, device=device, dtype=dtype)
+    train_dataset = random_dataset(total_samples, hidden_dim, device, dtype=dtype)
+    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
+    return train_loader
+
+
+def sequence_dataloader(model,
+                        total_samples,
+                        hidden_dim,
+                        device,
+                        seq_len: int = 32,
+                        dtype=torch.half):
+    batch_size = model.train_micro_batch_size_per_gpu()
+    train_data = torch.randn(total_samples,
+                             seq_len,
+                             hidden_dim,
+                             device=device,
+                             dtype=dtype)
     train_label = torch.empty(total_samples,
                               dtype=torch.long,
                               device=device).random_(hidden_dim)
diff --git a/tests/unit/test_activation_checkpointing.py b/tests/unit/test_activation_checkpointing.py
index ae371f9968f24905c75ba24e537a48bb18c50291..e66f2abf7408b8982fa2cef13b9657d767883d61 100644
--- a/tests/unit/test_activation_checkpointing.py
+++ b/tests/unit/test_activation_checkpointing.py
@@ -7,9 +7,10 @@ import pytest
 import torch
 
 import deepspeed
+
 ckpt = deepspeed.checkpointing.checkpoint
 
-from common import distributed_test
+from .common import distributed_test
 
 
 def _compute(module, *inputs, do_checkpoint=False):
@@ -165,7 +166,7 @@ def test_ckpt_inputs1_outputs1():
     _test_activation_checkpoint(module, inputs)
 
 
-# both bool and float are important, as bool is not diffentiable
+# both bool and float are important, as bool is not differentiable
 @pytest.mark.parametrize('mask',
                          [
                              _mixed_mask(),
diff --git a/tests/unit/test_adamw.py b/tests/unit/test_adamw.py
index 83e0b5436546a8d06bd896d8fa281d9641bd5163..b4bfbf3c260c5286560b117ba5d7b545ffe6259d 100644
--- a/tests/unit/test_adamw.py
+++ b/tests/unit/test_adamw.py
@@ -2,10 +2,10 @@ import deepspeed
 import torch
 import pytest
 
-from common import distributed_test
 from deepspeed.ops.adam import FusedAdam
 from deepspeed.ops.adam import DeepSpeedCPUAdam
-from simple_model import SimpleModel, args_from_dict
+from .common import distributed_test
+from .simple_model import SimpleModel, args_from_dict
 
 # yapf: disable
 #'optimizer, zero_offload, torch_adam, adam_w_mode, resulting_optimizer
diff --git a/tests/unit/test_aio.py b/tests/unit/test_aio.py
new file mode 100644
index 0000000000000000000000000000000000000000..fdec95a35ae79ab109be1df97e497bae8c46f219
--- /dev/null
+++ b/tests/unit/test_aio.py
@@ -0,0 +1,335 @@
+import pytest
+import os
+import filecmp
+import torch
+import deepspeed
+import torch.distributed as dist
+from deepspeed.ops.aio import AsyncIOBuilder
+from .common import distributed_test
+
+MEGA_BYTE = 1024**2
+BLOCK_SIZE = MEGA_BYTE
+QUEUE_DEPTH = 2
+IO_SIZE = 16 * MEGA_BYTE
+IO_PARALLEL = 2
+
+
+def _skip_if_no_aio():
+    if not deepspeed.ops.__compatible_ops__[AsyncIOBuilder.NAME]:
+        pytest.skip('Skip tests since async-io is not compatible')
+
+
+def _do_ref_write(tmpdir, index=0):
+    file_suffix = f'{dist.get_rank()}_{index}'
+    ref_file = os.path.join(tmpdir, f'_py_random_{file_suffix}.pt')
+    ref_buffer = os.urandom(IO_SIZE)
+    with open(ref_file, 'wb') as f:
+        f.write(ref_buffer)
+
+    return ref_file, ref_buffer
+
+
+def _get_test_file_and_buffer(tmpdir, ref_buffer, cuda_device, index=0):
+    file_suffix = f'{dist.get_rank()}_{index}'
+    test_file = os.path.join(tmpdir, f'_aio_write_random_{file_suffix}.pt')
+    if cuda_device:
+        test_buffer = torch.cuda.ByteTensor(list(ref_buffer))
+    else:
+        test_buffer = torch.ByteTensor(list(ref_buffer)).pin_memory()
+
+    return test_file, test_buffer
+
+
+def _validate_handle_state(handle, single_submit, overlap_events):
+    assert handle.get_single_submit() == single_submit
+    assert handle.get_overlap_events() == overlap_events
+    assert handle.get_thread_count() == IO_PARALLEL
+    assert handle.get_block_size() == BLOCK_SIZE
+    assert handle.get_queue_depth() == QUEUE_DEPTH
+
+
+@pytest.mark.parametrize('single_submit, overlap_events',
+                         [(False,
+                           False),
+                          (False,
+                           True),
+                          (True,
+                           False),
+                          (True,
+                           True)])
+def test_parallel_read(tmpdir, single_submit, overlap_events):
+    _skip_if_no_aio()
+
+    @distributed_test(world_size=[2])
+    def _test_parallel_read(single_submit, overlap_events):
+        ref_file, _ = _do_ref_write(tmpdir)
+
+        aio_buffer = torch.empty(IO_SIZE, dtype=torch.uint8, device='cpu').pin_memory()
+        h = AsyncIOBuilder().load().aio_handle(BLOCK_SIZE,
+                                               QUEUE_DEPTH,
+                                               single_submit,
+                                               overlap_events,
+                                               IO_PARALLEL)
+
+        _validate_handle_state(h, single_submit, overlap_events)
+
+        read_status = h.sync_pread(aio_buffer, ref_file)
+        assert read_status == 1
+
+        with open(ref_file, 'rb') as f:
+            ref_buffer = list(f.read())
+        assert ref_buffer == aio_buffer.tolist()
+
+    _test_parallel_read(single_submit, overlap_events)
+
+
+@pytest.mark.parametrize('single_submit, overlap_events, cuda_device',
+                         [(False,
+                           False,
+                           False),
+                          (False,
+                           True,
+                           False),
+                          (True,
+                           False,
+                           False),
+                          (True,
+                           True,
+                           False),
+                          (False,
+                           False,
+                           True),
+                          (True,
+                           True,
+                           True)])
+def test_async_read(tmpdir, single_submit, overlap_events, cuda_device):
+
+    _skip_if_no_aio()
+
+    @distributed_test(world_size=[2])
+    def _test_async_read(single_submit, overlap_events, cuda_device):
+        ref_file, _ = _do_ref_write(tmpdir)
+
+        if cuda_device:
+            aio_buffer = torch.empty(IO_SIZE, dtype=torch.uint8, device='cuda')
+        else:
+            aio_buffer = torch.empty(IO_SIZE,
+                                     dtype=torch.uint8,
+                                     device='cpu').pin_memory()
+
+        h = AsyncIOBuilder().load().aio_handle(BLOCK_SIZE,
+                                               QUEUE_DEPTH,
+                                               single_submit,
+                                               overlap_events,
+                                               IO_PARALLEL)
+
+        _validate_handle_state(h, single_submit, overlap_events)
+
+        read_status = h.async_pread(aio_buffer, ref_file)
+        assert read_status == 0
+
+        wait_status = h.wait()
+        assert wait_status == 1
+
+        with open(ref_file, 'rb') as f:
+            ref_buffer = list(f.read())
+        assert ref_buffer == aio_buffer.tolist()
+
+    _test_async_read(single_submit, overlap_events, cuda_device)
+
+
+@pytest.mark.parametrize('single_submit, overlap_events',
+                         [(False,
+                           False),
+                          (False,
+                           True),
+                          (True,
+                           False),
+                          (True,
+                           True)])
+def test_parallel_write(tmpdir, single_submit, overlap_events):
+
+    _skip_if_no_aio()
+
+    @distributed_test(world_size=[2])
+    def _test_parallel_write(single_submit, overlap_events):
+        ref_file, ref_buffer = _do_ref_write(tmpdir)
+
+        aio_file, aio_buffer = _get_test_file_and_buffer(tmpdir, ref_buffer, False)
+
+        h = AsyncIOBuilder().load().aio_handle(BLOCK_SIZE,
+                                               QUEUE_DEPTH,
+                                               single_submit,
+                                               overlap_events,
+                                               IO_PARALLEL)
+
+        _validate_handle_state(h, single_submit, overlap_events)
+
+        write_status = h.sync_pwrite(aio_buffer, aio_file)
+        assert write_status == 1
+
+        assert os.path.isfile(aio_file)
+
+        filecmp.clear_cache()
+        assert filecmp.cmp(ref_file, aio_file, shallow=False)
+
+    _test_parallel_write(single_submit, overlap_events)
+
+
+@pytest.mark.parametrize('single_submit, overlap_events, cuda_device',
+                         [(False,
+                           False,
+                           False),
+                          (False,
+                           True,
+                           False),
+                          (True,
+                           False,
+                           False),
+                          (True,
+                           True,
+                           False),
+                          (False,
+                           False,
+                           True),
+                          (True,
+                           True,
+                           True)])
+def test_async_write(tmpdir, single_submit, overlap_events, cuda_device):
+
+    _skip_if_no_aio()
+
+    @distributed_test(world_size=[2])
+    def _test_async_write(single_submit, overlap_events, cuda_device):
+        ref_file, ref_buffer = _do_ref_write(tmpdir)
+
+        aio_file, aio_buffer = _get_test_file_and_buffer(tmpdir, ref_buffer, cuda_device)
+
+        h = AsyncIOBuilder().load().aio_handle(BLOCK_SIZE,
+                                               QUEUE_DEPTH,
+                                               single_submit,
+                                               overlap_events,
+                                               IO_PARALLEL)
+
+        _validate_handle_state(h, single_submit, overlap_events)
+
+        write_status = h.async_pwrite(aio_buffer, aio_file)
+        assert write_status == 0
+
+        wait_status = h.wait()
+        assert wait_status == 1
+
+        assert os.path.isfile(aio_file)
+
+        filecmp.clear_cache()
+        assert filecmp.cmp(ref_file, aio_file, shallow=False)
+
+    _test_async_write(single_submit, overlap_events, cuda_device)
+
+
+@pytest.mark.parametrize('async_queue, cuda_device',
+                         [(2,
+                           False),
+                          (4,
+                           False),
+                          (2,
+                           True),
+                          (4,
+                           True)])
+def test_async_queue_read(tmpdir, async_queue, cuda_device):
+
+    _skip_if_no_aio()
+
+    @distributed_test(world_size=[2])
+    def _test_async_queue_read(async_queue, cuda_device):
+        ref_files = []
+        for i in range(async_queue):
+            f, _ = _do_ref_write(tmpdir, i)
+            ref_files.append(f)
+
+        aio_buffers = []
+        for i in range(async_queue):
+            if cuda_device:
+                buf = torch.empty(IO_SIZE, dtype=torch.uint8, device='cuda')
+            else:
+                buf = torch.empty(IO_SIZE, dtype=torch.uint8, device='cpu').pin_memory()
+            aio_buffers.append(buf)
+
+        single_submit = True
+        overlap_events = True
+        h = AsyncIOBuilder().load().aio_handle(BLOCK_SIZE,
+                                               QUEUE_DEPTH,
+                                               single_submit,
+                                               overlap_events,
+                                               IO_PARALLEL)
+
+        _validate_handle_state(h, single_submit, overlap_events)
+
+        for i in range(async_queue):
+            read_status = h.async_pread(aio_buffers[i], ref_files[i])
+            assert read_status == 0
+
+        wait_status = h.wait()
+        assert wait_status == async_queue
+
+        for i in range(async_queue):
+            with open(ref_files[i], 'rb') as f:
+                ref_buffer = list(f.read())
+            assert ref_buffer == aio_buffers[i].tolist()
+
+    _test_async_queue_read(async_queue, cuda_device)
+
+
+@pytest.mark.parametrize('async_queue, cuda_device',
+                         [(2,
+                           False),
+                          (7,
+                           False),
+                          (2,
+                           True),
+                          (7,
+                           True)])
+def test_async_queue_write(tmpdir, async_queue, cuda_device):
+
+    _skip_if_no_aio()
+
+    @distributed_test(world_size=[2])
+    def _test_async_queue_write(async_queue, cuda_device):
+        ref_files = []
+        ref_buffers = []
+        for i in range(async_queue):
+            f, buf = _do_ref_write(tmpdir, i)
+            ref_files.append(f)
+            ref_buffers.append(buf)
+
+        aio_files = []
+        aio_buffers = []
+        for i in range(async_queue):
+            f, buf = _get_test_file_and_buffer(tmpdir, ref_buffers[i], cuda_device, i)
+            aio_files.append(f)
+            aio_buffers.append(buf)
+
+        single_submit = True
+        overlap_events = True
+        h = AsyncIOBuilder().load().aio_handle(BLOCK_SIZE,
+                                               QUEUE_DEPTH,
+                                               single_submit,
+                                               overlap_events,
+                                               IO_PARALLEL)
+
+        _validate_handle_state(h, single_submit, overlap_events)
+
+        for i in range(async_queue):
+            read_status = h.async_pwrite(aio_buffers[i], aio_files[i])
+            assert read_status == 0
+
+        wait_status = h.wait()
+        assert wait_status == async_queue
+
+        for i in range(async_queue):
+            assert os.path.isfile(aio_files[i])
+
+            filecmp.clear_cache()
+            assert filecmp.cmp(ref_files[i], aio_files[i], shallow=False)
+
+    _test_async_queue_write(async_queue, cuda_device)
diff --git a/tests/unit/test_autocast.py b/tests/unit/test_autocast.py
new file mode 100644
index 0000000000000000000000000000000000000000..004cd8533869e4a1fab7c6cf50cfa784539ca008
--- /dev/null
+++ b/tests/unit/test_autocast.py
@@ -0,0 +1,73 @@
+import pytest
+import torch
+import deepspeed
+from deepspeed.runtime.zero.linear import LinearModuleForZeroStage3
+
+
+def _skip_autocast_test():
+    try:
+        from torch.cuda.amp import custom_fwd, custom_bwd
+    except (ImportError, AttributeError) as exp:
+        return True
+
+    return False
+
+
+@pytest.mark.parametrize('half_op', [False, True])
+def test_missing_amp_autocast(tmpdir, half_op):
+    hidden_dim = 4
+    if half_op:
+        input = torch.randn(hidden_dim).cuda().half()
+        ds_linear = LinearModuleForZeroStage3(hidden_dim, hidden_dim).cuda().half()
+    else:
+        input = torch.randn(hidden_dim).cuda()
+        ds_linear = LinearModuleForZeroStage3(hidden_dim, hidden_dim).cuda()
+
+    output = ds_linear(input)
+    assert output.dtype == ds_linear.weight.dtype
+
+
+@pytest.mark.parametrize('half_op', [False, True])
+def test_disable_autocast_linear(tmpdir, half_op):
+    if _skip_autocast_test():
+        pytest.skip("amp autocast is not available")
+
+    hidden_dim = 4
+    if half_op:
+        input = torch.randn(hidden_dim).cuda().half()
+        ds_linear = LinearModuleForZeroStage3(hidden_dim, hidden_dim).cuda().half()
+    else:
+        input = torch.randn(hidden_dim).cuda()
+        ds_linear = LinearModuleForZeroStage3(hidden_dim, hidden_dim).cuda()
+
+    with torch.cuda.amp.autocast(False):
+        output = ds_linear(input)
+        assert output.dtype == ds_linear.weight.dtype
+
+
+@pytest.mark.parametrize('half_input, half_weight',
+                         [(False,
+                           False),
+                          (False,
+                           True),
+                          (True,
+                           False),
+                          (True,
+                           True)])
+def test_autocast_linear(tmpdir, half_input, half_weight):
+    if _skip_autocast_test():
+        pytest.skip("amp autocast is not available")
+
+    hidden_dim = 4
+    input = torch.randn(hidden_dim).cuda()
+    ds_linear = LinearModuleForZeroStage3(hidden_dim, hidden_dim).cuda()
+
+    if half_input:
+        input = input.half()
+
+    if half_weight:
+        ds_linear = ds_linear.half()
+
+    with torch.cuda.amp.autocast():
+        output = ds_linear(input)
+        assert output.dtype == torch.half
diff --git a/tests/unit/test_autotuning.py b/tests/unit/test_autotuning.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a7898b8af0a21159c704d295a85aaa071db8dd2
--- /dev/null
+++ b/tests/unit/test_autotuning.py
@@ -0,0 +1,85 @@
+import os
+import pytest
+import torch
+from .simple_model import create_config_from_dict
+from deepspeed.launcher import runner as dsrun
+from deepspeed.autotuning.autotuner import Autotuner
+from deepspeed.autotuning.scheduler import ResourceManager
+
+RUN_OPTION = 'run'
+TUNE_OPTION = 'tune'
+
+
+def test_command_line():
+    '''Validate handling of command line arguments'''
+    for opt in [RUN_OPTION, TUNE_OPTION]:
+        dsrun.parse_args(
+            args=f"--num_nodes 1 --num_gpus 1 --autotuning {opt} foo.py".split())
+
+    for error_opts in [
+            "--autotuning --num_nodes 1 --num_gpus 1 foo.py".split(),
+            "--autotuning test --num_nodes 1 -- num_gpus 1 foo.py".split(),
+            "--autotuning".split()
+    ]:
+        with pytest.raises(SystemExit):
+            dsrun.parse_args(args=error_opts)
+
+
+@pytest.mark.parametrize("arg_mappings",
+                        [
+                            None,
+                            {
+                            },
+                            {
+                                "train_micro_batch_size_per_gpu": "--per_device_train_batch_size"
+                            },
+                            {
+                                "train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
+                                "gradient_accumulation_steps": "--gradient_accumulation_steps"
+                            },
+                            {
+                                "train_batch_size": "-tbs"
+                            }
+                        ]) # yapf: disable
+def test_resource_manager_arg_mappings(arg_mappings):
+    rm = ResourceManager(args=None,
+                         hosts="worker-0, worker-1",
+                         num_gpus_per_node=4,
+                         results_dir=None,
+                         exps_dir=None,
+                         arg_mappings=arg_mappings)
+
+    if arg_mappings is not None:
+        for k, v in arg_mappings.items():
+            assert k.strip() in rm.arg_mappings.keys()
+            assert arg_mappings[k.strip()].strip() == rm.arg_mappings[k.strip()]
+
+
+@pytest.mark.parametrize("active_resources",
+                        [
+                           {"worker-0": [0, 1, 2, 3]},
+                           {"worker-0": [0, 1, 2, 3], "worker-1": [0, 1, 2, 3]},
+                           {"worker-0": [0], "worker-1": [0, 1, 2], "worker-2": [0, 1, 2]},
+                           {"worker-0": [0, 1], "worker-2": [4, 5]}
+                        ]
+                        ) # yapf: disable
+def test_autotuner_resources(tmpdir, active_resources):
+    config_dict = {
+        "autotuning": {
+            "enabled": True,
+            "exps_dir": os.path.join(tmpdir,
+                                     'exps_dir'),
+            "arg_mappings": {}
+        }
+    }
+    config_path = create_config_from_dict(tmpdir, config_dict)
+    args = dsrun.parse_args(
+        args=f'--autotuning {TUNE_OPTION} foo.py --deepspeed_config {config_path}'.split(
+        ))
+    tuner = Autotuner(args=args, active_resources=active_resources)
+
+    expected_num_nodes = len(list(active_resources.keys()))
+    assert expected_num_nodes == tuner.exp_num_nodes
+
+    expected_num_gpus = min([len(v) for v in active_resources.values()])
+    assert expected_num_gpus == tuner.exp_num_gpus
diff --git a/tests/unit/test_bf16.py b/tests/unit/test_bf16.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa2ab132394c56d069cfdfe752b167c903f901dc
--- /dev/null
+++ b/tests/unit/test_bf16.py
@@ -0,0 +1,321 @@
+import math
+import torch
+import deepspeed
+import pytest
+from deepspeed.ops.adam import FusedAdam
+from .common import distributed_test
+from deepspeed.ops.op_builder import CPUAdamBuilder
+from .simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
+from .util import bf16_required_version_check
+
+
+@pytest.mark.parametrize('zero_stage, use_cpu_offload', [(2, False)])
+def test_adam_bf16_zero_onecycle_compatibility(tmpdir, zero_stage, use_cpu_offload):
+    if not bf16_required_version_check():
+        pytest.skip(
+            " DeepSpeed BFloat16 tests need torch >= 1.10, NCCL >= 2.10.3, CUDA > =11.0 and HW support for BFloat16 to run correctly"
+        )
+
+    if use_cpu_offload and not deepspeed.ops.__compatible_ops__[CPUAdamBuilder.NAME]:
+        pytest.skip("cpu-adam is not compatible")
+
+    config_dict = {
+        "train_batch_size": 1,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 0.00015
+            }
+        },
+        "scheduler": {
+            "type": "OneCycle",
+            "params": {
+                "cycle_first_step_size": 16000,
+                "cycle_first_stair_count": 8000,
+                "decay_step_size": 16000,
+                "cycle_min_lr": 1e-06,
+                "cycle_max_lr": 3e-05,
+                "decay_lr_rate": 1e-07,
+                "cycle_min_mom": 0.85,
+                "cycle_max_mom": 0.99,
+                "decay_mom_rate": 0.0
+            }
+        },
+        "fp16": {
+            "enabled": False
+        },
+        "bf16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+            "cpu_offload": use_cpu_offload
+        }
+    }
+
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    @distributed_test(world_size=[1])
+    def _test_adam_bf16_zero_onecycle_compatibility(args, zero_stage, hidden_dim):
+        model = SimpleModel(hidden_dim)
+
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                             model=model,
+                                             model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.bfloat16)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_adam_bf16_zero_onecycle_compatibility(args=args,
+                                                zero_stage=zero_stage,
+                                                hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('zero_stage, use_cpu_offload', [(2, False)])
+def test_zero_allow_untested_optimizer(tmpdir, zero_stage, use_cpu_offload):
+    if not bf16_required_version_check():
+        pytest.skip(
+            " DeepSpeed BFloat16 tests need torch >= 1.10, NCCL >= 2.10.3, CUDA > =11.0 and HW support for BFloat16 to run correctly"
+        )
+
+    if use_cpu_offload and not deepspeed.ops.__compatible_ops__[CPUAdamBuilder.NAME]:
+        pytest.skip("cpu-adam is not compatible")
+
+    config_dict = {
+        "train_batch_size": 4,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": False,
+        },
+        "bf16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+            "cpu_offload": use_cpu_offload
+        },
+        "zero_allow_untested_optimizer": False
+    }
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[1])
+    def _test_zero_allow_untested_optimizer(args, zero_stage):
+        hidden_dim = 10
+        model = SimpleModel(hidden_dim)
+        optimizer = SimpleOptimizer(model.parameters())
+        with pytest.raises(AssertionError):
+            model, optim, _, _ = deepspeed.initialize(args=args,
+                                                      model=model,
+                                                      optimizer=optimizer,
+                                                      model_parameters=model.parameters())
+
+    _test_zero_allow_untested_optimizer(args, zero_stage)
+
+
+@pytest.mark.parametrize('zero_stage, use_cpu_offload', [(2, False)])
+def test_zero_empty_partition(tmpdir, zero_stage, use_cpu_offload):
+    if not bf16_required_version_check():
+        pytest.skip(
+            " DeepSpeed BFloat16 tests need torch >= 1.10, NCCL >= 2.10.3, CUDA > =11.0 and HW support for BFloat16 to run correctly"
+        )
+
+    if use_cpu_offload and not deepspeed.ops.__compatible_ops__[CPUAdamBuilder.NAME]:
+        pytest.skip("cpu-adam is not compatible")
+
+    if zero_stage == 3:
+        pytest.skip("skip for now")
+
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 1,
+        "gradient_accumulation_steps": 1,
+        "fp16": {
+            "enabled": False
+        },
+        "bf16": {
+            "enabled": True
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 0.00015
+            }
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+            "cpu_offload": use_cpu_offload,
+            "reduce_bucket_size": 100,
+            "allgather_bucket_size": 100
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[3])
+    def _test_zero_empty_partition(args, zero_stage):
+        hidden_dim = 1
+        model = SimpleModel(hidden_dim)
+
+        # Ensure model has 2 parameters, to cause empty partition with DP=3
+        assert len(list(model.parameters())) == 2
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+
+        # Now make sure things work..
+        data_loader = random_dataloader(model=model,
+                                        total_samples=1,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.bfloat16)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_zero_empty_partition(args=args, zero_stage=zero_stage)
+
+
+@pytest.mark.parametrize('zero_stage, optimizer_constructor',
+                         [(2,
+                           torch.optim.Adam),
+                          (2,
+                           FusedAdam)])
+def test_zero_supported_client_optimizer(tmpdir, zero_stage, optimizer_constructor):
+    if not bf16_required_version_check():
+        pytest.skip(
+            " DeepSpeed BFloat16 tests need torch >= 1.10, NCCL >= 2.10.3, CUDA > =11.0 and HW support for BFloat16 to run correctly"
+        )
+
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": False
+        },
+        "bf16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": zero_stage
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    @distributed_test(world_size=[1])
+    def _test_zero_supported_client_optimizer(args, zero_stage, optimizer_constructor):
+        model = SimpleModel(hidden_dim)
+
+        client_optimizer = optimizer_constructor(params=model.parameters())
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=client_optimizer)
+
+    _test_zero_supported_client_optimizer(args=args,
+                                          zero_stage=zero_stage,
+                                          optimizer_constructor=optimizer_constructor)
+
+
+def test_zero2_reduce_scatter_off(tmpdir):
+    if not bf16_required_version_check():
+        pytest.skip(
+            " DeepSpeed BFloat16 tests need torch >= 1.10, NCCL >= 2.10.3, CUDA > =11.0 and HW support for BFloat16 to run correctly"
+        )
+
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 0.00015
+            }
+        },
+        "gradient_clipping": 1.0,
+        "zero_optimization": {
+            "stage": 2,
+            "contiguous_gradients": True,
+            "allgather_bucket_size": 2000000000,
+            "reduce_bucket_size": 200000000,
+            "overlap_comm": False,
+            "reduce_scatter": False
+        },
+        "fp16": {
+            "enabled": False
+        },
+        "bf16": {
+            "enabled": True
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[2])
+    def _helper(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.bfloat16)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _helper(args=args, model=model, hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('stage', [2])
+def test_zero_empty_grad(tmpdir, stage):
+    if not bf16_required_version_check():
+        pytest.skip(
+            " DeepSpeed BFloat16 tests need torch >= 1.10, NCCL >= 2.10.3, CUDA > =11.0 and HW support for BFloat16 to run correctly"
+        )
+
+    config_dict = {
+        "train_batch_size": 1,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": False
+        },
+        "bf16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": stage
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _go(args, model, hidden_dim):
+        optimizer = torch.optim.Adam(model.parameters())
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=optimizer)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.bfloat16)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _go(args=args, model=model, hidden_dim=hidden_dim)
diff --git a/tests/unit/test_checkpointing.py b/tests/unit/test_checkpointing.py
old mode 100755
new mode 100644
index 765c44c8e551e49befbb8c083d11a25e1814726c..c989f226cf2d8884ad6687db7f581558879df984
--- a/tests/unit/test_checkpointing.py
+++ b/tests/unit/test_checkpointing.py
@@ -3,64 +3,75 @@ import torch
 import torch.distributed as dist
 
 import deepspeed
-from deepspeed.runtime.zero.stage2 import FP16_DeepSpeedZeroOptimizer
-from deepspeed.runtime.zero.stage1 import FP16_DeepSpeedZeroOptimizer_Stage1
-
+from deepspeed.runtime.zero.stage_1_and_2 import DeepSpeedZeroOptimizer
+from deepspeed.utils import groups
 from deepspeed.runtime.fp16.fused_optimizer import FP16_Optimizer
 from deepspeed.runtime.fp16.unfused_optimizer import FP16_UnfusedOptimizer
+from deepspeed.moe.utils import split_params_into_different_moe_groups_for_optimizer
 
 from deepspeed.runtime.pipe.topology import *
+
 PipeTopo = PipeDataParallelTopology
 
 from deepspeed.ops.op_builder import FusedLambBuilder, CPUAdamBuilder
 
-from deepspeed.runtime.zero.stage3 import FP16_DeepSpeedZeroOptimizer_Stage3
+from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3
+from .util import required_torch_version
 
+import itertools
 import argparse
 import pytest
 import json
 import os
 import numbers
-from common import distributed_test
-from simple_model import *
+from .common import distributed_test
+from .simple_model import *
 
 
 def compare_deepspeed_states(saved_model, loaded_model):
     # These are compared in more depth in other places
     assert hasattr(loaded_model, 'module')
 
-    assert saved_model.csr_tensor_module_names == loaded_model.csr_tensor_module_names
+    assert saved_model.sparse_tensor_module_names == loaded_model.sparse_tensor_module_names
     assert saved_model.skipped_steps == loaded_model.skipped_steps
     assert saved_model.global_steps == loaded_model.global_steps
 
 
-def compare_model_states(saved_model, loaded_model, compare_optimizer=True):
-    compare_deepspeed_states(saved_model, loaded_model)
-
-    for p0, p1 in zip(saved_model.module.parameters(), loaded_model.module.parameters()):
+def compare_model_states(saved_model,
+                         loaded_model,
+                         compare_optimizer=True,
+                         load_module_only=False):
+    if not load_module_only:
+        compare_deepspeed_states(saved_model, loaded_model)
+
+    for p0, p1 in zip(saved_model.module.named_parameters(), loaded_model.module.named_parameters()):
+        np0, p0 = p0
+        np1, p1 = p1
+        if 'deepspeed_moe.gate.wg' in np0:
+            # these params are converted to float at runtime, cast to half for comparison
+            p1 = p1.half()
+            p0 = p0.half()
         assert id(p0) != id(p1), f'Comparing fp16 model state tensor against itself : {id(p0)} <====> {id(p1)}'
-        assert torch.allclose(p0, p1, atol=1e-07), f"FP16 model state {p0} is not equal to {p1}"
+        try:
+            assert torch.allclose(p0, p1, atol=1e-07), f"FP16 model state {p0} is not equal to {p1}, names:{np0}, {np1}"
+        except RuntimeError as err:
+            print(f"FP16 model state {p0} is not equal to {p1}, names:{np0}, {np1}")
+            raise err
 
     if not compare_optimizer:
         return
 
-    if FP16_DeepSpeedZeroOptimizer_Stage3 is not None and isinstance(
+    if DeepSpeedZeroOptimizer_Stage3 is not None and isinstance(
             saved_model.optimizer,
-            FP16_DeepSpeedZeroOptimizer_Stage3):
+            DeepSpeedZeroOptimizer_Stage3):
         for p0, p1 in zip(saved_model.optimizer.fp32_partitioned_groups_flat, loaded_model.optimizer.fp32_partitioned_groups_flat):
             assert torch.allclose(p0, p1, atol=1e-07), f"Fp32 model states {p0} is not equal to {p1}"
 
-    elif isinstance(saved_model.optimizer, FP16_DeepSpeedZeroOptimizer):
+    elif isinstance(saved_model.optimizer, DeepSpeedZeroOptimizer):
         for p0, p1 in zip(saved_model.optimizer.single_partition_of_fp32_groups, loaded_model.optimizer.single_partition_of_fp32_groups):
             assert id(p0) != id(p1), f'Comparing fp32 model state tensor against itself: {id(p0)} <====> {id(p1)}'
             assert torch.allclose(p0, p1, atol=1e-07), f"Fp32 model states {p0} is not equal to {p1}"
 
-    elif isinstance(saved_model.optimizer, FP16_DeepSpeedZeroOptimizer_Stage1):
-        for partition0, partition1 in zip(saved_model.optimizer.local_sub_partitions_of_fp32_groups, loaded_model.optimizer.local_sub_partitions_of_fp32_groups):
-            for p0, p1 in zip(partition0, partition1):
-                assert id(p0) != id(p1), f'Comparing fp32 model state tensor against itself: {id(p0)} <====> {id(p1)}'
-                assert torch.allclose(p0, p1, atol=1e-07), f"Fp32 model states {p0} is not equal to {p1}"
-
     elif isinstance(saved_model.optimizer, FP16_Optimizer):
         for p0, p1 in zip(saved_model.optimizer.fp32_groups_flat, loaded_model.optimizer.fp32_groups_flat):
             assert id(p0) != id(p1), f'Comparing fp32 model state tensor against itself: {id(p0)} <====> {id(p1)}'
@@ -137,17 +148,26 @@ def checkpoint_correctness_verification(args,
                                         train_batch=False,
                                         base_optimizers=[None,
                                                          None],
-                                        empty_tag=False):
+                                        empty_tag=False,
+                                        seq_dataloader=False,
+                                        load_module_only=False):
     dtype = torch.half if fp16 else torch.float32
     ds_model = create_deepspeed_model(args=args,
                                       model=models[0],
                                       base_optimizer=base_optimizers[0])
 
-    data_loader = random_dataloader(model=ds_model,
-                                    total_samples=50,
-                                    hidden_dim=hidden_dim,
-                                    device=ds_model.device,
-                                    dtype=dtype)
+    if seq_dataloader:
+        data_loader = sequence_dataloader(model=ds_model,
+                                          total_samples=50,
+                                          hidden_dim=hidden_dim,
+                                          device=ds_model.device,
+                                          dtype=dtype)
+    else:
+        data_loader = random_dataloader(model=ds_model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=ds_model.device,
+                                        dtype=dtype)
 
     if train_batch:
         ds_model.set_dataloader(data_loader)
@@ -166,16 +186,24 @@ def checkpoint_correctness_verification(args,
 
     trained_model.save_checkpoint(save_folder, tag=save_tag)
 
+    dist.barrier()
+
     loaded_model = create_deepspeed_model(args=args,
                                           model=models[1],
                                           base_optimizer=base_optimizers[1])
+    assert list(trained_model.parameters())[0].dtype == list(
+        loaded_model.parameters())[0].dtype
 
     loaded_model.load_checkpoint(save_folder,
                                  tag=save_tag,
                                  load_optimizer_states=load_optimizer_states,
-                                 load_lr_scheduler_states=load_lr_scheduler_states)
+                                 load_lr_scheduler_states=load_lr_scheduler_states,
+                                 load_module_only=load_module_only)
 
-    compare_model_states(trained_model, loaded_model)
+    compare_model_states(trained_model,
+                         loaded_model,
+                         compare_optimizer=load_optimizer_states,
+                         load_module_only=load_module_only)
 
     if load_optimizer_states:
         compare_optimizer_states(trained_model, loaded_model, hidden_dim, fp16)
@@ -413,8 +441,8 @@ def test_checkpoint_zero_no_optimizer(tmpdir,
                                            hidden_dim,
                                            load_optimizer_states):
         if zero_stage == 3:
-            global FP16_DeepSpeedZeroOptimizer_Stage3
-            from deepspeed.runtime.zero.stage3 import FP16_DeepSpeedZeroOptimizer_Stage3
+            global DeepSpeedZeroOptimizer_Stage3
+            from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3
             with deepspeed.zero.Init():
                 models = [SimpleModel(hidden_dim, empty_grad=False) for _ in range(2)]
         else:
@@ -494,8 +522,8 @@ def test_checkpoint_lr_scheduler(tmpdir, zero_stage, use_cpu_offload, adam_optim
                                       load_optimizer_states,
                                       load_lr_scheduler_states):
         if zero_stage == 3:
-            global FP16_DeepSpeedZeroOptimizer_Stage3
-            from deepspeed.runtime.zero.stage3 import FP16_DeepSpeedZeroOptimizer_Stage3
+            global DeepSpeedZeroOptimizer_Stage3
+            from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3
             with deepspeed.zero.Init():
                 models = [SimpleModel(hidden_dim, empty_grad=False) for _ in range(2)]
         else:
@@ -681,21 +709,22 @@ def test_checkpoint_pipe_engine(zero_stage, tmpdir, stages=2):
     _test(tmpdir, num_stages=stages)
 
 
-@pytest.mark.parametrize("base_topo,test_topo",
-                         [
-                             (PipeTopo(num_pp=1,
-                                       num_dp=4),
-                              PipeTopo(num_pp=4,
-                                       num_dp=1)),
-                             (PipeTopo(num_pp=2,
-                                       num_dp=2),
-                              PipeTopo(num_pp=2,
-                                       num_dp=2)),
-                             (PipeTopo(num_pp=4,
-                                       num_dp=1),
-                              PipeTopo(num_pp=2,
-                                       num_dp=2)),
-                         ])
+@pytest.mark.parametrize(
+    "base_topo,test_topo",
+    [
+        #(PipeTopo(num_pp=1,
+        #          num_dp=4),
+        # PipeTopo(num_pp=4,
+        #          num_dp=1)),
+        #(PipeTopo(num_pp=2,
+        #          num_dp=2),
+        # PipeTopo(num_pp=2,
+        #          num_dp=2)),
+        #(PipeTopo(num_pp=4,
+        #          num_dp=1),
+        # PipeTopo(num_pp=2,
+        #          num_dp=2)),
+    ])
 def test_checkpoint_pipe_module(base_topo, test_topo, tmpdir):
     @distributed_test(world_size=4)
     def _test(base_topo, test_topo, save_folder):
@@ -892,3 +921,483 @@ def test_checkpoint_unknown_tag_validation(tmpdir):
                                                  model_parameters=model.parameters())
 
     _helper(args=args, model=model, hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize("ep_size", [4])
+def test_checkpoint_moe(tmpdir, ep_size):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
+    config_dict = {
+        "train_batch_size": 8,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": True
+        }
+    }
+    hidden_dim = 16
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[4])
+    def _helper(args):
+        models = [
+            SimpleMoEModel(hidden_dim=hidden_dim,
+                           num_experts=ep_size,
+                           ep_size=ep_size) for _ in range(2)
+        ]
+        optimizers = [torch.optim.AdamW(params=model.parameters()) for model in models]
+        checkpoint_correctness_verification(args,
+                                            models=models,
+                                            hidden_dim=hidden_dim,
+                                            tmpdir=tmpdir,
+                                            load_optimizer_states=True,
+                                            load_lr_scheduler_states=False,
+                                            fp16=config_dict["fp16"]["enabled"],
+                                            empty_tag=True,
+                                            base_optimizers=optimizers,
+                                            seq_dataloader=True)
+
+    _helper(args)
+
+
+@pytest.mark.parametrize("ep_size, load_optim_states",
+                         [(4,
+                           True),
+                          (4,
+                           False),
+                          (2,
+                           True),
+                          (2,
+                           False)])
+def test_checkpoint_moe_and_zero(tmpdir, ep_size, load_optim_states):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
+    config_dict = {
+        "train_batch_size": 8,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": 'Adam',
+            "params": {
+                "lr": 0.00015,
+                "betas": [0.8,
+                          0.999],
+                "eps": 1e-8,
+                "weight_decay": 3e-7
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": 2,
+        }
+    }
+    hidden_dim = 16
+    args = args_from_dict(tmpdir, config_dict)
+
+    def create_param_groups(model):
+        # param group must have a random unique name (for now)
+        # TODO: clean-up this requirement, the unique name should not be required here
+        return {'params': [p for p in model.parameters()], 'name': 'random-unique-name'}
+
+    @distributed_test(world_size=[4])
+    def _helper(args):
+        models = [
+            SimpleMoEModel(hidden_dim=hidden_dim,
+                           num_experts=ep_size,
+                           ep_size=ep_size) for _ in range(2)
+        ]
+        params = [
+            split_params_into_different_moe_groups_for_optimizer(
+                create_param_groups(model)) for model in models
+        ]
+        optimizers = [torch.optim.AdamW(params=param) for param in params]
+        checkpoint_correctness_verification(args,
+                                            models=models,
+                                            hidden_dim=hidden_dim,
+                                            tmpdir=tmpdir,
+                                            load_optimizer_states=load_optim_states,
+                                            load_lr_scheduler_states=False,
+                                            fp16=config_dict["fp16"]["enabled"],
+                                            empty_tag=True,
+                                            base_optimizers=optimizers,
+                                            seq_dataloader=True)
+
+    _helper(args)
+
+
+@pytest.mark.parametrize('zero_stage', [0, 1, 2, 3])
+def test_checkpoint_load_module_only(tmpdir, zero_stage):
+    config_dict = {
+        "train_batch_size": 2,
+        "optimizer": {
+            "type": 'Adam'
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    @distributed_test(world_size=[2])
+    def _go(args, zero_stage, hidden_dim):
+        if zero_stage == 3:
+            with deepspeed.zero.Init():
+                models = [SimpleModel(hidden_dim, empty_grad=False) for _ in range(2)]
+        else:
+            models = [SimpleModel(hidden_dim, empty_grad=False) for _ in range(2)]
+
+        checkpoint_correctness_verification(args,
+                                            models,
+                                            hidden_dim,
+                                            tmpdir,
+                                            load_module_only=True)
+
+    _go(args, zero_stage, hidden_dim)
+
+
+@pytest.mark.parametrize(["to_save_model_has_embedding",
+                          "to_save_model_sparse"],
+                         [
+                             [False,
+                              False],
+                             [True,
+                              False],
+                             [True,
+                              True],
+                         ])
+@pytest.mark.parametrize(["destination_has_embedding",
+                          "destination_sparse"],
+                         [
+                             [False,
+                              False],
+                             [True,
+                              False],
+                             [True,
+                              True],
+                         ])
+def test_non_strict_load_sparse(tmpdir,
+                                to_save_model_has_embedding,
+                                to_save_model_sparse,
+                                destination_has_embedding,
+                                destination_sparse):
+    config_dict = {"train_batch_size": 2}
+
+    class ModelNoEmbedding(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.linear = torch.nn.Linear(3, 1)
+
+        def forward(self, x):
+            return self.linear(x)
+
+    class ModelEmbedding(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.emb = torch.nn.Embedding(10, 3)
+            self.linear = torch.nn.Linear(3, 1)
+
+        def forward(self, x, offsets):
+            return self.linear(self.emb(x, offsets))
+
+    @distributed_test(world_size=[2])
+    def _test(model_to_save, model_destination):
+        engine_to_save, _, _, _ = deepspeed.initialize(
+            model=model_to_save, config={"train_batch_size": 2, "sparse_gradients": to_save_model_sparse}
+        )
+        engine_destination, _, _, _ = deepspeed.initialize(
+            model=model_destination, config={"train_batch_size": 2, "sparse_gradients": destination_sparse}
+        )
+
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        save_tag = '1'
+
+        engine_to_save.save_checkpoint(save_folder, tag=save_tag)
+
+        is_sparse_destination = isinstance(model_destination,
+                                           ModelEmbedding) and destination_sparse
+        if isinstance(model_destination,
+                      ModelEmbedding) and model_destination.emb.sparse:
+            assert "emb.weight" in engine_destination.sparse_tensor_module_names
+        engine_destination.load_checkpoint(save_folder,
+                                           tag=save_tag,
+                                           load_module_strict=False,
+                                           load_optimizer_states=False,
+                                           load_lr_scheduler_states=False,
+                                           load_module_only=False)
+        if isinstance(model_destination,
+                      ModelEmbedding) and isinstance(model_to_save,
+                                                     ModelEmbedding):
+            assert engine_destination.sparse_tensor_module_names == engine_to_save.sparse_tensor_module_names
+        elif isinstance(model_destination, ModelEmbedding):
+            assert not is_sparse_destination or "emb.weight" in engine_destination.sparse_tensor_module_names
+        else:
+            assert len(engine_destination.sparse_tensor_module_names) == 0
+
+    if to_save_model_has_embedding:
+        model_to_save = ModelEmbedding()
+    else:
+        model_to_save = ModelNoEmbedding()
+    if destination_has_embedding:
+        model_destination = ModelEmbedding()
+    else:
+        model_destination = ModelNoEmbedding()
+    _test(model_to_save, model_destination)
+
+
+@pytest.mark.parametrize(["elastic_save",
+                          "elastic_load",
+                          "load_optim"],
+                         itertools.product(*[[True,
+                                              False],
+                                             [True,
+                                              False],
+                                             [True,
+                                              False]]))
+def test_checkpoint_zero_elastic(tmpdir, elastic_save, elastic_load, load_optim):
+    ds_config = {
+        "train_batch_size": 2,
+        "optimizer": {
+            "type": 'Adam'
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": 2,
+            "elastic_checkpoint": elastic_save
+        }
+    }
+    hidden_dim = 10
+
+    @distributed_test(world_size=[2])
+    def _go():
+        models = [SimpleModel(hidden_dim) for _ in range(2)]
+        model, _, _, _ = deepspeed.initialize(config=ds_config,
+                                              model=models[0],
+                                              model_parameters=models[0].parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=8,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+        model.save_checkpoint(tmpdir)
+
+        ds_config["zero_optimization"]["elastic_checkpoint"] = elastic_load
+        model, _, _, _ = deepspeed.initialize(config=ds_config,
+                                              model=models[1],
+                                              model_parameters=models[1].parameters())
+        model.load_checkpoint(tmpdir, load_optimizer_states=load_optim)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=8,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _go()
+
+
+@pytest.mark.parametrize(["elastic_save",
+                          "elastic_load",
+                          "load_optim"],
+                         itertools.product(*[[True,
+                                              False],
+                                             [True,
+                                              False],
+                                             [True,
+                                              False]]))
+def test_checkpoint_zero_elastic_dp_change(tmpdir,
+                                           elastic_save,
+                                           elastic_load,
+                                           load_optim):
+    ds_config = {
+        "train_batch_size": 4,
+        "optimizer": {
+            "type": 'Adam'
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": 2,
+            "elastic_checkpoint": elastic_save
+        }
+    }
+    hidden_dim = 10
+    models = [SimpleModel(hidden_dim) for _ in range(2)]
+
+    @distributed_test(world_size=[4])
+    def _go2(models):
+        model, _, _, _ = deepspeed.initialize(config=ds_config,
+                                              model=models[0],
+                                              model_parameters=models[0].parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=8,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+        model.save_checkpoint(tmpdir)
+
+    _go2(models)
+
+    @distributed_test(world_size=[2])
+    def _go1(models):
+        ds_config["zero_optimization"]["elastic_checkpoint"] = elastic_load
+        model, _, _, _ = deepspeed.initialize(config=ds_config,
+                                              model=models[1],
+                                              model_parameters=models[1].parameters())
+        if load_optim:
+            with pytest.raises(deepspeed.runtime.zero.utils.ZeRORuntimeException):
+                model.load_checkpoint(tmpdir, load_optimizer_states=load_optim)
+        else:
+            model.load_checkpoint(tmpdir, load_optimizer_states=load_optim)
+
+    _go1(models)
+
+
+@pytest.mark.parametrize('zero_stage', [0, 1, 2, 3])
+def test_immediate_save_load(tmpdir, zero_stage):
+    config_dict = {
+        "train_batch_size": 4,
+        "optimizer": {
+            "type": 'Adam'
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+        }
+    }
+    hidden_dim = 10
+    model = SimpleModel(hidden_dim)
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[1])
+    def _test_immediate_save_load(args, model, tmpdir):
+
+        ds_model = create_deepspeed_model(args=args, model=model, base_optimizer=None)
+        ds_model.save_checkpoint(tmpdir)
+        ds_model.load_checkpoint(tmpdir,
+                                 load_optimizer_states=False,
+                                 load_lr_scheduler_states=False,
+                                 load_module_only=False)
+
+    _test_immediate_save_load(args, model, tmpdir)
+
+
+@pytest.mark.parametrize('zero_stage', [0, 1, 2, 3])
+def test_load_immediate_save(tmpdir, zero_stage):
+    config_dict = {
+        "train_batch_size": 4,
+        "optimizer": {
+            "type": 'Adam'
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+        }
+    }
+    hidden_dim = 10
+    model = SimpleModel(hidden_dim)
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[1])
+    def _test_load_immediate_save(args, model, tmpdir):
+
+        # 1. pretrain a model and save it
+        dtype = torch.half
+        ds_model = create_deepspeed_model(args=args, model=model, base_optimizer=None)
+        data_loader = random_dataloader(model=ds_model,
+                                        total_samples=1,
+                                        hidden_dim=hidden_dim,
+                                        device=ds_model.device,
+                                        dtype=dtype)
+        for n, batch in enumerate(data_loader):
+            loss = ds_model(batch[0], batch[1])
+            ds_model.backward(loss)
+            ds_model.step()
+        ds_model.save_checkpoint(tmpdir)
+
+        # 2. load and immediately save a model with a fresh ds engine
+        ds_model = create_deepspeed_model(args=args, model=model, base_optimizer=None)
+        ds_model.load_checkpoint(tmpdir,
+                                 load_optimizer_states=False,
+                                 load_lr_scheduler_states=False,
+                                 load_module_only=False)
+        ds_model.save_checkpoint(tmpdir)
+
+    _test_load_immediate_save(args, model, tmpdir)
+
+
+@pytest.mark.parametrize('zero_stage', [0, 1, 2, 3])
+def test_save_before_accum_grad_is_done(tmpdir, zero_stage):
+    config_dict = {
+        "train_batch_size": 4,
+        "optimizer": {
+            "type": 'Adam'
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        },
+        "zero_optimization": {
+            "stage": zero_stage,
+            "stage3_gather_fp16_weights_on_model_save": True,
+        },
+        "gradient_accumulation_steps": 2,
+        "train_micro_batch_size_per_gpu": 1,
+        "train_batch_size": 2,
+    }
+    hidden_dim = 10
+    model = SimpleModel(hidden_dim)
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[1])
+    def _test_save_before_accum_grad_is_done(args, model, tmpdir):
+
+        # This test reproduces a bug where one tries to retrieve a 16bit model before grad_accum
+        # cycle was completed.
+        # So we config grad_accum=2 and step only once and save_16bit_model
+        ds_model = create_deepspeed_model(args=args, model=model, base_optimizer=None)
+
+        data_loader = random_dataloader(model=ds_model,
+                                        total_samples=2,
+                                        hidden_dim=hidden_dim,
+                                        device=ds_model.device,
+                                        dtype=torch.half)
+
+        batch = next(iter(data_loader))
+        loss = ds_model(batch[0], batch[1])
+        ds_model.backward(loss)
+        ds_model.step()
+
+        # we stepped only once, and now save 16bit model before gradient_accumulation_steps=2 is complete
+        ds_model.save_16bit_model(tmpdir, "model.pt")
+
+        # let's test just as well that we can save the checkpoint too
+        ds_model.save_checkpoint(tmpdir)
+
+    _test_save_before_accum_grad_is_done(args, model, tmpdir)
diff --git a/tests/unit/test_coalesced_collectives.py b/tests/unit/test_coalesced_collectives.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb6b5354a1589fc156aacdaf99fb192e3dc95324
--- /dev/null
+++ b/tests/unit/test_coalesced_collectives.py
@@ -0,0 +1,62 @@
+"""unit tests for coalesced collectives"""
+
+import pytest
+
+import torch
+import torch.distributed as dist
+from deepspeed.runtime.comm.coalesced_collectives import reduce_scatter_coalesced
+
+from .common import distributed_test
+
+
+@distributed_test(world_size=2)
+def test_reduce_scatter_coalesced_single_input():
+    input = torch.full((6,
+                        ),
+                       dist.get_rank(),
+                       dtype=torch.half,
+                       device=torch.cuda.current_device())
+
+    (output, ) = reduce_scatter_coalesced([input], dist.group.WORLD)
+
+    assert output.shape == (3, )
+    assert torch.allclose(output, torch.full_like(output, 0.5))
+
+
+@distributed_test(world_size=2)
+def test_reduce_scatter_coalesced_two_inputs():
+    tensor_kwargs = {"device": torch.cuda.current_device(), "dtype": torch.half}
+    inputs = [
+        dist.get_rank() * torch.arange(0,
+                                       6,
+                                       **tensor_kwargs),
+        dist.get_rank() * torch.arange(6,
+                                       9,
+                                       **tensor_kwargs),
+    ]
+
+    output1, output2 = reduce_scatter_coalesced(inputs, dist.group.WORLD)
+
+    if dist.get_rank() == 0:
+        assert output1.shape == (3, )
+        assert torch.allclose(output1, torch.arange(0, 3, **tensor_kwargs) / 2)
+        assert output2.shape == (2, )
+        assert torch.allclose(output2, torch.arange(6, 8, **tensor_kwargs) / 2)
+    elif dist.get_rank() == 1:
+        assert output1.shape == (3, )
+        assert torch.allclose(output1, torch.arange(3, 6, **tensor_kwargs) / 2)
+        assert output2.shape == (1, )
+        assert torch.allclose(output2, torch.arange(8, 9, **tensor_kwargs) / 2)
+
+
+@distributed_test(world_size=2)
+def test_reduce_scatter_coalesced_tensor_smaller_than_world_sz():
+    input = torch.zeros((1, ), dtype=torch.half, device=torch.cuda.current_device())
+
+    (output, ) = reduce_scatter_coalesced([input], dist.group.WORLD)
+
+    if dist.get_rank() == 0:
+        assert output.shape == (1, )
+        assert torch.allclose(output, torch.zeros_like(output))
+    elif dist.get_rank() == 1:
+        assert output.shape == (0, )
diff --git a/tests/unit/test_config.py b/tests/unit/test_config.py
old mode 100755
new mode 100644
index 7de3a40fabebc0026fe0d55b5b4a5c92af8987e9..a88cb2931d95efe76a499681326293ad682198ad
--- a/tests/unit/test_config.py
+++ b/tests/unit/test_config.py
@@ -3,13 +3,16 @@ import torch
 import pytest
 import json
 import argparse
-from common import distributed_test
-from simple_model import SimpleModel, create_config_from_dict, random_dataloader
+
+from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
+
+from .common import distributed_test, get_test_path
+from .simple_model import SimpleModel, create_config_from_dict, random_dataloader
 import torch.distributed as dist
 
 # A test on its own
 import deepspeed
-from deepspeed.runtime.config import DeepSpeedConfig
+from deepspeed.runtime.config import DeepSpeedConfig, get_bfloat16_enabled
 
 
 def test_cuda():
@@ -62,7 +65,7 @@ def test_batch_config(num_ranks, batch, micro_batch, gas, success):
         assert dist.get_world_size() == num_ranks, \
         'The test assumes a world size of f{num_ranks}'
 
-        ds_batch_config = 'tests/unit/ds_batch_config.json'
+        ds_batch_config = get_test_path('ds_batch_config.json')
         ds_config = DeepSpeedConfig(ds_batch_config)
 
         #test cases when all parameters are provided
@@ -114,6 +117,32 @@ def test_temp_config_json(tmpdir):
     assert 'train_batch_size' in config_json
 
 
+@pytest.mark.parametrize("gather_weights_key",
+                         [
+                             "stage3_gather_16bit_weights_on_model_save",
+                             "stage3_gather_fp16_weights_on_model_save"
+                         ])
+def test_gather_16bit_params_on_model_save(gather_weights_key):
+    config_dict = {
+        "zero_optimization": {
+            gather_weights_key: True,
+        },
+    }
+    config = DeepSpeedZeroConfig(config_dict)
+
+    assert config.gather_16bit_weights_on_model_save == True
+
+
+@pytest.mark.parametrize("bf16_key", ["bf16", "bfloat16"])
+def test_get_bfloat16_enabled(bf16_key):
+    cfg = {
+        bf16_key: {
+            "enabled": True,
+        },
+    }
+    assert get_bfloat16_enabled(cfg) == True
+
+
 def test_deprecated_deepscale_config(tmpdir):
     config_dict = {
         "train_batch_size": 1,
@@ -229,7 +258,7 @@ def test_init_no_optimizer(tmpdir):
 
 
 def test_none_args(tmpdir):
-    config_dict = {
+    config = {
         "train_batch_size": 1,
         "optimizer": {
             "type": "Adam",
@@ -245,7 +274,7 @@ def test_none_args(tmpdir):
     @distributed_test(world_size=1)
     def _helper():
         model = SimpleModel(hidden_dim=10)
-        model, _, _, _ = deepspeed.initialize(args=None, model=model, config_params=config_dict)
+        model, _, _, _ = deepspeed.initialize(args=None, model=model, config=config)
         data_loader = random_dataloader(model=model,
                                         total_samples=5,
                                         hidden_dim=10,
@@ -257,7 +286,7 @@ def test_none_args(tmpdir):
 
 
 def test_no_args(tmpdir):
-    config_dict = {
+    config = {
         "train_batch_size": 1,
         "optimizer": {
             "type": "Adam",
@@ -273,7 +302,7 @@ def test_no_args(tmpdir):
     @distributed_test(world_size=1)
     def _helper():
         model = SimpleModel(hidden_dim=10)
-        model, _, _, _ = deepspeed.initialize(model=model, config_params=config_dict)
+        model, _, _, _ = deepspeed.initialize(model=model, config=config)
         data_loader = random_dataloader(model=model,
                                         total_samples=5,
                                         hidden_dim=10,
@@ -285,7 +314,7 @@ def test_no_args(tmpdir):
 
 
 def test_no_model(tmpdir):
-    config_dict = {
+    config = {
         "train_batch_size": 1,
         "optimizer": {
             "type": "Adam",
@@ -302,7 +331,7 @@ def test_no_model(tmpdir):
     def _helper():
         model = SimpleModel(hidden_dim=10)
         with pytest.raises(AssertionError):
-            model, _, _, _ = deepspeed.initialize(model=None, config_params=config_dict)
+            model, _, _, _ = deepspeed.initialize(model=None, config=config)
 
         with pytest.raises(AssertionError):
-            model, _, _, _ = deepspeed.initialize(model, config_params=config_dict)
+            model, _, _, _ = deepspeed.initialize(model, config=config)
diff --git a/tests/unit/test_configurable_parallel.py b/tests/unit/test_configurable_parallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..35486181072b8ae1cfbb5a0657d22764b217532d
--- /dev/null
+++ b/tests/unit/test_configurable_parallel.py
@@ -0,0 +1,453 @@
+import torch
+import deepspeed
+import pytest
+import os
+import time
+import random
+import numpy as np
+import torch.multiprocessing as mp
+import torch.distributed as dist
+from .common import distributed_test
+from .simple_model import args_from_dict, create_deepspeed_args
+from .megatron_model import get_gpt2_model, get_megatron_version
+from .megatron_model import MockGPT2ModelPipe as GPT2ModelPipe
+from deepspeed.utils import RepeatingLoader
+
+TORCH_MAJOR = int(torch.__version__.split('.')[0])
+TORCH_MINOR = int(torch.__version__.split('.')[1])
+pytestmark = pytest.mark.skipif(
+    TORCH_MAJOR < 1 or (TORCH_MAJOR == 1 and TORCH_MINOR < 5),
+    reason='Megatron-LM package requires Pytorch version 1.5 or above')
+
+
+def reset_random(seed=1234):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+
+
+class TestConfigurableMP:
+    def setup_method(self, method):
+        reset_random()
+
+    def get_inputs(self, bs=1, seq_len=20):
+        input_ids = torch.randint(low=0, high=1000, size=(bs, seq_len))
+        position_ids = torch.randint(low=0, high=2, size=(bs, seq_len))
+        attention_mask = torch.randint(low=0,
+                                       high=2,
+                                       size=(bs,
+                                             seq_len),
+                                       dtype=torch.bool)
+        return [input_ids, position_ids, attention_mask]
+
+    def get_deepspeed_model(self, model, tmpdir):
+        ds_config_dict = {
+            "train_micro_batch_size_per_gpu": 1,
+            "optimizer": {
+                "type": "Lamb",
+                "params": {
+                    "lr": 0.00015
+                }
+            },
+        }
+
+        from megatron import mpu
+        model, _, _,_ = deepspeed.initialize(model=model,
+                                             mpu=mpu,
+                                             model_parameters=model.parameters(),
+                                             config=ds_config_dict)
+        return model
+
+    def test_gpt2_basic(self, tmpdir):
+        # basic test case, mp_size=1, verify ckpt saving/loading.
+
+        @distributed_test(world_size=1)
+        def _run():
+            inputs = self.get_inputs()
+            args_defaults = {
+                'num_layers': 2,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            model = get_gpt2_model(args_defaults)
+            model = self.get_deepspeed_model(model, tmpdir)
+
+            model.eval()
+            baseline = model(inputs[0].cuda(), inputs[1].cuda(), inputs[2].cuda())
+
+            tag = 'mp_1'
+            state_dict = {}
+            state_dict['checkpoint_version'] = get_megatron_version()
+            model.save_checkpoint(tmpdir, tag=tag, client_state=state_dict)
+            dist.barrier()
+            model.load_checkpoint(tmpdir,
+                                  tag=tag,
+                                  load_optimizer_states=False,
+                                  load_lr_scheduler_states=False)
+
+            test = model(inputs[0], inputs[1], inputs[2])
+            assert torch.allclose(baseline, test, atol=1e-07), f"Baseline output {baseline} is not equal to save-then-load output {test}"
+
+        _run()
+
+    def test_gpt2_mp2_no_resize(self, tmpdir):
+        # test mp_size=2 case, verify ckpt saving/loading without resize.
+
+        @distributed_test(world_size=2)
+        def _run(inputs):
+            args_defaults = {
+                'num_layers': 2,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            model = get_gpt2_model(args_defaults, mp_size=2)
+            model = self.get_deepspeed_model(model, tmpdir)
+
+            model.eval()
+
+            baseline = model(inputs[0].cuda(), inputs[1].cuda(), inputs[2].cuda())
+
+            tag = 'mp_2'
+            state_dict = {}
+            state_dict['checkpoint_version'] = get_megatron_version()
+            model.save_checkpoint(tmpdir, tag=tag, client_state=state_dict)
+            dist.barrier()
+            model.load_checkpoint(tmpdir,
+                                  tag=tag,
+                                  load_optimizer_states=False,
+                                  load_lr_scheduler_states=False)
+
+            test = model(inputs[0].cuda(), inputs[1].cuda(), inputs[2].cuda())
+            assert torch.allclose(baseline, test, rtol=1.0, atol=1e-07), f"Baseline output {baseline} is not equal to save-then-load output {test}"
+
+        inputs = self.get_inputs()
+        _run(inputs)
+
+    def _test_gpt2_config_mp(self, tmpdir, mp_size, resize):
+        # test mp_size=2 case, verify resize=1 case for ckpt merging.
+
+        @distributed_test(world_size=mp_size)
+        def _run_baseline(inputs, tag, output, quit_event):
+            reset_random()
+            args_defaults = {
+                'num_layers': 2,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            model = get_gpt2_model(args_defaults, mp_size=mp_size)
+            model = self.get_deepspeed_model(model, tmpdir)
+
+            model.eval()
+
+            with torch.no_grad():
+                baseline = model(inputs[0].cuda(), inputs[1].cuda(), inputs[2].cuda())
+                if dist.get_rank() == 0:
+                    output.put(baseline.cpu())
+
+                state_dict = {}
+                state_dict['checkpoint_version'] = get_megatron_version()
+                model.save_checkpoint(tmpdir, tag=tag, client_state=state_dict)
+                quit_event.wait()
+
+        @distributed_test(world_size=resize)
+        def _run_resize(inputs, tag, output, quit_event):
+            reset_random()
+            args_defaults = {
+                'num_layers': 2,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            model = get_gpt2_model(args_defaults, mp_size=resize)
+            model = self.get_deepspeed_model(model, tmpdir)
+
+            model.eval()
+
+            with torch.no_grad():
+                model.load_checkpoint(tmpdir,
+                                      tag=tag,
+                                      load_optimizer_states=False,
+                                      load_lr_scheduler_states=False)
+                test = model(inputs[0].cuda(), inputs[1].cuda(), inputs[2].cuda())
+                if dist.get_rank() == 0:
+                    output.put(test.cpu())
+            quit_event.wait()
+
+        def _verify(b_queue, t_queue, baseline_event, test_event):
+            baseline = b_queue.get()
+            baseline_event.set()
+
+            test = t_queue.get()
+            test_event.set()
+
+            assert torch.allclose(baseline, test, atol=1e-03), f"Baseline output {baseline} is not equal to save-then-load output {test}"
+
+        tag = f'mp_{mp_size}_resize_{resize}'
+        inputs = self.get_inputs()
+
+        baseline = mp.Queue()
+        test = mp.Queue()
+        baseline_event = mp.Event()
+        test_event = mp.Event()
+
+        verify_process = mp.Process(target=_verify,
+                                    args=(baseline,
+                                          test,
+                                          baseline_event,
+                                          test_event))
+        verify_process.start()
+
+        _run_baseline(inputs, tag, baseline, baseline_event)
+        _run_resize(inputs, tag, test, test_event)
+
+        verify_process.join()
+
+    def test_gpt2_mp_2to1(self, tmpdir):
+        # test mp_size=2 case, verify resize=1 case for ckpt merging.
+        self._test_gpt2_config_mp(tmpdir, mp_size=2, resize=1)
+
+    def test_gpt2_mp_2to4(self, tmpdir):
+        # test mp_size=2 case, verify resize=4 case for ckpt splitting.
+        self._test_gpt2_config_mp(tmpdir, mp_size=2, resize=4)
+
+
+class TestConfigurablePP:
+    def setup_method(self, method):
+        reset_random()
+
+    def get_inputs(self, bs=1, seq_len=1, hidden_size=128):
+        hidden_states = torch.randn(bs, seq_len, hidden_size)
+        attention_mask = torch.randint(low=0,
+                                       high=2,
+                                       size=(bs,
+                                             seq_len),
+                                       dtype=torch.bool)
+        return (hidden_states, attention_mask)
+
+    def get_deepspeed_model(self, model, tmpdir):
+        ds_config_dict = {
+            "train_micro_batch_size_per_gpu": 1,
+            "optimizer": {
+                "type": "Lamb",
+                "params": {
+                    "lr": 0.00015
+                }
+            },
+        }
+        dist.barrier()
+
+        model, _, _,_ = deepspeed.initialize(model=model,
+                                             model_parameters=model.parameters(),
+                                             config=ds_config_dict)
+        return model.cuda()
+
+    def get_topology(self, mp, pp, world_size):
+        assert world_size % (pp * mp) == 0
+        dp = world_size // (pp * mp)
+
+        from deepspeed.runtime.pipe.topology import PipeModelDataParallelTopology
+        topo = PipeModelDataParallelTopology(num_pp=pp, num_mp=mp, num_dp=dp)
+
+        return topo
+
+    def test_pp_basic(self, tmpdir):
+        # basic test case, mp_size=2, pp_size=2, verify ckpt saving/loading.
+
+        mp_size = 2
+        pp_size = 2
+        world_size = mp_size * pp_size
+
+        @distributed_test(world_size=world_size)
+        def _run():
+            args_defaults = {
+                'num_layers': 8,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            topo = self.get_topology(mp_size, pp_size, world_size)
+            gpt2_pipe_model = GPT2ModelPipe(num_layers=8,
+                                            num_stages=pp_size,
+                                            mp_size=mp_size,
+                                            args_others=args_defaults,
+                                            topo=topo)
+            model = self.get_deepspeed_model(gpt2_pipe_model, tmpdir)
+
+            tag = 'pp_basic'
+            state_dict = {}
+            state_dict['checkpoint_version'] = get_megatron_version()
+            model.save_checkpoint(tmpdir, tag=tag, client_state=state_dict)
+
+            if model.is_first_stage() or model.is_last_stage():
+                inputs = self.get_inputs()
+                loader = RepeatingLoader([(inputs[0], 0)])
+                data_iter = iter(loader)
+            else:
+                data_iter = None
+
+            baseline = model.eval_batch(data_iter=data_iter,
+                                        compute_loss=False,
+                                        reduce_output=None)
+
+            dist.barrier()
+            model.load_checkpoint(tmpdir,
+                                  tag=tag,
+                                  load_optimizer_states=False,
+                                  load_lr_scheduler_states=False)
+            dist.barrier()
+
+            test = model.eval_batch(data_iter=data_iter,
+                                    compute_loss=False,
+                                    reduce_output=None)
+
+            if test is not None:
+                assert len(baseline) == len(test)
+                # Compare outputs of each microbatch
+                for mb in range(len(baseline)):
+                    for b, t in zip(baseline[mb], test[mb]):
+                        if b.is_floating_point():  # don't compare masks
+                            assert torch.allclose(b, t, atol=1e-07), f"Baseline output {baseline} is not equal to save-then-load output {test}"
+
+        _run()
+
+    def _test_gpt2_config_pp(self, tmpdir, mp_size, pp_size, mp_resize, pp_resize):
+        @distributed_test(world_size=pp_size * mp_size)
+        def _run_baseline(inputs, tag, output, quit_event):
+            reset_random()
+            args_defaults = {
+                'num_layers': 8,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            topo = self.get_topology(mp_size, pp_size, mp_size * pp_size)
+            gpt2_pipe_model = GPT2ModelPipe(num_layers=8,
+                                            num_stages=pp_size,
+                                            mp_size=mp_size,
+                                            args_others=args_defaults,
+                                            topo=topo)
+            model = self.get_deepspeed_model(gpt2_pipe_model, tmpdir)
+
+            with torch.no_grad():
+                inputs = [x.cuda() for x in inputs]
+                if model.is_first_stage() or model.is_last_stage():
+                    loader = RepeatingLoader([(inputs[0], 0)])
+                    data_iter = iter(loader)
+                else:
+                    data_iter = None
+
+                baseline = model.eval_batch(data_iter=data_iter,
+                                            compute_loss=False,
+                                            reduce_output=None)
+
+                if baseline is not None:
+                    # baseline should be [[hidden, True]]]
+                    assert len(baseline) == 1
+                    assert len(baseline[0]) == 1
+                    assert torch.is_tensor(baseline[0][0])
+                    output.put(baseline[0][0].cpu())
+
+                state_dict = {}
+                state_dict['checkpoint_version'] = get_megatron_version()
+                model.save_checkpoint(tmpdir, tag=tag, client_state=state_dict)
+                quit_event.wait()
+
+        @distributed_test(world_size=mp_resize * pp_resize)
+        def _run_resize(inputs, tag, output, quit_event):
+            reset_random()
+            args_defaults = {
+                'num_layers': 8,
+                'hidden_size': 128,
+                'num_attention_heads': 8,
+                'max_position_embeddings': 128,
+            }
+
+            topo = self.get_topology(mp_resize, pp_resize, mp_resize * pp_resize)
+            gpt2_pipe_model = GPT2ModelPipe(num_layers=8,
+                                            num_stages=pp_resize,
+                                            mp_size=mp_resize,
+                                            args_others=args_defaults,
+                                            topo=topo)
+            model = self.get_deepspeed_model(gpt2_pipe_model, tmpdir)
+
+            with torch.no_grad():
+                model.load_checkpoint(tmpdir,
+                                      tag=tag,
+                                      load_optimizer_states=False,
+                                      load_lr_scheduler_states=False)
+                inputs = [x.cuda() for x in inputs]
+                if model.is_first_stage() or model.is_last_stage():
+                    loader = RepeatingLoader([(inputs[0], 0)])
+                    data_iter = iter(loader)
+                else:
+                    data_iter = None
+
+                test = model.eval_batch(data_iter=data_iter,
+                                        compute_loss=False,
+                                        reduce_output=None)
+
+                if test is not None:
+                    # test should be [[hidden, True]]]
+                    assert len(test) == 1
+                    assert len(test[0]) == 1
+                    assert torch.is_tensor(test[0][0])
+                    output.put(test[0][0].cpu())
+
+            quit_event.wait()
+
+        def _verify(b_queue, t_queue, baseline_event, test_event):
+            baseline = b_queue.get()
+            baseline_event.set()
+
+            test = t_queue.get()
+            test_event.set()
+
+            assert torch.allclose(baseline, test, atol=1e-03), f"Baseline output {baseline} is not equal to save-then-load output {test}"
+
+        tag = f'mp_{mp_size}to{mp_resize}_pp_{pp_size}to{pp_resize}'
+
+        baseline = mp.Queue()
+        test = mp.Queue()
+        baseline_event = mp.Event()
+        test_event = mp.Event()
+
+        verify_process = mp.Process(target=_verify,
+                                    args=(baseline,
+                                          test,
+                                          baseline_event,
+                                          test_event))
+        verify_process.start()
+
+        inputs = self.get_inputs()
+        _run_baseline(inputs, tag, baseline, baseline_event)
+        _run_resize(inputs, tag, test, test_event)
+
+        verify_process.join()
+
+    def test_gpt2_mp1_pp_2to1(self, tmpdir):
+        self._test_gpt2_config_pp(tmpdir, mp_size=1, pp_size=2, mp_resize=1, pp_resize=1)
+
+    def test_gpt2_mp1_pp_2to4(self, tmpdir):
+        self._test_gpt2_config_pp(tmpdir, mp_size=1, pp_size=2, mp_resize=1, pp_resize=4)
+
+    def test_gpt2_mp2_pp_2to1(self, tmpdir):
+        self._test_gpt2_config_pp(tmpdir, mp_size=2, pp_size=2, mp_resize=2, pp_resize=1)
+
+    def test_gpt2_mp2_pp_1to2(self, tmpdir):
+        self._test_gpt2_config_pp(tmpdir, mp_size=2, pp_size=1, mp_resize=2, pp_resize=2)
+
+    def test_gpt2_pp_2to1_mp_2to1(self, tmpdir):
+        self._test_gpt2_config_pp(tmpdir, mp_size=2, pp_size=2, mp_resize=1, pp_resize=1)
+
+    def test_gpt2_pp_1to2_mp_1to2(self, tmpdir):
+        self._test_gpt2_config_pp(tmpdir, mp_size=1, pp_size=1, mp_resize=2, pp_resize=2)
diff --git a/tests/unit/test_cpu_adagrad.py b/tests/unit/test_cpu_adagrad.py
new file mode 100644
index 0000000000000000000000000000000000000000..66e246ed23fc1e1e2d5adc8302c087c270a19e2b
--- /dev/null
+++ b/tests/unit/test_cpu_adagrad.py
@@ -0,0 +1,136 @@
+import torch
+import numpy as np
+import pytest
+
+import deepspeed
+from deepspeed.ops.adagrad import DeepSpeedCPUAdagrad
+from deepspeed.ops.op_builder import CPUAdagradBuilder
+
+if not deepspeed.ops.__compatible_ops__[CPUAdagradBuilder.NAME]:
+    pytest.skip("cpu-adagrad is not compatible")
+
+
+def check_equal(first, second, atol=1e-2, verbose=False):
+    x = first.detach().numpy()
+    y = second.detach().numpy()
+    if verbose:
+        print("x = {}".format(x.flatten()))
+        print("y = {}".format(y.flatten()))
+        print('-' * 80)
+    np.testing.assert_allclose(x, y, err_msg="param-update mismatch!", atol=atol)
+
+
+@pytest.mark.parametrize('model_size',
+                         [
+                             (64),
+                             (22),
+                             (55),
+                             (127),
+                             (1024),
+                             (1048576),
+                             (30000000),
+                         ]) # yapf: disable
+def test_cpu_adagrad_opt(model_size):
+    device = 'cpu'
+    rng_state = torch.get_rng_state()
+    param = torch.nn.Parameter(torch.randn(model_size, device=device))
+    torch.set_rng_state(rng_state)
+    param1 = torch.nn.Parameter(torch.randn(model_size, device=device))
+    torch.set_rng_state(rng_state)
+
+    optimizer = DeepSpeedCPUAdagrad([param])
+    optimizer1 = torch.optim.Adagrad([param1])
+
+    for i in range(10):
+        rng_state = torch.get_rng_state()
+        param.grad = torch.randn(model_size, device=device)
+        torch.set_rng_state(rng_state)
+        param1.grad = torch.randn(model_size, device=device)
+        optimizer.step()
+        optimizer1.step()
+
+    check_equal(param, param1, atol=1e-2, verbose=True)
+
+
+@pytest.mark.parametrize('model_size,vocabulary_size,dim',
+                         [
+                             (16 * 2, 16 * 4, 16),
+                             (16 * 32, 16 * 256, 16),
+                             (16 * 256, 16 * 16384, 16),
+                         ]) # yapf: disable
+def test_cpu_adagrad_opt_sparse_embedding(model_size, vocabulary_size, dim):
+    device = 'cpu'
+    rng_state = torch.get_rng_state()
+
+    def gen_sparse_grad(vocabulary_size, dim, num_indices, dtype, device):
+        i = torch.randint(vocabulary_size,
+                          size=(1,
+                                num_indices),
+                          dtype=torch.int64,
+                          device=device)
+        v = torch.randn(num_indices, dim, dtype=dtype, device=device)
+        t = torch.sparse_coo_tensor(i, v, (vocabulary_size, dim), device=device)
+        t = t.coalesce()
+        new_i = (t.indices().view(-1,
+                                  1).repeat(1,
+                                            dim) * dim +
+                 torch.tensor(range(dim))).flatten().unsqueeze(0)
+        new_v = t.values().flatten()
+        new_t = torch.sparse_coo_tensor(new_i,
+                                        new_v,
+                                        (vocabulary_size * dim,
+                                         ),
+                                        device=device)
+        new_t = new_t.coalesce()
+        new_t.requires_grad = False
+        return new_t
+
+    voc_size = vocabulary_size
+    dim = dim
+    num_indices = int(model_size // dim)
+    dtype = torch.float32
+
+    param = torch.nn.Parameter(torch.randn((voc_size * dim,
+                                            ),
+                                           dtype=dtype,
+                                           device=device),
+                               requires_grad=True)
+    torch.set_rng_state(rng_state)
+    param1 = torch.nn.Parameter(torch.randn((voc_size * dim,
+                                             ),
+                                            dtype=dtype,
+                                            device=device),
+                                requires_grad=True)
+    torch.set_rng_state(rng_state)
+
+    optimizer = DeepSpeedCPUAdagrad([param])
+    optimizer1 = torch.optim.Adagrad([param1])
+
+    for i in range(10):
+        torch.set_rng_state(rng_state)
+        param.grad = gen_sparse_grad(voc_size,
+                                     dim,
+                                     num_indices,
+                                     dtype=dtype,
+                                     device=device)
+        torch.set_rng_state(rng_state)
+        param1.grad = gen_sparse_grad(voc_size,
+                                      dim,
+                                      num_indices,
+                                      dtype=dtype,
+                                      device=device)
+        optimizer.step()
+        optimizer1.step()
+
+    check_equal(param, param1, atol=1e-2, verbose=True)
+
+
+def test_cpu_adam_gpu_error():
+    model_size = 64
+    device = 'cuda:0'
+    param = torch.nn.Parameter(torch.randn(model_size, device=device))
+    optimizer = DeepSpeedCPUAdagrad([param])
+
+    param.grad = torch.randn(model_size, device=device)
+    with pytest.raises(AssertionError):
+        optimizer.step()
diff --git a/tests/unit/test_cpu_adam.py b/tests/unit/test_cpu_adam.py
old mode 100755
new mode 100644
index 2ae1f8201328413351b737404c2d8b15f3ae17c7..bec8faf89aea2ff8529f7f6b88424442dd3dd65f
--- a/tests/unit/test_cpu_adam.py
+++ b/tests/unit/test_cpu_adam.py
@@ -1,62 +1,74 @@
-import argparse
-import torch
-import time
-import numpy as np
-import pytest
-import copy
-
-import deepspeed
-from deepspeed.ops.adam import FusedAdam
-from deepspeed.ops.op_builder import CPUAdamBuilder
-
-if not deepspeed.ops.__compatible_ops__[CPUAdamBuilder.NAME]:
-    pytest.skip("cpu-adam is not compatible")
-
-
-def check_equal(first, second, atol=1e-2, verbose=False):
-    x = first.detach().numpy()
-    y = second.detach().numpy()
-    if verbose:
-        print("x = {}".format(x.flatten()))
-        print("y = {}".format(y.flatten()))
-        print('-' * 80)
-    np.testing.assert_allclose(x, y, err_msg="param-update dismatch!", atol=atol)
-
-@pytest.mark.parametrize('model_size',
-                         [
-                             (64),
-                             (22),
-                             (55),
-                             (127),
-                             (1024),
-                             (1048576),
-                         ]) # yapf: disable
-def test_cpu_adam_opt(model_size):
-    from deepspeed.ops.adam import DeepSpeedCPUAdam
-    device = 'cpu'
-    rng_state = torch.get_rng_state()
-    param = torch.nn.Parameter(torch.randn(model_size, device=device))
-    torch.set_rng_state(rng_state)
-    param1 = torch.nn.Parameter(torch.randn(model_size, device=device))
-    torch.set_rng_state(rng_state)
-    param2_data = torch.randn(model_size, device=device).cuda()
-    param2 = torch.nn.Parameter(param2_data)
-
-    optimizer1 = torch.optim.AdamW([param1])
-    optimizer2 = FusedAdam([param2])
-    optimizer = DeepSpeedCPUAdam([param])
-
-    for i in range(10):
-        rng_state = torch.get_rng_state()
-        param.grad = torch.randn(model_size, device=device)
-        torch.set_rng_state(rng_state)
-        param1.grad = torch.randn(model_size, device=device)
-        torch.set_rng_state(rng_state)
-        param2.grad = torch.randn(model_size, device=device).cuda()
-
-        optimizer.step()
-        optimizer2.step()
-        optimizer1.step()
-
-    check_equal(param, param1, atol=1e-2, verbose=True)
-    check_equal(param, param2.cpu(), atol=1e-2, verbose=True)
+import argparse
+import torch
+import time
+import numpy as np
+import pytest
+import copy
+
+import deepspeed
+from deepspeed.ops.adam import FusedAdam
+from deepspeed.ops.op_builder import CPUAdamBuilder
+
+if not deepspeed.ops.__compatible_ops__[CPUAdamBuilder.NAME]:
+    pytest.skip("cpu-adam is not compatible")
+
+
+def check_equal(first, second, atol=1e-2, verbose=False):
+    x = first.detach().numpy()
+    y = second.detach().numpy()
+    if verbose:
+        print("x = {}".format(x.flatten()))
+        print("y = {}".format(y.flatten()))
+        print('-' * 80)
+    np.testing.assert_allclose(x, y, err_msg="param-update mismatch!", atol=atol)
+
+@pytest.mark.parametrize('model_size',
+                         [
+                             (64),
+                             (22),
+                             (55),
+                             (127),
+                             (1024),
+                             (1048576),
+                         ]) # yapf: disable
+def test_cpu_adam_opt(model_size):
+    from deepspeed.ops.adam import DeepSpeedCPUAdam
+    device = 'cpu'
+    rng_state = torch.get_rng_state()
+    param = torch.nn.Parameter(torch.randn(model_size, device=device))
+    torch.set_rng_state(rng_state)
+    param1 = torch.nn.Parameter(torch.randn(model_size, device=device))
+    torch.set_rng_state(rng_state)
+    param2_data = torch.randn(model_size, device=device).cuda()
+    param2 = torch.nn.Parameter(param2_data)
+
+    optimizer1 = torch.optim.AdamW([param1])
+    optimizer2 = FusedAdam([param2])
+    optimizer = DeepSpeedCPUAdam([param])
+
+    for i in range(10):
+        rng_state = torch.get_rng_state()
+        param.grad = torch.randn(model_size, device=device)
+        torch.set_rng_state(rng_state)
+        param1.grad = torch.randn(model_size, device=device)
+        torch.set_rng_state(rng_state)
+        param2.grad = torch.randn(model_size, device=device).cuda()
+
+        optimizer.step()
+        optimizer2.step()
+        optimizer1.step()
+
+    check_equal(param, param1, atol=1e-2, verbose=True)
+    check_equal(param, param2.cpu(), atol=1e-2, verbose=True)
+
+
+def test_cpu_adam_gpu_error():
+    model_size = 64
+    from deepspeed.ops.adam import DeepSpeedCPUAdam
+    device = 'cuda:0'
+    param = torch.nn.Parameter(torch.randn(model_size, device=device))
+    optimizer = DeepSpeedCPUAdam([param])
+
+    param.grad = torch.randn(model_size, device=device)
+    with pytest.raises(AssertionError):
+        optimizer.step()
diff --git a/tests/unit/test_csr.py b/tests/unit/test_csr.py
index 766be7fb747072cfc8498f49176fe7d8fa4ee47b..bd5f9933d2a20949bb95d297acfc0ab3a71a5ade 100644
--- a/tests/unit/test_csr.py
+++ b/tests/unit/test_csr.py
@@ -1,6 +1,6 @@
 import torch
 import random
-from deepspeed.runtime.csr_tensor import CSRTensor
+from deepspeed.runtime.sparse_tensor import SparseTensor
 
 
 def test_csr_addition_self():
@@ -14,7 +14,7 @@ def test_csr_addition_self():
         else:
             x = torch.cat([x, torch.zeros(1, 5)])
     dense_x = x.clone()
-    cx = CSRTensor(x)
+    cx = SparseTensor(x)
 
     assert torch.all(dense_x == cx.to_dense())
 
@@ -33,7 +33,7 @@ def test_csr_addition_different():
         else:
             x = torch.cat([x, torch.zeros(1, 5)])
     dense_x = x.clone()
-    cx = CSRTensor(x)
+    cx = SparseTensor(x)
 
     y = torch.ones(1, 5)
     for i in range(row_count - 1):
@@ -42,7 +42,7 @@ def test_csr_addition_different():
         else:
             y = torch.cat([y, torch.zeros(1, 5)])
     dense_y = y.clone()
-    cy = CSRTensor(y)
+    cy = SparseTensor(y)
 
     dense_sum = dense_x + dense_y
     cx.add(cy)
diff --git a/tests/unit/test_cuda_backward.py b/tests/unit/test_cuda_backward.py
old mode 100755
new mode 100644
index 2c7e07aa8b31ef916ff4f19a07d0aa927ac6f22f..d7faee7c05029df6c0970dde1604ab6b91ee9c00
--- a/tests/unit/test_cuda_backward.py
+++ b/tests/unit/test_cuda_backward.py
@@ -1,25 +1,18 @@
-import argparse
 import numpy as np
 import torch
 import torch.nn.functional as F
 import pytest
-import json
 import random
-import time
 import copy
 from torch import nn
-from modelingpreln import BertEncoder as BertEncoderPreln
-from modeling import BertEncoder as BertEncoderPostln
-from modeling import BertConfig, BertLayerNorm
 from deepspeed import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
-import deepspeed
-
-import sys
+from .modeling import BertConfig, BertLayerNorm, BertEncoder as BertEncoderPostln
+from .modelingpreln import BertEncoder as BertEncoderPreln
 
 #if not deepspeed.ops.__installed_ops__['transformer']:
-pytest.skip(
-    "transformer kernels are temporarily disabled because of unexplained failures",
-    allow_module_level=True)
+#pytest.skip(
+#    "transformer kernels are temporarily disabled because of unexplained failures",
+#    allow_module_level=True)
 
 
 def check_equal(first, second, atol=1e-2, verbose=False):
@@ -50,27 +43,37 @@ def check_equal(first, second, atol=1e-2, verbose=False):
             print("checking ", x[1], ":")
         y = diction_y[x[0], x[1]]
         x = diction_x[x[0], x[1]]
+
+        if verbose:
+            print(((x == float('inf')).nonzero(as_tuple=True)[0]))
+            print(((y == float('inf')).nonzero(as_tuple=True)[0]))
         x = x.cpu().detach().numpy()
         y = y.cpu().detach().numpy()
-        if verbose:
-            print(x)
-            print(y)
 
         avgx = np.sum(abs(x), dtype=float)
         countx = x.shape[0]
         for i in range(len(x.shape) - 1):
             countx *= x.shape[i + 1]
             avgx = np.sum(avgx)
-        tollerance = 1
+        tolerance = 1
         if avgx != float('inf') and avgx != -float('inf'):
             avgx = avgx / countx
-            tollerance = avgx * atol
+            tolerance = avgx * atol
         if verbose:
-            print("tollerance is ", tollerance)
-            print("x = {}".format(x.flatten()))
-            print("y = {}".format(y.flatten()))
+            print("tolerance is ", tolerance)
+            x = x.flatten()
+            y = y.flatten()
+            print("x = {}".format(x))
+            print("y = {}".format(y))
+            if any(x == float('inf')) or any(x == -float('inf')):
+                print("found infinity in x")
+            if any(y == float('inf')) or any(y == -float('inf')):
+                print("found infinity in y")
+            print(np.linalg.norm(x.astype('float64')))
+            print(np.linalg.norm(y.astype('float64')))
             print('-' * 80)
-        np.testing.assert_allclose(x, y, err_msg="Index: {}".format(i), atol=tollerance)
+        #toler = np.linalg.norm(x.astype('float64')) * 0.0005
+        np.testing.assert_allclose(x, y, err_msg="Index: {}".format(i), atol=tolerance)
 
 
 def zero_grad(variables):
@@ -235,7 +238,7 @@ def run_backward(ds_config, seq_len, atol=1e-2, verbose=False):
                                 output_all_encoded_layers=False,
                                 checkpoint_activations=False)
 
-    loss = (Y - base_results[0]).pow(2).sum()
+    loss = (Y - base_results[0]).pow(2).sum() / 64
     loss.backward()
     base_grads = bert_encoder.get_grads()
 
@@ -245,7 +248,7 @@ def run_backward(ds_config, seq_len, atol=1e-2, verbose=False):
                             output_all_encoded_layers=False,
                             checkpoint_activations=False)
 
-    loss = (Y - ds_results[0]).pow(2).sum()
+    loss = (Y - ds_results[0]).pow(2).sum() / 64
     loss.backward()
     ds_grads = ds_encoder.get_grads()
 
@@ -258,9 +261,14 @@ def run_backward(ds_config, seq_len, atol=1e-2, verbose=False):
 # 3-128-54-2-24-False-True-0.2
 @pytest.mark.parametrize('batch_size, hidden_size, seq_len, heads, num_layers, is_preln, use_fp16, atol',
                          [
-                             (3,1024,119,16,24,True,False, 0.05),
-                             (3,1024,115,16,24,True,True, 0.05),
-                             (1024,128,10,2,2,False,False, 0.1),
+                             (64,160,128,2,24,False,True, 0.2),
+                             (64,1600,128,2,4,False,True, 0.2),
+                             (8,1600,128,25,3,True,True, 0.05),
+                             (8,160,128,2,3,True,True, 0.1),
+                             (8,1600,128,2,3,True,True, 0.05),
+                             #(3,1024,119,16,24,True,False, 0.05),
+                             #(3,1024,115,16,24,True,True, 0.05),
+                             #(1024,128,10,2,2,False,False, 0.1),
                              #(3,1024,52,16,24,False,True, 0.2),
                              #(3,128,51,2,24,False,False, 0.1),
                              #(3,128,54,2,24,False,True, 0.2),
@@ -291,7 +299,7 @@ def test_backward(batch_size,
     ds_config.initializer_range = 0.02
     ds_config.fp16 = use_fp16
 
-    run_backward(ds_config, seq_len, atol=atol)
+    run_backward(ds_config, seq_len, atol=atol, verbose=True)
 
 
 #@pytest.mark.parametrize('batch_size, hidden_size, seq_len, heads, num_layers, is_preln, use_fp16, atol',
diff --git a/tests/unit/test_cuda_forward.py b/tests/unit/test_cuda_forward.py
old mode 100755
new mode 100644
index 5add5e152a917e6a3861771c02dca9e891ef0947..2a5d2d13858e01af6bc05886baa67f4ba778dbce
--- a/tests/unit/test_cuda_forward.py
+++ b/tests/unit/test_cuda_forward.py
@@ -8,17 +8,13 @@ import random
 import time
 import copy
 from torch import nn
-from modelingpreln import BertEncoder as BertEncoderPreln
-from modeling import BertEncoder as BertEncoderPostln
-from modeling import BertLayerNorm, BertConfig
+from .modelingpreln import BertEncoder as BertEncoderPreln
+from .modeling import BertLayerNorm, BertConfig, BertEncoder as BertEncoderPostln
 from deepspeed import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
 import deepspeed
 
 import sys
 
-#if not deepspeed.ops.__installed_ops__['transformer']:
-#    pytest.skip("transformer kernels are not installed", allow_module_level=True)
-
 
 def check_equal(first, second, atol=1e-2, verbose=False):
     if verbose:
@@ -197,9 +193,16 @@ def run_forward(ds_config, seq_len, atol=1e-2, verbose=False, test_bsz=None):
 
 
 # FP16 test cases can only run on the devices support FP16.
+@pytest.mark.sequential
 @pytest.mark.parametrize('batch_size, hidden_size, seq_len, heads, num_layers, is_preln, use_fp16',
                          [
-                             (8,256,53,4,3,True,False),
+                             (64,160,128,2,24,False,True),
+                             #(8,2048,2048,32,1,True,True),
+                             (8,160,128,2,3,True,True),
+                             (8,160,128,2,3,False,True),
+                             (8,1600,128,2,3,True,True),
+                             (8,1600,128,25,3,True,True),
+                             (8,1600,128,25,3,False,True),
                              (8,256,52,4,3,True,True),
                              (3,1024,51,16,3,True,False),
                              (3,1024,54,16,3,True,True),
@@ -228,6 +231,7 @@ def run_forward(ds_config, seq_len, atol=1e-2, verbose=False, test_bsz=None):
                              (8,128,128,2,3,True,True),
                              (8,4096,128,64,3,True,True),
                              (8,8192,128,64,3,False,True),
+                             (1,256,2048,32,3,True,True),
                          ]) # yapf: disable
 def test_forward(batch_size,
                  hidden_size,
@@ -254,7 +258,7 @@ def test_forward(batch_size,
     ds_config.initializer_range = 0.02
     ds_config.fp16 = use_fp16
 
-    run_forward(ds_config, seq_len, atol=2e-2)
+    run_forward(ds_config, seq_len, atol=3e-2)
 
 
 @pytest.mark.parametrize('batch_size, small_bsz, hidden_size, seq_len, heads, num_layers, is_preln, use_fp16',
@@ -290,14 +294,14 @@ def test_forward_with_small_bsz(batch_size,
     ds_config.initializer_range = 0.02
     ds_config.fp16 = use_fp16
 
-    run_forward(ds_config, seq_len, atol=2e-2, test_bsz=small_bsz)
+    run_forward(ds_config, seq_len, atol=3e-2, test_bsz=small_bsz)
 
 @pytest.mark.parametrize('batch_size, hidden_size, seq_len, heads, num_layers, is_preln, use_fp16',
                          [
-                             (64,1024,128,16,3,True,False),
-                             (64,1024,128,16,3,True,True),
-                             (64,1024,128,16,3,False,False),
-                             (64,1024,128,16,3,False,True),
+                             #(64,1024,128,16,3,True,False),
+                             #(64,1024,128,16,3,True,True),
+                             #(64,1024,128,16,3,False,False),
+                             #(64,1024,128,16,3,False,True),
                          ]) # yapf: disable
 def test_forward_stochastic(batch_size,
                             hidden_size,
diff --git a/tests/unit/test_curriculum_learning.py b/tests/unit/test_curriculum_learning.py
new file mode 100644
index 0000000000000000000000000000000000000000..3677b5966781a78b2f7fc86451299574fe519a5f
--- /dev/null
+++ b/tests/unit/test_curriculum_learning.py
@@ -0,0 +1,133 @@
+import torch
+import torch.distributed as dist
+import deepspeed
+import argparse
+import pytest
+import json
+import os
+import numpy as np
+import time
+from .common import distributed_test
+from .simple_model import Curriculum_SimpleModel, random_dataloader, args_from_dict
+
+
+def test_curriculum_scheduler_fixed_discrete(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        },
+        "curriculum_learning": {
+            "enabled": True,
+            "curriculum_type": "seqlen",
+            "min_difficulty": 1,
+            "max_difficulty": 5,
+            "schedule_type": "fixed_discrete",
+            "schedule_config": {
+                "difficulty": [1,
+                               2,
+                               3,
+                               4,
+                               5],
+                "max_step": [2,
+                             4,
+                             6,
+                             8]
+            }
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+    ground_truths = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4}
+    model = Curriculum_SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_curriculum_scheduler_fixed_discrete(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=20,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss, seqlen = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+            true_seqlen = 5
+            if n + 1 in ground_truths:
+                true_seqlen = ground_truths[n + 1]
+            print('at step {} the seqlen is {}'.format(n + 1, seqlen))
+            assert seqlen == true_seqlen, f"Incorrect curriculum schedule"
+
+    _test_curriculum_scheduler_fixed_discrete(args=args,
+                                              model=model,
+                                              hidden_dim=hidden_dim)
+
+
+def test_curriculum_scheduler_fixed_linear(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        },
+        "curriculum_learning": {
+            "enabled": True,
+            "curriculum_type": "seqlen",
+            "min_difficulty": 2,
+            "max_difficulty": 10,
+            "schedule_type": "fixed_linear",
+            "schedule_config": {
+                "total_curriculum_step": 8,
+                "difficulty_step": 2
+            }
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+    ground_truths = {1: 2, 2: 4, 3: 4, 4: 6, 5: 6, 6: 8, 7: 8, 8: 10, 9: 10, 10: 10}
+    model = Curriculum_SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_curriculum_scheduler_fixed_linear(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=20,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss, seqlen = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+            if n + 1 in ground_truths:
+                true_seqlen = ground_truths[n + 1]
+                print('at step {} the seqlen is {}'.format(n + 1, seqlen))
+                assert seqlen == true_seqlen, f"Incorrect curriculum schedule"
+
+    _test_curriculum_scheduler_fixed_linear(args=args,
+                                            model=model,
+                                            hidden_dim=hidden_dim)
diff --git a/tests/unit/test_data.py b/tests/unit/test_data.py
index d05b9d232e456b060a2edafb7ddf8ddeb4284d13..93510e5574508c5ce6d191dc3ce8710d4c963bc8 100644
--- a/tests/unit/test_data.py
+++ b/tests/unit/test_data.py
@@ -1,4 +1,9 @@
 from deepspeed.utils import RepeatingLoader
+import torch
+import pytest
+import deepspeed
+from .common import distributed_test
+from .simple_model import SimpleModel, args_from_dict, random_dataset
 
 
 def test_repeating_loader():
@@ -9,3 +14,45 @@ def test_repeating_loader():
         assert next(loader) == 1
         assert next(loader) == 2
         assert next(loader) == 3
+
+
+@pytest.mark.parametrize('train_batch_size, drop_last',
+                         [(1,
+                           True),
+                          (4,
+                           True),
+                          (1,
+                           False),
+                          (4,
+                           False)])
+def test_dataloader_drop_last(tmpdir, train_batch_size, drop_last):
+    config_dict = {
+        "train_batch_size": train_batch_size,
+        "dataloader_drop_last": drop_last,
+        "steps_per_print": 1
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _test_dataloader_drop_last(args, model, hidden_dim):
+        optimizer = torch.optim.AdamW(params=model.parameters())
+        #TODO: Figure out why this breaks with cuda device
+        train_dataset = random_dataset(total_samples=50,
+                                       hidden_dim=hidden_dim,
+                                       device=torch.device('cpu'),
+                                       dtype=torch.float32)
+        model, _, training_dataloader, _ = deepspeed.initialize(args=args,
+                                                                model=model,
+                                                                training_data=train_dataset,
+                                                                optimizer=optimizer)
+        for n, batch in enumerate(training_dataloader):
+            x = batch[0].to(torch.cuda.current_device())
+            y = batch[1].to(torch.cuda.current_device())
+            loss = model(x, y)
+            model.backward(loss)
+            model.step()
+
+    _test_dataloader_drop_last(args=args, model=model, hidden_dim=hidden_dim)
diff --git a/tests/unit/test_dist.py b/tests/unit/test_dist.py
index 04b97031b3e529635842cc98a7e303d6fa7d136f..d37133603ce46a6977e73dd8be2b215c45dd7763 100644
--- a/tests/unit/test_dist.py
+++ b/tests/unit/test_dist.py
@@ -1,7 +1,7 @@
 import torch
 import torch.distributed as dist
 
-from common import distributed_test
+from .common import distributed_test
 
 import pytest
 
@@ -13,7 +13,7 @@ def test_init():
     assert dist.get_rank() < 3
 
 
-# Demonstration of pytest's paramaterization
+# Demonstration of pytest's parameterization
 @pytest.mark.parametrize('number,color', [(1138, 'purple')])
 def test_dist_args(number, color):
     """Outer test function with inputs from pytest.mark.parametrize(). Uses a distributed
diff --git a/tests/unit/test_ds_arguments.py b/tests/unit/test_ds_arguments.py
old mode 100755
new mode 100644
diff --git a/tests/unit/test_ds_config.py b/tests/unit/test_ds_config.py
old mode 100755
new mode 100644
diff --git a/tests/unit/test_ds_initialize.py b/tests/unit/test_ds_initialize.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9756af6220007b16e845b479a7dc9a8ddab96f6
--- /dev/null
+++ b/tests/unit/test_ds_initialize.py
@@ -0,0 +1,180 @@
+import pytest
+from typing import Callable
+import torch
+from torch.optim import Optimizer, Adam, AdamW
+from torch.optim.lr_scheduler import _LRScheduler, LambdaLR
+
+from .simple_model import args_from_dict, SimpleModel, random_dataloader
+from .common import distributed_test
+from .util import required_torch_version
+
+import deepspeed
+from deepspeed.ops.adam import FusedAdam
+from deepspeed.runtime.lr_schedules import WARMUP_LR, WarmupLR
+from deepspeed.runtime.config import ADAM_OPTIMIZER
+from deepspeed.runtime.utils import see_memory_usage
+
+
+@pytest.mark.parametrize('zero_stage,world_size', [(0, 1), (3, 1)])
+def test_no_optim(zero_stage, world_size):
+    if zero_stage == 3 and not required_torch_version():
+        pytest.skip("zero-3 param offload requires at least torch 1.8")
+
+    ds_config = {
+        'train_batch_size': world_size,
+        'fp16': {
+            'enabled': True
+        },
+        'zero_optimization': {
+            "stage": zero_stage,
+            "offload_param": {
+                "device": "cpu"
+            }
+        }
+    }
+    # 20B test
+    #hidden_dim = 16 * 1024
+    hidden_dim = 4
+
+    @distributed_test(world_size=[world_size])
+    def _go(hidden_dim):
+        with deepspeed.zero.Init(enabled=zero_stage == 3, config_dict_or_path=ds_config):
+            model = SimpleModel(hidden_dim, nlayers=78)
+        print('total number of parameters:',
+              sum([p.numel() for p in model.parameters()]))
+        see_memory_usage('pre-init', force=True)
+        model, _, _, _ = deepspeed.initialize(model=model, config=ds_config)
+        see_memory_usage('post-init', force=True)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.half)
+        print(f"optimizer={model.optimizer}")
+        for batch in data_loader:
+            model(batch[0], batch[1])
+        see_memory_usage('post-fwds', force=True)
+
+    _go(hidden_dim)
+
+
+@pytest.mark.parametrize('optimizer_type', [None, Optimizer, Callable])
+def test_client_optimizer(tmpdir, optimizer_type):
+    def _optimizer_callable(params) -> Optimizer:
+        return AdamW(params=params)
+
+    hidden_dim = 10
+    model = SimpleModel(hidden_dim)
+
+    config_dict = {'train_batch_size': 1}
+    if optimizer_type is None:
+        client_optimizer = None
+        config_dict['optimizer'] = {'type': ADAM_OPTIMIZER}
+    elif optimizer_type is Optimizer:
+        client_optimizer = Adam(model.parameters())
+    else:
+        client_optimizer = _optimizer_callable
+
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[1])
+    def _test_client_optimizer(args, model, client_optimizer):
+        _, ds_optimizer, _, _ = deepspeed.initialize(args=args,
+                                                    model=model,
+                                                    model_parameters=list(model.parameters()),
+                                                    optimizer=client_optimizer)
+        if client_optimizer is None:
+            assert isinstance(ds_optimizer, FusedAdam)
+        elif isinstance(client_optimizer, Optimizer):
+            assert ds_optimizer == client_optimizer
+        else:
+            assert isinstance(ds_optimizer, AdamW)
+
+    _test_client_optimizer(args=args, model=model, client_optimizer=client_optimizer)
+
+
+@pytest.mark.parametrize('scheduler_type, optimizer_type',
+                         [(None,
+                           None),
+                          (None,
+                           Optimizer),
+                          (None,
+                           Callable),
+                          (_LRScheduler,
+                           None),
+                          (_LRScheduler,
+                           Optimizer),
+                          (_LRScheduler,
+                           Callable),
+                          (Callable,
+                           None),
+                          (Callable,
+                           Optimizer),
+                          (Callable,
+                           Callable)])
+def test_client_lr_scheduler(tmpdir, scheduler_type, optimizer_type):
+    def _my_lambda(epoch):
+        return epoch // 10
+
+    def _optimizer_callable(params) -> Optimizer:
+        return torch.optim.AdamW(params=params)
+
+    def _lr_scheduler_callable(optimizer) -> _LRScheduler:
+        return LambdaLR(optimizer, _my_lambda)
+
+    hidden_dim = 10
+    model = SimpleModel(hidden_dim)
+
+    config_dict = {'train_batch_size': 1}
+
+    client_optimizer = None
+    client_scheduler = None
+
+    if optimizer_type is None:
+        config_dict['optimizer'] = {'type': ADAM_OPTIMIZER}
+    elif optimizer_type is Optimizer:
+        client_optimizer = torch.optim.Adam(model.parameters())
+    else:
+        client_optimizer = _optimizer_callable
+
+    if scheduler_type is None:
+        config_dict['scheduler'] = {'type': WARMUP_LR, 'params': {}}
+    elif scheduler_type == _LRScheduler:
+        if isinstance(client_optimizer, Optimizer):
+            client_scheduler = LambdaLR(client_optimizer, _my_lambda)
+        else:
+            # Verify invalid combination is correctly handled
+            client_scheduler = LambdaLR(torch.optim.Adam(model.parameters()), _my_lambda)
+    else:
+        client_scheduler = _lr_scheduler_callable
+
+    args = args_from_dict(tmpdir, config_dict)
+
+    @distributed_test(world_size=[1])
+    def _test_client_lr_scheduler(args, model, optimizer, lr_scheduler):
+        if isinstance(lr_scheduler,
+                      _LRScheduler) and not isinstance(optimizer,
+                                                       Optimizer):
+            with pytest.raises(AssertionError):
+                _, _, _, _ = deepspeed.initialize(args=args,
+                                                  model=model,
+                                                  model_parameters=list(model.parameters()),
+                                                  optimizer=optimizer,
+                                                  lr_scheduler=lr_scheduler)
+        else:
+            _, _, _, ds_lr_scheduler = deepspeed.initialize(args=args,
+                                                            model=model,
+                                                            model_parameters=list(model.parameters()),
+                                                            optimizer=optimizer,
+                                                            lr_scheduler=lr_scheduler)
+            if lr_scheduler is None:
+                assert isinstance(ds_lr_scheduler, WarmupLR)
+            elif isinstance(lr_scheduler, _LRScheduler):
+                assert ds_lr_scheduler == lr_scheduler
+            else:
+                assert isinstance(ds_lr_scheduler, LambdaLR)
+
+    _test_client_lr_scheduler(args=args,
+                              model=model,
+                              optimizer=client_optimizer,
+                              lr_scheduler=client_scheduler)
diff --git a/tests/unit/test_dynamic_loss_scale.py b/tests/unit/test_dynamic_loss_scale.py
old mode 100755
new mode 100644
index 302de55c36a3d3d97eeb43da25dd26ac992d84e5..65a679d94de740adda2fb7972cac4ac3de93d8c0
--- a/tests/unit/test_dynamic_loss_scale.py
+++ b/tests/unit/test_dynamic_loss_scale.py
@@ -5,8 +5,8 @@ import pytest
 import json
 import os
 import numpy as np
-from common import distributed_test
-from simple_model import SimpleModel, args_from_dict
+from .common import distributed_test
+from .simple_model import SimpleModel, args_from_dict
 
 
 def run_model_step(model, gradient_list):
diff --git a/tests/unit/test_elastic.py b/tests/unit/test_elastic.py
index 62d948d599b0ecdb92afa511c0cdd690a86ee83a..353d6def37ba40054d64e42e5219318f6958f396 100644
--- a/tests/unit/test_elastic.py
+++ b/tests/unit/test_elastic.py
@@ -1,8 +1,8 @@
 import pytest
 import deepspeed
-from common import distributed_test
+from .common import distributed_test
 from deepspeed.git_version_info import version as ds_version
-from simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
+from .simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
 
 base_ds_config = {
     "elasticity": {
diff --git a/tests/unit/test_flops_profiler.py b/tests/unit/test_flops_profiler.py
index 133610d04ffdace89cc170771f19e04929eac703..173fa7eed09c403481a1a1afa6ec471fea5ccbb2 100644
--- a/tests/unit/test_flops_profiler.py
+++ b/tests/unit/test_flops_profiler.py
@@ -1,12 +1,26 @@
 import torch
+import pytest
 import deepspeed
 import deepspeed.runtime.utils as ds_utils
 from deepspeed.profiling.flops_profiler import FlopsProfiler, get_model_profile
-from simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
-from common import distributed_test
+from .simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
+from .common import distributed_test
 
+TORCH_MAJOR = int(torch.__version__.split('.')[0])
+TORCH_MINOR = int(torch.__version__.split('.')[1])
+pytestmark = pytest.mark.skipif(TORCH_MAJOR < 1
+                                or (TORCH_MAJOR == 1 and TORCH_MINOR < 3),
+                                reason='requires Pytorch version 1.3 or above')
 
-def test_flops_profiler_in_ds_trainning(tmpdir):
+
+def within_range(val, target, tolerance):
+    return abs(val - target) / target < tolerance
+
+
+TOLERANCE = 0.05
+
+
+def test_flops_profiler_in_ds_training(tmpdir):
     config_dict = {
         "train_batch_size": 1,
         "steps_per_print": 1,
@@ -34,7 +48,7 @@ def test_flops_profiler_in_ds_trainning(tmpdir):
     model = SimpleModel(hidden_dim, empty_grad=False)
 
     @distributed_test(world_size=[1])
-    def _test_flops_profiler_in_ds_trainning(args, model, hidden_dim):
+    def _test_flops_profiler_in_ds_training(args, model, hidden_dim):
         model, _, _, _ = deepspeed.initialize(args=args,
                                             model=model,
                                             model_parameters=model.parameters())
@@ -49,10 +63,10 @@ def test_flops_profiler_in_ds_trainning(tmpdir):
             model.backward(loss)
             model.step()
             if n == 3: break
-        assert model.flops_profiler.flops == 100
+        assert within_range(model.flops_profiler.flops, 200, tolerance=TOLERANCE)
         assert model.flops_profiler.params == 110
 
-    _test_flops_profiler_in_ds_trainning(args, model, hidden_dim)
+    _test_flops_profiler_in_ds_training(args, model, hidden_dim)
 
 
 class LeNet5(torch.nn.Module):
@@ -99,7 +113,7 @@ def test_flops_profiler_in_inference():
     mod = LeNet5(10)
     batch_size = 1024
     input = torch.randn(batch_size, 1, 32, 32)
-    macs, params = get_model_profile(
+    flops, macs, params = get_model_profile(
         mod,
         tuple(input.shape),
         print_profile=True,
@@ -107,9 +121,10 @@ def test_flops_profiler_in_inference():
         module_depth=-1,
         top_modules=3,
         warm_up=1,
-        as_string=True,
+        as_string=False,
         ignore_modules=None,
     )
-    print(macs, params)
-    assert macs == "439.56 MMACs"
-    assert params == "61.71 k"
+    print(flops, macs, params)
+    assert within_range(flops, 866076672, TOLERANCE)
+    assert within_range(macs, 426516480, TOLERANCE)
+    assert params == 61706
diff --git a/tests/unit/test_fp16.py b/tests/unit/test_fp16.py
old mode 100755
new mode 100644
index dbd40c322be9883515ef1d83b73b0cb241e052bc..0cd258f590d1d6e0974ed96994a3db72dc658540
--- a/tests/unit/test_fp16.py
+++ b/tests/unit/test_fp16.py
@@ -1,20 +1,25 @@
+import math
+from deepspeed.utils import groups
 import torch
+import torch.distributed as dist
 import deepspeed
 import argparse
 import pytest
 import json
 import os
 from deepspeed.ops.adam import FusedAdam
-from common import distributed_test
-from simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict, create_deepspeed_args
+from .common import distributed_test
 from deepspeed.ops.op_builder import CPUAdamBuilder
+from .simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict, create_deepspeed_args, SimpleMoEModel, sequence_dataloader
+from .util import required_torch_version
 
 try:
     from apex import amp
     _amp_available = True
 except ImportError:
     _amp_available = False
-amp_available = pytest.mark.skip(_amp_available, reason="apex/amp is not installed")
+amp_available = pytest.mark.skipif(not _amp_available,
+                                   reason="apex/amp is not installed")
 
 
 def test_lamb_fp32_grad_clip(tmpdir):
@@ -195,26 +200,171 @@ def test_adamw_fp16_basic(tmpdir):
     _test_adamw_fp16_basic(args=args, model=model, hidden_dim=hidden_dim)
 
 
-def test_dict_config_adamw_fp16_basic():
+def test_unfused_fp16_optimizer_gradnorm_for_moe(tmpdir, monkeypatch):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
     config_dict = {
-        "train_batch_size": 1,
+        "train_batch_size": 2,
         "steps_per_print": 1,
         "fp16": {
             "enabled": True
         }
     }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    def mock_unscale_and_clip_grads(total_norm, apply_scale=True):
+        torch_norm_tensor = torch.cuda.FloatTensor([total_norm])
+        all_gather_results = [
+            torch.zeros_like(torch_norm_tensor) for _ in range(dist.get_world_size())
+        ]
+        dist.all_gather(all_gather_results, torch_norm_tensor)
+        assert len(set([x.item() for x in all_gather_results])) == 1
+        return 1.0
+
+    @distributed_test(world_size=[2])
+    def _test_unfused_fp16_optimizer(args, hidden_dim):
+        # initialize MoE
+        model = SimpleMoEModel(hidden_dim, ep_size=2)
+        optimizer = torch.optim.AdamW(params=model.parameters())
+        engine, optimizer, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=optimizer,
+                                              dist_init_required=False)
+        monkeypatch.setattr(optimizer,
+                            'unscale_and_clip_grads',
+                            mock_unscale_and_clip_grads)
+        data_loader = sequence_dataloader(model=engine,
+                                          total_samples=50,
+                                          hidden_dim=hidden_dim,
+                                          device=engine.device)
+        for n, batch in enumerate(data_loader):
+            loss = engine(batch[0], batch[1])
+            engine.backward(loss)
+            engine.step()
+
+    _test_unfused_fp16_optimizer(args=args, hidden_dim=hidden_dim)
+
+
+def test_fused_fp16_optimizer_gradnorm_for_moe(tmpdir, monkeypatch):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": True
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    def mock_unscale_and_clip_grads(grads_groups_flat, total_norm, apply_scale=True):
+        torch_norm_tensor = torch.cuda.FloatTensor([total_norm])
+        all_gather_results = [
+            torch.zeros_like(torch_norm_tensor) for _ in range(dist.get_world_size())
+        ]
+        dist.all_gather(all_gather_results, torch_norm_tensor)
+        assert len(set([x.item() for x in all_gather_results])) == 1
+        return 1.0
+
+    @distributed_test(world_size=[2])
+    def _test_fused_fp16_optimizer(args, hidden_dim):
+        # initialize MoE
+        model = SimpleMoEModel(hidden_dim, ep_size=2)
+        # optimizer = torch.optim.AdamW(params=model.parameters())
+        optimizer = FusedAdam(params=model.parameters())
+        engine, optimizer, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=optimizer,
+                                              dist_init_required=False)
+        monkeypatch.setattr(optimizer,
+                            'unscale_and_clip_grads',
+                            mock_unscale_and_clip_grads)
+        data_loader = sequence_dataloader(model=engine,
+                                          total_samples=50,
+                                          hidden_dim=hidden_dim,
+                                          device=engine.device)
+        for n, batch in enumerate(data_loader):
+            loss = engine(batch[0], batch[1])
+            engine.backward(loss)
+            engine.step()
+
+    _test_fused_fp16_optimizer(args=args, hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize("fused_lamb_legacy", [(False), (True)])
+def test_lamb_optimizer_gradnorm_for_moe(tmpdir, monkeypatch, fused_lamb_legacy: bool):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": True
+        },
+        "optimizer": {
+            "type": "Lamb",
+            "params": {
+                "lr": 0.00015
+            }
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    def mock_unscale_and_clip_grads(total_norm, apply_scale=True):
+        torch_norm_tensor = torch.cuda.FloatTensor([total_norm])
+        all_gather_results = [
+            torch.zeros_like(torch_norm_tensor) for _ in range(dist.get_world_size())
+        ]
+        dist.all_gather(all_gather_results, torch_norm_tensor)
+        assert len(set([x.item() for x in all_gather_results])) == 1
+        return 1.0
+
+    @distributed_test(world_size=[2])
+    def _test_lamb_legacy_optimizer_step(args, hidden_dim, fused_lamb_legacy):
+        # initialize MoE
+        model = SimpleMoEModel(hidden_dim, ep_size=2)
+        engine, optimizer, _, _ = deepspeed.initialize(args=args,
+                                               model=model,
+                                               model_parameters=model.parameters(),
+                                               dist_init_required=False)
+        monkeypatch.setattr(optimizer,
+                            'unscale_and_clip_grads',
+                            mock_unscale_and_clip_grads)
+        optimizer.fused_lamb_legacy = fused_lamb_legacy
+        data_loader = sequence_dataloader(model=engine,
+                                          total_samples=50,
+                                          hidden_dim=hidden_dim,
+                                          device=engine.device)
+        for n, batch in enumerate(data_loader):
+            loss = engine(batch[0], batch[1])
+            engine.backward(loss)
+            engine.step()
+
+    _test_lamb_legacy_optimizer_step(args=args,
+                                     hidden_dim=hidden_dim,
+                                     fused_lamb_legacy=fused_lamb_legacy)
+
+
+def test_dict_config_adamw_fp16_basic():
+    config = {"train_batch_size": 1, "steps_per_print": 1, "fp16": {"enabled": True}}
     args = create_deepspeed_args()
     hidden_dim = 10
 
     model = SimpleModel(hidden_dim)
 
     @distributed_test(world_size=[1])
-    def _test_adamw_fp16_basic(args, model, hidden_dim, config_dict):
+    def _test_adamw_fp16_basic(args, model, hidden_dim, config):
         optimizer = torch.optim.AdamW(params=model.parameters())
         model, _, _, _ = deepspeed.initialize(args=args,
                                               model=model,
                                               optimizer=optimizer,
-                                              config_params=config_dict)
+                                              config=config)
         data_loader = random_dataloader(model=model,
                                         total_samples=50,
                                         hidden_dim=hidden_dim,
@@ -224,10 +374,7 @@ def test_dict_config_adamw_fp16_basic():
             model.backward(loss)
             model.step()
 
-    _test_adamw_fp16_basic(args=args,
-                           model=model,
-                           hidden_dim=hidden_dim,
-                           config_dict=config_dict)
+    _test_adamw_fp16_basic(args=args, model=model, hidden_dim=hidden_dim, config=config)
 
 
 def test_adamw_fp16_empty_grad(tmpdir):
@@ -241,7 +388,7 @@ def test_adamw_fp16_empty_grad(tmpdir):
     args = args_from_dict(tmpdir, config_dict)
     hidden_dim = 10
 
-    model = SimpleModel(hidden_dim, empty_grad=True)
+    model = SimpleModel(hidden_dim)
 
     @distributed_test(world_size=[1])
     def _test_adamw_fp16_empty_grad(args, model, hidden_dim):
@@ -865,3 +1012,38 @@ def test_zero3_lazyscatter(tmpdir):
             model.step()
 
     _go(args=args)
+
+
+@pytest.mark.parametrize('stage', [1, 2, 3])
+def test_zero_empty_grad(tmpdir, stage):
+    config_dict = {
+        "train_batch_size": 1,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": stage
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _go(args, model, hidden_dim):
+        optimizer = torch.optim.Adam(model.parameters())
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=optimizer)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _go(args=args, model=model, hidden_dim=hidden_dim)
diff --git a/tests/unit/test_get_optim_files.py b/tests/unit/test_get_optim_files.py
new file mode 100644
index 0000000000000000000000000000000000000000..68d046bfe99ebcdf5c84ef1e8c4056e2deca3e52
--- /dev/null
+++ b/tests/unit/test_get_optim_files.py
@@ -0,0 +1,18 @@
+import os
+import pytest
+import deepspeed
+from deepspeed.utils.zero_to_fp32 import get_optim_files
+
+
+@pytest.mark.parametrize('num_checkpoints', [1, 2, 12, 24])
+def test_get_optim_files(tmpdir, num_checkpoints):
+    saved_files = []
+    for i in range(num_checkpoints):
+        file_name = "zero_" + str(i) + "_optim_states.pt"
+        path_name = os.path.join(tmpdir, file_name)
+        saved_files.append(path_name)
+        with open(path_name, "w") as f:
+            f.write(file_name)
+    loaded_files = get_optim_files(tmpdir)
+    for lf, sf in zip(loaded_files, saved_files):
+        assert lf == sf
diff --git a/tests/unit/test_groups.py b/tests/unit/test_groups.py
new file mode 100644
index 0000000000000000000000000000000000000000..2769da74436d882fb272850eee1936cc7f1b36f3
--- /dev/null
+++ b/tests/unit/test_groups.py
@@ -0,0 +1,57 @@
+import unittest
+
+from deepspeed.utils.groups import _get_expert_parallel_ranks
+
+
+class TestGroups(unittest.TestCase):
+    def test_get_expert_parallel_ranks(self):
+        """
+            Example - E + M + D parallel
+            world_size = 16
+            model_degree = 2
+            expert_degree = 4 # number of experts in same group
+            mp_group = [0, 1], [2,3], [4,5] ...
+            data_parallel_group =[0,2,4,6,8,10, 12,14],                 [1,3,5,7,9,11,13,15]
+            expert_parallel_group = [0,2,4,6], [8,10,12,14]             [1,3,5,7], [9,11,13,15]
+            expert_data_parallel_group = [0,8],[2,10],[4,12],[6,14],    [1,9],[3,11],[5,13],[7,15]
+        """
+        expert_parallel_groups, expert_data_parallel_groups = _get_expert_parallel_ranks(
+            world_size=16, model_parallel_size_=2, expert_parallel_size_=4)
+        self.assertEqual(expert_parallel_groups,
+                         [[0,
+                           2,
+                           4,
+                           6],
+                          [8,
+                           10,
+                           12,
+                           14],
+                          [1,
+                           3,
+                           5,
+                           7],
+                          [9,
+                           11,
+                           13,
+                           15]])
+        self.assertEqual(expert_data_parallel_groups,
+                         [[0,
+                           8],
+                          [2,
+                           10],
+                          [4,
+                           12],
+                          [6,
+                           14],
+                          [1,
+                           9],
+                          [3,
+                           11],
+                          [5,
+                           13],
+                          [7,
+                           15]])
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/tests/unit/test_ignore_unused_parameters.py b/tests/unit/test_ignore_unused_parameters.py
new file mode 100644
index 0000000000000000000000000000000000000000..eb26f46ca2096bf100ef7ac62dcb7177ad1deb30
--- /dev/null
+++ b/tests/unit/test_ignore_unused_parameters.py
@@ -0,0 +1,70 @@
+import torch
+import pytest
+import json
+import argparse
+import os
+from .common import distributed_test
+from .simple_model import UnusedParametersModel, random_dataloader, args_from_dict
+from deepspeed.ops.op_builder import CPUAdamBuilder
+
+import deepspeed
+
+
+@pytest.mark.parametrize('ignore_unused_parameters', [False, True])
+def test_stage2_ignore_unused_parameters(tmpdir, ignore_unused_parameters):
+    use_cpu_offload = True
+
+    if use_cpu_offload and not deepspeed.ops.__compatible_ops__[CPUAdamBuilder.NAME]:
+        pytest.skip("cpu-adam is not compatible")
+
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_optimization": {
+            "stage": 2,
+            "cpu_offload": use_cpu_offload,
+            "ignore_unused_parameters": ignore_unused_parameters
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 4
+
+    model = UnusedParametersModel(hidden_dim=hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _test_stage2_ignore_unused_parameters(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                                  model=model,
+                                                  model_parameters=model.parameters())
+
+        data_loader = random_dataloader(model=model,
+                                        total_samples=10,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        def _loop():
+            for n, batch in enumerate(data_loader):
+                loss = model(batch[0], batch[1])
+                model.backward(loss)
+                model.step()
+
+        if ignore_unused_parameters:
+            _loop()
+        else:
+            with pytest.raises(AssertionError) as e:
+                _loop()
+            assert e.value.args and 'ignore_unused_parameters' in e.value.args[0]
+
+    _test_stage2_ignore_unused_parameters(args=args, model=model, hidden_dim=hidden_dim)
diff --git a/tests/unit/test_lr_schedulers.py b/tests/unit/test_lr_schedulers.py
old mode 100755
new mode 100644
index d93ac6f171bbea1533250c19fb2bad49c2c499fb..47bcfb1ef3293b56e0d046015f60edf701ea24e3
--- a/tests/unit/test_lr_schedulers.py
+++ b/tests/unit/test_lr_schedulers.py
@@ -4,10 +4,10 @@ import argparse
 import pytest
 import json
 import os
-from common import distributed_test
-from simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
+from .common import distributed_test
+from .simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
 from deepspeed.runtime.lr_schedules import LR_RANGE_TEST, LR_RANGE_TEST_MIN_LR, LR_RANGE_TEST_STEP_RATE, LR_RANGE_TEST_STEP_SIZE, LR_RANGE_TEST_STAIRCASE
-from deepspeed.runtime.lr_schedules import WARMUP_LR, WARMUP_MIN_LR, WARMUP_MAX_LR, WARMUP_NUM_STEPS
+from deepspeed.runtime.lr_schedules import WARMUP_LR, WARMUP_MIN_LR, WARMUP_MAX_LR, WARMUP_NUM_STEPS, WARMUP_TYPE, WARMUP_LOG_RATE, WARMUP_LINEAR_RATE
 from deepspeed.runtime.lr_schedules import ONE_CYCLE, CYCLE_MIN_LR, CYCLE_MAX_LR, CYCLE_FIRST_STEP_SIZE, DECAY_LR_RATE, DECAY_STEP_SIZE
 from deepspeed.runtime.lr_schedules import CYCLE_MIN_MOM, CYCLE_MAX_MOM, DECAY_MOM_RATE
 from deepspeed.runtime.lr_schedules import WARMUP_DECAY_LR, TOTAL_NUM_STEPS
@@ -86,8 +86,26 @@ def test_get_lr_before_train(tmpdir, scheduler_type, params):
     _test_get_lr_before_train(args=args, model=model, hidden_dim=hidden_dim)
 
 
-@pytest.mark.parametrize("warmup_num_steps", [10, 15, 19, 33])
-def test_lr_warmup_schedule(tmpdir, warmup_num_steps):
+@pytest.mark.parametrize("warmup_num_steps, warmup_type",
+                         [
+                             (10,
+                              WARMUP_LOG_RATE),
+                             (15,
+                              WARMUP_LOG_RATE),
+                             (19,
+                              WARMUP_LOG_RATE),
+                             (33,
+                              WARMUP_LOG_RATE),
+                             (10,
+                              WARMUP_LINEAR_RATE),
+                             (15,
+                              WARMUP_LINEAR_RATE),
+                             (19,
+                              WARMUP_LINEAR_RATE),
+                             (33,
+                              WARMUP_LINEAR_RATE),
+                         ])
+def test_lr_warmup_schedule(tmpdir, warmup_num_steps, warmup_type):
     config_dict = {
         "train_batch_size": 2,
         "steps_per_print": 1,
@@ -102,7 +120,8 @@ def test_lr_warmup_schedule(tmpdir, warmup_num_steps):
             "params": {
                 WARMUP_MIN_LR: 0.1,
                 WARMUP_MAX_LR: 0.2,
-                WARMUP_NUM_STEPS: warmup_num_steps
+                WARMUP_NUM_STEPS: warmup_num_steps,
+                WARMUP_TYPE: warmup_type,
             }
         },
         "gradient_clipping": 1.0
@@ -151,8 +170,26 @@ def test_lr_warmup_schedule(tmpdir, warmup_num_steps):
                              num_steps=total_num_steps)
 
 
-@pytest.mark.parametrize("warmup_num_steps", [10, 15, 19, 33])
-def test_lr_warmup_decay_schedule(tmpdir, warmup_num_steps):
+@pytest.mark.parametrize("warmup_num_steps, warmup_type",
+                         [
+                             (10,
+                              WARMUP_LOG_RATE),
+                             (15,
+                              WARMUP_LOG_RATE),
+                             (19,
+                              WARMUP_LOG_RATE),
+                             (33,
+                              WARMUP_LOG_RATE),
+                             (10,
+                              WARMUP_LINEAR_RATE),
+                             (15,
+                              WARMUP_LINEAR_RATE),
+                             (19,
+                              WARMUP_LINEAR_RATE),
+                             (33,
+                              WARMUP_LINEAR_RATE),
+                         ])
+def test_lr_warmup_decay_schedule(tmpdir, warmup_num_steps, warmup_type):
     config_dict = {
         "train_batch_size": 2,
         "steps_per_print": 1,
@@ -168,7 +205,8 @@ def test_lr_warmup_decay_schedule(tmpdir, warmup_num_steps):
                 WARMUP_MIN_LR: 0.1,
                 WARMUP_MAX_LR: 0.2,
                 WARMUP_NUM_STEPS: warmup_num_steps,
-                TOTAL_NUM_STEPS: warmup_num_steps * 2
+                TOTAL_NUM_STEPS: warmup_num_steps * 2,
+                WARMUP_TYPE: warmup_type
             }
         },
         "gradient_clipping": 1.0
@@ -357,14 +395,20 @@ def test_lr_range_test(tmpdir, min_lr, step_rate, step_size, staircase):
                         staircase=staircase)
 
 
-@pytest.mark.parametrize("min_lr, max_lr, decay_rate, step_size",
+@pytest.mark.parametrize("min_lr, max_lr, decay_rate, cycle_step_size, decay_step_size",
                          [
-                             (1e-5, 1e-2, 1e-3, 10),
-                             (1e-3, 1e-1, 0, 21),
-                             (1e-5, 1e-2, 1e-3, 10),
-                             (1e-3, 1e-1, 0, 21),
+                             (1e-5, 1e-2, 1e-3, 10, 10),
+                             (1e-3, 1e-1, 0, 21, 21),
+                             (1e-5, 1e-2, 1e-3, 10, 10),
+                             (1e-3, 1e-1, 1e-1, 21, 21),
+                             (1e-5, 1e-1, 0, 10, 0),
                          ])  # yapf: disable
-def test_onecycle_lr(tmpdir, min_lr, max_lr, decay_rate, step_size):
+def test_onecycle_lr(tmpdir,
+                     min_lr,
+                     max_lr,
+                     decay_rate,
+                     cycle_step_size,
+                     decay_step_size):
     config_dict = {
         "train_batch_size": 2,
         "steps_per_print": 1,
@@ -380,8 +424,8 @@ def test_onecycle_lr(tmpdir, min_lr, max_lr, decay_rate, step_size):
                 CYCLE_MIN_LR: min_lr,
                 CYCLE_MAX_LR: max_lr,
                 DECAY_LR_RATE: decay_rate,
-                CYCLE_FIRST_STEP_SIZE: step_size,
-                DECAY_STEP_SIZE: step_size
+                CYCLE_FIRST_STEP_SIZE: cycle_step_size,
+                DECAY_STEP_SIZE: decay_step_size
             }
         },
         "gradient_clipping": 1.0
@@ -437,7 +481,7 @@ def test_onecycle_lr(tmpdir, min_lr, max_lr, decay_rate, step_size):
                       hidden_dim=hidden_dim,
                       min_lr=[min_lr],
                       max_lr=[max_lr],
-                      step_size=step_size,
+                      step_size=cycle_step_size,
                       decay_rate=decay_rate)
 
 
diff --git a/tests/unit/test_moe.py b/tests/unit/test_moe.py
new file mode 100644
index 0000000000000000000000000000000000000000..e10356902a686178148f6dc245d4da7ea7ad71d2
--- /dev/null
+++ b/tests/unit/test_moe.py
@@ -0,0 +1,115 @@
+import math
+from deepspeed.utils import groups
+import torch
+import torch.distributed as dist
+import deepspeed
+import argparse
+import pytest
+import json
+import os
+from deepspeed.ops.adam import FusedAdam
+from .common import distributed_test
+from deepspeed.ops.op_builder import CPUAdamBuilder
+from .simple_model import SimpleModel, SimplePRMoEModel, SimpleOptimizer, random_dataloader, args_from_dict, create_deepspeed_args, SimpleMoEModel, sequence_dataloader
+from .util import required_torch_version
+
+try:
+    from apex import amp
+    _amp_available = True
+except ImportError:
+    _amp_available = False
+amp_available = pytest.mark.skip(_amp_available, reason="apex/amp is not installed")
+
+
+@pytest.mark.parametrize("ep_size, use_residual",
+                         [(2,
+                           True),
+                          (2,
+                           False),
+                          (4,
+                           True),
+                          (4,
+                           False)])
+def test_moe(tmpdir, ep_size, use_residual):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
+    config_dict = {
+        "train_batch_size": 8,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": True
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 16
+
+    @distributed_test(world_size=[4])
+    def _test_moe(args, hidden_dim, ep_size, use_residual):
+        # E+D -- ep_size = 2
+        # E only -- ep_size = 4
+        model = SimpleMoEModel(hidden_dim, ep_size=ep_size, use_residual=use_residual)
+        optimizer = torch.optim.AdamW(params=model.parameters())
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=optimizer,
+                                              dist_init_required=False)
+        #dist_init_required=False -- parameterize to True/False?
+
+        data_loader = sequence_dataloader(model=model,
+                                          total_samples=50,
+                                          hidden_dim=hidden_dim,
+                                          device=model.device)
+
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_moe(args=args,
+              hidden_dim=hidden_dim,
+              ep_size=ep_size,
+              use_residual=use_residual)
+
+
+@pytest.mark.parametrize("ep_size, use_residual", [(2, True), (2, False)])
+def test_pr_moe(tmpdir, ep_size, use_residual):
+    if not required_torch_version():
+        pytest.skip("DeepSpeed MoE tests need torch 1.8 or higher to run correctly")
+
+    config_dict = {
+        "train_batch_size": 8,
+        "steps_per_print": 1,
+        "fp16": {
+            "enabled": True
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 16
+
+    @distributed_test(world_size=[4])
+    def _test_moe(args, hidden_dim, ep_size, use_residual):
+        # E+D -- ep_size = 2
+        # E only -- ep_size = 4
+
+        model = SimplePRMoEModel(hidden_dim, ep_size=ep_size, use_residual=use_residual)
+        optimizer = torch.optim.AdamW(params=model.parameters())
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              optimizer=optimizer,
+                                              dist_init_required=False)
+
+        data_loader = sequence_dataloader(model=model,
+                                          total_samples=50,
+                                          hidden_dim=hidden_dim,
+                                          device=model.device)
+
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_moe(args=args,
+              hidden_dim=hidden_dim,
+              ep_size=ep_size,
+              use_residual=use_residual)
diff --git a/tests/unit/test_multi_output_model.py b/tests/unit/test_multi_output_model.py
old mode 100755
new mode 100644
index ccbe7f484e29699d7f1099f9f6f35944086c7589..478bdc8d383d3cf44daf483c6124667f4f934268
--- a/tests/unit/test_multi_output_model.py
+++ b/tests/unit/test_multi_output_model.py
@@ -5,9 +5,9 @@ import pytest
 from pytest import approx
 import json
 import os
-from common import distributed_test
-from simple_model import args_from_dict
-from multi_output_model import MultiOutputModel, multi_output_dataloader
+from .common import distributed_test
+from .simple_model import args_from_dict
+from .multi_output_model import MultiOutputModel, multi_output_dataloader
 
 
 def create_config_dict(micro_batch_size, grad_accumulation_steps, world_size):
diff --git a/tests/unit/test_onebit.py b/tests/unit/test_onebit.py
index 8e0056be0cff9d70871237b555477b4e3244e26b..bfcbdceb0ba76c07168f18f0ca3c5fb9c2b7b3da 100644
--- a/tests/unit/test_onebit.py
+++ b/tests/unit/test_onebit.py
@@ -1,368 +1,1335 @@
-import torch
-import torch.distributed as dist
-import deepspeed
-import argparse
-import pytest
-import json
-import os
-import numpy as np
-import time
-from common import distributed_test
-from simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict, create_deepspeed_args
-
-TORCH_MAJOR = int(torch.__version__.split('.')[0])
-TORCH_MINOR = int(torch.__version__.split('.')[1])
-if TORCH_MAJOR < 1 or TORCH_MINOR < 8:
-    pytest.skip("NCCL-based 1-bit compression requires torch 1.8 or higher",
-                allow_module_level=True)
-
-
-def test_onebitadam_fp16_basic(tmpdir):
-    config_dict = {
-        "train_batch_size": 2,
-        "steps_per_print": 1,
-        "optimizer": {
-            "type": "OneBitAdam",
-            "params": {
-                "lr": 0.00015,
-                "weight_decay": 0.01,
-                "freeze_step": 2,
-                "cuda_aware": False,
-                "comm_backend_name": "nccl"
-            }
-        },
-        "gradient_clipping": 1.0,
-        "fp16": {
-            "enabled": True,
-            "loss_scale": 0,
-            "initial_scale_power": 16
-        }
-    }
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 10
-
-    model = SimpleModel(hidden_dim)
-
-    @distributed_test(world_size=[1, 2])
-    def _test_onebitadam_fp16_basic(args, model, hidden_dim):
-        model, _, _, _ = deepspeed.initialize(args=args,
-                                              model=model,
-                                              model_parameters=model.parameters())
-        data_loader = random_dataloader(model=model,
-                                        total_samples=50,
-                                        hidden_dim=hidden_dim,
-                                        device=model.device)
-        for n, batch in enumerate(data_loader):
-            loss = model(batch[0], batch[1])
-            model.backward(loss)
-            model.step()
-
-    _test_onebitadam_fp16_basic(args=args, model=model, hidden_dim=hidden_dim)
-
-
-def test_onebitadam_fp32_basic(tmpdir):
-    config_dict = {
-        "train_batch_size": 2,
-        "steps_per_print": 1,
-        "optimizer": {
-            "type": "OneBitAdam",
-            "params": {
-                "lr": 0.00015,
-                "weight_decay": 0.01,
-                "freeze_step": 2,
-                "cuda_aware": False,
-                "comm_backend_name": "nccl"
-            }
-        },
-        "gradient_clipping": 1.0,
-    }
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 10
-
-    model = SimpleModel(hidden_dim)
-
-    @distributed_test(world_size=[1, 2])
-    def _test_onebitadam_fp32_basic(args, model, hidden_dim):
-        model, _, _, _ = deepspeed.initialize(args=args,
-                                              model=model,
-                                              model_parameters=model.parameters())
-        data_loader = random_dataloader(model=model,
-                                        total_samples=50,
-                                        hidden_dim=hidden_dim,
-                                        device=model.device,
-                                        dtype=torch.float)
-        for n, batch in enumerate(data_loader):
-            loss = model(batch[0], batch[1])
-            model.backward(loss)
-            model.step()
-
-    _test_onebitadam_fp32_basic(args=args, model=model, hidden_dim=hidden_dim)
-
-
-def test_onebitadam_exp_avg_mask(tmpdir):
-    config_dict = {
-        "train_batch_size": 2,
-        "steps_per_print": 1,
-        "optimizer": {
-            "type": "OneBitAdam",
-            "params": {
-                "lr": 0.00015,
-                "weight_decay": 0.01,
-                "freeze_step": 2,
-                "cuda_aware": False,
-                "comm_backend_name": "nccl"
-            }
-        },
-        "gradient_clipping": 1.0,
-        "fp16": {
-            "enabled": True,
-            "loss_scale": 0,
-            "initial_scale_power": 16
-        }
-    }
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 10
-
-    model = SimpleModel(hidden_dim)
-    param_optimizer = list(model.named_parameters())
-    mask1 = torch.zeros_like(param_optimizer[0][1].data)
-    for col in range(mask1.size()[1]):
-        mask1[0][col] += 1
-    mask1 = torch.flatten(mask1)
-    optimizer_grouped_parameters = [{
-        'params': [param_optimizer[0][1]],
-        'weight_decay': 0.01,
-        'exp_avg_mask': mask1
-    },
-                                    {
-                                        'params': [param_optimizer[1][1]],
-                                        'weight_decay': 0.01
-                                    }]
-
-    @distributed_test(world_size=[2])
-    def _test_onebitadam_exp_avg_mask(args, model, hidden_dim):
-        model, optimizer, _, _ = deepspeed.initialize(args=args,
-                                                      model=model,
-                                                      model_parameters=optimizer_grouped_parameters)
-        data_loader = random_dataloader(model=model,
-                                        total_samples=50,
-                                        hidden_dim=hidden_dim,
-                                        device=model.device)
-        for n, batch in enumerate(data_loader):
-            loss = model(batch[0], batch[1])
-            model.backward(loss)
-            model.step()
-        # Test whether the momentum mask works
-        for v in optimizer.state.values():
-            if v['exp_avg'].size() == mask1.size():
-                assert torch.allclose(v['exp_avg'], v['exp_avg'].mul_(mask1.to(device=v['exp_avg'].device)), atol=1e-07), f"Momentum mask is not working properly"
-
-    _test_onebitadam_exp_avg_mask(args=args, model=model, hidden_dim=hidden_dim)
-
-
-def test_onebitadam_checkpointing(tmpdir):
-    config_dict = {
-        "train_batch_size": 2,
-        "steps_per_print": 1,
-        "optimizer": {
-            "type": "OneBitAdam",
-            "params": {
-                "lr": 0.00015,
-                "weight_decay": 0.01,
-                "freeze_step": 2,
-                "cuda_aware": False,
-                "comm_backend_name": "nccl"
-            }
-        },
-        "gradient_clipping": 1.0,
-        "fp16": {
-            "enabled": True,
-            "loss_scale": 0,
-            "initial_scale_power": 16
-        }
-    }
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 10
-
-    model = SimpleModel(hidden_dim)
-    param_optimizer = list(model.named_parameters())
-    mask1 = torch.zeros_like(param_optimizer[0][1].data)
-    mask2 = torch.zeros_like(param_optimizer[0][1].data)
-    for col in range(mask1.size()[1]):
-        mask1[0][col] += 1
-        mask2[1][col] += 1
-    mask1 = torch.flatten(mask1)
-    mask2 = torch.flatten(mask2)
-
-    optimizer_grouped_parameters_1 = [{
-        'params': [param_optimizer[0][1]],
-        'weight_decay': 0.01,
-        'exp_avg_mask': mask1
-    },
-                                      {
-                                          'params': [param_optimizer[1][1]],
-                                          'weight_decay': 0.01
-                                      }]
-
-    optimizer_grouped_parameters_2 = [{
-        'params': [param_optimizer[0][1]],
-        'weight_decay': 0.01,
-        'exp_avg_mask': mask2
-    },
-                                      {
-                                          'params': [param_optimizer[1][1]],
-                                          'weight_decay': 0.01
-                                      }]
-
-    optimizer_grouped_parameters_3 = [{
-        'params': [param_optimizer[0][1]],
-        'weight_decay': 0.01
-    },
-                                      {
-                                          'params': [param_optimizer[1][1]],
-                                          'weight_decay': 0.01
-                                      }]
-
-    @distributed_test(world_size=[2])
-    def _test_onebitadam_checkpointing(mask1, mask2, args, model, hidden_dim):
-        model_1, optimizer_1, _, _ = deepspeed.initialize(args=args,
-                                                          model=model,
-                                                          model_parameters=optimizer_grouped_parameters_1)
-        data_loader = random_dataloader(model=model_1,
-                                        total_samples=10,
-                                        hidden_dim=hidden_dim,
-                                        device=model_1.device)
-        for n, batch in enumerate(data_loader):
-            loss = model_1(batch[0], batch[1])
-            model_1.backward(loss)
-            model_1.step()
-        # Test whether momentum mask still exist after saving checkpoint
-        assert optimizer_1.optimizer.adam_freeze_key is True
-        mask1 = mask1.to(device=optimizer_1.param_groups[0]['exp_avg_mask'].device)
-        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Incorrect momentum mask"
-        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
-        # optimizer_1.optimizer.gather_compression_errors()
-        model_1.save_checkpoint(save_folder, tag=None)
-        time.sleep(5)
-        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Momentum mask should not change after saving checkpoint"
-
-
-        model_2, optimizer_2, _, _ = deepspeed.initialize(args=args,
-                                                          model=model,
-                                                          model_parameters=optimizer_grouped_parameters_2)
-        # Test whether momentum mask stays the same after loading checkpoint
-        mask2 = mask2.to(device=optimizer_2.param_groups[0]['exp_avg_mask'].device)
-        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Incorrect momentum mask"
-        model_2.load_checkpoint(save_folder,
-                                tag=None,
-                                load_optimizer_states=True,
-                                load_lr_scheduler_states=True)
-        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Momentum mask should not change after loading checkpoint"
-        # Test whether worker&server error is resetted
-        for v in optimizer_2.state.values():
-            assert 'worker_error' not in v, f"Incorrect worker error"
-            assert 'server_error' not in v, f"Incorrect server error"
-        assert optimizer_2.optimizer.adam_freeze_key is True
-
-        model_3, optimizer_3, _, _ = deepspeed.initialize(args=args,
-                                                          model=model,
-                                                          model_parameters=optimizer_grouped_parameters_3)
-        optimizer_3.optimizer.freeze_step = 20
-        data_loader = random_dataloader(model=model_3,
-                                        total_samples=50,
-                                        hidden_dim=hidden_dim,
-                                        device=model_3.device)
-        for n, batch in enumerate(data_loader):
-            loss = model_3(batch[0], batch[1])
-            model_3.backward(loss)
-            model_3.step()
-        assert optimizer_3.optimizer.adam_freeze_key is True
-        # Test whether momentum mask stays the same after loading checkpoint
-        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Incorrect momentum mask"
-        model_3.load_checkpoint(save_folder,
-                                tag=None,
-                                load_optimizer_states=True,
-                                load_lr_scheduler_states=True)
-        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Momentum mask should not change after loading checkpoint"
-        # Test whether worker&server error is resetted
-        for v in optimizer_3.state.values():
-            assert 'worker_error' not in v, f"Incorrect worker error"
-            assert 'server_error' not in v, f"Incorrect server error"
-        assert optimizer_3.optimizer.adam_freeze_key is False
-
-    _test_onebitadam_checkpointing(mask1,
-                                   mask2,
-                                   args=args,
-                                   model=model,
-                                   hidden_dim=hidden_dim)
-
-
-def test_compressed_allreduce_basic(tmpdir):
-    @distributed_test(world_size=[1, 2])
-    def _test_compressed_allreduce_basic():
-        from deepspeed.runtime.comm.nccl import NcclBackend
-        size = dist.get_world_size()
-        rank = dist.get_rank()
-        backend = NcclBackend()
-        local_rank = dist.get_rank()
-        device = torch.device("cuda", dist.get_rank())
-
-        # A simulated compression function using torch.distributed
-        def torch_sim(a):
-            a_sign = a.sign().add_(1).bool().float().add_(-0.5).mul_(2.0)
-            scale = a.norm() / np.sqrt(a.numel())
-            a_compressed = scale * a_sign
-            a_sign = None
-            worker_error = a - a_compressed
-            dist.all_reduce(a_compressed)
-            a_compressed.mul_(1 / dist.get_world_size())
-            a_server_sign = a_compressed.sign().add_(1).bool().float().add_(-0.5).mul_(
-                2.0)
-            a_list = torch.chunk(a_compressed, chunks=dist.get_world_size())
-            server_scale = [
-                chunk_a.norm() / np.sqrt(chunk_a.numel()) for chunk_a in a_list
-            ]
-            a_sign_list = torch.chunk(a_server_sign, dist.get_world_size())
-            a_server_compressed = torch.cat(
-                [server_scale[i] * a_sign_list[i] for i in range(dist.get_world_size())])
-            rank = dist.get_rank()
-            server_error = a_list[rank] - server_scale[rank] * a_sign_list[rank]
-            torch.cuda.synchronize()
-            torch.distributed.barrier()
-            return a_server_compressed, worker_error, server_error
-
-        tensor_size = 300 * 2**20
-        server_size = int(tensor_size / size)
-        if tensor_size % (8 * size) != 0:
-            right_tensor_size = tensor_size + (8 * size - (tensor_size % (8 * size)))
-        else:
-            right_tensor_size = tensor_size
-        right_server_size = right_tensor_size // size
-
-        # Adding bias to the initialization of the gradient we are communicating
-        # In order to get rid of the case where some elements in the gradient are too small
-        a = (torch.rand(tensor_size, device=device) - 0.5) + 0.01 * rank
-
-        worker_error = torch.zeros(right_tensor_size, device=device)
-        server_error = torch.zeros(right_server_size, device=device)
-
-        a_torch, worker_error_torch, server_error_torch = torch_sim(a)
-        torch.cuda.empty_cache()
-
-        a_after = backend.compressed_allreduce(a, worker_error, server_error, local_rank)
-
-        threshold = 1e-6
-        magnitude_threshold = 1e-6
-        diff_mask = (a_after - a_torch) > threshold
-        diff_server_mask = torch.chunk(diff_mask, size)[rank]
-        mpi_server = torch.chunk(a_after, size)[rank] + server_error
-        torch_server = torch.chunk(a_torch, size)[rank] + server_error_torch
-
-        # If the number in the compensated_server_m is too small (e.g 1e-8), then calling sign() might be problematic
-        # The test would skip those numbers that are too small in compensated_server_m
-        check_mag_mask = mpi_server[diff_server_mask] > magnitude_threshold
-        if torch.sum(check_mag_mask) != 0:
-            print('Fails at {} of positions'.format(torch.sum(check_mag_mask)))
-        assert torch.sum(diff_server_mask) == 0 or torch.sum(check_mag_mask) == 0
-
-    _test_compressed_allreduce_basic()
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.distributed as dist
+import deepspeed
+import argparse
+import pytest
+import copy
+import json
+import os
+import numpy as np
+import time
+
+from deepspeed.runtime.pipe.topology import PipeDataParallelTopology, PipeModelDataParallelTopology
+from deepspeed.ops.op_builder import OpBuilder
+
+PipeTopo = PipeDataParallelTopology
+from deepspeed.runtime.pipe.module import PipelineModule, LayerSpec
+from .common import distributed_test
+from .simple_model import SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict, create_deepspeed_args
+from .test_pipe import AlexNetPipe, train_cifar
+
+TORCH_MAJOR = int(torch.__version__.split('.')[0])
+TORCH_MINOR = int(torch.__version__.split('.')[1])
+if TORCH_MAJOR < 1 or TORCH_MINOR < 8:
+    pytest.skip("NCCL-based 1-bit compression requires torch 1.8 or higher",
+                allow_module_level=True)
+
+rocm_version = OpBuilder.installed_rocm_version()
+if rocm_version[0] > 4:
+    pytest.skip(
+        "NCCL-based 1-bit compression is not yet supported w. ROCm 5 until cupy supports ROCm 5",
+        allow_module_level=True)
+
+
+def test_onebitadam_fp16_basic(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_onebitadam_fp16_basic(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_onebitadam_fp16_basic(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_onebitadam_fp32_basic(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_onebitadam_fp32_basic(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.float)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_onebitadam_fp32_basic(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_onebitadam_exp_avg_mask(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+    param_optimizer = list(model.named_parameters())
+    mask1 = torch.zeros_like(param_optimizer[0][1].data)
+    for col in range(mask1.size()[1]):
+        mask1[0][col] += 1
+    mask1 = torch.flatten(mask1)
+    optimizer_grouped_parameters = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask1
+    },
+                                    {
+                                        'params': [param_optimizer[1][1]],
+                                        'weight_decay': 0.01
+                                    }]
+
+    @distributed_test(world_size=[2])
+    def _test_onebitadam_exp_avg_mask(args, model, hidden_dim):
+        model, optimizer, _, _ = deepspeed.initialize(args=args,
+                                                      model=model,
+                                                      model_parameters=optimizer_grouped_parameters)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+        # Test whether the momentum mask works
+        for v in optimizer.state.values():
+            if v['exp_avg'].size() == mask1.size():
+                assert torch.allclose(v['exp_avg'], v['exp_avg'].mul_(mask1.to(device=v['exp_avg'].device)), atol=1e-07), f"Momentum mask is not working properly"
+
+    _test_onebitadam_exp_avg_mask(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_onebitadam_checkpointing(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+    param_optimizer = list(model.named_parameters())
+    mask1 = torch.zeros_like(param_optimizer[0][1].data)
+    mask2 = torch.zeros_like(param_optimizer[0][1].data)
+    for col in range(mask1.size()[1]):
+        mask1[0][col] += 1
+        mask2[1][col] += 1
+    mask1 = torch.flatten(mask1)
+    mask2 = torch.flatten(mask2)
+
+    optimizer_grouped_parameters_1 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask1
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    optimizer_grouped_parameters_2 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask2
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    optimizer_grouped_parameters_3 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    @distributed_test(world_size=[2])
+    def _test_onebitadam_checkpointing(mask1, mask2, args, model, hidden_dim):
+        model_1, optimizer_1, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_1)
+        data_loader = random_dataloader(model=model_1,
+                                        total_samples=10,
+                                        hidden_dim=hidden_dim,
+                                        device=model_1.device)
+        for n, batch in enumerate(data_loader):
+            loss = model_1(batch[0], batch[1])
+            model_1.backward(loss)
+            model_1.step()
+        # Test whether momentum mask still exist after saving checkpoint
+        assert optimizer_1.optimizer.adam_freeze_key is True
+        mask1 = mask1.to(device=optimizer_1.param_groups[0]['exp_avg_mask'].device)
+        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Incorrect momentum mask"
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        model_1.save_checkpoint(save_folder, tag=None)
+        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Momentum mask should not change after saving checkpoint"
+
+
+        model_2, optimizer_2, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_2)
+        # Test whether momentum mask stays the same after loading checkpoint
+        mask2 = mask2.to(device=optimizer_2.param_groups[0]['exp_avg_mask'].device)
+        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Incorrect momentum mask"
+        model_2.load_checkpoint(save_folder,
+                                tag=None,
+                                load_optimizer_states=True,
+                                load_lr_scheduler_states=True)
+        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Momentum mask should not change after loading checkpoint"
+        # Test whether worker&server error is reset
+        for v in optimizer_2.state.values():
+            assert 'worker_error' not in v, f"Incorrect worker error"
+            assert 'server_error' not in v, f"Incorrect server error"
+        assert optimizer_2.optimizer.adam_freeze_key is True
+
+        model_3, optimizer_3, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_3)
+        optimizer_3.optimizer.freeze_step = 20
+        data_loader = random_dataloader(model=model_3,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model_3.device)
+        for n, batch in enumerate(data_loader):
+            loss = model_3(batch[0], batch[1])
+            model_3.backward(loss)
+            model_3.step()
+        assert optimizer_3.optimizer.adam_freeze_key is True
+        # Test whether momentum mask stays the same after loading checkpoint
+        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Incorrect momentum mask"
+        model_3.load_checkpoint(save_folder,
+                                tag=None,
+                                load_optimizer_states=True,
+                                load_lr_scheduler_states=True)
+        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Momentum mask should not change after loading checkpoint"
+        # Test whether worker&server error is reset
+        for v in optimizer_3.state.values():
+            assert 'worker_error' not in v, f"Incorrect worker error"
+            assert 'server_error' not in v, f"Incorrect server error"
+        assert optimizer_3.optimizer.adam_freeze_key is False
+
+    _test_onebitadam_checkpointing(mask1,
+                                   mask2,
+                                   args=args,
+                                   model=model,
+                                   hidden_dim=hidden_dim)
+
+
+def test_onebitadam_checkpointing_overflow(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[2])
+    def _test_onebitadam_checkpointing_overflow(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=100,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            if dist.get_rank() == 0 and n >= 10:
+                loss = loss * 1000000.0
+            model.backward(loss)
+            dist.barrier()
+            model.step()
+            dist.barrier()
+            model.save_checkpoint(save_folder, tag=None)
+
+    _test_onebitadam_checkpointing_overflow(args=args,
+                                            model=model,
+                                            hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('topo',
+                         [
+                             PipeTopo(num_pp=1,
+                                      num_dp=4),
+                             PipeTopo(num_pp=2,
+                                      num_dp=2),
+                             PipeTopo(num_pp=4,
+                                      num_dp=1),
+                         ])
+def test_onebitadam_fp16_pipeline(topo, tmpdir):
+    config_dict = {
+        "train_batch_size": 16,
+        "train_micro_batch_size_per_gpu": 4,
+        "steps_per_print": 20,
+        "optimizer": {
+            "type": "OneBitAdam",
+            "params": {
+                "lr": 0.00001,
+                "betas": [0.9,
+                          0.999],
+                "eps": 1e-8,
+                "weight_decay": 3e-7,
+                "freeze_step": 200,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "zero_optimization": {
+            "stage": 0
+        },
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        },
+        "pipeline": {
+            "seed_layers": True,
+            "activation_checkpoint_interval": 1
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+
+    # Allocate model for consistent initial weights.
+    init_net = AlexNetPipe()
+
+    @distributed_test(world_size=4)
+    def _helper(topo, tmpdir, steps=500):
+        assert steps >= 100
+
+        test_net = copy.deepcopy(init_net)
+        test_model = PipelineModule(layers=test_net.to_layers(),
+                                    topology=topo,
+                                    loss_fn=nn.CrossEntropyLoss())
+
+        test_losses = train_cifar(test_model,
+                                  args,
+                                  num_steps=steps,
+                                  fp16=config_dict['fp16']['enabled'])
+
+    _helper(topo, tmpdir)
+
+
+def test_zerooneadam_fp16_basic(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "ZeroOneAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "var_freeze_step": 4,
+                "var_update_scaler": 1,
+                "local_step_scaler": 1,
+                "local_step_clipper": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_zerooneadam_fp16_basic(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_zerooneadam_fp16_basic(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_zerooneadam_fp32_basic(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "ZeroOneAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "var_freeze_step": 4,
+                "var_update_scaler": 1,
+                "local_step_scaler": 1,
+                "local_step_clipper": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_zerooneadam_fp32_basic(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.float)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_zerooneadam_fp32_basic(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_zerooneadam_exp_avg_mask(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "ZeroOneAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "var_freeze_step": 4,
+                "var_update_scaler": 1,
+                "local_step_scaler": 1,
+                "local_step_clipper": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+    param_optimizer = list(model.named_parameters())
+    mask1 = torch.zeros_like(param_optimizer[0][1].data)
+    for col in range(mask1.size()[1]):
+        mask1[0][col] += 1
+    mask1 = torch.flatten(mask1)
+    optimizer_grouped_parameters = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask1
+    },
+                                    {
+                                        'params': [param_optimizer[1][1]],
+                                        'weight_decay': 0.01
+                                    }]
+
+    @distributed_test(world_size=[2])
+    def _test_zerooneadam_exp_avg_mask(args, model, hidden_dim):
+        model, optimizer, _, _ = deepspeed.initialize(args=args,
+                                                      model=model,
+                                                      model_parameters=optimizer_grouped_parameters)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+        # Test whether the momentum mask works
+        for v in optimizer.state.values():
+            if v['exp_avg'].size() == mask1.size():
+                assert torch.allclose(v['exp_avg'], v['exp_avg'].mul_(mask1.to(device=v['exp_avg'].device)), atol=1e-07), f"Momentum mask is not working properly"
+
+    _test_zerooneadam_exp_avg_mask(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_zerooneadam_checkpointing(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "ZeroOneAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "var_freeze_step": 4,
+                "var_update_scaler": 1,
+                "local_step_scaler": 1,
+                "local_step_clipper": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+    param_optimizer = list(model.named_parameters())
+    mask1 = torch.zeros_like(param_optimizer[0][1].data)
+    mask2 = torch.zeros_like(param_optimizer[0][1].data)
+    for col in range(mask1.size()[1]):
+        mask1[0][col] += 1
+        mask2[1][col] += 1
+    mask1 = torch.flatten(mask1)
+    mask2 = torch.flatten(mask2)
+
+    optimizer_grouped_parameters_1 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask1
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    optimizer_grouped_parameters_2 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask2
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    optimizer_grouped_parameters_3 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    @distributed_test(world_size=[2])
+    def _test_zerooneadam_checkpointing(mask1, mask2, args, model, hidden_dim):
+        model_1, optimizer_1, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_1)
+        data_loader = random_dataloader(model=model_1,
+                                        total_samples=10,
+                                        hidden_dim=hidden_dim,
+                                        device=model_1.device)
+        for n, batch in enumerate(data_loader):
+            loss = model_1(batch[0], batch[1])
+            model_1.backward(loss)
+            model_1.step()
+        # Test whether momentum mask still exist after saving checkpoint
+        mask1 = mask1.to(device=optimizer_1.param_groups[0]['exp_avg_mask'].device)
+        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Incorrect momentum mask"
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        model_1.save_checkpoint(save_folder, tag=None)
+        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Momentum mask should not change after saving checkpoint"
+
+
+        model_2, optimizer_2, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_2)
+        # Test whether momentum mask stays the same after loading checkpoint
+        mask2 = mask2.to(device=optimizer_2.param_groups[0]['exp_avg_mask'].device)
+        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Incorrect momentum mask"
+        model_2.load_checkpoint(save_folder,
+                                tag=None,
+                                load_optimizer_states=True,
+                                load_lr_scheduler_states=True)
+        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Momentum mask should not change after loading checkpoint"
+        # Test whether worker&server error is reset
+        for v in optimizer_2.state.values():
+            assert 'worker_error' not in v, f"Incorrect worker error"
+            assert 'server_error' not in v, f"Incorrect server error"
+
+        model_3, optimizer_3, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_3)
+        optimizer_3.optimizer.freeze_step = 20
+        data_loader = random_dataloader(model=model_3,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model_3.device)
+        for n, batch in enumerate(data_loader):
+            loss = model_3(batch[0], batch[1])
+            model_3.backward(loss)
+            model_3.step()
+        # Test whether momentum mask stays the same after loading checkpoint
+        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Incorrect momentum mask"
+        model_3.load_checkpoint(save_folder,
+                                tag=None,
+                                load_optimizer_states=True,
+                                load_lr_scheduler_states=True)
+        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Momentum mask should not change after loading checkpoint"
+        # Test whether worker&server error is reset
+        for v in optimizer_3.state.values():
+            assert 'worker_error' not in v, f"Incorrect worker error"
+            assert 'server_error' not in v, f"Incorrect server error"
+
+    _test_zerooneadam_checkpointing(mask1,
+                                    mask2,
+                                    args=args,
+                                    model=model,
+                                    hidden_dim=hidden_dim)
+
+
+def test_zerooneadam_checkpointing_overflow(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "ZeroOneAdam",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "var_freeze_step": 4,
+                "var_update_scaler": 1,
+                "local_step_scaler": 1,
+                "local_step_clipper": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[2])
+    def _test_zerooneadam_checkpointing_overflow(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=100,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            if dist.get_rank() == 0 and n >= 10:
+                loss = loss * 1000000.0
+            model.backward(loss)
+            dist.barrier()
+            model.step()
+            dist.barrier()
+            model.save_checkpoint(save_folder, tag=None)
+
+    _test_zerooneadam_checkpointing_overflow(args=args,
+                                             model=model,
+                                             hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('topo',
+                         [
+                             PipeTopo(num_pp=1,
+                                      num_dp=4),
+                             PipeTopo(num_pp=2,
+                                      num_dp=2),
+                             PipeTopo(num_pp=4,
+                                      num_dp=1),
+                         ])
+def test_zerooneadam_fp16_pipeline(topo, tmpdir):
+    config_dict = {
+        "train_batch_size": 16,
+        "train_micro_batch_size_per_gpu": 4,
+        "steps_per_print": 20,
+        "optimizer": {
+            "type": "ZeroOneAdam",
+            "params": {
+                "lr": 0.00001,
+                "betas": [0.9,
+                          0.999],
+                "eps": 1e-8,
+                "weight_decay": 3e-7,
+                "var_freeze_step": 4,
+                "var_update_scaler": 1,
+                "local_step_scaler": 1,
+                "local_step_clipper": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "zero_optimization": {
+            "stage": 0
+        },
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        },
+        "pipeline": {
+            "seed_layers": True,
+            "activation_checkpoint_interval": 1
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+
+    # Allocate model for consistent initial weights.
+    init_net = AlexNetPipe()
+
+    @distributed_test(world_size=4)
+    def _helper(topo, tmpdir, steps=500):
+        assert steps >= 100
+
+        test_net = copy.deepcopy(init_net)
+        test_model = PipelineModule(layers=test_net.to_layers(),
+                                    topology=topo,
+                                    loss_fn=nn.CrossEntropyLoss())
+
+        test_losses = train_cifar(test_model,
+                                  args,
+                                  num_steps=steps,
+                                  fp16=config_dict['fp16']['enabled'])
+
+    _helper(topo, tmpdir)
+
+
+def test_onebitlamb_fp16_basic(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitLamb",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "max_coeff": 0.3,
+                "min_coeff": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl",
+                "coeff_beta": 0.9,
+                "factor_max": 1.0,
+                "factor_min": 0.5,
+                "factor_threshold": 0.1
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_onebitlamb_fp16_basic(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_onebitlamb_fp16_basic(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_onebitlamb_fp32_basic(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitLamb",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "max_coeff": 0.3,
+                "min_coeff": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl",
+                "coeff_beta": 0.9,
+                "factor_max": 1.0,
+                "factor_min": 0.5,
+                "factor_threshold": 0.1
+            }
+        },
+        "gradient_clipping": 1.0,
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[1, 2])
+    def _test_onebitlamb_fp32_basic(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device,
+                                        dtype=torch.float)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_onebitlamb_fp32_basic(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_onebitlamb_exp_avg_mask(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitLamb",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "max_coeff": 0.3,
+                "min_coeff": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl",
+                "coeff_beta": 0.9,
+                "factor_max": 1.0,
+                "factor_min": 0.5,
+                "factor_threshold": 0.1
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+    param_optimizer = list(model.named_parameters())
+    mask1 = torch.zeros_like(param_optimizer[0][1].data)
+    for col in range(mask1.size()[1]):
+        mask1[0][col] += 1
+    optimizer_grouped_parameters = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask1
+    },
+                                    {
+                                        'params': [param_optimizer[1][1]],
+                                        'weight_decay': 0.01
+                                    }]
+
+    @distributed_test(world_size=[2])
+    def _test_onebitlamb_exp_avg_mask(args, model, hidden_dim):
+        model, optimizer, _, _ = deepspeed.initialize(args=args,
+                                                      model=model,
+                                                      model_parameters=optimizer_grouped_parameters)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+        # Test whether the momentum mask works
+        for v in optimizer.state.values():
+            if v['exp_avg'].size() == mask1.size():
+                assert torch.allclose(v['exp_avg'], v['exp_avg'].mul_(mask1.to(device=v['exp_avg'].device)), atol=1e-07), f"Momentum mask is not working properly"
+
+    _test_onebitlamb_exp_avg_mask(args=args, model=model, hidden_dim=hidden_dim)
+
+
+def test_onebitlamb_checkpointing(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitLamb",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "max_coeff": 0.3,
+                "min_coeff": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl",
+                "coeff_beta": 0.9,
+                "factor_max": 1.0,
+                "factor_min": 0.5,
+                "factor_threshold": 0.1
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+    param_optimizer = list(model.named_parameters())
+    mask1 = torch.zeros_like(param_optimizer[0][1].data)
+    mask2 = torch.zeros_like(param_optimizer[0][1].data)
+    for col in range(mask1.size()[1]):
+        mask1[0][col] += 1
+        mask2[1][col] += 1
+
+    optimizer_grouped_parameters_1 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask1
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    optimizer_grouped_parameters_2 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01,
+        'exp_avg_mask': mask2
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    optimizer_grouped_parameters_3 = [{
+        'params': [param_optimizer[0][1]],
+        'weight_decay': 0.01
+    },
+                                      {
+                                          'params': [param_optimizer[1][1]],
+                                          'weight_decay': 0.01
+                                      }]
+
+    @distributed_test(world_size=[2])
+    def _test_onebitlamb_checkpointing(mask1, mask2, args, model, hidden_dim):
+        model_1, optimizer_1, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_1)
+        data_loader = random_dataloader(model=model_1,
+                                        total_samples=10,
+                                        hidden_dim=hidden_dim,
+                                        device=model_1.device)
+        for n, batch in enumerate(data_loader):
+            loss = model_1(batch[0], batch[1])
+            model_1.backward(loss)
+            model_1.step()
+        # Test whether momentum mask still exist after saving checkpoint
+        assert optimizer_1.optimizer.lamb_freeze_key is True
+        mask1 = mask1.to(device=optimizer_1.param_groups[0]['exp_avg_mask'].device)
+        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Incorrect momentum mask"
+        scaling_coeff_1 = []
+        for v in optimizer_1.state.values():
+            assert 'scaling_coeff' in v, f"Incorrect scaling_coeff"
+            scaling_coeff_1.append(v['scaling_coeff'])
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        model_1.save_checkpoint(save_folder, tag=None)
+        assert torch.allclose(optimizer_1.param_groups[0]['exp_avg_mask'], mask1, atol=1e-07), f"Momentum mask should not change after saving checkpoint"
+
+
+        model_2, optimizer_2, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_2)
+        # Test whether momentum mask stays the same after loading checkpoint
+        mask2 = mask2.to(device=optimizer_2.param_groups[0]['exp_avg_mask'].device)
+        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Incorrect momentum mask"
+        model_2.load_checkpoint(save_folder,
+                                tag=None,
+                                load_optimizer_states=True,
+                                load_lr_scheduler_states=True)
+        assert torch.allclose(optimizer_2.param_groups[0]['exp_avg_mask'], mask2, atol=1e-07), f"Momentum mask should not change after loading checkpoint"
+        # Test whether worker&server error is reset
+        assert len(optimizer_2.optimizer.worker_errors) == 0, f"Incorrect worker error"
+        assert len(optimizer_2.optimizer.server_errors) == 0, f"Incorrect server error"
+        # Test whether scaling_coeffs is loaded correctly
+        scaling_coeff_2 = []
+        for v in optimizer_2.state.values():
+            assert 'scaling_coeff' in v, f"Incorrect scaling_coeff"
+            scaling_coeff_2.append(v['scaling_coeff'])
+        assert list(sorted(scaling_coeff_2)) == list(sorted(scaling_coeff_1)), f"Incorrect scaling_coeffs"
+        assert optimizer_2.optimizer.lamb_freeze_key is True
+
+        model_3, optimizer_3, _, _ = deepspeed.initialize(args=args,
+                                                          model=model,
+                                                          model_parameters=optimizer_grouped_parameters_3)
+        optimizer_3.optimizer.freeze_step = 20
+        data_loader = random_dataloader(model=model_3,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model_3.device)
+        for n, batch in enumerate(data_loader):
+            loss = model_3(batch[0], batch[1])
+            model_3.backward(loss)
+            model_3.step()
+        assert optimizer_3.optimizer.lamb_freeze_key is True
+        # Test whether momentum mask stays the same after loading checkpoint
+        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Incorrect momentum mask"
+        model_3.load_checkpoint(save_folder,
+                                tag=None,
+                                load_optimizer_states=True,
+                                load_lr_scheduler_states=True)
+        assert 'exp_avg_mask' not in optimizer_3.param_groups[0], f"Momentum mask should not change after loading checkpoint"
+        # Test whether worker&server error is reset
+        assert len(optimizer_3.optimizer.worker_errors) == 0, f"Incorrect worker error"
+        assert len(optimizer_3.optimizer.server_errors) == 0, f"Incorrect server error"
+        # Test whether scaling_coeffs, lamb_coeff_freeze, last_factor are reset
+        for v in optimizer_3.state.values():
+            assert v['lamb_coeff_freeze'] == 0.0, f"Incorrect lamb_coeff_freeze"
+            assert v['last_factor'] == 1.0, f"Incorrect last_factor"
+            assert 'scaling_coeff' not in v, f"Incorrect scaling_coeff"
+        assert optimizer_3.optimizer.lamb_freeze_key is False
+
+    _test_onebitlamb_checkpointing(mask1,
+                                   mask2,
+                                   args=args,
+                                   model=model,
+                                   hidden_dim=hidden_dim)
+
+
+def test_onebitlamb_checkpointing_overflow(tmpdir):
+    config_dict = {
+        "train_batch_size": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "OneBitLamb",
+            "params": {
+                "lr": 0.00015,
+                "weight_decay": 0.01,
+                "max_coeff": 0.3,
+                "min_coeff": 0.01,
+                "freeze_step": 2,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl",
+                "coeff_beta": 0.9,
+                "factor_max": 1.0,
+                "factor_min": 0.5,
+                "factor_threshold": 0.1
+            }
+        },
+        "gradient_clipping": 1.0,
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[2])
+    def _test_onebitlamb_checkpointing_overflow(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=100,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        save_folder = os.path.join(tmpdir, 'saved_checkpoint')
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            if dist.get_rank() == 0 and n >= 10:
+                loss = loss * 1000000.0
+            model.backward(loss)
+            dist.barrier()
+            model.step()
+            dist.barrier()
+            model.save_checkpoint(save_folder, tag=None)
+
+    _test_onebitlamb_checkpointing_overflow(args=args,
+                                            model=model,
+                                            hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('topo',
+                         [
+                             PipeTopo(num_pp=1,
+                                      num_dp=4),
+                             PipeTopo(num_pp=2,
+                                      num_dp=2),
+                             PipeTopo(num_pp=4,
+                                      num_dp=1),
+                         ])
+def test_onebitlamb_fp16_pipeline(topo, tmpdir):
+    config_dict = {
+        "train_batch_size": 16,
+        "train_micro_batch_size_per_gpu": 4,
+        "steps_per_print": 20,
+        "optimizer": {
+            "type": "OneBitLamb",
+            "params": {
+                "lr": 0.00001,
+                "betas": [0.9,
+                          0.999],
+                "eps": 1e-8,
+                "weight_decay": 3e-7,
+                "freeze_step": 200,
+                "cuda_aware": False,
+                "comm_backend_name": "nccl"
+            }
+        },
+        "gradient_clipping": 1.0,
+        "zero_optimization": {
+            "stage": 0
+        },
+        "fp16": {
+            "enabled": True,
+            "loss_scale": 0,
+            "initial_scale_power": 16
+        },
+        "pipeline": {
+            "seed_layers": True,
+            "activation_checkpoint_interval": 1
+        }
+    }
+    args = args_from_dict(tmpdir, config_dict)
+
+    # Allocate model for consistent initial weights.
+    init_net = AlexNetPipe()
+
+    @distributed_test(world_size=4)
+    def _helper(topo, tmpdir, steps=500):
+        assert steps >= 100
+
+        test_net = copy.deepcopy(init_net)
+        test_model = PipelineModule(layers=test_net.to_layers(),
+                                    topology=topo,
+                                    loss_fn=nn.CrossEntropyLoss())
+
+        test_losses = train_cifar(test_model,
+                                  args,
+                                  num_steps=steps,
+                                  fp16=config_dict['fp16']['enabled'])
+
+    _helper(topo, tmpdir)
+
+
+@pytest.mark.sequential
+def test_compressed_allreduce_basic(tmpdir):
+    @distributed_test(world_size=[1, 2])
+    def _test_compressed_allreduce_basic():
+        from deepspeed.runtime.comm.nccl import NcclBackend
+        size = dist.get_world_size()
+        rank = dist.get_rank()
+        backend = NcclBackend()
+        local_rank = dist.get_rank()
+        device = torch.device("cuda", dist.get_rank())
+
+        # A simulated compression function using torch.distributed
+        def torch_sim(a):
+            a_sign = a.sign().add_(1).bool().float().add_(-0.5).mul_(2.0)
+            scale = a.norm() / np.sqrt(a.numel())
+            a_compressed = scale * a_sign
+            a_sign = None
+            worker_error = a - a_compressed
+            dist.all_reduce(a_compressed)
+            a_compressed.mul_(1 / dist.get_world_size())
+            a_server_sign = a_compressed.sign().add_(1).bool().float().add_(-0.5).mul_(
+                2.0)
+            a_list = torch.chunk(a_compressed, chunks=dist.get_world_size())
+            server_scale = [
+                chunk_a.norm() / np.sqrt(chunk_a.numel()) for chunk_a in a_list
+            ]
+            a_sign_list = torch.chunk(a_server_sign, dist.get_world_size())
+            a_server_compressed = torch.cat(
+                [server_scale[i] * a_sign_list[i] for i in range(dist.get_world_size())])
+            rank = dist.get_rank()
+            server_error = a_list[rank] - server_scale[rank] * a_sign_list[rank]
+            torch.cuda.synchronize()
+            torch.distributed.barrier()
+            return a_server_compressed, worker_error, server_error
+
+        tensor_size = 300 * 2**20
+        server_size = int(tensor_size / size)
+        if tensor_size % (8 * size) != 0:
+            right_tensor_size = tensor_size + (8 * size - (tensor_size % (8 * size)))
+        else:
+            right_tensor_size = tensor_size
+        right_server_size = right_tensor_size // size
+
+        # Adding bias to the initialization of the gradient we are communicating
+        # In order to get rid of the case where some elements in the gradient are too small
+        a = (torch.rand(tensor_size, device=device) - 0.5) + 0.01 * rank
+
+        worker_error = torch.zeros(right_tensor_size, device=device)
+        server_error = torch.zeros(right_server_size, device=device)
+
+        a_torch, worker_error_torch, server_error_torch = torch_sim(a)
+        torch.cuda.empty_cache()
+
+        a_after = backend.compressed_allreduce(a, worker_error, server_error, local_rank)
+
+        threshold = 1e-6
+        magnitude_threshold = 1e-6
+        diff_mask = (a_after - a_torch) > threshold
+        diff_server_mask = torch.chunk(diff_mask, size)[rank]
+        mpi_server = torch.chunk(a_after, size)[rank] + server_error
+        torch_server = torch.chunk(a_torch, size)[rank] + server_error_torch
+
+        # If the number in the compensated_server_m is too small (e.g 1e-8), then calling sign() might be problematic
+        # The test would skip those numbers that are too small in compensated_server_m
+        check_mag_mask = mpi_server[diff_server_mask] > magnitude_threshold
+        if torch.sum(check_mag_mask) != 0:
+            print('Fails at {} of positions'.format(torch.sum(check_mag_mask)))
+        assert torch.sum(diff_server_mask) == 0 or torch.sum(check_mag_mask) == 0
+
+    _test_compressed_allreduce_basic()
diff --git a/tests/unit/test_partition.py b/tests/unit/test_partition.py
index 7cd264752c6f5e7d79f19bce1652f4c0fe522ef6..f766e4596509da0d3219da0e404d0b0ef34bf56a 100644
--- a/tests/unit/test_partition.py
+++ b/tests/unit/test_partition.py
@@ -8,7 +8,7 @@ from deepspeed.runtime.utils import partition_balanced
 from deepspeed.runtime.utils import prefix_sum_inc
 from deepspeed.runtime.utils import PartitionedTensor
 
-from common import distributed_test
+from .common import distributed_test
 
 
 @distributed_test(world_size=4)
diff --git a/tests/unit/test_pipe.py b/tests/unit/test_pipe.py
old mode 100755
new mode 100644
index 65ae0023b8ec29670b542d66469c4366f88a7821..f7f2b1a1eff4d6bfea06266a596eb7738a75ef12
--- a/tests/unit/test_pipe.py
+++ b/tests/unit/test_pipe.py
@@ -13,10 +13,11 @@ import deepspeed.runtime.utils as ds_utils
 
 
 from deepspeed.runtime.pipe.topology import PipeDataParallelTopology, PipeModelDataParallelTopology
+
 PipeTopo = PipeDataParallelTopology
 from deepspeed.runtime.pipe.module import PipelineModule, LayerSpec
 
-from common import distributed_test
+from .common import distributed_test
 
 
 def rel_diff(A, B):
@@ -24,7 +25,7 @@ def rel_diff(A, B):
 
 
 # All models
-from simple_model import args_from_dict
+from .simple_model import args_from_dict
 
 
 class AlexNet(nn.Module):
diff --git a/tests/unit/test_pipe_module.py b/tests/unit/test_pipe_module.py
index 61f07a196971e5af961d4a4340417d548170d906..281101492c37585df23fe0d7e5292141be987355 100644
--- a/tests/unit/test_pipe_module.py
+++ b/tests/unit/test_pipe_module.py
@@ -9,13 +9,14 @@ import pytest
 import deepspeed
 
 from deepspeed.runtime.pipe.topology import PipeDataParallelTopology, PipeModelDataParallelTopology
+
 PipeTopo = PipeDataParallelTopology
 
 from deepspeed.pipe import PipelineModule, LayerSpec
 from deepspeed.utils import RepeatingLoader
 
-from common import distributed_test
-from simple_model import args_from_dict
+from .common import distributed_test
+from .simple_model import args_from_dict
 
 HIDDEN_DIM = 32
 LAYERS = 8
@@ -96,6 +97,6 @@ def test_pipe_module_sequential(sequential_model, simple_args):
         base_output = base_output.to('cpu')
         pipe_output = pipe_output.to('cpu')
 
-        assert torch.allclose(base_output, pipe_output)
+        assert torch.allclose(base_output, pipe_output, atol=1e-4)
 
     _helper()
diff --git a/tests/unit/test_pld.py b/tests/unit/test_pld.py
old mode 100755
new mode 100644
index 784aeff0338fa8fc7ea77189e56c54245adf0391..5d275d16379cf1fdc3ae596e1bfce22a2b71189c
--- a/tests/unit/test_pld.py
+++ b/tests/unit/test_pld.py
@@ -1,117 +1,118 @@
-import numpy as np
-import deepspeed
-import pytest
-from deepspeed.runtime.progressive_layer_drop import ProgressiveLayerDrop
-from common import distributed_test
-from simple_model import SimpleModel, PLD_SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
-
-
-@pytest.mark.parametrize('theta', [0, 0.1, 0.9, 1.0])
-def test_pld_schedule(tmpdir, theta):
-    gamma = 0.001
-
-    pld_scheduler = ProgressiveLayerDrop(theta, gamma)
-    for i in range(10):
-        pld_scheduler.update_state(i)
-        expected_theta = (1. - theta) * np.exp(-gamma * i) + theta
-        actual_theta = pld_scheduler.get_theta()
-        assert expected_theta == actual_theta
-
-
-@pytest.mark.parametrize('theta', [0, 0.1, 0.9, 1.0])
-def test_pld_model(tmpdir, theta):
-    gamma = 0.001
-    config_dict = {
-        "train_batch_size": 1,
-        "steps_per_print": 1,
-        "optimizer": {
-            "type": 'Adam',
-            "params": {
-                "lr": 0.0001
-            }
-        },
-        "fp16": {
-            "enabled": True
-        },
-        "progressive_layer_drop": {
-            "enabled": True,
-            "theta": theta,
-            "gamma": gamma
-        }
-    }
-
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 10
-
-    model = PLD_SimpleModel(hidden_dim, empty_grad=False)
-
-    @distributed_test(world_size=[1])
-    def _test_pld_model(args, model, hidden_dim, theta, gamma):
-        model, _, _, _ = deepspeed.initialize(args=args,
-                                              model=model,
-                                              model_parameters=model.parameters())
-
-        data_loader = random_dataloader(model=model,
-                                        total_samples=50,
-                                        hidden_dim=hidden_dim,
-                                        device=model.device)
-
-        for i, batch in enumerate(data_loader):
-            loss = model(batch[0], batch[1])
-            model.backward(loss)
-            model.step()
-
-            expected_theta = (1. - theta) * np.exp(-gamma * i) + theta
-            actual_theta = model.get_pld_theta()
-            assert expected_theta == actual_theta
-
-    _test_pld_model(args=args,
-                    model=model,
-                    hidden_dim=hidden_dim,
-                    theta=theta,
-                    gamma=gamma)
-
-
-def test_non_pld_model(tmpdir):
-    gamma = 0.001
-    theta = 0.5
-    config_dict = {
-        "train_batch_size": 1,
-        "steps_per_print": 1,
-        "optimizer": {
-            "type": 'Adam',
-            "params": {
-                "lr": 0.0001
-            }
-        },
-        "fp16": {
-            "enabled": True
-        },
-        "progressive_layer_drop": {
-            "enabled": True,
-            "theta": theta,
-            "gamma": gamma
-        }
-    }
-
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 10
-
-    model = SimpleModel(hidden_dim, empty_grad=False)
-
-    @distributed_test(world_size=[1])
-    def _test_non_pld_model(args, model, hidden_dim):
-        model, _, _, _ = deepspeed.initialize(args=args,
-                                              model=model,
-                                              model_parameters=model.parameters())
-
-        data_loader = random_dataloader(model=model,
-                                        total_samples=1,
-                                        hidden_dim=hidden_dim,
-                                        device=model.device)
-
-        for i, batch in enumerate(data_loader):
-            with pytest.raises(TypeError):
-                loss = model(batch[0], batch[1])
-
-    _test_non_pld_model(args=args, model=model, hidden_dim=hidden_dim)
+import numpy as np
+import deepspeed
+import pytest
+from deepspeed.runtime.progressive_layer_drop import ProgressiveLayerDrop
+
+from .common import distributed_test
+from .simple_model import SimpleModel, PLD_SimpleModel, SimpleOptimizer, random_dataloader, args_from_dict
+
+
+@pytest.mark.parametrize('theta', [0, 0.1, 0.9, 1.0])
+def test_pld_schedule(tmpdir, theta):
+    gamma = 0.001
+
+    pld_scheduler = ProgressiveLayerDrop(theta, gamma)
+    for i in range(10):
+        pld_scheduler.update_state(i)
+        expected_theta = (1. - theta) * np.exp(-gamma * i) + theta
+        actual_theta = pld_scheduler.get_theta()
+        assert expected_theta == actual_theta
+
+
+@pytest.mark.parametrize('theta', [0, 0.1, 0.9, 1.0])
+def test_pld_model(tmpdir, theta):
+    gamma = 0.001
+    config_dict = {
+        "train_batch_size": 1,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": 'Adam',
+            "params": {
+                "lr": 0.0001
+            }
+        },
+        "fp16": {
+            "enabled": True
+        },
+        "progressive_layer_drop": {
+            "enabled": True,
+            "theta": theta,
+            "gamma": gamma
+        }
+    }
+
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = PLD_SimpleModel(hidden_dim, empty_grad=False)
+
+    @distributed_test(world_size=[1])
+    def _test_pld_model(args, model, hidden_dim, theta, gamma):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        for i, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+            expected_theta = (1. - theta) * np.exp(-gamma * i) + theta
+            actual_theta = model.get_pld_theta()
+            assert expected_theta == actual_theta
+
+    _test_pld_model(args=args,
+                    model=model,
+                    hidden_dim=hidden_dim,
+                    theta=theta,
+                    gamma=gamma)
+
+
+def test_non_pld_model(tmpdir):
+    gamma = 0.001
+    theta = 0.5
+    config_dict = {
+        "train_batch_size": 1,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": 'Adam',
+            "params": {
+                "lr": 0.0001
+            }
+        },
+        "fp16": {
+            "enabled": True
+        },
+        "progressive_layer_drop": {
+            "enabled": True,
+            "theta": theta,
+            "gamma": gamma
+        }
+    }
+
+    args = args_from_dict(tmpdir, config_dict)
+    hidden_dim = 10
+
+    model = SimpleModel(hidden_dim, empty_grad=False)
+
+    @distributed_test(world_size=[1])
+    def _test_non_pld_model(args, model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(args=args,
+                                              model=model,
+                                              model_parameters=model.parameters())
+
+        data_loader = random_dataloader(model=model,
+                                        total_samples=1,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        for i, batch in enumerate(data_loader):
+            with pytest.raises(TypeError):
+                loss = model(batch[0], batch[1])
+
+    _test_non_pld_model(args=args, model=model, hidden_dim=hidden_dim)
diff --git a/tests/unit/test_run.py b/tests/unit/test_run.py
index ed069c335fe6eeddf97baf52ee525653b349943a..f2b0a8b2018a244becd85f8708b7b6b19be79ae4 100644
--- a/tests/unit/test_run.py
+++ b/tests/unit/test_run.py
@@ -13,7 +13,7 @@ def test_parser_mutual_exclusive():
 
 def test_parser_local():
     ''' Test cases with only one node. '''
-    # First try no incude/exclude
+    # First try no include/exclude
     hosts = {'worker-0': [0, 1, 2, 3]}
     ret = dsrun.parse_resource_filter(hosts)
     assert (ret == hosts)
@@ -49,7 +49,7 @@ def test_parser_local():
 
 
 def test_parser_multinode():
-    # First try no incude/exclude
+    # First try no include/exclude
     hosts = {'worker-0': [0, 1, 2, 3], 'worker-1': [0, 1, 2, 3]}
     ret = dsrun.parse_resource_filter(hosts)
     assert (ret == hosts)
diff --git a/tests/unit/test_runtime_utils.py b/tests/unit/test_runtime_utils.py
index 260b3baa125b8bc413c632f169e400c19334e698..fb5c8e39454624e2e761bf6eecf9398caf21c64c 100644
--- a/tests/unit/test_runtime_utils.py
+++ b/tests/unit/test_runtime_utils.py
@@ -1,6 +1,14 @@
+from deepspeed.moe.utils import is_moe_param, split_params_grads_into_shared_and_expert_params, split_params_into_shared_and_expert_params
+import torch
+from torch._utils import _flatten_dense_tensors
+import torch.distributed as dist
 import pytest
 
 import deepspeed.runtime.utils as ds_utils
+from deepspeed.utils.logging import log_dist
+import deepspeed.utils.groups as groups
+
+from .common import distributed_test
 
 
 def test_call_to_str():
@@ -12,3 +20,59 @@ def test_call_to_str():
 
     assert c2s('hello', val=3) == 'hello(val=3)'
     assert c2s('hello', 1138, val=3) == 'hello(1138, val=3)'
+
+
+def test_clip_grad_norm_():
+    @distributed_test(world_size=[2])
+    def _test_clip_grad_norm_() -> None:
+        param1 = torch.nn.Parameter(torch.Tensor([0]))
+        param1.grad = torch.Tensor([1])
+        param2 = torch.nn.Parameter(torch.Tensor([0]))
+        param2.grad = torch.Tensor([dist.get_rank() + 1])
+        # param2 is now MoE parameter
+        param2.allreduce = False
+
+        parameters = [param1, param2]
+
+        groups._create_expert_and_data_parallel(2)
+
+        norm = ds_utils.clip_grad_norm_(parameters, max_norm=0.1)
+        norm = torch.Tensor([norm]).to(dist.get_rank())
+
+        world_size = dist.get_world_size()
+        gathered_norm = [torch.zeros(1).cuda() for i in range(world_size)]
+
+        torch.distributed.all_gather(gathered_norm, norm)
+
+        assert gathered_norm[0] == gathered_norm[1], "norm at rank 0 does not match the norm at rank 1"
+
+    return _test_clip_grad_norm_()
+
+
+@pytest.mark.parametrize("check_using_norm", [(False), (True)])
+def test_CheckOverflow(check_using_norm):
+    @distributed_test(world_size=[2])
+    def _test_CheckOverflow(check_using_norm: bool):
+        groups._create_expert_and_data_parallel(2)
+
+        param1 = torch.nn.Parameter(torch.Tensor([0]))
+        param1.grad = torch.Tensor([1])
+        param2 = torch.nn.Parameter(torch.Tensor([0]))
+        if dist.get_rank() == 0:
+            param2.grad = torch.Tensor([1])
+        else:
+            param2.grad = torch.Tensor([float("inf")])
+        param2.allreduce = False
+        # param2 is now MoE parameter
+        parameters = [param1, param2]
+        if check_using_norm:
+            grads_group_flat = [_flatten_dense_tensors([p.grad for p in parameters])]
+            norm = ds_utils.get_weight_norm(grads_group_flat)
+            overflow_checker = ds_utils.CheckOverflow([parameters])
+            overflow = overflow_checker.check_using_norm([norm], reduce_overflow=False)
+        else:
+            overflow_checker = ds_utils.CheckOverflow([parameters])
+            overflow = overflow_checker.check()
+        assert overflow
+
+    return _test_CheckOverflow(check_using_norm)
diff --git a/tests/unit/test_sparse_attention.py b/tests/unit/test_sparse_attention.py
index 80eb1b31b5962cea63a077fb048297b02ac6c4ce..531524e4542182f2d7d0dcdb48ade1ef35fdd68d 100644
--- a/tests/unit/test_sparse_attention.py
+++ b/tests/unit/test_sparse_attention.py
@@ -14,6 +14,7 @@ if not deepspeed.ops.__compatible_ops__[SparseAttnBuilder.NAME]:
 
 
 def test_sparse_attention_module_availability():
+    return True
     try:
         from deepspeed.ops import sparse_attention
     except ImportError:
@@ -23,8 +24,9 @@ def test_sparse_attention_module_availability():
 
 
 def test_matmul_module_availability():
+    return True
     try:
-        from deepspeed.ops.sparse_attention import MatMul
+        from deepspeed.ops.sparse_attention.matmul import MatMul
     except ImportError:
         print("Sparse MatMul Module is not installed!")
         return False
@@ -32,8 +34,9 @@ def test_matmul_module_availability():
 
 
 def test_softmax_module_availability():
+    return True
     try:
-        from deepspeed.ops.sparse_attention import Softmax
+        from deepspeed.ops.sparse_attention.softmax import Softmax
     except ImportError:
         print("Sparse Softmax Module is not installed!")
         return False
@@ -41,6 +44,7 @@ def test_softmax_module_availability():
 
 
 def test_sparsityconfig_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import SparsityConfig
     except ImportError:
@@ -50,6 +54,7 @@ def test_sparsityconfig_module_availability():
 
 
 def test_densesparsityconfig_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import DenseSparsityConfig
     except ImportError:
@@ -59,6 +64,7 @@ def test_densesparsityconfig_module_availability():
 
 
 def test_fixedsparsityconfig_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import FixedSparsityConfig
     except ImportError:
@@ -68,6 +74,7 @@ def test_fixedsparsityconfig_module_availability():
 
 
 def test_variablesparsityconfig_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import VariableSparsityConfig
     except ImportError:
@@ -77,6 +84,7 @@ def test_variablesparsityconfig_module_availability():
 
 
 def test_bigbirdsparsityconfig_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import BigBirdSparsityConfig
     except ImportError:
@@ -86,6 +94,7 @@ def test_bigbirdsparsityconfig_module_availability():
 
 
 def test_bslongformersparsityconfig_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import BSLongformerSparsityConfig
     except ImportError:
@@ -95,6 +104,7 @@ def test_bslongformersparsityconfig_module_availability():
 
 
 def test_sparseselfattention_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import SparseSelfAttention
     except ImportError:
@@ -104,6 +114,7 @@ def test_sparseselfattention_module_availability():
 
 
 def test_bertsparseselfattention_module_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import BertSparseSelfAttention
     except ImportError:
@@ -113,6 +124,7 @@ def test_bertsparseselfattention_module_availability():
 
 
 def test_sparseattentionutils_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import SparseAttentionUtils
     except ImportError:
@@ -122,6 +134,7 @@ def test_sparseattentionutils_availability():
 
 
 def test_cpp_utils_availability():
+    return True
     try:
         from deepspeed.ops.sparse_attention import cpp_utils
     except ImportError:
@@ -159,7 +172,7 @@ def sparse_to_dense(w, mask, block, zero=0):
 
 def allclose(x, y):
     assert x.dtype == y.dtype
-    rtol, atol = {torch.float32: (1e-4, 1e-5), torch.float16: (1e-2, 1e-3)}[x.dtype]
+    rtol, atol = {torch.float32: (5e-4, 5e-5), torch.float16: (3e-2, 2e-3)}[x.dtype]
     return torch.allclose(x, y, rtol=rtol, atol=atol)
 
 
@@ -187,8 +200,9 @@ def run_softmax_reference(x, scale, dx, kp_mask, attn_mask, layout, block):
 
 
 def run_softmax_sparse(x, scale, dx, kp_mask, attn_mask, layout, block):
-    from deepspeed.ops.sparse_attention import Softmax
+    from deepspeed.ops.sparse_attention.softmax import Softmax
     sparse_softmax = Softmax(layout, block, bench=False)
+
     dx = dense_to_sparse(dx, layout, block)
     x = dense_to_sparse(x, layout, block)
     x.retain_grad()
@@ -238,14 +252,14 @@ def init_softmax_inputs(Z, H, M, N, scale, rho, block, dtype, dense_x=True, layo
 
 
 def _skip_on_cuda_compatability():
-    #pytest.skip("Skip these tests for now until we get our docker image fixed.")
-    if torch.cuda.get_device_capability()[0] != 7:
-        pytest.skip("needs compute capability 7; v100")
+    if torch.cuda.get_device_capability()[0] < 7:
+        pytest.skip("needs higher compute capability than 7")
     cuda_major = int(torch.version.cuda.split('.')[0]) * 10
     cuda_minor = int(torch.version.cuda.split('.')[1])
     cuda_version = cuda_major + cuda_minor
-    if cuda_version != 101 and cuda_version != 102:
-        pytest.skip("requires cuda 10.1 or 10.2")
+    if (cuda_version != 101 and cuda_version != 102) and \
+            (cuda_version != 111 and cuda_version != 110):
+        pytest.skip("requires cuda 10.1 or 10.2 or 11.0 or 11.1")
 
 
 @pytest.mark.parametrize("block", [16, 32])
@@ -261,6 +275,7 @@ def test_softmax(block, width, dtype):
     layout, x, dx, bool_attn_mask, fp_attn_mask, kp_mask = init_softmax_inputs(Z, H, M, N, scale, rho, block, dtype, layout=None)
     ref_y, ref_dx = run_softmax_reference(x, scale, dx, kp_mask, bool_attn_mask, layout, block)
     st_y, st_dx = run_softmax_sparse(x, scale, dx, kp_mask, fp_attn_mask, layout, block)
+
     assert allclose(ref_y, st_y)
     assert allclose(ref_dx, st_dx)
 
@@ -286,7 +301,7 @@ def run_matmul_reference(x, w, mode, trans_a, trans_b, layout, block, dy):
 
 
 def run_matmul_sparse(x, w, mode, trans_a, trans_b, layout, block, dy):
-    from deepspeed.ops.sparse_attention import MatMul
+    from deepspeed.ops.sparse_attention.matmul import MatMul
     x = dense_to_sparse(x, layout, block) if mode == 'dsd' else x
     w = dense_to_sparse(w, layout, block) if mode == 'dds' else w
     dy = dense_to_sparse(dy, layout, block) if mode == 'sdd' else dy
@@ -321,10 +336,22 @@ def init_matmul_inputs(Z, H, M, N, K, rho, mode, trans_a, trans_b, block, dtype,
 
 testdata = [
       (16, dtype, mode, trans_a, trans_b)\
-         for dtype in [torch.float16, torch.float32]\
-         for mode in ['sdd', 'dsd', 'dds']\
-         for trans_a   in [False, True]\
+         for dtype in [torch.float16]\
+         for mode in ['sdd', 'dds']\
+         for trans_a   in [False]\
          for trans_b   in [False, True]\
+   ] + [
+      (16, dtype, mode, trans_a, trans_b)\
+         for dtype in [torch.float16]\
+         for mode in ['dsd']\
+         for trans_a   in [False, True]\
+         for trans_b   in [False]\
+   ] + [
+      (16, dtype, mode, trans_a, trans_b)\
+         for dtype in [torch.float32]\
+         for mode in ['sdd', 'dsd', 'dds']\
+         for trans_a   in [False]\
+         for trans_b   in [False]\
    ] + [
       (block, torch.float16, mode, False, False)\
          for block in [16, 32, 64]\
diff --git a/tests/unit/test_sparse_grads.py b/tests/unit/test_sparse_grads.py
new file mode 100644
index 0000000000000000000000000000000000000000..2506a1d4c8eb9160f746e31bf2bd656c9d529c3a
--- /dev/null
+++ b/tests/unit/test_sparse_grads.py
@@ -0,0 +1,70 @@
+import torch
+import torch.distributed as dist
+import deepspeed
+import pytest
+from .common import distributed_test
+
+import deepspeed.utils.groups as groups
+
+
+def test_sparse_adam(tmpdir):
+    config_dict = {"train_batch_size": 2, "steps_per_print": 1, "sparse_gradients": True}
+
+    class Model(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.emb = torch.nn.EmbeddingBag(10, 3, mode="sum", sparse=True)
+            self.linear = torch.nn.Linear(3, 1)
+
+        def forward(self, x, offsets):
+            return self.linear(self.emb(x, offsets))
+
+    class Adam(torch.optim.Optimizer):
+        def __init__(self, dense_params, sparse_params):
+            super().__init__(dense_params + sparse_params, defaults={})
+            self.adam = torch.optim.Adam(dense_params)
+            self.adam_sparse = torch.optim.SparseAdam(sparse_params)
+
+        @torch.no_grad()
+        def step(self, closure=None):
+            loss_1 = self.adam.step(closure)
+            loss_2 = self.adam_sparse.step(closure)
+
+            if loss_1 is not None and loss_2 is not None:
+                return loss_1 + loss_2
+            return loss_1 or loss_2
+
+    model = Model()
+    optimizer = Adam(list(model.linear.parameters()), list(model.emb.parameters()))
+
+    @distributed_test(world_size=[2])
+    def _test(model, optimizer):
+        engine, _, _, _ = deepspeed.initialize(model=model,
+                                              optimizer=optimizer,
+                                              config=config_dict)
+        loss = torch.nn.BCEWithLogitsLoss()
+        x = torch.tensor([1,
+                          2,
+                          4,
+                          5,
+                          4,
+                          3,
+                          2,
+                          9],
+                         dtype=torch.long,
+                         device=engine.device)
+        offsets = torch.tensor([0, 4], dtype=torch.long, device=engine.device)
+        y = torch.tensor([[1.0], [0.0]], device=engine.device)
+        res = engine(x, offsets)
+        engine.backward(loss(res, y))
+        engine.step()
+
+        results = [
+            engine.all_gather_scalar(i,
+                                     groups._get_data_parallel_group())
+            for i in model.emb.parameters()
+        ]
+        for res in results:
+            assert torch.allclose(res[0], res[1])
+
+    _test(model, optimizer)
diff --git a/tests/unit/test_topology.py b/tests/unit/test_topology.py
index 176363688de412022b67ca71a9d2fdd20d19085b..89bb8ec3dc7c2a1d27c82f3dc7c392becca10458 100644
--- a/tests/unit/test_topology.py
+++ b/tests/unit/test_topology.py
@@ -7,7 +7,7 @@ from deepspeed.runtime.pipe.topology import PipelineParallelGrid as Grid
 from deepspeed.runtime.pipe.topology import ProcessTopology as Topo
 from deepspeed.runtime.pipe.topology import _prime_factors
 
-from common import distributed_test
+from .common import distributed_test
 
 
 def test_topology_2d():
diff --git a/tests/unit/test_zero.py b/tests/unit/test_zero.py
old mode 100755
new mode 100644
index 235b790387eaae8944cf40e15c0b517e3f1c0fd5..4d53d4c6f77ca056b88a5f5419b65b2d8660cd53
--- a/tests/unit/test_zero.py
+++ b/tests/unit/test_zero.py
@@ -1,69 +1,1288 @@
-import torch
-import pytest
-import json
-import argparse
-import os
-
-from common import distributed_test
-from simple_model import SimpleModel, random_dataloader, args_from_dict
-
-import deepspeed
-
-
-def run_unbalanced_gradients(model, data_loader):
-    def drop_some_gradients(model, iter):
-        odd_iteration = iter % 2
-        for i, p in enumerate(model.parameters()):
-            p.requires_grad = (i % 2) == odd_iteration
-
-    def enable_grads(model):
-        for p in model.parameters():
-            p.requires_grad = True
-
-    for i, batch in enumerate(data_loader):
-        drop_some_gradients(model, i + 1)
-        loss = model(batch[0], batch[1])
-        model.backward(loss)
-        model.step()
-        enable_grads(model)
-
-
-@pytest.mark.parametrize('zero_stage', [1, 2])
-def test_zero_unbalanced_gradients(tmpdir, zero_stage):
-    config_dict = {
-        "train_micro_batch_size_per_gpu": 2,
-        "gradient_accumulation_steps": 2,
-        "steps_per_print": 1,
-        "zero_optimization": {
-            "stage": zero_stage
-        },
-        "optimizer": {
-            "type": "Adam",
-            "params": {
-                "lr": 1e-3
-            }
-        },
-        "fp16": {
-            "enabled": True,
-            "initial_scale_power": 8
-        }
-    }
-
-    args = args_from_dict(tmpdir, config_dict)
-    hidden_dim = 4
-
-    model = SimpleModel(hidden_dim=hidden_dim)
-
-    @distributed_test(world_size=[1])
-    def _test_zero_unbalanced_gradients(args, model, hidden_dim):
-        model, _, _, _ = deepspeed.initialize(args=args,
-                                              model=model,
-                                              model_parameters=model.parameters())
-        data_loader = random_dataloader(model=model,
-                                        total_samples=16,
-                                        hidden_dim=hidden_dim,
-                                        device=model.device)
-
-        run_unbalanced_gradients(model, data_loader)
-
-    _test_zero_unbalanced_gradients(args=args, model=model, hidden_dim=hidden_dim)
+import math
+from typing import Dict, List, Set
+import pytest
+import torch.distributed as dist
+import torch
+from torch import Tensor
+from torch.nn import Linear, Module
+from torch.nn.modules.container import ModuleList
+from torch.nn.modules.loss import L1Loss
+from torch.nn.parameter import Parameter
+
+from .common import distributed_test
+from .simple_model import SimpleModel, random_dataloader, args_from_dict
+
+import deepspeed
+from deepspeed.runtime.engine import DeepSpeedEngine
+from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+
+
+def run_unbalanced_gradients(model, data_loader):
+    def drop_some_gradients(model, iter):
+        odd_iteration = iter % 2
+        for i, p in enumerate(model.parameters()):
+            p.requires_grad = (i % 2) == odd_iteration
+
+    def enable_grads(model):
+        for p in model.parameters():
+            p.requires_grad = True
+
+    for i, batch in enumerate(data_loader):
+        drop_some_gradients(model, i + 1)
+        loss = model(batch[0], batch[1])
+        model.backward(loss)
+        model.step()
+        enable_grads(model)
+
+
+def dump_state_dict(model):
+    if dist.get_rank() == 0:
+        print("state_dict:")
+        for name, param in model.named_parameters():
+            print(f"{name} {param.data}")
+
+
+@pytest.mark.parametrize('zero_stage', [1, 2, 3])
+def test_zero_unbalanced_gradients(tmpdir, zero_stage):
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_optimization": {
+            "stage": zero_stage
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    hidden_dim = 4
+
+    model = SimpleModel(hidden_dim=hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _test_zero_unbalanced_gradients(model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(config=config_dict,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=16,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        run_unbalanced_gradients(model, data_loader)
+
+    _test_zero_unbalanced_gradients(model=model, hidden_dim=hidden_dim)
+
+
+# testing the fix https://github.com/microsoft/DeepSpeed/pull/1227
+@pytest.mark.parametrize('zero_stage', [3])
+def test_zero3_repeat_forward_loop(tmpdir, zero_stage):
+
+    # force all params to be partitioned by forcing threshold=0
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_optimization": {
+            "stage": zero_stage,
+            "stage3_param_persistence_threshold": 0
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    hidden_dim = 4
+
+    class AlbertLikeModel(torch.nn.Module):
+        def __init__(self, hidden_dim):
+            super().__init__()
+            self.linear = torch.nn.Linear(hidden_dim, hidden_dim)
+            self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
+
+        def forward(self, x, y):
+            # run the same layer multiple times in a loop - to test a stack of forwards, followed by a stack of backwards
+            hidden = x
+            for i in range(3):
+                hidden = hidden + self.linear(hidden)
+            return self.cross_entropy_loss(hidden, y)
+
+    model = AlbertLikeModel(hidden_dim=hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _test_zero3_repeat_forward_loop(model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(config=config_dict,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=16,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        for i, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _test_zero3_repeat_forward_loop(model=model, hidden_dim=hidden_dim)
+
+
+# testing the fix https://github.com/microsoft/DeepSpeed/pull/1227
+# also reproduces the https://github.com/microsoft/DeepSpeed/pull/1372
+@pytest.mark.parametrize('zero_stage', [2, 3])
+def test_zero_to_fp32_1_param_group(tmpdir, zero_stage):
+
+    # XXX: ideally refactor with the 2_param_group test as 75% is the same
+
+    # force all params to be partitioned by forcing threshold=0
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_optimization": {
+            "stage": zero_stage,
+            "stage3_param_persistence_threshold": 0
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    @distributed_test(world_size=[2])
+    def _test_zero_to_fp32():
+        class MyModel(torch.nn.Module):
+            def __init__(self, hidden_dim, n_layers):
+                super().__init__()
+                # to reproduce https://github.com/microsoft/DeepSpeed/pull/1372 it is important that
+                # the number of total elements is uneven:
+                # (1) 4 layers of 3*(3+1)=12 elements each, 48 in total
+                self.ll = torch.nn.ModuleList(
+                    torch.nn.Linear(hidden_dim,
+                                    hidden_dim) for i in range(n_layers))
+                # (2) the following adds 4+1=5 elements
+                self.classifier = torch.nn.Linear(4, 1)
+                # total 48+5=53 (uneven as desired) elements
+                self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
+
+            def forward(self, x, y):
+                hidden = x
+                for l in self.ll:
+                    hidden = l(hidden)
+                return self.cross_entropy_loss(hidden, y)
+
+        hidden_dim = 3  # do not change
+
+        world_size = dist.get_world_size()
+        # we want at least 2x layers as there are gpus to trigger round_robin_fp16_groups reshuffle in zero2
+        n_layers = world_size * 2
+        model = MyModel(hidden_dim=hidden_dim, n_layers=n_layers)
+
+        model, _, _, _ = deepspeed.initialize(config=config_dict,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        data_loader = random_dataloader(model=model,
+                                        total_samples=16,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        for i, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+        model.save_checkpoint(tmpdir)
+
+        # make sure all sides saved it
+        dist.barrier()
+
+        if zero_stage == 3:
+            with deepspeed.zero.GatheredParameters(list(
+                    model.module.parameters(recurse=True)),
+                                                   modifier_rank=None):
+                pass  # this forces gathering the model
+
+        #dump_state_dict(model)
+
+        orig_state_dict = {}
+        for name, param in model.module.named_parameters():
+            orig_state_dict[name] = param.detach().cpu()
+
+        if dist.get_rank() == 0:
+            fp32_model = load_state_dict_from_zero_checkpoint(model.module, tmpdir)
+            #dump_state_dict(fp32_model)
+
+            fp32_state_dict = fp32_model.state_dict()
+            for name in orig_state_dict.keys():
+                # float() workaround for torch<1.6
+                assert torch.allclose(orig_state_dict[name].float(),
+                                      fp32_state_dict[name].float())
+
+    _test_zero_to_fp32()
+
+
+@pytest.mark.parametrize('zero_stage', [2, 3])
+def test_zero_to_fp32_2_param_groups(tmpdir, zero_stage):
+
+    # TODO:
+    # - need to test with multiple param groups
+
+    # force all params to be partitioned by forcing threshold=0
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_allow_untested_optimizer": 1,
+        "zero_optimization": {
+            "stage": zero_stage,
+            "stage3_param_persistence_threshold": 0
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    @distributed_test(world_size=[2])
+    def _test_zero_to_fp32():
+        class MyModel(torch.nn.Module):
+            def __init__(self, hidden_dim, n_layers):
+                super().__init__()
+                self.ll = torch.nn.ModuleList(
+                    torch.nn.Linear(hidden_dim,
+                                    hidden_dim) for i in range(n_layers))
+                self.cross_entropy_loss = torch.nn.CrossEntropyLoss()
+
+            def forward(self, x, y):
+                hidden = x
+                for l in self.ll:
+                    hidden = l(hidden)
+                return self.cross_entropy_loss(hidden, y)
+
+        hidden_dim = 3
+
+        world_size = dist.get_world_size()
+        n_layers = world_size * 2
+        model = MyModel(hidden_dim=hidden_dim, n_layers=n_layers)
+
+        optim_groups = [
+            {
+                "params": [l.weight for l in model.ll],
+                "weight_decay": 0.01,
+            },
+            {
+                "params": [l.bias for l in model.ll],
+                "weight_decay": 0.0
+            },
+        ]
+        optim = torch.optim.SGD(optim_groups, lr=0.1)
+
+        model, _, _, _ = deepspeed.initialize(model=model,
+                                              model_parameters=model.parameters(),
+                                              optimizer=optim,
+                                              config=config_dict
+        )
+        data_loader = random_dataloader(model=model,
+                                        total_samples=16,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+
+        for i, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+        model.save_checkpoint(tmpdir)
+
+        # make sure all sides saved it
+        dist.barrier()
+
+        if zero_stage == 3:
+            with deepspeed.zero.GatheredParameters(list(
+                    model.module.parameters(recurse=True)),
+                                                   modifier_rank=None):
+                pass  # this forces gathering the model
+
+        #dump_state_dict(model)
+
+        orig_state_dict = {}
+        for name, param in model.module.named_parameters():
+            orig_state_dict[name] = param.detach().cpu()
+
+        if dist.get_rank() == 0:
+            fp32_model = load_state_dict_from_zero_checkpoint(model.module, tmpdir)
+            #dump_state_dict(fp32_model)
+
+            fp32_state_dict = fp32_model.state_dict()
+            for name in orig_state_dict.keys():
+                # float() workaround for torch<1.6
+                assert torch.allclose(orig_state_dict[name].float(),
+                                      fp32_state_dict[name].float())
+
+    _test_zero_to_fp32()
+
+
+@pytest.mark.parametrize('zero_stage, allgather_bucket_size', [(2, 1000), (2, 1001)])
+def test_incorrect_allgather_bucket_size(tmpdir, zero_stage, allgather_bucket_size):
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_optimization": {
+            "stage": zero_stage,
+            "allgather_bucket_size": allgather_bucket_size
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    hidden_dim = 4
+
+    model = SimpleModel(hidden_dim=hidden_dim)
+
+    @distributed_test(world_size=[1])
+    def _test_incorrect_allgather_bucket_size(model, hidden_dim):
+        if allgather_bucket_size % 2 == 0:
+            model, _, _, _ = deepspeed.initialize(config=config_dict,
+                                              model=model,
+                                              model_parameters=model.parameters())
+        else:
+            with pytest.raises(AssertionError) as assertinfo:
+                model, _, _, _ = deepspeed.initialize(config=config_dict,
+                                                  model=model,
+                                                  model_parameters=model.parameters())
+            assert "allgather_bucket_size must be a multiple of nccl_start_alignment_factor" in str(
+                assertinfo)
+
+    _test_incorrect_allgather_bucket_size(model=model, hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('zero_stage, world_size', [(2, 2), (2, 3), (2, 4)])
+def test_partition_nccl_alignment(tmpdir, zero_stage, world_size):
+    config_dict = {
+        "train_micro_batch_size_per_gpu": 2,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "zero_optimization": {
+            "stage": zero_stage
+        },
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-3
+            }
+        },
+        "fp16": {
+            "enabled": True,
+            "initial_scale_power": 8
+        }
+    }
+
+    hidden_dim = 4
+
+    model = SimpleModel(hidden_dim=hidden_dim)
+
+    @distributed_test(world_size=world_size)
+    def _test_partition_nccl_alignment(model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(config=config_dict,
+                                              model=model,
+                                              model_parameters=model.parameters())
+
+        # get nccl all-gather send buffers alignment factor
+        nccl_start_alignment_factor = model.optimizer.nccl_start_alignment_factor
+
+        parallel_partitioned_bit16_groups = model.optimizer.parallel_partitioned_bit16_groups if zero_stage == 2 else model.optimizer.parallel_partitioned_fp16_groups
+        for data_parallel_partitions in parallel_partitioned_bit16_groups:
+            for partition_id, partitioned_data in enumerate(data_parallel_partitions):
+                # verify that data partition start locations are 4-byte aligned
+                assert (partitioned_data.data_ptr() %
+                        (2 * nccl_start_alignment_factor) == 0)
+
+    _test_partition_nccl_alignment(model=model, hidden_dim=hidden_dim)
+
+
+def _ds_initialize_for_param_partitioning_testing(model: Module,
+                                                  cfg: dict) -> DeepSpeedEngine:
+    ds_engine, _, _, _ = deepspeed.initialize(
+        config=cfg,
+        model=model,
+        model_parameters=model.parameters()
+    )
+
+    return ds_engine
+
+
+def _assert_partition_status(model: Module,
+                             valid_statuses: Set[ZeroParamStatus]) -> None:
+    for _, param in model.named_parameters():
+        assert param.ds_status in valid_statuses, param.ds_summary()
+
+
+def _assert_fully_available(model: Module) -> None:
+    for _, param in model.named_parameters():
+        assert param.ds_status == ZeroParamStatus.AVAILABLE
+
+
+class EltwiseMultiplicationModule(Module):
+    def __init__(self, weight: Parameter) -> None:
+        super().__init__()
+        self.weight = weight
+
+    def forward(self, x: Tensor) -> Tensor:
+        _assert_fully_available(self)
+        result = self.weight * x
+
+        return result
+
+
+class EltwiseMultiplicationTestNetwork(Module):
+    """used for testing purposes"""
+    def __init__(
+        self,
+        weight1: Parameter,
+        weight2: Parameter,
+        weight3: Parameter,
+    ) -> None:
+        super().__init__()
+        self.__layer1 = EltwiseMultiplicationModule(weight1)
+        self.__layer2 = EltwiseMultiplicationModule(weight2)
+        self.__layer3 = EltwiseMultiplicationModule(weight3)
+
+        self.loss = L1Loss(reduction="none")
+
+    def forward(self, x: Tensor, y: Tensor, prefetching: bool) -> Dict[str, Tensor]:
+        _assert_partition_status(
+            self,
+            {
+                ZeroParamStatus.NOT_AVAILABLE,
+                ZeroParamStatus.INFLIGHT,
+                ZeroParamStatus.AVAILABLE
+            } if prefetching else {ZeroParamStatus.NOT_AVAILABLE})
+
+        layerwise_expected_states = {
+            ZeroParamStatus.INFLIGHT if prefetching else ZeroParamStatus.NOT_AVAILABLE,
+            ZeroParamStatus.AVAILABLE,
+        }
+
+        _assert_partition_status(self.__layer1, layerwise_expected_states)
+        hidden1 = self.__layer1(x)
+        _assert_partition_status(self.__layer1, {ZeroParamStatus.NOT_AVAILABLE})
+
+        _assert_partition_status(self.__layer2, layerwise_expected_states)
+        hidden2 = self.__layer2(hidden1)
+        _assert_partition_status(self.__layer2, {ZeroParamStatus.NOT_AVAILABLE})
+
+        _assert_partition_status(self.__layer3, layerwise_expected_states)
+        y_hat = self.__layer3(hidden2)
+        _assert_partition_status(self.__layer3,
+                                 {
+                                     ZeroParamStatus.AVAILABLE
+                                     if prefetching else ZeroParamStatus.NOT_AVAILABLE
+                                 })
+
+        loss = self.loss(y_hat, y)
+
+        _assert_partition_status(
+            self,
+            {
+                ZeroParamStatus.NOT_AVAILABLE,
+                ZeroParamStatus.INFLIGHT,
+                ZeroParamStatus.AVAILABLE
+            } if prefetching else {ZeroParamStatus.NOT_AVAILABLE})
+
+        return {
+            "hidden1": hidden1,
+            "hidden2": hidden2,
+            "y_hat": y_hat,
+            "loss": loss,
+        }
+
+
+@pytest.mark.parametrize("param_persistence_threshold", [0, 10])
+@pytest.mark.parametrize("fp16_enabled", [True, False])
+@pytest.mark.parametrize("contiguous_gradients", [True, False])
+@pytest.mark.parametrize("offload_optimizer", [True, False])
+@pytest.mark.parametrize("zero_grad", [True, False])
+@pytest.mark.parametrize("iteration", list(range(1)))
+def test_zero3_param_partitioning_base(
+    param_persistence_threshold: int,
+    fp16_enabled: bool,
+    contiguous_gradients: bool,
+    offload_optimizer: bool,
+    zero_grad: bool,
+    iteration: int,
+) -> None:
+    @distributed_test(world_size=[2])
+    def _test_zero3_param_partitioning():
+        if offload_optimizer and not contiguous_gradients:
+            return
+
+        m = 3
+        n = 5
+        weights = [Parameter(torch.zeros((m, n), dtype=torch.float32)) for _ in range(3)]
+        model = EltwiseMultiplicationTestNetwork(*weights)
+
+        cfg = {
+            "train_micro_batch_size_per_gpu": 1,
+            "zero_optimization": {
+                "stage": 3,
+                "stage3_max_reuse_distance": 0,
+                "stage3_param_persistence_threshold": param_persistence_threshold,
+                "contiguous_gradients": contiguous_gradients,
+            },
+            "optimizer": {
+                "type": "Adam",
+                "params": {
+                    "lr": 1.
+                }
+            },
+            "fp16": {
+                "enabled": fp16_enabled,
+                "loss_scale": 1.,
+            }
+        }
+
+        if offload_optimizer:
+            cfg["zero_optimization"]["offload_optimizer"] = {
+                "device": "cpu",
+                "pin_memory": True,
+            }
+
+        ds_engine = _ds_initialize_for_param_partitioning_testing(model, cfg)
+        for i, weight in enumerate(weights):
+            weight.ds_tensor.data = torch.full_like(weight.ds_tensor.data,
+                                                    (i + 1) * (1 + dist.get_rank()))
+
+        def create_tensor(vals, dtype: torch.dtype = None) -> Tensor:
+            return torch.as_tensor(vals,
+                                   dtype=dtype
+                                   or (torch.float16 if fp16_enabled else torch.float32),
+                                   device=ds_engine.device)
+
+        expected_hidden1 = create_tensor([
+            [1,
+             1,
+             1,
+             1,
+             1],
+            [1,
+             1,
+             1,
+             2,
+             2],
+            [2,
+             2,
+             2,
+             2,
+             2],
+        ])
+        expected_hidden2 = create_tensor([
+            [2,
+             2,
+             2,
+             2,
+             2],
+            [2,
+             2,
+             2,
+             8,
+             8],
+            [8,
+             8,
+             8,
+             8,
+             8],
+        ])
+        expected_yhat = create_tensor([[6,
+                                        6,
+                                        6,
+                                        6,
+                                        6],
+                                       [6,
+                                        6,
+                                        6,
+                                        48,
+                                        48],
+                                       [48,
+                                        48,
+                                        48,
+                                        48,
+                                        48]])
+        expected_loss = create_tensor([
+            [5,
+             5,
+             5,
+             5,
+             5],
+            [5,
+             5,
+             5,
+             47,
+             47],
+            [47,
+             47,
+             47,
+             47,
+             47],
+        ])
+
+        for train_iter in range(3):
+            activations = ds_engine(
+                x=torch.ones((m,
+                              n),
+                             dtype=torch.float16 if fp16_enabled else torch.float32,
+                             device=ds_engine.device),
+                y=torch.ones((m,
+                              n),
+                             dtype=torch.float16 if fp16_enabled else torch.float32,
+                             device=ds_engine.device),
+                prefetching=train_iter > 0,
+            )
+            assert torch.allclose(activations["hidden1"], expected_hidden1)
+            assert torch.allclose(activations["hidden2"], expected_hidden2)
+            assert torch.allclose(activations["y_hat"], expected_yhat)
+            assert torch.allclose(activations["loss"], expected_loss)
+
+            ds_engine.backward(activations["loss"].sum())
+
+            # check the gradients
+            grad_partitions = ds_engine.optimizer.get_fp32_grad_partitions()
+            assert set(grad_partitions.keys()) == {0}, f"should have one parameter group but got {len(grad_partitions)}"
+            assert set(grad_partitions[0].keys()) == {0, 1, 2}
+            dloss_wrt_layer1 = grad_partitions[0][0]
+            dloss_wrt_layer2 = grad_partitions[0][1]
+            dloss_wrt_layer3 = grad_partitions[0][2]
+
+            assert dloss_wrt_layer1.dtype == torch.float
+            assert dloss_wrt_layer2.dtype == torch.float
+            assert dloss_wrt_layer3.dtype == torch.float
+
+            # layer1 = [..., 1, 2, ...]
+            # layer2 = [..., 2, 4, ...]
+            # layer3 = [..., 3, 6, ...]
+            # dloss_wrt_layer3 = hidden2
+            # dloss_wrt_layer2 = layer3 * hidden1
+            # dloss_wrt_layer1 = layer3 * layer2 * x
+
+            grad_multiplier = 1 if zero_grad else (train_iter + 1)
+            if dist.get_rank() == 0:
+                assert torch.allclose(
+                    dloss_wrt_layer3.cuda(),
+                    grad_multiplier * create_tensor([2] * 8,
+                                                    torch.float))
+                assert torch.allclose(
+                    dloss_wrt_layer2.cuda(),
+                    grad_multiplier * create_tensor([3 * 1] * 8,
+                                                    torch.float))
+                assert torch.allclose(
+                    dloss_wrt_layer1.cuda(),
+                    grad_multiplier * create_tensor([3 * 2 * 1] * 8,
+                                                    torch.float))
+            elif dist.get_rank() == 1:
+                # parameters dont split evenly across ranks so rank 1 has a zero-padded
+                # partition
+                assert torch.allclose(
+                    dloss_wrt_layer3.cuda(),
+                    grad_multiplier * create_tensor(([8] * 7) + [0],
+                                                    torch.float))
+                assert torch.allclose(
+                    dloss_wrt_layer2.cuda(),
+                    grad_multiplier * create_tensor(([6 * 2] * 7) + [0],
+                                                    torch.float))
+                assert torch.allclose(
+                    dloss_wrt_layer1.cuda(),
+                    grad_multiplier * create_tensor(([6 * 4 * 1] * 7) + [0],
+                                                    torch.float))
+            else:
+                raise RuntimeError("test has world size of two")
+
+            if zero_grad:
+                ds_engine.optimizer.zero_grad()
+
+        # TODO. add testing for this - for now we just call it to make sure it
+        # doesn't throw
+        ds_engine.optimizer.step()
+        # taking an optimizer step invalidates all parameters, make sure everything
+        # has been partitioned afterwards
+        _assert_partition_status(ds_engine, {ZeroParamStatus.NOT_AVAILABLE})
+        assert not math.isclose(ds_engine.optimizer._global_grad_norm, 0.0)
+
+    _test_zero3_param_partitioning()
+
+
+@pytest.mark.parametrize("world_sz", [1, 2, 4])
+@pytest.mark.parametrize("param_sz", [8100])
+@pytest.mark.parametrize("init_context_manager", [True, False])
+def test_zero3_param_partitioning_large_param(world_sz: int,
+                                              param_sz: int,
+                                              init_context_manager: bool) -> None:
+    class LargeParamModel(Module):
+        def __init__(self):
+            super().__init__()
+            self.param = Parameter(torch.zeros((param_sz, ), dtype=torch.float32))
+
+            # only do weight initialization on root rank to
+            # make sure we are broadcasting correctly from rank 0
+            if dist.get_rank() == 0:
+                partition_sz = math.ceil(self.param.numel() / dist.get_world_size())
+                offset = 0
+                for rank in range(dist.get_world_size()):
+                    with torch.no_grad():
+                        self.param[offset:offset + partition_sz].fill_(rank)
+                    offset += partition_sz
+
+        def forward(self, x: Tensor) -> Tensor:
+            return x * self.param
+
+    @distributed_test(world_size=[world_sz])
+    def _distributed_test():
+        ds_config = {
+            "train_micro_batch_size_per_gpu": 1,
+            "zero_optimization": {
+                "stage": 3,
+                "stage3_max_reuse_distance": 0,
+                "contiguous_gradients": True,
+                "overlap_comm": True,
+            },
+            "optimizer": {
+                "type": "Adam",
+                "params": {
+                    "lr": 1.
+                }
+            },
+            "fp16": {
+                "enabled": True,
+                "loss_scale": 1.,
+            }
+        }
+        with deepspeed.zero.Init(mem_efficient_linear=False,
+                                 enabled=init_context_manager):
+            model = LargeParamModel()
+        ds_engine = _ds_initialize_for_param_partitioning_testing(model, ds_config)
+
+        for train_iter in range(3):  # test multiple iterations to cover prefetching
+            activation: Tensor = ds_engine(
+                torch.ones(param_sz,
+                           dtype=torch.float16,
+                           device=ds_engine.device))
+
+            partition_sz = math.ceil(param_sz / world_sz)
+            for rank_idx, start_idx in enumerate(range(0, param_sz, partition_sz)):
+                activation_from_partition = activation[start_idx:start_idx +
+                                                       partition_sz]
+                assert torch.allclose(
+                    activation_from_partition,
+                    torch.full_like(activation_from_partition,
+                                    rank_idx))
+
+            ds_engine.backward(activation.sum())
+            ds_engine.allreduce_gradients()
+
+            avgd_gradients = ds_engine.optimizer.averaged_gradients
+            assert set(avgd_gradients.keys()) == {0}, "should only have one parameter group"
+            weight_gradient, = avgd_gradients[0]
+            expected_weight_gradient = (train_iter + 1) * torch.full_like(
+                weight_gradient,
+                1)
+
+            assert torch.allclose(weight_gradient, expected_weight_gradient)
+
+    _distributed_test()
+
+
+@pytest.mark.parametrize("world_sz", [1, 2, 4])
+@pytest.mark.parametrize("param_sz", [100, 1_000, 10_000])
+@pytest.mark.parametrize("n_layers", [100, 1_000])
+@pytest.mark.parametrize("init_context_manager", [True, False])
+def test_zero3_param_partitioning_many_params(world_sz: int,
+                                              param_sz: int,
+                                              n_layers: int,
+                                              init_context_manager: bool) -> None:
+    class ManyParamModel(Module):
+        def __init__(self) -> None:
+            super().__init__()
+
+            self.modulelist = ModuleList(
+                EltwiseMultiplicationModule(
+                    weight=Parameter(torch.empty((param_sz,
+                                                  ),
+                                                 dtype=torch.float32)))
+                for _ in range(n_layers))
+
+            for layer_num, module in enumerate(self.modulelist):
+                with deepspeed.zero.GatheredParameters(module.weight, modifier_rank=0):
+                    param: Parameter = module.weight
+                    partition_sz = math.ceil(param.numel() / dist.get_world_size())
+                    offset = 0
+                    for rank in range(dist.get_world_size()):
+                        with torch.no_grad():
+                            param[offset:offset + partition_sz].fill_(2 * layer_num *
+                                                                      rank)
+                        offset += partition_sz
+
+        def forward(self, x: Tensor) -> Tensor:
+            activations = []
+
+            for module in self.modulelist:
+                x = module(x)
+                activations.append(x)
+
+            return activations
+
+    @distributed_test(world_size=[world_sz])
+    def _distributed_test():
+        ds_cfg = {
+            "train_micro_batch_size_per_gpu": 1,
+            "zero_optimization": {
+                "stage": 3,
+                "stage3_max_reuse_distance": 0,
+                "contiguous_gradients": True,
+                "overlap_comm": True,
+            },
+            "optimizer": {
+                "type": "Adam",
+                "params": {
+                    "lr": 1.
+                }
+            },
+            "fp16": {
+                "enabled": True,
+                "loss_scale": 1.,
+            }
+        }
+
+        with deepspeed.zero.Init(config=ds_cfg,
+                                 mem_efficient_linear=False,
+                                 enabled=init_context_manager):
+            model = ManyParamModel()
+
+        ds_engine = _ds_initialize_for_param_partitioning_testing(model, ds_cfg)
+
+        for _ in range(3):  # test multiple iterations to cover prefetching
+            activations: List[Tensor] = ds_engine(
+                torch.ones((param_sz,
+                            ),
+                           dtype=torch.float16,
+                           device=ds_engine.device))
+            assert len(activations) == n_layers
+
+            partition_sz = math.ceil(param_sz / world_sz)
+            expected_activations = torch.empty(param_sz,
+                                               dtype=torch.float16,
+                                               device=ds_engine.device)
+            for start_idx in range(0, param_sz, partition_sz):
+                expected_activations[start_idx:start_idx +
+                                     partition_sz] = dist.get_rank()
+
+            for layer_num, activation in enumerate(activations):
+                expected_activations *= 2 * layer_num
+                assert torch.allclose(activation, expected_activations)
+
+            # TODO. finish writing this test
+            ds_engine.backward(activations[-1].sum())
+
+            avgd_gradients = ds_engine.optimizer.averaged_gradients
+            assert set(avgd_gradients.keys()) == {0}, "should only have one parameter group"
+            weight_gradients: List[Tensor] = avgd_gradients[0]
+
+            for layer_num, activation in enumerate(weight_gradients):
+                pass
+
+    _distributed_test()
+
+
+@pytest.mark.parametrize("world_sz", [1, 2, 4])
+def test_zero3_init_for_parent_weight_initialization(world_sz):
+    class ModelWhereParentInitializesChildWeights(Module):
+        def __init__(self) -> None:
+            super().__init__()
+
+            self.linear = Linear(12, 1)
+
+            self.apply(self.__init_weights)
+
+        def __init_weights(self, module):
+            if isinstance(module, Linear):
+                with torch.no_grad():
+                    module.weight.fill_(1 + dist.get_rank())
+
+    @distributed_test(world_size=[world_sz])
+    def _distributed_test():
+        ds_cfg = {
+            "train_micro_batch_size_per_gpu": 1,
+            "zero_optimization": {
+                "stage": 3,
+                "stage3_max_reuse_distance": 0,
+                "contiguous_gradients": True,
+                "overlap_comm": True,
+            },
+            "optimizer": {
+                "type": "Adam",
+                "params": {
+                    "lr": 1.
+                }
+            },
+            "fp16": {
+                "enabled": True,
+                "loss_scale": 1.,
+            }
+        }
+
+        with deepspeed.zero.Init(config=ds_cfg,
+                                 mem_efficient_linear=False,
+                                 enabled=True):
+            model = ModelWhereParentInitializesChildWeights()
+
+        assert model.linear.weight.ds_tensor.numel() == math.ceil(12 / world_sz)
+        assert torch.allclose(model.linear.weight.ds_tensor,
+                              torch.full_like(model.linear.weight.ds_tensor,
+                                              1))
+
+    _distributed_test()
+
+
+@pytest.mark.skip(
+    reason="depends on upgraded pytorch and nccl that isn't always available")
+@pytest.mark.parametrize("param_persistence_threshold", [0, 10])
+@pytest.mark.parametrize("contiguous_gradients", [True, False])
+@pytest.mark.parametrize("offload_optimizer", [True, False])
+@pytest.mark.parametrize("zero_grad", [True])
+@pytest.mark.parametrize("iteration", list(range(1)))
+def test_zero3_param_partitioning_base_bf16(
+    param_persistence_threshold: int,
+    contiguous_gradients: bool,
+    offload_optimizer: bool,
+    zero_grad: bool,
+    iteration: int,
+) -> None:
+    @distributed_test(world_size=[2])
+    def _test_zero3_param_partitioning():
+        if offload_optimizer and not contiguous_gradients:
+            return
+
+        m = 3
+        n = 5
+        weights = [Parameter(torch.zeros((m, n), dtype=torch.float32)) for _ in range(3)]
+        model = EltwiseMultiplicationTestNetwork(*weights)
+
+        cfg = {
+            "train_micro_batch_size_per_gpu": 1,
+            "zero_optimization": {
+                "stage": 3,
+                "stage3_max_reuse_distance": 0,
+                "stage3_param_persistence_threshold": param_persistence_threshold,
+                "contiguous_gradients": contiguous_gradients,
+            },
+            "optimizer": {
+                "type": "Adam",
+                "params": {
+                    "lr": 1.
+                }
+            },
+            "bf16": {
+                "enabled": True,
+                "loss_scale": 1.,
+            }
+        }
+
+        if offload_optimizer:
+            cfg["zero_optimization"]["offload_optimizer"] = {
+                "device": "cpu",
+                "pin_memory": True,
+            }
+
+        ds_engine = _ds_initialize_for_param_partitioning_testing(model, cfg)
+        for i, weight in enumerate(weights):
+            weight.ds_tensor.data = torch.full_like(weight.ds_tensor.data,
+                                                    (i + 1) * (1 + dist.get_rank()))
+
+        def create_tensor(vals):
+            return torch.as_tensor(vals, dtype=torch.bfloat16, device=ds_engine.device)
+
+        expected_hidden1 = create_tensor([
+            [1,
+             1,
+             1,
+             1,
+             1],
+            [1,
+             1,
+             1,
+             2,
+             2],
+            [2,
+             2,
+             2,
+             2,
+             2],
+        ])
+        expected_hidden2 = create_tensor([
+            [2,
+             2,
+             2,
+             2,
+             2],
+            [2,
+             2,
+             2,
+             8,
+             8],
+            [8,
+             8,
+             8,
+             8,
+             8],
+        ])
+        expected_yhat = create_tensor([[6,
+                                        6,
+                                        6,
+                                        6,
+                                        6],
+                                       [6,
+                                        6,
+                                        6,
+                                        48,
+                                        48],
+                                       [48,
+                                        48,
+                                        48,
+                                        48,
+                                        48]])
+        expected_loss = create_tensor([
+            [5,
+             5,
+             5,
+             5,
+             5],
+            [5,
+             5,
+             5,
+             47,
+             47],
+            [47,
+             47,
+             47,
+             47,
+             47],
+        ])
+
+        for train_iter in range(3):
+            _assert_partition_status(ds_engine, {ZeroParamStatus.NOT_AVAILABLE})
+            activations = ds_engine(
+                x=torch.ones((m,
+                              n),
+                             dtype=torch.bfloat16,
+                             device=ds_engine.device),
+                y=torch.ones((m,
+                              n),
+                             dtype=torch.bfloat16,
+                             device=ds_engine.device),
+                prefetching=train_iter > 0,
+            )
+            assert torch.allclose(activations["hidden1"], expected_hidden1)
+            assert torch.allclose(activations["hidden2"], expected_hidden2)
+            assert torch.allclose(activations["y_hat"], expected_yhat)
+            assert torch.allclose(activations["loss"], expected_loss)
+
+            ds_engine.backward(activations["loss"].sum())
+            _assert_partition_status(ds_engine, {ZeroParamStatus.NOT_AVAILABLE})
+
+            # check the gradients
+            grad_partitions = ds_engine.optimizer.get_fp32_grad_partitions()
+            assert set(grad_partitions.keys()) == {0}, f"should have one parameter group but got {len(grad_partitions)}"
+            assert set(grad_partitions[0].keys()) == {0, 1, 2}
+            dloss_wrt_layer1 = grad_partitions[0][0]
+            dloss_wrt_layer2 = grad_partitions[0][1]
+            dloss_wrt_layer3 = grad_partitions[0][2]
+
+            # layer1 = [..., 1, 2, ...]
+            # layer2 = [..., 2, 4, ...]
+            # layer3 = [..., 3, 6, ...]
+            # dloss_wrt_layer3 = hidden2
+            # dloss_wrt_layer2 = layer3 * hidden1
+            # dloss_wrt_layer1 = layer3 * layer2 * x
+
+            expected_grad_dtype = torch.float32 if offload_optimizer else torch.bfloat16
+
+            grad_multiplier = 1 if zero_grad else (train_iter + 1)
+            if dist.get_rank() == 0:
+                assert torch.allclose(
+                    dloss_wrt_layer3.cuda(),
+                    grad_multiplier * create_tensor([2] * 8).to(expected_grad_dtype))
+                assert torch.allclose(
+                    dloss_wrt_layer2.cuda(),
+                    grad_multiplier * create_tensor([3 * 1] * 8).to(expected_grad_dtype))
+                assert torch.allclose(
+                    dloss_wrt_layer1.cuda(),
+                    grad_multiplier *
+                    create_tensor([3 * 2 * 1] * 8).to(expected_grad_dtype))
+            elif dist.get_rank() == 1:
+                # parameters dont split evenly across ranks so rank 1 has a zero-padded
+                # partition
+                assert torch.allclose(
+                    dloss_wrt_layer3.cuda(),
+                    grad_multiplier *
+                    create_tensor(([8] * 7) + [0]).to(expected_grad_dtype))
+                assert torch.allclose(
+                    dloss_wrt_layer2.cuda(),
+                    grad_multiplier *
+                    create_tensor(([6 * 2] * 7) + [0]).to(expected_grad_dtype))
+                assert torch.allclose(
+                    dloss_wrt_layer1.cuda(),
+                    grad_multiplier *
+                    create_tensor(([6 * 4 * 1] * 7) + [0]).to(expected_grad_dtype))
+            else:
+                raise RuntimeError("test has world size of two")
+
+            if zero_grad:
+                ds_engine.optimizer.zero_grad()
+
+        # TODO. add testing for this - for now we just call it to make sure it
+        # doesn't throw
+        ds_engine.optimizer.step()
+        _assert_partition_status(ds_engine, {ZeroParamStatus.NOT_AVAILABLE})
+
+    _test_zero3_param_partitioning()
+
+
+def test_zero_offload_stage1():
+    config_dict = {
+        "train_batch_size": 4,
+        "gradient_accumulation_steps": 2,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-4
+            }
+        },
+        "fp16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": 1,
+            "offload_optimizer": {
+                "device": "cpu"
+            }
+        }
+    }
+
+    hidden_dim = 10
+    model = SimpleModel(hidden_dim)
+
+    @distributed_test(world_size=[2])
+    def _go(model, hidden_dim):
+        model, _, _, _ = deepspeed.initialize(model=model,
+                                              model_parameters=model.parameters(),
+                                              config=config_dict)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        torch.distributed.barrier()
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            model.backward(loss)
+            model.step()
+
+    _go(model=model, hidden_dim=hidden_dim)
+
+
+@pytest.mark.parametrize('return_type', [tuple, list, dict])
+def test_z3_dict_fwd(return_type):
+    config_dict = {
+        "train_batch_size": 4,
+        "steps_per_print": 1,
+        "optimizer": {
+            "type": "Adam",
+            "params": {
+                "lr": 1e-4
+            }
+        },
+        "fp16": {
+            "enabled": True
+        },
+        "zero_optimization": {
+            "stage": 3
+        }
+    }
+    hidden_dim = 10
+
+    class MyModel(torch.nn.Module):
+        def __init__(self, hidden_dim):
+            super(MyModel, self).__init__()
+            self.l1 = torch.nn.Linear(hidden_dim, hidden_dim)
+            self.cel = torch.nn.CrossEntropyLoss()
+
+        def forward(self, x, y):
+            x = self.l1(x)
+            loss = self.cel(x, y)
+            if return_type == dict:
+                val = {'a': x, 'loss': loss, 'b': 1, 'c': None}
+            elif return_type == list:
+                val = [x, loss]
+            elif return_type == tuple:
+                val = (x, loss)
+            else:
+                raise NotImplementedError
+            return val
+
+    @distributed_test(world_size=[1])
+    def _go(hidden_dim):
+        with deepspeed.zero.Init():
+            model = MyModel(hidden_dim)
+
+        model, _, _, _ = deepspeed.initialize(model=model,
+                                              model_parameters=model.parameters(),
+                                              config=config_dict)
+        data_loader = random_dataloader(model=model,
+                                        total_samples=50,
+                                        hidden_dim=hidden_dim,
+                                        device=model.device)
+        torch.distributed.barrier()
+        for n, batch in enumerate(data_loader):
+            loss = model(batch[0], batch[1])
+            if return_type == dict:
+                loss = loss['loss']
+            else:
+                loss = loss[1]
+            model.backward(loss)
+            model.step()
+
+    _go(hidden_dim)
diff --git a/tests/unit/test_zero_context.py b/tests/unit/test_zero_context.py
index 0e5b2e0696e61bc41b7bf38bf941163cffcba7f8..66521e075ce14fd056825d6ea5208738dfe68ba6 100644
--- a/tests/unit/test_zero_context.py
+++ b/tests/unit/test_zero_context.py
@@ -1,17 +1,19 @@
 import os
+from types import SimpleNamespace
+
 import torch
 import pytest
 
 import deepspeed
-from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus, partitioned_param_data_shape
 
-from common import distributed_test
+from .common import distributed_test, get_master_port
 
 
 def setup_serial_env():
     # Setup for a serial run
     os.environ['MASTER_ADDR'] = '127.0.0.1'
-    os.environ['MASTER_PORT'] = '29503'
+    os.environ['MASTER_PORT'] = get_master_port()
     os.environ['LOCAL_RANK'] = '0'
     os.environ['RANK'] = '0'
     os.environ['WORLD_SIZE'] = '1'
@@ -29,7 +31,7 @@ def test_scatter_gather():
     with deepspeed.zero.Init():
         l = torch.nn.Linear(6, 3)
     assert l.weight.ds_status == ZeroParamStatus.NOT_AVAILABLE
-    assert l.weight.numel() == 1
+    assert l.weight.shape == torch.Size(partitioned_param_data_shape)
 
     # Ensure there is no impact outside the context
     l2 = torch.nn.Linear(6, 3)
@@ -62,55 +64,59 @@ def test_gather_update():
         assert torch.equal(l.weight, torch.zeros_like(l.weight))
 
 
-@pytest.mark.skip('WIP')
-def test_external_param():
+config = {
+    "train_batch_size": 1,
+    "steps_per_print": 1,
+    "optimizer": {
+        "type": "Adam",
+        "params": {
+            "lr": 0.00015
+        }
+    },
+    "fp16": {
+        "enabled": True,
+        "loss_scale": 138.
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "stage3_param_persistence_threshold": 1,
+    }
+}
+
+
+def test_ext_param_getattr():
     setup_serial_env()
 
-    print()
-
     class ExtLinear(torch.nn.Module):
-        def __init__(self, dim=10, copycat=None):
+        def __init__(self, dim=16):
             super().__init__()
             self.dim = dim
-            self.linear = torch.nn.Linear(dim, dim)
-            if copycat is not None:
-                with deepspeed.zero.GatheredParameters(self.linear.weight,
-                                                  modifier_rank=0), \
-                     torch.no_grad():
-                    self.linear.weight.copy_(copycat.linear.weight)
-
-            if hasattr(self.linear.weight, 'ds_id'):
-                print('registering')
-                super().ds_register_external_parameter('samyam', self.linear.weight)
+            self.linear1 = torch.nn.Linear(dim, dim)
+            self.linear2 = torch.nn.Linear(dim, dim)
 
         def forward(self, input):
-            yamsam = self.linear(input)
-            if hasattr(self.linear.weight, 'ds_status'):
-                assert self.linear.weight.ds_status == ZeroParamStatus.AVAILABLE
-            jeff = torch.nn.functional.linear(yamsam, self.linear.weight)
-            return jeff
-
-    l1_base = ExtLinear().half().cuda()
-    l2_base = ExtLinear().half().cuda()
+            A = self.linear1(input)
+            B = self.linear2(A)
 
-    input = torch.rand(10).half().cuda()
+            # external use of self.linear1.weight
+            C = torch.nn.functional.linear(B, self.linear1.weight)
+            return C.sum()
 
-    l1_base_out = l1_base(input.clone().detach())
-    l2_base_out = l2_base(input.clone().detach())
+    net = ExtLinear()
 
-    with deepspeed.zero.Init():
-        l1_test = ExtLinear(copycat=l1_base).cuda()
-        #l2_test = ExtLinear(copycat=l2_base).cuda()
-        assert l1_test.linear.weight.ds_status == ZeroParamStatus.NOT_AVAILABLE
+    args = SimpleNamespace(local_rank=0)
+    engine, optim, _, _ = deepspeed.initialize(args=args,
+                                               model=net,
+                                               model_parameters=net.parameters(),
+                                               config=config)
 
-    # XXX l1 and l2 share their external parameter (l2.linear.weight)
+    with deepspeed.zero.GatheredParameters(net.linear1.weight):
+        assert net.linear1.weight.numel() == net.dim**2
 
-    assert l1_test.linear.weight.ds_status == ZeroParamStatus.NOT_AVAILABLE
-    l1_test_out = l1_test(input.clone().detach())
-    #assert torch.allclose(l1_base_out, l1_test_out)
-
-    #l2_test_out = l2_test(input.clone().detach())
-    #assert torch.allclose(l2_base_out, l2_test_out)
+    input = torch.rand(net.dim).to(engine.device).half()
+    loss = engine(input)
+    engine.backward(loss)
+    engine.step()
 
 
 def test_scatter_halftype():
@@ -122,3 +128,234 @@ def test_scatter_halftype():
 
         y = torch.LongTensor([3, 3])
         assert y.dtype == torch.long
+
+
+class DanglingBias(torch.nn.Linear):
+    def forward(self, *inputs):
+        out = super().forward(*inputs)
+        # return the bias to trigger a dangling external param
+        return out, self.bias
+
+
+class DataClass:
+    """Just wraps data in an object. """
+    def __init__(self, out=None, bias=None):
+        self.out = out
+        self.bias = bias
+
+
+class DanglingBiasClass(DanglingBias):
+    def forward(self, *inputs):
+        out, bias = super().forward(*inputs)
+        return DataClass(out=out, bias=bias)
+
+
+class DanglingAttention(torch.nn.Linear):
+    def __init__(self, dim=16, return_obj=False):
+        super().__init__(dim, dim)
+        self.dim = dim
+        self.return_obj = return_obj
+        if return_obj:
+            self.d_linear = DanglingBiasClass(dim, dim)
+        else:
+            self.d_linear = DanglingBias(dim, dim)
+
+    def forward(self, input):
+        out = super().forward(input)
+        if self.return_obj:
+            out_obj = self.d_linear(out)
+            assert out_obj.bias.ds_status == ZeroParamStatus.AVAILABLE
+            # forward the external param
+            return out_obj.out, out_obj.bias
+        else:
+            out, bias = self.d_linear(out)
+            assert bias.ds_status == ZeroParamStatus.AVAILABLE
+            return out, bias
+
+
+class ModelContainer(torch.nn.Module):
+    def __init__(self, dim=16, return_obj=False):
+        super().__init__()
+        self.dim = dim
+        self.linear1 = torch.nn.Linear(dim, dim)
+        self.dangler = DanglingAttention(dim, return_obj=return_obj)
+
+    def forward(self, input):
+        act1 = self.linear1(input)
+        # bias is actually dangler.d_linear1.bias
+        act2, bias = self.dangler(act1)
+        assert bias.ds_status == ZeroParamStatus.AVAILABLE
+        return (act2 + bias).sum()
+
+
+class DanglingExt(torch.nn.Module):
+    def __init__(self, dim=16):
+        super().__init__()
+        self.dim = dim
+        self.container = ModelContainer(dim)
+
+    def forward(self, input):
+        out = self.container(input)
+
+        # Make sure it's at the right level of the stack
+        assert len(self._external_params) == 0
+        assert len(self.container._external_params) == 1
+        assert len(self.container.dangler._external_params) == 0
+        return out
+
+
+def test_ext_param_return():
+    setup_serial_env()
+
+    net = DanglingExt()
+
+    args = SimpleNamespace(local_rank=0)
+    engine, optim, _, _ = deepspeed.initialize(args=args,
+                                               model=net,
+                                               model_parameters=net.parameters(),
+                                               config=config)
+
+    for _ in range(5):
+        input = torch.rand(net.dim).to(engine.device).half()
+        loss = engine(input)
+        engine.backward(loss)
+        engine.step()
+
+
+@pytest.mark.skip('WIP')
+def test_ext_param_returnobj():
+    setup_serial_env()
+    print()
+
+    net = ModelContainer(return_obj=True)
+
+    args = SimpleNamespace(local_rank=0)
+    engine, optim, _, _ = deepspeed.initialize(args=args,
+                                               model=net,
+                                               model_parameters=net.parameters(),
+                                               config=config)
+
+    for _ in range(5):
+        input = torch.rand(net.dim).to(engine.device).half()
+        loss = engine(input)
+        assert len(net._external_params) == 1
+        assert len(net.dangler._external_params) == 0
+        engine.backward(loss)
+        engine.step()
+
+
+class ModelContainerVariableOutputType(ModelContainer):
+    def __init__(self, dim=16, output_type=dict):
+        super().__init__()
+        self.output_type = output_type
+        self.dim = dim
+        self.linear1 = torch.nn.Linear(dim, dim)
+
+    def forward(self, input):
+        act1 = self.linear1(input)
+        if self.output_type is dict:
+            return {'loss': act1.sum()}
+        if self.output_type is torch.tensor:
+            return act1.sum()
+
+
+@pytest.mark.parametrize('output_type', [torch.tensor, dict, None])
+def test_stage_3_output_type(output_type):
+    setup_serial_env()
+    print()
+
+    net = ModelContainerVariableOutputType(output_type=output_type)
+
+    args = SimpleNamespace(local_rank=0)
+    engine, optim, _, _ = deepspeed.initialize(args=args,
+                                               model=net,
+                                               model_parameters=net.parameters(),
+                                               config=config)
+
+    for _ in range(1):
+        input = torch.rand(net.dim).to(engine.device).half()
+        loss = engine(input)
+        if loss is not None:
+            if isinstance(loss, dict):
+                loss = loss['loss']
+            engine.backward(loss)
+            engine.step()
+
+
+# Test that no sub-class or super-class is missed
+class ConvX(torch.nn.Conv1d):
+    def __init__(self, *args):
+        super().__init__(*args)
+        # This would not be partitioned before bugfix 5ca8167
+        self.param_in = torch.nn.Parameter(torch.FloatTensor(5).uniform_())
+
+    def forward(self, x):
+        return x
+
+
+class ConvNet(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = ConvX(1, 3, 4)
+        self.param = torch.nn.Parameter(torch.FloatTensor(5).uniform_())
+
+    def forward(self, x):
+        return x
+
+
+def test_subclass_param():
+    setup_serial_env()
+    with deepspeed.zero.Init(config=config):
+        model = ConvNet()
+
+    assert model.param.ds_status == ZeroParamStatus.NOT_AVAILABLE
+    assert model.conv1.param_in.ds_status == ZeroParamStatus.NOT_AVAILABLE
+
+
+# test that sub-classes get params that aren't prematurely partitioned and thus requiring gathering
+# fixed by https://github.com/microsoft/DeepSpeed/pull/1202
+class GrandPa(torch.nn.Module):
+    def __init__(self, *args):
+        super().__init__(*args)
+        self.param_grandpa = torch.nn.Parameter(torch.ones(5))
+        self.param_grandpa.data = (self.param_grandpa.data +
+                                   1).data  # test param is not yet partitioned
+
+
+class Pa(GrandPa):
+    def __init__(self, *args):
+        super().__init__(*args)
+        self.param_pa = torch.nn.Parameter(torch.ones(5))
+        self.param_pa.data = (self.param_pa.data +
+                              1).data  # test param is not yet partitioned
+        self.param_grandpa.data = (self.param_grandpa.data +
+                                   1).data  # test param is not yet partitioned
+
+
+class Son(Pa):
+    def __init__(self):
+        super().__init__()
+        self.param = torch.nn.Parameter(torch.ones(5))
+        self.param.data = (self.param.data + 1).data  # test param is not yet partitioned
+        self.param_pa.data = (self.param_pa.data +
+                              1).data  # test param is not yet partitioned
+        self.param_grandpa.data = (self.param_grandpa.data +
+                                   1).data  # test param is not yet partitioned
+
+
+def test_subclass_param_init():
+    setup_serial_env()
+    with deepspeed.zero.Init(config=config):
+        model = Son().cpu()
+
+    # test that all params have been partitioned
+    assert model.param_grandpa.ds_status == ZeroParamStatus.NOT_AVAILABLE
+    assert model.param_pa.ds_status == ZeroParamStatus.NOT_AVAILABLE
+    assert model.param.ds_status == ZeroParamStatus.NOT_AVAILABLE
+
+    # test that the weights manipulation during each __init__ worked in all w/o needing gathering
+    ones = torch.ones(5).half().cuda()
+    with deepspeed.zero.GatheredParameters(list(model.parameters(recurse=False))):
+        assert torch.equal(model.param, ones + 1)
+        assert torch.equal(model.param_pa, ones + 2)
+        assert torch.equal(model.param_grandpa, ones + 3)
diff --git a/tests/unit/test_zero_tiled.py b/tests/unit/test_zero_tiled.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8b63b11d32af63db5a0962831c14944934c3d48
--- /dev/null
+++ b/tests/unit/test_zero_tiled.py
@@ -0,0 +1,169 @@
+import copy
+
+import torch
+import deepspeed
+from deepspeed.runtime.zero.tiling import TiledLinear, TiledLinearReturnBias
+
+import pytest
+
+
+@pytest.mark.parametrize('in_splits,out_splits', [(1, 1), (2, 2), (5, 5), (32, 32)])
+def test_tiled_init(in_splits, out_splits):
+    in_f = 32
+    out_f = 40
+    base = torch.nn.Linear(in_f, out_f, bias=True)
+    l = TiledLinear(in_f,
+                    out_f,
+                    bias=True,
+                    init_linear=copy.deepcopy(base),
+                    out_splits=out_splits,
+                    in_splits=in_splits)
+
+    for out_id in range(out_splits):
+        for in_id in range(in_splits):
+            local_l = l.linears[out_id][in_id]
+            assert isinstance(local_l, torch.nn.Linear)
+
+            rstart = l.out_parts[out_id]
+            rstop = l.out_parts[out_id + 1]
+            cstart = l.in_parts[in_id]
+            cstop = l.in_parts[in_id + 1]
+
+            local_out = rstop - rstart
+            local_in = cstop - cstart
+            assert local_l.weight.size()[1] == local_in, f'local[{out_id}][{in_id}].size {local_l.weight.size()}'
+            assert local_l.weight.size()[0] == local_out
+
+            test = base.weight[rstart:rstop, cstart:cstop]
+
+            assert local_l.weight.size() == test.size()
+            assert torch.equal(local_l.weight.data, test.data)
+
+            if in_id == in_splits - 1:
+                assert local_l.bias is not None
+                assert local_l.bias.size()[0] == local_out
+            else:
+                assert local_l.bias is None
+
+
+@pytest.mark.parametrize('in_splits,out_splits', [(0, 0), (33, 33)])
+def test_tiled_baddim(in_splits, out_splits):
+    dim = 32
+    with pytest.raises(RuntimeError):
+        l = TiledLinear(dim, dim, out_splits=out_splits, in_splits=in_splits)
+
+
+@pytest.mark.parametrize('bias', [False, True])
+@pytest.mark.parametrize('in_splits,out_splits', [(1, 1), (2, 2)])
+@pytest.mark.parametrize('in_f,out_f', [(32, 32), (23, 29), (29, 23)])
+def test_tiled_forward(in_splits, out_splits, bias, in_f, out_f):
+    base = torch.nn.Linear(in_f, out_f, bias=bias)
+    test = TiledLinear(in_f,
+                       out_f,
+                       bias=bias,
+                       init_linear=copy.deepcopy(base),
+                       out_splits=out_splits,
+                       in_splits=in_splits)
+
+    inp = torch.rand(in_f)
+
+    base_out = base(copy.deepcopy(inp))
+    test_out = test(copy.deepcopy(inp))
+
+    assert torch.allclose(base_out, test_out, rtol=1e-4)
+
+
+@pytest.mark.parametrize('bias', [False, True])
+@pytest.mark.parametrize('in_splits,out_splits', [(1, 1), (2, 2)])
+@pytest.mark.parametrize('in_f,out_f', [(32, 32), (23, 29), (29, 23)])
+def test_tiled_backward(in_splits, out_splits, bias, in_f, out_f):
+    base = torch.nn.Linear(in_f, out_f, bias=bias)
+    test = TiledLinear(in_f,
+                       out_f,
+                       bias=bias,
+                       init_linear=copy.deepcopy(base),
+                       out_splits=out_splits,
+                       in_splits=in_splits)
+
+    inp = torch.rand(in_f)
+
+    base_out = base(copy.deepcopy(inp))
+    test_out = test(copy.deepcopy(inp))
+    assert torch.allclose(base_out, test_out, rtol=1e-4)
+
+    base_out.sum().backward()
+    test_out.sum().backward()
+
+    # compare grads
+    for row in range(out_splits):
+        rstart = test.out_parts[row]
+        rstop = test.out_parts[row + 1]
+
+        for col in range(in_splits):
+            cstart = test.in_parts[col]
+            cstop = test.in_parts[col + 1]
+
+            local = test.linears[row][col]
+            base_grad = base.weight.grad[rstart:rstop, cstart:cstop]
+            assert torch.allclose(base_grad, local.weight.grad, rtol=1e-4)
+
+            if local.bias is not None:
+                base_grad = base.bias.grad[rstart:rstop]
+                assert torch.allclose(base_grad, local.bias.grad, rtol=1e-4)
+
+
+class LinearWrapper(torch.nn.Linear):
+    """Returns its own bias to simulate Megatron-LM's behavior.
+
+    Megatron-LM optionally delays the bias addition to fuse with a proceeding kernel.
+    """
+    def forward(self, input):
+        out = super().forward(input)
+        return out, self.bias
+
+
+@pytest.mark.parametrize('bias', [False, True])
+@pytest.mark.parametrize('in_splits,out_splits', [(1, 1), (2, 2)])
+@pytest.mark.parametrize('in_f,out_f', [(32, 32), (23, 29), (29, 23)])
+def test_tiled_returnbias_backward(in_splits, out_splits, bias, in_f, out_f):
+    base = LinearWrapper(in_f, out_f, bias=bias)
+    test = TiledLinearReturnBias(in_f,
+                                 out_f,
+                                 bias=bias,
+                                 linear_cls=LinearWrapper,
+                                 init_linear=copy.deepcopy(base),
+                                 out_splits=out_splits,
+                                 in_splits=in_splits)
+
+    inp = torch.rand(in_f)
+
+    base_out_t, base_out_b = base(copy.deepcopy(inp))
+    test_out_t, test_out_b = test(copy.deepcopy(inp))
+    assert torch.allclose(base_out_t, test_out_t, rtol=1e-4)
+    if base_out_b is None:
+        assert test_out_b is None
+        base_out_b = torch.zeros_like(base_out_t)
+        test_out_b = torch.zeros_like(test_out_t)
+    else:
+        assert test_out_b is not None
+        assert torch.allclose(base_out_b, test_out_b, rtol=1e-4)
+
+    (base_out_t + base_out_b).sum().backward()
+    (test_out_t + test_out_b).sum().backward()
+
+    # compare grads
+    for row in range(out_splits):
+        rstart = test.out_parts[row]
+        rstop = test.out_parts[row + 1]
+
+        for col in range(in_splits):
+            cstart = test.in_parts[col]
+            cstop = test.in_parts[col + 1]
+
+            local = test.linears[row][col]
+            base_grad = base.weight.grad[rstart:rstop, cstart:cstop]
+            assert torch.allclose(base_grad, local.weight.grad, rtol=1e-4)
+
+            if local.bias is not None:
+                base_grad = base.bias.grad[rstart:rstop]
+                assert torch.allclose(base_grad, local.bias.grad, rtol=1e-4)
diff --git a/tests/unit/util.py b/tests/unit/util.py
new file mode 100644
index 0000000000000000000000000000000000000000..966733b1d9290dc9fef80fc0fb1a25e101e127c9
--- /dev/null
+++ b/tests/unit/util.py
@@ -0,0 +1,32 @@
+import torch
+from deepspeed.git_version_info import torch_info
+
+
+def required_torch_version():
+    TORCH_MAJOR = int(torch.__version__.split('.')[0])
+    TORCH_MINOR = int(torch.__version__.split('.')[1])
+
+    if TORCH_MAJOR >= 1 and TORCH_MINOR >= 8:
+        return True
+    else:
+        return False
+
+
+def bf16_required_version_check():
+    TORCH_MAJOR = int(torch.__version__.split('.')[0])
+    TORCH_MINOR = int(torch.__version__.split('.')[1])
+
+    if type(torch.cuda.nccl.version()) != tuple:
+        return False
+    else:
+        NCCL_MAJOR = torch.cuda.nccl.version()[0]
+        NCCL_MINOR = torch.cuda.nccl.version()[1]
+
+    CUDA_MAJOR = int(torch_info['cuda_version'].split('.')[0])
+    if (TORCH_MAJOR > 1 or
+        (TORCH_MAJOR == 1 and TORCH_MINOR >= 10)) and (CUDA_MAJOR >= 11) and (
+            NCCL_MAJOR > 2 or
+            (NCCL_MAJOR == 2 and NCCL_MINOR >= 10)) and torch.cuda.is_bf16_supported():
+        return True
+    else:
+        return False
diff --git a/version.txt b/version.txt
index e4737652ca5ab419d259eb9d857a978ef18d0758..844f6a91acb92e5f4c58fe0d440fba8deea2a8c8 100644
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-0.3.13
+0.6.3